For example, we might notice some symptoms in a patient, and try to deduce what the underlying disease could be. Then our prior density \(p(\mbox{Disease})\) is what we think the probability of each disease is, and our posterior distribution after observing the symptoms is \(p(\mbox{Disease} \mid \mbox{Symptoms})\).

It is easy to sample from the joint density of latent variables and observations in Bayesian networks by an application of the definition of conditional probability:

\[p(\mbox{Disease}, \mbox{Symptoms}) = \underbrace{p(\mbox{Disease})}_{\text{prior}} \cdot \underbrace{p(\mbox{Symptoms} \mid \mbox{Disease})}_{\text{likelihood based on model}}\]On the other hand, computing the posterior is hard. By Baye’s rule,

\[\begin{align*} p(\mbox{Disease} \mid \mbox{Symptoms}) & = \frac{p(\mbox{Disease}, \mbox{Symptoms})}{p(\mbox{Symptoms})} \\ & = \frac{p(\mbox{Disease}, \mbox{Symptoms})}{\sum_{\mbox{Disease}^\prime \in \mbox{Diseases}}p(\mbox{Disease}^\prime, \mbox{Symptoms})}, \end{align*}\]where in the second line we had to marginalize over all possible diseases. The summation term is called the partition function, and also referred to as the evidence. Computing the partition function exactly is often computationally intractable as there can be an exponential number of configurations.

For instance, consider an Ising model, where we have \(n\) nodes \(\bx = x_1, \dots, x_n\), and each node takes on a binary value \(x_i \in \left\{ \pm 1 \right\}\). The model is parameterized by \(\theta\), with \(\theta_{ij}\) denoting the strength of interactions between nodes \(i\) and \(j\), and \(\theta_i\) denoting the self-relationship for node \(i\). The joint probability is given by

\[\begin{align*} \frac{1}{\mathcal{Z}(\theta)}\exp \left( \sum_{\substack{i, j \in [n] \\ i \neq j}} x_i x_j \theta_{ij} + \sum_{i \in [n]} x_i \theta_i \right), \end{align*}\]with the partition function \(\mathcal{Z}(\theta)\) that ensures that the probability distribution sums to 1 given by

\[\begin{align*} \mathcal{Z}(\theta) = \sum_{\bx \in \left\{ \pm 1 \right\}^n} p(\theta, \bx). \end{align*}\]Computing \(\mathcal{Z}(\theta)\) is therefore the main problem here since the number of configurations of \(\bx\) is exponential in \(n\). Indeed, computing partition functions is proven to be \(\#\P\)-hard in general (this is stricly harder than being \(\NP\)-hard, which to our current knowledge only has exponential time algorithms).

Therefore, several families of methods have been developed to try to approximate the posterior distribution in a computationally feasible manner. In this series of posts, we will discuss using variational inference techniques, which reduces it to an optimization problem. Another family of techniques is called Markov chain Monte Carlo (MCMC), which makes use of Markov chains to sample from a stationary distribution to approximate the posterior.

The main idea of variational inference is to estimate the partition function by minimizing the distance between our distribution \(P\) and some easier to compute distribution \(Q\), by modifying the

The name variational inference comes from the calculus of variations, which uses small perturbations to find maxima/minimas of functionals.

Theorem
(Gibbs Variational Principle)

We say that an (undirected) graph \(H\) is a minor of \(G\) if \(H\) can be
obtained from \(G\) by deleting edges and vertices, and edge contractions.

Consider a latent-variable model, which means

In this post, we will see how we can apply variational methods to perform inference and learning in latent variable models.

]]>Unfortunately, this meant that while many people have good operational knowledge of LaTeX and can get the job done, there are still many small mistakes and best practices which are not followed, which are not corrected by TAs as they are either not severe enough to warrant a note, or perhaps even the TAs themselves are not aware of them.

In this post, we cover some common mistakes that are made by LaTeX practitioners (even in heavily cited papers), and how to address them. This post assumes that the reader has some working knowledge of LaTeX.

It is important to get into the right mindset whenever you typeset a document. You are not simply “writing” a document — you are crafting a work of art that combines both the precision and creativity of your logical thinking, as well as the elegance of a beautifully typeset writing. The amount of attention and care you put into the presentation is indicative of the amount of thought you put into the content. Therefore, having good style is not only delightful and aesthetically pleasing to read, but it also serves to establish your ethos and character. One can tell that someone puts a lot of effort into their work and takes great pride in them when they pay attention even to the smallest of details.

Furthermore, adopting good practices also helps to avoid you making typographical mistakes in your proof, such as missing parenthesis or wrong positioning. This could often lead to cascading errors that are very annoying to fix when you discover them later on. There are ways to replicate the strict typechecking of statically typed languages to ensure that mistakes in your expressions can be caught at compile-time.

In the following section, we take a look at common mistakes that people make, and how they can be avoided or fixed. We cover style mistakes first, since the ideas behind them are more general. All the screenshotted examples come from peer-reviewed papers that have been published to top conferences, so they are definitely very common mistakes and you shouldn’t feel bad for making them. The important thing is that you are aware of them now so that your style will gradually improve over time.

We take a look at style mistakes, which impairs reader understanding, and makes it easy to commit other sorts of errors.

Parenthesis, brackets, and pipes are examples of delimiters that are used to mark the start and end of formula expressions. As they come in pairs, a common mistake is accidentally leaving out the closing delimiter, especially for nested expressions. Even if you don’t forget to do so, there is the issue of incorrect sizing.

For instance, consider the following way of expressing the Topologist’s sine curve, which is an example of a topology that is connected but not path connected:

which is rendered as follows:

\[T = \{(x, \sin \frac{1}{x} ) : x \in (0, 1] \} \cup \{ ( 0, 0 ) \}\]The problem here is that the curly braces have the wrong size, as they should be large enough to cover the \(\sin \frac{1}{x}\) expression vertically.

The wrong way of resolving this would be to use delimiter size modifiers, i.e
`\bigl, \Bigl, \biggl`

paired with `\bigr, \Bigr, \biggr`

and the like. This
is tedious and error-prone, since it will even happily let you match delimiters
with different sizes. Indeed, I came across the following formula in a
paper recently, where the outer right square brackets was missing and the left one
had the wrong size:

The correct way to do this would be to use paired delimiters, which will automatically adjust its size based on its contents, and automatically result in a compile error if the matching right delimiter is not included, or nested at the wrong level. Some of them are given below:

Raw LaTeX | Rendered |
---|---|

`\left( \frac{1}{x} \right) ` |
\(\left( \frac{1}{x} \right)\) |

`\left[ \frac{1}{x} \right] ` |
\(\left[ \frac{1}{x} \right]\) |

`\left\{ \frac{1}{x} \right\} ` |
\(\left\{ \frac{1}{x} \right\}\) |

`\left\lvert \frac{1}{x} \right\lvert ` |
\(\left\lvert \frac{1}{x} \right\rvert\) |

`\left\lceil \frac{1}{x} \right\rceil ` |
\(\left\lceil \frac{1}{x} \right\rceil\) |

In fact, to make things even simpler and more readable, you can declare paired delimiters
for use based on the `mathtools`

package, with the following commands due to
Ryan O’Donnell:

Then you can now use the custom delimiters as follows, taking note that you need the `*`

for it to auto-resize:

which gives

\[T = \left\{ \left( x, \sin \frac{1}{x} \right) : x \in (0, 1] \right\} \cup \left\{ \left( 0, 0 \right) \right\} \\\]The biggest downside of using custom paired delimiters is having to remember to
add the `*`

, otherwise, the delimiters will not auto-resize. This is pretty unfortunate
as it still makes it error-prone. There is a proposed
solution
floating around on StackExchange that relies on a custom command that makes auto-resizing
the default, but it’s still a far cry from a parsimonious solution.

Macros can be defined using the `\newcommand`

command.
The basic syntax is `\newcommand{command_name}{command_definition}`

.
For instance, it might get tiring to always type `\boldsymbol{A}`

to refer to a matrix \(\boldsymbol{A}\), so you can use the following macro:

Macros can also take arguments to be substituted within the definition.
This is done by adding a `[n]`

argument after your command name,
where `n`

is the number of arguments that it should take. You can then
reference the positional arguments using `#1, #2,`

and so on.
Here, we create a `\dotprod`

macro that takes two arguments:

Macros are incredibly helpful as they help to save time, and ensure that our notation is consistent. However, they can also be used to help to catch mistakes when typesetting grammatically structured things.

For instance, when expressing types and terms in programming language theory, there is often a lot of nested syntactical structure, which could make it easy to make mistakes. Consider the following proof:

The details are unimportant, but it is clear that it is easy to miss a letter here or a term there in the proof, given how cumbersome the notation is. To avoid this, I used the following macros, due to Robert Harper:

And the source for the proof looks like the following:

It is definitely still not the most pleasant thing to read, but at least now you will be less likely to miss an argument or forget to close a parenthesis.

Expressions which are logically a single unit should stay on the same line, instead of being split apart mid-sentence. Cue the following bad example from another paper:

In the area marked in red, we had the expression that was defining \(\tau^i\) get cut in half, which is very jarring visually and interrupts the reader’s train of thought.

To ensure that expressions do not get split, simply wrap it around in curly braces. For instance,

would be wrapped by `{`

and `}`

on both sides and become

So if we render the following snippet, which would otherwise have expressions split in half without the wrapped curly braces:

we get the following positive result where there is additional whitespace between the justified text on the first line, to compensate for the expression assigning \(\tau\) to stay on the same line:

`~`

When referencing figures and equations, you want the text and number (i.e Figure 10) to end up on the same line. This is a negative example, where the region underlined in red shows how it was split up:

To remedy this, add a `~`

after `Figure`

, which LaTeX interprets as a non-breaking space:

This would ensure that “Figure 2” always appears together.

Your document is meant to be read, and it should follow the rules and structures of English (or whichever language you are writing in). This means that mathematical expressions should also be punctuated appropriately, which allows it to flow more naturally and make it easier for the reader to follow.

Consider the following example that does not use punctuation:

In the region highlighted in red, the expressions do not carry any punctuation at all, and by the end of the last equation (Equation 15), I am almost out of breath trying to process all of the information. In addition, it does not end in a full stop, which does not give me an affordance to take a break mentally until the next paragraph.

Instead, commas should be added after each expression where the expression does not terminate, and the final equation should be ended by a full stop. Here is a good example of punctuation that helps to guide the reader along the author’s train of thought:

Here is another good example of how using commas for the equations allow the text to flow naturally, where it takes the form of “analogously, observe that we have [foo] and [bar], where the inequality…”:

This even extends to when you pack several equations on a single line, which is common when you are trying to fit the page limit for conference submissions:

`proof`

environmentThe `proof`

environment from the `amsthm`

package is great for signposting to your readers
where a proof starts and ends. For instance, consider how it is used in the following example:

This will helpfully highlight the start of your argument with *“Proof”*, and
terminate it with a square that symbolizes QED.

`\qedhere`

Consider the same example as previously, but now you accidentally added an additional
newline before the closing `\end{proof}`

, which happens pretty often:

This results in the above scenario, where the QED symbol now appears on the next line by itself,
which throws the entire text off-balance visually. To avoid such things happening,
always include an explicit `\qedhere`

marker at the end of your proof, which would cause it
to always appear on the line that it appears after:

We would then get the same result as before originally, when we did not have the extra newline.

Spacing matters a lot in readability, as it helps to separate logical components. For instance, the following example fails to add spacing before the differential of the variable \(dz\):

This might seem innocuous, but consider the following example that makes the issue more explicit:

\[P(X) = \int xyz dx\]Now we can really see that the quantities are running into each other, and it becomes hard to interpret. Instead, we can add math-mode spacing, summarized in the following table:

Spacing Expression | Type |
---|---|

`\;` |
Thick space |

`\:` |
Medium space |

`\,` |
Thin space |

So our new expression now looks like:

\[P(X) = \int xyz \, dx\]which is much more readable.

`align*`

Environment for Multiline EquationsWhen using the `align*`

environment, make sure that your ampersands `&`

appear before the
symbol that you are aligning against. This ensures that you get the correct spacing.

For instance, the following is wrong, where the `&`

appears after the `=`

:

This is because there is too little spacing after the `=`

sign on each line, which feels very cramped.
Putting the `&`

before the `=`

is correct:

The spacing is much more comfortable now.

We now look at some mistakes that arise from using the wrong commands.

Instead of `sin (x)`

\((sin(x))\) or `log (x)`

\((log (x))\), use `\sin (x)`

\((\sin (x))\)
and `\log (x)`

\((\log (x))\). The idea extends to many other common math functions.
These are math operators that will de-italicize the commands
and also take care of the appropriate math-mode spacing between characters:

`O(n log n)` |
\(O(n log n)\) |

`O(n \log n)` |
\(O(n \log n)\) |

Many times there is a math operator that you need to use repeatedly, but which does
not come out of the box. You can define custom math operators with
the `\DeclareMathOperator`

command. For instance, here are some commonly used in probability:

Then you can use it as follows:

\[\Pr \left[ X \geq a \right] \leq \frac{\Ex[X]}{a}\]This is more of a rookie mistake since it’s visually very obvious something is wrong. Double quotes don’t work the way you would expect:

\[\text{"Hello World!"}\]Instead, surround them in double backticks and single quotes, which is supposed to be reminiscent of the directional strokes of an actual double quote. This allows it to know which side to orient the ticks:

Unfortunately I had to demonstrate this with a screenshot since MathJax only performs math-mode typesetting, but this is an instance of text-mode typesetting.

This is a common mistake due to laziness. Many times, people use `\epsilon`

(\(\epsilon\))
when they really meant to write `\varepsilon`

(\(\varepsilon\)). For instance, in analysis
this is usually the case, and therefore writing `\epsilon`

results in a very uncomfortable
read:

Using `\varepsilon`

makes the reader feel much more at peace:

Similarly, people tend to get lazy and mix up `\phi, \Phi, \varphi`

(\(\phi, \Phi, \varphi\)),
since they are “about the same”. Details matter!

`mathbbm`

Instead Of `mathbb`

For sets like \(\mathbb{N}\), you should use `\mathbbm{N}`

(from `bbm`

package) instead of `mathbb{N}`

(from `amssymb`

). See the
difference in how the rendering of the set of natural numbers
\(\mathbb{N}\) differs, using the same example as the previous section:

`mathbbm`

causes the symbols to be bolded, which is what you want.

`...`

and `\dots`

are different. See the difference:

When using “…”, the spacing between each dot, and between the final dot and the comma character is wrong. Always use “\dots”.

When writing summation or products of terms, use `\sum`

and `\prod`

instead of `\Sigma`

and `\Pi`

. This helps to handle the
relative positioning of the limits properly, and is much more idiomatic to read
from the raw script:

Raw LaTeX | Rendered |
---|---|

`\Sigma_{i=1}^n X_i` |
\(\Sigma_{i=1}^n X_i\) |

`\sum_{i=1}^n X_i` |
\(\sum_{i=1}^n X_i\) |

`\Pi_{i=1}^n X_i` |
\(\Pi_{i=1}^n X_i\) |

`\prod_{i=1}^n X_i` |
\(\prod_{i=1}^n X_i\) |

To denote multiplication, use `\cdot`

or `times`

instead of `*`

. See the difference below
in the equation:

For set builder notation or conditional probability, use `\mid`

instead of the pipe `|`

.
This helps to handle the spacing between the terms properly:

Raw LaTeX | Rendered |
---|---|

`p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} | \mathbf{z})` |
\(p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} | \mathbf{z})\) |

`p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} \mid \mathbf{z})` |
\(p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} \mid \mathbf{z})\) |

When writing vectors, use the `\langle`

and `\rangle`

instead of the keyboard angle brackets:

Raw LaTeX | Rendered |
---|---|

`<u, v>` |
\(<u, v>\) |

`\langle u, v \rangle` |
\(\langle u, v \rangle\) |

Use `\label`

to label your figures, equations, tables, and so on, and reference them using `\ref`

, instead of hardcoding the number.
For instance, `\label{fig:myfig}`

and `\ref{fig:myfig}`

.
Including the type of the object in the tag helps to keep track
of what it is and ensures that you are referencing it correctly, i.e
making sure you write `Figure \ref{fig:myfig}`

instead of accidentally saying
something like `Table \ref{fig:myfig}`

.

That was a lot, and I hope it has been a helpful read! I will continue updating this post in the future as and when I feel like there are other important things which should be noted which I missed.

I would like to thank my friend Zack Lee for reviewing this article and for providing valuable suggestions. I would also like to express my thanks to Ryan O’Donnell, and my 15-751 A Theorist’s Toolkit TAs Tim Hsieh and Emre Yolcu for helping me realize a lot of the style-related LaTeX issues mentioned in this post, many of which I made personally in the past.

]]>The Central Limit states that the mean of an appropriately transformed random variable converges in distribution to a standard normal. We first need to introduce the definition of convergence of probability distributions:

Definition
(Convergence in Distribution)

Let \( F_{X_n} \) and \( F_{X} \) denotes the cumulative density functions (CDF) of
\( X_n \) and \( X \) respectively.
A sequence \( X_n \) converges to \( X \) in distribution if
$$ \lim_{n \to \infty } F_{X_n}(t) = F_X (t)$$
for all points \( t \) where \( F_X \) is continuous.

Note that the requirement that it only holds for points of continuity is not superfluous, as there can be distributions that converge but disagree in value at points of discontinuities (i.e take \(X_n = N(0, 1/n)\) and \(X\) to be the point mass at 0, they converge but their CDF take different values at \(t=0\)).

The Central Limit Theorem can then be stated in the following form (there are many other equivalent statements):

Theorem
(Central Limit Theorem)

Let \( X_1, X_2, \dots, X_n \) be a sequence of independent random variables with mean \( \mu \) and variance \( \sigma^2 \).
Assume that the moment generating function \( \E \left[ \exp(t X_i) \right] \) is finite for \( t \) in a neighborhood around zero.
Let \( \overline{X}_n = \frac{1}{n} \sum\limits_{i=1}^n X_i \). Let
$$ Z_n = \frac{\sqrt{n} \left( \overline{X}_n - \mu \right)}{\sigma}. $$
Then \( Z_n \) converges in distribution to \( Z \sim N(0, 1) \).

There are several ways of proving the Central Limit Theorem. The proof that we will explore today relies on the methods of moments. An alternative measure-theoretic version of the proof relies on Lévy’s Continuity Theorem, and makes use of convolutions and Fourier transforms.

Our goal is to show that \(Z_n\) converges in distribution to \(Z \sim N(0, 1)\). To do so, we will show that all the moments of \(Z_n\) converges to the respective moments of \(Z\).

The moments of a random variable can be obtained from its moment-generating function (MGF), defined as follows:

Definition
(Moment Generating Function)

The moment generating function of a random variable \( X \) is given by
$$ M_X(t) = \E \left[ e^{tX} \right].$$

It is called a moment generating function since the \(k\)th moment of \(X\), i.e \(\E \left[X^k \right]\), can be obtained by taking the \(k\)th derivative of its moment-generating function (MGF) at 0:

\[\E \left[X^k \right] = M^{(k)}(0).\]This is not too hard to see by induction on the fact that \(M_X^k(t) = \E \left[ X^k e^{tX} \right]\). The base case is trivial. For the inductive case,

\[\begin{align*} M_X^{(k)}(t) & = \frac{d^k}{dt^k} \E \left[ e^{tX} \right] \\ & = \frac{d}{dt} \E \left[ X^{k-1} e^{tX} \right] & \text{(by IH)}\\ & = \frac{d}{dt} \int f(x) x^{k-1} e^{tx} \; dx \\ & = \int \frac{d}{dt} f(x) x^{k-1} e^{tx} \; dx \\ & = \int f(x) x^{k} e^{tx} \; dx \\ & = \E \left[ X^{k} e^{tX} \right]. \end{align*}\]Substituting \(t=0\) gives us the desired result.

Distributions are determined uniquely by its moments under certain conditions. This is made precise in the following theorem:

Theorem
(Sufficient Condition for Distribution to be Determined by Moments)

Let \( s_0 > 0 \), and let \( X \) be a random variable with moment generating
function \( M_X(s) \) which is finite for \( |s| < s_0 \). Then \( f_X \)
is determined by its moments (and also by \( M_X(s)\)).

In words, it means that for some open interval around 0 we have that all moments are finite, then the moments determine the distribution. This is true for the normal distribution, where it can be shown that the following recurrence holds for the \(k\)th moment:

\[M^k(t) = \mu M^{k-1}(t) + (k-1) \sigma^2 M^{k-2}(t).\]This is also not hard to show by induction, and the proof is omitted for brevity. Since the first two moments of the standard normal distribution are 1 and 0 respectively which are both finite, and our mean and standard deviation are both finite, then all our moments generated by the recurrence must also be finite. So our standard normal is determined by its moments.

Now cue the theorem that ties things together:

Theorem
(Method of Moments)

Suppose that \( X \) is determined by its moments. Let \( X_n \) be a sequence of
distributions, such that \( \int f_{X_n}(x) x^k \; dx \) is finite for all \( n, k \in \N \),
and such that \( \lim_{n \to \infty} \int f_{X_n}(x) x^k \; dx = \int f_{X}(x) x^k \; dx \)
for each \( k \in \N \). Then \( X_n \) converges in distribution to \( X \).

In words, it states that if the \(k\)th moment of \(X_n\) is finite and converges to the \(k\)th moment of \(X\) in the limit of \(n\), then \(X_n\) converges to \(X\).

This is great, since now we just have to show that all the moments of \(Z_n = \frac{\sqrt{n} \left( \overline{X}_n - \mu \right)}{\sigma}\) converges to the moments of the standard normal \(Z\).

Let’s first find the moment generating function of \(Z\):

\[\begin{align*} M_{Z} & = \E \left[ e^{tZ} \right] \\ & = \int f_Z(x) e^{tx} \; dx \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2} e^{tx} \; dx & \text{(subst. pdf of standard Gaussian)} \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2 + tx} \; dx \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 + \frac{1}{2}t^2} \; dx & \text{(completing the square)} \\ & = e^{\frac{1}{2}t^2} \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 } \; dx & \text{($e^{\frac{1}{2}t^2}$ does not depend on $x$)} \\ & = e^{\frac{1}{2}t^2} \cdot 1 \\ & = e^{\frac{1}{2}t^2}, \end{align*}\]where the second last step comes from the fact that \(\frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 }\) is a probability distribution of a Gaussian with mean \(t\) and variance 1, and therefore the integral integrates to 1.

Now we find the moment generating function of \(Z_n\). To simplify notation, define \(A_i = \frac{X_i - \mu}{\sigma}\), and see that we can write \(Z_n = \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i\), since

\[\begin{align*} \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i &= \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n \frac{X_i - \mu}{\sigma} \\ &= \sqrt{n} \sum\limits_{i=1}^n \frac{X_i - \mu}{ n \sigma} \\ &= \sqrt{n} \frac{\overline{X}_n - \mu}{ \sigma} \\ &= Z_n. \end{align*}\]See that \(\E[A_i] = 0\), and \(\Var(A_i) = 1\).

Then starting from the definition of the moment generating function of \(Z_n\),

\[\begin{align*} M_{Z_n}(t) & = \E \left[ e^{t Z_n} \right] \\ & = \E \left[ \exp\left(t \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i \right) \right] & \text{(by equivalent definition of $Z_n$)} \\ & = \prod_{i=1}^n \E \left[ \exp\left( \frac{t}{\sqrt{n}} A_i \right) \right] & \text{(by independence of $A_i$'s)} \\ & = \prod_{i=1}^n M_{A_i}(t/\sqrt{n}) & \text{(definition of $M_{A_i}$)} \\ & = M_{A_i}(t/\sqrt{n} )^n. \end{align*}\]Let’s analyze each individual term \(M_{A_i}(t / \sqrt{n})\) by performing a Taylor expansion around 0. Recall that the Taylor expansion of a function \(f(x)\) about a point \(a\) is given by \(f(x)= \sum\limits_{n=0}^\infty \frac{f^{(n)(a)}}{n!}(x-a)^n.\). We will expand up to the second order term, which requires us to find the first three moments of the MGF.

These are:

\[\begin{align*} M_{A_i}(0) & = \E \left[ e^{t A_i} \right] \Big|_{t=0} \\ & = \E \left[ 1 \right] \\ & = 1, \\ M_{A_i}^\prime(0) & = \E \left[ A_i \right] & \text{(by the $k$th moment property proved previously)} \\ & = 0, \\ M_{A_i}^{\prime \prime}(0) & = \E \left[ A_i^2 \right] & \text{(by the $k$th moment property proved previously)} \\ & = \E \left[ A_i^2 \right] - \E \left[ A_i \right]^2 + \E \left[ A_i \right]^2 \\ & = \Var(A_i) + \E \left[ A_i \right]^2 & \text{($\Var(A_i) = \E \left[ A_i^2 \right] - \E \left[ A_i \right]^2 $)} \\ & = 1 + 0 \\ & = 1. \end{align*}\]Taking all terms up to the second order Taylor expansion allows us to approximate \(M_{A_i}\) as

\[\begin{align*} M_{A_i}(t/\sqrt{n}) & \approx M_{A_i}(0) + M_{A_i}^\prime(0) + M_{A_i}^{\prime \prime}(0) \frac{t^2}{2n} \\ & = 1 + 0 + \frac{t^2}{2n} \\ & = 1 + \frac{t^2}{2n}. \end{align*}\]Then now we can write the limit of the MGF of \(Z_n\) as the following:

\[\begin{align*} M_{Z_n}(t) & = M_{A_i}(t/\sqrt{n})^n \\ & \approx \left( 1 + \frac{t^2}{2n} \right)^n \\ & \to e^{t^2/2}, & \text{(by identity $\lim_{n \to \infty} (1 + x/n)^n \to e^x$)} \end{align*}\]which shows that it converges to the MGF of \(Z\), as desired. Hooray!

However, there is one thing in this proof that might have bothered you. Our result came from making use of the Taylor approximation and taking limits, but there is no bound on how large \(n\) must be for the distributions to converge up to a maximum amount of error. This makes it unsuitable for much theoretical analysis, since usually we would like to know that \(n\) does not have to be too large for us to obtain a sufficiently good approximation to the standard normal.

The Berry-Esseen theorem solves this limitation by also providing explicit error bounds. This was proved independently by Andrew Berry and Carl-Gustav Esseen in the 40s, and the statement goes as follows:

Theorem
(Berry-Esseen)

Let \( X_1, \dots, X_n \) be independent random variables.
Assume \( \E [X_i] = 0 \; \forall i \).
Write \( \sigma_i^2 = \Var [ X_i] = \E[X_i^2] - \E[X_i]^2 = \E[X_i^2] \).
Assume \( \sum\limits_{i=1}^n \sigma_i^2 = 1 \).
Let \( S = \sum\limits_{i=1}^n X_i \). Then \( \forall u \in \R \),
$$
\lvert \Pr \left[ S \leq u \right] - \Pr \left[ Z \leq u \right] \rvert
\leq \mbox{const} \cdot \beta,
$$
where the exact constant depends on the proof, with the best known constant
being \(.5600\) proven by Shevtsova in 2010, and
\(\beta = \sum\limits_{i=1}^n \E \left[ \lvert X_i \rvert^3 \right]\).

In words, the theorem says that the difference between the CDF of the sum of the mean-0 random variables and the CDF of the standard normal is bounded by a value proportionate to the third moment. This then becomes useful as a tool in proving high probability statements if we can show that the third moment is inversely polynomially small, i.e \(\beta = 1/\poly(n)\).

Another thing to note is that the theorem only provides an absolute bound for all values of \(u\). Therefore, when \(u\) is very negative and \(\Pr [Z \leq u ] = \Phi(u)\) is very small, the relative error is actually very large, and therefore is not as helpful.

I hope this article has been helpful!

*I would like to express my thanks to my friend Albert Gao
for reviewing this article and for providing valuable suggestions*.

- Rosenthal, J. S. (2016). A first look at rigorous probability theory. World Scientific.
- Larry Wasserman, CMU 36-705 Intermediate Statistics Lecture Notes. URL: https://www.stat.cmu.edu/~larry/=stat705/
- Ryan O’Donnell, CMU 15-751 A Theorist’s Toolkit. URL: https://www.youtube.com/watch?v=Ig5TuZauhW4

Our goal is to train an agent that is able to maximize its rewards in a given task. For instance, its goal could be to balance a cartpole for as long as possible, where for each time step the pole does not fall down, the agent receives 1 reward, and when the pole falls down the episode is terminated and the agent no longer receives any rewards:

Formally, we want to maximize the expected rewards for our policy over the trajectories that it visits. A trajectory \(\tau\) is defined as state-action pairs \(\tau = (s_0, a_0, s_1, a_1, \dots, s_H, a_H, s_{H+1})\), where \(H\) is horizon of the trajectory, i.e the duration until the episode is terminated, and \(s_t, a_t\) are the states and actions performed at each time step \(t\).

This can be formalized as the following objective:

\[\begin{align} & \max_\theta \E_{\tau \sim P_\theta(\tau)} [R(\tau)] \\ = & \max_\theta \sum\limits_\tau P_\theta(\tau) R(\tau) \\ = & \max_\theta U(\theta), \end{align}\]where \(\tau\) refers to a trajectory of state-action pairs, \(P_\theta(\tau)\) denotes the probability of experiencing trajectory \(\tau\) under policy \(\theta\), and \(R(\tau)\) is the reward under trajectory \(\tau\), and \(U(\theta)\) is shorthand for the expression for brevity.

The probability of \(P_\theta(\tau)\) is given by the following:

\[\begin{align} P_\theta(\tau) = \prod_{t=0}^H P(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta (a_t \mid s_t), \end{align}\]where in words, it is the product over each time step \(t\), of the probability of taking the action at time \(t\) in the trajectory \(a_t\) when we were in state \(s_t\) under our policy \(\pi_\theta\), given by \(\pi_\theta(a_t \mid s_t)\), multiplied by the probability that the environment transitions us from \(s_t\) to \(s_{t+1}\) given that we performed action \(a_t\). Note that we do not necessarily know this environment transition probability \(P(s_{t+1} \mid s_t, a_t)\).

To perform a gradient-based update on \(\theta\) to increase the reward, we need to compute the derivative with respect to our policy \(\theta\), i.e \(\nabla_\theta \E_{\tau \sim P(\tau; \theta)} [R(\tau)]\). Let’s walk through the derivation step by step:

\[\begin{align*} \nabla_\theta \E_{\tau \sim P_\theta(\tau)} [R(\tau)] & = \nabla_\theta \sum\limits_\tau P_\theta(\tau) R(\tau) \\ & = \sum\limits_\tau \nabla_\theta P_\theta(\tau) R(\tau) & \text{(uh oh...)}\\ \end{align*}\]It appears that we are already stuck here, since \(\nabla_\theta P_\theta(\tau)\) will result in many repeated applications of the chain rule since \(P_\theta(\tau)\) is a huge product containing our policy transition probabilities, and will quickly get out of hand to be computed feasibly.

Instead, the trick is to multiply by 1 on the left:

\[\begin{align*} \sum\limits_\tau \nabla_\theta P_\theta(\tau) R(\tau) &= \sum\limits_\tau \frac{ P_\theta(\tau) }{ P_\theta(\tau) } \nabla_\theta P_\theta(\tau) R(\tau) & \text{(multiplying by 1)} \\ &= \sum\limits_\tau P_\theta(\tau) \frac{ \nabla_\theta P_\theta(\tau) }{ P_\theta(\tau) } R(\tau) & \text{(rearranging)} \\ &= \sum\limits_\tau P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) R(\tau) & \text{($\frac{d}{dx} \log f(x) = \frac{f'(x)}{f(x)} $)} \\ &= \E_{\tau \sim P_\theta(\tau)} \left[ \nabla_\theta \log P_\theta(\tau) R(\tau) \right] \\ &\approx \frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i), \\ \end{align*}\]where we can use \(\frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i)\) as our estimator, which converges to the true expectation as our number of trajectory samples \(N\) increases.

We can compute \(\nabla_\theta \log P_\theta(\tau_i)\) for each sampled trajectory \(\tau_i\), and then take their average. This can be done as follows:

\[\begin{align*} \nabla_\theta \log P_\theta(\tau_i) & = \nabla_\theta \log P_\theta(s_0, a_0, \dots, s_H, a_H, s_{H+1}) \\ & = \nabla_\theta \log \left[ \prod_{t=0}^H P(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta (a_t \mid s_t), \right] \\ & = \nabla_\theta \left[ \sum\limits_{t=0}^H \log P(s_{t+1} \mid s_t, a_t) + \log \pi_\theta (a_t \mid s_t) \right] \\ & = \nabla_\theta \sum\limits_{t=0}^H \log \pi_\theta (a_t \mid s_t) \\ & \qquad \qquad \text{(first term does not depend on $\theta$, becomes zero)} \\ & = \sum\limits_{t=0}^H \nabla_\theta \log \pi_\theta (a_t \mid s_t),\\ \end{align*}\]where the last expression is easily computable for models such as neural networks since it is end-to-end differentiable.

With the approximate gradient \(\nabla_\theta U(\theta)\) in hand, we can now perform our policy gradient update as

\[\begin{align*} \theta_{\mbox{new}} = \theta_{\mbox{old}} + \alpha \nabla_\theta U(\theta), \end{align*}\]for some choice of step size \(\alpha\).

In this post, we saw from first principles how taking the gradients of many sampled trajectories does indeed converge to the true policy gradient.

This method of multiplying by 1 to pull out a probability term so that a summation can be converted into an expectation is widely used in machine learning, such as for computing variational autoencoder (VAE) loss. It is known as the log derivative trick.

The estimator \(\frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i)\) is also sometimes known as the REINFORCE estimator, after the popular REINFORCE algorithm.

One limitation of this approach is that it requires \(\pi_\theta\) to be differentiable. However, given how most RL models rely on neural networks, this is not a significant restriction.

Choosing the right step size \(\alpha\) is actually not straightforward. It is different from the offline supervised-learning context, where you can use methods like AdaGrad or RMSProp which adaptively chooses a learning rate for you, and even if the learning rate was not optimal it just takes more iterations to converge. On the other hand, in reinforcement learning, a learning rate that is too small results in inefficient use of trajectory samples as they cannot be trivially re-used since it depends on your current policy, and a learning rate that is too large can result in the policy becoming bad, which is difficult to recover from since future trajectories would also become bad.

We will discuss three important methods to choose an appropriate step size in a future post: Natural Policy Gradients, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO). Hope to see you around!

*I would like to express my thanks to my friend Jun Yu Tan
for reviewing this article and for providing valuable suggestions*.

Our goal is to maximize the expected rewards for our policy over the trajectories that it visits. This can be formalized as

\begin{equation} \max_\theta \E_{\tau \sim P_\theta(\tau)} [R(\tau)], \end{equation}

where \(\tau\) refers to a trajectory of state-action pairs, \(P_\theta(\tau)\) denotes the probability of experiencing trajectory \(\tau\) under policy \(\theta\), and \(R(\tau)\) is the reward under trajectory \(\tau\).

To perform a gradient-based update on \(\theta\) to increase the reward, we need to compute \(\nabla_\theta \E_{\tau \sim P(\tau; \theta)} [R(\tau)]\). We can derive the following sequence of steps:

\begin{align}
hello

\end{align}

Policy updates are performed by deploying the current policy \(\pi_theta\), and sampling many state-action trajectories

In reinforcement learning,

]]>Given a fixed input for a search problem, pseudo-deterministic algorithms produce the same answer over multiple independent runs, with high probability. For example, we can efficiently find a certificate for inequality of multivariate polynomials pseudo-deterministically, but it is not known how to do so deterministically. The same notion can be extended to the streaming model. The problem of finding a nonzero element from a turnstile stream is previously shown to require linear space for both deterministic and pseudo-deterministic algorithms. Another model of streaming problems is that of graphs, where edge insertions and deletions occur along a stream. Some natural problems include connectivity, bipartiteness, and colorability of a graph. While the randomized and deterministic graph streaming algorithms have been mostly well-studied, we investigate pseudo-deterministic space lower bounds and upper bounds for graph theoretic streaming problems.

Joint work with Albert Gao, Andrew Caosun, Puhua Cheng for the course project of 15-859CC Algorithms for Big Data.

Covariance matrix prediction is a long-standing challenge in modern portfolio theory and quantitative finance. In this project, we investigate the effectiveness of Bayesian networks in predicting the covariance matrix of financial assets (specifically a subset of the S&P 500), evaluated against Heterogeneous Autoregressive (HAR) models. In particular, we consider both HAR-DRD, based on the DRD decomposition of the covariance matrix, and Graphical HAR (GHAR)-DRD, which is also based on DRD decomposition but also makes use of graphical relationships between the assets. To build the graph representing relationships between the assets, we apply Latent Dirichlet allocation (LDA) on the 10-K filings of each of the companies, and infer edges based on topic overlap. We show that this technique has limited usefulness in our setup, but provides recommendations on how it could be further improved based on our observations of its predictions.

Joint work with Kevin Minghan Li for the course project of 10-708 Probabilistic Graphical Models.

**Note:** different diagram-generation packages require external dependencies to be installed on your machine.
Also, be mindful of that because of diagram generation the fist time you build your Jekyll website after adding new diagrams will be SLOW.
For any other details, please refer to jekyll-diagrams README.

Install mermaid using `node.js`

package manager `npm`

by running the following command:

```
npm install -g mermaid.cli
```

The diagram below was generated by the following code:

```
{% mermaid %}
sequenceDiagram
participant John
participant Alice
Alice->>John: Hello John, how are you?
John-->>Alice: Great!
{% endmermaid %}
```

Command Not Found: mmdc

]]>We investigate if policies learnt by agents using the Off-Belief Learning (OBL) algorithm in the multi-player cooperative game Hanabi in the zero-shot coordination (ZSC) context are invariant across symmetries of the game, and if any conventions formed during training are arbitrary or natural. We do this by a convention analysis on the action matrix of what the agent does, introduce a novel technique called the Intervention Analysis to estimate if the actions taken by the policies learnt are equivalent between isomorphisms of the same game state, and finally evaluate if our observed results also hold in a simplified version of Hanabi which we call Mini-Hanabi.

Joint work with William Zhang for the course project of 15-784 Foundations of Cooperative AI