This is a long paper, but it’s full of gems. Here’s a reading recommendation guide:
Introduces new set of models (8/70/405 B) that supports:
Largest model:
Also introduced Llama Guard 3 model for input/output safety.
Knowledge cutoff end of 2023. To ensure high-quality tokens, performed: de-duplication, data cleaning, removed domains known to contain large amounts of PII, adult content.
Data cleaning:
De-duplication:
Used heuristics to filter other low-quality documents: logs/error messages, other adult websites, websites with excessive numbers of outlier tokens
Built a model-based classifier to sub-select high-quality tokens.
Built domain-specific pipelines to extract code & math-relevant web pages, including pages containing math deduction, pages containing code interleaved with natural language.
Used similar approaches as the above for other languages.
This ensures they have the right proportion of different data sources. They ended up with:
Knowledge classification: categorizes data to determine the data mix. Used this to downsample data over-represented on the web like arts & entertainment.
Scaling laws for data mix: trained several small models on data mix & use that to predict the performance of large models on mix
Overview
Separate encoders trained for images and speech.
Image encoder:
Speech encoder:
TBD
Performed annealing on small amounts of high-quality code and mathematical data. Annealing here means increasingly upsampling these high-quality data over time.
Found improvements for Llama 3 8B on GSM8k and MATH, but not 405B.
Architecture
tiktoken
tokenizer and extra 28K non-English tokens. Tokenizer improves compression rate from 3.17 to 3.94 characters per token compared to Llama 2 tokenizer.Scaling laws are nice for predicting loss, but not helpful for understanding impact on downstream task performance.
To find relationship with downstream task performance they did:
The scaling laws suggest that given their compute budget of \(3.8 \times 10^{25}\) FLOPs, a 402B model with 16.55T tokens is optimal, which led to their 405B model.
They also found their predictions to be quite accurate for the final downstream performance of their models.
Compute:
Storage:
Network:
Scaled parallelism as much as possible, so all of GPU’s model parameters, optimizer states, gradients, and activations fit in HBM.
4D parallelism:
Parallelism achieved BF16 Model FLOPs Utilization (MFU) of 38-43%
90% effective training time, even while supporting automated cluster maintenance (i.e Linux kernel upgrades)
466 job interruptions
Debugging
Others
Initial pre-training:
Long context pre-training:
Annealing:
Checkpoints aligned with DPO
Modified DPO:
Finetuning data contains:
Datasets:
Rejection sampling:
Most of training data is model-generated, requires careful cleaning and quality control
Data cleaning:
Data pruning:
Capabilities:
Targeted languages: Python, Java, Javascript, C/C++, Typescript, Rust, PHP, HTML/CSS, SQL, bash/shell
Improved capabilities via:
Expert training:
Synthetic data generation:
During RS, used code specific system prompts to improve:
Trained Llama 3 to use search engine (Brave), Python interpreter, Wolfram Alpha API.
To train on tool use:
To train the model to guard against hallucinations, they used a knowledge probe to find out what the model knows, and to generate training data of refusals for the things it doesn’t:
But because pre-training data is not always factually correct, they also did this for sensitive topics where contradictory/incorrect statements are prevalent
Remainder to be continued…
]]>This beginner’s guide will help to demystify the process of setting up Sound Voltex at home using a custom SDVX controller using Unnamed SDVX Clone.
My foray into rhythm games started way back with Love Live! School Idol Festival. While the game has sadly since shut down, other titles I’ve played include BanG Dream! and Project SEKAI.
When I visited Japan a few months ago in the summer, I discovered Sound Voltex, and instantly fell in love with its unique control system and beautiful flashy graphics:
In Japan, you pay 100 yen (~$0.68 USD) to play two to three songs. You get to play three songs if you don’t crash (i.e fail) any tracks, and two if you fail on either of the first two guaranteed plays.
Rhythm games in general are sadly not as mainstream outside of Japan. For instance, in Singapore I was only aware of a single arcade that had Sound Voltex cabs, even though arcades are quite popular in general. Similarly, in NYC, there’s only a single small arcade called Chinatown Fair that has Sound Voltex. So naturally I wanted to see if I could set it up at home to continue enjoying the game.
(Apparently, if you have a lot of disposable income and space in your living room, you can also just buy an entire previous-generation Sound Voltex cabinet for a few thousand dollars)
I decided to write this guide since the setup process could seem somewhat daunting for people who are interested in rhythm games but are not developers. Hopefully now more people will also be able to play and enjoy this game.
The setup process is very straightforward on Windows, but has a few subtle points on macOS and Linux that I’ll point out.
This guide will use the following setup:
While I performed the setup on macOS, the instructions are largely the same for Linux based systems as well. In fact, if you are already regardless of which controller or OS you use.
The setup process for Windows is very straightforward. You should just download the latest Windows build as linked on the Github page, and run usc-game.exe
to start the game.
This is mostly just from the official instructions, but with implicit points made explicit:
If you don’t have git yet, install it with Homebrew:
$ brew install git
Git is a version control system (normally used for code). In our case, we use it mainly to obtain the project dependencies.
Clone the unnamed-sdvx-clone
with git
:
$ git clone https://github.com/Drewol/unnamed-sdvx-clone
This will result in the game being downloaded to a unnamed-sdvx-clone
folder in your current working directory.
Navigate into the new folder, and download the submodules of the project:
$ cd unnamed-sdvx-clone
$ git submodule update --init --recursive
This is necessary because the game has third-party dependencies, which are tracked as other Github repositories.
Install more dependencies required to build the project with Homebrew:
$ brew install cmake freetype libvorbis sdl2 libpng jpeg libarchive libiconv
These are all open source libraries required for the following reasons:
cmake
: a popular build system used to compile the projectfreetype
: for rendering fontslibvorbis
: audio compressionsdl2
: get access to hardware inputs like keyboard, mouse, controller, etclibpng
: for using/manipulating PNG imagesjpeg
: for using/manipulating JPEG imageslibarchive
: compression librarylibiconv
: convert between different character encodings (i.e ISO-8859-1 to UTF-8)Configure the project using cmake
. In this case, the project author already kindly wrapped a script around this command, so we only have to run the script:
$ ./mac-cmake.sh
Compile and build the project:
$ make
This step could take a while. If you want to speed it up, you can run it and specify the -j
argument parallelize the compilation based on the number of cores you have (use one less than your total number of cores):
$ make -j 9
Run the game from the bin
folder:
$ cd bin
$ ./usc-game
It is important to run it from the ./bin
folder and not the root of the project directory, as some skins search for file dependencies in a relative manner and will hence not be able to find it.
Honestly if you’re on Linux, you should be able to figure it out by yourself 😊
On first startup, you should see this:
For now, you can just use your mouse to interact with the game menu.
Let’s now setup our Yuancon controller!
Quit the game, and unplug your SDVX controller if it is plugged in
Hold the START
and BT-C
button simultaneously. The START
button is the diamond-shaped button at the top, while the BT-C is the third white button from the left (it should also be labelled on the controller board).
Then while still holding down both buttons, connect it to your computer. This will put it in Controller HID Mode, where the controller inputs as a gamepad.
Start up the game again, and navigate to the Settings
page. Here, you want to do the following:
Button input mode
to Controller
Laser input mode
to Controller
1.875
)You should have something that looks similar to this:
Restart the game. You should now be able to use the knobs to cycle through the menus, and the buttons to activate them!
Right after setup, there are no songs to play yet. USC uses the same chart format as K-Shoot MANIA (KSM)
There are a few places you can get songs:
Once you have downloaded the songs, unzip and extract them if necessary, and copy them into ./bin/songs
.
The default skin works, but it is not very impressive:
Let’s try to re-create the original SDVX arcade experience with skins. You can get skins for the game here. These are really high-effort and well-made, and huge thanks to the developers and artists for making them.
Once you have downloaded the skin, extract and move them to ./bin/skins
. You should then be able to select the skin under the Skins
tab of the game settings.
The UI of the skin for the game may change depending on whether your monitor is in portrait or landscape mode. Orienting it vertically is recommended for the best SDVX-like experience - the spaceship(?) at the bottom only shows up when it’s vertical.
Some examples of the different skins are shown below. I know, they’re pretty!
(Why are the previews so low-res? Bandwidth costs add up!)
If you run into errors about shaders when trying to play a song, see the Common Errors section below.
As a side note, if you find the default menu text for this skin too casual/unprofessional, you can change it in ./bin/skins/ExperimentalGear/scripts/language/EN.lua
.
Not all skins come with a cast of crews, like the ExperimentalGear skin which only comes with a boring empty nothing
skin in ./bin/skins/ExperimentalGear/textures/crew/anim
.
As crews are very important for our psychological safety and well-being, fortunately we can just copy over the animations from other skins. In HeavenlyExpress, you can find it in ./bin/skins/HeavenlyExpress-1.3.0/textures/_shared/crew
. Similarly, in LiqidWave they are stored in ./bin/skins/LiqidWave-1.5.0/textures/_shared/crew
.
If you’ve made it this far, congrats and thanks for reading! I hope you’ll enjoy the game as much as I do. If you have any questions or run into problems, feel free to ask in the comments section below.
KSM charts have a .ksh
extensions. This can be a useful check to ensure that any charts that you download are actually for this game.
The following is a snippet of the ADV.ksh
(i.e advanced beatmap) file for YOASOBI’s Idol (アイドル):
title=アイドル
artist=YOASOBI /「推しの子」より
effect=AS
jacket=jk.jpg
illustrator=-
difficulty=challenge
level=10
t=166
m=music.ogg
o=0
bg=desert
layer=smoke
po=56024
plength=15000
pfiltergain=50
filtertype=peak
chokkakuautovol=0
chokkakuvol=50
ver=171
--
beat=4/4
0000|00|--
--
0000|00|0-
0000|00|:-
0000|00|o-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|o-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|P-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|P-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
0000|00|:-
--
0000|00|0-
0000|00|:-
filtertype=lpf1
0000|00|0o
0000|00|::
0000|00|o0
0000|00|::
Some errors I faced when trying to setup and configure the game.
commonShared
not foundIf you get a Lua error about not being able to load a commonShared
package, such as when using a custom skin:
[14:45:10][Error] Lua error: ...clone/bin/skins/HeavenlyExpress-1.3.0/scripts/common.lua:2: module 'commonShared' not found:
no field package.preload['commonShared']
no file '/usr/local/share/lua/5.3/commonShared.lua'
no file '/usr/local/share/lua/5.3/commonShared/init.lua'
no file '/usr/local/lib/lua/5.3/commonShared.lua'
no file '/usr/local/lib/lua/5.3/commonShared/init.lua'
no file './commonShared.lua'
no file './commonShared/init.lua'
no file '/Users/fanpu/unnamed-sdvx-clone/bin/skins/HeavenlyExpress-1.3.0/scripts/commonShared.lua'
no file 'skins/HeavenlyExpress-1.3.0/textures/_shared/scripts/commonShared.lua'
no file '/usr/local/lib/lua/5.3/commonShared.so'
no file '/usr/local/lib/lua/5.3/loadall.so'
no file './commonShared.so'
You are likely running the game from the root of the project directory (i.e ./bin/usc-game
), instead of from within the ./bin
directory itself.
If you are using the HeavenlyExpress skin, you may run into the following error after selecting a track to play:
Shader Error:
Could not load shaders skins/HeavenlyExpress-1.3.0/shaders/holdbutton.vs
and skins/HeavenlyExpress-1.3.0/shaders/holdbutton.fs
You may also get logs like this:
[14:58:37][Error] Shader program compile log for /Users/fanpu/unnamed-sdvx-clone/bin/skins/HeavenlyExpress-1.3.0/shaders/holdbutton.vs: ERROR: 0:6: 'varying' : syntax error: syntax error
[14:58:37][Error] Shader program compile log for /Users/fanpu/unnamed-sdvx-clone/bin/skins/HeavenlyExpress-1.3.0/shaders/holdbutton.fs: ERROR: 0:10: 'varying' : syntax error: syntax error
[14:58:37][Error] Failed to load vertex shader for material from /Users/fanpu/unnamed-sdvx-clone/bin/skins/HeavenlyExpress-1.3.0/shaders/holdbutton.vs
The shaders were probably written a long time ago, since the varying
keyword has been deprecated since OpenGL 3.3. It was previously used as a qualifier for variables that communicate between the vertex shader and the fragment shader, that is now replaced by the in
and out
qualifiers to provide a more clear distinction of data flow between shaders.
To fix this, modify the two files and change the varying
keyword to out
in both files:
In file bin/skins/HeavenlyExpress-1.3.0/shaders/holdbutton.vs
:
#version 330
#extension GL_ARB_separate_shader_objects : enable
layout(location=0) in vec2 inPos;
layout(location=1) in vec2 inTex;
out vec4 position; // update here
out gl_PerVertex
{
vec4 gl_Position;
};
...rest of file omitted...
In file bin/skins/HeavenlyExpress-1.3.0/shaders/holdbutton.fs
:
#version 330
#extension GL_ARB_separate_shader_objects : enable
layout(location=1) in vec2 fsTex;
layout(location=0) out vec4 target;
uniform sampler2D mainTex;
uniform float objectGlow;
out vec4 position; // update here
...rest of file omitted...
Restart the game and you should be good now.
I faced issues where it appeared that the value that I set in the settings page for the skin to use was not being saved. I resolved this by manually editing the config file in ./bin/skins/ExperimentalGear/skin.cfg
.
From Wikipedia:
A trackback allows one website to notify another about an update. It is one of four types of linkback methods for website authors to request notification when somebody links to one of their documents. This enables authors to keep track of who is linking to their articles. Some weblog software, such as SilverStripe, WordPress, Drupal, and Movable Type, supports automatic pingbacks where all the links in a published article can be pinged when the article is published. The term is used colloquially for any kind of linkback.
Essentially, it is a mechanism for other websites to know that you mentioned them, with the hope that they’ll notice you and possibly mention you as well. It helps to increase the visibility and discoverability of your website.
My use case was to send trackbacks to arXiv, so that specific arXiv papers will know that my blog post mentioned them, and readers can also check it out as an additional resource. In particular, each of my paper summary posts is based around a paper, and it would be nice if they could be linked from the respective arXiv paper abstract pages.
In arXiv, there is a blog link section that will track websites that made trackback requests for a given paper:
Unfortunately, if you try to search for anything about trackbacks and/or pingbacks, most of what you’ll get are articles about how to disable them on popular blogging platforms like WordPress due to widespread misuse and spam, or otherwise how to configure them.
There was also a 7-year old StackOverflow post about how to create trackback requests for arXiv, essentially the same problem I was facing. Sadly, it currently has a grand total of 0 answers and 0 comments. I hope this article might be useful if the author is still facing the issue.
The convenience of CMS blogging software like WordPress is that it supports features like automated trackbacks and pingbacks for content that you create. Static site generators are not capable of this, since by design they are static and stateless. This means that we have to make such requests manually, which is fortunately not too difficult!
Here’s a very simple script for doing it. In this example, the target URL is for the arXiv trackback endpoint.
Before reading or running the code, please note that you SHOULD NOT test or experiment on this with trackback listener URLs and spam them. You should only make requests if they are legitimate and you have a genuine reason for letting them know about your blog post. Trackback spam is a serious issue and part of why they have become so unpopular and unmanageable is due to the high volumes of spam.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import requests
# Replace with your own data
data = {
'title': 'My Awesome Blog Post',
'url': 'https://my-blog.com/post/',
'blog_name': 'My Awesome Blog'
}
# Replace with actual Trackback destination URL
trackback_url = f'https://foo.bar/trackback/post_id'
response = requests.post(trackback_url, data=data)
if response.status_code == 200:
print("Trackback successful!")
else:
print(f"Trackback failed with status code: {response.status_code}")
print(response.content.decode())
A successful response has the error
field set to 0
:
If an error occured, the error
field is set to 1
:
And that’s all there is to creating Trackback requests! It’s actually quite simple, and is just not terribly well-documented.
As a final parting word, a reminder again to please use it responsibly and stay away from any behavior that could be constituted as spamming.
]]>In high-dimensional statistical inference, it is common for the number of parameters \(p\) to be comparable to or greater than the sample size \(n\). However, for an estimator \(\thatn\) to be consistent in such a regime, meaning that it converges to the true parameter \(\theta\), it is necessary to make additional low-dimensional assumptions on the model. Examples of such constraints that have been well-studied include linear regression with sparsity constraints, estimation of structured covariance or inverse covariance matrices, graphical model selection, sparse principal component analysis (PCA), low-rank matrix estimation, matrix decomposition problems and estimation of sparse additive nonparametric models (Negahban et al., 2009).
In recent years, there has been a flurry of work on each of these individual specific cases. However, the authors of the paper in discussion poses the question of whether there is a way of unifying these analysis to understand all of such estimators in a common framework, and answers it in the affirmative. They showed that it is possible to bound the squared difference between any regularized \(M\)-estimator and its true parameter by (1) the decomposability of the regularization function, and (2) restricted strong convexity of the loss function. We will call this the “main theorem” in the remainder of the blog post, and this is referred to as “Theorem 1” in (Negahban et al., 2009).
In the remainder of the paper, we will develop the tools necessary to deeply understand and prove the result. Notation used will be consistent with the original paper for expositional clarity.
In this section, we develop some of the necessary background and notation to build up to the proof.
\(M\)-estimators (\(M\) for “maximum likelihood-type”) are solutions that minimize the sum of loss functions \(\rho\): \begin{align} \that \in \argmin_\theta \sum_{i=1}^n \rho(x_i, \theta). \end{align}
If we add a regularization term \(\rcal\) to penalize complexity of the model, scaled by weights \(\lambda\), the method is known as a regularized \(M\)-estimator: \begin{align} \that \in \argmin_\theta \sum_{i=1}^n \rho(x_i, \theta) + \lambda \rcal(\theta). \end{align}
The subspace compatibility constant measures how much the regularizer \(\rcal\) can change with respect to the error norm \(\| \cdot \|\) restricted to the subspace \(\mcal\). This concept will show up later in showing that the restricted strong convexity condition will hold with certain parameters.
The subspace compatibility constant is defined as follows:
It can be thought of as the Lipschitz constant of the regularizer with respect to the error norm restricted to values in \(\mcal\), by considering the point where it can vary the most.
Define the projection operator \begin{align} \Pi_{\mcal}(u) \coloneqq \argmin_{v \in \mcal} | u - v | \end{align} to be the projection of \(u\) onto the subspace \(\mcal\). For notational brevity, we will use the shorthand \(u_{\mcal} = \Pi_{\mcal}(u)\).
One property of the projection operator is that it is non-expansive, meaning that \begin{align} | \Pi(u) - \Pi(v) | \leq | u - v | \label{eq:non-expansive} \end{align} for some error norm \(\| \cdot \|\). In other words, it has Lipschitz constant 1.
In our setup, we define the following quantities:
The purpose of the regularized \(M\)-estimator is then to solve for the convex optimization problem
\[\begin{align} \label{eq:opt} \widehat{\theta}_{\lambda_n} \in \argmin_{\theta \in \mathbb{R}^p} \left\{ \mathcal{L}(\theta; Z_1^n) + \lambda_n \mathcal{R} (\theta) \right\}, \end{align}\]and we are interested in deriving bounds on \(\begin{align} \| \thatlambda - \theta^* \| \end{align}\) for some error norm \(\| \cdot \|\) induced by an inner product \(\langle \cdot, \cdot \rangle\) in \(\mathbb{R}^p\).
The first key property in the result is decomposability of our norm-based regularizer \(\rcal\). Working in the ambient \(\mathbb{R}^p\), define \(\mcal \sse \mathbb{R}^p\) to be the model subspace that captures the constraints of the model that we are working with (i.e \(k\)-sparse vectors), and denote \(\mocal\) to be its closure, i.e the union of \(\mcal\) and all of its limit points. In addition, denote \(\mocalp\) to be the orthogonal complement of \(\mocal\), namely
\[\begin{align} \mocalp \coloneqq \left\{ v \in \mathbb{R}^p \mid \langle u, v \rangle = 0 \text{ for all \( u \in \mocal \) } \right\}. \end{align}\]We call this the perturbation subspace, as they represent perturbations away from the model subspace \(\mocal\). The reason why we need to consider \(\mocal\) instead of \(\mcal\) is because there are some special cases of low-rank matrices and nuclear norms where it could be possible that \(\mcal\) is strictly contained in \(\mocal\).
Now we can introduce the property of decomposability:
Since \(\rcal\) is a norm-based regularizer, by the triangle inequality property of norms we know that always \begin{align} \rcal(\theta + \gamma) \leq \rcal(\theta) + \rcal(\gamma), \end{align} and hence this is a stronger condition which requires tightness in the inequality when we are specifically considering elements in the closure of the model subspace and its orthogonal complement.
Decomposability of the regularizer is important as it allows us to penalize deviations \(\gamma\) away from the model subspace in \(\mcal\) to the maximum extent possible. We are usually interested to find model subspaces that are small, with a large orthogonal complement. We will see in the main theorem that when this is the case, we will obtain better rates for estimating \(\theta^*\).
There are many natural contexts that admit regularizers which are decomposable with respect to subspaces, and the following example highlights one such case.
Decomposability is important because it allows us to bound the error of the estimator. This is given in the following result, which is known as Lemma 1 in (Negahban et al., 2009):
Recall from the Projections Section that \(\Delta_{\mocalp}\) represents the projection of \(\Delta\) onto \(\mocalp\), and similarly for the other quantities. Due to space constraints, we are unable to prove Lemma Lemma 1 in this survey, but it is very important in the formulation of restricted strong convexity, and in proving Theorem 1.
Figure 1 provides a visualization of \(\ctriplet\) in \(\mathbb{R}^3\) in the sparse vectors setting. In this case, \(S = \left\{ 3 \right\}\) with \(|S|=1\), and so the projection of \(\Delta\) onto the model subspace only has non-zero values on the third coordinate, and its orthogonal complement is where the third coordinate is zero. Formally,
\[\begin{align} \mcal(S) = \mocal(S) & = \left\{ \Delta \in \mathbb{R}^3 \mid \Delta_1 = \Delta_2 = 0 \right\}, \\ \mocalp(S) & = \left\{ \Delta \in \mathbb{R}^3 \mid \Delta_3 = 0 \right\}. \end{align}\]The vertical axis of Figure 1 denotes the third coordinate, and the horizontal plane denotes the first two coordinates. The shaded area represents the set \(\ctriplet\), i.e all values of \(\theta\) that satisfies the inequality of the set in Lemma 1.
Figure 1(a) shows the special case when \(\ts \in \mcal\). In this scenario, \(\rcal (\ts_{\mcalp}) = 0\), and so
\[\begin{align*} \C(\mcal, \mocalp; \ts) = \left\{ \Delta \in \mathbb{R}^p \mid \rcal(\Delta_{\mocalp}) \leq 3 \rcal (\Delta_{\mocal}) \right\}, \end{align*}\]which is a cone.
However, in the general setting where \(\ts \not\in \mcal\), then \(\rcal (\ts_{\mcalp}) > 0\), and the set \(\ctriplet\) will become a star-shaped set like what is shown in Figure 1(b).
In a classical setting, as the number of samples \(n\) increases, the difference in loss \(d \lcal = |\lcal(\thatlambda) - \lcal(\ts)|\) will converge to zero. However, the convergence in loss by itself is insufficient to also ensure the convergence in parameters, \(\hd = \thatlambda - \ts\). Instead, it also depends on the curvature of the loss function \(\lcal\).
Figure 2 illustrates the importance of curvature. In Figure 2(a), \(\lcal\) has high curvature, and so having a small \(d\lcal\) also implies a small \(\hd\). On the other hand, in Figure 2(b), \(\lcal\) has an almost flat landscape near \(\thatlambda\), and hence even when \(d \lcal\) is small, \(\hd\) could still be large.
Consider performing a Taylor expansion of \(\lcal\) around \(\ts\):
\[\begin{align} \lcal(\ts + \Delta) & = \lcal(\ts) + \dotprod{\nabla \lcal(\ts)}{\Delta} + \underbrace{\frac{1}{2} \Delta^T \nabla^2 \lcal(\ts) \Delta + \dots}_{\delta \lcal(\Delta, \ts)}. \end{align}\]Then we can rearrange and write the error of the first-order Taylor series expansion at \(\ts\) as
\[\begin{align*} \delta \lcal(\Delta, \ts) = \lcal(\ts + \Delta) - \lcal(\ts) - \dotprod{\nabla \lcal(\ts)}{\Delta}. \end{align*}\]The first-order Taylor approximation is a linear approximation, and hence the error \(\delta \lcal(\Delta, \ts)\), which is dominated by the quadratic term, can capture the curvature about \(\ts\).
As such, one way to show that \(\lcal\) has good curvature about \(\ts\) is to show that \(\delta \lcal(\Delta, \ts) \geq \kappa \|\Delta \|^2\) holds for all \(\Delta\) in a neighborhood of \(\ts\). This is because we are enforcing a lower bound on its quadratic growth.
This leads us to the definition of restricted strong convexity:
We only need to consider error terms \(\Delta \in \ctriplet\), since Lemma \ref{lemma:1} guarantees us that the error term will only lie in that set.
In many statistical models, restricted strong convexity holds with \(\tl = 0\), however, it is required in more general settings, such as generalized linear models.
We can now state and prove the main result of the paper. This will hold under the decomposability of the regularizer (G1), and the restricted strong convexity of the loss function (G2).
(G1) The regularizer \(\rcal\) is a norm and is decomposable with respect to the subspace pair \((\mcal, \mocalp)\), where \(\mcal \sse \mocalp\).
(G2) The loss function \(\lcal\) is convex and differentiable, and satisfies restricted strong convexity with curvature \(\kl\) and tolerance \(\tl\).
We will rely on the following lemmas that will be stated without proof due to space constraints:
Note that this was similar to our previous analysis on restricted strong convexity where we only really need to consider error terms restricted to \(\ctriplet\) due to Lemma 1. Therefore, it suffices to show \(\fcal(\Delta) > 0\) to obtain a bound on \(\| \hd \| = \| \thatlambda - \ts\|\), which completes the proof of Theorem 1.
Define \(\fcal : \mathbb{R}^p \to \mathbb{R}\) by
\[\begin{align} \fcal(\Delta) \coloneqq \lcal(\ts + \Delta) - \lcal(\ts) + \lambda_n \left\{ \rcal(\ts + \Delta) - \rcal(\ts) \right\}, \end{align}\]and define the set
\[\begin{align} \mathbb{K}(\delta) \coloneqq \ctriplet \cap \left\{ \| \Delta \| = \delta \right\}. \end{align}\]Take any \(\Delta \in \kbb\). Then
\[\begin{align} \fcal(\Delta) = & \lcal(\ts + \Delta) - \lcal(\ts) + \lambda_n \left\{ \rcal(\ts + \Delta) - \rcal(\ts) \right\} \tag{by definition} \\ \geq & \langle \nabla \lcal (\ts), \Delta \rangle + \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ \rcal(\ts + \Delta) - \rcal(\ts) \right\} \\ & \qquad \text{(by restricted strong convexity: \(\delta \lcal(\Delta, \ts) \geq \kl \| \Delta \|^2 - \tl^2(\ts)\),} \\ & \qquad \text{ and \( \delta \lcal(\Delta, \ts) = \lcal(\ts + \Delta) - \lcal(\ts) - \dotprod{\nabla \lcal(\ts)}{\Delta} \) ) } \\ \geq & \langle \nabla \lcal (\ts), \Delta \rangle + \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ \rcal(\Delta_{\mocalp}) - \rcal(\Delta_{\mocal}) - 2 \rcal(\ts_{\mcal^{\perp}}) \right\} \\ & \qquad \text{(by Lemma 3)}. \label{thm-deriv:1} \end{align}\]We lower bound the first term as \(\langle \nabla \lcal (\ts), \Delta \rangle \geq - \frac{\lambda_n}{2} \rcal(\Delta)\):
\[\begin{align} | \langle \nabla \lcal (\ts), \Delta \rangle | \leq & \rs(\nabla \lcal(\ts)) \rcal(\Delta) & \text{(Cauchy-Schwarz using dual norms \( \rcal \) and \( \rs \))} \\ \leq & \frac{\lambda_n}{2} \rcal(\Delta) & \text{Theorem 1 assumption: \( \lambda_n \geq 2 \rs (\nabla \lcal(\ts)) \))}, \end{align}\]and hence,
\[\begin{align} \langle \nabla \lcal (\ts), \Delta \rangle \geq & - \frac{\lambda_n}{2} \rcal(\Delta). \end{align}\]So applying to (\ref{thm-deriv:1}),
\[\begin{align} \fcal(\Delta) \geq & \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ \rcal(\Delta_{\mocalp}) - \rcal(\Delta_{\mocal}) - 2 \rcal(\ts_{\mcal^{\perp}}) \right\} - \frac{\lambda_n}{2} \rcal(\Delta) \\ \geq & \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ \rcal(\Delta_{\mocalp}) - \rcal(\Delta_{\mocal}) - 2 \rcal(\ts_{\mcal^{\perp}}) \right\} - \frac{\lambda_n}{2} (\rcal(\Delta_{\mocalp}) + \rcal(\Delta_{\mocal})) \\ & \qquad \text{(Triangle inequality: \( \rcal(\Delta) \leq \rcal(\Delta_{\mocalp}) + \rcal(\Delta_{\mocal}) \))} \\ = & \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ \frac{1}{2}\rcal(\Delta_{\mocalp}) - \frac{3}{2}\rcal(\Delta_{\mocal}) - 2 \rcal(\ts_{\mcal^{\perp}}) \right\} \\ & \qquad \text{(Moving terms in)} \\ \geq & \kl \| \Delta \|^2 - \tl^2(\ts) + \lambda_n \left\{ - \frac{3}{2}\rcal(\Delta_{\mocal}) - 2 \rcal(\ts_{\mcal^{\perp}}) \right\} \\ & \qquad \text{(Norms always non-negative)} \\ = & \kl \| \Delta \|^2 - \tl^2(\ts) - \frac{\lambda_n }{2} \left\{ 3 \rcal(\Delta_{\mocal}) + 4 \rcal(\ts_{\mcal^{\perp}}) \right\} \label{eq:r-delta-lb} . \end{align}\]To bound the term \(\rcal(\Delta_{\mocal})\), recall the definition of subspace compatibility:
\[\begin{align} \varPsi (\mcal) \coloneqq \sup_{u \in \mcal \setminus \left\{ 0 \right\}} \frac{\rcal(u)}{\| u \|}, \label{eq:r-delta-ub} \end{align}\]and hence
\[\begin{align} \rcal(\Delta_{\mocal}) \leq \varPsi(\mocal) \| \Delta_{\mocal} \|. \end{align}\]To upper bound \(\| \Delta_{\mocal} \|\), we have
\[\begin{align} \| \Delta_{\mocal} \| & = \| \Pi_{\mocal} (\Delta) - \Pi_{\mocal}(0) \| & \text{(Since \(0 \in \mocal \), \( \Pi_{\mocal}(0) = 0 \)) } \\ & \leq \| \Delta - 0 \| & \text{(Projection operator is non-expansive, see Equation \ref{eq:non-expansive})} \\ & = \| \Delta \|, \end{align}\]which substituting into Equation (\ref{eq:r-delta-ub}) gives
\[\begin{align} \rcal(\Delta_{\mocal}) \leq \varPsi(\mocal) \| \Delta \|. \end{align}\]Now we can use this result to lower bound Equation \ref{eq:r-delta-lb}:
\[\begin{align} \fcal (\Delta) \geq & \kl \| \Delta \|^2 - \tl^2(\ts) - \frac{\lambda_n }{2} \left\{ 3 \varPsi(\mocal) \| \Delta \| + 4 \rcal(\ts_{\mcal^{\perp}}) \right\}. \label{eq:strict-psd} \end{align}\]The RHS of the inequality in Equation \ref{eq:strict-psd} has a strictly positive definite quadratic form in \(\| \Delta \|\), and hence by taking \(\| \Delta \|\) large, it will be strictly positive. To find such a sufficiently large \(\| \Delta \|\), write
\[\begin{align} a & = \kl, \\ b & = \frac{3\lambda_n}{2} \varPsi (\mocal), \\ c & = \tau_{\lcal}^2 (\ts) + 2 \lambda_n \rcal(\ts_{\mcalp}), \\ \end{align}\]such that we have
\[\begin{align} \fcal (\Delta) & \geq a \| \Delta \|^2 - b \| \Delta \| - c. \end{align}\]Then the square of the rightmost intercept is given by the squared quadratic formula
\[\begin{align} \| \Delta \|^2 & = \left( \frac{-(-b) + \sqrt{b^2 - 4a(-c)}}{2a} \right)^2 \\ & = \left( \frac{b + \sqrt{b^2 + 4ac}}{2a} \right)^2 \\ & \leq \left( \frac{\sqrt{b^2 + 4ac}}{a} \right)^2 & \text{($b \leq \sqrt{b^2 + 4ac}$)} \label{eq:coarse-bound} \\ & = \frac{b^2 + 4ac}{a^2} \\ & = \frac{9 \lambda_n^2 \varPsi^2 (\mocal)}{4 \kl^2} + \frac{ 4 \tau_{\lcal}^2 (\ts) + 8 \lambda_n \rcal(\ts_{\mcalp}) }{\kl}. & \text{(Substituting in \(a, b, c\))} \\ \end{align}\]In (Negahban et al., 2009), they were able to show an upper bound of
\[\begin{align} \| \Delta \|^2 & \leq \frac{9 \lambda_n^2 \varPsi^2 (\mocal)}{\kl^2} + \frac{\lambda_n}{\kl} \left\{ 2\tau_{\lcal}^2 (\ts) + 4 \rcal(\ts_{\mcalp}) \right\}, \label{eq:ub} \end{align}\]but I did not manage to figure out how they managed to produce a \(\lambda_n\) term beside the \(\tl^2(\ts)\) term. All other differences are just constant factors. It may be due to an overly coarse bound on my end applied in Equation \ref{eq:coarse-bound}, but it is unclear to me how the \(\lambda_n\) term can be applied on only the \(\tl^2(\ts)\) term without affecting the \(\rcal(\ts_{\mcalp})\) term.
With Equation \ref{eq:ub}, we can hence apply Lemma 4 in (Negahban et al., 2009) to obtain the desired result that
\[\begin{align} \| \thatlambda - \ts \|^2 \leq 9 \frac{\lambda_n^2}{\kl^2} \Psi^2(\mocal) + \frac{\lambda_n}{\kl} \left( 2 \tl^2 (\ts) + 4 \rcal (\ts_{\mcal^{\perp}}) \right). \end{align}\]This concludes the proof.
In the proof of Theorem 1, we saw how the bound is derived from the two key ingredients of the decomposability of the regularizer, and restricted strong convexity of the loss function. The decomposability of the regularizer allowed us to ensure that the error vector \(\hd\) will stay in the set \(\ctriplet\). This condition is then required in Lemma 4 of (Negahban et al., 2009), which allows us to bound \(\| \hd \|\) given that \(\fcal(\Delta) > 0\). In one of the steps where we were lower bounding \(\fcal(\Delta)\) in the proof, we made use of the properties of restricted strong convexity.
Theorem 1 provides a family of bounds for each decomposable regularizer under the choice of \((\mcal, \mocalp)\). The authors of (Negahban et al., 2009) were able to use Theorem 1 to rederive both existing known results, and also derive new results on low-rank matrix estimation using the nuclear norm, minimax-optimal rates for noisy matrix completion, and noisy matrix decomposition. The reader is encouraged to refer to (Negahban et al., 2009) for more details on the large number of corrollaries of Theorem 1.
I would like to thank my dear friend Josh Abrams for helping to review and provide valuable suggestions for this post!
Before I continue, let me warn readers that you are not allowed to enter the steam tunnels by yourself. Please sign up for an official tour by SLICE that will be led by a facilities engineer. From The Word Student Handbook:
Steam Tunnels
Because of the danger to all who enter them, the steam tunnels are locked and anyone found in the tunnels will be subject to serious disciplinary action and/or criminal action. The University Police are responsible for keeping the tunnels locked and apprehending anyone who trespasses in them.
We met at the fence, and all of us had to sign a waiver and don a helmet (the same white helmet that was used by builders during Spring Carnival booth).
We proceeded to the basement of Margaret Morrison Hall, and the engineer guiding the expedition shared some history about how the Margaret Morrison basement was enhanced to be flood-resistant after a flooding incident a few decades ago caused water to also flood into the steam tunnels.
We were warned that the tunnels will be hot and claustrophobic, that we should not touch any pipes as they will be extremely hot, and to not poke at any asbestos for obvious reasons. He then unlocked an unmarked door, and let us into the tunnels proper:
It was initially still quite cool near the door, but the temperature began rising as we went further in. Some sections of the pipes were hissing and you could really feel the warmth emanating from them. At some point I was slightly afraid that a pipe beside me might burst.
At some point, there was a fork where the tunnel on the left fork became very short and narrow. We took the right fork.
It was generally quite well-lit, until we were brought to a section of the tunnel where we had to ascend a rusty ladder to reach a cavern, which was unlit. It was known as the CFA cavern as it is located right under the steps of CFA:
Apparently at some point in the past, a very resourceful CFA student decided to make this their home. Not only was rent cheap (free!), but it was also very close to the CFA building! However, they were found by campus police and booted out.
I would not say that it was the most ideal living arrangement. There were plastic bottles strewn everywhere, and stalactites growing down from the ceiling. The air was very damp and musty, and would probably do something bad to your lungs if you stayed in there long enough. It was surprisingly much cooler than the steam tunnels right below it though.
We then went back down into the tunnels, and continued on:
As it got close to the end of the tunnels, we were each handed chalk that can be used for leaving our mark in the tunnel. Since public vandalism is punishable by caning in my home country, of course I was not going to pass on this wonderful opportunity to defile the steam tunnels to my heart’s content:
We then emerged from an exit deep inside Doherty Hall, which was nice as it was beginning to get rather uncomfortable and claustrophobic. Whew!
If you thought there were only 8 levels in Wean, then you will learn something new today. We took the freight elevator from the corner of Wean to floor PH (penthouse?), AKA Wean 9.
Wean 9 was essentially a huge storeroom for CMU FMS (Facility Management Services). There were all sorts of supplies and tools, and even spare doors for classrooms. It made me realize just how much maintenance it took to operate a campus.
We were finally brought to a door that led to the roof of Wean Hall, and had to adjust our eyes for a few seconds to the new blinding sunlight. It was beautiful!
Everyone got busy snapping photos, myself included. To my knowledge this is the only place on campus where you can take a side-by-side photo with the Hammerschlag radio tower:
There was some open space on the roof, and I thought it would be pretty cool if they opened an open-air cafe here. It has pretty nice panoramic views of the entire campus.
And that’s it for the tour!
I would like to thank SLICE for organizing this trip, my friend Justin Sun who also went on this little adventure with me for proofreading this post, and Joey Li for pointing out a mistake in the post, where I previously erroneously claimed that WRCT 88.3FM was also broadcast from Hammerschlag tower.
]]>This class exceeded my expectations significantly. I found it especially meaningful and apt since this was my last systems class before I graduate, and the topics and discussions from class helped to unify all the systems concepts that I had learnt from previous classes into a nice package informed by common underlying principles: from distributed systems, to networking, databases, filesystems, operating systems, and even machine learning systems.
The first lecture went through 2 Wisdom Papers, which no one was expected to have read yet as it was the first class. You can refer to the slides here if you are curious.
The first paper, Mythical Man-Month: Essays on Software Engineering is a book by Turing-award winner Fred Brooks. It is about many of his observations and principles on software engineering based on his own vast experiences. What really brought it home to me was that a couple of them were also things that I had some suspicions about previously, but never really thought it was universally applicable, and thought they were simply artifacts of the way I approached things.
For instance, one of the principles is “Plan to Throw One Away”, meaning that one should first build a worthwhile system in a short amount of time, and then re-build a better second version with the benefit of hindsight. This is because one would end up having to re-build the system anyway after being confronted with change and feedback, and also due to the following observation on program maintenance:
“Program maintenance is an entropy-increasing process, and even its most skillful execution only delays the subsidence of the system into unfixable obsolescence”
This had many parallels with my own experiences. For instance, my group ended up having 4 major re-writes of our kernel during 15-410, and I also did a complete re-write of my CloudFS filesystem for my 18-746 project. Similarly, many of my internship projects were also re-writes and improvements on design of existing systems that had accumulated too much technical debt. It does seem a lot more reasonable to plan for this eventual change to begin with.
The paper also contained a lot of other great advice, such as the importance of conceptual integrity to separate architecture from implementation, structuring a team in a “surgical” fashion to drive software development where the best programmer leads the most critical development work like a surgeon and directs others on the other aspects, and of course the famous Brook’s law:
“Adding manpower to a late software project makes it later”
The second paper, You and Your Research by Richard Hamming (of Hamming code fame), talks about how to become a great scientist. The following two slides gives a good sense of the spirit of the paper:
I mention the first lecture and the two papers that were discussed here not simply because they were interesting, but because they helped to set the tone and expectations for the rest of the semester going forward. The message is clear: this is going to be a practical and useful class that will help you on your journey to becoming great systems designers and researchers.
The class took us on a whirlwind tour through many SIGOPS Hall of Fame papers, which the award description states was “instituted in 2005 to recognize the most influential Operating Systems papers that were published at least ten years in the past”. Reading through the papers helped to consolidate a lot of the knowledge that I learned in previous systems classes, and it was cool to see how decades ago many of these ideas that were once unappreciated or heavily criticized now form the bedrock of many of the systems that we use today.
In addition to the Hall of Fame papers, there were also several relatively recent papers that the course staff thought were conceptually interesting and promising.
The following sections will go through each of the modules and the required papers that you will read (refer to the course website if you are also interested in the optional papers), and a short description of what the paper is about so you can get a pretty good sense of what is covered. A cool thing to note is that the scope of all the papers will touch almost all the systems classes offered at CMU.
ext*
filesystemsHere are my thoughts on the key takeaways from the class.
As a seminar-based class, one of the most surprising things for me was how fun and valuable the class discussions were. It was especially enlightening to hear the comments of Ph.D. students who are working in systems and other fields in computer science, who often had very different critiques and opinions of the papers than what I had come up with, which often led me to wonder how they got their perspectives and what their background is like. This was particularly true when someone mentioned glaring deficiencies and problems with the paper that I had completely not even thought of.
However, one thing that made me sad was that attendance in class started to fall after the halfway point of the semester. This included quite a few of the students who used to give very insightful and interesting responses and so the diversity of perspectives of the discussions as a whole suffered.
While attendance is not strictly enforced, actively participating in the discussions and being engaged in lectures is one of the most valuable takeaways from this class, and positively impacts not just you but also your classmates, and so I would strongly encourage anyone interested in the class to attend all the lectures that you can.
Another aspect of the class that I really appreciated was how Phil taught us a lot of the spirit and tribal knowledge of doing CS research during his lively lectures. These were often presented as off-hand remarks while presenting the context or background of a paper, and provided insight into the zeitgeist of the time, the motivations and challenges that the paper authors faced, and what the authors went on to do in the future based on the impact (or lack of impact at that time) of their work.
As someone who has not done a long-term research project with a faculty member but am thinking about possibly doing a Ph.D. in the future, all of these were very valuable wisdom which are not things that you can pick up easily yourself from reading past papers or books. In fact, it almost felt as if I had my own advisor at times.
As you read through the papers, you almost feel as if you are being put into the driver’s seat and can see how systems research has matured and evolved over the past few decades. Seminal papers of the past tackled the most general problems, although many of them lacked implementations or proper benchmarks that would surely be grounds to be red-flagged and rejected from any systems conference today. Many of the more recent papers strive to anticipate and build for future changes in the computation landscape, have solid replicable implementations and evaluations, and are a lot more careful about anticipating and providing rebuttals for criticisms.
I also really appreciated the personal attention that Phil and Val gave to us by meeting with us every other week for our course projects. This is especially so if you consider that many advisors already have trouble meeting their own Ph.D. students for an hour a week, whereas in this case the course staff dedicated half an hour every two weeks for every single group in the class (there were around 10), which I thought was some real dedication. I will admit that I did not live up to my end of the bargain by spending as much time on the project as I would have wanted to (compared to when I took 15-410). One could always give excuses for anything so you don’t have to listen to mine, but if I had to reflect on it, it was due to a combination of high workloads from other classes, the fact that this was not the highest priority for the members in our group (my project partners were both quite busy with their own research and I was busy with other classes), and some unexpected obstacles in our project that forced us back to the drawing boards a few times (our project interim report was drastically different from our initial proposal).
Phil is a really good lecturer. He is very clear, the class pacing is great, and the lecture slides are polished. He is very approachable and respectful towards students, and puts in great effort to give a good and satisfying answer to every question.
Feedback for projects is prompt (there was no feedback for the paper summaries), and the midterms were graded fairly quickly.
Overall it is clear that the class is pedagogically mature and has benefited from many rounds of feedback during past iterations. It is rich in content, is accessible and yet challenging to students from a wide range of backgrounds, and will prepare one well for building systems in the future, be it in academia or industry.
There are three main components to the class: paper summaries, projects, and exams.
Before each lecture, the class is assigned a required reading and an optional reading. A paper summary of the required reading must be submitted before the class, which will discuss both readings.
The paper summary will contain 3 things:
It took me on average 2-4 hours to read each paper and around 15 minutes for the summary.
The lectures for this class are front-loaded, meaning that during the first two-thirds of the semester, you will meet 3 times a week for 80 minutes each, while there will be no lectures at all during the final third of the semester, and so “on average” throughout the semester you will meet twice a week. This is so that students have enough knowledge and content to begin working on their course projects early on in the semester.
There will be 3 short breaks in each lecture, where all students will get into breakout groups and share and discuss among themselves one of the prompts for the paper based on their paper summaries. Afterwards, all groups are invited to share what they thought.
Reading and writing the paper summaries are the only “homework” you will get in this class.
There is also a semester-long course project with a significant systems component in groups of three. This will begin in earnest after a third of the semester, and all the project groups met with Phil and the TA Val once every two weeks. The deliverables include a project proposal, an interim report, a final presentation, and a final report. The course project will be the largest constituent of your final grade.
Finally, there are two midterm exams, which are taken during class time. The first is taken in the middle of the semester, and the second is taken after all lectures have concluded.
Each midterm will cover content from a shortlisted selection of 10 of the required readings. There will be 9 questions on the midterm, which covers 9 of the 10 papers, and you are only required to answer 7 of the problems.
The course staff will also provide two past year exams to practice on, though some of the readings may have changed since.
It admittedly does seem quite daunting to have to study and be familiar with 10 papers spanning very different topics. I did not have time to actually re-read all 10 papers to prepare for the midterm, and so the way I prepared was to go through all the lecture slides again, re-read the most important sections of the paper, and skim through the rest. Afterward, I attempted the past exams to fill in any gaps that I may have missed. This strategy allowed me to do fairly well on the exam.
The class has a moderate workload for a systems class. Expect to spend 10-12 hours a week on the readings and paper summaries while lectures are ongoing, probably a couple more hours once the projects get into motion midway through the semester, and for it to consume a significant portion of your existence in the last two weeks before the final presentations.
It is a far less demanding and stressful class than the legendary 15-410/605 Operating System Design and Implementation class, so don’t let the “advanced” in the course title scare you off from taking this class. After all, most people taking this class are Ph.D. students who have their own research to work on and can’t exactly spend all their time on courses, unlike undergraduates.
Our course project was on the automated optimal scheduling of data in dynamic neural networks over heterogeneous GPUs for inference tasks in a pipeline-parallelism fashion. This means that when a model is too large to fit on a single GPU but instead has to be distributed across multiple GPUs, we aim to solve the problem of finding the optimal way to perform this split in the presence of dynamism in the network. In our case we focused on input dynamism, meaning that the sizes of the inputs can vary, which can result in different execution times in different segments of the network. We built a system called DynPartition
, a reinforcement-learning based scheduler that uses Deep-Q Learning to learn the optimal way of performing this split.
We had some positive empirical results on our benchmarks, but will require additional future work to verify the generality of these results. Overall, I thought it was a great experience working with PhD students and to learn from their working styles and approach to solving problems. It was also really cool to see the breadth and depth of projects presented by the other teams during the final presentation, which was structured like a conference.
I cannot recommend this course enough to anyone who has sufficient background and have an interest in building systems, or systems research.
You should be sufficiently prepared for the class if you have taken 15-410 Operating System Design and Implementation, or any other equivalent rigorous operating systems design class in your undergraduate college. Most papers draw heavily on low-level concepts from operating systems and assume that the reader is familiar with them, and therefore familiarity with these ideas is critical to understanding the papers.
I don’t feel any of the other classes are as critical, as any new concepts can be picked up relatively easily. For instance, a good grasp of considerations involved in operating systems design means that it’s not too hard to also understand the challenges involved in filesystems or virtual machine design. Having taken other classes would definitely still help to make the papers more approachable though. For instance, the Pollux paper was not very approachable for people who did not have prior exposure to machine learning systems, which led to the course staff deciding not to include that as one of the papers tested for the second midterm.
When I took the class, all the students were either Masters or Ph.D. students. Strong undergraduates with sufficient background would also definitely do well in the class.
I had to take a systems class this semester to fulfill my graduation requirements for the MSCS program. I initially did include this class in my shortlist of systems classes to take, but then thought it was just going to be a paper reading class (not that I had been in one of such classes before, but it just did not sound very interesting and felt like something I could do by myself asynchronously after I graduate) and therefore was quite hesitant to take it.
As such, during registration week I settled on 15-618 Parallel Computer Architecture and Programming, since it included topics on GPU programming that aligned with my current interests in machine learning. However, I did not feel like the class was sufficiently challenging for me after the first lecture, as it was a bit too slow-paced and simple for my liking as I already had exposure to most of the topics from other system classes that I had taken. I decided to switch to 15-712, and I knew immediately that it was the right class for me after the first lecture.
In a sense, this class was a hidden gem and I was really glad that I ended up taking it.
I would like to express my gratitude to Albert Gao for helping to proofread this article, who took the class with me this semester.
]]>There has recently been a flurry of work in score-based diffusion models as part of the broader area of generative models. This is due to the recent success of such score-based methods, which has achieved results comparable to the state-of-the-art of generative adversarial networks (GANs).
Past techniques in generative modeling have either relied on the approximation of the partition function of the probability density, or the combination of an implicit network representation of the probability density and adversarial training. The former suffers from having to either constrain the model to make the partition function tractable, or otherwise relies on approximations with surrogate losses that may be inaccurate, and the latter suffers from training instability and mode collapse.
Score-based diffusion models try to address the cons of both approaches, and instead, use score-matching to learn a model of the gradient of the log of the probability density function. This allows it to avoid computing the partition function completely.
One of the first such approaches that rely on using score-matching to perform generative modeling does so by generating new samples via Langevin dynamics (Song & Ermon, 2019). A key observation is that naively applying score-matching is that the model of score function will be inaccurate in areas of low density with respect to the data distribution, which results in improper Langevin dynamics in low-density areas. The solution that was proposed is the injection of noise into the data, which provides additional training signal and increases the dimensionality of the data.
The next major step introduced in (Song et al., 2021) is to perturb the data using a diffusion process which is a form of a stochastic differential equation (SDEs). The SDE is then reversed using annealed Langevin dynamics in order to recover the generative process, where the reversal process makes use of score matching.
Other recent refinements that have been proposed include re-casting the objective as a Schrödinger bridge problem, which is an entropy-regularized optimal transport problem. The advantage of this approach is that it allows for fewer diffusion steps to be taken during the generative process.
We will be primarily focusing on the paper Generative Modeling by Estimating Gradients of the Data Distribution (Song & Ermon, 2019).
In this section, we provide the necessary background, provide derivations for important results, and explain the key ideas of score matching for diffusion models as proposed in the papers.
Score matching is motivated by the limitations of likelihood-based methods. In likelihood-based methods, we use a parameterized model \(f_\theta(\bx) \in \mathbb{R}\) and attempt it to recover the parameters \(\theta\) that best explains the observed data. For instance, in energy-based models, the probability mass function \(p_\theta(\bx)\) would be given as \begin{align} p_\theta(\bx) = \frac{\exp(-f_\theta(\bx))}{Z_\theta}, \end{align} where \(Z_\theta\) is the normalizing constant that causes the distribution to integrate to 1, i.e \begin{align} Z_\theta = \int \exp(-f_\theta(\bx)) \, d \bx. \end{align} The goal then is to maximize the log likelihood of the observed data \(\{\bx_i\}_i^N\), given by \begin{align} \max_\theta \sum_{i=1}^N \log p_\theta (\bx_i). \end{align}
It is often computationally intractable to compute the partition function \(Z_\theta\) unless there are restrictions on what the model can be, since there are usually at least an exponential number of possible configurations. Examples of models where the partition function can be efficiently computed include causal convolutions in autoregressive models, and invertible networks in normalizing flow models However, such architecture restrictions are very undesirable as they limit the expressiveness of the models.
A likelihood-based approach that tries to avoid computing the partition function is variational inference. In variational inference, we use the Evidence Lower Bound (ELBO) as a surrogate objective, where the approximation error is the smallest Kullback-Leibler divergence between the true distribution and a distribution that can be parameterized by our model.
Adversarial-based approaches, like generative adversarial networks (GANs), have been shown to suffer from both instability in training and mode collapse.
Training GANs can be viewed as finding a Nash equilibrium for a two-player non-cooperative game between the discriminator and the generator. Finding a Nash equilibrium is PPAD-complete which is computationally intractable, and therefore methods like gradient-based optimization techniques are used instead. However, the highly non-convex and high-dimensional optimization landscape means that small perturbations in the parameters of either player can change the cost function of the other player, which results in non-convergence.
Another problem with training GANs is that when either the generator or discriminator becomes significantly better than the other, then the learning signal for the other player becomes very weak. For generators, this is when the discriminator is always able to tell it apart. For discriminators, this is when the generator performs so well it can hardly do better than random guessing.
Finally, a common failure mode of GANs is mode collapse, where the generator only learns to produce a set of very similar outputs from a single mode instead of from all the modes. This is due to the non-convexity of the optimization landscape.
Score matching is a non-likelihood-based method to perform sampling on an unknown data distribution, and seeks to address many of the limitations of likelihood-based methods and adversarial methods. This is achieved by learning the score of the probability density function, formally defined below:
In practice, we try to learn the score function using a neural network \(\stx\) parameterized by \(\theta\).
The objective of score matching is to minimize the Fisher Divergence between the score function and the score network:
\[\begin{align} \label{eq:score-matching-target-fisher-div} \argmin_\theta \frac{1}{2} \mathbb{E}_{\pdata} \left[ \| \stx - \nabla_\bx \log \pdata \|_2^2 \right]. \end{align}\]However, the main problem here is that we do not know \(\nabla_\bx \log \pdata\), since it depends on knowing what \(\pdata\) is.
(Hyvärinen, 2005) showed that Equation \ref{eq:score-matching-target-fisher-div} is equivalent to Equation \ref{eq:score-matching-target} below:
\[\begin{align} \label{eq:score-matching-target} \argmin_\theta \frac{1}{2} \mathbb{E}_{\pdata} \left[ \tr \left( \nabla_\bx \stx \right) + \frac{1}{2} \| \stx \|_2^2 \right]. \end{align}\]We can now compute this using Monte Carlo methods by sampling from \(\pdata\), since it only depends on knowing \(\stx\).
It is computationally difficult to compute the trace term \(\tr \left( \nabla_\bx \stx \right)\) in Equation \ref{eq:score-matching-target} when \(\bx\) is high-dimensional. This motivates another alternative cheaper approach for score matching, called sliced score matching (Song et al., 2019).
In sliced score matching, we sample random vectors from some distribution \(\pv\) (such as the multivariate standard Gaussian) in order to optimize an analog of the Fisher Divergence:
\[\begin{align} L(\btheta, \pv) = \frac{1}{2} \mathbb{E}_{\pv} \mathbb{E}_{\pdata} \left[ (\bv^T \stx - \bv^T \sdx)^2 \right] \end{align}\]We observe that
\[\begin{align} L(\btheta; \pv) &= \frac{1}{2} \mathbb{E}_{\pv} \mathbb{E}_{\pdata} \left[ (\bv^T \stx - \bv^T \sdx)^2 \right]\\ &=\frac{1}{2} \mathbb{E}_{\pv} \mathbb{E}_{\pdata} \left[ (\bv^T \stx )^2 + (\bv^T \sdx)^2 - 2(\bv^T \stx )(\bv^T \sdx) \right]\\ &= \mathbb{E}_{\pv} \mathbb{E}_{\pdata} \left[ \frac{1}{2}(\bv^T \stx )^2 - (\bv^T \stx )(\bv^T \sdx) \right] + C\\ \end{align}\]where the \(\sdx\) term is absorbed into \(C\) as it doesn’t depend on \(\theta\). Now note
\[\begin{align} & -\mathbb{E}_{\pv} \mathbb{E}_{\pdata}\left[(\bv^T \stx )(\bv^T \sdx) \right] \\ =& -\mathbb{E}_{\pv} \int \left[(\bv^T \stx )(\bv^T \sdx) \pdata d\bx\right]\\ =& -\mathbb{E}_{\pv} \left[\int(\bv^T \stx )(\bv^T\nabla_{\bx}\log \pdata)\pdata d\bx\right] \\ =& -\mathbb{E}_{\pv} \left[\int(\bv^T \stx )(\bv^T\nabla_{\bx}\pdata)d\bx\right] \\ =& -\mathbb{E}_{\pv} \left[\int(\bv^T \stx )(\bv^T\nabla_{\bx}\pdata)d\bx\right] \\ =& -\mathbb{E}_{\pv} \left[\sum_{i}\int(\bv^T \stx )(v_i\frac{\partial \pdata}{\partial x_i})d\bx\right] \\ =& \mathbb{E}_{\pv} \left[\int \bv^T\stx\bv \cdot \pdata d\bx\right] \\ =& \mathbb{E}_{\pv}\mathbb{E}_{\pdata}\left[\bv^T\stx\bv \right] \end{align}\]where line 16 is obtained by applying multivariate integration by parts. This finally yields the equivalent objective:
\[\begin{align} J(\btheta; \pv) &= \mathbb{E}_{\pv} \mathbb{E}_{\pdata} \left[ \bv^T \nabla_\bx \stx \bv + \frac{1}{2} \| \stx \|_2^2 \right] \end{align}\]which no longer has a dependence on the unknown \(\nabla_{bx}\sdx\). This leads to the unbiased estimator:
\[\begin{align} \hat J_{N,M}(\btheta; \pv) &=\frac{1}{N}\frac{1}{M}\sum_{i= 1}^N\sum_{j=1}^M \left[\bv_{ij}^T\nabla_{\bx}\mathbf{s}_\mathbf{\btheta}(\bx_i)\bv_{ij} + \frac{1}{2} \|\mathbf{s}_\mathbf{\btheta}(\bx_i)\|_2^2\right] \end{align}\]where for each data point \(\bx_i\) we draw \(M\) projection vectors from \(\pv\).
(Song et al., 2019) showed that under some regularity conditions, sliced score matching is an asymptotically consistent estimator:
\[\begin{align} \hat \btheta_{N,M} \overset{p}{\to} \btheta^* \text { as } \mathbb{N} \to \infty \end{align}\]where
\[\begin{align} \btheta^* &= \underset{\btheta}{\text{argmin }} J(\btheta; \pv), \\ \hat \btheta_{N,M} &= \underset{\btheta}{\text{argmin }} \hat J_{N,M}(\btheta; \pv). \end{align}\]Sliced score matching is computationally more efficient, since it now only involves Hessian-vector products, and continues to work well in high dimensions.
Once we have trained a score network, we can sample from the data distribution via Langevin dynamics. Langevin dynamics is a Markov Chain Monte Carlo method of sampling from a stationary distribution, where we can efficiently take gradients with respect to the probability of our samples \(\bx\). We satisfy this criteria since we have the trained score network.
In Langevin dynamics, we start from some initial point \(\bx_0 \sim \bpi(\bx)\) sampled from some prior distribution \(\bpi\), and then iteratively obtain updated points based on the following recurrence: \begin{align} \xt_t = \xt_{t-1} + \frac{\epsilon}{2} \nabla_\bx \log p(\xt_{t-1}) + \sqrt{\epsilon} \bz_t, \end{align} where \(\bz_t \sim \mathcal{N}(0, I)\). The addition of the Gaussian noise is required, or otherwise the process simply converges to the nearest mode instead of converging to a stationary distribution.
It can be shown that as \(\epsilon \to 0\) and \(T \to \infty\), we have that the distribution of the process \(\xt_T\) converges to \(\pdata\) (Welling & Teh, 2011).
Langevin dynamics does not perform well with multi-modal distributions with poor conductance, since it will tend to stay in a single mode, which causes long mixing times. This is particularly a problem when the modes have disjoint supports, since there is very weak gradient information in the region where there is no support.
The manifold hypothesis postulates that real-world data often lies in a low-dimensional manifold embedded in a high-dimensional space. This has been empirically observed in many datasets.
This poses problems for score matching. The first problem that the manifold hypothesis poses is that the score \(\score\) becomes undefined if \(\bx\) actually just lies in a low-dimensional manifold. The second problem is that the estimator in Equation \ref{eq:score-matching-target} is only consistent when the support of \(\pdata\) is that of the whole space.
In order to increase the dimension of the data to match that of the ambient space, (Hyvärinen, 2005) proposed injecting small amounts of Gaussian noise into the data, such that now the data distribution has full support. As long as the perturbation is sufficiently small (\(\mathcal{N}(0, 0.0001)\) was used in their paper), it is almost indistinguishable to humans.
The other problem with score matching is that it may not be able to learn the score function in areas of low data density. This is due to the lack of samples drawn from these regions, resulting in the Monte Carlo estimation to have high variance.
The challenges mentioned in the previous sections are addressed by Noise Conditional Score Networks (NCSN).
In NCSN, we define a geometric sequence of \(L\) noise levels \({\left\{ \sigma_i \right\}}_{i=1}^L\), with the property that \(\frac{\sigma_1}{\sigma_2} = \frac{\sigma_{L-1}}{\sigma_L} > 1\). Each of these noise levels correspond to Gaussian noise that will be added to perturb the data distribution, i.e \(q_{\sigma_i} \sim \pdata + \mathcal{N}(0, \sigma_i)\).
We augment the score network to also take the noise level \(\sigma\) into account, which is called the NCSN \(\stxt\). The goal of NCSN is then to estimate the score conditioned on the noise level. Once we have a trained NCSN, we use a similar apporach as simulated annealing in Langevin sampling, where we begin with a large noise level in order to cross the different modes easily, before gradually annealing down the noise to achieve convergence.
The denoising score matching objective for each noise level \(\sigma_i\) is given as
\[\begin{align} \ell(\theta; \sigma) \triangleq \frac{1}{2} \mathbb{E}_{\pdata} \mathbb{E}_{\xt \sim \mathcal{N}(\bx, \sigma^2 I)} \left[ \left\| \stxt + \frac{\xt - \bx}{\sigma^2} \right\|_2^2 \right], \end{align}\]and the unified objective for denoising across all levels is given as
\[\begin{align} \mathcal{L}\left(\theta; \left\{ \sigma_i\right\}_{i=1}^L \right) \triangleq \frac{1}{L} \sum_{i=1}^L \lambda(\sigma_i) \ell(\theta; \sigma_i). \end{align}\]We can extend the idea of having a finite number of noise scales to having an infinite continuous number of such noise scales by modeling the process as a diffusion process, which can be formalized as a stochastic differential equation (SDE). Such an SDE is given in the following form:
\[\begin{align} d\bx = \boldf(\bx, t) \, dt + g(t) \, d\bw. \end{align}\]Here, \(\boldf\) represents the drift coefficient, which models the deterministic part of the SDE, and determines the rate at which the process \(d\bx\) is expected to change over time on average. \(g(t)\) is called the diffusion coefficient, which represents the random part of the SDE, and determines the magnitude of the noising process over time. Finally, \(\bw\) is Brownian motion. Thus \(g(t) \, d \bw\) represents the noising process.
We want our diffusion process to be such that \(\bx(0) \sim p_0\) is the original data distribution, and \(\bx(T) \sim p_T\) is the Gaussian noise distribution that is independent of \(p_0\). Then since every SDE has a corresponding reverse SDE, we can start from the final noise distribution and run the reverse-time SDE in order to recover a sample from \(p_0\), given by the following process:
\[\begin{align} d \bx = [\boldf (\bx, t) - g(t)^2 \nabla_{\bx} \log_{p_t} (\bx) ] \, dt + g(t) \,d \overline{w}, \end{align}\]where \(\overline{w}\) is Brownian motion that flows backwards in time from \(T\) to \(0\), and \(dt\) is an infinitesimal negative timestep.
The objective function for score matching for the SDE is then given by
\[\begin{align} \argmin_{\theta} \mathbb{E}_t \left[ \lambda (t) \mathbb{E}_{\bx(0)} \mathbb{E}_{\bx (t) \mid \bx(0)} \left[ \| \bs_\theta (\bx(t), t) - \nabla_{\bx(t)} \log p_{0t}(\bx (t) \mid \bx(0)) \|_2^2 \right] \right]. \end{align}\](Song et al., 2021) covers two score-based generative models that uses SDEs to perform generative modeling. The first is called score matching with Langevin dynamics (SMLD), which performs score estimation at different noise scales and then performs sampling using Langevin dynamics with decreasing noise scales. The second is denoising diffusion probabilistic modeling (DDPM)
(Ho et al., 2020), which uses a parameterized Markov chain that is trained with a re-weighted variant of the evidence lower bound (ELBO), which is an instance of variational inference. The Markov chain is trained to reverse the noise diffusion process, which then allows sampling from the chain using standard Markov Chain Monte Carlo techniques.
(Song et al., 2021) shows that SMLD and DDPM actually corresponds to discretizations of the Variance Exploding (VE) and Variance Preserving (VP) SDEs, which is the focus of the next two section. We believe expanding on this will be illuminating as it highlights the connections between SDEs and the discretized approaches that are used in practice.
Recall that we use a geometric sequence of \(L\) noise levels \({\left\{ \sigma_i \right\}}_{i=1}^L\). that is added to the data distribution
We can recursively define the distribution for each noise level \(i\) by incrementally adding noise:
\[\begin{align} \bx_i = \bx_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \bz_{i-1}, \qquad \qquad i = 1, \dots, L, \end{align}\]where \(\bz_{i-1} \sim \mathcal{N}(\mathbf{0}, \bI)\), and \(\sigma_0 = 0\) so \(\bx_0 \sim \pdata\).
If we view the noise levels as gradually changing in time, then the continuous time limit of the process is given by the following SDE: \begin{align} \bx(t + \Delta t) = \bx(t) + \sqrt{\sigma^2 (t + \Delta t ) - \sigma^2 (t)} \bz(t) \approx \bx(t) + \sqrt{\frac{d [\sigma^2 (t)]}{dt} \Delta t } \bz (t), \end{align} where the approximation holds when \(\Delta t \ll 1\). If we take \(\Delta t \to 0\), we recover the VE SDE: \begin{align} d \bx = \sqrt{\frac{d [\sigma^2 (t)]}{dt} } d \bw, \end{align} which causes the variance of \(d \bx(t)\) to go to infinity as \(t \to \infty\) due to its geometric growth, hence its name.
Similarly, the Markov chain of the perturbation kernel of DDPM is given by \begin{align} \bx_i = \sqrt{1 - \beta_i} \bx_{i-1} + \sqrt{\beta_i} \bz_{i-1}, \qquad i = 1, \cdots, L, \end{align} where \(\left\{ \beta_i \right\}_{i=1}^L\) are the noise scales, and if we take \(L \to \infty\) with scaled noise scales \(\overline{\beta_i} = N \beta_i\), we get \begin{align} \bx_i = \sqrt{1 - \frac{\bov_i}{N} } \bx_{i-1} + \sqrt{ \frac{\bov_i}{N} } \bz_{i-1}, \qquad i = 1, \cdots, L. \end{align} Now taking limits with \(L \to \infty\), we get \begin{align} \bx(t + \Delta t) \approx \bx(t) - \frac{1}{2} \beta(t) \Delta t \bx(t) + \sqrt{\beta(t) \Delta t} \bz(t), \end{align} where the approximation comes from the second degree Taylor expansion of \(\sqrt{1 - \beta(t + \Delta t) \Delta t}\). Then taking the limit of \(\Delta t \to 0\), we obtain the VP SDE \begin{align} d \bx = - \frac{1}{2} \beta(t) \bx dt + \sqrt{\beta(t)} d \bw. \end{align} This process thus has bounded variance since \(\beta_i\) is bounded.
We conduct the following preliminary series of experiments, based on released work by (Song & Ermon, 2019).
In this experiment, we have plotted the true data density of a toy distribution along with samples drawn in three ways. The i.i.d samples are drawn directly from the underlying distribution and we can see that more samples are drawn in the area of high data density. However, applying Langevin dynamics without annealing, we see that there is an almost equal number of points in the top left and bottom right corners. This is evidence that the sampling method doesn’t conform to the true distribution. Finally, by injecting and decreasing the amount of noise through the annealing process, we can recover a representative sample of the distribution.
To better visualize the effects of annealing when sampling via Langevin Dynamics, we generated images from a model trained on the CelebA dataset. We first tried applying Langevin Dynamics with a fixed noise and then used annealing to gradually decrease the noise.
Figure 2 shows that the results with annealing are significantly clearer and more varied, matching the performance of GANs in 2019.
We notice that the image generated without annealing manages to produce the structure of a human face but fails to capture finer details such as the hair, and the surrounding backdrop. There is also little variation in color between different samples. This is in agreement with our theory that without annealing, Langevin dynamics cannot properly explore regions of lower data density.
We also investigated the effect of changing the lowest noise standard deviation \(\sigma\) while keeping the number of different noises injected fixed at \(10\). The 10 noise values are determined by an interpolation in log scale.
Our experiment shows that the effect of starting, ending, and the interval between noise values has a significant effect on the convergence of annealed Langevin sampling.
Having completed a survey of score-based diffusion models, and having run some experiments on them, we now turn our attention to discussing the pros and cons of this approach.
As mentioned previously in this paper, the main draw of score-based diffusion models is that it has shown to be capable of generating impressive high-quality samples that is on-par with the state-of-the-art with GANs. We hence focus on its limitations and how they might be overcome, drawing from work in (Cao et al., 2022).
A common refrain of score-based diffusion model is the high computational complexity in both training and sampling. This is because it requires thousands of small diffusion steps in order to ensure that the forward and reverse SDEs hold in their approximations (Zheng et al., 2022). If the diffusion steps are too large, then the Gaussian noise assumption may not hold, resulting in poor score estimates. This makes it significantly more expensive than other generative methods like GANs and VAEs. To this end, there are some directions being explored to improve its computation cost.
The first technique seeks to reduce the number of sampling steps required by a method known as knowledge distillation (Lopes et al., 2017). In knowledge distillation, knowledge is transferred from a larger and more complex model (called the teacher), to one that is smaller and simpler (called the student). This technique has found success in other domains such as image classification, and has also been shown to result in improvements in diffusion models (Salimans & Ho, 2022). It would be interesting to see how far we can take this optimization.
Another technique known as truncated diffusion probabilistic modeling (TDPM) (Zheng et al., 2022). In this approach, instead of considering the diffusion process until it becomes pure noise, the process is stopped once it reaches a hidden noisy-data distribution that can be learnt by an auto-encoder by adversarial training. Then in order to produce samples, a sample is first drawn from the learnt noisy-data distribution, before being passed through the reverse-SDE diffusion steps.
It also suffers from poor explainability and interpretability, but this is a common problem across other generative models.
(Song et al., 2021) also notes that it is currently difficult to tune the myriad of hyperparameters introduced by the choice of noise levels and specific samplers chosen, and new methods to automatically select and tune these hyperparameters would make score-based diffusion models more easily deployable in practice.
Diffusion models have mostly only seen applications for generating image data, and its potential for generating other data modalities has not been as thoroughly investigated. (Austin et al., 2021) introduces Discrete Denoising Diffusion Probabilistic Models (D3PMs), which develops a diffusion process for corrupting text data into noise. It would be interesting to see how well diffusion models can be stretched to perform compared to state-of-the-art transformer models in text generation.
Dimensionality reduction is another technique that can be used to speed up training and sampling speeds of diffusion models. Diffusion models are typically trained directly in data space. (Vahdat et al., 2021) instead proposes for them to be trained in latent space, which results in dimensionality reduction in the representation learnt, and also potentially increases the expressiveness of the framework. In a similar vein, (Zhang et al., 2022) argues that due to redundancy in spatial data, it is not necessary to learn in data space, and instead proposes a dimensionality-varying diffusion process (DVDP), where the dimensionality of the signal is dynamically adjusted during the both the diffusion and denoising process.
We showed that score matching presents a promising new direction for generative models, which avoids many of the limitations of other approaches such as training instability and mode collapse in GANs, and poor approximation guarantees in variational inference. While score matching has several flaws, such as suffering from the manifold hypothesis and requiring an expensive Langevin dynamics process in order to draw samples, successive work has done well in addressing these limitations to make score matching on diffusion models a viable contender to displace GANs as the state-of-the-art for generative modeling.
Our experiments in this blog post help to provide empirical context to the theoretical results we have derived. Most notably, we have shown how annealing is an essential part of sampling via Langevin dynamics.
Finally, we discuss some future directions that can help to improve the viability of using score-based diffusion models, which includes improving its computational cost in both training and sampling and increasing the diversity of applicable modalities.
@article{zeng2023diffusion,
title = {Score-Based Diffusion Models},
author = {Fan Pu Zeng and Owen Wang},
journal = {fanpu.io},
year = {2023},
month = {Jun},
url = {https://fanpu.io/blog/2023/score-based-diffusion-models/}
}
Unfortunately, this meant that while many people have good operational knowledge of LaTeX and can get the job done, there are still many small mistakes and best practices which are not followed, which are not corrected by TAs as they are either not severe enough to warrant a note, or perhaps even the TAs themselves are not aware of them.
In this post, we cover some common mistakes that are made by LaTeX practitioners (even in heavily cited papers), and how to address them. This post assumes that the reader has some working knowledge of LaTeX.
It is important to get into the right mindset whenever you typeset a document. You are not simply “writing” a document — you are crafting a work of art that combines both the precision and creativity of your logical thinking, as well as the elegance of a beautifully typeset writing. The amount of attention and care you put into the presentation is indicative of the amount of thought you put into the content. Therefore, having good style is not only delightful and aesthetically pleasing to read, but it also serves to establish your ethos and character. One can tell that someone puts a lot of effort into their work and takes great pride in them when they pay attention even to the smallest of details.
Furthermore, adopting good practices also helps to avoid you making typographical mistakes in your proof, such as missing parenthesis or wrong positioning. This could often lead to cascading errors that are very annoying to fix when you discover them later on. There are ways to replicate the strict typechecking of statically typed languages to ensure that mistakes in your expressions can be caught at compile-time.
In the following section, we take a look at common mistakes that people make, and how they can be avoided or fixed. We cover style mistakes first, since the ideas behind them are more general. All the screenshotted examples come from peer-reviewed papers that have been published to top conferences, so they are definitely very common mistakes and you shouldn’t feel bad for making them. The important thing is that you are aware of them now so that your style will gradually improve over time.
We take a look at style mistakes, which impairs reader understanding, and makes it easy to commit other sorts of errors.
Parenthesis, brackets, and pipes are examples of delimiters that are used to mark the start and end of formula expressions. As they come in pairs, a common mistake is accidentally leaving out the closing delimiter, especially for nested expressions. Even if you don’t forget to do so, there is the issue of incorrect sizing.
For instance, consider the following way of expressing the Topologist’s sine curve, which is an example of a topology that is connected but not path connected:
which is rendered as follows:
\[T = \{(x, \sin \frac{1}{x} ) : x \in (0, 1] \} \cup \{ ( 0, 0 ) \}\]The problem here is that the curly braces have the wrong size, as they should be large enough to cover the \(\sin \frac{1}{x}\) expression vertically.
The wrong way of resolving this would be to use delimiter size modifiers, i.e \bigl, \Bigl, \biggl
paired with \bigr, \Bigr, \biggr
and the like. This is tedious and error-prone, since it will even happily let you match delimiters with different sizes. Indeed, I came across the following formula in a paper recently, where the outer right square brackets was missing and the left one had the wrong size:
The correct way to do this would be to use paired delimiters, which will automatically adjust its size based on its contents, and automatically result in a compile error if the matching right delimiter is not included, or nested at the wrong level. Some of them are given below:
Raw LaTeX | Rendered |
---|---|
\left( \frac{1}{x} \right) | \(\left( \frac{1}{x} \right)\) |
\left[ \frac{1}{x} \right] | \(\left[ \frac{1}{x} \right]\) |
\left\{ \frac{1}{x} \right\} | \(\left\{ \frac{1}{x} \right\}\) |
\left\lvert \frac{1}{x} \right\lvert | \(\left\lvert \frac{1}{x} \right\rvert\) |
\left\lceil \frac{1}{x} \right\rceil | \(\left\lceil \frac{1}{x} \right\rceil\) |
In fact, to make things even simpler and more readable, you can declare paired delimiters for use based on the mathtools
package, with the following commands due to Ryan O’Donnell:
Then you can now use the custom delimiters as follows, taking note that you need the *
for it to auto-resize:
which gives
\[T = \left\{ \left( x, \sin \frac{1}{x} \right) : x \in (0, 1] \right\} \cup \left\{ \left( 0, 0 \right) \right\} \\\]The biggest downside of using custom paired delimiters is having to remember to add the *
, otherwise, the delimiters will not auto-resize. This is pretty unfortunate as it still makes it error-prone. There is a proposed solution floating around on StackExchange that relies on a custom command that makes auto-resizing the default, but it’s still a far cry from a parsimonious solution.
Macros can be defined using the \newcommand
command. The basic syntax is \newcommand{command_name}{command_definition}
. For instance, it might get tiring to always type \boldsymbol{A}
to refer to a matrix \(\boldsymbol{A}\), so you can use the following macro:
Macros can also take arguments to be substituted within the definition. This is done by adding a [n]
argument after your command name, where n
is the number of arguments that it should take. You can then reference the positional arguments using #1, #2,
and so on. Here, we create a \dotprod
macro that takes two arguments:
Macros are incredibly helpful as they help to save time, and ensure that our notation is consistent. However, they can also be used to help to catch mistakes when typesetting grammatically structured things.
For instance, when expressing types and terms in programming language theory, there is often a lot of nested syntactical structure, which could make it easy to make mistakes. Consider the following proof:
The details are unimportant, but it is clear that it is easy to miss a letter here or a term there in the proof, given how cumbersome the notation is. To avoid this, I used the following macros, due to Robert Harper:
And the source for the proof looks like the following:
It is definitely still not the most pleasant thing to read, but at least now you will be less likely to miss an argument or forget to close a parenthesis.
Expressions which are logically a single unit should stay on the same line, instead of being split apart mid-sentence. Cue the following bad example from another paper:
In the area marked in red, we had the expression that was defining \(\tau^i\) get cut in half, which is very jarring visually and interrupts the reader’s train of thought.
To ensure that expressions do not get split, simply wrap it around in curly braces. For instance,
would be wrapped by {
and }
on both sides and become
So if we render the following snippet, which would otherwise have expressions split in half without the wrapped curly braces:
we get the following positive result where there is additional whitespace between the justified text on the first line, to compensate for the expression assigning \(\tau\) to stay on the same line:
~
When referencing figures and equations, you want the text and number (i.e Figure 10) to end up on the same line. This is a negative example, where the region underlined in red shows how it was split up:
To remedy this, add a ~
after Figure
, which LaTeX interprets as a non-breaking space:
This would ensure that “Figure 2” always appears together.
Your document is meant to be read, and it should follow the rules and structures of English (or whichever language you are writing in). This means that mathematical expressions should also be punctuated appropriately, which allows it to flow more naturally and make it easier for the reader to follow.
Consider the following example that does not use punctuation:
In the region highlighted in red, the expressions do not carry any punctuation at all, and by the end of the last equation (Equation 15), I am almost out of breath trying to process all of the information. In addition, it does not end in a full stop, which does not give me an affordance to take a break mentally until the next paragraph.
Instead, commas should be added after each expression where the expression does not terminate, and the final equation should be ended by a full stop. Here is a good example of punctuation that helps to guide the reader along the author’s train of thought:
Here is another good example of how using commas for the equations allow the text to flow naturally, where it takes the form of “analogously, observe that we have [foo] and [bar], where the inequality…”:
This even extends to when you pack several equations on a single line, which is common when you are trying to fit the page limit for conference submissions:
proof
environmentThe proof
environment from the amsthm
package is great for signposting to your readers where a proof starts and ends. For instance, consider how it is used in the following example:
This will helpfully highlight the start of your argument with “Proof”, and terminate it with a square that symbolizes QED.
\qedhere
Consider the same example as previously, but now you accidentally added an additional newline before the closing \end{proof}
, which happens pretty often:
This results in the above scenario, where the QED symbol now appears on the next line by itself, which throws the entire text off-balance visually. To avoid such things happening, always include an explicit \qedhere
marker at the end of your proof, which would cause it to always appear on the line that it appears after:
We would then get the same result as before originally, when we did not have the extra newline.
Spacing matters a lot in readability, as it helps to separate logical components. For instance, the following example fails to add spacing before the differential of the variable \(dz\):
This might seem innocuous, but consider the following example that makes the issue more explicit:
\[P(X) = \int xyz dx\]Now we can really see that the quantities are running into each other, and it becomes hard to interpret. Instead, we can add math-mode spacing, summarized in the following table:
Spacing Expression | Type |
---|---|
\; | Thick space |
\: | Medium space |
\, | Thin space |
So our new expression now looks like:
\[P(X) = \int xyz \, dx\]which is much more readable.
align*
Environment for Multiline EquationsWhen using the align*
environment, make sure that your ampersands &
appear before the symbol that you are aligning against. This ensures that you get the correct spacing.
For instance, the following is wrong, where the &
appears after the =
:
This is because there is too little spacing after the =
sign on each line, which feels very cramped. Putting the &
before the =
is correct:
The spacing is much more comfortable now.
We now look at some mistakes that arise from using the wrong commands.
Instead of sin (x)
\((sin(x))\) or log (x)
\((log (x))\), use \sin (x)
\((\sin (x))\) and \log (x)
\((\log (x))\). The idea extends to many other common math functions. These are math operators that will de-italicize the commands and also take care of the appropriate math-mode spacing between characters:
O(n log n) | \(O(n log n)\) |
O(n \log n) | \(O(n \log n)\) |
Many times there is a math operator that you need to use repeatedly, but which does not come out of the box. You can define custom math operators with the \DeclareMathOperator
command. For instance, here are some commonly used in probability:
Then you can use it as follows:
\[\Pr \left[ X \geq a \right] \leq \frac{\Ex[X]}{a}\]This is more of a rookie mistake since it’s visually very obvious something is wrong. Double quotes don’t work the way you would expect:
\[\text{"Hello World!"}\]Instead, surround them in double backticks and single quotes, which is supposed to be reminiscent of the directional strokes of an actual double quote. This allows it to know which side to orient the ticks:
Unfortunately I had to demonstrate this with a screenshot since MathJax only performs math-mode typesetting, but this is an instance of text-mode typesetting.
This is a common mistake due to laziness. Many times, people use \epsilon
(\(\epsilon\)) when they really meant to write \varepsilon
(\(\varepsilon\)). For instance, in analysis this is usually the case, and therefore writing \epsilon
results in a very uncomfortable read:
Using \varepsilon
makes the reader feel much more at peace:
Similarly, people tend to get lazy and mix up \phi, \Phi, \varphi
(\(\phi, \Phi, \varphi\)), since they are “about the same”. Details matter!
mathbbm
Instead Of mathbb
For sets like \(\mathbb{N}\), you should use \mathbbm{N}
(from bbm
package) instead of mathbb{N}
(from amssymb
). See the difference in how the rendering of the set of natural numbers \(\mathbb{N}\) differs, using the same example as the previous section:
mathbbm
causes the symbols to be bolded, which is what you want.
...
and \dots
are different. See the difference:
When using “…”, the spacing between each dot, and between the final dot and the comma character is wrong. Always use “\dots”.
When writing summation or products of terms, use \sum
and \prod
instead of \Sigma
and \Pi
. This helps to handle the relative positioning of the limits properly, and is much more idiomatic to read from the raw script:
Raw LaTeX | Rendered |
---|---|
\Sigma_{i=1}^n X_i | \(\Sigma_{i=1}^n X_i\) |
\sum_{i=1}^n X_i | \(\sum_{i=1}^n X_i\) |
\Pi_{i=1}^n X_i | \(\Pi_{i=1}^n X_i\) |
\prod_{i=1}^n X_i | \(\prod_{i=1}^n X_i\) |
To denote multiplication, use \cdot
or times
instead of *
. See the difference below in the equation:
For set builder notation or conditional probability, use \mid
instead of the pipe |
. This helps to handle the spacing between the terms properly:
Raw LaTeX | Rendered |
---|---|
p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} | \mathbf{z}) | \(p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} | \mathbf{z})\) |
p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} \mid \mathbf{z}) | \(p(\mathbf{z}, \mathbf{x}) = p(\mathbf{z}) p(\mathbf{z} \mid \mathbf{z})\) |
When writing vectors, use the \langle
and \rangle
instead of the keyboard angle brackets:
Raw LaTeX | Rendered |
---|---|
<u, v> | \(<u, v>\) |
\langle u, v \rangle | \(\langle u, v \rangle\) |
Use \label
to label your figures, equations, tables, and so on, and reference them using \ref
, instead of hardcoding the number. For instance, \label{fig:myfig}
and \ref{fig:myfig}
. Including the type of the object in the tag helps to keep track of what it is and ensures that you are referencing it correctly, i.e making sure you write Figure \ref{fig:myfig}
instead of accidentally saying something like Table \ref{fig:myfig}
.
That was a lot, and I hope it has been a helpful read! I will continue updating this post in the future as and when I feel like there are other important things which should be noted which I missed.
I would like to thank my friend Zack Lee for reviewing this article and for providing valuable suggestions. I would also like to express my thanks to Ryan O’Donnell, and my 15-751 A Theorist’s Toolkit TAs Tim Hsieh and Emre Yolcu for helping me realize a lot of the style-related LaTeX issues mentioned in this post, many of which I made personally in the past.
]]>The Central Limit states that the mean of an appropriately transformed random variable converges in distribution to a standard normal. We first need to introduce the definition of convergence of probability distributions:
Note that the requirement that it only holds for points of continuity is not superfluous, as there can be distributions that converge but disagree in value at points of discontinuities (i.e take \(X_n = N(0, 1/n)\) and \(X\) to be the point mass at 0, they converge but their CDF take different values at \(t=0\)).
The Central Limit Theorem can then be stated in the following form (there are many other equivalent statements):
There are several ways of proving the Central Limit Theorem. The proof that we will explore today relies on the methods of moments. An alternative measure-theoretic version of the proof relies on Lévy’s Continuity Theorem, and makes use of convolutions and Fourier transforms.
Our goal is to show that \(Z_n\) converges in distribution to \(Z \sim N(0, 1)\). To do so, we will show that all the moments of \(Z_n\) converges to the respective moments of \(Z\).
The moments of a random variable can be obtained from its moment-generating function (MGF), defined as follows:
It is called a moment generating function since the \(k\)th moment of \(X\), i.e \(\mathbb{E} \left[X^k \right]\), can be obtained by taking the \(k\)th derivative of its moment-generating function (MGF) at 0:
\[\mathbb{E} \left[X^k \right] = M^{(k)}(0).\]This is not too hard to see by induction on the fact that \(M_X^k(t) = \mathbb{E} \left[ X^k e^{tX} \right]\). The base case is trivial. For the inductive case,
\[\begin{align*} M_X^{(k)}(t) & = \frac{d^k}{dt^k} \mathbb{E} \left[ e^{tX} \right] \\ & = \frac{d}{dt} \mathbb{E} \left[ X^{k-1} e^{tX} \right] & \text{(by IH)}\\ & = \frac{d}{dt} \int f(x) x^{k-1} e^{tx} \; dx \\ & = \int \frac{d}{dt} f(x) x^{k-1} e^{tx} \; dx \\ & = \int f(x) x^{k} e^{tx} \; dx \\ & = \mathbb{E} \left[ X^{k} e^{tX} \right]. \end{align*}\]Substituting \(t=0\) gives us the desired result.
Distributions are determined uniquely by its moments under certain conditions. This is made precise in the following theorem:
In words, it means that for some open interval around 0 we have that all moments are finite, then the moments determine the distribution. This is true for the normal distribution, where it can be shown that the following recurrence holds for the \(k\)th moment:
\[M^k(t) = \mu M^{k-1}(t) + (k-1) \sigma^2 M^{k-2}(t).\]This is also not hard to show by induction, and the proof is omitted for brevity. Since the first two moments of the standard normal distribution are 1 and 0 respectively which are both finite, and our mean and standard deviation are both finite, then all our moments generated by the recurrence must also be finite. So our standard normal is determined by its moments.
Now cue the theorem that ties things together:
In words, it states that if the \(k\)th moment of \(X_n\) is finite and converges to the \(k\)th moment of \(X\) in the limit of \(n\), then \(X_n\) converges to \(X\).
This is great, since now we just have to show that all the moments of \(Z_n = \frac{\sqrt{n} \left( \overline{X}_n - \mu \right)}{\sigma}\) converges to the moments of the standard normal \(Z\).
Let’s first find the moment generating function of \(Z\):
\[\begin{align*} M_{Z} & = \mathbb{E} \left[ e^{tZ} \right] \\ & = \int f_Z(x) e^{tx} \; dx \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2} e^{tx} \; dx & \text{(subst. pdf of standard Gaussian)} \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2 + tx} \; dx \\ & = \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 + \frac{1}{2}t^2} \; dx & \text{(completing the square)} \\ & = e^{\frac{1}{2}t^2} \int \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 } \; dx & \text{($e^{\frac{1}{2}t^2}$ does not depend on $x$)} \\ & = e^{\frac{1}{2}t^2} \cdot 1 \\ & = e^{\frac{1}{2}t^2}, \end{align*}\]where the second last step comes from the fact that \(\frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - t)^2 }\) is a probability distribution of a Gaussian with mean \(t\) and variance 1, and therefore the integral integrates to 1.
Now we find the moment generating function of \(Z_n\). To simplify notation, define \(A_i = \frac{X_i - \mu}{\sigma}\), and see that we can write \(Z_n = \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i\), since
\[\begin{align*} \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i &= \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n \frac{X_i - \mu}{\sigma} \\ &= \sqrt{n} \sum\limits_{i=1}^n \frac{X_i - \mu}{ n \sigma} \\ &= \sqrt{n} \frac{\overline{X}_n - \mu}{ \sigma} \\ &= Z_n. \end{align*}\]See that \(\mathbb{E}[A_i] = 0\), and \(\mathbf{Var}(A_i) = 1\).
Then starting from the definition of the moment generating function of \(Z_n\),
\[\begin{align*} M_{Z_n}(t) & = \mathbb{E} \left[ e^{t Z_n} \right] \\ & = \mathbb{E} \left[ \exp\left(t \frac{1}{\sqrt{n}} \sum\limits_{i=1}^n A_i \right) \right] & \text{(by equivalent definition of $Z_n$)} \\ & = \prod_{i=1}^n \mathbb{E} \left[ \exp\left( \frac{t}{\sqrt{n}} A_i \right) \right] & \text{(by independence of $A_i$'s)} \\ & = \prod_{i=1}^n M_{A_i}(t/\sqrt{n}) & \text{(definition of $M_{A_i}$)} \\ & = M_{A_i}(t/\sqrt{n} )^n. \end{align*}\]Let’s analyze each individual term \(M_{A_i}(t / \sqrt{n})\) by performing a Taylor expansion around 0. Recall that the Taylor expansion of a function \(f(x)\) about a point \(a\) is given by \(f(x)= \sum\limits_{n=0}^\infty \frac{f^{(n)(a)}}{n!}(x-a)^n.\). We will expand up to the second order term, which requires us to find the first three moments of the MGF.
These are:
\[\begin{align*} M_{A_i}(0) & = \mathbb{E} \left[ e^{t A_i} \right] \Big|_{t=0} \\ & = \mathbb{E} \left[ 1 \right] \\ & = 1, \\ M_{A_i}^\prime(0) & = \mathbb{E} \left[ A_i \right] & \text{(by the $k$th moment property proved previously)} \\ & = 0, \\ M_{A_i}^{\prime \prime}(0) & = \mathbb{E} \left[ A_i^2 \right] & \text{(by the $k$th moment property proved previously)} \\ & = \mathbb{E} \left[ A_i^2 \right] - \mathbb{E} \left[ A_i \right]^2 + \mathbb{E} \left[ A_i \right]^2 \\ & = \mathbf{Var}(A_i) + \mathbb{E} \left[ A_i \right]^2 & \text{($\mathbf{Var}(A_i) = \mathbb{E} \left[ A_i^2 \right] - \mathbb{E} \left[ A_i \right]^2 $)} \\ & = 1 + 0 \\ & = 1. \end{align*}\]Taking all terms up to the second order Taylor expansion allows us to approximate \(M_{A_i}\) as
\[\begin{align*} M_{A_i}(t/\sqrt{n}) & \approx M_{A_i}(0) + M_{A_i}^\prime(0) + M_{A_i}^{\prime \prime}(0) \frac{t^2}{2n} \\ & = 1 + 0 + \frac{t^2}{2n} \\ & = 1 + \frac{t^2}{2n}. \end{align*}\]Then now we can write the limit of the MGF of \(Z_n\) as the following:
\[\begin{align*} M_{Z_n}(t) & = M_{A_i}(t/\sqrt{n})^n \\ & \approx \left( 1 + \frac{t^2}{2n} \right)^n \\ & \to e^{t^2/2}, & \text{(by identity $\lim_{n \to \infty} (1 + x/n)^n \to e^x$)} \end{align*}\]which shows that it converges to the MGF of \(Z\), as desired. Hooray!
However, there is one thing in this proof that might have bothered you. Our result came from making use of the Taylor approximation and taking limits, but there is no bound on how large \(n\) must be for the distributions to converge up to a maximum amount of error. This makes it unsuitable for much theoretical analysis, since usually we would like to know that \(n\) does not have to be too large for us to obtain a sufficiently good approximation to the standard normal.
The Berry-Esseen theorem solves this limitation by also providing explicit error bounds. This was proved independently by Andrew Berry and Carl-Gustav Esseen in the 40s, and the statement goes as follows:
In words, the theorem says that the difference between the CDF of the sum of the mean-0 random variables and the CDF of the standard normal is bounded by a value proportionate to the third moment. This then becomes useful as a tool in proving high probability statements if we can show that the third moment is inversely polynomially small, i.e \(\beta = 1/\text{poly}(n)\).
Another thing to note is that the theorem only provides an absolute bound for all values of \(u\). Therefore, when \(u\) is very negative and \(\Pr [Z \leq u ] = \Phi(u)\) is very small, the relative error is actually very large, and therefore is not as helpful.
I hope this article has been helpful!
I would like to express my thanks to my friend Albert Gao for reviewing this article and for providing valuable suggestions.
Our goal is to train an agent that is able to maximize its rewards in a given task. For instance, its goal could be to balance a cartpole for as long as possible, where for each time step the pole does not fall down, the agent receives 1 reward, and when the pole falls down the episode is terminated and the agent no longer receives any rewards:
Formally, we want to maximize the expected rewards for our policy over the trajectories that it visits. A trajectory \(\tau\) is defined as state-action pairs \(\tau = (s_0, a_0, s_1, a_1, \dots, s_H, a_H, s_{H+1})\), where \(H\) is horizon of the trajectory, i.e the duration until the episode is terminated, and \(s_t, a_t\) are the states and actions performed at each time step \(t\).
This can be formalized as the following objective:
\[\begin{align} & \max_\theta \mathbb{E}_{\tau \sim P_\theta(\tau)} [R(\tau)] \\ = & \max_\theta \sum\limits_\tau P_\theta(\tau) R(\tau) \\ = & \max_\theta U(\theta), \end{align}\]where \(\tau\) refers to a trajectory of state-action pairs, \(P_\theta(\tau)\) denotes the probability of experiencing trajectory \(\tau\) under policy \(\theta\), and \(R(\tau)\) is the reward under trajectory \(\tau\), and \(U(\theta)\) is shorthand for the expression for brevity.
The probability of \(P_\theta(\tau)\) is given by the following:
\[\begin{align} P_\theta(\tau) = \prod_{t=0}^H P(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta (a_t \mid s_t), \end{align}\]where in words, it is the product over each time step \(t\), of the probability of taking the action at time \(t\) in the trajectory \(a_t\) when we were in state \(s_t\) under our policy \(\pi_\theta\), given by \(\pi_\theta(a_t \mid s_t)\), multiplied by the probability that the environment transitions us from \(s_t\) to \(s_{t+1}\) given that we performed action \(a_t\). Note that we do not necessarily know this environment transition probability \(P(s_{t+1} \mid s_t, a_t)\).
To perform a gradient-based update on \(\theta\) to increase the reward, we need to compute the derivative with respect to our policy \(\theta\), i.e \(\nabla_\theta \mathbb{E}_{\tau \sim P(\tau; \theta)} [R(\tau)]\). Let’s walk through the derivation step by step:
\[\begin{align*} \nabla_\theta \mathbb{E}_{\tau \sim P_\theta(\tau)} [R(\tau)] & = \nabla_\theta \sum\limits_\tau P_\theta(\tau) R(\tau) \\ & = \sum\limits_\tau \nabla_\theta P_\theta(\tau) R(\tau) & \text{(uh oh...)}\\ \end{align*}\]It appears that we are already stuck here, since \(\nabla_\theta P_\theta(\tau)\) will result in many repeated applications of the chain rule since \(P_\theta(\tau)\) is a huge product containing our policy transition probabilities, and will quickly get out of hand to be computed feasibly.
Instead, the trick is to multiply by 1 on the left:
\[\begin{align*} \sum\limits_\tau \nabla_\theta P_\theta(\tau) R(\tau) &= \sum\limits_\tau \frac{ P_\theta(\tau) }{ P_\theta(\tau) } \nabla_\theta P_\theta(\tau) R(\tau) & \text{(multiplying by 1)} \\ &= \sum\limits_\tau P_\theta(\tau) \frac{ \nabla_\theta P_\theta(\tau) }{ P_\theta(\tau) } R(\tau) & \text{(rearranging)} \\ &= \sum\limits_\tau P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) R(\tau) & \text{($\frac{d}{dx} \log f(x) = \frac{f'(x)}{f(x)} $)} \\ &= \mathbb{E}_{\tau \sim P_\theta(\tau)} \left[ \nabla_\theta \log P_\theta(\tau) R(\tau) \right] \\ &\approx \frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i), \\ \end{align*}\]where we can use \(\frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i)\) as our estimator, which converges to the true expectation as our number of trajectory samples \(N\) increases.
We can compute \(\nabla_\theta \log P_\theta(\tau_i)\) for each sampled trajectory \(\tau_i\), and then take their average. This can be done as follows:
\[\begin{align*} \nabla_\theta \log P_\theta(\tau_i) & = \nabla_\theta \log P_\theta(s_0, a_0, \dots, s_H, a_H, s_{H+1}) \\ & = \nabla_\theta \log \left[ \prod_{t=0}^H P(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta (a_t \mid s_t), \right] \\ & = \nabla_\theta \left[ \sum\limits_{t=0}^H \log P(s_{t+1} \mid s_t, a_t) + \log \pi_\theta (a_t \mid s_t) \right] \\ & = \nabla_\theta \sum\limits_{t=0}^H \log \pi_\theta (a_t \mid s_t) \\ & \qquad \qquad \text{(first term does not depend on $\theta$, becomes zero)} \\ & = \sum\limits_{t=0}^H \nabla_\theta \log \pi_\theta (a_t \mid s_t),\\ \end{align*}\]where the last expression is easily computable for models such as neural networks since it is end-to-end differentiable.
With the approximate gradient \(\nabla_\theta U(\theta)\) in hand, we can now perform our policy gradient update as
\[\begin{align*} \theta_{\mbox{new}} = \theta_{\mbox{old}} + \alpha \nabla_\theta U(\theta), \end{align*}\]for some choice of step size \(\alpha\).
In this post, we saw from first principles how taking the gradients of many sampled trajectories does indeed converge to the true policy gradient.
This method of multiplying by 1 to pull out a probability term so that a summation can be converted into an expectation is widely used in machine learning, such as for computing variational autoencoder (VAE) loss. It is known as the log derivative trick.
The estimator \(\frac{1}{N} \sum\limits_{i=1}^N \nabla_\theta \log P_\theta(\tau_i) R(\tau_i)\) is also sometimes known as the REINFORCE estimator, after the popular REINFORCE algorithm.
One limitation of this approach is that it requires \(\pi_\theta\) to be differentiable. However, given how most RL models rely on neural networks, this is not a significant restriction.
Choosing the right step size \(\alpha\) is actually not straightforward. It is different from the offline supervised-learning context, where you can use methods like AdaGrad or RMSProp which adaptively chooses a learning rate for you, and even if the learning rate was not optimal it just takes more iterations to converge. On the other hand, in reinforcement learning, a learning rate that is too small results in inefficient use of trajectory samples as they cannot be trivially re-used since it depends on your current policy, and a learning rate that is too large can result in the policy becoming bad, which is difficult to recover from since future trajectories would also become bad.
We will discuss three important methods to choose an appropriate step size in a future post: Natural Policy Gradients, Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO). Hope to see you around!
I would like to express my thanks to my friend Jun Yu Tan for reviewing this article and for providing valuable suggestions.