Theoretical Foundations

Machine Learning, Computer Science, Mathematics

Neural Networks from Maximizing Rate Reduction

While we have witnessed much empirical evidence of the success of deep learning, much of it is due to trial and error and not guided by underlying mathematical principles. I attended Yi Ma's keynote on "Pursuing the Nature of Intelligence" at ICLR this year, which took on a statistical lens towards urging the community to view model training as learning to do compression. I was especially struck by the novelty of his recent work on using coding rate reduction as a learning objective as an alternative to standard loss functions, and the remainder of this post will be a high-level overview of his ReduNet paper.

6 min read · May 10, 2025

2025 · machine-learning statistics
An Intuitive Introduction to Gaussian Processes

Deep learning is currently dominated by parametric models, which are models with a fixed number of parameters regardless of the size of the training dataset. Examples include linear regression models and neural networks.

However, it's good to occasionally take a step back and remember that that is not all there is. Non-parametric models like k-NN, decision trees, or kernel density estimation don't rely on a fixed set of weights, but instead grow in complexity based on the size of the data.

In this post we'll talk about Gaussian processes, a conceptually important, but in my opinion under-appreciated non-parametric approach with deep connections with modern-day neural networks. An interesting motivating fact which we will eventually show is that neural networks initialized with Gaussian weights are equivalent to Gaussian processes in the infinite-width limit.

11 min read · January 21, 2025

2025 · machine-learning math statistics
Bounding Mixing Times of Markov Chains via the Spectral Gap

An aperiodic and irreducible Markov chain will eventually converge to a stationary distribution. This is used in many applications in machine learning like Markov Chain Monte Carlo (MCMC) methods, where random walks on Markov chains are used to obtain a good estimate of the log likelihood of the partition function of a model, which is hard to compute directly as it is #P-hard (this is even harder than NP-hard). However, one common problem is that it is unclear how many steps we should take before we are guaranteed that the Markov chain has converged to the its stationary distribution.

In this post, we understand how the spectral gap of the transition matrix of the Markov chain relates to its mixing time.

18 min read · January 12, 2025

2025 · math machine-learning
Notes on 'The Llama 3 Herd of Models'

Notes on the new Llama 3.1 technical report. It's a long paper, but one that's well-written with lots of interesting technical details and design choices.

14 min read · August 07, 2024

2024 · machine-learning
Playing Sound Voltex at Home: Setting Up Unnamed SDVX Clone with the Yuancon SDVX Controller

Rhythm is just a $200 controller and some hopefully-not-too-complicated open source software setup away! This beginner's guide will help to demystify the process of setting up Sound Voltex at home using a custom SDVX controller using Unnamed SDVX Clone.

14 min read · September 02, 2023

2023 · general rhythm-games
Creating Trackback Requests for Static Sites

A simple guide on creating manual Trackback requests for static sites to increase visibility and discoverability

5 min read · September 01, 2023

2023 · code general
A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers: A Guided Walkthrough

Imagine doing high-dimensional statistical inference, but instead of repeatedly studying different settings with specific low-dimensional constraints (such as linear regression with sparsity constraints, or estimation of structured covariance matrices), there is a method for performing a unified analysis using appropriate notions.

Well, you're in luck! 'A Unified Framework for High-Dimensional Analysis of $ M $-Estimators with Decomposable Regularizers' by Negahban, Ravikumar, Wainwright, and Yu shows that the $ \ell_2 $ difference between any regularized $M$-estimator and its true parameter can be bounded if the regularization function is decomposable, and the loss function satisfies restricted strong convexity.

The goal of this post is to provide intuition for the result and develop sufficient background for understanding the proof of this result, followed by a walkthrough of the proof itself.

22 min read · July 14, 2023

2023 · statistics machine-learning
The CMU Steam Tunnels and Wean 9

If you're curious about the infamous steam tunnels at CMU, or what the views from the roof of Wean Hall looks like, this post is for you!

8 min read · June 16, 2023

2023 · general cmu
CMU 15712 Advanced Operating Systems and Distributed Systems Course Review

15-712 Advanced OS was an excellent seminar-based graduate course that took us on a whirlwind tour through many of the most seminal SigOps Hall of Fame papers across several systems domains. It will prepare you to be a great systems designer and researcher. In this post, I will share my experience in the class, the course structure and content, what I thought were the biggest takeaways, and who this class might be suitable for.

22 min read · June 09, 2023

2023 · courses cmu systems
Score-Based Diffusion Models

Score-based diffusion models are a promising direction for generative models, as they improve on both likelihood-based approaches like variational autoencoders, as well as adversarial methods like Generative Adversarial Networks (GANs). In this blog post, we survey recent developments in the field centered around the line of results developed in (Song & Ermon, 2019), analyze the current strengths and limitations of score-based diffusion models, and discuss possible future directions that can address its drawbacks. Joint work with Owen Wang.

26 min read · June 07, 2023

2023 · machine-learning
DynPartition: Automatic Optimal Pipeline Parallelism of Dynamic Neural Networks over Heterogeneous GPU Systems for Inference Tasks

Dynamic neural networks are slowly gaining popularity due to their ability to adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, and adaptivity, in comparison to static models which have fixed computational graphs and parameters. We propose a novel reinforcement learning-based scheduler called DynPartition that performs dynamic partitioning of computation across multiple heterogeneous GPUs for dynamic neural network inference tasks.

1 min read · May 05, 2023

2023 · ml-systems
The Art of LaTeX: Common Mistakes, and Advice for Typesetting Beautiful, Delightful Proofs

When was the first time you had to use LaTeX? If you are like most people, it was probably suddenly forced upon you during your first math or CS class where you had to start writing proofs, with minimal guidance on how to get started. Unfortunately, this meant that while many people have good operational knowledge of LaTeX, there are still many small mistakes and best practices which are not followed, which are not corrected by TAs as they are either not severe enough to warrant a note, or perhaps even the TAs themselves are not aware of them.

In this post, we cover some common mistakes that are made by LaTeX practitioners (even in heavily cited papers), and how to address them.

31 min read · January 02, 2023

2023 · code general
A Concise Proof of the Central Limit Theorem, and Its Actually Useful Version, the Berry-Esseen Theorem

The Central Limit Theorem is widely used in statistics and machine learning, as it allows us to assume that given enough samples, the mean of the samples will follow a normal distribution. This holds even if the samples come from a distribution that is not normally distributed. In this post, we prove the Central Limit Theorem, and then take a look at the Berry-Esseen Theorem, which actually provides a quantitative bound on the convergence of the distribution and can therefore be actually used in deriving theoretical bounds.

12 min read · December 28, 2022

2022 · math machine-learning
Reinforcement Learning Policy Optimization: Deriving the Policy Gradient Update

Reinforcement learning algorithms that learn a policy (as opposed to implicit policy methods like $\epsilon$-greedy) optimize their policies by updating their policies in the direction of the gradient. However, the precise environment dynamics are not usually known to us, and the state space is usually also too large to enumerate, which means that we still cannot compute the gradient analytically. In this post, we derive the policy gradient update from scratch, and show how it can be approximated by sampling sufficiently many trajectories.

7 min read · December 26, 2022

2022 · machine-learning
Pseudo-determinism for Graph Streaming Problems

Given a fixed input for a search problem, pseudo-deterministic algorithms produce the same answer over multiple independent runs, with high probability. For example, we can efficiently find a certificate for inequality of multivariate polynomials pseudo-deterministically, but it is not known how to do so deterministically. The same notion can be extended to the streaming model. The problem of finding a nonzero element from a turnstile stream is previously shown to require linear space for both deterministic and pseudo-deterministic algorithms. Another model of streaming problems is that of graphs, where edge insertions and deletions occur along a stream. Some natural problems include connectivity, bipartiteness, and colorability of a graph. While the randomized and deterministic graph streaming algorithms have been mostly well-studied, we investigate pseudo-deterministic space lower bounds and upper bounds for graph theoretic streaming problems.

1 min read · December 19, 2022

2022 · theory project
Graphical Bayesian Networks with Topic Modeling Priors for Predicting Asset Covariances

Covariance matrix prediction is a long-standing challenge in modern portfolio theory and quantitative finance. In this project, we investigate the effectiveness of Bayesian networks in predicting the covariance matrix of financial assets (specifically a subset of the S&P 500), evaluated against Heterogeneous Autoregressive (HAR) models. In particular, we consider both HAR-DRD, based on the DRD decomposition of the covariance matrix, and Graphical HAR (GHAR)-DRD, which is also based on DRD decomposition but also makes use of graphical relationships between the assets. To build the graph representing relationships between the assets, we apply Latent Dirichlet allocation (LDA) on the 10-K filings of each of the companies, and infer edges based on topic overlap.

1 min read · December 13, 2022

2022 · machine-learning project
Analysis of Symmetry and Conventions in Off-Belief Learning (OBL) in Hanabi

Hanabi has been proposed as the new frontier for developing strategies in cooperative AI, currently a very nascent area of AI research. A recent algorithm that has been developed for multi-agent reinforcement learning in a cooperative context is the Off-Belief Learning (OBL) algorithm, which is based on iterated reasoning starting from a base policy. We investigate if policies learnt by agents using the OBL algorithm in the multi-player cooperative game Hanabi in the zero-shot coordination (ZSC) context are invariant across symmetries of the game, and if any conventions formed during training are arbitrary or natural, both of which are desirable properties.

1 min read · December 09, 2022

2022 · machine-learning project
Improving Domain Adaptation of Transformer Models For Generating Reddit Comments

We improve upon the recent success of large language models based on the transformer architecture by investigating and showing several methods that have empirically improved its performance in domain adaptation. We use a pre-trained GPT-2 model and perform fine-tuning on 5 different subreddits, and use different methods of ordering the training data based on our priors about the input to see how this affects the prediction quality of the trained model. We propose a new metric for evaluating causal language modeling tasks called APES (Average Perplexity Evaluation for Sentences) to address the limitations of existing metrics, and apply them to our results. Our results are evaluated against both LSTM and GPT-2 baselines.

4 min read · December 05, 2022

2022 · machine-learning project
Efficient Low Rank Approximation via Affine Embeddings

Suppose you have a $n \times d$ matrix $A$, where both dimensions are large. This could represent something like a customer-product matrix used in online recommender systems, where each cell $A_{i,j}$ denotes how many times customer $i$ purchased item $j$. Then it is typically the case that $A$ can be well-approximated by a low-rank matrix. For instance, using the previous example, there might only be a few dominant patterns that describes purchasing behavior in $A$, and the rest of it is just noise. Therefore, if we can find such a low-rank approximation, we can achieve significant space savings, and can also help to make the data more interpretable. In this post, we explore how affine embeddings via the CountSketch matrix allows us to perform low rank approximation in time $O\left(\nnz{A}+(n+d) \text{poly} \left( \frac{k}{\epsilon} \right)\right)$.

16 min read · September 29, 2022

2022 · theory machine-learning math
CMU 15-441/641 Computer Networks Course Review

Computer Networks is one of the lesser-known systems classes at Carnegie Mellon that turned out to be surprisingly fun and informative. In this post I'll talk about the projects and content covered, followed by my own thoughts on the usefulness on the class and who should take it.

15 min read · August 15, 2022

2022 · courses systems