ML Reading List
Curated list of papers that I have bookmarked to read, well, someday...
This list is continously updated as I bookmark more papers to read. It’s also a good reflection of my current research interests.
There’s a good chance that if I like the paper, I will write a summary about it.
- Fast Inference from Transformers via Speculative Decoding
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
- Scaling Data-Constrained Language Models
- Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
- TIES-Merging: Resolving Interference When Merging Models
- Editing Models with Task Arithmetic
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification
- Fast Inference from Transformers via Speculative Decoding
- Jasper and Stella: distillation of SOTA embedding models
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- YaRN: Efficient Context Window Extension of Large Language Models
- Zero Bubble Pipeline Parallelism
- How Does Critical Batch Size Scale in Pre-training?
- Better & Faster Large Language Models via Multi-token Prediction
- Pointer Networks
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
- Recursive Introspection: Teaching Language Model Agents How to Self-Improve
- WARM: On the Benefits of Weight Averaged Reward Models
- Dual Operating Modes of In-Context Learning
- Combining Induction and Transduction for Abstract Reasoning
- Phi-4 Technical Report
- The Impact of Positional Encoding on Length Generalization in Transformers
- RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
- Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
- HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
- Stream of Search (SoS): Learning to Search in Language
- Quadratic models for understanding catapult dynamics of neural networks
- Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
- $μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
- VeLO: Training Versatile Learned Optimizers by Scaling Up
- Probing the Decision Boundaries of In-context Learning in Large Language Models
- Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers
- VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search
- xLSTM: Extended Long Short-Term Memory
- Memory-Efficient LLM Training with Online Subspace Descent
- Efficient Model-Free Exploration in Low-Rank MDPs
- On the Computational Landscape of Replicable Learning
- TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
- Reinforcement Learning: An Overview
- Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices
- Compute Better Spent: Replacing Dense Layers with Structured Matrices
- CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
- Theoretical Foundations of Conformal Prediction
- AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
- Mechanism of feature learning in convolutional neural networks
- Average gradient outer product as a mechanism for deep neural collapse
- Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
- Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
- The duality structure gradient descent algorithm: analysis and applications to neural networks
- A Spectral Condition for Feature Learning
- Modular Duality in Deep Learning
- Old Optimizer, New Norm: An Anthology
- Modular Duality in Deep Learning
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
- Investigating the Limitations of Transformers with Simple Arithmetic Tasks
- Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
- Evaluating Language Models for Mathematics through Interactions
- Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
- Continual Pre-Training of Large Language Models: How to (re)warm your model?
- Functional Data Analysis: An Introduction and Recent Developments
- On Calibration of Modern Neural Networks
- How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
- LoRA vs Full Fine-tuning: An Illusion of Equivalence
- An Empirical Model of Large-Batch Training
- The Description Length of Deep Learning Models
- ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate
- Denoising Diffusion Probabilistic Models in Six Simple Steps
- Understanding Optimization in Deep Learning with Central Flows
- Modular Duality in Deep Learning
- DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
- Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
- Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
- Impacts of Continued Legal Pre-Training and IFT on LLMs’ Latent Representations of Human-Defined Legal Concepts
- A First Course in Monte Carlo Methods
- Mixture of Parrots: Experts improve memorization more than reasoning
- The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities
- Agentic Information Retrieval
- nGPT: Normalized Transformer with Representation Learning on the Hypersphere
- Why Do We Need Weight Decay in Modern Deep Learning?
- GLU Variants Improve Transformer
- Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning
- Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
- Generative Verifiers: Reward Modeling as Next-Token Prediction
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- Round and Round We Go! What makes Rotary Positional Encodings useful?
- Large Language Models as Markov Chains
- Searching for Best Practices in Retrieval-Augmented Generation
- Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers
- Reinforced Self-Training (ReST) for Language Modeling
- eGAD! double descent is explained by Generalized Aliasing Decomposition
- A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
- Classical Statistical (In-Sample) Intuitions Don’t Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs
- Contextual Document Embeddings
- Were RNNs All We Needed?
- Instruction Following without Instruction Tuning
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
- Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
- Decoupled Weight Decay Regularization
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Hardware Acceleration of LLMs: A comprehensive survey and comparison
- Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
- White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
- Just Say the Name: Online Continual Learning with Category Names Only via Data Generation
- Magicoder: Empowering Code Generation with OSS-Instruct
- Code Llama: Open Foundation Models for Code
- #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
- Equivariant neural networks and piecewise linear representation theory
- Does your data spark joy? Performance gains from domain upsampling at the end of training
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Reconciling modern machine learning practice and the bias-variance trade-off
- Moment Matching for Multi-Source Domain Adaptation
- Invariant Risk Minimization
- How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
- Do ImageNet Classifiers Generalize to ImageNet?
- The Effect of Natural Distribution Shift on Question Answering Models
- Causal inference using invariant prediction: identification and confidence intervals
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks
- Measuring the Intrinsic Dimension of Objective Landscapes
- Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
- No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems
- Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization
- In-Context Learning Learns Label Relationships but Is Not Conventional Learning
- Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
- Language Models (Mostly) Know What They Know
- A Tutorial on Bayesian Optimization
- Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models
- When Representations Align: Universality in Representation Learning Dynamics
- Scaling Synthetic Data Creation with 1,000,000,000 Personas
- A Theory of Interpretable Approximations
- BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
- A Tutorial on Thompson Sampling
- Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference
- Transcendence: Generative Models Can Outperform The Experts That Train Them
- Step-by-Step Diffusion: An Elementary Tutorial
- Text Embeddings Reveal (Almost) As Much As Text
- Language Modeling with Gated Convolutional Networks
- Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks
- Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
- Customizing Text-to-Image Models with a Single Image Pair
- Self-Play Preference Optimization for Language Model Alignment
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
- U-Nets as Belief Propagation: Efficient Classification, Denoising, and Diffusion in Generative Hierarchical Models
- On the Bottleneck of Graph Neural Networks and its Practical Implications
- How Can We Know What Language Models Know?
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
- Why do tree-based models still outperform deep learning on tabular data?
- On the Complexity of Best Arm Identification in Multi-Armed Bandit Models
- Random Utility Theory for Social Choice
- Black-box Dataset Ownership Verification via Backdoor Watermarking
- Decoupled Weight Decay Regularization
- Efficient Training of Language Models to Fill in the Middle
- From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
- Large Language Models Can Self-Improve
- How to Train Data-Efficient LLMs
- Co-training Improves Prompt-based Learning for Large Language Models
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
- Learning Transformer Programs
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
- Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
- A Convergence Theory for Deep Learning via Over-Parameterization
- Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
- Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
- STaR: Bootstrapping Reasoning With Reasoning
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
- Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
- The Barron Space and the Flow-induced Function Spaces for Neural Network Models
- Deep Equilibrium Based Neural Operators for Steady-State PDEs
- Parametric Complexity Bounds for Approximating PDEs with Neural Networks
- Simple linear attention language models balance the recall-throughput tradeoff
- Chain-of-Thought Reasoning Without Prompting
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
- Evolutionary Optimization of Model Merging Recipes
- An Invitation to Deep Reinforcement Learning
- Deep Neural Networks Tend To Extrapolate Predictably
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- Is Cosine-Similarity of Embeddings Really About Similarity?
- Language Modeling with Gated Convolutional Networks
- GLU Variants Improve Transformer
- On the Measure of Intelligence
- Asymmetry in Low-Rank Adapters of Foundation Models
- Scalable Diffusion Models with Transformers
- In-Context Learning for Extreme Multi-Label Classification
- Matryoshka Representation Learning
- Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
- Diffusion Models for Generative Artificial Intelligence: An Introduction for Applied Mathematicians
- In-context Learning and Induction Heads
- Characterizing Implicit Bias in Terms of Optimization Geometry
- Scaling Instruction-Finetuned Language Models
- Textbooks Are All You Need II: phi-1.5 technical report
- Data Selection for Language Models via Importance Resampling
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Data Movement Is All You Need: A Case Study on Optimizing Transformers
- Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
- Accelerating LLM Inference with Staged Speculative Decoding
- Accelerating Large Language Model Decoding with Speculative Sampling
- Fast Inference from Transformers via Speculative Decoding
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- Masked Autoencoders Are Scalable Vision Learners
- One Wide Feedforward is All You Need
- A Mean Field View of the Landscape of Two-Layers Neural Networks
- Sharpness-Aware Minimization for Efficiently Improving Generalization
- Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
- The large learning rate phase of deep learning: the catapult mechanism
- Label Noise SGD Provably Prefers Flat Global Minimizers
- Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
- Deep Double Descent: Where Bigger Models and More Data Hurt
- The generalization error of random features regression: Precise asymptotics and double descent curve
- Exploring Generalization in Deep Learning
- Understanding deep learning requires rethinking generalization
- On Exact Computation with an Infinitely Wide Neural Net
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks
- Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- A Simple Framework for Contrastive Learning of Visual Representations
- Big Transfer (BiT): General Visual Representation Learning
- A Fourier Perspective on Model Robustness in Computer Vision
- Certified Adversarial Robustness via Randomized Smoothing
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Vision Transformers are Robust Learners
- On the Adversarial Robustness of Vision Transformers
- Extracting Training Data from Large Language Models
- Supervised Contrastive Learning
- Efficiently Modeling Long Sequences with Structured State Spaces