ML Reading List
Curated list of papers I have bookmarked to read/have read, accompanied by a description on why I think it is worth reading. You can click on the tags to filter for only papers in that category.
-
1. The Emergence of Spectral Universality in Deep Networks
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
Studied MLP's Jacobian singular value distribution in the limit of large width
2018-02-27 - 2009-10-31
- 2020-06-19
-
5. The Recurrent Neural Tangent Kernel
Sina Alemohammad, Zichao Wang, Randall Balestriero, Richard Baraniuk
NTK for RNNs
2020-06-18 -
6. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu
Wide NNs behave as linear models
2018-11-21 -
7. The Nonlinearity Coefficient - Predicting Generalization in Deep Neural Networks
George Philipp, Jaime G. Carbonell
Signal progation in NN with random weights
2018-06-01 -
8. Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?
Boris Hanin
Signal progation in NN with random weights
2018-01-11 -
9. Gaussian Process Behaviour in Wide Deep Neural Networks
Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, Zoubin Ghahramani
GP behavior of wide nets
2018-04-30 -
10. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
Shows that if Jacobian singular value distribution of a wide NN concentrates around 1 even when network gets deper, error signal largely preserved and all layers get signal to improve
2017-11-13 -
11. Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function
Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzębski, Jacek Tabor, Maciej A. Nowak
Random matrix theory applied to DL
2018-09-24 -
12. Spectrum concentration in deep residual learning: a free probability approach
Zenan Ling, Xing He, Robert C. Qiu
Random matrix theory applied to DL
2018-07-31 -
13. Tensor Programs II: Neural Tangent Kernel for Any Architecture
Greg Yang
Provides a good overview of NTK. Some nice discussions on Gradient Independence Assumption (GIA). Extends TP I language to RNNs
2020-06-25 -
14. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes
Greg Yang
Framework for analyzing modern NN architectures. Pretty notation heavy though.
2019-10-28 -
15. Analysis of Boolean Functions
Ryan O'Donnell
Contains results on Hermite polynomials that can be useful for DL
2021-05-21 -
16. A Mean Field Theory of Batch Normalization
Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz
2019-02-21 -
17. The dynamics of message passing on dense graphs, with applications to compressed sensing
Mohsen Bayati, Andrea Montanari
Introduces a useful Gaussian conditioning technique
2010-01-20 - 2016-07-21
-
19. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
Amit Daniely, Roy Frostig, Yoram Singer
2016-02-18 -
20. Message Passing Algorithms for Compressed Sensing
David L. Donoho, Arian Maleki, Andrea Montanari
Approximate message passing
2009-07-21 -
21. Path Integral Approach to Random Neural Networks
A. Crisanti, H. Sompolinsky
Random classic spiking networks
2018-09-17 -
22. Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
Ryo Karakida, Shotaro Akaho, Shun-ichi Amari
Spectral properties of the empirical Fisher information matrix of random NNs
2018-06-04 -
23. Fisher Information and Natural Gradient Learning of Random Deep Networks
Shun-ichi Amari, Ryo Karakida, Masafumi Oizumi
Spectral properties of the empirical Fisher information matrix of random NNs
2018-08-22 -
24. Mean Field Residual Networks: On the Edge of Chaos
Greg Yang, Samuel S. Schoenholz
Edge of chaos MFT
2017-12-24 -
25. Deep Information Propagation
Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein
Mean field theory for backprop
2016-11-04 -
26. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks
Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington
Training 10k layer CNN w/o batcnorm or skip connections
2018-06-14 -
27. The boundary of neural network trainability is fractal
Jascha Sohl-Dickstein
Boundary between trainable and untrainable NN hyperparameters is fractal, nice visualizations
2024-02-09 -
28. Exponential expressivity in deep neural networks through transient chaos
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli
Signal propagation in NNs, NN expressivitiy
2016-06-16 -
29. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
He initialization
2015-02-06 -
30. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia et al.
Showed equivalence betw infinitely wide CNNs to GPs
2018-10-11 -
31. Deep Neural Networks as Gaussian Processes
Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein
Scaling limit for infinite width MLPs
2017-11-01 -
32. Steps Toward Deep Kernel Methods from Infinite Neural Networks
Tamir Hazan, Tommi Jaakkola
GP behavior on wide NNs under other conditions
2015-08-20 -
33. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe, Christian Szegedy
Batchnorm paper
2015-02-11 -
34. DeepSeek-V3 Technical Report
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu et al.
First open model to beat SOTA closed models
2024-12-27 -
35. Feature Learning in Infinite-Width Neural Networks
Greg Yang, Edward J. Hu
Shows that NN parameterizations using Standard/Mean Field/NTK either has feature learning or infinite-width training dynamics given by kernel gradient descent
2020-11-30 -
36. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan et al.
Uses autoformalization to create training data
2024-05-23 -
37. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao et al.
Introduces Multi-head Latent Attention (MLA)
2024-05-07 -
38. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang et al.
Introduced Group Relative Policy Optimization (GRPO)
2024-02-05 -
39. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
Benefits of feature learning
2020-10-28 -
40. On Lazy Training in Differentiable Programming
Lenaic Chizat, Edouard Oyallon, Francis Bach
Shows feature learning is usually beneficial in practical large-scale DL settings
2018-12-19 - 2022-10-10
-
42. Disentangling feature and lazy training in deep neural networks
Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart
Feature learning limit for NN dynamics
2019-06-19 -
43. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
Song Mei, Theodor Misiakiewicz, Andrea Montanari
Introduces mean-field theory for analyzing NN training
2019-02-16 -
44. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington
Dynamics of NTK in parameter space
2019-02-18 -
45. Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, Clément Hongler
Paper that introduced NTK, nice kernel viewpoint of how NNs evolve during training at the infinite width limit
2018-06-20 -
46. On the distance between two neural networks and the stability of learning
Jeremy Bernstein, Arash Vahdat, Yisong Yue, Ming-Yu Liu
Derived a spectral analysis of feature learning based on perturbation bounds, but obtained wrong scaling relation with network width due to flawed conditioning assumption on gradients
2020-02-09 -
47. A Spectral Condition for Feature Learning
Greg Yang, James B. Simon, Jeremy Bernstein
Shows that using spectral instead of Frobenius norm to analyze how NNs change during training is the right framing. Good summary & generalization of previous Tensor Program series of work.
2023-10-26 -
48. Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach
Grant M. Rotskoff, Eric Vanden-Eijnden
2018-05-02 -
49. Limitations of the NTK for Understanding Generalization in Deep Learning
Nikhil Vyas, Yamini Bansal, Preetum Nakkiran
2022-06-20 -
50. Spectral Normalization for Generative Adversarial Networks
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida
2018-02-16 -
51. Layer rotation: a surprisingly powerful indicator of generalization in deep networks?
Simon Carbonnelle, Christophe De Vleeschouwer
2018-06-05 -
52. Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu, Jeremy Bernstein, Markus Meister, Yisong Yue
2021-02-14 -
53. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen et al.
2023-12-14 -
54. Scaling Data-Constrained Language Models
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo et al.
2023-05-25 -
55. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li
2023-11-06 -
56. TIES-Merging: Resolving Interference When Merging Models
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal
2023-06-02 -
57. Editing Models with Task Arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, Ali Farhadi
2022-12-08 -
58. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao
2024-01-19 -
59. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification
Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, Shanfeng Zhu
2018-11-01 -
60. Jasper and Stella: distillation of SOTA embedding models
Dun Zhang, Jiacheng Li, Ziyang Zeng, Fulong Wang
2024-12-26 -
61. YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole
2023-08-31 - 2023-11-30
-
63. How Does Critical Batch Size Scale in Pre-training?
Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster et al.
2024-10-29 -
64. Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
2024-04-30 - 2015-06-09
-
66. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang et al.
2024-04-16 -
67. Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar
2024-07-25 -
68. WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
2024-01-22 - 2024-02-29
-
70. Combining Induction and Transduction for Abstract Reasoning
Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn et al.
2024-11-04 -
71. Phi-4 Technical Report
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison et al.
2024-12-12 -
72. The Impact of Positional Encoding on Length Generalization in Transformers
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, Siva Reddy
2023-05-31 -
73. RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve
2024-10-02 -
74. Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause
2024-10-10 -
75. HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu
2024-10-28 -
76. Stream of Search (SoS): Learning to Search in Language
Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, Noah D. Goodman
2024-04-01 -
77. Quadratic models for understanding catapult dynamics of neural networks
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
2022-05-24 -
78. Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning
Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin
2023-06-07 -
79. $μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky
2024-05-31 -
80. VeLO: Training Versatile Learned Optimizers by Scaling Up
Luke Metz, James Harrison, C. Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal et al.
2022-11-17 -
81. Probing the Decision Boundaries of In-context Learning in Large Language Models
Siyan Zhao, Tung Nguyen, Aditya Grover
2024-06-17 -
82. Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers
Gautham Vasan, Mohamed Elsayed, Alireza Azimi, Jiamin He, Fahim Shariar, Colin Bellinger, Martha White et al.
2024-11-22 -
83. VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search
David Brandfonbrener, Simon Henniger, Sibi Raja, Tarun Prasad, Chloe Loughridge, Federico Cassano, Sabrina Ruixin Hu et al.
2024-02-13 -
84. xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer et al.
2024-05-07 -
85. Memory-Efficient LLM Training with Online Subspace Descent
Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu
2024-08-23 -
86. Efficient Model-Free Exploration in Low-Rank MDPs
Zakaria Mhammedi, Adam Block, Dylan J. Foster, Alexander Rakhlin
2023-07-08 -
87. On the Computational Landscape of Replicable Learning
Alkis Kalavasis, Amin Karbasi, Grigoris Velegkas, Felix Zhou
2024-05-24 -
88. TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications
David Salinas, Nick Erickson
2023-11-06 -
89. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, Frank Hutter
2022-07-05 - 2024-12-06
-
91. Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi
2024-06-03 -
92. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder et al.
2022-03-07 -
93. Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices
Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss et al.
2024-10-03 -
94. Compute Better Spent: Replacing Dense Layers with Structured Matrices
Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson
2024-06-10 -
95. CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
Andres Potapczynski, Marc Finzi, Geoff Pleiss, Andrew Gordon Wilson
2023-09-06 -
96. Theoretical Foundations of Conformal Prediction
Anastasios N. Angelopoulos, Rina Foygel Barber, Stephen Bates
2024-11-18 -
97. AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis et al.
2024-06-17 -
98. Mechanism of feature learning in convolutional neural networks
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin
2023-09-01 -
99. Average gradient outer product as a mechanism for deep neural collapse
Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin
2024-02-21 -
100. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel
Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma
2018-10-12 -
101. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
Mikhail Belkin
2021-05-29 -
102. The duality structure gradient descent algorithm: analysis and applications to neural networks
Thomas Flynn
2017-08-01 - 2024-09-30
-
104. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio
2014-06-10 -
105. Investigating the Limitations of Transformers with Simple Arithmetic Tasks
Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin
2021-02-25 -
106. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou et al.
2023-08-03 -
107. Evaluating Language Models for Mathematics through Interactions
Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz et al.
2023-06-02 -
108. Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
Tiedong Liu, Bryan Kian Hsiang Low
2023-05-23 -
109. Continual Pre-Training of Large Language Models: How to (re)warm your model?
Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish et al.
2023-08-08 -
110. Functional Data Analysis: An Introduction and Recent Developments
Jan Gertheiss, David Rügamer, Bernard X. W. Liew, Sonja Greven
2023-12-09 - 2017-06-14
-
112. How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
Guan Zhe Hong, Nishanth Dikkala, Enming Luo, Cyrus Rashtchian, Xin Wang, Rina Panigrahy
2024-11-06 -
113. LoRA vs Full Fine-tuning: An Illusion of Equivalence
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma
2024-10-28 -
114. An Empirical Model of Large-Batch Training
Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team
2018-12-14 - 2018-02-20
-
116. ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate
Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama et al.
2024-11-05 -
117. Denoising Diffusion Probabilistic Models in Six Simple Steps
Richard E. Turner, Cristiana-Diana Diaconu, Stratis Markou, Aliaksandra Shysheya, Andrew Y. K. Foong, Bruno Mlodozeniec
2024-02-06 -
118. Understanding Optimization in Deep Learning with Central Flows
Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason D. Lee
2024-10-31 - 2024-10-28
-
120. Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
Rachel S. Y. Teo, Tan M. Nguyen
2024-06-19 -
121. Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma
2024-10-07 - 2024-05-25
-
124. Mixture of Parrots: Experts improve memorization more than reasoning
Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis et al.
2024-10-24 -
125. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid
2024-08-23 - 2024-10-13
-
127. nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg
2024-10-01 -
128. Why Do We Need Weight Decay in Modern Deep Learning?
Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre, Nicolas Flammarion
2023-10-06 - 2020-02-12
-
130. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning
Stefan Elfwing, Eiji Uchibe, Kenji Doya
2017-02-10 -
131. Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal et al.
2024-10-10 -
132. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, Saining Xie
2024-10-09 -
133. Generative Verifiers: Reward Modeling as Next-Token Prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
2024-08-27 -
134. Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün et al.
2024-02-22 -
135. Round and Round We Go! What makes Rotary Positional Encodings useful?
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković
2024-10-08 -
136. Large Language Models as Markov Chains
Oussama Zekri, Ambroise Odonnat, Abdelhakim Benechehab, Linus Bleistein, Nicolas Boullé, Ievgen Redko
2024-10-03 -
137. Searching for Best Practices in Retrieval-Augmented Generation
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi et al.
2024-07-01 -
138. Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers
Alicia Curth, Alan Jeffares, Mihaela van der Schaar
2024-02-02 -
139. Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant et al.
2023-08-17 -
140. eGAD! double descent is explained by Generalized Aliasing Decomposition
Mark K. Transtrum, Gus L. W. Hart, Tyler J. Jarvis, Jared P. Whitehead
2024-08-15 -
141. A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
Alicia Curth, Alan Jeffares, Mihaela van der Schaar
2023-10-29 - 2024-10-03
-
144. Were RNNs All We Needed?
Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein Hajimirsadeghi
2024-10-02 -
145. Instruction Following without Instruction Tuning
John Hewitt, Nelson F. Liu, Percy Liang, Christopher D. Manning
2024-09-21 -
146. Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan
2024-03-14 -
147. From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov et al.
2024-06-24 -
148. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma
2024-02-20 -
149. Hardware Acceleration of LLMs: A comprehensive survey and comparison
Nikoletta Koilia, Christoforos Kachris
2024-09-05 -
150. Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu
2024-09-04 -
151. White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai et al.
2023-11-22 -
152. Just Say the Name: Online Continual Learning with Category Names Only via Data Generation
Minhyuk Seo, Seongwon Cho, Minjae Lee, Diganta Misra, Hyeonbeom Choi, Seon Joo Kim, Jonghyun Choi
2024-03-16 -
153. Magicoder: Empowering Code Generation with OSS-Instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang
2023-12-04 -
154. Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi et al.
2023-08-24 -
155. #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou et al.
2023-08-14 -
156. Equivariant neural networks and piecewise linear representation theory
Joel Gibson, Daniel Tubbenhauer, Geordie Williamson
2024-08-01 -
157. Does your data spark joy? Performance gains from domain upsampling at the end of training
Cody Blakeney, Mansheej Paul, Brett W. Larsen, Sean Owen, Jonathan Frankle
2024-06-05 -
158. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
2024-08-06 -
159. Reconciling modern machine learning practice and the bias-variance trade-off
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
2018-12-28 -
160. Moment Matching for Multi-Source Domain Adaptation
Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, Bo Wang
2018-12-04 - 2019-07-05
-
162. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
2018-08-14 -
163. Do ImageNet Classifiers Generalize to ImageNet?
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar
2019-02-13 -
164. The Effect of Natural Distribution Shift on Question Answering Models
John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt
2020-04-29 -
165. Causal inference using invariant prediction: identification and confidence intervals
Jonas Peters, Peter Bühlmann, Nicolai Meinshausen
2015-01-06 -
166. Measuring the Intrinsic Dimension of Objective Landscapes
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski
2018-04-24 -
167. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta
2017-07-10 -
168. No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems
Nimit S. Sohoni, Jared A. Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré
2020-11-25 -
169. Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization
Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski
2022-02-14 -
170. In-Context Learning Learns Label Relationships but Is Not Conventional Learning
Jannik Kossen, Yarin Gal, Tom Rainforth
2023-07-23 -
171. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak et al.
2024-04-01 -
172. Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer et al.
2022-07-11 - 2018-07-08
-
174. Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations
Polina Kirichenko, Pavel Izmailov, Andrew Gordon Wilson
2022-04-06 -
175. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
2022-10-31 -
176. The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch
2024-06-03 -
177. When Representations Align: Universality in Representation Learning Dynamics
Loek van Rossem, Andrew M. Saxe
2024-02-14 -
178. Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
2024-06-28 -
179. A Theory of Interpretable Approximations
Marco Bressan, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, Maximilian Thiessen
2024-06-15 -
180. BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
Wenda Xu, Jiachen Li, William Yang Wang, Lei Li
2024-06-18 -
181. A Tutorial on Thompson Sampling
Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen
2017-07-07 -
182. Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference
Siddhartha Dalal, Vishal Misra
2024-02-05 -
183. Transcendence: Generative Models Can Outperform The Experts That Train Them
Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade et al.
2024-06-17 -
184. Step-by-Step Diffusion: An Elementary Tutorial
Preetum Nakkiran, Arwen Bradley, Hattie Zhou, Madhu Advani
2024-06-13 -
185. Text Embeddings Reveal (Almost) As Much As Text
John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush
2023-10-10 -
186. Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks
Giovanni Luca Marchetti, Christopher Hillar, Danica Kragic, Sophia Sanborn
2023-12-13 -
187. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land, Max Bartolo
2024-05-08 -
188. Customizing Text-to-Image Models with a Single Image Pair
Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu
2024-05-02 -
189. Self-Play Preference Optimization for Language Model Alignment
Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
2024-05-01 -
190. Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Shreya Shankar, J. D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo
2024-04-18 - 2020-06-09
-
193. How Can We Know What Language Models Know?
Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig
2019-11-28 -
194. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam et al.
2017-12-15 -
195. Why do tree-based models still outperform deep learning on tabular data?
Léo Grinsztajn, Edouard Oyallon, Gaël Varoquaux
2022-07-18 -
196. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models
Emilie Kaufmann, Olivier Cappé, Aurélien Garivier
2014-07-16 - 2012-11-11
-
198. Black-box Dataset Ownership Verification via Backdoor Watermarking
Yiming Li, Mingyan Zhu, Xue Yang, Yong Jiang, Tao Wei, Shu-Tao Xia
2022-08-04 - 2017-11-14
-
200. Efficient Training of Language Models to Fill in the Middle
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen
2022-07-28 -
201. From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
2024-04-18 -
202. Large Language Models Can Self-Improve
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han
2022-10-20 -
203. How to Train Data-Efficient LLMs
Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee et al.
2024-02-15 -
204. Co-training Improves Prompt-based Learning for Large Language Models
Hunter Lang, Monica Agrawal, Yoon Kim, David Sontag
2022-02-02 -
205. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden et al.
2023-06-07 - 2023-06-01
- 2018-04-11
-
208. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Yuanzhi Li, Yingyu Liang
2018-08-03 -
209. A Convergence Theory for Deep Learning via Over-Parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
2018-11-09 -
210. Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
Samet Oymak, Zalan Fabian, Mingchen Li, Mahdi Soltanolkotabi
2019-06-12 -
211. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta
2020-12-22 -
212. STaR: Bootstrapping Reasoning With Reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
2022-03-28 -
213. Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama et al.
2023-09-29 -
214. The Barron Space and the Flow-induced Function Spaces for Neural Network Models
Weinan E, Chao Ma, Lei Wu
2019-06-18 -
215. Deep Equilibrium Based Neural Operators for Steady-State PDEs
Tanya Marwah, Ashwini Pokle, J. Zico Kolter, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski
2023-11-30 -
216. Parametric Complexity Bounds for Approximating PDEs with Neural Networks
Tanya Marwah, Zachary C. Lipton, Andrej Risteski
2021-03-03 -
217. Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou et al.
2024-02-28 - 2024-02-15
-
219. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
2024-03-14 -
220. Evolutionary Optimization of Model Merging Recipes
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha
2024-03-19 - 2023-12-13
-
222. Deep Neural Networks Tend To Extrapolate Predictably
Katie Kang, Amrith Setlur, Claire Tomlin, Sergey Levine
2023-10-02 -
223. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi et al.
2024-03-05 -
224. Is Cosine-Similarity of Embeddings Really About Similarity?
Harald Steck, Chaitanya Ekanadham, Nathan Kallus
2024-03-08 -
225. Language Modeling with Gated Convolutional Networks
Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier
2016-12-23 - 2020-02-12
- 2019-11-05
-
228. Asymmetry in Low-Rank Adapters of Foundation Models
Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi et al.
2024-02-26 - 2022-12-19
-
230. In-Context Learning for Extreme Multi-Label Classification
Karel D'Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, Christopher Potts
2024-01-22 -
231. Matryoshka Representation Learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder et al.
2022-05-26 -
232. Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
Elan Rosenfeld, Andrej Risteski
2023-11-07 -
233. Diffusion Models for Generative Artificial Intelligence: An Introduction for Applied Mathematicians
Catherine F. Higham, Desmond J. Higham, Peter Grindrod
2023-12-21 -
234. In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann et al.
2022-09-24 -
235. Characterizing Implicit Bias in Terms of Optimization Geometry
Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro
2018-02-22 -
236. Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li et al.
2022-10-20 -
237. Textbooks Are All You Need II: phi-1.5 technical report
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
2023-09-11 -
238. Data Selection for Language Models via Importance Resampling
Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang
2023-02-06 -
239. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
2023-05-29 -
240. Data Movement Is All You Need: A Case Study on Optimizing Transformers
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler
2020-06-30 -
241. Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars
Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski
2023-12-03 - 2023-08-08
-
243. Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper
2023-02-02 -
244. Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
2022-11-30 -
245. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
2022-08-15 -
246. Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
2021-11-11 -
247. One Wide Feedforward is All You Need
Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan
2023-09-04 -
248. A Mean Field View of the Landscape of Two-Layers Neural Networks
Song Mei, Andrea Montanari, Phan-Minh Nguyen
2018-04-18 -
249. Sharpness-Aware Minimization for Efficiently Improving Generalization
Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur
2020-10-03 -
250. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar
2021-02-26 -
251. The large learning rate phase of deep learning: the catapult mechanism
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari
2020-03-04 - 2021-06-11
- 2019-06-13
-
254. Deep Double Descent: Where Bigger Models and More Data Hurt
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever
2019-12-04 -
255. The generalization error of random features regression: Precise asymptotics and double descent curve
Song Mei, Andrea Montanari
2019-08-14 -
256. Exploring Generalization in Deep Learning
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro
2017-06-27 -
257. Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals
2016-11-10 -
258. On Exact Computation with an Infinitely Wide Neural Net
Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang
2019-04-26 -
259. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang
2018-11-12 -
260. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle, Michael Carbin
2018-03-09 -
261. A Simple Framework for Contrastive Learning of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
2020-02-13 -
262. Big Transfer (BiT): General Visual Representation Learning
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
2019-12-24 -
263. A Fourier Perspective on Model Robustness in Computer Vision
Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, Justin Gilmer
2019-06-21 -
264. Certified Adversarial Robustness via Randomized Smoothing
Jeremy M Cohen, Elan Rosenfeld, J. Zico Kolter
2019-02-08 -
265. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al.
2020-10-22 - 2021-05-17
-
267. On the Adversarial Robustness of Vision Transformers
Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, Cho-Jui Hsieh
2021-03-29 -
268. Extracting Training Data from Large Language Models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts et al.
2020-12-14 -
269. Supervised Contrastive Learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot et al.
2020-04-23 -
270. Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, Christopher Ré
2021-10-31