Three Important Things
1. Wide Two-Layer Neural Networks with Gram Matrix Spectral Properties Enjoys Linear Convergence Rate
This paper shows theoretically how overparameterization of a two-layer neural network can result in a linear rate of convergence (linear when logarithms are taken, hence it really converges exponentially fast).
The setup of the paper is a standard two-layer neural network of the following form:
The first layer is initialized with random Gaussians, and the second layer is initialized uniformly with either -1 or +1. The reason
The Gram matrix
The main result of the paper states that if the following conditions hold:
- The neural network is initialized in the manner mentioned previously,
- The Gram matrix
induced by ReLU activation and random initialization has its smallest eigenvalue bounded away from 0 (this is the key assumption used), - We set step size
, - The number of hidden notes is at least
,
then with probability at least
In other words, it converges at a linear rate.
2. Spectral Norm of Gram Matrix Does Not Change Much During Gradient Descent
An essential part of the proof in the main theorem requires that the smallest eigenvalue assumption of the Gram matrix
Precisely, they showed that if
and
They then used this result to show that any
3. Synthetic Data To Validate Theoretical Findings

The authors generated synthetic data to verify their theoretical findings. They found that:
- Greater widths result in faster convergence,
- Greater widths result in fewer activation pattern changes, which verifies the stability of the Gram matrix,
- Greater widths result in smaller weight changes.
Most Glaring Deficiency
The assumption that the second layer of the neural networks is initialized with
Conclusions for Future Work
Their work provides a further stepping stone to understanding why over-parameterized models perform so well. As the authors mentioned, it may be possible to generalize the results of their approach to deeper neural networks.