Gradient Descent Provably Optimizes Over-parameterized Neural Networks

2018

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S. Du, Xiyu Zhai, Barnabas Poczos, and 1 more author

Oct 2018

Paper Abstract

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

@article{1810.02054v2,
  author = {Du, Simon S. and Zhai, Xiyu and Poczos, Barnabas and Singh, Aarti},
  title = {Gradient Descent Provably Optimizes Over-parameterized Neural Networks},
  eprint = {1810.02054v2},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  year = {2018},
  month = oct,
  url = {http://arxiv.org/abs/1810.02054v2},
  file = {1810.02054v2.pdf},
  eprintnover = {1810.02054}
}

Three Important Things

1. Wide Two-Layer Neural Networks with Gram Matrix Spectral Properties Enjoys Linear Convergence Rate

This paper shows theoretically how overparameterization of a two-layer neural network can result in a linear rate of convergence (linear when logarithms are taken, hence it really converges exponentially fast).

The setup of the paper is a standard two-layer neural network of the following form:

\[f(\mathbf{W}, \mathbf{a}, \mathbf{x})=\frac{1}{\sqrt{m}} \sum_{r=1}^{m} a_{r} \sigma\left(\mathbf{w}_{r}^{\top} \mathbf{x}\right)\]

The first layer is initialized with random Gaussians, and the second layer is initialized uniformly with either -1 or +1. The reason \(\pm 1\) initialization was used in the second layer is because it simplifies the Gram matrix of the weights.

The Gram matrix \(\bH\) referred to in the paper is defined as the matrix product of the differential map of \(f\) with its transpose, i.e \(\nabla_W f \, \nabla_W f^{\top}\). This is similar to the kernel in the neural tangent kernel paper. Note that the eigenvalues of the Gram matrix are always non-negative since the Gram matrix is positive semi-definite.

The main result of the paper states that if the following conditions hold:

The neural network is initialized in the manner mentioned previously,
The Gram matrix \(\bH\) induced by ReLU activation and random initialization has its smallest eigenvalue \(\lambda_0\) bounded away from 0 (this is the key assumption used),
We set step size \(\eta = O \left( \frac{\lambda_0}{n^2} \right)\),
The number of hidden notes is at least \(m = \Omega \left( \frac{n^6}{ \lambda_0^4 \delta^3 } \right)\),

then with probability at least \(1-\delta\) over the random initializations, we have that for any time step \(k\), the difference between the output of the network \(\bu(k)\) and the labels \(\by\) can be bounded:

\[\| \bu(k) - \by \|_2^2 \leq \left( 1 - \frac{\eta \lambda_0}{2} \right)^k \| \bu(0) - \by \|_2^2.\]

In other words, it converges at a linear rate.

2. Spectral Norm of Gram Matrix Does Not Change Much During Gradient Descent

An essential part of the proof in the main theorem requires that the smallest eigenvalue assumption of the Gram matrix \(\bH\) holds throughout gradient descent. They showed this by first showing that the randomly initialized \(\bH(0)\) at time step 0 is close in spectral norm to \(\bH^{\infty}\), defined as

\[\mathbf{H}_{i j}^{\infty}= \mathbb{E}_{\mathbf{w} \sim N(\mathbf{0}, \mathbf{I})}\left[\mathbf{x}_{i}^{\top} \mathbf{x}_{j} \mathbb{I}\left\{\mathbf{w}^{\top} \mathbf{x}_{i} \geq 0, \mathbf{w}^{\top} \mathbf{x}_{j} \geq 0\right\}\right].\]

Precisely, they showed that if \(m\) is wide enough, then with high probability,

\[\left\|\mathbf{H}(0)-\mathbf{H}^{\infty}\right\|_{2} \leq \frac{\lambda_{0}}{4}\]

and \(\lambda_{\min }(\mathbf{H}(0)) \geq \frac{3}{4} \lambda_{0}\).

They then used this result to show that any \(\bH(t)\) is stable, and is close to its value at initialization.

3. Synthetic Data To Validate Theoretical Findings

The authors generated synthetic data to verify their theoretical findings. They found that:

Greater widths result in faster convergence,
Greater widths result in fewer activation pattern changes, which verifies the stability of the Gram matrix,
Greater widths result in smaller weight changes.

Most Glaring Deficiency

The assumption that the second layer of the neural networks is initialized with \(\pm 1\) values feels quite unrealistic, although it is admittedly to simplify the theoretical analysis. It would be interesting to see if such an initialization pattern could actually be competitive with established weight initialization techniques in practice.

Conclusions for Future Work

Their work provides a further stepping stone to understanding why over-parameterized models perform so well. As the authors mentioned, it may be possible to generalize the results of their approach to deeper neural networks.