Three Important Things
1. Gradient Descent On Separable Data Has Implicit Bias Towards Max-Margin SVM
Why is it that over-parameterized models fitted on training data via gradient descent actually generalizes well instead of overfitting? In this work, the authors make headway towards answering this question by showing that the solution found by gradient descent on linearly separable data actually has an implicit bias towards the
As a brief recap of max-margin (also known as hard) SVM, consider the linearly separable dataset given in the figure below:

While there are infinitely many solutions of lines that result in perfect accuracy, the solution that maximizes the margin to the support vectors (i.e closest datapoints to the solution hyperplane on both sides of the plane) will be the one that generalizes the best.
Let’s now consider the problem setup: the goal is to minimize the empirical loss
where labels
They make three key assumptions for their result:
Assumption 1: The dataset is linearly separable:
Assumption 2: The loss function
Assumption 3: The negative loss derivative
Using these assumptions, they showed the following result:
Let’s analyze what the theorem says.
First, we see that as the number of time steps increases, the magnitude of the weights
Then this shows that the normalized weight vector tends towards
However, one thing to note is that the rate of convergence to the max-margin solution is exponentially slow - since the growth of the weights is dominated by the
This means that it is worthwhile to continue running gradient descent for a long time even when the loss is vanishingly small with zero training error.
2. Proof Sketch of Main Theorem
Why should this theorem be true? In this section, we’ll go through a quick sketch of the proof.
Assume for simplicity that the loss function is simply the exponential function
First, they used previous results that gradient descent on a smooth loss function with an appropriate stepsize always converges. In other words, the gradient update eventually goes to 0. The gradient of the loss function is
and hence for this to go to zero, we require that
If we assume that
for some function
Then we can re-write our gradient as
But as we increase
But then this means that the gradient update is effectively only dominated by contributions from some of these points - essentially the support vectors of the problem, like in SVM.
Then as we continue performing gradient updates such that
This coincides with the KKT conditions for SVM:
and hence allows us to conclude that indeed
3. Empirical Evidence on Non-Separable Data
The authors showed that empirical results on synthetic data supported their theoretical findings:

A key objection to the generalizability of the main theorem is the extremely strong assumption that the data must be linearly separable.
The authors of course saw that coming, and also investigated the loss and error curves on non-linearly separable CIFAR10 dataset, and saw that it observed the same trends:

This is exciting, as it provides evidence that it may be possible to extend this theory to deep neural networks as well.
Most Glaring Deficiency
The result assumes the existence of some
It would be great if it is possible to show this without relying on this strong assumption as it is quite unrealistic as most real-world datasets are not linearly separable.
Conclusions for Future Work
Even when the training loss is zero or plateaus, it can be worthwhile to continue training. We should also look at the 0-1 error on the validation set instead of the validation loss, as it is possible that even though the validation loss increases, the 0-1 error actually improves and the model learns to generalize better.
As mentioned previously, it would be very exciting if the linearly separable assumption on the data can be theoretically removed, and if it can be extended to deeper neural networks.