Double descent is a phenomenon observed in machine learning that challenges traditional ideas about the relationship between model complexity and performance. Typically, as models become more complex, they start to overfit the training data, leading to higher test errors. However, in the case of double descent, the error behaves differently, showing two distinct drops in error rates as complexity increases.
In the first phase, models behave as expected: they are underfit, and their test error decreases as they grow more complex. Then, after a certain point, the error spikes—this is the overfitting stage where models become too complex for the data and capture noise, leading to high variance. This peak of error occurs at the "interpolation threshold," when the model has just enough capacity to fit the training data perfectly but not generalize well.
Surprisingly, if we continue increasing the complexity, instead of the error worsening further, it starts decreasing again. This second drop, or the "second descent," occurs when the model enters the over-parameterized regime, where it has more parameters than necessary to memorize the training data. In this regime, larger models often find smoother solutions that generalize better, contrary to traditional expectations.
Double descent has been observed across a variety of deep learning architectures, including convolutional neural networks (CNNs), transformers, and residual networks (ResNets). It can also manifest over training epochs, where extending training past the overfitting point can lead to improved performance—a phenomenon known as epoch-wise double descent.
This discovery is particularly relevant to the design of modern AI systems, suggesting that larger, more complex models may perform better even when they initially seem to overfit. .