Reproducing the deep double descent paper

15 stpn 4 6/5/2025, 6:34:23 PM stpn.bearblog.dev ↗

Comments (4)

lcrmorin · 3h ago
Do you change regularisation ?
davidguetta · 21h ago
is this not because the longer you train, the more neurons 'die' (not uilized anymore cause the gradient is flat on the dataset) so you effectively get a smaller models as the training goes on ?
rsfern · 18h ago
I don’t think so? the double descent phenomenon also occurs in linear models under the right conditions. My understanding of this is that when the effective model capacity is exactly equal to the information in the dataset, there is only one solution that interpolates the training data perfectly, but when the capacity increases far beyond this there are many such interpolating solutions. Apply enough regularization and you are likely to find an interpolating solution that generalizes well
stpn · 15h ago
(post author here)

I was curious about this since it kind of makes sense, but I offer a few reasons why I think this isn't the case:

- In the 10% noise case at least, the second descent eventually finds a minima that's better than the original local minima which suggests to me the model really is finding a better fit rather than just reducing itself to a similar smaller model

- If it were the case, I think we might also expect the error for larger models to converge to the performance of smaller models? But instead they converge lower and better

- I checked the logged gradient histograms I had for a the runs. While I'm still learning how to interpret the results, I didn't see signs of vanishing gradients where dead neurons later in the model prevented earlier layers from learning. Gradients do get smaller over time but that seems expected and we don't have big waves of neurons dying which is what I'd expect to have the larger network converge on the size of the smaller one.