2208.02083

Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Patrick Cheridito, Arnulf Jentzen, Florian Rossmannek

correctmedium confidence

Category: math.DS
Journal tier: Strong Field
Processed: Sep 28, 2025, 12:56 AM
arXiv Links: Abstract ↗PDF ↗

Audit review

The paper proves that for almost every stepsize, the set of initializations whose gradient-descent iterates converge to the specified saddle set S has Lebesgue measure zero, by (i) establishing a center-stable manifold theorem that does not require local diffeomorphism at the fixed point, and (ii) modifying the gradient map near semi-inactive neurons so that differentiability and symmetry conditions hold at the saddles under study (Theorem 1, Proposition 7, Lemmas 10–11, 13, Theorem 14) . By contrast, the model’s proof assumes L is C^2 at the relevant saddles and applies the classical center-stable manifold theorem for C^1 diffeomorphisms, relying on the Hessian at those saddles. But the paper explicitly notes that semi-inactive neurons place many saddles outside the full-measure C^2 set, so ∇^2L may not exist there; this is why the modified gradient G_J and the non-diffeomorphic manifold theorem are needed . The model also asserts nondegeneracy/isolatedness of these saddles without support, whereas the paper only needs a strictly negative eigenvalue of the symmetric Jacobian of the modified system and allows degeneracy in other directions (handled by the new theorem). Hence, the model’s argument has essential gaps while the paper’s argument is sound.

Referee report (LaTeX)

\textbf{Recommendation:} minor revisions

\textbf{Journal Tier:} strong field

\textbf{Justification:}

The manuscript closes a key gap between smooth dynamical systems theory and the nonsmooth loss landscapes of ReLU networks. By relaxing the local-diffeomorphism requirement and introducing a principled gradient modification, the authors rigorously prove that gradient descent avoids most saddle points for affine targets. The technical development is solid and the results are relevant to modern optimization theory. Minor clarifications would further improve readability.