Day - 4 : Why Plain Gradient Descent Fails (And What Polyak Momentum Fixes)
We love the clean update:
wₖ₊₁ = wₖ − η ∇f(wₖ)
Simple. Elegant. First-order.
But in real ML systems, it struggles.
Where Core Gradients Fail
1. Zig-Zag in Narrow Valleys
In ill-conditioned problems (think long curved valleys), gradients keep pointing sideways.
You move:
- Left
- Then right
- Then left again
Progress along the valley direction becomes painfully slow.
This happens when the condition number [κ = λ_max / λ_min ] is large.
Large κ ⇒ slow convergence.
2. High-Frequency Noise (in SGD)
With stochastic gradients: gₖ = ∇f(wₖ) + ξₖ
Noise causes:
- Jitter
- Oscillations
- Bouncing near minima
Mini-batches reduce variance spatially. But they don’t smooth noise across time.
Temporal noise remains.
3. No Memory
Vanilla GD forgets everything.
Each step only uses:
- Current position
- Current gradient
It ignores:
- Where you were going
- Whether direction was consistent
That’s inefficient in structured landscapes.
The Core Idea of Polyak Momentum
Instead of trusting only today’s gradient, accumulate history.
Define velocity:
vₖ₊₁ = β vₖ + ∇f(wₖ)
Update:
wₖ₊₁ = wₖ − η vₖ₊₁
Equivalent second-order form:
wₖ₊₁ = wₖ − η ∇f(wₖ) + β (wₖ − wₖ₋₁)
That’s it.
Simple change. Big effect.
What Actually Changes?
1. It Adds Memory
Momentum is an exponential moving average of past gradients.
So instead of reacting instantly, it builds directional conviction.
If gradients consistently point one way → speed increases.
Recommended by LinkedIn
2. It Filters Noise
High-frequency random noise cancels out.
Consistent signal accumulates.
Momentum behaves like a low-pass filter.
Noise ↓ Signal ↑
3. It Fixes Ill-Conditioning (Partially)
For quadratic problems:
Vanilla GD rate: (κ − 1) / (κ + 1)
Heavy Ball rate: (√κ − 1) / (√κ + 1)
That √κ improvement is why momentum flies through narrow valleys.
Intuition: Heavy Ball Analogy
Imagine rolling a heavy ball down a valley.
Without momentum → step, stop, step, stop. With momentum → inertia builds → smoother motion.
Mathematically it behaves like:
w¨ + c w˙ + ∇f(w) = 0
Acceleration + friction + force.
Optimization becomes a dynamical system, not just a local step rule.
But It’s Not Magic
Momentum:
- Needs tuning (η, β)
- Can overshoot
- Is sensitive if curvature is unknown
- Doesn’t guarantee optimal worst-case rate in general convex case
It improves constants. It doesn’t solve everything.
Final Takeaway
Gradient Descent fails because:
- It forgets history
- It overreacts to noise
- It struggles in bad geometry
Polyak Momentum fixes this by:
- Adding memory
- Smoothing temporal noise
- Accelerating movement in narrow valleys
It’s not about “bigger steps”.
It’s about better dynamics.
If SGD makes learning possible, Momentum makes it efficient.
#Optimization #Momentum #HeavyBall #DeepLearning #MachineLearning #MathForML