Day - 4 : Why Plain Gradient Descent Fails (And What Polyak Momentum Fixes)

Day - 4 : Why Plain Gradient Descent Fails (And What Polyak Momentum Fixes)

We love the clean update:

wₖ₊₁ = wₖ − η ∇f(wₖ)

Simple. Elegant. First-order.

But in real ML systems, it struggles.

Where Core Gradients Fail

1. Zig-Zag in Narrow Valleys

In ill-conditioned problems (think long curved valleys), gradients keep pointing sideways.

You move:

  • Left
  • Then right
  • Then left again

Progress along the valley direction becomes painfully slow.

This happens when the condition number [κ = λ_max / λ_min ] is large.

Large κ ⇒ slow convergence.

2. High-Frequency Noise (in SGD)

With stochastic gradients: gₖ = ∇f(wₖ) + ξₖ

Noise causes:

  • Jitter
  • Oscillations
  • Bouncing near minima

Mini-batches reduce variance spatially. But they don’t smooth noise across time.

Temporal noise remains.

3. No Memory

Vanilla GD forgets everything.

Each step only uses:

  • Current position
  • Current gradient

It ignores:

  • Where you were going
  • Whether direction was consistent

That’s inefficient in structured landscapes.


The Core Idea of Polyak Momentum

Instead of trusting only today’s gradient, accumulate history.

Define velocity:

vₖ₊₁ = β vₖ + ∇f(wₖ)

Update:

wₖ₊₁ = wₖ − η vₖ₊₁

Equivalent second-order form:

wₖ₊₁ = wₖ − η ∇f(wₖ) + β (wₖ − wₖ₋₁)

That’s it.

Simple change. Big effect.


What Actually Changes?

1. It Adds Memory

Momentum is an exponential moving average of past gradients.

So instead of reacting instantly, it builds directional conviction.

If gradients consistently point one way → speed increases.

2. It Filters Noise

High-frequency random noise cancels out.

Consistent signal accumulates.

Momentum behaves like a low-pass filter.

Noise ↓ Signal ↑

3. It Fixes Ill-Conditioning (Partially)

For quadratic problems:

Vanilla GD rate: (κ − 1) / (κ + 1)

Heavy Ball rate: (√κ − 1) / (√κ + 1)

That √κ improvement is why momentum flies through narrow valleys.


Intuition: Heavy Ball Analogy

Imagine rolling a heavy ball down a valley.

Without momentum → step, stop, step, stop. With momentum → inertia builds → smoother motion.

Mathematically it behaves like:

w¨ + c w˙ + ∇f(w) = 0

Acceleration + friction + force.

Optimization becomes a dynamical system, not just a local step rule.


But It’s Not Magic

Momentum:

  • Needs tuning (η, β)
  • Can overshoot
  • Is sensitive if curvature is unknown
  • Doesn’t guarantee optimal worst-case rate in general convex case

It improves constants. It doesn’t solve everything.


Final Takeaway

Gradient Descent fails because:

  • It forgets history
  • It overreacts to noise
  • It struggles in bad geometry

Polyak Momentum fixes this by:

  • Adding memory
  • Smoothing temporal noise
  • Accelerating movement in narrow valleys

It’s not about “bigger steps”.

It’s about better dynamics.

If SGD makes learning possible, Momentum makes it efficient.


Article content


Article content
Article content
Article content




#Optimization #Momentum #HeavyBall #DeepLearning #MachineLearning #MathForML

To view or add a comment, sign in

More articles by chandu chowdary

Others also viewed

Explore content categories