Day - 4 : Why Plain Gradient Descent Fails (And What Polyak Momentum Fixes)

chandu chowdary

Published Mar 1, 2026

+ Follow

We love the clean update:

wₖ₊₁ = wₖ − η ∇f(wₖ)

Simple. Elegant. First-order.

But in real ML systems, it struggles.

Where Core Gradients Fail

1. Zig-Zag in Narrow Valleys

In ill-conditioned problems (think long curved valleys), gradients keep pointing sideways.

You move:

Left
Then right
Then left again

Progress along the valley direction becomes painfully slow.

This happens when the condition number [κ = λ_max / λ_min ] is large.

Large κ ⇒ slow convergence.

2. High-Frequency Noise (in SGD)

With stochastic gradients: gₖ = ∇f(wₖ) + ξₖ

Noise causes:

Jitter
Oscillations
Bouncing near minima

Mini-batches reduce variance spatially. But they don’t smooth noise across time.

Temporal noise remains.

3. No Memory

Vanilla GD forgets everything.

Each step only uses:

Current position
Current gradient

It ignores:

Where you were going
Whether direction was consistent

That’s inefficient in structured landscapes.

The Core Idea of Polyak Momentum

Instead of trusting only today’s gradient, accumulate history.

Define velocity:

vₖ₊₁ = β vₖ + ∇f(wₖ)

Update:

wₖ₊₁ = wₖ − η vₖ₊₁

Equivalent second-order form:

wₖ₊₁ = wₖ − η ∇f(wₖ) + β (wₖ − wₖ₋₁)

That’s it.

Simple change. Big effect.

What Actually Changes?

1. It Adds Memory

Momentum is an exponential moving average of past gradients.

So instead of reacting instantly, it builds directional conviction.

If gradients consistently point one way → speed increases.

Intuition: Heavy Ball Analogy

Imagine rolling a heavy ball down a valley.

Without momentum → step, stop, step, stop. With momentum → inertia builds → smoother motion.

Mathematically it behaves like:

w¨ + c w˙ + ∇f(w) = 0

Acceleration + friction + force.

Optimization becomes a dynamical system, not just a local step rule.

But It’s Not Magic

Momentum:

Needs tuning (η, β)
Can overshoot
Is sensitive if curvature is unknown
Doesn’t guarantee optimal worst-case rate in general convex case

It improves constants. It doesn’t solve everything.

Final Takeaway

Gradient Descent fails because:

It forgets history
It overreacts to noise
It struggles in bad geometry

Polyak Momentum fixes this by:

Adding memory
Smoothing temporal noise
Accelerating movement in narrow valleys

It’s not about “bigger steps”.

It’s about better dynamics.

If SGD makes learning possible, Momentum makes it efficient.

#Optimization #Momentum #HeavyBall #DeepLearning #MachineLearning #MathForML

Day - 4 : Why Plain Gradient Descent Fails (And What Polyak Momentum Fixes)

chandu chowdary

Where Core Gradients Fail

1. Zig-Zag in Narrow Valleys

2. High-Frequency Noise (in SGD)

3. No Memory

The Core Idea of Polyak Momentum

What Actually Changes?

1. It Adds Memory

Recommended by LinkedIn

2. It Filters Noise

3. It Fixes Ill-Conditioning (Partially)

Intuition: Heavy Ball Analogy

But It’s Not Magic

Final Takeaway

Learning Optimization

152 followers

More articles by chandu chowdary

Others also viewed

PID Tuning Step by Step - Part 6

One Operational Environment: A Doctrinal Anchor for the Cognitive Age

Debunking Misconceptions about the Rules of Mixtures

Expand your testing Universe with Systems Thinking

Enhanced Unified Toroidal-Crystalline Harmonic System (UTCHS) with Phase Recursion: A Comprehensive Theoretical Framework

Variography: Post II - The Nugget Effect

Part II: Bridging the Gap Between Riemann Hypothesis and Goldbach Conjecture

“Quantum Squeezing” Helps Us Dodge Heisenberg- Can We Apply This Tool Up Here In The World of Macro Metrology?

In And Out Of Style

Blueprint No. 16: The Ignition Cost

Explore content categories

Where Core Gradients Fail

1. Zig-Zag in Narrow Valleys

2. High-Frequency Noise (in SGD)

3. No Memory

The Core Idea of Polyak Momentum

What Actually Changes?

1. It Adds Memory

Recommended by LinkedIn

2. It Filters Noise

3. It Fixes Ill-Conditioning (Partially)

Intuition: Heavy Ball Analogy

But It’s Not Magic

Final Takeaway

Learning Optimization

152 followers

More articles by chandu chowdary

Day 25 — RMSProp: Fixing AdaGrad’s “Memory Problem”

Day 24 — AdaGrad: Adapting Learning Rates from Data

Day 23 — Katyusha: Accelerating Variance Reduction with Momentum

Day 22 — SPIDER: Tracking Gradients with Near-Zero Variance

Day 21 — SARAH: Recursive Variance Reduction for Smarter Optimization

Day 20 — SVRG: Reducing Variance Without Storing Everything

Day 19 — SAGA: Fixing Bias While Keeping Stability

Day 18 — SAG: Turning Noisy Gradients into Stable Learning

Day 17 — Natural Gradient & K-FAC: Learning the Right Geometry

Day 16 — L-BFGS: Learning Curvature Without Storing It

Others also viewed

PID Tuning Step by Step - Part 6

One Operational Environment: A Doctrinal Anchor for the Cognitive Age

Debunking Misconceptions about the Rules of Mixtures

Expand your testing Universe with Systems Thinking

Enhanced Unified Toroidal-Crystalline Harmonic System (UTCHS) with Phase Recursion: A Comprehensive Theoretical Framework

Variography: Post II - The Nugget Effect

Part II: Bridging the Gap Between Riemann Hypothesis and Goldbach Conjecture

“Quantum Squeezing” Helps Us Dodge Heisenberg- Can We Apply This Tool Up Here In The World of Macro Metrology?

In And Out Of Style

Blueprint No. 16: The Ignition Cost

Explore content categories