From the course: Neural Networks and Convolutional Neural Networks Essential Training
Gradient descent
From the course: Neural Networks and Convolutional Neural Networks Essential Training
Gradient descent
Remember when we talked about artificial neurons learning to recognize handwritten digits, we left with a crucial question. How does the network actually learn? And how does it figure out the right weights, those important values that determine what the neuron pays attention to? And let me tell you a story that happened to me last month. I was driving to a new client meeting in an unfamiliar part of town, of town, and I was running late, and my GPS kept recalculating the way. And at first I was frustrated. Why couldn't it just give me the perfect directions from the start? And then I realized something fascinating. My GPS was doing exactly what neural networks do when they learn. It was making its best guess, checking if that guess was working, and then adjusting its approach based on the feedback. Now here's how it worked. My GPS initially estimated the fastest route based on its current knowledge. But as I drove, it continuously received new information, traffic updates, road closures, my actual speed. And it realized its prediction was wrong. It calculated how far off it was. And that difference between predicted arrival time and my actual arrival time, well, that's what's called the loss or the error. And the GPS then asked itself, how can I adjust my route recommendations to reduce this error? Should it avoid highways more, or maybe wait side streets differently? It made these small adjustments to its internal decision making process, and it recalculated. And this is exactly how neural networks learn, through a process called gradient descent. Just like my GPS, a neural network makes predictions, measures how wrong these predictions are, then adjusts its internal settings, those weights we talked about, to do better next time. Now, let's say our neural network is trying to recognize handwritten sixes. It looks at an image and says, I'm 70 percent confident this is a five. But the correct answer is a six. The neural network has made an error. Now, here's where it gets interesting. Now, imagine you're hiking in thick fog and trying to find the bottom of a valley, the lowest point, and you can't see where you're going but you can feel the slope of the ground under your feet. Your strategy will be simple. Feel which direction slopes downward most steeply and then take a step in that direction and then repeat. Step by step you'll eventually reach the bottom of the valley. Now this is exactly what gradient descent does except instead of hiking down a physical mountain the network is navigating down an error mountain. The height of the mountain represents how wrong the network's predictions are. The bottom of the valley represents perfect accuracy, and the network feels which direction will reduce the error most quickly, and that's the gradient or the slope. Then it takes a small step in that direction by adjusting its weight slightly. But here's a crucial detail. How big should each step be? Now, in my hiking analogy, you have to imagine that wherever you look around you, there are mountains and valleys. If you take tiny steps, you'll eventually reach the bottom, but it might take forever. If you take huge steps, you might overshoot the valley and end up on the wrong mountain entirely. This step size is called the learning rate in neural networks. Too small, and the learning takes forever, and too large, and the network never settles on good weights. It keeps overshooting the optimal solution. Now, remember our perceptron from last time? It had a harsh decision boundary, either the neuron fired, so it recommended the restaurant with a value of 1, or it didn't and the output is 0. And this creates a problem for our hiking analogy. So imagine if the mountain slope suddenly turned into a cliff, a complete vertical drop. You couldn't determine which way to step because the ground doesn't slope gradually in any direction. In mathematics, we define the gradient as the change in y divided by the change in x. If you have a complete vertical drop, then there is no change in x, so that would be zero. Now the problem is that because there is no change in x for a vertical drop, we can't calculate the gradient at that point because you can't divide a number by zero. And the solution was brilliant. Instead of these harsh on-off decisions, we used something called a sigmoid function. of it as smoothing out these cliff edges into gentle slopes. Now, instead of saying definitely a six or definitely not a six, the neural network now says, I'm 73.2 percent confident this is a six, and this creates smooth rolling hills instead of these jagged cliffs. So our gradient descent algorithm can always find which direction leads downhill. Now, this might seem like a technical detail, but it's the difference between AI that works and AI that doesn't. This learning process, making predictions, measuring errors, and gradually adjusting, is happening millions of times as your phone learns to recognize your voice, as Netflix figures out what you might want to watch, and as your e-mail learns what counts as spam. Every recommendation you get, every photo that gets automatically tagged, every voice command that gets understood correctly, it all started with the simple idea of following the gradient downhill towards better predictions. But here's where things get really interesting. What we've described works great for simple problems, but what happens when we encounter something that stumped early AI researchers for decades? In the next video, I'll show you the XOR problem, a deceptively simple logical puzzle that broke the original perceptron and sparked what many called the first AI winter. More importantly, I'll show you how the solution to this problem led to the deep learning revolution that we're still living through today.