Gradient Descent
Basic concepts and how to implement Gradient Descent
Last updated
Basic concepts and how to implement Gradient Descent
Last updated
Gradient descent is one of basic but very important machine learning techniques to find the best optimum values, among possible ones, of weights such that it minimizes total loss values.
When we define a model such as SVM, Linear Regression or Decision Tree Classifier, we have to have some sort of a way to know if our model is being trained well or not. And to know that, we use loss function which computes the difference between true values, usually noted as value, and predicted values, .
One simple loss function can be MSE whose equation is
When we first initialize a model, usually its weights are generated randomly and every time we train, we update them in a way that in the next training step, it lowers the loss value.
This process until it no more lowers the loss is called gradient descent.
Let's randomly generate x values and set y values in range (1, 10) As you may follow this, generated data will look different since it is random.
If we try to find a line that has the lowest loss values (or fits the data), it would be something like this.
The red line was not the best line as it was manually done.
So to find the best line, let's start from 0 for both weights and bias (But in practice, weights are also initialized randomly).
This weight and bias will look like this.
And let's use the same loss function mentioned above.
To compute the gradient descent, we need partial derivatives with respect to weights and bias (if there is). The followings are corresponding equations.
Before proceeding further, let me go how we get to obtain an optimal weights (and bias).
In a perfect world where there always exists only one optimum, our loss values might look like this.
We can see that at first, the loss is big and as it progresses, it moves toward 0 but after that it increases back. Cases such as above where there only exists one curve is called convex problem and in such cases, we are guaranteed to reach the optimal value.
It does not matter if the loss can actually get to 0 or not because even then, it achieves best possible minimum.
So how do we get there? If you have taken calculus, you know that at each point of a line there exists a tangent line (or multiple in some cases). With those tangent lines and its derivatives, we gain information on which direction we should go to reach the lowest point.
Whether the problem is convex or not, we can always reach an optimum (it could be global or local, depending on convexity of problems). Those optimal points have the derivative of 0 (horizontal line at a point).
With the above graph, we see that at point of 0 we have the horizontal line and that's where we would like to be.
So when problems are not convex, meaning there exists many points with derivative of 0's, we are not guaranteed to reach the global optimum but most likely a local one. These cases will be covered in another post.
The code for the partial derivatives and MSE are defined next.
Since we are using MSE which is a convex function, we know we can get to the global optimum following code.
We can see that the best w is around 1.011 and b is 3.413.
One very important thing in gradient descent is to choose an appropriate learning rate. Learning rate is how much we move towards the optimum value. Setting it too high will cause not being able to reach the optimal because it bounces off too much, while setting it too low will make it slow.
Let's look at an example of setting a high learning rate.
We can see that the red dot is now much further away from the starting point!
It didn't progress much that it seems as though it did nothing at all.
For such, when we try to find the best weights we have to try many different learning rates and use the best one.
The following 6 graphs are plots of losses on different learning rates on the same data we used above.
The first two learning rates are set too high that the final loss is too high. The next two losses seem fine but we can see that using the learning rate of 0.001 works better (converges faster) than 0.01. The last two seem to be working as well but it converges too slow that we don't want to use or train for a longer period.
We've covered what gradient descent is and how setting a different value for learning rates affects the speed of convergence of it. Gradient descent is very important concept in all of machine learning, from supervised and unsupervised to deep learning, and without understanding how it works, a model made might not be able to work well.
Thank you again for reading the post and if you have seen any typos or mistakes or have any other suggestions, please let me know.
You can find the full code at this link
The above is the graph of and our current w is -4. To get to the optimum, we have to move to the right by 4. With the derivative of the function and the learning rate of 2, we will have the new weight of
Now we if use too low learning rate such as 0.00001, from w of -4 will be