Gradient descent in Python from scratch

Prerequisites:

  1. Some programming knowledge
  2. A bit theoretical background — loss function, derivative, chain rule, etc.
  3. Willingness to try out different stuffs

Step-1: Download and Install Python

Get Python from here, and install. If you run into any issue, there are many solutions available online.

Step-2: Download and install PyCharm

We are using PyCharm only because I am more comfortable with it. Feel free to use the IDE you are comfortable with.

Get the Professional version here. It is not free, but you can use the trial for a month. For our purpose, the free Community edition might work, but I never used that.

Step-3: Setup Project

  1. Open PyCharm
  2. File -> New Project
  3. Select Scientific from the left panel
  4. Keep other settings as shown in the screenshot below, and click Create

Step-4: Install Numpy and Matplotlib

As shown in the screenshot below, go to the terminal panel at the bottom and type:

Now, let's install Matplotlib

pip install matplotlib

Step-5: Understand the exercise

In this exercise, we will find a polynomial approximation of a sine curve. Here is a memory refresher on the polynomial functions. They can be represented as:

For example, this is a polynomial function:

But these are not:

Okay, let’s keep things simpler. First of all, let us rid of the subscripts, and use a,b,c,d instead. Also, let us take up to 3rd-degree polynomial (simply speaking, up to the term x to the power 3). So our intended function is:

We want the function to behave like:

So, we can summarize the problem this way — finding the right values of a,b,c,d so that equation(1) works the same (or almost the same) as (2), meaning

Step-6: Get hands dirty — Plot sine curve

Open main.py file in the project, paste the following code. To run the project, either click the play button or right-click main.py and click the Run option.

You don’t need to understand how everything in this code works. Just understanding what it is doing is fine for now.

In line#5, we are generating an array of 2000 elements from -π to +π, and all the numbers in between are linearly distributed. For example, np.linspace(0, 100, 5) will generate an array of elements 0, 25, 50, 75, 100.

In line#6, we are generating sine function output for each element in x, meaning y_sin is an array of 2000 elements as well, the first element being sin(-π) and the last element being sin(π).

In line#7 and #8, we are plotting the graph, as visible in the right part of the screenshot above.

Step-7: Let’s cheat

I had actually run the full program before writing this article and derived some values for a,b,c,d. Before going further, let us plot the resulting curve to get an understanding of what we are trying to achieve.

Run the following code:

In the screenshot below, you can see the sine curve in blue and the polynomial curve (y_polynomial_cheat in the example) in orange. Notice how they are almost the same, but not exactly the same.

Step-8: Let’s start deriving in the proper way now

In the code, now, initialize a,b,c,d with random values (np.random.randn()). Also, let us rename y_polynomial_cheat to y_polynomial_derived.

Unsurprisingly, those are not the same by any means. Now, we will be updating a,b,c,d in a way, so that the orange curve gradually aligns with the blue curve.

Step-9: The holy grail — Gradient Descent

Clone the complete code from Github, or simply copy from below:

If everything goes okay, you will see this output!

Looks scary? I got your back — I promise (if you know chain rule).

From Line#15 to #34, we are running a loop 2000 times. Every time we are evaluating y_polynomial_derived, comparing it with y_sin, and based on the difference (loss), updating a,b,c,d.

The most important part of this loop is the last block (Line#31–34), where we are updating parameters. Everything else is done to know how much to update. So, lets start investigating from the block and gradually go upward.

  1. So, we are subtracting descent_grad_a from a. Let us try to understand how we are deriving descent_grad_a. Same, intuition can be applied for the other 3 parameters.
  2. As we see in line#26, descent_grad_a is a product of grad_a and learning_rate. Let us first find out what grad_a is.
  3. grad_a is gradient or derivative of the loss with respect to a, which in other words, a’s impact on the loss. Why there is a relation between derivatives and impact on change? That is kind of the definition of derivative — isn’t it?

Now, why a’s gradient equals to grad_y_polynomial_derived.sum()? From chain rule formula:

gradient of loss with respect to a

= (gradient of loss with respect to y_polynomial_derived)
* (gradient of y_polynomial_derived with respect to a)

= (grad_y_polynomial_derived) * (1) [Applying differential calculus]

= grad_y_polynomial_derived

Why “gradient of y_polynomial_derived with respect to a” equals 1? You need differential calculus knowledge for it, which is outside the scope of this article. But I will give a quick explanation - it is derivative of “a + b * x + c * x ** 2 + d * x ** 3” with respect to a. Here all terms are constant with respect to a except for the first one (a), and derivative of this term is “1” and so the derivative becomes “1 + 0 + 0 + 0”, meaning 1.

In line#22, 23, 24, we are calculating gradients of b,c,d respectively applying differential calculus in the same way. Same idea applies to line#19 to calculate the derivative of loss with respect to y_polynomial_derived.

Why we are using the sum of the gradients? Would not it be more rational to use mean instead? Yes, but that doesn’t affect since we are finally scaling down these gradients by a factor (learning_rate).

So, is it the only use of learning_rate, scaling down? No! Remember from calculus that gradients give you the impact information for a “very close” area around a point. So, from the statement

“a has a positive impact on loss when a = p

It is safer to assume that

“a has a positive impact on loss when a = p+0.000001 as well”

than

“a has a positive impact on loss when a = p+10000 as well”

So would a value much much less than 0.000001 for learning rate would give a better result? Yes! But that would make the update much slower. So, we decide a value somewhere in the middle. This is a hyperparameter — meaning you have the freedom to set the value based on the specific problem. For more explanation on learning rate, have a look here.

One final question — why we are subtracting gradient and not adding? See — if gradient of loss with respect to a (grad_a) is positive, it means a has a positive impact on the loss. Meaning, increasing a will increase loss. But we want to decrease the loss here. So, we descent a by the amount of it’s gradient (with some learning_rate factor of course). Understood now why it is called “gradient descent?”.

Step-10: Where to go from here

If you really really want a deep intuition in gradient descent, it will be impossible unless you try it out by yourself.

Here are some sample and simple exercises you can do:

  1. Try different values of learning_rate
  2. Try different values of the iterations
  3. Try different loss functions
  4. In the loop, print a,b,c,d, loss, and plot the graphs every time — or maybe in a regular interval (say, once in 100 iterations)
  5. If you get comfortable with graph, try plotting loss with change in a,b,c,d
  6. Set breakpoints in PyCharm and inspect variables.
  7. Take a bigger range for y_sin, maybe -4π to +4π, and see if the approximation still works. If not, why not? (food for thought)

Now, please write in the comment if there is any question. I will try to address it.

Credits:

I used the example from PyTorch's official tutorial and changed the code a bit to make it more readable to newbies.

Jack of a few trades, master of none.