This post is a draft.

In the post, we suppose you have a basic understanding of derivative and gradient, however it’s not a must. If you know how to code in Python, you can figure out what is gradient descent and how it works by reading the following codes and explanations.

## Preliminary

Gradient descent[1] is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes proportional steps to the negative of the gradient (or of the approximate gradient) of the function at the current point. Otherwise, if one takes steps to the positive of the gradient, one approaches a local maximum of that function. The procedure is then known as gradient ascent.

### Why do we use Gradient Descent?

Intuitively, what will you do when I ask to you find out the minimum values of function $f(x)=\frac{x^4}{4} - x^3 + x^2$? According to the text book, the first step is to compute the derivative $f’(x)$ of $f(x)$ with respect to $x$. Then make $f’(x)=x(x-1)(x-2)=0$, the solution is $x=0,1,2$. The two local minimums are both 0, so the global minimum is 0. This is the way we humans tackle these kind of problems, even more complex ones. Computer never do like that, I mean computer is not that clever. For gradient descent, it search the negative direction of the gradient at a point to find a solution where the derivative of target function is zero.

## Show Me the Code

In this section, we take Stochastic Gradient Descent as example.

### Question 1: How does code work?

The above code, we make $\delta = np.dot(x, theta)$ as a whole variable, then we get the derivative:

$$der = \frac{d(L(x, theta)}{d(\delta)} = -1 * (y - \delta) \tag{1.3.1}$$

First we should take a look at the problem why we have to calculate the derivative der. The target function is the loss function of the model, so we need to minimize it as possible as we can. It’s a good way to reduce the value $abs(y - \delta$) with the calculated derivative. We can decrease $\delta$ if $y < \delta$ and increase $\delta$ if $y > \delta$ to make the loss function L smaller.

Here we derive the negative of gradient with $der$ in equation 1.3.1:

$$neg_grad = -1 * der = y - \delta = \begin{cases} & > 0, \quad y > \delta; \qquad//\ \delta\ needs\ to\ increase \\ & < 0, \quad y < \delta; \qquad//\ \delta\ needs\ to\ decrease \\ & = 0, \quad y = \delta; \qquad//\ \delta\ is\ ok \end{cases} \tag{1.3.2}$$

With equation 1.3.2, we can update

$$\delta’ = \delta + \lambda * neg_grad \tag{1.3.3}$$

to reduce cost function, $\lambda$ is the step size. You may ask why we have use the negative of gradient and why is gradient the direction of steepest ascent? The textbook said target function decreases fastest if one goes from the point $\alpha$ in the direction of the negative gradient of F at $\alpha$. Now I will prove it for you. At first let’s recall the concept of gradient. Let $f(x_1, x_2, \ldots x_n):\mathbb{R}^n \rightarrow \mathbb{R}$, the partial derivatives of $f$ are the rates of change along the basis vectors of $x$. The partial derivative, or rate of change along $e_i$ can be expressed like this:

$$der(f(x))_i=\frac{\partial f}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x+h*e_i) - f(x)}{h} \qquad //\ it’s\ a\ number. \tag{1.3.4}$$

Actually the gradient is the directional derivative, the gradient along $e_i$ is:

$$grad(f(x))_i=der(f(x))_i * e_i\qquad //\ it’s\ a\ vector. \tag{1.3.5}$$

So if we change x along $e_i$ direction, we can increase or decrease target function $f$.

1. $if\ der(f(x))_i > 0$, We can increase x along $e_i$ direction to increase $f$.

2. $if\ der(f(x))_i < 0$, We can decrease x along $e_i$ direction to increase $f$

Finally we can draw the conclusion like this:

$$diff = \lim_{h \to 0} f(a + h * der(f(a))_i*e_i) - f(a) = \lim_{h \to 0} f(a+h*grad(f(a))_i > 0 \tag{1.3.6}$$

We will increase target function by changing x along positive of gradient, otherwise changing x along negative of gradient to reduce target function.

### Question 2: How to explain equation 1.4.1?

However in the code we use equation 1.4.1 instead of 1.3.3. Here we got the problem that the gradient is of $\delta$, how can we use for theta directly in the example?

To figure out this question, we suppose

$$theta’ = theta + alpha*np.dot(x.T, neg_grad) \tag{1.4.1}$$

then we calculate the new $\delta’$:

$$\delta’ = np.dot(x, theta’) = np.dot(x, theta) + np.dot(x, alpha*np.dot(x.T, neg_grad))\\ = \delta + alpha*np.dot(np.dot(x, x.T), neg_grad) \tag{1.4.2}$$

With sample $x_i$, it’s is obvious in equation 1.4.2 that matrix $np.dot(x_i, x_i.T) > 0$, so if $neg_grad_i > 0$, then $\delta_i’ > \delta_i$, which means $\delta_i’$ increased, in other words, we can increase $\delta$ by making $neg_grad_i>0$ and vice versa. At this point we will find that equation 1.4.1 have the same effect with equation 1.3.3. Bingo, the gradient for $\delta$ also can be used for $\theta$ and the hypothesis proposed above is correct.

### Question 3: How about calculate the derivative of $theta$?

Another question is how about calculate the derivative of $theta$, rather than $\delta=np.dot(x, theta)$. You can give it a try yourself. However you will find it’s quite complicated and you have to calculate the derivative of a matrix, it’s really a disaster. So this a trick you have to take in mind when dealing with gradient descent problems.

### Question 4: Will the target function converge?

Then you may ask how could the target function converges to the desired minimum at the end. When you update $theta$ for sample $x_i$, sample $x_k$ will be affected. How can you make sure that target function or cost function will converge to a tiny number.

### Question 5: Is it a global minimum?

In this post, we talked about how to implement GD. You may think it’s an easy task, since it’s only a few lines of code. However, I have to remind you that you may miss something important. Is the solution we get a global minimum or local minimum? The target function we use in the example code is a convex function which means local minimum is necessarily global. However, in many situations, the target function we faced in real world is not a convex one, leading to a bunch of local minimums and there are good chances you will stuck in certain one rather than the global minimum. You can imagine with the increase of parameter number, there could be more local minimums. For example, if $x_1$, $x_2$ makes $\frac{\partial f}{\partial x} = 0$, and $y_1$, $y_2$ makes $\frac{\partial f}{\partial y} = 0$, number of local minimums could be up to $2*2=4$.

## Recommendation

The idea of gradient descent we get in textbook is always abstract, when it comes to the real world problems, there seems to be much confusion and you may get no clues what to do. It greatly dependents on the train model and target function of task you are tackling when you are implementing gradient descent algorithm, also there are some tricks you have to keep in mind. We strongly recommend you to write your own code after reading this article. If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it’s a killer exercise.