Batch Gradient Descent: The math and the code

Gradient Descent is the Algorithm behind the Algorithm. It is the heart of Machine Learning. Batch Gradient Descent is probably the most popular of all optimization algorithms and overall has a great deal of significance. To try and fully understand the algorithm, it is important to look at it without shying away from the math behind it.

Warning: The following text may contain what some people consider too much Math. Viewer discretion is advised.

Gradient Descent is the process of reducing the Cost Function. Why is that important? Well Cost function kinda represents the total error of our model. So we use Gradient Descent to optimize and reduce the overall cost by many a times.

To break it down, Gradient is the slope and the Descent is well the Descent. So literally, we have to Descend down the slope to find the least error. How is that?

Cost Equations

$$h_{\theta}(x) = \theta_0+ \theta_1 x$$

This is the equation of a line. Now we know that if we change the intercept \(\theta_0\) and the slope \(\theta_1\) it becomes a different line. i.e. \(h_{\theta}(x) = 3+ 2 x\) is completely different from \(h_{\theta}(x) = 5+ 7 x\)

So if we want to change the line, all we have to do is update the parameters \(\theta_0\) and \(\theta_1\)

The key lies in understanding the cost function, after all the aim is to get the lowest possible Cost.

$$J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)^2$$ $$= \frac{1}{m} \sum_{i=1}^m \left((\theta_0 + \theta_1 x_i) - y_i\right)^2$$

To get the cost function of a different line, all we have to do is alter the two parameters \(\theta_0\) and \(\theta_1\) . Hence, we see cost as a function of \(\theta_0\) and \(\theta_1\)

Cost Function Visualized

This is a 3D representation of the Cost Function. We have to find the lowest Cost and note the parameter values for that Cost ( \(\theta_0\) and \(\theta_1\) ). Those values will give us the Best fit line.

Let’s visualize the cost function and how to go down the slope:

To achieve the following, we use steps to move towards the lower cost. The direction of the downwards movement is decided by the gradient and the size of the step is decided by the learning rate.

Another visualization to understand this better:


Math behind each step

We can describe this process as: To reach a local minimum, the cost function updates itself by updating it’s parameters.

So, we have to update the parameters for every step. These equations are updated every time a step is taken.

$$\theta_0 = \theta_0 - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} $$
$$\theta_1 = \theta_1 - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} $$

Here, \(\alpha\) is learning rate. A constant value that determines how large the steps should be.

To solve these, we have to find the values of the following:

$$\frac{\partial J}{\partial \theta_1} = \frac {\partial (\frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)^2)}{\partial \theta_1}$$
$$\frac{\partial J}{\partial \theta_0} = \frac {\partial (\frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)^2)}{\partial \theta_0}$$

Since, \(h_{\theta}(x) = \theta_0+ \theta_1 x\) ,

taking partial derivative of \(h_{\theta}(x)\) w.r.t to \(\theta_1\) : $$\frac{\partial (\theta_0+ \theta_1 x)}{\partial \theta_1} = x $$

and taking partial derivative of \(h_{\theta}(x)\) w.r.t to \(\theta_0\) : $$\frac{\partial (\theta_0+ \theta_1 x)}{\partial \theta_0} = 1 $$

We know partial derivative of \(y\) is zero w.r.t to both \(\theta_1\) and \(\theta_0\) :

\(\frac{\partial {y}}{\partial \theta_1} = 0 \) and \(\frac{\partial {y}}{\partial \theta_0} = 0 \)

Coming back to the slope, using chain rule:

\(\frac{\partial J}{\partial \theta_1} = \frac{1}{m} [2 (h_{\theta}(x_i) - y_i)] *\) \([(0 + x_1 - 0) + (x_2)+...+ (x_m)]\)

$$ = \frac{2}{m} \{(h_{\theta}(x_i) - y_i) * [x_1+...+x_m] \} $$

As we know, $$Sum((h_{\theta}(x_i) - y_i) * \begin{bmatrix} x_1\\x_2\\.\\.\\x_m\end{bmatrix})$$

$$= \{(h_{\theta}(x_i) - y_i) * [x_1+...+x_m] \}$$

We know:

$$\begin{bmatrix} x_1\\x_2\\.\\.\\x_m\end{bmatrix} = X[:\:,\:1]$$

and for ease let’s call:

$$(h_{\theta}(x_i) - y_i) = Error$$


$$\frac{\partial J}{\partial \theta_1} = \frac{2}{m} * Sum(Error * X[ :\:,\:1])$$
and similarly,
$$\frac{\partial J}{\partial \theta_0} = \frac{2}{m} * Error * \sum_{i=1}^m 1$$ $$\frac{\partial J}{\partial \theta_0} = 2 *Sum(Error * X[:\:,\:0])$$

Plugging them back into the main equation we get:

$$\theta_0 = \theta_0 - \alpha * Sum(Error * X[:\:,\:0]) $$
$$\theta_1 = \theta_1 - \alpha *Sum(Error * X[:\:,\:1]) $$

Note: All the constant terms come together and form the new adjusted \(\alpha\)

This process of updating the parameters need to be repeated until we reach a point of convergence.

Generalizing and putting it all together:

Now, we have to generalize and scale for cases when there are a lot of variables present and a lot of parameters present. For these cases, we use vectors and matrices to make calculations much faster. Without Linear Agebra the practial applications of these algorithms would not at all be possible.

We should vectorize all that we possibly can. For example:

$$X = \begin{bmatrix}1 & x_{11} & x_{21} &.&.&x_{n1}\\1&x_{12}&x_{22}&.&.&x_{n2}\\1&x_{13}&x_{23}&.&.&x_{n3}\\.&.&.&.&.&.&\\.&.&.&.&.&.&\\1&x_{1m}&x_{2m}&.&.&x_{nm}\end{bmatrix}$$

$$\theta = \begin{bmatrix}\theta_0&\theta_1&\theta_2&.&.&.&\theta_n\end{bmatrix}$$

Therefore :

$$\theta_j^{(new)} = \theta_j^{(old)} - \alpha * Sum(Error * X[:\:,\:j]) $$

Here \(n\) is the number of variables present while \(m\) is the number of samples present.

This is the final snippet of the algorithm:

def gradientDescent( X, y, iters, theta, alpha):
    '''To run a loop where each loop is a step towards the local optima.
        Here: X is the input vector/matrix,
        y is the actual recorded value vector,
        iters is the number of iterations until convergence,
        theta is the parameter vector/matrix'''
    temp = np.matrix(np.zeros(theta.shape))
    cost = np.zeros(iters)
    for i in range(iters):
        error = (X * theta.T) - y
        term = error*X
        temp = theta - ((alpha / len(y)) * term.sum(axis = 0)
        theta = temp
        cost[i] = computeCost(X, y, theta)
    return theta, cost

To check out how Linear Regression works, you can check these out:
Method Behind Madness: Linear Regression part-1
Method Behind Madness: Linear Regression part-2

comments powered by Disqus