Source: James et al. Introduction to Statistical Learning (Springer 2013)

In Simple Linear Regression, where we had only two variables, made a 2D plot. The “best fit” linear model there was a line. In case of three variables, the points are gonna make a 3D plot. In this case, the “best fit” linear model is a plane! What about ‘n’ variables and hence ‘n’ dimensions?

The “best fit” linear model will have $$n-1$$ dimensions!

## Converting Equations to General Case Scenario:

First, the equation of our linear model:

$$h_{\theta}(x) = \theta_0+ \theta_1 x$$

$h_{\theta}(x) = \theta_0+ \theta_1 x$
becomes $$h_{\theta}(x) = \theta_0+ \theta_1 x_1+\theta_2 x_2 + \theta_3 x_3 + ... \theta_n x_n$$

Here $$n$$ is the number of independent variables.

A general rule of thumb to remember at this point is ‘vectorize wherever possible’. Since we probably will be dealing with large amounts of data, vectorizing does provide huge performance improvements. To do the same, let:

$$y = \begin{bmatrix}y_1\\y_2\\y_3\\.\\.\\y_m\end{bmatrix},$$

$$X = \begin{bmatrix}1 & x_{11} & x_{21} &.&.&x_{n1}\\1&x_{12}&x_{22}&.&.&x_{n2}\\1&x_{13}&x_{23}&.&.&x_{n3}\\.&.&.&.&.&.&\\.&.&.&.&.&.&\\1&x_{1m}&x_{2m}&.&.&x_{nm}\end{bmatrix},$$

$X = \begin{bmatrix}1 & x_{11} & x_{21} &.&.&x_{n1}\\1&x_{12}&x_{22}&.&.&x_{n2}\\1&x_{13}&x_{23}&.&.&x_{n3}\\.&.&.&.&.&.&\\.&.&.&.&.&.&\\1&x_{1m}&x_{2m}&.&.&x_{nm}\end{bmatrix},$
Here $$n$$ is the number of independent variables present while $$m$$ is the number of samples present.

$$\theta = \begin{bmatrix}\theta_0&\theta_1&\theta_2&.&.&.&\theta_n\end{bmatrix}$$

Therefore $$X \bullet \theta^T$$ becomes:

$$\begin{bmatrix}\theta_0 +\theta_1 x_{11} +\theta_2 x_{21} +..+\theta_nx_{n1}\\\theta_0+\theta_1x_{12}+\theta_2x_{22}+..+\theta_nx_{n2}\\.\\.\\\theta_0+\theta_1x_{1m}+\theta_2x_{2m}+..+\theta_nx_{nm}\end{bmatrix}$$

$\begin{bmatrix}\theta_0 +\theta_1 x_{11} +\theta_2 x_{21} +..+\theta_nx_{n1}\\\theta_0+\theta_1x_{12}+\theta_2x_{22}+..+\theta_nx_{n2}\\.\\.\\\theta_0+\theta_1x_{1m}+\theta_2x_{2m}+..+\theta_nx_{nm}\end{bmatrix}$

So what does this achieve? Our Equation boils down to just:

$$h_{\theta}(X) = X \bullet \theta^T$$

Cost Equation:

Now the cost equation:

$$J(\theta) = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)^2$$

becomes:

$$J(\theta) = \frac{1}{m} sum((X\theta^T - y)^2)$$ $$= \frac{1}{m} (X\theta^T - y)^T (X\theta^T - y)$$

def computeCost(X, y, theta):
inner = np.power(((X * theta.T) - y), 2)
return np.sum(inner) / (2 * len(X))



Since this topic is too big to cover here, it is done separately:

Batch Gradient Descent: The math and the code

## Exploring an example:

Taking the previous example of the Restaurant:

We first perform a preprocessing step, Normalization to scale the values down proportionally:

Then we apply all the aforementioned concepts to get a prediction (Best Fit Line):

To see visuallly how the cost equation has reduced, we plot the cost function of every iteration like so:


fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')


What we get is the following figure:

The entire code and execution is present on a jupyter notebook here

Here is the first part :
Method Behind Madness: Linear Regression part-1