Method behind Madness: Linear Regression - Part 1

Something I read recently that tickled me was the statement “Machine Learning is just Statistics done on a mac”.

The fact is that it’s not entirely untrue! Machine Learning borrows from a lot of fields including Mathematics and statistics. Linear Regression is simply one of those algorithms that came from statistics. It’s deception lies in how powerful it can be in certain scenarios. To understand Linear Regression, we must go eat first. Yes, I always give food examples, don’t judge me. Let’s say we went to a Restaurant…

Problem Statement

The service was pleasant, food was delicious and surprisingly the bill was not too much. We pay the bill, tip the waiter and leave. Unbeknown to us, the owner of the Restaurant was writing down both the bill amount and tip amount. As we leave, he stops us (This part is just to quell my vanity) and asks us to help him with this burning question that is consuming him.

This is his question: Will an increase in bill amount result in an increase of the tips? More importantly, by how much? (I know! Just bear with me okay..)


Thinking about the problem before looking at the data, let us imagine the owner only had the tip amount recorded. If he wanted to predict the tip of the next bill what would be the most intuitive way?

Looking at it visually, here x-axis is the index and y-axis is the Tip amount (₹):


If there is no other information present then the only estimate that is possible would be a central tendency, right? And in most cases the best option is the mean. Putting it simply, the owner would be right in assuming that the next tip would be near the average of all the tips.


Simple Linear Regression

This time, Let us take two variables.. Tip amount and the Bill Amount. When we have only two variables and we are trying to predict one with the help of the other using a Linear Regression, it’s called a Simple Linear Regression.

Since we are trying to figure out if tip amount depends on the Bill amount, the former becomes the dependent variable and latter the independent variable here.

Our aim is to find the best fit linear line that will give us the most accurate prediction.

How in the world do we do that? is a very valid question at this point.

The Cost Function is nothing but a function that describes the total error present in a particular model. In our case, the model is a Simple Linear Regression.

The equation of our line is:

$$\hat{y} = \theta_0+ \theta_1 x$$

This is how we describe the error:

For every Bill amount, we check what our line says the Tips should be and compare it with the actual Tips received for that value of x. This difference between the predicted amount and the actual amount is the error. i.e. \((\hat{y}(x_i) - y_i)\)

To actually penalise the errors that are larger, we square this error term. For eg: if the error is 3 and for another meal is 5. They are fairly close.. But on squaring them, they become 9 and 25. The difference between them becomes significant. i.e. \((\hat{y}(x_i) - y_i)^2\)

So far, we find these errors, square each of them. The last step?

It is to sum up all these squared error terms and divide by the total number of terms. The is also called the Mean Squared Error i.e. Mean of Squared Errors. Now it makes sense..

$$J(\theta_0, \theta_1) = \frac{1}{m} \sum_{i=1}^m (\hat{y}(x_i) - y_i)^2$$ $$= \frac{1}{m} \sum_{i=1}^m \left((\theta_0 + \theta_1 x_i) - y_i\right)^2$$

Here \(J\) is the Cost Function,

\(y_i\) is the actual recorded value and,

\(\hat{y}(x_i)\) is the value predicted by our line for a particular \(x_i\)

\(m\) is the number of examples/samples in the dataset

simpleLinear

The best fit line is the line that passes through the data such that it has the least possible value of cost function. The dotted lines shown in the figure are candidates for that best fit line.

For Simple Linear Regression, this can even be done manually as we have only two variables. When we have a multiple independent variables, manually finding out the best fit line is not only tedious but also not feasible.

How to automate this entire process in python?:
Method behind Madness: Linear Regression - Part 2

We need to use an optimization algorithm to achieve this, Batch Gradient Descent is one of them:
Batch Gradient Descent: The math and the code

comments powered by Disqus