Logistic Regression: Flag bearer of Classification

Imagine a coin toss with a few unfair coins. Now imagine that the parameters of unfairness for these coins are all the same and we have tossed all these coins except one a lot of times. Wouldn’t it be crazy if we could predict what the toss of the last coin will result in?

Well.. We’re in luck (I sense a pattern here), Logistic Regression boasts of doing exactly that.

We would be able to say with a high degree of certainty that 6 or 7 or 8 or 9 .. times out 10 the coin will drop heads/tails. Now this is nothing but the probability of an event (Heads or Tails). The highest probability being one i.e. 10 times out of 10 (Example: A coin with both sides as head) and the lowest being zero.

Logistic Regression is actually one of the many popular Classification Algorithms.

Classification being the act of grouping an example or sample into one of two or more classes. When there are only two classes to group into (Example being heads and tails), it is called Binary Classification. When there are more than two classes it is called Multi Class Classification.

As with every machine learning algorithm. There are three sections: 1. Model Equation 2. Cost Function 3. Method of Cost/Error Reduction


Model:

Since Logistic Regression is a linear model, it can be thought of an extension of Linear Regression. Therefore this is the first stage:

$$Z = X * \theta^T$$

Note: This is a vectorized equation and how it was obtained can be seen here

The next part deals with how to convert this into a probabilistic output.

Why?

Quite simply, let us imagine a coin with certain attributes tossed about a thousand times. We notice that it gets heads 700 times and tails 300 times. The heads and tails are the outputs/results which are driven by the probability which we can guess easily is around 0.7 for heads.

Therefore to predict the result of a coin with a similar attributes, we have to convert the linear combination of characteristics into a probability!

By characteristics, I mean the values of ‘X’, which could be anything like.. Shift in center of mass, angle of tilt of one surface, weight of the coin, friction coefficient of surfaces etc. We assume these all linearly affect our probability of the output (heads/tails). We therefore multiply them with parameters ‘ \(\theta\) ’ (controlled by us) which are random at first but later we plan to make them as precise as possible.

So the linear combination of characteristics is \(z = x * \theta^T\) . This we have to convert into that driving probability.

This is done using a sigmoid function.

Sigmoid

It is easier to see what sigmoid fuction does:

$$h_\theta(x) = g(z) = \frac {e^z}{(1 + e^z)}$$

As we can observe, when the input is 0 (all characterisitics become zero and coin becomes fair), the output is 0.5 or 50%. The range of output is from 0 to 1. Note that when the parameters ( \(\theta\) ) cause ‘z’ to be positive, the probability is greater than 0.5, approaching 1 and when it is negative, vice versa is observed.

So we have to design the cost function such that the parameters can be trained accordingly.


Cost Function

We know now that, z is the output of combination of parameters and attributes. We also know that when we take sigmoid of this z, we get a probability.

The reason we chose sigmoid fuction ( \(h_\theta(x)\) ) is because it denotes a true probability. It serves as a first step in defining the cost function which is derived from the Maximum Liklihood Estimation.

How?

Let us assume that:

$$P(y=1|x;\theta) = h_\theta(x)$$ $$P(y=0|x;\theta) = 1 - h_\theta(x)$$

This is rather obvious. To make it easier to perform calculations, it is necessary to convert this into a single compressed form:

$$p(y|x;\theta) = (h_\theta(x))^y (1-h_\theta(x))^{1-y}$$

Assuming that the m training examples were generated independently, we can then write down the likelihood of the parameters as the product of the probabilities of individual examples.

$$L(\theta) = \prod_{i=1}^m p(y^{(i)} | x^{(i)};\theta)$$

Maximizing likelihood

Why maximize likelihood? Well Maximizing likelihood in ways means reducing the total error which is nothing but the cost function. Our aim is to attain the cost function through the likelihood expression.

As \(x\) increases, \(log(x)\) also increases. Since the likelihood (product of probabilities) is positive, we are allowed to also maximize the log of the likelihood as it means the same thing. This makes it much easier.

$$l(\theta) = log(L(\theta))$$

We know, \([log(ab) = log(a) + log (b)]\) , therefore:

Maximize:

\(l(\theta) =\frac{1}{m} \sum_{i=1}^m [y^{(i)}log(h_\theta(x^{(i)})) *\) \((1-y^{(i)})log(1-h_\theta(x^{(i)})]\)

(Note: \(\frac{1}{m}\) is simple scaling and is added to make the final equation easier to work with)

In other words:

Minimize:

$$-l(\theta) = Cost \: Function = J(\theta)$$

Cost Equation:

\(J(\theta) =-\frac{1}{m} \sum_{i=1}^m [y^{(i)}log(h_\theta(x^{(i)})) *\) \((1-y^{(i)})log(1-h_\theta(x^{(i)})]\)

Here is the python code for it:


import numpy as np

def Sigmoid(z):
    G_of_z = float(1.0/float(1.0+np.exp(-1.0*z)))
    return G_of_z

def Hypothesis(X, Theta):
    ''' We obtain the h_theta(X) or the hypothesis function.
        Here X should be an numpy array of dimensions (n,m) where all n = 0 elements are equal to 1. 
        Theta is the 1-D parameters vector.
    '''
    z = np.dot(X, np.transpose(Theta))
    return Sigmoid(z)

def CostFunction(X,Y,Theta):
    h_of_x = Hypothesis(X,Theta)
    SumOfErrors = np.dot(Y, np.log(h_of_x)) + np.dot((1.0-Y),np.log(1.0-h_of_x))
    J = (-1.0/len(Y))*SumOfErrors
    return J

Gradient Descent

The idea is to reach a local minimum (Lowest Cost) by updating the parameters ( \(\theta\) ) and then using those to compute the new cost derivative at every step.

$$Repeat \: \{$$

$$\theta_{j}^{(new)} := \theta_{j}^{(old)} - \alpha \frac{\partial J(\theta)}{\partial \theta_{j}^{(old)}} $$ $$\} \: update \: all \: \theta_j \: simulataneously$$

Cost Derivative

By simple calculus, we can figure out the derivative of Cost Function:

$$\frac{\partial J(\theta)}{ \partial \theta_j} = \frac{1}{m} \sum_{i=1}^m h_\theta(x^{(i)})-y^{(i)} )x_j^{(i)}$$

Gradient Descent in Action

$$Repeat \: \{$$

$$\theta_{j}^{(new)} := \theta_{j}^{(old)} - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)} )x_j^{(i)} $$ $$\} \: update \: all \: \theta_j \: simulataneously$$

Here is the python code:


def CostDerivative(X,Y,Theta):
    ''' This function gives the derivative of cost function of logistic regression.
        alpha is the learning rate.
    '''
    return (1.0/len(Y))*(np.multiply(Hypothesis(X,Theta) - np.transpose(Y)),X[:,])

def GradientDescent(X,Y,Theta,alpha=0.01,n_i):
    ''' This function performs Gradient Descent and updates the parameters and returns 
        the new Theta.
        n_i is the number of iterations
    '''
    Theta_new = np.matrix(np.zeros(theta.shape))
    cost = np.zeros(iters)

    for i in range(n_i):

        Theta_new = Theta - alpha * CostDerivative(X,Y,Theta).sum(axis = 0)
        Theta = Theta_new
        cost[i] = CostFunction(X,Y,Theta)

    return Theta, cost



Batch Gradient descent has been written about in detail here.


In Closing

From the Cost Derivative and the Gradient Descent Equations, it might seem like Linear Regression and Logistic Regression are the same. But that is incorrect as the Hypothesis function ( \(h_\theta(x)\) ) of linear and logistic regression are different.

The final Theta is used in the \(h_\theta(x)\) to get the driving probability.

The code can also be found in this github repo


To check out how Linear Regression works, you can check these out:
Method Behind Madness: Linear Regression part-1
Method Behind Madness: Linear Regression part-2

comments powered by Disqus