Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex. We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= h_{Ɵ }<= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e^{‑z})

and h_{Ɵ }(x) = g(Ɵ^{T}X_{)}

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when h_{Ɵ }(x) >= 0.5 and h_{Ɵ }(x) asymptotes to 0 when h_{Ɵ }(x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

h_{Ɵ }(x) = Ɵ_{0} + Ɵ_{1}x_{1}^{2} + Ɵ_{2}x_{2}^{2} + Ɵ_{3}x_{1} + ^{ }Ɵ_{4}x_{2}

_{ }

or

^{ }

h_{Ɵ }(x) = 1/(1 + e –^{(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 + Ɵ4×2)})

We could choose the general form of a circle which is

f(x) = ax^{2} + by^{2} +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(h_{Ɵ }(x),y) = { -log(h_{Ɵ }(x)) if y = 1

-log(1 – h_{Ɵ }(x))) if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(h_{Ɵ }(x),y) = { -log(h_{Ɵ }(x)) if y = 1

-log(1 – h_{Ɵ }(x))) if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

Hence the cost can be written as

J(Ɵ) = Cost(h_{Ɵ }(x),y) = -y * log(h_{Ɵ }(x)) – (1-y) * (log(1 – h_{Ɵ }(x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵ_{j} = Ɵ_{j} – α δ/δ Ɵ_{j} J(Ɵ_{0}, Ɵ_{1},… Ɵ_{n})

This works out to a function that is similar to linear regression

Ɵj_{ = }Ɵj – α 1/m { Σ h_{Ɵ }(x_{i}) – y_{i}} x_{j }^{i}

This will enable the machine to fairly accurately determine the parameters Ɵ_{j }for the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+