logistic function – Giga thoughts …

Neural networks try to overcome the shortcomings of logistic regression in which we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

The above diagram shows a simple neural network with features x₁, x₂, x₃and a bias unit x₀

With a hypothesis function h_Ɵ(x) = 1/(1 + e^-x)

The edges from the features x_i are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

As can be seen in a multi-layer neural network at the left we have the features x₁,x₂, .. x_n.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a₁² = g(Ɵ₁₀¹x₀+ Ɵ₁₁¹x₁+ Ɵ₁₂¹x₂ + Ɵ₁₃¹x₃) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

Here a₁²is the activation parameter at layer 1

Ɵ₁₀is the model parameter at layer 1 and is the 0^th parameter. Similarly Ɵ₁₁is the model parameter at layer 1 and is the 1^st parameter and so on.

Similarly the other activation parameters can be written as

a₂² = g(Ɵ₂₀¹x₀+ Ɵ₂₁¹x₁+ Ɵ₂₂¹x₂ + Ɵ₂₃¹x₃)

a₃² = g(Ɵ₃₀¹x₀+ Ɵ₃₁¹x₁+ Ɵ₃₂¹x₂ + Ɵ₃₃¹x₃)

h_Ɵ(x) = a₁³ = g(Ɵ₁₀²a₀+ Ɵ₁₁²a₁+ Ɵ₁₂²a₂ + Ɵ₁₃²a₃ – (A)

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x₁,x₂,… x_nbut on a new set of features or the activation units a₁,a₂, … a_n. In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

h_Ɵ(x) = g(-30 + 20 * x₁ + 20 * x₂)

So for x₁= 0 and x₂ = 0 we would have

h_Ɵ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

Similarly a NOT gate can be constructed with a neural network as follows

Neural networks can also be used for multi class classification.

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!

Find me on Google+

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex. We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= h_Ɵ<= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e^‑z)

and h_Ɵ(x) = g(Ɵ^TX₎

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when h_Ɵ(x) >= 0.5 and h_Ɵ(x) asymptotes to 0 when h_Ɵ(x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₁ + Ɵ₄x₂

h_Ɵ(x) = 1/(1 + e –^{(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 + Ɵ4×2)})

We could choose the general form of a circle which is

f(x) = ax² + by² +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen