Unraveling the mysteries of life

This article was published in Gigaom, Nov 23, “Unraveling the mysteries of life

SUMMARY:

novaThe future of technology will bring big changes, including advances in AI, brain-to-brain interfaces, and the ability to halt death

Time, space and matter were created 13.7 billion years ago, when the Big Bang occurred. This pale, blue planet, so termed by Carl Sagan, our earth, came into existence about 4.5 billion years ago. Life originated on earth about 3.8 billion years ago. Our species, the home sapiens, came much later at about 0.2 million years while recorded history is merely 6000 years old.

However in the last 60 years or so, man has started to unravel many secrets of his own existence. There have been extremely rapid advances in science and mankind is now grappling with very profound aspects of life from intelligence, perception, aging all the way to death itself…. more

Find me on Google+

Simplifying ML: Neural networks- Part 3

Neural networks try to overcome the shortcomings of logistic regression in which  we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

neuron

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

31

The above diagram shows a simple neural network with features x1, x2, x3 and a bias unit x0

 

With a hypothesis function hƟ(x) = 1/(1 + e-x)

The edges from the features xi  are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

32

As can be seen in a multi-layer neural network at the left we have the features x1,x2, .. xn.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a12 = g(Ɵ101x0+ Ɵ111x1+ Ɵ121x2 + Ɵ131x3) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

33

Here a12 is the activation parameter at layer 1

Ɵ10 is the model parameter at layer 1 and is the 0th parameter. Similarly Ɵ11 is the model parameter at layer 1 and is the 1st parameter and so on.

Similarly the other activation parameters can be written as

a22 = g(Ɵ201x0+ Ɵ211x1+ Ɵ221x2 + Ɵ231x3)

a32 = g(Ɵ301x0+ Ɵ311x1+ Ɵ321x2 + Ɵ331x3)

hƟ(x) = a13 = g(Ɵ102a0+ Ɵ112a1+ Ɵ122a2 + Ɵ132a3  – (A)

 

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x1,x2,… xbut on a new set of features or the activation units a1,a2, … an . In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

34

hƟ(x) = g(-30 + 20 * x1 + 20 * x2)

So for x1= 0 and x2 = 0 we would have

hƟ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

37

Similarly a NOT gate can be constructed with a neural network as follows

35

38

Neural networks can also be used for multi class classification.

36

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!


Find me on Google+

Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

4

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex.  We may need to classify another problem as shown below.

5

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= hƟ <= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e‑z)

and hƟ (x) = g(ƟTX)

6

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when hƟ (x) >= 0.5 and  hƟ (x) asymptotes to 0 when hƟ (x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x1 +  Ɵ4x2

 

or

 

hƟ (x) = 1/(1 + e –(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 +  Ɵ4×2))

We could choose the general form of a circle which is

f(x) = ax2 + by2 +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses  = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

7

Hence the cost can be written as

J(Ɵ) = Cost(hƟ (x),y) = -y * log(hƟ (x))  – (1-y) * (log(1 – hƟ (x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵj =   Ɵj – α δ/δ Ɵj J(Ɵ0, Ɵ1,… Ɵn)

This works out to a function that is similar to linear regression

Ɵj = Ɵj – α 1/m { Σ hƟ (xi) – yi} xj i

This will enable the machine to fairly accurately determine the parameters Ɵj for the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+

Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a)      Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

1

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

hƟ(x) = Ɵ1x+ Ɵ0

 

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

hƟ(x) = Ɵ2x2+ Ɵ1x+Ɵ0

 

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(hƟ (xi)  – y i)2

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (hƟ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

2

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function  is defined and differentiable in a neighborhood of a point , then  decreases fastest if one goes from  in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

Gradient_descent.svg

For e.g for a curve as shown below

3

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by  – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for  Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(hƟ (xi)  – y i)2

 

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

hƟ(x) = Ɵ2xn+ Ɵ1xn-1+ …+  Ɵ0

Also read my post on Simplifying ML: Logistic regression – Part 2

You may like
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Applying the principles of Machine Learning

Find me on Google+