# Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a)      Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

hƟ(x) = Ɵ1x+ Ɵ0

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

hƟ(x) = Ɵ2x2+ Ɵ1x+Ɵ0

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(hƟ (xi)  – y i)2

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (hƟ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function  is defined and differentiable in a neighborhood of a point , then  decreases fastest if one goes from  in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by  – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for  Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(hƟ (xi)  – y i)2

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

hƟ(x) = Ɵ2xn+ Ɵ1xn-1+ …+  Ɵ0

Also read my post on Simplifying ML: Logistic regression – Part 2