Presentation on ‘Machine Learning in plain English – Part 1’

This is the first part on my series ‘Machine Learning in plain English – Part 1’ in which I discuss the intuition behind different Machine Learning algorithms, metrics and the approaches etc. These presentations will not include tiresome math or laborious programming constructs, and will instead focus on just the concepts behind the Machine Learning algorithms. This presentation discusses what Machine Learning is, Gradient Descent, linear, multi variate & polynomial regression, bias/variance, under fit, good fit and over fit and finally logistic regression etc.

It is hoped that these presentations will trigger sufficient interest in you, to explore this fascinating field further

To see actual implementations of the most widely used Machine Learning algorithms in R and Python, check out My book ‘Practical Machine Learning with R and Python’ on Amazon

Also see
1. Practical Machine Learning with R and Python – Part 3
2.R vs Python: Different similarities and similar differences
3. Perils and pitfalls of Big Data
4. Deep Learning from first principles in Python, R and Octave – Part 2
5. Getting started with memcached-libmemcached

To see all post see “Index of posts“

Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of $1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest. So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.

Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

n_u = Number of users

n_b= Number of books

r^(i,j) = Boolean whether user j rated a book i

y^(i,j) = The rating user j gave book i

m_j = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps. In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc

So each book has its own feature vector where x¹is the feature vector of the first book x² feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵ^jand a feature vector xⁱ the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θ^j)^T xⁱ

This reduces to the minimization problem over all θ^j for r=1

min 1/2m Σ ((θj)^T xⁱ – y ^i,j)²

θ^ji:r=1

Adding the regularization term this becomes

min 1/2m Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ θ^j)²

θ^ji:r=1

The recommender algorithm in essence tries to learn parameters θ^j for a set of features of xⁱthe chosen system for e.g. books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ Σ θ^j)²

θ¹…θⁿi:r=1

The minimization is performed by gradient descent as

Θ^j_k:= Θ^j_k – α (Σ((θ^j)^T xⁱ – y ^i,j)xⁱ + λ Θ^j_k

Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex. We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= h_Ɵ<= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e^‑z)

and h_Ɵ(x) = g(Ɵ^TX₎

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when h_Ɵ(x) >= 0.5 and h_Ɵ(x) asymptotes to 0 when h_Ɵ(x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₁ + Ɵ₄x₂

h_Ɵ(x) = 1/(1 + e –^{(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 + Ɵ4×2)})

We could choose the general form of a circle which is

f(x) = ax² + by² +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

Hence the cost can be written as

J(Ɵ) = Cost(h_Ɵ(x),y) = -y * log(h_Ɵ(x)) – (1-y) * (log(1 – h_Ɵ(x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵ_j = Ɵ_j – α δ/δ Ɵ_j J(Ɵ₀, Ɵ₁,… Ɵ_n)

This works out to a function that is similar to linear regression

Ɵj₌Ɵj – α 1/m { Σ h_Ɵ(x_i) – y_i} x_jⁱ

This will enable the machine to fairly accurately determine the parameters Ɵ_jfor the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+

Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a) Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

h_Ɵ(x) = Ɵ₁x+ Ɵ₀

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

h_Ɵ(x) = Ɵ₂x²+ Ɵ₁x+Ɵ₀

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (h_Ɵ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

h_Ɵ(x) = Ɵ₂xⁿ+ Ɵ₁x^n-1+ …+ Ɵ₀

Also read my post on Simplifying ML: Logistic regression – Part 2

Find me on Google+