My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.


In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle- $8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

Applying the principles of Machine Learning

While working with multivariate regression there are certain essential principles that must be applied to ensure the correctness of the solution while being able to pick the most optimum solution. This is all the more important when the problem has a large number of features. In this post I apply these important principles to a regression data set which I was able to pull of the internet. This data set was taken from the UCI Machine Learning repository and deals with Boston housing data.  The housing data provides the cost of house in Boston suburbs given the number of rooms, the connectivity to main highways, and crime rate in the area and several other data.  There are a total of 506 data points in this data set with a total of 13 features.

This seemed a reasonable dataset to start to try out the principles of Machine Learning I had picked up from Coursera’s ML course.

Out of a total of 13 features 2 features ’ZN’ and ‘CHAS’ proximity to  Charles river were dropped as the values were mostly zero in these columns . The remaining 11 features were used to map to the output variable of the price.

The following key rules have been applied on the

  • The dataset was divided into training samples (60%), cross-validation set (20%) and test set (20%) using a random index
  • Try out different polynomial functions while performing gradient descent to determine the theta values
  • Different combinations of ‘alpha’ learning rate and ‘lambda’ the regularization parameter were tried while performing gradient descent
  • The error rate is then calculated on the cross-validation and test set
  • The theta values that were obtained for the lowest cost for a polynomial was used to compute and plot the learning curve for the different polynomials against increasing number of training and cross-validation test samples to check for bias and variance.
  • The plot of the cost versus the polynomial degree was plotted to obtain the best fit polynomial for the data set.

A multivariate regression hypothesis can be represented as

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 + …
And the cost can is determined as
J(θ0, θ1, θ2, θ3..) = 1/2m ∑ (hΘ (xi) -yi)2
The implementation was done using Octave. As in my previous posts some functions have not been include to comply with Coursera’s Honor Code. The code can be cloned from GitHub at machine-learning-principles

a) housing compute.m. In this module I perform gradient descent for different polynomial degrees and check the error that is obtained when using the computed theta on the cross validation and test set

max_degrees =4;
J_history = zeros(max_degrees, 1);
Jcv_history = zeros(max_degrees, 1);
for degree = 1:max_degrees;
[J Jcv alpha lambda] = train_samples(randidx, training,cross_validation,test_data,degree);

b) train_samples.m – This module uses gradient descent to check the best fit for a given polynomial degree for different combinations of alpha (learning rate) and lambda( regularization).

for i = 1:length(alpha_arr),
for j = 1:length(lambda_arr)
alpha = alpha_arr{i};
lambda= lambda_arr{j};
% Perform Gradient descent
% Compute error for training sample for computed theta values
% Compute the error rate for the cross validation samples
% Compute the error rate against the test set

c) cross_validation.m – This module uses the theta values to compute cost for the cross validation set

d) test-samples.m – This modules computes the error when using the trained theta on the test set

e) poly.m – This module constructs polynomial vectors based on the degree as follows
function [x] = poly(xinput, n)
x = [];
for i= 1:n
xtemp = xinput .^i;
x = [x xtemp];

e) learning_curve.m – The learning curve module plots the error rate for increasing number of training and cross validation samples. This is done as follows. For the theta with the lowest cost as determined by gradient descent
for i from 1 to 100

  • Compute the error for ‘i’ training samples
  • Compute the error for ‘i’ cross-validation
  • Plot the learning curve to determine the bias and variance of the polynomial fit

This is included below
for i = 1: 100
xsample = xtrain(1:i,:);
ysample = ytrain(1:i,:);
[xsample] = poly(xsample,degree);
xsample= [ones(i, 1) xsample];
[c d] = size(xsample);
theta = zeros(d, 1);
% Minimize using fmincg
J = computeCost(xsample, ysample, theta);
Jtrain(i) = J;
xsample_cv = xcv(1:i,:);
ysample_cv = ycv(1:i,:);
[xsample_cv] = poly(xsample_cv,degree);
xsample_cv= [ones(i, 1) xsample_cv];
J_cv = computeCost(xsample_cv, ysample_cv,theta)
Jcv(i) = J_cv;

Finally a plot is done been different lambda and the cost.

The results are included below

A) Polynomial degree 1
Convergence graph

Learning curve

The above figure does show a stronger bias. Note: the learning curve was done with around 100 samples
B) Polynomial degree 2

Convergence graph

Learning curve

The learning curve for degree 2 shows a stronger variance.

C) Polynomial degree 3
Convergence graph


Learning curve

D) Polynomial degree 4
Convergence graph

E) Learning curve

This plot is useful to determine which polynomial degree will give the best fit for the dataset and the lowest cost


Clearly from the above it can be seen that degree 2 will give a good fit for the data set.

F) Lambda vs Cost function

The above code demonstrates some key principles while performing multivariate regression
The code can be cloned from GitHub at machine-learning-principles

Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of $1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest.  So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.


Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

nu = Number of users

nb = Number of books

r(i,j) = Boolean whether user j rated a book i

y(i,j) = The rating user j gave book i

mj  = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps.  In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc


So each book has its own feature vector where x1 is the feature vector of the first book x2  feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵjand a feature vector xi the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θj) T xi


This reduces to the minimization problem over all θj for r=1

min 1/2m Σ  ((θj)T xi  – y i,j)2

θj                  i:r=1

Adding the regularization term this becomes

min 1/2m Σ((θj)T xi  – y i,j)2  + λ/2m(Σ θj)2

θj                  i:r=1

The recommender algorithm in essence tries to learn parameters θj for a set of features of xi the chosen system for e.g.  books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θj)T x – y i,j)2  + λ/2m(Σ Σ θj)2

θ1…θn                i:r=1

The minimization is performed by gradient descent as

Θj k:= Θjk – α (Σ((θj)T x – y i,j)xi + λ Θj k


Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data.  If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

 However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.


The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is  given below

Consider a hypothesis function

hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22 + Ɵ3x33  +  Ɵ4x44


The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

Jtest(Ɵ) = 1/2mtest Σ(hƟ (xtest – ytesti)2

For logistic regression the test set error is similarly determined as

Jtest(Ɵ) = = 1/mtest Σ -ytest * log(hƟ (xtest))  – (1-ytest) * (log(1 – hƟ (xtest))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

  1. hƟ (x) = Ɵ0 + Ɵ1x1
  2. hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22
  3. hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets  as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

  1. Train all 10 models against the training set (60%)
  2. Compute the cost value J against the cross-validation set (20%)
  3. Determine the lowest cost model
  4. Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below


Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵi values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵi become negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest


Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias


In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below


Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below


It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex.  We may need to classify another problem as shown below.


From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= hƟ <= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e‑z)

and hƟ (x) = g(ƟTX)


The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when hƟ (x) >= 0.5 and  hƟ (x) asymptotes to 0 when hƟ (x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x1 +  Ɵ4x2




hƟ (x) = 1/(1 + e –(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 +  Ɵ4×2))

We could choose the general form of a circle which is

f(x) = ax2 + by2 +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses  = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen


Hence the cost can be written as

J(Ɵ) = Cost(hƟ (x),y) = -y * log(hƟ (x))  – (1-y) * (log(1 – hƟ (x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵj =   Ɵj – α δ/δ Ɵj J(Ɵ0, Ɵ1,… Ɵn)

This works out to a function that is similar to linear regression

Ɵj = Ɵj – α 1/m { Σ hƟ (xi) – yi} xj i

This will enable the machine to fairly accurately determine the parameters Ɵj for the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+