My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions. In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable. There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle- $8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

Presentation on ‘Machine Learning in plain English – Part 1’

This is the first part on my series ‘Machine Learning in plain English – Part 1’ in which I discuss the intuition behind different Machine Learning algorithms, metrics and the approaches etc. These presentations will not include tiresome math or laborious programming constructs, and will instead focus on just the concepts behind the Machine Learning algorithms. This presentation discusses what Machine Learning is, Gradient Descent, linear, multi variate & polynomial regression, bias/variance, under fit, good fit and over fit and finally logistic regression etc.

It is hoped that these presentations will trigger sufficient interest in you, to explore this fascinating field further

To see actual implementations of the most widely used Machine Learning algorithms in R and Python, check out My book ‘Practical Machine Learning with R and Python’ on Amazon

Also see
1. Practical Machine Learning with R and Python – Part 3
2.R vs Python: Different similarities and similar differences
3. Perils and pitfalls of Big Data
4. Deep Learning from first principles in Python, R and Octave – Part 2
5. Getting started with memcached-libmemcached

To see all post see “Index of posts“

Applying the principles of Machine Learning

While working with multivariate regression there are certain essential principles that must be applied to ensure the correctness of the solution while being able to pick the most optimum solution. This is all the more important when the problem has a large number of features. In this post I apply these important principles to a regression data set which I was able to pull of the internet. This data set was taken from the UCI Machine Learning repository and deals with Boston housing data. The housing data provides the cost of house in Boston suburbs given the number of rooms, the connectivity to main highways, and crime rate in the area and several other data. There are a total of 506 data points in this data set with a total of 13 features.

This seemed a reasonable dataset to start to try out the principles of Machine Learning I had picked up from Coursera’s ML course.

Out of a total of 13 features 2 features ’ZN’ and ‘CHAS’ proximity to Charles river were dropped as the values were mostly zero in these columns . The remaining 11 features were used to map to the output variable of the price.

The following key rules have been applied on the

The dataset was divided into training samples (60%), cross-validation set (20%) and test set (20%) using a random index
Try out different polynomial functions while performing gradient descent to determine the theta values
Different combinations of ‘alpha’ learning rate and ‘lambda’ the regularization parameter were tried while performing gradient descent
The error rate is then calculated on the cross-validation and test set
The theta values that were obtained for the lowest cost for a polynomial was used to compute and plot the learning curve for the different polynomials against increasing number of training and cross-validation test samples to check for bias and variance.
The plot of the cost versus the polynomial degree was plotted to obtain the best fit polynomial for the data set.

A multivariate regression hypothesis can be represented as

hθ(x) = θ₀ + θ₁x₁ + θ₂x₂ + θ₃x₃ + θ₄x₄ + …
And the cost can is determined as
J(θ₀, θ₁, θ₂, θ₃..) = 1/2m ∑ (h_Θ (xⁱ) -yⁱ)²
The implementation was done using Octave. As in my previous posts some functions have not been include to comply with Coursera’s Honor Code. The code can be cloned from GitHub at machine-learning-principles

a) housing compute.m. In this module I perform gradient descent for different polynomial degrees and check the error that is obtained when using the computed theta on the cross validation and test set

max_degrees =4; J_history = zeros(max_degrees, 1); Jcv_history = zeros(max_degrees, 1); for degree = 1:max_degrees; [J Jcv alpha lambda] = train_samples(randidx, training,cross_validation,test_data,degree); end;

b) train_samples.m – This module uses gradient descent to check the best fit for a given polynomial degree for different combinations of alpha (learning rate) and lambda( regularization).

for i = 1:length(alpha_arr), for j = 1:length(lambda_arr) alpha = alpha_arr{i}; lambda= lambda_arr{j}; % Perform Gradient descent % Compute error for training sample for computed theta values % Compute the error rate for the cross validation samples % Compute the error rate against the test set end; end;

c) cross_validation.m – This module uses the theta values to compute cost for the cross validation set

d) test-samples.m – This modules computes the error when using the trained theta on the test set

e) poly.m – This module constructs polynomial vectors based on the degree as follows
function [x] = poly(xinput, n) x = []; for i= 1:n xtemp = xinput .^i; x = [x xtemp]; end;

e) learning_curve.m – The learning curve module plots the error rate for increasing number of training and cross validation samples. This is done as follows. For the theta with the lowest cost as determined by gradient descent
for i from 1 to 100

Compute the error for ‘i’ training samples
Compute the error for ‘i’ cross-validation
Plot the learning curve to determine the bias and variance of the polynomial fit

This is included below
for i = 1: 100 xsample = xtrain(1:i,:); ysample = ytrain(1:i,:); size(xsample); size(ysample); [xsample] = poly(xsample,degree); xsample= [ones(i, 1) xsample]; [c d] = size(xsample); theta = zeros(d, 1); % Minimize using fmincg J = computeCost(xsample, ysample, theta); Jtrain(i) = J; xsample_cv = xcv(1:i,:); ysample_cv = ycv(1:i,:); [xsample_cv] = poly(xsample_cv,degree); xsample_cv= [ones(i, 1) xsample_cv]; J_cv = computeCost(xsample_cv, ysample_cv,theta) Jcv(i) = J_cv; end;

Finally a plot is done been different lambda and the cost.

The results are included below

A) Polynomial degree 1
Convergence graph

Learning curve

The above figure does show a stronger bias. Note: the learning curve was done with around 100 samples
B) Polynomial degree 2

Convergence graph

Learning curve

The learning curve for degree 2 shows a stronger variance.

C) Polynomial degree 3
Convergence graph

Learning curve

D) Polynomial degree 4
Convergence graph

E) Learning curve

This plot is useful to determine which polynomial degree will give the best fit for the dataset and the lowest cost

Clearly from the above it can be seen that degree 2 will give a good fit for the data set.

F) Lambda vs Cost function

The above code demonstrates some key principles while performing multivariate regression
The code can be cloned from GitHub at machine-learning-principles

Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a) Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

h_Ɵ(x) = Ɵ₁x+ Ɵ₀

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

h_Ɵ(x) = Ɵ₂x²+ Ɵ₁x+Ɵ₀

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (h_Ɵ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

h_Ɵ(x) = Ɵ₂xⁿ+ Ɵ₁x^n-1+ …+ Ɵ₀

Also read my post on Simplifying ML: Logistic regression – Part 2

Find me on Google+