Practical Machine Learning with R and Python – Part 1

Introduction

This is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms

  • Univariate Regression
  • Multivariate Regression
  • Polynomial Regression
  • K Nearest Neighbors Regression

The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera

  1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
  2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package

Note: Please listen to my video presentations Machine Learning in youtube
1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.

R squared computation in R is computed as follows
RSS=\sum (y-yhat)^{2}
TSS= \sum(y-mean(y))^{2}
Rsquared- 1-\frac{RSS}{TSS}

Note: You can download this R Markdown file and the associated data sets from Github at MachineLearning-RandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!

1.1a Univariate Regression – R code

Here a simple linear regression line is fitted between a single input feature and the target variable

# Source in the R function library
source("RFunctions.R")
# Read the Boston data file
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - Statistical Learning

# Split the data into training and test sets (75:25)
train_idx <- trainTestSplit(df,trainPercent=75,seed=5)
train <- df[train_idx, ]
test <- df[-train_idx, ]

# Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of 
# population'
fit=lm(medv~lstat,data=df)
# Display details of fir
summary(fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# Display the confidence intervals
confint(fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)")
abline(fit)
abline(fit,lwd=3)
abline(fit,lwd=3,col="red")

rsquared=Rsquared(fit,test,test$medv)
sprintf("R-squared for uni-variate regression (Boston.csv)  is : %f", rsquared)
## [1] "R-squared for uni-variate regression (Boston.csv)  is : 0.556964"

1.1b Univariate Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#os.chdir("C:\\software\\machine-learning\\RandPython")

# Read the CSV file
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
# Select the feature variable
X=df['lstat']

# Select the target 
y=df['medv']

# Split into train and test sets (75:25)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)
X_train=X_train.values.reshape(-1,1)
X_test=X_test.values.reshape(-1,1)

# Fit a linear model
linreg = LinearRegression().fit(X_train, y_train)

# Print the training and test R squared score
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))
     
# Plot the linear regression line
fig=plt.scatter(X_train,y_train)

# Create a range of points. Compute yhat=coeff1*x + intercept and plot
x=np.linspace(0,40,20)
fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red')
fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)")
fig1=plt.xlabel("Lower status (%)")
fig1=plt.ylabel("Median value of owned homes ($1000)")
fig.figure.savefig('foo.png', bbox_inches='tight')
fig1.figure.savefig('foo1.png', bbox_inches='tight')
print "Finished"
## R-squared score (training): 0.571
## R-squared score (test): 0.458
## Finished

1.2a Multivariate Regression – R code

# Read crimes data
crimesDF <- read.csv("crimes.csv",stringsAsFactors = FALSE)

# Remove the 1st 7 columns which do not impact output
crimesDF1 <- crimesDF[,7:length(crimesDF)]

# Convert all to numeric
crimesDF2 <- sapply(crimesDF1,as.numeric)

# Check for NAs
a <- is.na(crimesDF2)
# Set to 0 as an imputation
crimesDF2[a] <-0
#Create as a dataframe
crimesDF2 <- as.data.frame(crimesDF2)
#Create a train/test split
train_idx <- trainTestSplit(crimesDF2,trainPercent=75,seed=5)
train <- crimesDF2[train_idx, ]
test <- crimesDF2[-train_idx, ]

# Fit a multivariate regression model between crimesPerPop and all other features
fit <- lm(ViolentCrimesPerPop~.,data=train)

# Compute and print R Squared
rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop)
sprintf("R-squared for multi-variate regression (crimes.csv)  is : %f", rsquared)
## [1] "R-squared for multi-variate regression (crimes.csv)  is : 0.653940"

1.2b Multivariate Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the data
crimesDF =pd.read_csv("crimes.csv",encoding="ISO-8859-1")
#Remove the 1st 7 columns
crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]]
# Convert to numeric
crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce')
# Impute NA to 0s
crimesDF2.fillna(0, inplace=True)

# Select the X (feature vatiables - all)
X=crimesDF2.iloc[:,0:120]

# Set the target
y=crimesDF2.iloc[:,121]

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)
# Fit a multivariate regression model
linreg = LinearRegression().fit(X_train, y_train)

# compute and print the R Square
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))
## R-squared score (training): 0.699
## R-squared score (test): 0.677

1.3a Polynomial Regression – R

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

 # Polynomial degree 1
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))

# Select key columns
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Split as train and test sets
train_idx <- trainTestSplit(df3,trainPercent=75,seed=5)
train <- df3[train_idx, ]
test <- df3[-train_idx, ]

# Fit a model of degree 1
fit <- lm(mpg~. ,data=train)
rsquared1 <-Rsquared(fit,test,test$mpg)
sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv)  is : %f", rsquared1)
## [1] "R-squared for Polynomial regression of degree 1 (auto_mpg.csv)  is : 0.763607"
# Polynomial degree 2 - Quadratic
x = as.matrix(df3[1:6])
# Make a  polynomial  of degree 2 for feature variables before split
df4=as.data.frame(poly(x,2,raw=TRUE))
df5 <- cbind(df4,df3[7])

# Split into train and test set
train_idx <- trainTestSplit(df5,trainPercent=75,seed=5)
train <- df5[train_idx, ]
test <- df5[-train_idx, ]

# Fit the quadratic model
fit <- lm(mpg~. ,data=train)
# Compute R squared
rsquared2=Rsquared(fit,test,test$mpg)
sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : %f", rsquared2)
## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : 0.831372"
#Polynomial degree 3
x = as.matrix(df3[1:6])
# Make polynomial of degree 4  of feature variables before split
df4=as.data.frame(poly(x,3,raw=TRUE))
df5 <- cbind(df4,df3[7])
train_idx <- trainTestSplit(df5,trainPercent=75,seed=5)

train <- df5[train_idx, ]
test <- df5[-train_idx, ]
# Fit a model of degree 3
fit <- lm(mpg~. ,data=train)
# Compute R squared
rsquared3=Rsquared(fit,test,test$mpg)
sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : %f", rsquared3)
## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : 0.773225"
df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3))
# Make a plot of Rsquared and degree
ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") +
    ggtitle("Polynomial regression - R squared vs Degree of polynomial") +
    xlab("Degree") + ylab("R squared")

1.3a Polynomial Regression – Python

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
# Select key columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
# Convert columns to numeric
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
# Drop NAs
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Polynomial degree 1
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
print('R-squared score - Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train)))
# Compute R squared     
rsquared1 =linreg.score(X_test, y_test)
print('R-squared score - Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test)))

# Polynomial degree 2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

# Compute R squared
print('R-squared score - Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train)))
rsquared2 =linreg.score(X_test, y_test)
print('R-squared score - Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test)))

#Polynomial degree 3

poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)
print('(R-squared score -Polynomial degree 3  (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
# Compute R squared     
rsquared3 =linreg.score(X_test, y_test)
print('R-squared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test)))
degree=[1,2,3]
rsquared =[rsquared1,rsquared2,rsquared3]
fig2=plt.plot(degree,rsquared)
fig2=plt.title("Polynomial regression - R squared vs Degree of polynomial")
fig2=plt.xlabel("Degree")
fig2=plt.ylabel("R squared")
fig2.figure.savefig('foo2.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared score - Polynomial degree 1 (training): 0.811
## R-squared score - Polynomial degree 1 (test): 0.799
## R-squared score - Polynomial degree 2 (training): 0.861
## R-squared score - Polynomial degree 2 (test): 0.847
## 
## (R-squared score -Polynomial degree 3  (training): 0.933
## R-squared score Polynomial degree 3 (test): 0.710
## 
## Finished plotting and saving

1.4 K Nearest Neighbors

The code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to

X_{normalized} = \frac{X-min(X)}{max(X-min(X))}

Another technique that is used is Standardization which is

X_{standardized} = \frac{X-mean(X)}{sd(X)}

1.4a K Nearest Neighbors Regression – R( Unnormalized)

The R code below does not use feature scaling

# KNN regression requires the FNN package
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Split train and test
train_idx <- trainTestSplit(df3,trainPercent=75,seed=5)
train <- df3[train_idx, ]
test <- df3[-train_idx, ]
#  Select the feature variables
train.X=train[,1:6]
# Set the target for training
train.Y=train[,7]
# Do the same for test set
test.X=test[,1:6]
test.Y=test[,7]

rsquared <- NULL
# Create a list of neighbors
neighbors <-c(1,2,4,8,10,14)
for(i in seq_along(neighbors)){
    # Perform a KNN regression fit
    knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i])
    # Compute R sqaured
    rsquared[i]=knnRSquared(knn$pred,test.Y)
}

# Make a dataframe for plotting
df <- data.frame(neighbors,Rsquared=rsquared)
# Plot the number of neighors vs the R squared
ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") +
    xlab("Number of neighbors") + ylab("R squared") +
    ggtitle("KNN regression - R squared vs Number of Neighors (Unnormalized)")

1.4b K Nearest Neighbors Regression – Python( Unnormalized)

The Python code below does not use feature scaling

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Perform a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# Create a list of neighbors
rsquared=[]
neighbors=[1,2,4,8,10,14]
for i in neighbors:
        # Fit a KNN model
        knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train)
        # Compute R squared
        rsquared.append(knnreg.score(X_test, y_test))
        print('R-squared test score: {:.3f}'
        .format(knnreg.score(X_test, y_test)))
# Plot the number of neighors vs the R squared        
fig3=plt.plot(neighbors,rsquared)
fig3=plt.title("KNN regression - R squared vs Number of neighbors(Unnormalized)")
fig3=plt.xlabel("Neighbors")
fig3=plt.ylabel("R squared")
fig3.figure.savefig('foo3.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared test score: 0.527
## R-squared test score: 0.678
## R-squared test score: 0.707
## R-squared test score: 0.684
## R-squared test score: 0.683
## R-squared test score: 0.670
## Finished plotting and saving

1.4c K Nearest Neighbors Regression – R( Normalized)

It can be seen that R squared improves when the features are normalized.

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Perform MinMaxScaling of feature variables 
train.X.scaled=MinMaxScaler(train.X)
test.X.scaled=MinMaxScaler(test.X)

# Create a list of neighbors
rsquared <- NULL
neighbors <-c(1,2,4,6,8,10,12,15,20,25,30)
for(i in seq_along(neighbors)){
    # Fit a KNN model
    knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i)
    # Compute R ssquared
    rsquared[i]=knnRSquared(knn$pred,test.Y)
    
}

df <- data.frame(neighbors,Rsquared=rsquared)
# Plot the number of neighors vs the R squared 
ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") +
    xlab("Number of neighbors") + ylab("R squared") +
    ggtitle("KNN regression - R squared vs Number of Neighors(Normalized)")

1.4d K Nearest Neighbors Regression – Python( Normalized)

R squared improves when the features are normalized with MinMaxScaling

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Perform a train/ test  split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# Use MinMaxScaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Apply scaling on test set
X_test_scaled = scaler.transform(X_test)

# Create a list of neighbors
rsquared=[]
neighbors=[1,2,4,6,8,10,12,15,20,25,30]
for i in neighbors:
    # Fit a KNN model
    knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train)
    # Compute R squared
    rsquared.append(knnreg.score(X_test_scaled, y_test))
    print('R-squared test score: {:.3f}'
        .format(knnreg.score(X_test_scaled, y_test)))

# Plot the number of neighors vs the R squared 
fig4=plt.plot(neighbors,rsquared)
fig4=plt.title("KNN regression - R squared vs Number of neighbors(Normalized)")
fig4=plt.xlabel("Neighbors")
fig4=plt.ylabel("R squared")
fig4.figure.savefig('foo4.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared test score: 0.703
## R-squared test score: 0.810
## R-squared test score: 0.830
## R-squared test score: 0.838
## R-squared test score: 0.834
## R-squared test score: 0.828
## R-squared test score: 0.827
## R-squared test score: 0.826
## R-squared test score: 0.816
## R-squared test score: 0.815
## R-squared test score: 0.809
## Finished plotting and saving

Conclusion

In this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms.
Comments, suggestions and corrections are welcome.

Watch this this space!

To be continued….

You may like
1. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
2. Neural Networks: The mechanics of backpropagation
3. More book, more cricket! 2nd edition of my books now on Amazon
4. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
5. Introducing cricket package yorkr:Part 4-In the block hole!

To see all posts see Index of posts

Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

Continuing my earlier ‘innings’, of test driving my knowledge in Machine Learning acquired via Coursera,  I now turn my attention towards the bowling performances of our Indian bowling heroes. In this post I give a slightly different ‘spin’ to the bowling analysis and hope I can ‘swing’ your opinion based on my assessment.

I guess that is enough of my cricketing ‘double-speak’ for now and I will get down to the real business of my bowling analysis!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

 

As in my earlier post Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid ,the first part of the post has my analyses and the latter part has the details of the implementation of the algorithm. Feel free to read the first part and either scan or skip the latter.

To perform this analysis I have skipped the data on our recent crop of new bowlers. The reason being that data is scant on these bowlers, besides they also seem to have a relatively shorter shelf life (hope there are a couple of finds in this Australian tour of Dec 2014). For the analyses I have chosen B S Chandrasekhar, Kapil Dev Anil Kumble. My rationale as to why I chose the above 3

B S Chandrasekhar also known as “Chandra’ was one of the most lethal leg spinners in the late 1970’s. He had a very dangerous combination of fast leg breaks, searing tops spins interspersed with the  occasional googly. On many occasions he would leave most batsmen completely clueless.

Kapil Nikhanj Dev, the Haryana Hurricane who could outwit the most technically sound batsmen  through some really clever bowling. His variations were almost always effective and he would achieve the vital breakthrough outsmarting the opponent.

And finally Anil Kumble, I chose Kumble because in my opinion he is truly the embodiment of the ‘thinking’ bowler. Many times I have seen Kumble repeatedly beat batsmen. It was like he was telling the batsman ‘check’ as he bowled faster leg breaks, flippers, a straighter delivery or top spins before finally crashing into the wickets or trapping the batsmen. It felt he was saying ‘checkmate dude!’

I have taken the data for the 3 bowlers from ESPN Cricinfo. Only the Test matches were considered for the analyses. All tests against all oppositions both at home and away were included

The assumptions taken and basis of the computation is included below
a.The data is based on the following 2 input variables a) Overs bowled b) Runs given. The output variable is ‘Wickets taken’

b.To my surprise I found that in the late 1970’s when BS Chandrasekhar used to bowl, an over had 8 balls for matches in Australia. So, I had to normalize this data for Chandra to make it on par with the others. Hence for Chandra where the overs were made up of 8 balls the overs was calculated as follows
Overs (O) = (Overs * 8)/6

c.The Economy rate E was calculated as below
E = Overs/runs was chosen as input variable to take into account fewer runs given by the bowler

d.The output variable was re-calculated as Strike Rate (SR) to determine the ‘bowling effectiveness’
Strike Rate = Wickets/Overs
(not be confused with a batsman’s strike rate batsman strike rate = runs/ balls faced)

e.Hence the analysis is based on
f(O,E) = SR
An outline of the Octave code and the data used can be cloned from GitHub at ml-bowling-analyze

 1. Surface of Bowling Effectiveness (SBE)
In my earlier post I was able to fit a ‘prediction plane’ based on the minutes at crease, balls faced versus the runs scored. But in this case a plane did not make sense as the wickets can only range from 0 – 10 and in most cases averaging between 3 and 5. So I plot the best fitting 3-D surface over the predicted hypothesis function. The steps performed are

1) The data for the different  bowlers were cleaned with data which indicated (DNB – Did not bowl)
2) The Economy Rate (E) = Runs given/Overs and Strike Rate(SR) = Wickets/overs were calculated.
3) The product of Overs (O), and Economy(E) were stored as Over_Economy(OE)
4) The hypothesis function was computed as h(O, E, OE) = y
5) Theta was calculated using the Normal Equation. The Surface of Bowling Effectiveness( SBE) was then plotted. The plots for each of the bowler is shown below

Here are the plots

A) Anil Kumble
The  data of Kumble, based on Overs bowled & Economy rate versus the Strike Rate is plotted as a 3-D scatter plot (pink crosses). The best fit as determined by solving the optimum theta using the Normal Equation is plotted as 3-D surface shown below.
kumble-1
The 3-D surface is what I have termed as ‘Surface of Bowling Effectiveness (SBE)’ as it depicts bowlers overall effectiveness as it plots the overs (O), ‘economy rate’ E against predicted ‘strike rate’ SR.
Here is another view
kumble-2
The theta values obtained for Kumble are
Theta =
0.104208
-0.043769
-0.016305
0.011949

And the cost at this theta is
Cost Function J = 0.0046269

B) B S Chandrasekhar
Here are the best optimal surface plot for Chandra with the data on O,E vs SR plotted as a 3D scatter plot.  Note: The dataset for  Chandrasekhar is smaller compared to the other two.
chandra-1Another view for Chandra
chandra-2

Theta values for B S Chandrasekhar are
Theta =
0.095780
-0.025377
-0.024847
0.023415
and the cost is
Cost Function J = 0.0032980

c) Kapil Dev
The plots  for Kapil
kapil-1
Another view of SBE for Kapil
kapil-2
The Theta values and cost function for Kapil are
Theta =
0.090219
0.027725
0.023894
-0.021434
Cost Function J = 0.0035123

2. Predicting wickets
In the previous section the optimum theta with the lowest Cost Function J was calculated. Based on the value of theta, the wickets that will be taken by a bowler can be computed as the product of the hypothesis function and theta. i.e.

y= h(x) * theta  => Strike Rate (SR) = [1 O E OE] * theta
Now predicted wickets can be calculated as

wickets = Strike rate(SR) * Overs(O)
This is done  for Kumble, Chandra and Kapil  for different combinations of Overs(O) and Economy(E) rate.

Here are the results
Predicted wickets for Anil Kumble
The plot of predicted wickets for Kumble is represented below
kumble-wickets-1
This can also be represented as a a table
kumble-wkts-tbl

Predicted wickets for B S Chandrasekhar
chandra-wickets-1
The table for Chandra
chandra-wkts-tbl
 Predicted wickets for Kapil Dev

The plot
kapil-wicket-2

The predicted table from the hypothesis function for Kapil Dev
kapil-wkts-tbl

Observation: A closer look at  the predicted wickets for Kapil, Kumble and B S Chandra shows an interesting aspect. The predicted number of wickets is higher for lower economy rates. With a little thought we can see bowlers on turning or pitches with a lot of movement can not only be more economical but can also be destructive and take a lot of wickets. Hence the higher wickets for lower economy rates!

Implementation details
In this post I have used the Normal Equation to get the optimal values of theta for local minimum of the Gradient function.  As mentioned above when I had run the 3D scatter plot fitting a 2D plane did not seem quite right. So I had to experiment with different polynomial equations first trying 2nd order, 3rd order and also the sqrt

I tried the following where ‘O is Overs, ‘E’ stands for Economy Rate and ‘SR’ the predicated Strike rate. Theta is the computed theta from the Normal Equation. The notation in  Matrix notation is shown below

i) A linear plane
SR = [1 O E] * theta

ii) Using the sqrt function
SR = [1 sqrt(O) sqrt(E)]  * theta

iii) Using 2nd order plynomial
SR = [1 O^2 E^2] * theta

iv) Using the 3rd order polynomial
SR = [1 O^3 E^3] * theta

v) Before finally settling on
SR = [1 O E OE] * theta

where OE  = O .* E

The last one seemed to give me the lowest cost and also seemed the most logical visual choice.

A good resource to play around with different functions and check out the shapes of combinations of variables and polynomial order of equation is at WolframAlpha: Plotting and Graphics

Note 1: The gradient descent with the Normal Equation has been performed on the entire data set (approx 220 for Kumble & Kapil) and 99 for Chandra. The proper process for verifying a Machine Learning algorithm is to split the data set into (60% training data, 20% cross validation data and 20% as the test set).  We need to validate the prediction function against the cross-validation set, fine tune it and finally ensure that it  fits  the test set samples well.  However, this split was not done as the data set itself was very low. The entire data set was used to perform the optimal surface fit

Note 2: The optimal theta values have been chosen with a feature vector that is of the form
[1 x y x .* y] The Surface of  Bowling Effectiveness’ has been plotted above. It may appear that there is a’high bias’ in the fit and an even better fit could be obtained by choosing higher order polynomials like
[1 x y x*y x^2 y^2 (x^2) .* y x  .* (y^2)] or
[1 x y x*y x^2 y^2 x^3 y^3]  etc
While we can get a better fit we could run into the problem of ‘high variance; and without the cross validation and test set we will not be able to verify the results, Hence the simpler option [1 x y x*y] was chosen

The Octave code outline and the data used can be cloned from GitHub at ml-bowling-analyze

 Conclusion:

1) Predicted wickets: The predicted number of wickets is higher at lower economy rates
2) Comparing performances: There are different ways of looking at the results. One possible way is to check for a particular number of overs and economy rate who is most effective. Here is one way. Taking a small slice from each bowler’s predicted wickets table for anm Economy Rate=4.0 the predicted wickets are

comp

From the above it does appear that Kapil is definitely more effective than the other two. However one could slice and dice in different ways, maybe the most economical for a given numbers and wickets combination or wickets taken in the least overs etc. Do add your thoughts. comments on my assessment or analysis

Also see
1. Analyzing cricket’s batting legends – Through the mirage with R
2. Masters of spin: Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1