# My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon

‘Cricket analytics with cricketr and cricpy – Analytics harmony with R and Python’ is now available on Amazon in both paperback ($21.99) and kindle ($9.99/Rs 449) versions. The book includes analysis of cricketers using both my R package ‘cricketr’ and my python package ‘cricpy’ for all formats of the game namely Test, ODI and T20. Both packages use data from ESPN Cricinfo Statsguru. The paperback is available on Amazon for $21.99 and the kindle version is available for$9.99/Rs 449

The book includes the following chapters

CONTENTS

Introduction 7
1. Cricket analytics with cricketr 9
1.1. Introducing cricketr! : An R package to analyze performances of cricketers 10
1.2. Taking cricketr for a spin – Part 1 48
1.2. cricketr digs the Ashes! 69
1.3. cricketr plays the ODIs! 97
1.4. cricketr adapts to the Twenty20 International! 139
1.5. Sixer – R package cricketr’s new Shiny avatar 168
1.6. Re-introducing cricketr! : An R package to analyze performances of cricketers 178
1.7. cricketr sizes up legendary All-rounders of yesteryear 233
1.8. cricketr flexes new muscles: The final analysis 277
1.9. The Clash of the Titans in Test and ODI cricket 300
1.10. Analyzing performances of cricketers using cricketr template 338
2. Cricket analytics with cricpy 352
2.1 Introducing cricpy:A python package to analyze performances of cricketers 353
2.2 Cricpy takes a swing at the ODIs 405
Analysis of Top 4 batsman 448
2.3 Cricpy takes guard for the Twenty20s 449
2.4 Analyzing batsmen and bowlers with cricpy template 490
9. Average runs against different opposing teams 493
3. Other cricket posts in R 500
3.1 Analyzing cricket’s batting legends – Through the mirage with R 500
3.2 Mirror, mirror … the best batsman of them all? 527
4. Appendix 541
Cricket analysis with Machine Learning using Octave 541
4.1 Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid 542
4.2 Informed choices through Machine Learning-2 Pitting together Kumble, Kapil, Chandra 555

To see all posts click Index of posts

# Analyzing performances of cricketers using cricketr template

This post includes a template which you can use for analyzing the performances of cricketers, both batsmen and bowlers in Test, ODI and Twenty 20 cricket using my R package cricketr. To see actual usage of functions in the R package cricketr see Introducing cricketr! : An R package to analyze performances of cricketers.

The ‘cricketr’ package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

# The cricketr package

The cricketr package has several functions that perform several different analyses on both batsman and bowlers. The package has function that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available.

Other interesting functions include batting performance moving average, forecast and a function to check whether the batsmans in in-form or out-of-form.

The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Ricky Ponting, Sachin Tendulkar etc. This will bring up a page which have the profile number for the player e.g. for Sachin Tendulkar this would be http://www.espncricinfo.com/india/content/player/35320.html. Hence, Sachin’s profile is 35320. This can be used to get the data for Tendulkar as shown below

The cricketr package is now available from CRAN!!! You should be able to install directly with

### 1. Install the cricketr package

if (!require("cricketr")){
install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

The cricketr package includes some pre-packaged sample (.csv) files. You can use these sample to test functions as shown below

# Retrieve the file path of a data file installed with cricketr
#pathToFile <- system.file("data", "tendulkar.csv", package = "cricketr")
#batsman4s(pathToFile, "Sachin Tendulkar")

# The general format is pkg-function(pathToFile,par1,...)
#batsman4s(<path-To-File>,"Sachin Tendulkar")

“ The pre-packaged files can be accessed as shown above. To get the data of any player use the function in Test, ODI and Twenty20 use the following

### 2. For Test cricket

#tendulkar <- getPlayerData(35320,dir="..",file="tendulkar.csv",type="batting",homeOrAway=c(1,2), result=c(1,2,4))

### 2a. For ODI cricket

#tendulkarOD <- getPlayerDataOD(35320,dir="..",file="tendulkarOD.csv",type="batting")

### 2b For Twenty 20 cricket

#tendulkarT20 <- getPlayerDataTT(35320,dir="..",file="tendulkarT20.csv",type="batting")

## Analysis of batsmen

Important Note This needs to be done only once for a player. This function stores the player’s data in a CSV file (for e.g. tendulkar.csv as above) which can then be reused for all other functions. Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

## Sachin Tendulkar’s performance – Basic Analyses

The 3 plots below provide the following for Tendulkar

1. Frequency percentage of runs in each run range over the whole career
2. Mean Strike Rate for runs scored in the given range
3. A histogram of runs frequency percentages in runs ranges For example

### 3. Basic analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
#batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
#batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()
## null device
##           1
1. Player 1
2. Player 2
3. Player 3
4. Player 4

### 4. More analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player1.csv","Player1")
#batsman6s("./player1.csv","Player1")
#batsmanMeanStrikeRate("./player1.csv","Player1")

# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player2.csv","Player2")
#batsman6s("./player2.csv","Player2")
#batsmanMeanStrikeRate("./player2.csv","Player2")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player3.csv","Player3")
#batsman6s("./player3.csv","Player3")
#batsmanMeanStrikeRate("./player3.csv","Player3")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")

dev.off()
## null device
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player4.csv","Player4")
#batsman6s("./player4.csv","Player4")
#batsmanMeanStrikeRate("./player4.csv","Player4")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device
##           1

Note: For mean strike rate in ODI and Twenty20 use the function batsmanScoringRateODTT()

### 5.Boxplot histogram plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

#batsmanPerfBoxHist("./player1.csv","Player1")
#batsmanPerfBoxHist("./player2.csv","Player2")
#batsmanPerfBoxHist("./player3.csv","Player3")
#batsmanPerfBoxHist("./player4.csv","Player4")

### 6. Contribution to won and lost matches

For the 2 functions below you will have to use the getPlayerDataSp() function. I have commented this as I already have these files. This function can only be used for Test matches

#player1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player1sp.csv",ttype="batting")
#player2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player2sp.csv",ttype="batting")
#player3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player3sp.csv",ttype="batting")
#player4sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player4sp.csv",ttype="batting")
par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanContributionWonLost("player1sp.csv","Player1")
#batsmanContributionWonLost("player2sp.csv","Player2")
#batsmanContributionWonLost("player3sp.csv","Player3")
#batsmanContributionWonLost("player4sp.csv","Player4")
dev.off()
## null device
##           1

### 7, Performance at home and overseas

This function also requires the use of getPlayerDataSp() as shown above. This can only be used for Test matches

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfHomeAway("player1sp.csv","Player1")
#batsmanPerfHomeAway("player2sp.csv","Player2")
#batsmanPerfHomeAway("player3sp.csv","Player3")
#batsmanPerfHomeAway("player4sp.csv","Player4")
dev.off()
## null device
##           1

### 8. Batsman average at different venues

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsGround("./player1.csv","Player1")
#batsmanAvgRunsGround("./player2.csv","Player2")
#batsmanAvgRunsGround("./player3.csv","Ponting")
#batsmanAvgRunsGround("./player4.csv","Player4")
dev.off()
## null device
##           1

### 9. Batsman average against different opposition

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsOpposition("./player1.csv","Player1")
#batsmanAvgRunsOpposition("./player2.csv","Player2")
#batsmanAvgRunsOpposition("./player3.csv","Ponting")
#batsmanAvgRunsOpposition("./player4.csv","Player4")
dev.off()
## null device
##           1

### 10. Runs Likelihood of batsman

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanRunsLikelihood("./player1.csv","Player1")
#batsmanRunsLikelihood("./player2.csv","Player2")
#batsmanRunsLikelihood("./player3.csv","Ponting")
#batsmanRunsLikelihood("./player4.csv","Player4")
dev.off()
## null device
##           1

### 11. Moving Average of runs in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanMovingAverage("./player1.csv","Player1")
#batsmanMovingAverage("./player2.csv","Player2")
#batsmanMovingAverage("./player3.csv","Ponting")
#batsmanMovingAverage("./player4.csv","Player4")
dev.off()
## null device
##           1

### 12. Cumulative Average runs of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeAverageRuns("./player1.csv","Player1")
#batsmanCumulativeAverageRuns("./player2.csv","Player2")
#batsmanCumulativeAverageRuns("./player3.csv","Ponting")
#batsmanCumulativeAverageRuns("./player4.csv","Player4")
dev.off()
## null device
##           1

### 13. Cumulative Average strike rate of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeStrikeRate("./player1.csv","Player1")
#batsmanCumulativeStrikeRate("./player2.csv","Player2")
#batsmanCumulativeStrikeRate("./player3.csv","Ponting")
#batsmanCumulativeStrikeRate("./player4.csv","Player4")
dev.off()
## null device
##           1

### 14. Future Runs forecast

Here are plots that forecast how the batsman will perform in future. In this case 90% of the career runs trend is uses as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated runs trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

Take a look at the runs forecasted for the batsman below.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfForecast("./player1.csv","Player1")
#batsmanPerfForecast("./player2.csv","Player2")
#batsmanPerfForecast("./player3.csv","Player3")
#batsmanPerfForecast("./player4.csv","Player4")
dev.off()
## null device
##           1

### 15. Relative Mean Strike Rate plot

The plot below compares the Mean Strike Rate of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanSR(frames,names)

### 16. Relative Runs Frequency plot

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeRunsFreqPerf(frames,names)

### 17. Relative cumulative average runs in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanCumulativeAvgRuns(frames,names)

### 18. Relative cumulative average strike rate in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","player4")
#relativeBatsmanCumulativeStrikeRate(frames,names)

### 19. Check Batsman In-Form or Out-of-Form

The below computation uses Null Hypothesis testing and p-value to determine if the batsman is in-form or out-of-form. For this 90% of the career runs is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the batsman continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the batsman is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

This is done for the Top 4 batsman

#checkBatsmanInForm("./player1.csv","Player1")
#checkBatsmanInForm("./player2.csv","Player2")
#checkBatsmanInForm("./player3.csv","Player3")
#checkBatsmanInForm("./player4.csv","Player4")

### 20. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player1.csv","Player1")
#battingPerf3d("./player2.csv","Player2")
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player3.csv","Player3")
#battingPerf3d("./player4.csv","player4")
dev.off()
## null device
##           1

### 21. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
#Player1 <- batsmanRunsPredict("./player1.csv","Player1",newdataframe=newDF)
#Player2 <- batsmanRunsPredict("./player2.csv","Player2",newdataframe=newDF)
#ponting <- batsmanRunsPredict("./player3.csv","Player3",newdataframe=newDF)
#sangakkara <- batsmanRunsPredict("./player4.csv","Player4",newdataframe=newDF)
#batsmen <-cbind(round(Player1$Runs),round(Player2$Runs),round(Player3$Runs),round(Player4$Runs))
#colnames(batsmen) <- c("Player1","Player2","Player3","Player4")
#newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
#colnames(newDF) <- c("BallsFaced","MinsAtCrease")
#predictedRuns <- cbind(newDF,batsmen)
#predictedRuns

## Analysis of bowlers

1. Bowler1
2. Bowler2
3. Bowler3
4. Bowler4

player1 <- getPlayerData(xxxx,dir=“..”,file=“player1.csv”,type=“bowling”) Note For One day you will have to use getPlayerDataOD() and for Twenty20 it is getPlayerDataTT()

### 21. Wicket Frequency Plot

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsFreqPercent("./bowler1.csv","Bowler1")
#bowlerWktsFreqPercent("./bowler2.csv","Bowler2")
#bowlerWktsFreqPercent("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

### 22. Wickets Runs plot

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsRunsPlot("./bowler1.csv","Bowler1")
#bowlerWktsRunsPlot("./bowler2.csv","Bowler2")
#bowlerWktsRunsPlot("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

### 23. Average wickets at different venues

#bowlerAvgWktsGround("./bowler3.csv","Bowler3")

### 24. Average wickets against different opposition

#bowlerAvgWktsOpposition("./bowler3.csv","Bowler3")

### 25. Wickets taken moving average

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerMovingAverage("./bowler1.csv","Bowler1")
#bowlerMovingAverage("./bowler2.csv","Bowler2")
#bowlerMovingAverage("./bowler3.csv","Bowler3")

dev.off()
## null device
##           1

### 26. Cumulative Wickets taken

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgWickets("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgWickets("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgWickets("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

### 27. Cumulative Economy rate

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgEconRate("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgEconRate("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgEconRate("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

### 28. Future Wickets forecast

Here are plots that forecast how the bowler will perform in future. In this case 90% of the career wickets trend is used as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated wickets trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfForecast("./bowler1.csv","Bowler1")
#bowlerPerfForecast("./bowler2.csv","Bowler2")
#bowlerPerfForecast("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

### 29. Contribution to matches won and lost

As discussed above the next 2 charts require the use of getPlayerDataSp(). This can only be done for Test matches

#bowler1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler1sp.csv",ttype="bowling")
#bowler2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler2sp.csv",ttype="bowling")
#bowler3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler3sp.csv",ttype="bowling")
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerContributionWonLost("bowler1sp","Bowler1")
#bowlerContributionWonLost("bowler2sp","Bowler2")
#bowlerContributionWonLost("bowler3sp","Bowler3")
dev.off()
## null device
##           1

### 30. Performance home and overseas.

This can only be done for Test matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfHomeAway("bowler1sp","Bowler1")
#bowlerPerfHomeAway("bowler2sp","Bowler2")
#bowlerPerfHomeAway("bowler3sp","Bowler3")
dev.off()
## null device
##           1

### 31 Relative Wickets Frequency Percentage

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingPerf(frames,names)

### 32 Relative Economy Rate against wickets taken

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingER(frames,names)

### 33 Relative cumulative average wickets of bowlers in career

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgWickets(frames,names)

### 34 Relative cumulative average economy rate of bowlers

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgEconRate(frames,names)

### 35 Check for bowler in-form/out-of-form

The below computation uses Null Hypothesis testing and p-value to determine if the bowler is in-form or out-of-form. For this 90% of the career wickets is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the bowler continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the bowler is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

Note: The check for the form status of the bowlers indicate

#checkBowlerInForm("./bowler1.csv","Bowler1")
#checkBowlerInForm("./bowler2.csv","Bowler2")
#checkBowlerInForm("./bowler3.csv","Bowler3")
dev.off()
## null device
##           1

# The Clash of the Titans in Test and ODI cricket

Who looks outside, dreams; who looks inside, awakes.
Show me a sane man and I will cure him for you.

            Carl Jung 

We’re made of star stuff. We are a way for the cosmos to know itself.
If you want to make an apple pie from scratch, you must first create the universe.

            Carl Sagan

## Introduction

The biggest nag in the collective psyche of cricketing fraternity these days, is whether Virat Kohli has surpassed Sachin Tendulkar. This question has been troubling cricket lovers the world over and particularly in India, for quite a while. This nagging question has only grown stronger with Kohli’s 41st ODI century and with Michael Vaughan bestowing the GOAT title to Virat Kohli for ODI cricket. Hence, I decided to do my bit in addressing this, by doing analysis of Kohli’s and Tendulkar’s performance in ODI cricket. I also wanted to address the the best among the cricketing idols of India in Test cricket, namely Sunil Gavaskar, Sachin Tendulkar and Virat Kohli. Hence this post has 2 parts

1. Analysis of Tendulkar, Gavaskar and Kohli in Test cricket
2. Analysis of Tendulkar and Kohli in ODIs

In this post, I analyze the performances of these titans in Test and ODI cricket using my R package cricketr. While some may feel that comparisons are not possible as these batsmen are from different eras. To some extent this is true. I would give some leeway to Gavaskar as he had to bat in a pre-helmet era. But with Tendulkar and Kohli a fair and objective comparison is possible. There were pre-eminient bowlers in the times of Tendulkar as there are now.

From the analysis below, it can be seen that Tendulkar is ahead  of everybody else in Test cricket. However it must be noted that Tendulkar’s performance deteriorated towards the end of his career. Such was not the case with Gavaskar. Kohli has some catching up to do and he still has a lot of Test cricket in him.

In ODI Kohli can be seen to pulling ahead of Tendulkar in several aspects.

My R package cricketr can be installed directly from CRAN and you can use it analyze cricketers.

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Note 1: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr templatefrom Github (which is the R Markdown file I have used for the analysis below).

Note 2: I sprinkle the charts with my observations. Feel free to look at them more closely and come to your conclusions.

If you are passionate about cricket, and love analyzing cricket performances, then check out my racy book on cricket ‘Cricket analytics with cricketr and cricpy – Analytics harmony with R & Python’! This book discusses and shows how to use my R package ‘cricketr’ and my Python package ‘cricpy’ to analyze batsmen and bowlers in all formats of the game (Test, ODI and T20). The paperback is available on Amazon at $21.99 and the kindle version at$9.99/Rs 449/-. A must read for any cricket lover! Check it out!!

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post Introducing cricpy:A python package to analyze performances of cricketers

### 1 Load the cricketr package

if (!require("cricketr")){
install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

## A Test cricket  – Analysis of Gavaskar, Tendulkar and Kohli

### 2. Get player data

tendulkar <- getPlayerData(35320,dir=".",file="tendulkar.csv",type="batting")
kohli <- getPlayerData(253802,dir=".",file="kohli.csv",type="batting")
gavaskar <- getPlayerData(28794,dir=".",file="gavaskar.csv",type="batting")

### 3a. Basic analyses for Tendulkar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()

### 3b Basic analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./kohli.csv","Kohli")
batsmanMeanStrikeRate("./kohli.csv","Kohli")
batsmanRunsRanges("./kohli.csv","Kohli")
dev.off()

### 3c Basic analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsRanges("./gavaskar.csv","Gavaskar")
dev.off()

### 4a.More analyses for Tendulkar

It can be seen that Tendulkar and Gavaskar has been bowled more often than Kohli. Also Kohli does not have as many sixes in Test cricket as Tendulkar and Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./tendulkar.csv","Tendulkar")
batsman6s("./tendulkar.csv","Tendulkar")
batsmanDismissals("./tendulkar.csv","Tendulkar")
dev.off()

### 4b. More analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./kohli.csv","Kohli")
batsman6s("./kohli.csv","Kohli")
batsmanDismissals("./kohli.csv","Kohli")
dev.off()

### 4c More analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanDismissals("./gavaskar.csv","Gavaskar")
dev.off()

### 5 Performance of batsmen on different grounds

par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./tendulkar.csv","Tendulkar")
batsmanAvgRunsGround("./kohli.csv","Kohli")
batsmanAvgRunsGround("./gavaskar.csv","Gavaskar")


a

#dev.off()

### 6. Performance if batsmen against different Opposition

1. Tendulkar averages 50 against the following countries – Australia, Bangladesh, England, Sri Lanka, West Indies and Zimbabwe
2. Kohli average almost 50 against all the nations he has played – Australia, Bangladesh, England, New Zealand, Sri Lanka and West Indies
3. Gavaskar averages 50 against Australia, Pakistan, West Indies, Sri Lanka
par(mar=c(4,4,2,2))
batsmanAvgRunsOpposition("./tendulkar.csv","Tendulkar")
batsmanAvgRunsOpposition("./kohli.csv","Kohli")
batsmanAvgRunsOpposition("./gavaskar.csv","Gavaskar")

### 7. Get player data special

This is required for the next 2 function calls

tendulkarsp <- getPlayerDataSp(35320,tdir=".",tfile="tendulkarsp.csv",ttype="batting")
kohlisp <- getPlayerDataSp(253802,tdir=".",tfile="kohlisp.csv",ttype="batting")

#dev.off()

### 8 Get contribution of batsmen in matches won and lost

Kohli contribution has had an equal contribution in won and lost matches. Tendulkar’s runs seem to have not helped in winning as much as only 50% of matches he has played have been won

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanContributionWonLost("tendulkarsp.csv","Tendulkar")
batsmanContributionWonLost("./kohlisp.csv","Kohli")


a

### 9 Performance of batsmen at home and overseas

The boxplots show that Kohli performs better overseas than at home. The 3rd quartile is higher, though the median seems to lower overseas. For Tendulkar the performance is similar both ways. Gavaskar’s median runs scored overseas is higher.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanPerfHomeAway("tendulkarsp.csv","Tendulkar")
batsmanPerfHomeAway("./kohlisp.csv","Kohli")



### 10. Moving average of runs

Gavaskar’s moving average was very good at the time of his retirement. Kohli seems to be going very strong. Tendulkar’s performance shows signs of deterioration around the time of his retirement.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanMovingAverage("./tendulkar.csv","Tendulkar")
batsmanMovingAverage("./kohli.csv","Kohli")

#dev.off()

### 11 Boxplot and histogram of runs

Kohli has a marginally higher average (50.69) than Tendulkar (48.65) while Gavaskar 46. The median runs are same for Tendulkar and Kohli at 32

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfBoxHist("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfBoxHist("./kohli.csv","Kohli")
batsmanPerfBoxHist("./gavaskar.csv","Gavaskar")

### 12 Cumulative average Runs for batsmen

Looking at the cumulative average runs we can see a gradual drop in the cumulative average for Tendulkar while Kohli and Gavaskar’s performance seems to be getting better

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeAverageRuns("./tendulkar.csv","Tendulkar")
batsmanCumulativeAverageRuns("./kohli.csv","Kohli")
batsmanCumulativeAverageRuns("./gavaskar.csv","Gavaskar")

### 13. Cumulative average strike rate of batsmen

Tendulkar’s strike rate is better than Kohli and Gavaskar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeStrikeRate("./tendulkar.csv","Tendulkar")
batsmanCumulativeStrikeRate("./kohli.csv","Kohli")
batsmanCumulativeStrikeRate("./gavaskar.csv","Gavaskar")

### 14 Performance forecast of batsmen

The forecasted performance for Kohli and Gavaskar is higher than that of Tendulkar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfForecast("./kohli.csv","Kohli")

#dev.off()

### 15. Relative strike rate of batsmen

par(mar=c(4,4,2,2))

relativeBatsmanSR(frames,names)
#dev.off()



### 16. Relative Runs frequency of batsmen

par(mar=c(4,4,2,2))
relativeRunsFreqPerf(frames,names)
#dev.off()


### 17. Relative cumulative average runs of batsmen

Tendulkar leads the way here, but it can be seem Kohli catching up.

par(mar=c(4,4,2,2))
relativeBatsmanCumulativeAvgRuns(frames,names)
#dev.off()


### 18. Relative cumulative average strike rate

Tendulkar has better strike rate than the other two.

par(mar=c(4,4,2,2))
relativeBatsmanCumulativeStrikeRate(frames,names)
#dev.off()


### 19. Check batsman in form

As in the moving average and performance forecast and cumulative average runs, Kohli and Gavaskar are in-form while Tendulkar was out-of-form towards the end.

checkBatsmanInForm("./tendulkar.csv","Sachin Tendulkar")
## [1] "**************************** Form status of Sachin Tendulkar ****************************
\n\n Population size: 294  Mean of population: 50.48 \n Sample size: 33  Mean of sample: 32.42 SD of
sample: 29.8 \n\n Null hypothesis H0 : Sachin Tendulkar 's sample average is within 95% confidence interval
of population average\n Alternative hypothesis Ha : Sachin Tendulkar 's sample average is below
the 95% confidence interval of population average\n\n
Sachin Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./kohli.csv","Kohli")
## [1] "**************************** Form status of Kohli ****************************\n\n Population size: 117
Mean of population: 50.35 \n Sample size: 13  Mean of sample: 53.77 SD of sample: 46.15 \n\n Null
hypothesis H0 : Kohli 's sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Kohli 's sample average is below the 95% confidence interval of population
average\n\n Kohli 's Form Status: In-Form because the p value: 0.603244  is greater than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./gavaskar.csv","Gavaskar")
## [1] "**************************** Form status of Gavaskar ****************************\n\n
Population size: 125  Mean of population: 44.67 \n Sample size: 14  Mean of sample: 57.86 SD of sample:
58.55 \n\n Null hypothesis H0 : Gavaskar 's sample average is within 95% confidence interval of population
average\n Alternative hypothesis Ha : Gavaskar 's sample average is below the 95% confidence interval of
population average\n\n Gavaskar 's Form Status: In-Form because the p value: 0.793276  is greater
than alpha=  0.05 \n *******************************************************************************************\n\n"
#dev.off()

### 20. Performance 3D

A 3D regression plane is fitted between the the Balls faced, Minutes at crease and Runs scored

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
battingPerf3d("./tendulkar.csv","Sachin Tendulkar")
battingPerf3d("./kohli.csv","Kohli")
#dev.off()

### 20. Runs likelihood

This functions computes the K-Means and determines the runs the batsmen are likely to score.

par(mar=c(4,4,2,2))
batsmanRunsLikelihood("./tendulkar.csv","Tendulkar")
## Summary of  Tendulkar 's runs scoring likelihood
## **************************************************
##
## There is a 16.51 % likelihood that Tendulkar  will make  139 Runs in  251 balls over 353  Minutes
## There is a 25.08 % likelihood that Tendulkar  will make  66 Runs in  122 balls over  167  Minutes
## There is a 58.41 % likelihood that Tendulkar  will make  16 Runs in  31 balls over 44  Minutes
batsmanRunsLikelihood("./kohli.csv","Kohli")
## Summary of  Kohli 's runs scoring likelihood
## **************************************************
##
## There is a 20 % likelihood that Kohli  will make  143 Runs in  232 balls over 330  Minutes
## There is a 33.85 % likelihood that Kohli  will make  51 Runs in  92 balls over  127  Minutes
## There is a 46.15 % likelihood that Kohli  will make  11 Runs in  24 balls over 31  Minutes
batsmanRunsLikelihood("./gavaskar.csv","Gavaskar")
## Summary of  Gavaskar 's runs scoring likelihood
## **************************************************
##
## There is a 33.81 % likelihood that Gavaskar  will make  69 Runs in  159 balls over 214  Minutes
## There is a 8.63 % likelihood that Gavaskar  will make  172 Runs in  364 balls over  506  Minutes
## There is a 57.55 % likelihood that Gavaskar  will make  13 Runs in  35 balls over 48  Minutes

### 21. Predict runs for a random combination of Balls faced and runs scored

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
tendulkar <- batsmanRunsPredict("./tendulkar.csv","Tendulkar",newdataframe=newDF)
kohli <- batsmanRunsPredict("./kohli.csv","Kohli",newdataframe=newDF)
2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle- $9.99/Rs449) This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better. Here is a look at the topics covered Table of Contents Preface …………………………………………………………………………….4 Introduction ………………………………………………………………………6 1. Essential R ………………………………………………………………… 8 2. Essential Python for Datascience ……………………………………………57 3. R vs Python …………………………………………………………………81 4. Regression of a continuous variable ……………………………………….101 5. Classification and Cross Validation ………………………………………..121 6. Regression techniques and regularization ………………………………….146 7. SVMs, Decision Trees and Validation curves ………………………………191 8. Splines, GAMs, Random Forests and Boosting ……………………………222 9. PCA, K-Means and Hierarchical Clustering ………………………………258 References ……………………………………………………………………..269 Pick up your copy today!! Hope you have a great time learning as I did while implementing these algorithms! # The 3rd paperback & kindle editions of my books on Cricket, now on Amazon The 3rd paperback & kindle edition of both my books on cricket is now available on Amazon a) Cricket analytics with cricketr, Third Edition. The paperback edition is$12.99 and the kindle edition is $4.99/Rs320. This book is based on my R package ‘cricketr‘, available on CRAN and uses ESPN Cricinfo Statsguru b) Beaten by sheer pace! Cricket analytics with yorkr, 3rd edition . The paperback is$12.99 and the kindle version is $6.99/Rs448. This is based on my R package ‘yorkr‘ on CRAN and uses data from Cricsheet Pick up your copies today!! Note: In the 3rd edition of the paperback book, the charts will be in black and white. If you would like the charts to be in color, please check out the 2nd edition of these books see More book, more cricket! 2nd edition of my books now on Amazon To see all posts see Index of posts # Practical Machine Learning with R and Python – Part 4 This is the 4th installment of my ‘Practical Machine Learning with R and Python’ series. In this part I discuss classification with Support Vector Machines (SVMs), using both a Linear and a Radial basis kernel, and Decision Trees. Further, a closer look is taken at some of the metrics associated with binary classification, namely accuracy vs precision and recall. I also touch upon Validation curves, Precision-Recall, ROC curves and AUC with equivalent code in R and Python This post is a continuation of my 3 earlier posts on Practical Machine Learning in R and Python 1. Practical Machine Learning with R and Python – Part 1 2. Practical Machine Learning with R and Python – Part 2 3. Practical Machine Learning with R and Python – Part 3 The RMarkdown file with the code and the associated data files can be downloaded from Github at MachineLearning-RandPython-Part4 Note: Please listen to my video presentations Machine Learning in youtube 1. Machine Learning in plain English-Part 1 2. Machine Learning in plain English-Part 2 3. Machine Learning in plain English-Part 3 Check out my compact and minimal book “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and kindle(8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!! Support Vector Machines (SVM) are another useful Machine Learning model that can be used for both regression and classification problems. SVMs used in classification, compute the hyperplane, that separates the 2 classes with the maximum margin. To do this the features may be transformed into a larger multi-dimensional feature space. SVMs can be used with different kernels namely linear, polynomial or radial basis to determine the best fitting model for a given classification problem. In the 2nd part of this series Practical Machine Learning with R and Python – Part 2, I had mentioned the various metrics that are used in classification ML problems namely Accuracy, Precision, Recall and F1 score. Accuracy gives the fraction of data that were correctly classified as belonging to the +ve or -ve class. However ‘accuracy’ in itself is not a good enough measure because it does not take into account the fraction of the data that were incorrectly classified. This issue becomes even more critical in different domains. For e.g a surgeon who would like to detect cancer, would like to err on the side of caution, and classify even a possibly non-cancerous patient as possibly having cancer, rather than mis-classifying a malignancy as benign. Here we would like to increase recall or sensitivity which is given by Recall= TP/(TP+FN) or we try reduce mis-classification by either increasing the (true positives) TP or reducing (false negatives) FN On the other hand, search algorithms would like to increase precision which tries to reduce the number of irrelevant results in the search result. Precision= TP/(TP+FP). In other words we do not want ‘false positives’ or irrelevant results to come in the search results and there is a need to reduce the false positives. When we try to increase ‘precision’, we do so at the cost of ‘recall’, and vice-versa. I found this diagram and explanation in Wikipedia very useful Source: Wikipedia “Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).” ## 1.1a. Linear SVM – R code In R code below I use SVM with linear kernel source('RFunctions-1.R') library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Read data. Data from SKLearn cancer <- read.csv("cancer.csv") cancertarget <- as.factor(cancer$target) # Split into training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Fit a linear basis kernel. DO not scale the data svmfit=svm(target~., data=train, kernel="linear",scale=FALSE) ypred=predict(svmfit,test) #Print a confusion matrix confusionMatrix(ypred,test$target)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 54  3
##          1  3 82
##
##                Accuracy : 0.9577
##                  95% CI : (0.9103, 0.9843)
##     No Information Rate : 0.5986
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.9121
##  Mcnemar's Test P-Value : 1
##
##             Sensitivity : 0.9474
##             Specificity : 0.9647
##          Pos Pred Value : 0.9474
##          Neg Pred Value : 0.9647
##              Prevalence : 0.4014
##          Detection Rate : 0.3803
##    Detection Prevalence : 0.4014
##       Balanced Accuracy : 0.9560
##
##        'Positive' Class : 0
## 

## 1.1b Linear SVM – Python code

The code below creates a SVM with linear basis in Python and also dumps the corresponding classification metrics

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
clf = LinearSVC().fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Breast cancer dataset
## Accuracy of Linear SVC classifier on training set: 0.92
## Accuracy of Linear SVC classifier on test set: 0.94

## 1.2 Dummy classifier

Often when we perform classification tasks using any ML model namely logistic regression, SVM, neural networks etc. it is very useful to determine how well the ML model performs agains at dummy classifier. A dummy classifier uses some simple computation like frequency of majority class, instead of fitting and ML model. It is essential that our ML model does much better that the dummy classifier. This problem is even more important in imbalanced classes where we have only about 10% of +ve samples. If any ML model we create has a accuracy of about 0.90 then it is evident that our classifier is not doing any better than a dummy classsfier which can just take a majority count of this imbalanced class and also come up with 0.90. We need to be able to do better than that.

In the examples below (1.3a & 1.3b) it can be seen that SVMs with ‘radial basis’ kernel with unnormalized data, for both R and Python, do not perform any better than the dummy classifier.

## 1.2a Dummy classifier – R code

R does not seem to have an explicit dummy classifier. I created a simple dummy classifier that predicts the majority class. SKlearn in Python also includes other strategies like uniform, stratified etc. but this should be possible to create in R also.

# Create a simple dummy classifier that computes the ratio of the majority class to the totla
DummyClassifierAccuracy <- function(train,test,type="majority"){
if(type=="majority"){
count <- sum(train$target==1)/dim(train)[1] } count } cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Create training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] #Dummy classifier majority class acc=DummyClassifierAccuracy(train,test) sprintf("Accuracy is %f",acc) ## [1] "Accuracy is 0.638498" ## 1.2b Dummy classifier – Python code This dummy classifier uses the majority class. import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.metrics import confusion_matrix (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Negative class (0) is most frequent dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train) y_dummy_predictions = dummy_majority.predict(X_test) print('Dummy classifier accuracy on test set: {:.2f}' .format(dummy_majority.score(X_test, y_test)))  ## Dummy classifier accuracy on test set: 0.63 ## 1.3a – Radial SVM (un-normalized) – R code SVMs perform better when the data is normalized or scaled. The 2 examples below show that SVM with radial basis kernel does not perform any better than the dummy classifier library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Radial SVM unnormalized train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Unnormalized data svmfit=svm(target~., data=train, kernel="radial",cost=10,scale=FALSE) ypred=predict(svmfit,test) confusionMatrix(ypred,test$target)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0  0  0
##          1 57 85
##
##                Accuracy : 0.5986
##                  95% CI : (0.5131, 0.6799)
##     No Information Rate : 0.5986
##     P-Value [Acc > NIR] : 0.5363
##
##                   Kappa : 0
##  Mcnemar's Test P-Value : 1.195e-13
##
##             Sensitivity : 0.0000
##             Specificity : 1.0000
##          Pos Pred Value :    NaN
##          Neg Pred Value : 0.5986
##              Prevalence : 0.4014
##          Detection Rate : 0.0000
##    Detection Prevalence : 0.0000
##       Balanced Accuracy : 0.5000
##
##        'Positive' Class : 0
## 

## 1.4b – Radial SVM (un-normalized) – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)

clf = SVC(C=10).fit(X_train, y_train)
print('Breast cancer dataset (unnormalized features)')
print('Accuracy of RBF-kernel SVC on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of RBF-kernel SVC on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Breast cancer dataset (unnormalized features)
## Accuracy of RBF-kernel SVC on training set: 1.00
## Accuracy of RBF-kernel SVC on test set: 0.63

## 1.5a – Radial SVM (Normalized) -R Code

The data is scaled (normalized ) before using the SVM model. The SVM model has 2 paramaters a) C – Large C (less regularization), more regularization b) gamma – Small gamma has larger decision boundary with more misclassfication, and larger gamma has tighter decision boundary

The R code below computes the accuracy as the regularization paramater is changed

trainingAccuracy <- NULL
testAccuracy <- NULL
C1 <- c(.01,.1, 1, 10, 20)
for(i in  C1){

ypredTrain <-predict(svmfit,train)
ypredTest=predict(svmfit,test)
a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target)
trainingAccuracy <-c(trainingAccuracy,a$overall[1]) testAccuracy <-c(testAccuracy,b$overall[1])

}
print(trainingAccuracy)
##  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy
## 0.6384977 0.9671362 0.9906103 0.9976526 1.0000000
print(testAccuracy)
##  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy
## 0.5985915 0.9507042 0.9647887 0.9507042 0.9507042
a <-rbind(C1,as.numeric(trainingAccuracy),as.numeric(testAccuracy))
b <- data.frame(t(a))
names(b) <- c("C1","trainingAccuracy","testAccuracy")
df <- melt(b,id="C1")
ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) +
xlab("C (SVC regularization)value") + ylab("Accuracy") +
ggtitle("Training and test accuracy vs C(regularization)")

## 1.5b – Radial SVM (normalized) – Python

The Radial basis kernel is used on normalized data for a range of ‘C’ values and the result is plotted.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print('Breast cancer dataset (normalized with MinMax scaling)')
trainingAccuracy=[]
testAccuracy=[]
for C1 in [.01,.1, 1, 10, 20]:
clf = SVC(C=C1).fit(X_train_scaled, y_train)
acctrain=clf.score(X_train_scaled, y_train)
accTest=clf.score(X_test_scaled, y_test)
trainingAccuracy.append(acctrain)
testAccuracy.append(accTest)

# Create a dataframe
C1=[.01,.1, 1, 10, 20]
trainingAccuracy=pd.DataFrame(trainingAccuracy,index=C1)
testAccuracy=pd.DataFrame(testAccuracy,index=C1)

# Plot training and test R squared as a function of alpha
df=pd.concat([trainingAccuracy,testAccuracy],axis=1)
df.columns=['trainingAccuracy','trainingAccuracy']

fig1=df.plot()
fig1=plt.title('Training and test accuracy vs C (SVC)')
fig1.figure.savefig('fig1.png', bbox_inches='tight')
## Breast cancer dataset (normalized with MinMax scaling)

Output image:

## 1.6a Validation curve – R code

Sklearn includes code creating validation curves by varying paramaters and computing and plotting accuracy as gamma or C or changd. I did not find this R but I think this is a useful function and so I have created the R equivalent of this.

# The R equivalent of np.logspace
seqLogSpace <- function(start,stop,len){
a=seq(log10(10^start),log10(10^stop),length=len)
10^a
}

# Read the data. This is taken the SKlearn cancer data
cancer$target <- as.factor(cancer$target)

set.seed(6)

# Create the range of C1 in log space
param_range = seqLogSpace(-3,2,20)
# Initialize the overall training and test accuracy to NULL
overallTrainAccuracy <- NULL
overallTestAccuracy <- NULL

# Loop over the parameter range of Gamma
for(i in param_range){
# Set no of folds
noFolds=5
# Create the rows which fall into different folds from 1..noFolds
folds = sample(1:noFolds, nrow(cancer), replace=TRUE)
# Initialize the training and test accuracy of folds to 0
trainingAccuracy <- 0
testAccuracy <- 0

# Loop through the folds
for(j in 1:noFolds){
# The training is all rows for which the row is != j (k-1 folds -> training)
train <- cancer[folds!=j,]
# The rows which have j as the index become the test set
test <- cancer[folds==j,]
# Create a SVM model for this

# Add up all the fold accuracy for training and test separately
ypredTrain <-predict(svmfit,train)
ypredTest=predict(svmfit,test)

# Create confusion matrix
a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target)
# Get the accuracy
trainingAccuracy <-trainingAccuracy + a$overall[1] testAccuracy <-testAccuracy+b$overall[1]

}
# Compute the average of accuracy for K folds for number of features 'i'
overallTrainAccuracy=c(overallTrainAccuracy,trainingAccuracy/noFolds)
overallTestAccuracy=c(overallTestAccuracy,testAccuracy/noFolds)
}
#Create a dataframe
a <- rbind(param_range,as.numeric(overallTrainAccuracy),
as.numeric(overallTestAccuracy))
b <- data.frame(t(a))
names(b) <- c("C1","trainingAccuracy","testAccuracy")
df <- melt(b,id="C1")
#Plot in log axis
ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) +
xlab("C (SVC regularization)value") + ylab("Accuracy") +
ggtitle("Training and test accuracy vs C(regularization)") + scale_x_log10()

## 1.6b Validation curve – Python

Compute and plot the validation curve as gamma is varied.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_cancer)

# Create a gamma values from 10^-3 to 10^2 with 20 equally spaced intervals
param_range = np.logspace(-3, 2, 20)
# Compute the validation curve
train_scores, test_scores = validation_curve(SVC(), X_scaled, y_cancer,
param_name='gamma',
param_range=param_range, cv=10)

#Plot the figure
fig2=plt.figure()

#Compute the mean
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig2=plt.title('Validation Curve with SVM')
fig2=plt.xlabel('$\gamma$ (gamma)')
fig2=plt.ylabel('Score')
fig2=plt.ylim(0.0, 1.1)
lw = 2

fig2=plt.semilogx(param_range, train_scores_mean, label='Training score',
color='darkorange', lw=lw)

fig2=plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color='darkorange', lw=lw)

fig2=plt.semilogx(param_range, test_scores_mean, label='Cross-validation score',
color='navy', lw=lw)

fig2=plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color='navy', lw=lw)
fig2.figure.savefig('fig2.png', bbox_inches='tight')


Output image:

## 1.7a Validation Curve (Preventing data leakage) – Python code

In this course Applied Machine Learning in Python, the Professor states that when we apply the same data transformation to a entire dataset, it will cause a data leakage. “The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately”

So I apply separate scaling to the training and testing folds and plot. In the lecture the Prof states that this can be done using pipelines.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.cross_validation import  KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Set the parameter range
param_range = np.logspace(-3, 2, 20)

# Set number of folds
folds=5
#Initialize
overallTrainAccuracy=[]
overallTestAccuracy=[]

# Loop over the paramater range
for c in  param_range:
trainingAccuracy=0
testAccuracy=0
kf = KFold(len(X_cancer),n_folds=folds)
# Partition into training and test folds
for train_index, test_index in kf:
# Partition the data acccording the fold indices generated
X_train, X_test = X_cancer[train_index], X_cancer[test_index]
y_train, y_test = y_cancer[train_index], y_cancer[test_index]

# Scale the X_train and X_test
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit a SVC model for each C
clf = SVC(C=c).fit(X_train_scaled, y_train)
#Compute the training and test score
acctrain=clf.score(X_train_scaled, y_train)
accTest=clf.score(X_test_scaled, y_test)
trainingAccuracy += np.sum(acctrain)
testAccuracy += np.sum(accTest)
# Compute the mean training and testing accuracy
overallTrainAccuracy.append(trainingAccuracy/folds)
overallTestAccuracy.append(testAccuracy/folds)

overallTrainAccuracy=pd.DataFrame(overallTrainAccuracy,index=param_range)
overallTestAccuracy=pd.DataFrame(overallTestAccuracy,index=param_range)

# Plot training and test R squared as a function of alpha
df=pd.concat([overallTrainAccuracy,overallTestAccuracy],axis=1)
df.columns=['trainingAccuracy','testAccuracy']

fig3=plt.title('Validation Curve with SVM')
fig3=plt.xlabel('$\gamma$ (gamma)')
fig3=plt.ylabel('Score')
fig3=plt.ylim(0.5, 1.1)
lw = 2

fig3=plt.semilogx(param_range, overallTrainAccuracy, label='Training score',
color='darkorange', lw=lw)

fig3=plt.semilogx(param_range, overallTestAccuracy, label='Cross-validation score',
color='navy', lw=lw)

fig3=plt.legend(loc='best')
fig3.figure.savefig('fig3.png', bbox_inches='tight')


Output image:

## 1.8 a Decision trees – R code

Decision trees in R can be plotted using RPart package

library(rpart)
library(rpart.plot)
rpart = NULL
# Create a decision tree
m <-rpart(Species~.,data=iris)
#Plot
rpart.plot(m,extra=2,main="Decision Tree - IRIS")

## 1.8 b Decision trees – Python code

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
import graphviz

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3)
clf = DecisionTreeClassifier().fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))

dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
## Accuracy of Decision Tree classifier on training set: 1.00
## Accuracy of Decision Tree classifier on test set: 0.97

## 1.9a Feature importance – R code

I found the following code which had a snippet for feature importance. Sklean has a nice method for this. For some reason the results in R and Python are different. Any thoughts?

set.seed(3)
library(mlbench)
library(caret)
cancer$target <- as.factor(cancer$target)
# Split as data
data <- cancer[,1:31]
target <- cancer[,32]

# Train the model
model <- train(data, target, method="rf", preProcess="scale", trControl=trainControl(method = "cv"))
# Compute variable importance
importance <- varImp(model)
# summarize importance
print(importance)
# plot importance
plot(importance)

## 1.9b Feature importance – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)
# Use the DecisionTreClassifier
clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8,
random_state = 0).fit(X_train, y_train)

c_features=len(cancer.feature_names)
print('Breast cancer dataset: decision tree')
print('Accuracy of DT classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of DT classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))

# Plot the feature importances
fig4=plt.figure(figsize=(10,6),dpi=80)

fig4=plt.barh(range(c_features), clf.feature_importances_)
fig4=plt.xlabel("Feature importance")
fig4=plt.ylabel("Feature name")
fig4=plt.yticks(np.arange(c_features), cancer.feature_names)
fig4=plt.tight_layout()
plt.savefig('fig4.png', bbox_inches='tight')

## Breast cancer dataset: decision tree
## Accuracy of DT classifier on training set: 0.96
## Accuracy of DT classifier on test set: 0.94

Output image:

## 1.10a Precision-Recall, ROC curves & AUC- R code

I tried several R packages for plotting the Precision and Recall and AUC curve. PRROC seems to work well. The Precision-Recall curves show the tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa.AUC curves that hug the top left corner indicate a high sensitivity,specificity and an excellent accuracy.

source("RFunctions-1.R")
library(dplyr)
library(caret)
library(e1071)
library(PRROC)
# Read the data (this data is from sklearn!)
digits <- d[2:66]
digits$X64 <- as.factor(digits$X64)

# Split as training and test sets
train_idx <- trainTestSplit(digits,trainPercent=75,seed=5)
train <- digits[train_idx, ]
test <- digits[-train_idx, ]

# Fit a SVM model with linear basis kernel with probabilities
svmfit=svm(X64~., data=train, kernel="linear",scale=FALSE,probability=TRUE)
ypred=predict(svmfit,test,probability=TRUE)
head(attr(ypred,"probabilities"))
##               0            1
## 6  7.395947e-01 2.604053e-01
## 8  9.999998e-01 1.842555e-07
## 12 1.655178e-05 9.999834e-01
## 13 9.649997e-01 3.500032e-02
## 15 9.994849e-01 5.150612e-04
## 16 9.999987e-01 1.280700e-06
# Store the probability of 0s and 1s
m0<-attr(ypred,"probabilities")[,1]
m1<-attr(ypred,"probabilities")[,2]

# Create a dataframe of scores
scores <- data.frame(m1,test$X64) # Class 0 is data points of +ve class (in this case, digit 1) and -ve class (digit 0) #Compute Precision Recall pr <- pr.curve(scores.class0=scores[scores$test.X64=="1",]$m1, scores.class1=scores[scores$test.X64=="0",]\$m1,
curve=T)

# Plot precision-recall curve
plot(pr)

#Plot the ROC curve
roc<-roc.curve(m0, m1,curve=TRUE)
plot(roc)

## 1.10b Precision-Recall, ROC curves & AUC- Python code

For Python Logistic Regression is used to plot Precision Recall, ROC curve and compute AUC

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve, auc
X, y = dataset.data, dataset.target
#Create 2 classes -i) Digit 1 (from digit 1) ii) Digit 0 (from all other digits)
# Make a copy of the target
z= y.copy()
# Replace all non 1's as 0
z[z != 1] = 0

X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0)
# Fit a LR model
lr = LogisticRegression().fit(X_train, y_train)

#Compute the decision scores
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

#Show the decision_function scores for first 20 instances
y_score_list

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
#Plot
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.savefig('fig5.png', bbox_inches='tight')

#Compute and plot the ROC
y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes()
plt.savefig('fig6.png', bbox_inches='tight')


output

## 1.10c Precision-Recall, ROC curves & AUC- Python code

In the code below classification probabilities are used to compute and plot precision-recall, roc and AUC

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

X, y = dataset.data, dataset.target
# Make a copy of the target
z= y.copy()
# Replace all non 1's as 0
z[z != 1] = 0

X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0)
svm = LinearSVC()
# Need to use CalibratedClassifierSVC to redict probabilities for lInearSVC
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba_lr = clf.predict_proba(X_test)
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr[:,1])
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
#plt.figure(figsize=(15,15),dpi=80)
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.savefig('fig7.png', bbox_inches='tight')`