Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8

“Lights, camera and … action – Take 4+!”

This post includes  a rework of all presentation of ‘Elements of Neural Networks and Deep  Learning Parts 1-8 ‘ since my earlier presentations had some missing parts, omissions and some occasional errors. So I have re-recorded all the presentations.
This series of presentation will do a deep-dive  into Deep Learning networks starting from the fundamentals. The equations required for performing learning in a L-layer Deep Learning network  are derived in detail, starting from the basics. Further, the presentations also discuss multi-class classification, regularization techniques, and gradient descent optimization methods in deep networks methods. Finally the presentations also touch on how  Deep Learning Networks can be tuned.

The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 1
This presentation introduces Neural Networks and Deep Learning. A look at history of Neural Networks, Perceptrons and why Deep Learning networks are required and concluding with a simple toy examples of a Neural Network and how they compute. This part also includes a small digression on the basics of Machine Learning and how the algorithm learns from a data set

2. Elements of Neural Networks and Deep Learning – Part 2
This presentation takes logistic regression as an example and creates an equivalent 2 layer Neural network. The presentation also takes a look at forward & backward propagation and how the cost is minimized using gradient descent


The implementation of the discussed 2 layer Neural Network in vectorized R, Python and Octave are available in my post ‘Deep Learning from first principles in Python, R and Octave – Part 1‘

3. Elements of Neural Networks and Deep Learning – Part 3
This 3rd part, discusses a primitive neural network with an input layer, output layer and a hidden layer. The neural network uses tanh activation in the hidden layer and a sigmoid activation in the output layer. The equations for forward and backward propagation are derived.


To see the implementations for the above discussed video see my post ‘Deep Learning from first principles in Python, R and Octave – Part 2

4. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)


The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

5. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.


The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

6. Elements of Neural Networks and Deep Learning – Part 6
This part discusses initialization methods specifically like He and Xavier. The presentation also focuses on how to prevent over-fitting using regularization. Lastly the dropout method of regularization is also discussed


The corresponding implementations in vectorized R, Python and Octave of the above discussed methods are available in my post Deep Learning from first principles in Python, R and Octave – Part 6

7. Elements of Neural Networks and Deep Learning – Part 7
This presentation introduces exponentially weighted moving average and shows how this is used in different approaches to gradient descent optimization. The key techniques discussed are learning rate decay, momentum method, rmsprop and adam.

The equivalent implementations of the gradient descent optimization techniques in R, Python and Octave can be seen in my post Deep Learning from first principles in Python, R and Octave – Part 7

8. Elements of Neural Networks and Deep Learning – Part 8
This last part touches on the method to adopt while tuning hyper-parameters in Deep Learning networks

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

This concludes this series of presentations on “Elements of Neural Networks and Deep Learning’

Also
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2. Introducing cricpy:A python package to analyze performances of cricketers
3. Natural language processing: What would Shakespeare say?
4. Big Data-2: Move into the big league:Graduate from R to SparkR
5. Presentation on Wireless Technologies – Part 1
6. Introducing cricketr! : An R package to analyze performances of cricketers

To see all posts click Index of posts

cricketr sizes up legendary All-rounders of yesteryear

Introduction

This is a post I have been wanting to write for several months, but had to put it off for one reason or another. In this post I use my R package cricketr to analyze the performance of All-rounder greats namely Kapil Dev, Ian Botham, Imran Khan and Richard Hadlee. All these players had talent that was natural and raw. They were good strikers of the ball and extremely lethal with their bowling. The ODI data for these players have been taken from ESPN Cricinfo.

Please be mindful of the ESPN Cricinfo Terms of Use

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

320 and $6.99/Rs448 respectively

 

You can also read this post at Rpubs as cricketr-AR. Dowload this report as a PDF file from cricketr-AR

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.

Important note: Do check out my other posts using cricketr at cricketr-posts

All Rounders

  1. Kapil Dev (Ind)
  2. Ian Botham (Eng)
  3. Imran Khan (Pak)
  4. Richard Hadlee (NZ)

I have sprinkled the plots with a few of my comments. Feel free to draw your conclusions! The analysis is included below

if (!require("cricketr")){ 
    install.packages("cricketr",) 
} 

library(cricketr)

The data for any particular ODI player can be obtained with the getPlayerDataOD() function. To do you will need to go to ESPN CricInfo Playerand type in the name of the player for e.g Kapil Dev, etc. This will bring up a page which have the profile number for the player e.g. for Kapil Dev this would be http://www.espncricinfo.com/india/content/player/30028.html. Hence, Kapils’s profile is 30028. This can be used to get the data for Kapil Dev’s data as shown below. I have already executed the below 4 commands and I will use the files to run further commands

#kapil1 
#botham11 
#imran1 
#hadlee1 

Analyses of batting performances of the All Rounders

The following plots gives the analysis of the 4 ODI batsmen

  1. Kapil Dev (Ind) – Innings – 225, Runs = 3783, Average=23.79, Strike Rate= 95.07
  2. Ian Botham (Eng) – Innings – 116, Runs= 2113, Average=23.21, Strike Rate= 79.10
  3. Imran Khan (Pak) – Innings – 175, Runs= 3709, Average=33.41, Strike Rate= 72.65
  4. Richard Hadlee (NZ) – Innings – 115, Runs= 1751, Average=21.61, Strike Rate= 75.50

Plot of 4s, 6s and the scoring rate in ODIs

The 3 charts below give the number of

  1. 4s vs Runs scored
  2. 6s vs Runs scored
  3. Balls faced vs Runs scored

A regression line is fitted in each of these plots for each of the ODI batsmen

A. Kapil Dev
It can be seen that Kapil scores four 4’s when he scores 50. Also after facing 50 deliveries he scores around 43

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./kapil1.csv","Kapil")
batsman6s("./kapil1.csv","Kapil")
batsmanScoringRateODTT("./kapil1.csv","Kapil")

kapil-4s6ssr-1

dev.off()
## null device 
##           1

B. Ian Botham
Botham scores around 39 runs after 50 deliveries

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./botham1.csv","Botham")
batsman6s("./botham1.csv","Botham")
batsmanScoringRateODTT("./botham1.csv","Botham")

botham-4s6sr-1

dev.off()
## null device 
##           1

C. Imran Khan
Imran scores around 36 runs for 50 deliveries

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./imran1.csv","Imran")
batsman6s("./imran1.csv","Imran")
batsmanScoringRateODTT("./imran1.csv","Imran")

imran-4s6ssr-1

dev.off()
## null device 
##           1

D. Richard Hadlee
Hadlee also scores around 30 runs facing 50 deliveries

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./hadlee1.csv","Hadlee")
batsman6s("./hadlee1.csv","Hadlee")
batsmanScoringRateODTT("./hadlee1.csv","Hadlee")

hadlee-4s6sout-1

dev.off()
## null device 
##           1

Cumulative Average runs of batsman in career

Kapils cumulative avrerage runs drops towards the last 15 innings wheres Botham had a good run towards the end of his career. Imran performance as a batsman really peaks towards the end with a cumulative average of almost 25 runs. Hadlee has a stead performance

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeAverageRuns("./kapil1.csv","Kapil")

kbih-car-1

batsmanCumulativeAverageRuns("./botham1.csv","Botham")

kbih-car-2

batsmanCumulativeAverageRuns("./imran1.csv","Imran")

kbih-car-3

batsmanCumulativeAverageRuns("./hadlee1.csv","Hadlee")

kbih-car-4

dev.off()
## null device 
##           1

Cumulative Average strike rate of batsman in career

Kapil’s strike rate is superlative touching the 90’s steadily. Botham’s strike drops dramatically towards the latter part of his career. Imran average at a steady 75 and Hadlee averages around 85.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeStrikeRate("./kapil1.csv","Kapil")

kbih-casr-1

batsmanCumulativeStrikeRate("./botham1.csv","Botham")

kbih-casr-2

batsmanCumulativeStrikeRate("./imran1.csv","Imran")

kbih-casr-3

batsmanCumulativeStrikeRate("./hadlee1.csv","Hadlee")

kbih-casr-4

dev.off()
## null device 
##           1

Relative Mean Strike Rate

Kapil tops the strike rate among all the all-rounders. This is really a revelation to me. This can also be seen in the original data in Kapil’s strike rate is at a whopping 95.07 in comparison to Botham, Inran and Hadlee who are at 79.1,72.65 and 75.50 respectively

par(mar=c(4,4,2,2))
frames <- list("./kapil1.csv","./botham1.csv","imran1.csv","hadlee1.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBatsmanSRODTT(frames,names)

plot-1-1

Relative Runs Frequency Percentage

This plot shows that Imran has a much better average runs scored over the other all rounders followed by Kapil

frames <- list("./kapil1.csv","./botham1.csv","imran1.csv","hadlee1.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeRunsFreqPerfODTT(frames,names)

plot-2-1

Relative cumulative average runs in career

It can be seen clearly that Imran Khan leads the pack in cumulative average runs followed by Kapil Dev and then Botham

frames <- list("./kapil1.csv","./botham1.csv","imran1.csv","hadlee1.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBatsmanCumulativeAvgRuns(frames,names)

kbih-relcar-1

Relative cumulative average strike rate in career

In the cumulative strike rate Hadlee and Kapil run a close race.

frames <- list("./kapil1.csv","./botham1.csv","imran1.csv","hadlee1.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBatsmanCumulativeStrikeRate(frames,names)

kbih-relcsr-1

Percent 4’s,6’s in total runs scored

The plot below shows the contrib

frames <- list("./kapil1.csv","./botham1.csv","imran1.csv","hadlee1.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
runs4s6s <-batsman4s6s(frames,names)

plot-46s-1

print(runs4s6s)
##                Kapil Botham Imran Hadlee
## Runs(1s,2s,3s) 72.08  66.53 77.53  73.27
## 4s             21.98  25.78 17.61  21.08
## 6s              5.94   7.68  4.86   5.65

Runs forecast

The forecast for the batsman is shown below.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./kapil1.csv","Kapil")
batsmanPerfForecast("./botham1.csv","Botham")
batsmanPerfForecast("./imran1.csv","Imran")
batsmanPerfForecast("./hadlee1.csv","Hadlee")

plot-fcst-1

dev.off()
## null device 
##           1

3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./kapil1.csv","Kapil")
battingPerf3d("./botham1.csv","Botham")

plot-3-1

dev.off()
## null device 
##           1
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./imran1.csv","Imran")
battingPerf3d("./hadlee1.csv","Hadlee")

plot-4-1

dev.off()
## null device 
##           1

Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 200,length=10)
Mins <- seq(30,220,length=10)
newDF <- data.frame(BF,Mins)

kapil <- batsmanRunsPredict("./kapil1.csv","Kapil",newdataframe=newDF)
botham <- batsmanRunsPredict("./botham1.csv","Botham",newdataframe=newDF)
imran <- batsmanRunsPredict("./imran1.csv","Imran",newdataframe=newDF)
hadlee <- batsmanRunsPredict("./hadlee1.csv","Hadlee",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a hypotheticial Balls faced and Minutes at crease. It can be seen that Kapil is the best bet for a balls faced and minutes at crease followed by Botham.

batsmen <-cbind(round(kapil$Runs),round(botham$Runs),round(imran$Runs),round(hadlee$Runs))
colnames(batsmen) <- c("Kapil","Botham","Imran","Hadlee")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Kapil Botham Imran Hadlee
## 1          10           30    16      6    10     15
## 2          31           51    33     22    22     28
## 3          52           72    49     38    33     42
## 4          73           93    65     54    45     56
## 5          94          114    81     70    56     70
## 6         116          136    97     86    67     84
## 7         137          157   113    102    79     97
## 8         158          178   130    117    90    111
## 9         179          199   146    133   102    125
## 10        200          220   162    149   113    139

Highest runs likelihood

The plots below the runs likelihood of batsman. This uses K-Means . A. Kapil Dev

batsmanRunsLikelihood("./kapil1.csv","Kapil")

kapil11-1

## Summary of  Kapil 's runs scoring likelihood
## **************************************************
## 
## There is a 34.57 % likelihood that Kapil  will make  22 Runs in  24 balls over 34  Minutes 
## There is a 17.28 % likelihood that Kapil  will make  46 Runs in  46 balls over  65  Minutes 
## There is a 48.15 % likelihood that Kapil  will make  5 Runs in  7 balls over 9  Minutes

B. Ian Botham

batsmanRunsLikelihood("./botham1.csv","Botham")

devilliers-1

## Summary of  Botham 's runs scoring likelihood
## **************************************************
## 
## There is a 47.95 % likelihood that Botham  will make  9 Runs in  12 balls over 15  Minutes 
## There is a 39.73 % likelihood that Botham  will make  23 Runs in  32 balls over  44  Minutes 
## There is a 12.33 % likelihood that Botham  will make  59 Runs in  74 balls over 101  Minutes

C. Imran Khan

batsmanRunsLikelihood("./imran1.csv","Imran")

gaylecache-true-1

## Summary of  Imran 's runs scoring likelihood
## **************************************************
## 
## There is a 23.33 % likelihood that Imran  will make  36 Runs in  54 balls over 74  Minutes 
## There is a 60 % likelihood that Imran  will make  14 Runs in  18 balls over  23  Minutes 
## There is a 16.67 % likelihood that Imran  will make  53 Runs in  90 balls over 115  Minutes

D. Richard Hadlee

batsmanRunsLikelihood("./hadlee1.csv","Hadlee")

maxwell-1

## Summary of  Hadlee 's runs scoring likelihood
## **************************************************
## 
## There is a 6.1 % likelihood that Hadlee  will make  64 Runs in  79 balls over 90  Minutes 
## There is a 42.68 % likelihood that Hadlee  will make  25 Runs in  33 balls over  44  Minutes 
## There is a 51.22 % likelihood that Hadlee  will make  9 Runs in  11 balls over 15  Minutes

Average runs at ground and against opposition

A. Kapil Dev

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./kapil1.csv","Kapil")
batsmanAvgRunsOpposition("./kapil1.csv","Kapil")

avgrg-1-1

dev.off()
## null device 
##           1

B. Ian Botham

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./botham1.csv","Botham")
batsmanAvgRunsOpposition("./botham1.csv","Botham")

avgrg-2-1

dev.off()
## null device 
##           1

C. Imran Khan

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./imran1.csv","Imran")
batsmanAvgRunsOpposition("./imran1.csv","Imran")

avgrg-3-1

dev.off()
## null device 
##           1

D. Richard Hadlee

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./hadlee1.csv","Hadlee")
batsmanAvgRunsOpposition("./hadlee1.csv","Hadlee")

avgrg-4-1

dev.off()
## null device 
##           1

Moving Average of runs over career

The moving average for the 4 batsmen indicate the following

Kapil’s performance drops significantly while there is a slump in Botham’s performance. On the other hand Imran and Hadlee’s performance were on the upswing.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./kapil1.csv","Kapil")
batsmanMovingAverage("./botham1.csv","Botham")
batsmanMovingAverage("./imran1.csv","Imran")
batsmanMovingAverage("./hadlee1.csv","Hadlee")

sdgm-ma-1

dev.off()
## null device 
##           1

Check batsmen in-form, out-of-form

[1] “**************************** Form status of Kapil ****************************\n\n
Population size: 72
Mean of population: 19.38 \n
Sample size: 9 Mean of sample: 6.78 SD of sample: 6.14 \n\n
Null hypothesis H0 : Kapil ‘s sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Kapil ‘s sample average is below the 95% confidence interval of population average\n\n
Kapil ‘s Form Status: Out-of-Form because the p value: 8.4e-05 is less than alpha= 0.05

“**************************** Form status of Botham ****************************\n\n
Population size: 65
Mean of population: 21.29 \n
Sample size: 8 Mean of sample: 15.38 SD of sample: 13.19 \n\n
Null hypothesis H0 : Botham ‘s sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Botham ‘s sample average is below the 95% confidence interval of population average\n\n
Botham ‘s Form Status: In-Form because the p value: 0.120342 is greater than alpha= 0.05 \n

“**************************** Form status of Imran ****************************\n\n
Population size: 54
Mean of population: 24.94 \n
Sample size: 6 Mean of sample: 30.83 SD of sample: 25.4 \n\n
Null hypothesis H0 : Imran ‘s sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Imran ‘s sample average is below the 95% confidence interval of population average\n\n
Imran ‘s Form Status: In-Form because the p value: 0.704683 is greater than alpha= 0.05 \n

“**************************** Form status of Hadlee ****************************\n\n
Population size: 73
Mean of population: 18 \n
Sample size: 9 Mean of sample: 27 SD of sample: 24.27 \n\n
Null hypothesis H0 : Hadlee ‘s sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Hadlee ‘s sample average is below the 95% confidence interval of population average\n\n
Hadlee ‘s Form Status: In-Form because the p value: 0.85262 is greater than alpha= 0.05 \n *******************************************************************************************\n\n”

Analyses of bowling performances of the All Rounders

The following plots gives the analysis of the 4 ODI batsmen

  1. Kapil Dev (Ind) – Innings – 225, Wickets = 253, Average=27.45, Economy Rate= 3.71
  2. Ian Botham (Eng) – Innings – 116, Wickets = 145, Average=28.54, Economy Rate= 3.96
  3. Imran Khan (Pak) – Innings – 175, Wickets = 182, Average=26.61, Economy Rate= 3.89
  4. Richard Hadlee (NZ) – Innings – 115, Wickets = 158, Average=21.56, Economy Rate= 3.30

Botham has the highest number of innings and wickets followed closely by Mitchell. Imran and Hadlee have relatively fewer innings.

To get the bowler’s data use

#kapil2 
#botham2 
#imran2 
#hadlee2 

“`

Wicket Frequency percentage

This plot gives the percentage of wickets for each wickets (1,2,3…etc).

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsFreqPercent("./kapil2.csv","Kapil")
bowlerWktsFreqPercent("./botham2.csv","Botham")
bowlerWktsFreqPercent("./imran2.csv","Imran")
bowlerWktsFreqPercent("./hadlee2.csv","Hadlee")

relbowlfp-1

dev.off()
## null device 
##           1

Wickets Runs plot

The plot below gives a boxplot of the runs ranges for each of the wickets taken by the bowlers.

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))

bowlerWktsRunsPlot("./kapil2.csv","Kapil")
bowlerWktsRunsPlot("./botham2.csv","Botham")
bowlerWktsRunsPlot("./imran2.csv","Imran")
bowlerWktsRunsPlot("./hadlee2.csv","Hadlee")

wktsrun-1

dev.off()
## null device 
##           1

Cumulative average wicket plot

Botham has the best cumulative average wicket touching almost 1.6 wickets followed by Hadlee

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerCumulativeAvgWickets("./kapil2.csv","Kapil")

kwm-bowlcaw-1

bowlerCumulativeAvgWickets("./botham2.csv","Botham")

kwm-bowlcaw-2

bowlerCumulativeAvgWickets("./imran2.csv","Imran")

kwm-bowlcaw-3

bowlerCumulativeAvgWickets("./hadlee2.csv","Hadlee")

kwm-bowlcaw-4

dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerCumulativeAvgEconRate("./kapil2.csv","Kapil")

kwm-bowlcer-1

bowlerCumulativeAvgEconRate("./botham2.csv","Botham")

kwm-bowlcer-2

bowlerCumulativeAvgEconRate("./imran2.csv","Imran")

kwm-bowlcer-3

bowlerCumulativeAvgEconRate("./hadlee2.csv","Hadlee")

kwm-bowlcer-4

dev.off()
## null device 
##           1

Average wickets in different grounds and opposition

A. Kapil Dev

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./kapil2.csv","Kapil")
bowlerAvgWktsOpposition("./kapil2.csv","Kapil")

gr-1-1

dev.off()
## null device 
##           1

B. Ian Botham

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./botham2.csv","Botham")
bowlerAvgWktsOpposition("./botham2.csv","Botham")

gr-2-1

dev.off()
## null device 
##           1

C. Imran Khan

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./imran2.csv","Imran")
bowlerAvgWktsOpposition("./imran2.csv","Imran")

gr-3-1

dev.off()
## null device 
##           1

D. Richard Hadlee

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./hadlee2.csv","Hadlee")
bowlerAvgWktsOpposition("./hadlee2.csv","Hadlee")

gr-4-1

dev.off()
## null device 
##           1

Relative bowling performance

It can be seen that Botham is the most effective wicket taker of the lot

frames <- list("./kapil2.csv","./botham2.csv","imran2.csv","hadlee2.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBowlingPerf(frames,names)

relbowlperf-1

Relative Economy Rate against wickets taken

Hadlee has the best overall economy rate followed by Kapil Dev

frames <- list("./kapil2.csv","./botham2.csv","imran2.csv","hadlee2.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBowlingERODTT(frames,names)

relbowler-1

Relative cumulative average wickets of bowlers in career

This plot confirms the wicket taking ability of Botham followed by Hadlee

frames <- list("./kapil2.csv","./botham2.csv","imran2.csv","hadlee2.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBowlerCumulativeAvgWickets(frames,names)

rbcaw-1

Relative cumulative average economy rate of bowlers

frames <- list("./kapil2.csv","./botham2.csv","imran2.csv","hadlee2.csv")
names <- list("Kapil","Botham","Imran","Hadlee")
relativeBowlerCumulativeAvgEconRate(frames,names)

rbcer-1

Moving average of wickets over career

This plot shows that Hadlee has the best economy rate followed by Kapil

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerMovingAverage("./kapil2.csv","Kapil")
bowlerMovingAverage("./botham2.csv","Botham")
bowlerMovingAverage("./imran2.csv","Imran")
bowlerMovingAverage("./hadlee2.csv","Hadlee")

jmss-bowlma-1

dev.off()
## null device 
##           1

Wickets forecast

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerPerfForecast("./kapil2.csv","Kapil")
bowlerPerfForecast("./botham2.csv","Botham")
bowlerPerfForecast("./imran2.csv","Imran")
bowlerPerfForecast("./hadlee2.csv","Hadlee")

jjmss-pfcst-1

dev.off()
## null device 
##           1

Check bowler in-form, out-of-form

“**************************** Form status of Kapil ****************************\n\n
Population size: 198
Mean of population: 1.2 \n Sample size: 23 Mean of sample: 0.65 SD of sample: 0.83 \n\n
Null hypothesis H0 : Kapil ‘s sample average is within 95% confidence interval \n of population average\n
Alternative hypothesis Ha : Kapil ‘s sample average is below the 95% confidence\n interval of population average\n\n
Kapil ‘s Form Status: Out-of-Form because the p value: 0.002097 is less than alpha= 0.05 \n

“**************************** Form status of Botham ****************************\n\n
Population size: 166
Mean of population: 1.58 \n Sample size: 19 Mean of sample: 1.47 SD of sample: 1.12 \n\n
Null hypothesis H0 : Botham ‘s sample average is within 95% confidence interval \n of population average\n
Alternative hypothesis Ha : Botham ‘s sample average is below the 95% confidence\n interval of population average\n\n
Botham ‘s Form Status: In-Form because the p value: 0.336694 is greater than alpha= 0.05 \n

“**************************** Form status of Imran ****************************\n\n
Population size: 137
Mean of population: 1.23 \n Sample size: 16 Mean of sample: 0.81 SD of sample: 0.91 \n\n
Null hypothesis H0 : Imran ‘s sample average is within 95% confidence interval \n of population average\n
Alternative hypothesis Ha : Imran ‘s sample average is below the 95% confidence\n interval of population average\n\n
Imran ‘s Form Status: Out-of-Form because the p value: 0.041727 is less than alpha= 0.05 \n

“**************************** Form status of Hadlee ****************************\n\n
Population size: 100
Mean of population: 1.38 \n Sample size: 12 Mean of sample: 1.67 SD of sample: 1.37 \n\n
Null hypothesis H0 : Hadlee ‘s sample average is within 95% confidence interval \n of population average\n
Alternative hypothesis Ha : Hadlee ‘s sample average is below the 95% confidence\n interval of population average\n\n
Hadlee ‘s Form Status: In-Form because the p value: 0.761265 is greater than alpha= 0.05 \n *******************************************************************************************\n\n”

Key findings

Here are some key conclusions ODI batsmen

  1. Kapil Dev’s strike rate stands high above the other 3
  2. Imran Khan has the best cumulative average runs followed by Kapil
  3. Botham is the most effective wicket taker followed by Hadlee
  4. Hadlee is the most economical bowler and is followed by Kapil Dev
  5. For a hypothetical Balls Faced and Minutes at creases Kapil will score the most runs followed by Botham
  6. The moving average of indicates that the best is yet to come for Imran and Hadlee. Kapil and Botham were on the decline

Also see my other posts in R

  1. A primer on Qubits, Quantum gates abd Quantum operations
  2. Deblurring with OpenCV:Weiner filter reloaded
  3. Designing a Social Web Portal
  4. A crime map of India in R – Crimes against women
  5. Bend it like Bluemix, MongoDB with autoscaling – Part 2
  6. Mirror, mirror . the best batsman of them all?

For a full list of posts see Index of posts

cricketr plays the ODIs!

Published in R bloggers: cricketr plays the ODIs

Introduction

In this post my package ‘cricketr’ takes a swing at One Day Internationals(ODIs). Like test batsman who adapt to ODIs with some innovative strokes, the cricketr package has some additional functions and some modified functions to handle the high strike and economy rates in ODIs. As before I have chosen my top 4 ODI batsmen and top 4 ODI bowlers.

Unititled2

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

d $4.99/Rs 320 and $6.99/Rs448 respectively

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

You can also read this post at Rpubs as odi-cricketr. Dowload this report as a PDF file from odi-cricketr.pdf

Important note: Do check out my other posts using cricketr at cricketr-posts

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.
Batsmen

  1. Virendar Sehwag (Ind)
  2. AB Devilliers (SA)
  3. Chris Gayle (WI)
  4. Glenn Maxwell (Aus)

Bowlers

  1. Mitchell Johnson (Aus)
  2. Lasith Malinga (SL)
  3. Dale Steyn (SA)
  4. Tim Southee (NZ)

I have sprinkled the plots with a few of my comments. Feel free to draw your conclusions! The analysis is included below

The profile for Virender Sehwag is 35263. This can be used to get the ODI data for Sehwag. For a batsman the type should be “batting” and for a bowler the type should be “bowling” and the function is getPlayerDataOD()

The package can be installed directly from CRAN

if (!require("cricketr")){ 
    install.packages("cricketr",lib = "c:/test") 
} 
library(cricketr)

or from Github

library(devtools)
install_github("tvganesh/cricketr")
library(cricketr)

The One day data for a particular player can be obtained with the getPlayerDataOD() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Virendar Sehwag, etc. This will bring up a page which have the profile number for the player e.g. for Virendar Sehwag this would be http://www.espncricinfo.com/india/content/player/35263.html. Hence, Sehwag’s profile is 35263. This can be used to get the data for Virat Sehwag as shown below

sehwag <- getPlayerDataOD(35263,dir="..",file="sehwag.csv",type="batting")

Analyses of Batsmen

The following plots gives the analysis of the 4 ODI batsmen

  1. Virendar Sehwag (Ind) – Innings – 245, Runs = 8586, Average=35.05, Strike Rate= 104.33
  2. AB Devilliers (SA) – Innings – 179, Runs= 7941, Average=53.65, Strike Rate= 99.12
  3. Chris Gayle (WI) – Innings – 264, Runs= 9221, Average=37.65, Strike Rate= 85.11
  4. Glenn Maxwell (Aus) – Innings – 45, Runs= 1367, Average=35.02, Strike Rate= 126.69

Plot of 4s, 6s and the scoring rate in ODIs

The 3 charts below give the number of

  1. 4s vs Runs scored
  2. 6s vs Runs scored
  3. Balls faced vs Runs scored

A regression line is fitted in each of these plots for each of the ODI batsmen A. Virender Sehwag

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./sehwag.csv","Sehwag")
batsman6s("./sehwag.csv","Sehwag")
batsmanScoringRateODTT("./sehwag.csv","Sehwag")

sehwag-4s6sSR-1

dev.off()
## null device 
##           1

B. AB Devilliers

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./devilliers.csv","Devillier")
batsman6s("./devilliers.csv","Devillier")
batsmanScoringRateODTT("./devilliers.csv","Devillier")

devillier-4s6SR-1

dev.off()
## null device 
##           1

C. Chris Gayle

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./gayle.csv","Gayle")
batsman6s("./gayle.csv","Gayle")
batsmanScoringRateODTT("./gayle.csv","Gayle")

gayle-4s6sSR-1

dev.off()
## null device 
##           1

D. Glenn Maxwell

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./maxwell.csv","Maxwell")
batsman6s("./maxwell.csv","Maxwell")
batsmanScoringRateODTT("./maxwell.csv","Maxwell")

maxwell-4s6sout-1

dev.off()
## null device 
##           1

Relative Mean Strike Rate

In this first plot I plot the Mean Strike Rate of the batsmen. It can be seen that Maxwell has a awesome strike rate in ODIs. However we need to keep in mind that Maxwell has relatively much fewer (only 45 innings) innings. He is followed by Sehwag who(most innings- 245) also has an excellent strike rate till 100 runs and then we have Devilliers who roars ahead. This is also seen in the overall strike rate in above

par(mar=c(4,4,2,2))
frames <- list("./sehwag.csv","./devilliers.csv","gayle.csv","maxwell.csv")
names <- list("Sehwag","Devilliers","Gayle","Maxwell")
relativeBatsmanSRODTT(frames,names)

plot-1-1

Relative Runs Frequency Percentage

Sehwag leads in the percentage of runs in 10 run ranges upto 50 runs. Maxwell and Devilliers lead in 55-66 & 66-85 respectively.

frames <- list("./sehwag.csv","./devilliers.csv","gayle.csv","maxwell.csv")
names <- list("Sehwag","Devilliers","Gayle","Maxwell")
relativeRunsFreqPerfODTT(frames,names)

plot-2-1

Percentage of 4s,6s in the runs scored

The plot below shows the percentage of runs made by the batsmen by ways of 1s,2s,3s, 4s and 6s. It can be seen that Sehwag has the higheest percent of 4s (33.36%) in his overall runs in ODIs. Maxwell has the highest percentage of 6s (13.36%) in his ODI career. If we take the overall 4s+6s then Sehwag leads with (33.36 +5.95 = 39.31%),followed by Gayle (27.80+10.15=37.95%)

Percent 4’s,6’s in total runs scored

The plot below shows the contrib

frames <- list("./sehwag.csv","./devilliers.csv","gayle.csv","maxwell.csv")
names <- list("Sehwag","Devilliers","Gayle","Maxwell")
runs4s6s <-batsman4s6s(frames,names)

plot-46s-1

print(runs4s6s)
##                Sehwag Devilliers Gayle Maxwell
## Runs(1s,2s,3s)  60.69      67.39 62.05   62.11
## 4s              33.36      24.28 27.80   24.53
## 6s               5.95       8.32 10.15   13.36
 

Runs forecast

The forecast for the batsman is shown below.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./sehwag.csv","Sehwag")
batsmanPerfForecast("./devilliers.csv","Devilliers")
batsmanPerfForecast("./gayle.csv","Gayle")
batsmanPerfForecast("./maxwell.csv","Maxwell")

swcr-perf-1

dev.off()
## null device 
##           1

3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./sehwag.csv","V Sehwag")
battingPerf3d("./devilliers.csv","AB Devilliers")

plot-3-1

dev.off()
## null device 
##           1
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./gayle.csv","C Gayle")
battingPerf3d("./maxwell.csv","G Maxwell")

plot-4-1

dev.off()
## null device 
##           1

Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 200,length=10)
Mins <- seq(30,220,length=10)
newDF <- data.frame(BF,Mins)

sehwag <- batsmanRunsPredict("./sehwag.csv","Sehwag",newdataframe=newDF)
devilliers <- batsmanRunsPredict("./devilliers.csv","Devilliers",newdataframe=newDF)
gayle <- batsmanRunsPredict("./gayle.csv","Gayle",newdataframe=newDF)
maxwell <- batsmanRunsPredict("./maxwell.csv","Maxwell",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a hypotheticial Balls faced and Minutes at crease. It can be seen that Maxwell sets a searing pace in the predicted runs for a given Balls Faced and Minutes at crease followed by Sehwag. But we have to keep in mind that Maxwell has only around 1/5th of the innings of Sehwag (45 to Sehwag’s 245 innings). They are followed by Devilliers and then finally Gayle

batsmen <-cbind(round(sehwag$Runs),round(devilliers$Runs),round(gayle$Runs),round(maxwell$Runs))
colnames(batsmen) <- c("Sehwag","Devilliers","Gayle","Maxwell")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Sehwag Devilliers Gayle Maxwell
## 1          10           30     11         12    11      18
## 2          31           51     33         32    28      43
## 3          52           72     55         52    46      67
## 4          73           93     77         71    63      92
## 5          94          114    100         91    81     117
## 6         116          136    122        111    98     141
## 7         137          157    144        130   116     166
## 8         158          178    167        150   133     191
## 9         179          199    189        170   151     215
## 10        200          220    211        190   168     240

Highest runs likelihood

The plots below the runs likelihood of batsman. This uses K-Means It can be seen that Devilliers has almost 27.75% likelihood to make around 90+ runs. Gayle and Sehwag have 34% to make 40+ runs. A. Virender Sehwag

A. Virender Sehwag

batsmanRunsLikelihood("./sehwag.csv","Sehwag")

smith-1

## Summary of  Sehwag 's runs scoring likelihood
## **************************************************
## 
## There is a 35.22 % likelihood that Sehwag  will make  46 Runs in  44 balls over 67  Minutes 
## There is a 9.43 % likelihood that Sehwag  will make  119 Runs in  106 balls over  158  Minutes 
## There is a 55.35 % likelihood that Sehwag  will make  12 Runs in  13 balls over 18  Minutes

B. AB Devilliers

batsmanRunsLikelihood("./devilliers.csv","Devilliers")

warner-1

## Summary of  Devilliers 's runs scoring likelihood
## **************************************************
## 
## There is a 30.65 % likelihood that Devilliers  will make  44 Runs in  43 balls over 60  Minutes 
## There is a 29.84 % likelihood that Devilliers  will make  91 Runs in  88 balls over  124  Minutes 
## There is a 39.52 % likelihood that Devilliers  will make  11 Runs in  15 balls over 21  Minutes

C. Chris Gayle

batsmanRunsLikelihood("./gayle.csv","Gayle")

cook,cache-TRUE-1

## Summary of  Gayle 's runs scoring likelihood
## **************************************************
## 
## There is a 32.69 % likelihood that Gayle  will make  47 Runs in  51 balls over 72  Minutes 
## There is a 54.49 % likelihood that Gayle  will make  10 Runs in  15 balls over  20  Minutes 
## There is a 12.82 % likelihood that Gayle  will make  109 Runs in  119 balls over 172  Minutes

D. Glenn Maxwell

batsmanRunsLikelihood("./maxwell.csv","Maxwell")

oot-1

## Summary of  Maxwell 's runs scoring likelihood
## **************************************************
## 
## There is a 34.38 % likelihood that Maxwell  will make  39 Runs in  29 balls over 35  Minutes 
## There is a 15.62 % likelihood that Maxwell  will make  89 Runs in  55 balls over  69  Minutes 
## There is a 50 % likelihood that Maxwell  will make  6 Runs in  7 balls over 9  Minutes

Average runs at ground and against opposition

A. Virender Sehwag

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./sehwag.csv","Sehwag")
batsmanAvgRunsOpposition("./sehwag.csv","Sehwag")

avgrg-1-1

dev.off()
## null device 
##           1

B. AB Devilliers

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./devilliers.csv","Devilliers")
batsmanAvgRunsOpposition("./devilliers.csv","Devilliers")

avgrg-2-1

dev.off()
## null device 
##           1

C. Chris Gayle

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./gayle.csv","Gayle")
batsmanAvgRunsOpposition("./gayle.csv","Gayle")

avgrg-3-1

dev.off()
## null device 
##           1

D. Glenn Maxwell

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./maxwell.csv","Maxwell")
batsmanAvgRunsOpposition("./maxwell.csv","Maxwell")

avgrg-4-1

dev.off()
## null device 
##           1

Moving Average of runs over career

The moving average for the 4 batsmen indicate the following

1. The moving average of Devilliers and Maxwell is on the way up.
2. Sehwag shows a slight downward trend from his 2nd peak in 2011
3. Gayle maintains a consistent 45 runs for the last few years

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./sehwag.csv","Sehwag")
batsmanMovingAverage("./devilliers.csv","Devilliers")
batsmanMovingAverage("./gayle.csv","Gayle")
batsmanMovingAverage("./maxwell.csv","Maxwell")

sdgm-ma-1

dev.off()
## null device 
##           1

Check batsmen in-form, out-of-form

  1. Maxwell, Devilliers, Sehwag are in-form. This is also evident from the moving average plot
  2. Gayle is out-of-form
checkBatsmanInForm("./sehwag.csv","Sehwag")
## *******************************************************************************************
## 
## Population size: 143  Mean of population: 33.76 
## Sample size: 16  Mean of sample: 37.44 SD of sample: 55.15 
## 
## Null hypothesis H0 : Sehwag 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Sehwag 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Sehwag 's Form Status: In-Form because the p value: 0.603525  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./devilliers.csv","Devilliers")
## *******************************************************************************************
## 
## Population size: 111  Mean of population: 43.5 
## Sample size: 13  Mean of sample: 57.62 SD of sample: 40.69 
## 
## Null hypothesis H0 : Devilliers 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Devilliers 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Devilliers 's Form Status: In-Form because the p value: 0.883541  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./gayle.csv","Gayle")
## *******************************************************************************************
## 
## Population size: 140  Mean of population: 37.1 
## Sample size: 16  Mean of sample: 17.25 SD of sample: 20.25 
## 
## Null hypothesis H0 : Gayle 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Gayle 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Gayle 's Form Status: Out-of-Form because the p value: 0.000609  is less than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./maxwell.csv","Maxwell")
## *******************************************************************************************
## 
## Population size: 28  Mean of population: 25.25 
## Sample size: 4  Mean of sample: 64.25 SD of sample: 36.97 
## 
## Null hypothesis H0 : Maxwell 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Maxwell 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Maxwell 's Form Status: In-Form because the p value: 0.948744  is greater than alpha=  0.05"
## *******************************************************************************************

Analysis of bowlers

  1. Mitchell Johnson (Aus) – Innings-150, Wickets – 239, Econ Rate : 4.83
  2. Lasith Malinga (SL)- Innings-182, Wickets – 287, Econ Rate : 5.26
  3. Dale Steyn (SA)- Innings-103, Wickets – 162, Econ Rate : 4.81
  4. Tim Southee (NZ)- Innings-96, Wickets – 135, Econ Rate : 5.33

Malinga has the highest number of innings and wickets followed closely by Mitchell. Steyn and Southee have relatively fewer innings.

To get the bowler’s data use

malinga <- getPlayerDataOD(49758,dir=".",file="malinga.csv",type="bowling")

Wicket Frequency percentage

This plot gives the percentage of wickets for each wickets (1,2,3…etc)

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsFreqPercent("./mitchell.csv","J Mitchell")
bowlerWktsFreqPercent("./malinga.csv","Malinga")
bowlerWktsFreqPercent("./steyn.csv","Steyn")
bowlerWktsFreqPercent("./southee.csv","southee")

relBowlFP-1

dev.off()
## null device 
##           1

Wickets Runs plot

The plot below gives a boxplot of the runs ranges for each of the wickets taken by the bowlers. M Johnson and Steyn are more economical than Malinga and Southee corroborating the figures above

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))

bowlerWktsRunsPlot("./mitchell.csv","J Mitchell")
bowlerWktsRunsPlot("./malinga.csv","Malinga")
bowlerWktsRunsPlot("./steyn.csv","Steyn")
bowlerWktsRunsPlot("./southee.csv","southee")

wktsrun-1

dev.off()
## null device 
##           1

Average wickets in different grounds and opposition

A. Mitchell Johnson

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./mitchell.csv","J Mitchell")
bowlerAvgWktsOpposition("./mitchell.csv","J Mitchell")

gr-1-1

dev.off()
## null device 
##           1

B. Lasith Malinga

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./malinga.csv","Malinga")
bowlerAvgWktsOpposition("./malinga.csv","Malinga")

gr-2-1

dev.off()
## null device 
##           1

C. Dale Steyn

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./steyn.csv","Steyn")
bowlerAvgWktsOpposition("./steyn.csv","Steyn")

gr-3-1

dev.off()
## null device 
##           1

D. Tim Southee

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./southee.csv","southee")
bowlerAvgWktsOpposition("./southee.csv","southee")

avgrg-4-1

dev.off()
## null device 
##           1

Relative bowling performance

The plot below shows that Mitchell Johnson and Southee have more wickets in 3-4 wickets range while Steyn and Malinga in 1-2 wicket range

frames <- list("./mitchell.csv","./malinga.csv","steyn.csv","southee.csv")
names <- list("M Johnson","Malinga","Steyn","Southee")
relativeBowlingPerf(frames,names)

relBowlPerf-1

Relative Economy Rate against wickets taken

Steyn had the best economy rate followed by M Johnson. Malinga and Southee have a poorer economy rate

frames <- list("./mitchell.csv","./malinga.csv","steyn.csv","southee.csv")
names <- list("M Johnson","Malinga","Steyn","Southee")
relativeBowlingERODTT(frames,names)

relBowlER-1

Moving average of wickets over career

Johnson and Steyn career vs wicket graph is on the up-swing. Southee is maintaining a reasonable record while Malinga shows a decline in ODI performance

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerMovingAverage("./mitchell.csv","M Johnson")
bowlerMovingAverage("./malinga.csv","Malinga")
bowlerMovingAverage("./steyn.csv","Steyn")
bowlerMovingAverage("./southee.csv","Southee")

jmss-bowlma-1

dev.off()
## null device 
##           1

Wickets forecast

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerPerfForecast("./mitchell.csv","M Johnson")
bowlerPerfForecast("./malinga.csv","Malinga")
bowlerPerfForecast("./steyn.csv","Steyn")
bowlerPerfForecast("./southee.csv","southee")

jsba-pfcst-1

dev.off()
## null device 
##           1

Check bowler in-form, out-of-form

All the bowlers are shown to be still in-form

checkBowlerInForm("./mitchell.csv","J Mitchell")
## *******************************************************************************************
## 
## Population size: 135  Mean of population: 1.55 
## Sample size: 15  Mean of sample: 2 SD of sample: 1.07 
## 
## Null hypothesis H0 : J Mitchell 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : J Mitchell 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "J Mitchell 's Form Status: In-Form because the p value: 0.937917  is greater than alpha=  0.05"
## *******************************************************************************************
checkBowlerInForm("./malinga.csv","Malinga")
## *******************************************************************************************
## 
## Population size: 163  Mean of population: 1.58 
## Sample size: 19  Mean of sample: 1.58 SD of sample: 1.22 
## 
## Null hypothesis H0 : Malinga 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Malinga 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Malinga 's Form Status: In-Form because the p value: 0.5  is greater than alpha=  0.05"
## *******************************************************************************************
checkBowlerInForm("./steyn.csv","Steyn")
## *******************************************************************************************
## 
## Population size: 93  Mean of population: 1.59 
## Sample size: 11  Mean of sample: 1.45 SD of sample: 0.69 
## 
## Null hypothesis H0 : Steyn 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Steyn 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Steyn 's Form Status: In-Form because the p value: 0.257438  is greater than alpha=  0.05"
## *******************************************************************************************
checkBowlerInForm("./southee.csv","southee")
## *******************************************************************************************
## 
## Population size: 86  Mean of population: 1.48 
## Sample size: 10  Mean of sample: 0.8 SD of sample: 1.14 
## 
## Null hypothesis H0 : southee 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : southee 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "southee 's Form Status: Out-of-Form because the p value: 0.044302  is less than alpha=  0.05"
## *******************************************************************************************

***************

Key findings

Here are some key conclusions ODI batsmen

  1. AB Devilliers has high frequency of runs in the 60-120 range and the highest average
  2. Sehwag has the most number of innings and good strike rate
  3. Maxwell has the best strike rate but it should be kept in mind that he has 1/5 of the innings of Sehwag. We need to see how he progress further
  4. Sehwag has the highest percentage of 4s in the runs scored, while Maxwell has the most 6s
  5. For a hypothetical Balls Faced and Minutes at creases Maxwell will score the most runs followed by Sehwag
  6. The moving average of indicates that the best is yet to come for Devilliers and Maxwell. Sehwag has a few more years in him while Gayle shows a decline in ODI performance and an out of form is indicated.

ODI bowlers

  1. Malinga has the highest played the highest innings and also has the highest wickets though he has poor economy rate
  2. M Johnson is the most effective in the 3-4 wicket range followed by Southee
  3. M Johnson and Steyn has the best overall economy rate followed by Malinga and Steyn 4 M Johnson and Steyn’s career is on the up-swing,Southee maintains a steady consistent performance, while Malinga shows a downward trend

Hasta la vista! I’ll be back!
Watch this space!

Also see my other posts in R

  1. Introducing cricketr! : An R package to analyze performances of cricketers
  2. cricketr digs the Ashes!
  3. A peek into literacy in India: Statistical Learning with R
  4. A crime map of India in R – Crimes against women
  5. Analyzing cricket’s batting legends – Through the mirage with R
  6. Mirror, mirror . the best batsman of them all?

You may also like

  1. A closer look at “Robot Horse on a Trot” in Android
  2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
  3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
  4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
  5. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
  6. Deblurring with OpenCV:Weiner filter reloadedhttp://www.r-bloggers.com/cricketr-plays-the-odis/

cricketr digs the Ashes!

Published in R bloggers: cricketr digs the Ashes

Introduction

In some circles the Ashes is considered the ‘mother of all cricketing battles’. But, being a staunch supporter of all things Indian, cricket or otherwise, I have to say that the Ashes pales in comparison against a India-Pakistan match. After all, what are a few frowns and raised eyebrows at the Ashes in comparison to the seething emotions and reckless exuberance of Indian fans.

Anyway, the Ashes are an interesting duel and I have decided to do some cricketing analysis using my R package cricketr. For this analysis I have chosen the top 2 batsman and top 2 bowlers from both the Australian and English sides.

Batsmen

  1. Steven Smith (Aus) – Innings – 58 , Ave: 58.52, Strike Rate: 55.90
  2. David Warner (Aus) – Innings – 76, Ave: 46.86, Strike Rate: 73.88
  3. Alistair Cook (Eng) – Innings – 208 , Ave: 46.62, Strike Rate: 46.33
  4. J E Root (Eng) – Innings – 53, Ave: 54.02, Strike Rate: 51.30

Bowlers

  1. Mitchell Johnson (Aus) – Innings-131, Wickets – 299, Econ Rate : 3.28
  2. Peter Siddle (Aus) – Innings – 104 , Wickets- 192, Econ Rate : 2.95
  3. James Anderson (Eng) – Innings – 199 , Wickets- 406, Econ Rate : 3.05
  4. Stuart Broad (Eng) – Innings – 148 , Wickets- 296, Econ Rate : 3.08

It is my opinion if any 2 of the 4 in either team click then they will be able to swing the match in favor of their team.

I have interspersed the plots with a few comments. Feel free to draw your conclusions!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

cks), and $4.99/Rs 320 and $6.99/Rs448 respectively

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

The analysis is included below. Note: This post has also been hosted at Rpubs as cricketr digs the Ashes!
You can also download this analysis as a PDF file from cricketr digs the Ashes!

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.

Important note: Do check out my other posts using cricketr at cricketr-posts

The package can be installed directly from CRAN

if (!require("cricketr")){ 
    install.packages("cricketr",lib = "c:/test") 
} 
library(cricketr)

or from Github

library(devtools)
install_github("tvganesh/cricketr")
library(cricketr)

Analyses of Batsmen

The following plots gives the analysis of the 2 Australian and 2 English batsmen. It must be kept in mind that Cooks has more innings than all the rest put together. Smith has the best average, and Warner has the best strike rate

Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

batsmanPerfBoxHist("./smith.csv","S Smith")

swcr-boxhist-1

batsmanPerfBoxHist("./warner.csv","D Warner")

swcr-boxhist-2

batsmanPerfBoxHist("./cook.csv","A Cook")

swcr-boxhist-3

batsmanPerfBoxHist("./root.csv","JE Root")

swcr-boxhist-4

Plot os 4s, 6s and the type of dismissals

A. Steven Smith

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./smith.csv","S Smith")
batsman6s("./smith.csv","S Smith")
batsmanDismissals("./smith.csv","S Smith")

smith-4s6sout-1

dev.off()
## null device 
##           1

B. David Warner

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./warner.csv","D Warner")
batsman6s("./warner.csv","D Warner")
batsmanDismissals("./warner.csv","D Warner")

warner-4s6sout-1

dev.off()
## null device 
##           1

C. Alistair Cook

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./cook.csv","A Cook")
batsman6s("./cook.csv","A Cook")
batsmanDismissals("./cook.csv","A Cook")

cook-4s6sout-1

dev.off()
## null device 
##           1

D. J E Root

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./root.csv","JE Root")
batsman6s("./root.csv","JE Root")
batsmanDismissals("./root.csv","JE Root")

root-4s6sout-1

dev.off()
## null device 
##           1

Relative Mean Strike Rate

In this first plot I plot the Mean Strike Rate of the batsmen. It can be Warner’s has the best strike rate (hit outside the plot!) followed by Smith in the range 20-100. Root has a good strike rate above hundred runs. Cook maintains a good strike rate.

par(mar=c(4,4,2,2))
frames <- list("./smith.csv","./warner.csv","cook.csv","root.csv")
names <- list("Smith","Warner","Cook","Root")
relativeBatsmanSR(frames,names)

plot-1-1

Relative Runs Frequency Percentage

The plot below show the percentage contribution in each 10 runs bucket over the entire career.It can be seen that Smith pops up above the rest with remarkable regularity.COok is consistent over the entire range.

frames <- list("./smith.csv","./warner.csv","cook.csv","root.csv")
names <- list("Smith","Warner","Cook","Root")
relativeRunsFreqPerf(frames,names)

plot-2-1

Moving Average of runs over career

The moving average for the 4 batsmen indicate the following 1. S Smith is the most promising. There is a marked spike in Performance. Cook maintains a steady pace and is consistent over the years averaging 50 over the years.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./smith.csv","S Smith")
batsmanMovingAverage("./warner.csv","D Warner")
batsmanMovingAverage("./cook.csv","A Cook")
batsmanMovingAverage("./root.csv","JE Root")

swcr-ma-1

dev.off()
## null device 
##           1

Runs forecast

The forecast for the batsman is shown below. As before Cooks’s performance is really consistent across the years and the forecast is good for the years ahead. In Cook’s case it can be seen that the forecasted and actual runs are reasonably accurate

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./smith.csv","S Smith")
batsmanPerfForecast("./warner.csv","D Warner")
batsmanPerfForecast("./cook.csv","A Cook")
## Warning in HoltWinters(ts.train): optimization difficulties: ERROR:
## ABNORMAL_TERMINATION_IN_LNSRCH
batsmanPerfForecast("./root.csv","JE Root")

swcr-perf-1

dev.off()
## null device 
##           1

3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./smith.csv","S Smith")
battingPerf3d("./warner.csv","D Warner")

plot-3-1

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./cook.csv","A Cook")
battingPerf3d("./root.csv","JE Root")

plot-4-1

dev.off()
## null device 
##           1

Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
smith <- batsmanRunsPredict("./smith.csv","S Smith",newdataframe=newDF)
warner <- batsmanRunsPredict("./warner.csv","D Warner",newdataframe=newDF)
cook <- batsmanRunsPredict("./cook.csv","A Cook",newdataframe=newDF)
root <- batsmanRunsPredict("./root.csv","JE Root",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease. It can be seen that Warner sets a searing pace in the predicted runs for a given Balls Faced and Minutes at crease while Smith and Root are neck to neck in the predicted runs

batsmen <-cbind(round(smith$Runs),round(warner$Runs),round(cook$Runs),round(root$Runs))
colnames(batsmen) <- c("Smith","Warner","Cook","Root")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Smith Warner Cook Root
## 1          10           30     9     12    6    9
## 2          38           71    25     33   20   25
## 3          66          111    42     53   33   42
## 4          94          152    58     73   47   59
## 5         121          193    75     93   60   75
## 6         149          234    91    114   74   92
## 7         177          274   108    134   88  109
## 8         205          315   124    154  101  125
## 9         233          356   141    174  115  142
## 10        261          396   158    195  128  159
## 11        289          437   174    215  142  175
## 12        316          478   191    235  155  192
## 13        344          519   207    255  169  208
## 14        372          559   224    276  182  225
## 15        400          600   240    296  196  242

Highest runs likelihood

The plots below the runs likelihood of batsman. This uses K-Means. It can be seen Smith has the best likelihood around 40% of scoring around 41 runs, followed by Root who has 28.3% likelihood of scoring around 81 runs

A. Steven Smith

batsmanRunsLikelihood("./smith.csv","S Smith")
smith-1
## Summary of  S Smith 's runs scoring likelihood
## **************************************************
## 
## There is a 40 % likelihood that S Smith  will make  41 Runs in  73 balls over 101  Minutes 
## There is a 36 % likelihood that S Smith  will make  9 Runs in  21 balls over  27  Minutes 
## There is a 24 % likelihood that S Smith  will make  139 Runs in  237 balls over 338  Minutes

B. David Warner

batsmanRunsLikelihood("./warner.csv","D Warner")
warner-1
## Summary of  D Warner 's runs scoring likelihood
## **************************************************
## 
## There is a 11.11 % likelihood that D Warner  will make  134 Runs in  159 balls over 263  Minutes 
## There is a 63.89 % likelihood that D Warner  will make  17 Runs in  25 balls over  37  Minutes 
## There is a 25 % likelihood that D Warner  will make  73 Runs in  105 balls over 156  Minutes

C. Alastair Cook

batsmanRunsLikelihood("./cook.csv","A Cook")
cook,cache-TRUE-1
## Summary of  A Cook 's runs scoring likelihood
## **************************************************
## 
## There is a 27.72 % likelihood that A Cook  will make  64 Runs in  140 balls over 195  Minutes 
## There is a 59.9 % likelihood that A Cook  will make  15 Runs in  32 balls over  46  Minutes 
## There is a 12.38 % likelihood that A Cook  will make  141 Runs in  300 balls over 420  Minutes

D. J E Root

batsmanRunsLikelihood("./root.csv","JE Root")
oot-1
## Summary of  JE Root 's runs scoring likelihood
## **************************************************
## 
## There is a 28.3 % likelihood that JE Root  will make  81 Runs in  158 balls over 223  Minutes 
## There is a 7.55 % likelihood that JE Root  will make  179 Runs in  290 balls over  425  Minutes 
## There is a 64.15 % likelihood that JE Root  will make  16 Runs in  39 balls over 59  Minutes
 

Average runs at ground and against opposition

A. Steven Smith

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./smith.csv","S Smith")
batsmanAvgRunsOpposition("./smith.csv","S Smith")

avgrg-1-1

dev.off()
## null device 
##           1

B. David Warner

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./warner.csv","D Warner")
batsmanAvgRunsOpposition("./warner.csv","D Warner")

avgrg-2-1

dev.off()
## null device 
##           1

C. Alistair Cook

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./cook.csv","A Cook")
batsmanAvgRunsOpposition("./cook.csv","A Cook")

avgrg-3-1

dev.off()
## null device 
##           1

D. J E Root

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./root.csv","JE Root")
batsmanAvgRunsOpposition("./root.csv","JE Root")

avgrg-4-1

dev.off()
## null device 
##           1

Analysis of bowlers

  1. Mitchell Johnson (Aus) – Innings-131, Wickets – 299, Econ Rate : 3.28
  2. Peter Siddle (Aus) – Innings – 104 , Wickets- 192, Econ Rate : 2.95
  3. James Anderson (Eng) – Innings – 199 , Wickets- 406, Econ Rate : 3.05
  4. Stuart Broad (Eng) – Innings – 148 , Wickets- 296, Econ Rate : 3.08

Anderson has the highest number of inning and wickets followed closely by Broad and Mitchell who are in a neck to neck race with respect to wickets. Johnson is on the more expensive side though. Siddle has fewer innings but a good economy rate.

Wicket Frequency percentage

This plot gives the percentage of wickets for each wickets (1,2,3…etc)

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsFreqPercent("./johnson.csv","Johnson")
bowlerWktsFreqPercent("./siddle.csv","Siddle")
bowlerWktsFreqPercent("./broad.csv","Broad")
bowlerWktsFreqPercent("./anderson.csv","Anderson")

relBowlFP-1

dev.off()
## null device 
##           1

Wickets Runs plot

The plot below gives a boxplot of the runs ranges for each of the wickets taken by the bowlers

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsRunsPlot("./johnson.csv","Johnson")
bowlerWktsRunsPlot("./siddle.csv","Siddle")
bowlerWktsRunsPlot("./broad.csv","Broad")
bowlerWktsRunsPlot("./anderson.csv","Anderson")

wktsrun-1

dev.off()
## null device 
##           1

Average wickets in different grounds and opposition

A. Mitchell Johnson

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./johnson.csv","Johnson")
bowlerAvgWktsOpposition("./johnson.csv","Johnson")

gr-1-1

dev.off()
## null device 
##           1

B. Peter Siddle

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./siddle.csv","Siddle")
bowlerAvgWktsOpposition("./siddle.csv","Siddle")

gr-2-1

dev.off()
## null device 
##           1

C. Stuart Broad

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./broad.csv","Broad")
bowlerAvgWktsOpposition("./broad.csv","Broad")

gr-3-1

dev.off()
## null device 
##           1

D. James Anderson

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./anderson.csv","Anderson")
bowlerAvgWktsOpposition("./anderson.csv","Anderson")

gr-4-1

dev.off()
## null device 
##           1

Relative bowling performance

The plot below shows that Mitchell Johnson is the mopst effective bowler among the lot with a higher wickets in the 3-6 wicket range. Broad and Anderson seem to perform well in 2 wickets in comparison to Siddle but in 3 wickets Siddle is better than Broad and Anderson.

frames <- list("./johnson.csv","./siddle.csv","broad.csv","anderson.csv")
names <- list("Johnson","Siddle","Broad","Anderson")
relativeBowlingPerf(frames,names)

relBowlPerf-1

Relative Economy Rate against wickets taken

Anderson followed by Siddle has the best economy rates. Johnson is fairly expensive in the 4-8 wicket range.

frames <- list("./johnson.csv","./siddle.csv","broad.csv","anderson.csv")
names <- list("Johnson","Siddle","Broad","Anderson")
relativeBowlingER(frames,names)

relBowlER-1

Moving average of wickets over career

Johnson is on his second peak while Siddle is on the decline with respect to bowling. Broad and Anderson show improving performance over the years.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerMovingAverage("./johnson.csv","Johnson")
bowlerMovingAverage("./siddle.csv","Siddle")
bowlerMovingAverage("./broad.csv","Broad")
bowlerMovingAverage("./anderson.csv","Anderson")

jsba-bowlma-1

dev.off()
## null device 
##           1

Wickets forecast

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerPerfForecast("./johnson.csv","Johnson")
bowlerPerfForecast("./siddle.csv","Siddle")
bowlerPerfForecast("./broad.csv","Broad")
bowlerPerfForecast("./anderson.csv","Anderson")

jsba-bowlma-1

dev.off()
## null device 
##           1

Key findings

Here are some key conclusions

  1. Cook has the most number of innings and has been extremly consistent in his scores
  2. Warner has the best strike rate among the lot followed by Smith and Root
  3. The moving average shows a marked improvement over the years for Smith
  4. Johnson is the most effective bowler but is fairly expensive
  5. Anderson has the best economy rate followed by Siddle
  6. Johnson is at his second peak with respect to bowling while Broad and Anderson maintain a steady line and length in their career bowling performance


Also see my other posts in R

  1. Introducing cricketr! : An R package to analyze performances of cricketers
  2. Taking cricketr for a spin – Part 1
  3. A peek into literacy in India: Statistical Learning with R
  4. A crime map of India in R – Crimes against women
  5. Analyzing cricket’s batting legends – Through the mirage with R
  6. Masters of Spin: Unraveling the web with R
  7. Mirror, mirror . the best batsman of them all?

You may also like

  1. A crime map of India in R: Crimes against women
  2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
  3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
  4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
  5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
  6. Deblurring with OpenCV:Weiner filter reloaded

Taking cricketr for a spin – Part 1

“Curiouser and curiouser!” cried Alice
“The time has come,” the walrus said, “to talk of many things: Of shoes and ships – and sealing wax – of cabbages and kings”
“Begin at the beginning,”the King said, very gravely,“and go on till you come to the end: then stop.”
“And what is the use of a book,” thought Alice, “without pictures or conversation?”

            Excerpts from Alice in Wonderland by Lewis Carroll

Introduction

This post is a continuation of my previous post “Introducing cricketr! A R package to analyze the performances of cricketers.” In this post I take my package cricketr for a spin. For this analysis I focus on the Indian batting legends

– Sachin Tendulkar (Master Blaster)
– Rahul Dravid (The Will)
– Sourav Ganguly ( The Dada Prince)
– Sunil Gavaskar (Little Master)

This post is also hosted on RPubs – cricketr-1

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

d $4.99/Rs 320 and $6.99/Rs448 respectively

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

(Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar)

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.

The package can be installed directly from CRAN

if (!require("cricketr")){ 
    install.packages("cricketr",lib = "c:/test") 
} 
library(cricketr)

or from Github

library(devtools)
install_github("tvganesh/cricketr")
library(cricketr)

Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency The plot below indicate the Tendulkar’s average is the highest. He is followed by Dravid, Gavaskar and then Ganguly

batsmanPerfBoxHist("./tendulkar.csv","Sachin Tendulkar")
tkps-boxhist-1
batsmanPerfBoxHist("./dravid.csv","Rahul Dravid")
tkps-boxhist-2
batsmanPerfBoxHist("./ganguly.csv","Sourav Ganguly")
tkps-boxhist-3
batsmanPerfBoxHist("./gavaskar.csv","Sunil Gavaskar")
tkps-boxhist-4

Relative Mean Strike Rate

In this first plot I plot the Mean Strike Rate of the batsmen. Tendulkar leads in the Mean Strike Rate for each runs in the range 100- 180. Ganguly has a very good Mean Strike Rate for runs range 40 -80

frames <- list("./tendulkar.csv","./dravid.csv","ganguly.csv","gavaskar.csv")
names <- list("Tendulkar","Dravid","Ganguly","Gavaskar")
relativeBatsmanSR(frames,names)

plot-1-1

Relative Runs Frequency Percentage

The plot below show the percentage contribution in each 10 runs bucket over the entire career.The percentage Runs Frequency is fairly close but Gavaskar seems to lead most of the way

frames <- list("./tendulkar.csv","./dravid.csv","ganguly.csv","gavaskar.csv")
names <- list("Tendulkar","Dravid","Ganguly","Gavaskar")
relativeRunsFreqPerf(frames,names)

plot-2-1

Moving Average of runs over career

The moving average for the 4 batsmen indicate the following – Tendulkar and Ganguly’s career has a downward trend and their retirement didn’t come too soon – Dravid and Gavaskar’s career definitely shows an upswing. They probably had a year or two left.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./tendulkar.csv","Tendulkar")
batsmanMovingAverage("./dravid.csv","Dravid")
batsmanMovingAverage("./ganguly.csv","Ganguly")
batsmanMovingAverage("./gavaskar.csv","Gavaskar")

tdsg-ma-1

dev.off()
## null device 
##           1

Runs forecast

The forecast for the batsman is shown below. The plots indicate that only Tendulkar seemed to maintain a consistency over the period while the rest seem to score less than their forecasted runs in the last 10% of the career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfForecast("./dravid.csv","Rahul Dravid")
batsmanPerfForecast("./ganguly.csv","Sourav Ganguly")
batsmanPerfForecast("./gavaskar.csv","Sunil Gavaskar")

tdsg-perf-1

dev.off()
## null device 
##           1

Check for batsman in-form/out-of-form

The following snippet checks whether the batsman is in-inform or ouyt-of-form during the last 10% innings of the career. This is done by choosing the null hypothesis (h0) to indicate that the batsmen are in-form. Ha is the alternative hypothesis that they are not-in-form. The population is based on the 1st 90% of career runs. The last 10% is taken as the sample and a check is made on the lower tail to see if the sample mean is less than 95% confidence interval. If this difference is >0.05 then the batsman is considered out-of-form.

The computation show that Tendulkar was out-of-form while the other’s weren’t. While Dravid and Gavaskar’s moving average do show an upward trend the surprise is Ganguly. This could be that Ganguly was able to keep his average in the last 10% to with the 95$ confidence interval. It has to be noted that Ganguly’s average was much lower than Tendulkar

checkBatsmanInForm("./tendulkar.csv","Tendulkar")
## *******************************************************************************************
## 
## Population size: 294  Mean of population: 50.48 
## Sample size: 33  Mean of sample: 32.42 SD of sample: 29.8 
## 
## Null hypothesis H0 : Tendulkar 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Tendulkar 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./dravid.csv","Dravid")
## *******************************************************************************************
## 
## Population size: 256  Mean of population: 46.98 
## Sample size: 29  Mean of sample: 43.48 SD of sample: 40.89 
## 
## Null hypothesis H0 : Dravid 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Dravid 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Dravid 's Form Status: In-Form because the p value: 0.324138  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./ganguly.csv","Ganguly")
## *******************************************************************************************
## 
## Population size: 169  Mean of population: 38.94 
## Sample size: 19  Mean of sample: 33.21 SD of sample: 32.97 
## 
## Null hypothesis H0 : Ganguly 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Ganguly 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Ganguly 's Form Status: In-Form because the p value: 0.229006  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./gavaskar.csv","Gavaskar")
## *******************************************************************************************
## 
## Population size: 125  Mean of population: 44.67 
## Sample size: 14  Mean of sample: 57.86 SD of sample: 58.55 
## 
## Null hypothesis H0 : Gavaskar 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Gavaskar 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Gavaskar 's Form Status: In-Form because the p value: 0.793276  is greater than alpha=  0.05"
## *******************************************************************************************
dev.off()
## null device 
##           1

3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./tendulkar.csv","Tendulkar")
battingPerf3d("./dravid.csv","Dravid")

plot-3-1

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./ganguly.csv","Ganguly")
battingPerf3d("./gavaskar.csv","Gavaskar")

plot-4-1

dev.off()
## null device 
##           1

Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
tendulkar <- batsmanRunsPredict("./tendulkar.csv","Tendulkar",newdataframe=newDF)
dravid <- batsmanRunsPredict("./dravid.csv","Dravid",newdataframe=newDF)
ganguly <- batsmanRunsPredict("./ganguly.csv","Ganguly",newdataframe=newDF)
gavaskar <- batsmanRunsPredict("./gavaskar.csv","Gavaskar",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease. It can be seen Tendulkar has a much higher Runs scored than all of the others.

Tendulkar is followed by Ganguly who we saw earlier had a very good strike rate. However it must be noted that Dravid and Gavaskar have a better average.

batsmen <-cbind(round(tendulkar$Runs),round(dravid$Runs),round(ganguly$Runs),round(gavaskar$Runs))
colnames(batsmen) <- c("Tendulkar","Dravid","Ganguly","Gavaskar")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Tendulkar Dravid Ganguly Gavaskar
## 1          10           30         7      1       7        4
## 2          38           71        23     14      21       17
## 3          66          111        39     27      35       30
## 4          94          152        54     40      50       43
## 5         121          193        70     54      64       56
## 6         149          234        86     67      78       69
## 7         177          274       102     80      93       82
## 8         205          315       118     94     107       95
## 9         233          356       134    107     121      108
## 10        261          396       150    120     136      121
## 11        289          437       165    134     150      134
## 12        316          478       181    147     165      147
## 13        344          519       197    160     179      160
## 14        372          559       213    173     193      173
## 15        400          600       229    187     208      186

Contribution to matches won and lost

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanContributionWonLost(35320,"Tendulkar")
batsmanContributionWonLost(28114,"Dravid")
batsmanContributionWonLost(28779,"Ganguly")
batsmanContributionWonLost(28794,"Gavaskar")

tdgg-1

Home and overseas performance

From the plot below Tendulkar and Dravid have a lot more matches both home and abroad and their performance has good both at home and overseas. Tendulkar has the best performance home and abroad and is consistent all across. Dravid is also cossistent at all venues. Gavaskar played fewer matches than Tendulkar & Dravid. The range of runs at home is higher than overseas, however the average is consistent both at home and abroad. Finally we have Ganguly.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfHomeAway(35320,"Tendulkar")
batsmanPerfHomeAway(28114,"Dravid")
batsmanPerfHomeAway(28779,"Ganguly")
batsmanPerfHomeAway(28794,"Gavaskar")
tdgg-ha-1

Average runs at ground and against opposition

Tendulkar has above 50 runs average against Sri Lanka, Bangladesh, West Indies and Zimbabwe. The performance against Australia and England average very close to 50. Sydney, Port Elizabeth, Bloemfontein, Collombo are great huntings grounds for Tendulkar

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./tendulkar.csv","Tendulkar")
batsmanAvgRunsOpposition("./tendulkar.csv","Tendulkar")
avgrg-1-1
dev.off()
## null device 
##           1

Dravid plundered runs at Adelaide, Georgetown, Oval, Hamiltom etc. Dravid has above average against England, Bangaldesh, New Zealand, Pakistan, West Indies and Zimbabwe

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./dravid.csv","Dravid")
batsmanAvgRunsOpposition("./dravid.csv","Dravid")
avgrg-2-1
dev.off()
## null device 
##           1

Ganguly has good performance at the Oval, Rawalpindi, Johannesburg and Kandy. Ganguly averages 50 runs against England and Bangladesh.

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./ganguly.csv","Ganguly")
batsmanAvgRunsOpposition("./ganguly.csv","Ganguly")
avgrg-3-1
dev.off()
## null device 
##           1

The Oval, Sydney, Perth, Melbourne, Brisbane, Manchester are happy hunting grounds for Gavaskar. Gavaskar averages around 50 runs Australia, Pakistan, Sri Lanka, West Indies.

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./gavaskar.csv","Gavaskar")
batsmanAvgRunsOpposition("./gavaskar.csv","Gavaskar")
avgrg-4-1
dev.off()
## null device 
##           1

Key findings

Here are some key conclusions

  1. Tendulkar has the highest average among the 4. He is followed by Dravid, Gavaskar and Ganguly.
  2. Tendulkar’s predicted performance for a given number of Balls Faced and Minutes at Crease is superior to the rest
  3. Dravid averages above 50 against 6 countries
  4. West Indies and Australia are Gavaskar’s favorite batting grounds
  5. Ganguly has a very good Mean Strike Rate for the range 40-80 and Tendulkar from 100-180
  6. In home and overseas performance, Tendulkar is the best. Dravid and Gavaskar also have good performance overseas.
  7. Dravid and Gavaskar probably retired a year or two earlier while Tendulkar and Ganguly’s time was clearly up

Final thoughts

Tendulkar is clearly the greatest batsman India has produced as he leads in almost all aspects of batting – number of centuries, strike rate, predicted runs and home and overseas performance. Dravid follows Tendulkar with 48 centuries, consistent performance home and overseas and a career that was still green. Gavaskar has fewer matches than rest but his performance overseas is very good in those helmetless times. Finally we have Ganguly.

Dravid and Gavaskar had a few more years of great batting while Tendulkar and Ganguly’s career was on a decline.

Note:It is really not fair to include Gavaskar in the analysis as he played in a different era when helmets were not used, even against the fiery pace of Thomson, Lillee, Roberts, Holding etc. In addition Gavaskar did not play against some of the newer countries like Bangladesh and Zimbabwe where he could have amassed runs. Yet I wanted to include him and his performance is clearly excellent

Also see my other posts in R

  1. A peek into literacy in India: Statistical Learning with R
  2. A crime map of India in R – Crimes against women
  3. Analyzing cricket’s batting legends – Through the mirage with R
  4. Masters of Spin: Unraveling the web with R
  5. Mirror, mirror . the best batsman of them all?

You may also like

  1. A crime map of India in R: Crimes against women
  2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
  3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
  4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
  5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
  6. Deblurring with OpenCV:Weiner filter reloaded

Introducing cricketr! : An R package to analyze performances of cricketers

Yet all experience is an arch wherethro’
Gleams that untravell’d world whose margin fades
For ever and forever when I move.
How dull it is to pause, to make an end,
To rust unburnish’d, not to shine in use!

Ulysses by Alfred Tennyson

Introduction

This is an initial post in which I introduce a cricketing package ‘cricketr’ which I have created. This package was a natural culmination to my earlier posts on cricket and my finishing 10 modules of Data Science Specialization, from John Hopkins University at Coursera. The thought of creating this package struck me some time back, and I have finally been able to bring this to fruition.

So here it is. My R package ‘cricketr!!!’

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package can handle all formats of the game including Test, ODI and Twenty20 cricket.

You should be able to install the package from CRAN and use  many of the functions available in the package. Please be mindful of  ESPN Cricinfo Terms of Use

(Note: This page is also hosted as a GitHub page at cricketr and also at RPubs as cricketr: A R package for analyzing performances of cricketers

You can download this analysis as a PDF file from Introducing cricketr

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.

You can clone the cricketr code from Github at cricketr

(Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial)

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Please look at my recent post, which includes updates to this post, and 8 new functions added to the cricketr package “Re-introducing cricketr: An R package to analyze the performances of cricketers

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

 The cricketr package

The cricketr package has several functions that perform several different analyses on both batsman and bowlers. The package has functions that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available.

Other interesting functions include batting performance moving average, forecast and a function to check whether the batsman/bowler is in in-form or out-of-form.

The data for a particular player can be obtained with the getPlayerData() function from the package. To do this you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Ricky Ponting, Sachin Tendulkar etc. This will bring up a page which have the profile number for the player e.g. for Sachin Tendulkar this would be http://www.espncricinfo.com/india/content/player/35320.html. Hence, Sachin’s profile is 35320. This can be used to get the data for Tendulkar as shown below

The cricketr package is now available from  CRAN!!!.  You should be able to install directly with

if (!require("cricketr")){ 
    install.packages("cricketr",lib = "c:/test") 
} 
library(cricketr)
?getPlayerData
## 
## getPlayerData(profile, opposition='', host='', dir='./data', file='player001.csv', type='batting', homeOrAway=[1, 2], result=[1, 2, 4], create=True)
##     Get the player data from ESPN Cricinfo based on specific inputs and store in a file in a given directory
##     
##     Description
##     
##     Get the player data given the profile of the batsman. The allowed inputs are home,away or both and won,lost or draw of matches. The data is stored in a .csv file in a directory specified. This function also returns a data frame of the player
##     
##     Usage
##     
##     getPlayerData(profile,opposition="",host="",dir="./data",file="player001.csv",
##     type="batting", homeOrAway=c(1,2),result=c(1,2,4))
##     Arguments
##     
##     profile     
##     This is the profile number of the player to get data. This can be obtained from http://www.espncricinfo.com/ci/content/player/index.html. Type the name of the player and click search. This will display the details of the player. Make a note of the profile ID. For e.g For Sachin Tendulkar this turns out to be http://www.espncricinfo.com/india/content/player/35320.html. Hence the profile for Sachin is 35320
##     opposition  
##     The numerical value of the opposition country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,England:1,India:6,New Zealand:5,Pakistan:7,South Africa:3,Sri Lanka:8, West Indies:4, Zimbabwe:9
##     host        
##     The numerical value of the host country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,England:1,India:6,New Zealand:5,Pakistan:7,South Africa:3,Sri Lanka:8, West Indies:4, Zimbabwe:9
##     dir 
##     Name of the directory to store the player data into. If not specified the data is stored in a default directory "./data". Default="./data"
##     file        
##     Name of the file to store the data into for e.g. tendulkar.csv. This can be used for subsequent functions. Default="player001.csv"
##     type        
##     type of data required. This can be "batting" or "bowling"
##     homeOrAway  
##     This is a vector with either 1,2 or both. 1 is for home 2 is for away
##     result      
##     This is a vector that can take values 1,2,4. 1 - won match 2- lost match 4- draw
##     Details
##     
##     More details can be found in my short video tutorial in Youtube https://www.youtube.com/watch?v=q9uMPFVsXsI
##     
##     Value
##     
##     Returns the player's dataframe
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh <tvganesh.85@gmail.com>
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://www.espncricinfo.com/ci/content/stats/index.html
##     https://gigadom.wordpress.com/
##     
##     See Also
##     
##     getPlayerDataSp
##     
##     Examples
##     
##     ## Not run: 
##     # Both home and away. Result = won,lost and drawn
##     tendulkar = getPlayerData(35320,dir=".", file="tendulkar1.csv",
##     type="batting", homeOrAway=c(1,2),result=c(1,2,4))
##     
##     # Only away. Get data only for won and lost innings
##     tendulkar = getPlayerData(35320,dir=".", file="tendulkar2.csv",
##     type="batting",homeOrAway=c(2),result=c(1,2))
##     
##     # Get bowling data and store in file for future
##     kumble = getPlayerData(30176,dir=".",file="kumble1.csv",
##     type="bowling",homeOrAway=c(1),result=c(1,2))
##     
##     #Get the Tendulkar's Performance against Australia in Australia
##     tendulkar = getPlayerData(35320, opposition = 2,host=2,dir=".", 
##     file="tendulkarVsAusInAus.csv",type="batting")

The cricketr package includes some pre-packaged sample (.csv) files. You can use these sample to test functions  as shown below

# Retrieve the file path of a data file installed with cricketr
pathToFile ,"Sachin Tendulkar")

unnamed-chunk-2-1

Alternatively, the cricketr package can be installed from GitHub with

if (!require("cricketr")){ 
    library(devtools) 
    install_github("tvganesh/cricketr") 
}
library(cricketr)

The pre-packaged files can be accessed as shown above.
To get the data of any player use the function getPlayerData()

tendulkar <- getPlayerData(35320,dir="..",file="tendulkar.csv",type="batting",homeOrAway=c(1,2),
                           result=c(1,2,4))

Important Note This needs to be done only once for a player. This function stores the player’s data in a CSV file (for e.g. tendulkar.csv as above) which can then be reused for all other functions. Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

Sachin Tendulkar’s performance – Basic Analyses

The 3 plots below provide the following for Tendulkar

  1. Frequency percentage of runs in each run range over the whole career
  2. Mean Strike Rate for runs scored in the given range
  3. A histogram of runs frequency percentages in runs ranges
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./tendulkar.csv","Sachin Tendulkar")
batsmanMeanStrikeRate("./tendulkar.csv","Sachin Tendulkar")
batsmanRunsRanges("./tendulkar.csv","Sachin Tendulkar")

tendulkar-batting-1

dev.off()
## null device 
##           1

More analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./tendulkar.csv","Tendulkar")
batsman6s("./tendulkar.csv","Tendulkar")
batsmanDismissals("./tendulkar.csv","Tendulkar")

tendulkar-4s6sout-1

 

3D scatter plot and prediction plane

The plots below show the 3D scatter plot of Sachin’s Runs versus Balls Faced and Minutes at crease. A linear regression model is then fitted between Runs and Balls Faced + Minutes at crease

battingPerf3d("./tendulkar.csv","Sachin Tendulkar")

tendulkar-3d-1

Average runs at different venues

The plot below gives the average runs scored by Tendulkar at different grounds. The plot also displays the number of innings at each ground as a label at x-axis. It can be seen Tendulkar did great in Colombo (SSC), Melbourne ifor matches overseas and Mumbai, Mohali and Bangalore at home

batsmanAvgRunsGround("./tendulkar.csv","Sachin Tendulkar")
tendulkar-avggrd-1

Average runs against different opposing teams

This plot computes the average runs scored by Tendulkar against different countries. The x-axis also gives the number of innings against each team

batsmanAvgRunsOpposition("./tendulkar.csv","Tendulkar")
tendulkar-avgopn-1

Highest Runs Likelihood

The plot below shows the Runs Likelihood for a batsman. For this the performance of Sachin is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease using. K-Means. The centroids of 3 clusters are computed and plotted. In this plot. Sachin Tendulkar’s highest tendencies are computed and plotted using K-Means

batsmanRunsLikelihood("./tendulkar.csv","Sachin Tendulkar")

tendulkar-kmeans-1

## Summary of  Sachin Tendulkar 's runs scoring likelihood
## **************************************************
## 
## There is a 16.51 % likelihood that Sachin Tendulkar  will make  139 Runs in  251 balls over 353  Minutes 
## There is a 58.41 % likelihood that Sachin Tendulkar  will make  16 Runs in  31 balls over  44  Minutes 
## There is a 25.08 % likelihood that Sachin Tendulkar  will make  66 Runs in  122 balls over 167  Minutes

A look at the Top 4 batsman – Tendulkar, Kallis, Ponting and Sangakkara

The batsmen with the most hundreds in test cricket are

  1. Sachin Tendulkar :Average:53.78,100’s – 51, 50’s – 68
  2. Jacques Kallis : Average: 55.47, 100’s – 45, 50’s – 58
  3. Ricky Ponting : Average: 51.85, 100’s – 41 , 50’s – 62
  4. Kumara Sangakarra: Average: 58.04 ,100’s – 38 , 50’s – 52

in that order.

The following plots take a closer at their performances. The box plots show the mean (red line) and median (blue line). The two ends of the boxplot display the 25th and 75th percentile.

Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency. The calculated Mean differ from the stated means possibly because of data cleaning. Also not sure how the means were arrived at ESPN Cricinfo for e.g. when considering not out..

batsmanPerfBoxHist("./tendulkar.csv","Sachin Tendulkar")

tkps-boxhist-1

batsmanPerfBoxHist("./kallis.csv","Jacques Kallis")

tkps-boxhist-2

batsmanPerfBoxHist("./ponting.csv","Ricky Ponting")

tkps-boxhist-3

batsmanPerfBoxHist("./sangakkara.csv","K Sangakkara")

tkps-boxhist-4

Contribution to won and lost matches

The plot below shows the contribution of Tendulkar, Kallis, Ponting and Sangakarra in matches won and lost. The plots show the range of runs scored as a boxplot (25th & 75th percentile) and the mean scored. The total matches won and lost are also printed in the plot.

All the players have scored more in the matches they won than the matches they lost. Ricky Ponting is the only batsman who seems to have more matches won to his credit than others. This could also be because he was a member of strong Australian team

For the next 2 functions below you will have to use the getPlayerDataSp() function. I
have commented this as I already have these files

tendulkarsp 
par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanContributionWonLost("tendulkarsp.csv","Tendulkar")
batsmanContributionWonLost("kallissp.csv","Kallis")
batsmanContributionWonLost("pontingsp.csv","Ponting")
batsmanContributionWonLost("sangakkarasp.csv","Sangakarra")

tkps-wonlost-1

dev.off()
## null device 
##           1

Performance at home and overseas

From the plot below it can be seen
Tendulkar has more matches overseas than at home and his performance is consistent in all venues at home or abroad. Ponting has lesser innings than Tendulkar and has an equally good performance at home and overseas.Kallis and Sangakkara’s performance abroad is lower than the performance at home.

This function also requires the use of getPlayerDataSp() as shown above

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfHomeAway("tendulkarsp.csv","Tendulkar")
batsmanPerfHomeAway("kallissp.csv","Kallis")
batsmanPerfHomeAway("pontingsp.csv","Ponting")
batsmanPerfHomeAway("sangakkarasp.csv","Sangakarra")
dev.off()
tkps-homeaway-1
dev.off()
## null device 
##           1
 

Relative Mean Strike Rate plot

The plot below compares the Mean Strike Rate of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following Range 0 – 50 Runs – Ponting leads followed by Tendulkar Range 50 -100 Runs – Ponting followed by Sangakkara Range 100 – 150 – Ponting and then Tendulkar

frames <- list("./tendulkar.csv","./kallis.csv","ponting.csv","sangakkara.csv")
names <- list("Tendulkar","Kallis","Ponting","Sangakkara")
relativeBatsmanSR(frames,names)

tkps-relSR-1

Relative Runs Frequency plot

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

Sangakkara leads followed by Ponting

frames <- list("./tendulkar.csv","./kallis.csv","ponting.csv","sangakkara.csv")
names <- list("Tendulkar","Kallis","Ponting","Sangakkara")
relativeRunsFreqPerf(frames,names)

tkps-relRunFreq-1

Moving Average of runs in career

Take a look at the Moving Average across the career of the Top 4. Clearly . Kallis and Sangakkara have a few more years of great batting ahead. They seem to average on 50. . Tendulkar and Ponting definitely show a slump in the later years

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./tendulkar.csv","Sachin Tendulkar")
batsmanMovingAverage("./kallis.csv","Jacques Kallis")
batsmanMovingAverage("./ponting.csv","Ricky Ponting")
batsmanMovingAverage("./sangakkara.csv","K Sangakkara")

tkps-ma-1

dev.off()
## null device 
##           1

Future Runs forecast

Here are plots that forecast how the batsman will perform in future. In this case 90% of the career runs trend is uses as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated runs trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

Take a look at the runs forecasted for the batsman below.

  • Tendulkar’s forecasted performance seems to tally with his actual performance with an average of 50
  • Kallis the forecasted runs are higher than the actual runs he scored
  • Ponting seems to have a good run in the future
  • Sangakkara has a decent run in the future averaging 50 runs
par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfForecast("./kallis.csv","Jacques Kallis")
batsmanPerfForecast("./ponting.csv","Ricky Ponting")
batsmanPerfForecast("./sangakkara.csv","K Sangakkara")

tkps-perffcst-1

dev.off()
## null device 
##           1

Check Batsman In-Form or Out-of-Form

The below computation uses Null Hypothesis testing and p-value to determine if the batsman is in-form or out-of-form. For this 90% of the career runs is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the batsman continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the batsman is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

This is done for the Top 4 batsman

checkBatsmanInForm("./tendulkar.csv","Sachin Tendulkar")
## *******************************************************************************************
## 
## Population size: 294  Mean of population: 50.48 
## Sample size: 33  Mean of sample: 32.42 SD of sample: 29.8 
## 
## Null hypothesis H0 : Sachin Tendulkar 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Sachin Tendulkar 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Sachin Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./kallis.csv","Jacques Kallis")
## *******************************************************************************************
## 
## Population size: 240  Mean of population: 47.5 
## Sample size: 27  Mean of sample: 47.11 SD of sample: 59.19 
## 
## Null hypothesis H0 : Jacques Kallis 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Jacques Kallis 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Jacques Kallis 's Form Status: In-Form because the p value: 0.48647  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./ponting.csv","Ricky Ponting")
## *******************************************************************************************
## 
## Population size: 251  Mean of population: 47.5 
## Sample size: 28  Mean of sample: 36.25 SD of sample: 48.11 
## 
## Null hypothesis H0 : Ricky Ponting 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Ricky Ponting 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Ricky Ponting 's Form Status: In-Form because the p value: 0.113115  is greater than alpha=  0.05"
## *******************************************************************************************
checkBatsmanInForm("./sangakkara.csv","K Sangakkara")
## *******************************************************************************************
## 
## Population size: 193  Mean of population: 51.92 
## Sample size: 22  Mean of sample: 71.73 SD of sample: 82.87 
## 
## Null hypothesis H0 : K Sangakkara 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : K Sangakkara 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "K Sangakkara 's Form Status: In-Form because the p value: 0.862862  is greater than alpha=  0.05"
## *******************************************************************************************

3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./tendulkar.csv","Tendulkar")
battingPerf3d("./kallis.csv","Kallis")
plot-3-1par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./ponting.csv","Ponting")
battingPerf3d("./sangakkara.csv","Sangakkara")
plot-4-1dev.off()
## null device 
##           1

Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease. A sample sequence of Balls Faced(BF) and Minutes at crease (Mins) is setup as shown below. The fitted model is used to predict the runs for these values

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
tendulkar <- batsmanRunsPredict("./tendulkar.csv","Tendulkar",newdataframe=newDF)
kallis <- batsmanRunsPredict("./kallis.csv","Kallis",newdataframe=newDF)
ponting <- batsmanRunsPredict("./ponting.csv","Ponting",newdataframe=newDF)
sangakkara <- batsmanRunsPredict("./sangakkara.csv","Sangakkara",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease. It can be seen Ponting has the will score the highest for a given Balls Faced and Minutes at crease.

Ponting is followed by Tendulkar who has Sangakkara close on his heels and finally we have Kallis. This is intuitive as we have already seen that Ponting has a highest strike rate.

batsmen <-cbind(round(tendulkar$Runs),round(kallis$Runs),round(ponting$Runs),round(sangakkara$Runs))
colnames(batsmen) <- c("Tendulkar","Kallis","Ponting","Sangakkara")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Tendulkar Kallis Ponting Sangakkara
## 1          10           30         7      6       9          2
## 2          38           71        23     20      25         18
## 3          66          111        39     34      42         34
## 4          94          152        54     48      59         50
## 5         121          193        70     62      76         66
## 6         149          234        86     76      93         82
## 7         177          274       102     90     110         98
## 8         205          315       118    104     127        114
## 9         233          356       134    118     144        130
## 10        261          396       150    132     161        146
## 11        289          437       165    146     178        162
## 12        316          478       181    159     194        178
## 13        344          519       197    173     211        194
## 14        372          559       213    187     228        210
## 15        400          600       229    201     245        226

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

Analysis of Top 3 wicket takers

The top 3 wicket takes in test history are
1. M Muralitharan:Wickets: 800, Average = 22.72, Economy Rate – 2.47
2. Shane Warne: Wickets: 708, Average = 25.41, Economy Rate – 2.65
3. Anil Kumble: Wickets: 619, Average = 29.65, Economy Rate – 2.69

How do Anil Kumble, Shane Warne and M Muralitharan compare with one another with respect to wickets taken and the Economy Rate. The next set of plots compute and plot precisely these analyses.

Wicket Frequency Plot

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerWktsFreqPercent("./kumble.csv","Anil Kumble")
bowlerWktsFreqPercent("./warne.csv","Shane Warne")
bowlerWktsFreqPercent("./murali.csv","M Muralitharan")

relBowlFP-1

dev.off()
## null device 
##           1

Wickets Runs plot

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerWktsRunsPlot("./kumble.csv","Kumble")
bowlerWktsRunsPlot("./warne.csv","Warne")
bowlerWktsRunsPlot("./murali.csv","Muralitharan")
wktsrun-1
dev.off()
## null device 
##           1

Average wickets at different venues

The plot gives the average wickets taken by Muralitharan at different venues. Muralitharan has taken an average of 8 and 6 wickets at Oval & Wellington respectively in 2 different innings. His best performances are at Kandy and Colombo (SSC)

bowlerAvgWktsGround("./murali.csv","Muralitharan")
avgWktshrg-1

Average wickets against different opposition

The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team

bowlerAvgWktsOpposition("./murali.csv","Muralitharan")
avgWktoppn-1

Relative Wickets Frequency Percentage

The Relative Wickets Percentage plot shows that M Muralitharan has a large percentage of wickets in the 3-8 wicket range

frames <- list("./kumble.csv","./murali.csv","warne.csv")
names <- list("Anil KUmble","M Muralitharan","Shane Warne")
relativeBowlingPerf(frames,names)

relBowlPerf-1

Relative Economy Rate against wickets taken

Clearly from the plot below it can be seen that Muralitharan has the best Economy Rate among the three

frames <- list("./kumble.csv","./murali.csv","warne.csv")
names <- list("Anil KUmble","M Muralitharan","Shane Warne")
relativeBowlingER(frames,names)

relBowlER-1

Wickets taken moving average

From th eplot below it can be see 1. Shane Warne’s performance at the time of his retirement was still at a peak of 3 wickets 2. M Muralitharan seems to have become ineffective over time with his peak years being 2004-2006 3. Anil Kumble also seems to slump down and become less effective.

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerMovingAverage("./kumble.csv","Anil Kumble")
bowlerMovingAverage("./warne.csv","Shane Warne")
bowlerMovingAverage("./murali.csv","M Muralitharan")

tkps-bowlma-1

dev.off()
## null device 
##           1

Future Wickets forecast

Here are plots that forecast how the bowler will perform in future. In this case 90% of the career wickets trend is used as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated wickets trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

Take a look at the wickets forecasted for the bowlers below. – Shane Warne and Muralitharan have a fairly consistent forecast – Kumble forecast shows a small dip

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerPerfForecast("./kumble.csv","Anil Kumble")
bowlerPerfForecast("./warne.csv","Shane Warne")
bowlerPerfForecast("./murali.csv","M Muralitharan")

kwm-perffcst-1

dev.off()
## null device 
##           1

Contribution to matches won and lost

The plot below is extremely interesting
1. Kumble wickets range from 2 to 4 wickets in matches wons with a mean of 3
2. Warne wickets in won matches range from 1 to 4 with more matches won. Clearly there are other bowlers contributing to the wins, possibly the pacers
3. Muralitharan wickets range in winning matches is more than the other 2 and ranges ranges 3 to 5 and clearly had a hand (pun unintended) in Sri Lanka’s wins

As discussed above the next 2 charts require the use of getPlayerDataSp()

kumblesp 
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerContributionWonLost("kumblesp.csv","Kumble")
bowlerContributionWonLost("warnesp.csv","Warne")
bowlerContributionWonLost("muralisp.csv","Murali")

kwm-wl-1

dev.off()
## null device 
##           1

Performance home and overseas

From the plot below it can be seen that Kumble & Warne have played more matches overseas than Muralitharan. Both Kumble and Warne show an average of 2 wickers overseas,  Murali on the other hand has an average of 2.5 wickets overseas but a slightly less number of matches than Kumble & Warne

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
bowlerPerfHomeAway("kumblesp.csv","Kumble")
bowlerPerfHomeAway("warnesp.csv","Warne")
bowlerPerfHomeAway("muralisp.csv","Murali")

kwm-ha-1
dev.off()
## null device 
##           1
 

Check for bowler in-form/out-of-form

The below computation uses Null Hypothesis testing and p-value to determine if the bowler is in-form or out-of-form. For this 90% of the career wickets is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the bowler continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the bowler is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

Note: The check for the form status of the bowlers indicate 1. That both Kumble and Muralitharan were out of form. This also shows in the moving average plot 2. Warne is still in great form and could have continued for a few more years. Too bad we didn’t see the magic later

checkBowlerInForm("./kumble.csv","Anil Kumble")
## *******************************************************************************************
## 
## Population size: 212  Mean of population: 2.69 
## Sample size: 24  Mean of sample: 2.04 SD of sample: 1.55 
## 
## Null hypothesis H0 : Anil Kumble 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Anil Kumble 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Anil Kumble 's Form Status: Out-of-Form because the p value: 0.02549  is less than alpha=  0.05"
## *******************************************************************************************
checkBowlerInForm("./warne.csv","Shane Warne")
## *******************************************************************************************
## 
## Population size: 240  Mean of population: 2.55 
## Sample size: 27  Mean of sample: 2.56 SD of sample: 1.8 
## 
## Null hypothesis H0 : Shane Warne 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : Shane Warne 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "Shane Warne 's Form Status: In-Form because the p value: 0.511409  is greater than alpha=  0.05"
## *******************************************************************************************
checkBowlerInForm("./murali.csv","M Muralitharan")
## *******************************************************************************************
## 
## Population size: 207  Mean of population: 3.55 
## Sample size: 23  Mean of sample: 2.87 SD of sample: 1.74 
## 
## Null hypothesis H0 : M Muralitharan 's sample average is within 95% confidence interval 
##         of population average
## Alternative hypothesis Ha : M Muralitharan 's sample average is below the 95% confidence
##         interval of population average
## 
## [1] "M Muralitharan 's Form Status: Out-of-Form because the p value: 0.036828  is less than alpha=  0.05"
## *******************************************************************************************
dev.off()
## null device 
##           1

Key Findings

The plots above capture some of the capabilities and features of my cricketr package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.
Here are the main findings from the analysis above

Analysis of Top 4 batsman

The analysis of the Top 4 test batsman Tendulkar, Kallis, Ponting and Sangakkara show the folliwing

  1. Sangakkara has the highest average, followed by Tendulkar, Kallis and then Ponting.
  2. Ponting has the highest strike rate followed by Tendulkar,Sangakkara and then Kallis
  3. The predicted runs for a given Balls faced and Minutes at crease is highest for Ponting, followed by Tendulkar, Sangakkara and Kallis
  4. The moving average for Tendulkar and Ponting shows a downward trend while Kallis and Sangakkara retired too soon
  5. Tendulkar was out of form about the time of retirement while the rest were in-form. But this result has to be taken along with the moving average plot. Ponting was clearly on the way out.
  6. The home and overseas performance indicate that Tendulkar is the clear leader. He has the highest number of matches played overseas and his performance has been consistent. He is followed by Ponting, Kallis and finally Sangakkara

Analysis of Top 3 legs spinners

The analysis of Anil Kumble, Shane Warne and M Muralitharan show the following

  1. Muralitharan has the highest wickets and best economy rate followed by Warne and Kumble
  2. Muralitharan has higher wickets frequency percentage between 3 to 8 wickets
  3. Muralitharan has the best Economy Rate for wickets between 2 to 7
  4. The moving average plot shows that the time was up for Kumble and Muralitharan but Warne had a few years ahead
  5. The check for form status shows that Muralitharan and Kumble time was over while Warne still in great form
  6. Kumble’s has more matches abroad than the other 2, yet Kumble averages of 3 wickets at home and 2 wickets overseas liek Warne . Murali has played few matches but has an average of 4 wickets at home and 3 wickets overseas.

Final thoughts

Here are my final thoughts

Batting

Among the 4 batsman Tendulkar, Kallis, Ponting and Sangakkara the clear leader is Tendulkar for the following reasons

  1. Tendulkar has the highest test centuries and runs of all time.Tendulkar’s average is 2nd to Sangakkara, Tendulkar’s predicted runs for a given Balls faced and Minutes at Crease is 2nd and is behind Ponting. Also Tendulkar’s performance at home and overseas are consistent throughtout despite the fact that he has a highest number of overseas matches
  2. Ponting takes the 2nd spot with the 2nd highest number of centuries, 1st in Strike Rate and 2nd in home and away performance.
  3. The 3rd spot goes to Sangakkara, with the highest average, 3rd highest number of centuries, reasonable run frequency percentage in different run ranges. However he has a fewer number of matches overseas and his performance overseas is significantly lower than at home
  4. Kallis has the 2nd highest number of centuries but his performance overseas and strike rate are behind others
  5. Finally Kallis and Sangakkara had a few good years of batting still left in them (pity they retired!) while Tendulkar and Ponting’s time was up

Bowling

Muralitharan leads the way followed closely by Warne and finally Kumble. The reasons are

  1. Muralitharan has the highest number of test wickets with the best Wickets percentage and the best Economy Rate. Murali on average gas taken 4 wickets at home and 3 wickets overseas
  2. Warne follows Murali in the highest wickets taken, however Warne has less matches overseas than Murali and average 3 wickets home and 2 wickets overseas
  3. Kumble has the 3rd highest wickets, with 3 wickets on an average at home and 2 wickets overseas. However Kumble has played more matches overseas than the other two. In that respect his performance is great. Also Kumble has played less matches at home otherwise his numbers would have looked even better.
  4. Also while Kumble and Muralitharan’s career was on the decline , Warne was going great and had a couple of years ahead.

You can download this analysis at Introducing cricketr

Hope you have fun using the cricketr package as I had in developing it. Do take a look at  my follow up post Taking cricketr for a spin – Part 1

Important note: Do check out my other posts using cricketr at cricketr-posts

Do take a look at my 2nd package “The making of cricket package  yorkr – Part 1

Also see
1. My book “Deep Learning from first principles” now on Amazon
2. My book ‘Practical Machine Learning with R and Python’ on Amazon
3. Taking cricketr for a spin – Part 1
4. cricketr plays the ODIs
5. cricketr adapts to the Twenty20 International
6. Analyzing cricket’s batting legends – Through the mirage with R
7. Masters of spin: Unraveling the web with R
8. Mirror,mirror …best batsman of them all

You may also like
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded
7. Fun simulation of a Chain in Androidhttp://www.r-bloggers.com/introducing-cricketr-an-r-package-to-analyze-performances-of-cricketers/