The plots below show the 3D scatter plot of Kohli’s Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease
The plot below gives the average runs scored by Kohli at different grounds. The plot also the number of innings at each ground as a label at x-axis.
This plot computes the average runs scored by Kohli against different countries.
The plot below shows the Runs Likelihood for a batsman. For this the performance of Kohli is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease. K-Means. The centroids of 3 clusters are computed and plotted. In this plot Kohli’s highest tendencies are computed and plotted using K-Means
The following batsmen have been very prolific in Twenty20 cricket and will be used for the analyses
The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs
12. Box Histogram Plot
This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency
import cricpy.analytics as ca
ca.batsmanPerfBoxHist("./kohli.csv","Virat Kohli")

ca.batsmanPerfBoxHist("./guptill.csv","M J Guptill")

ca.batsmanPerfBoxHist("./shahzad.csv","M Shahzad")

ca.batsmanPerfBoxHist("./mccullum.csv","BB McCullum")

13 Moving Average of runs in career
Take a look at the Moving Average across the career of the Top 4 Twenty20 batsmen.
import cricpy.analytics as ca
ca.batsmanMovingAverage("./kohli.csv","Virat Kohli")

ca.batsmanMovingAverage("./guptill.csv","M J Guptill")

ca.batsmanMovingAverage("./mccullum.csv","BB McCullum")

14 Cumulative Average runs of batsman in career
This function provides the cumulative average runs of the batsman over the career.Kohli’s average tops around 45 runs around 43 innings, though there is a dip downwards
import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeAverageRuns("./guptill.csv","M J Guptill")

ca.batsmanCumulativeAverageRuns("./shahzad.csv","M Shahzad")

ca.batsmanCumulativeAverageRuns("./mccullum.csv","BB McCullum")

15 Cumulative Average strike rate of batsman in career
Kohli, Guptill and McCullum average a strike rate of 125+
import cricpy.analytics as ca
ca.batsmanCumulativeStrikeRate("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeStrikeRate("./guptill.csv","M J Guptill")

ca.batsmanCumulativeStrikeRate("./shahzad.csv","M Shahzad")

ca.batsmanCumulativeStrikeRate("./mccullum.csv","BB McCullum")

16 Relative Batsman Cumulative Average Runs
The plot below compares the Relative cumulative average runs of the batsman. Kohli is way above all the other 3 batsmen. Behind Kohli is McCullum and then Guptill
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullumn"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

17. Relative Batsman Strike Rate
The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show that Kohli tops the overall strike rate followed by McCullum and then Guptill
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.relativeBatsmanCumulativeStrikeRate(frames,names)

18. 3D plot of Runs vs Balls Faced and Minutes at Crease
The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A 3D prediction plane is fitted
import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")

ca.battingPerf3d("./guptill.csv","M J Guptill")

ca.battingPerf3d("./shahzad.csv","M Shahzad")

ca.battingPerf3d("./mccullum.csv","BB McCullum")

19. 3D plot of Runs vs Balls Faced and Minutes at Crease
Guptill and McCullum have a large percentage of sixes in comparison to the 4s. Kohli has a relative lower number of 6s
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.batsman4s6s(frames,names)

20. Predicting Runs given Balls Faced and Minutes at Crease
A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.
import cricpy.analytics as ca
import numpy as np
import pandas as pd
BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})
kohli= ca.batsmanRunsPredict("./kohli.csv",newDF,"Kohli")
print(kohli)
## BF Mins Runs
## 0 10.000000 30.000000 14.753153
## 1 37.857143 70.714286 55.963333
## 2 65.714286 111.428571 97.173513
## 3 93.571429 152.142857 138.383693
## 4 121.428571 192.857143 179.593873
## 5 149.285714 233.571429 220.804053
## 6 177.142857 274.285714 262.014233
## 7 205.000000 315.000000 303.224414
## 8 232.857143 355.714286 344.434594
## 9 260.714286 396.428571 385.644774
## 10 288.571429 437.142857 426.854954
## 11 316.428571 477.857143 468.065134
## 12 344.285714 518.571429 509.275314
## 13 372.142857 559.285714 550.485494
## 14 400.000000 600.000000 591.695674
21 Analysis of Top Bowlers
The following 4 bowlers have had an excellent career and will be used for the analysis
- Shakib Hasan:Wickets: 80, Average = 21.07, Economy Rate – 6.74
- Mohammed Nabi : Wickets: 67, Average = 24.25, Economy Rate – 7.13
- Rashid Khan: Wickets: 64, Average = 12.40, Economy Rate – 6.01
- Imran Tahir : Wickets:62, Average – 14.95, Economy Rate – 6.77
22. Get the bowler’s data
This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line
import cricpy.analytics as ca
23. Wicket Frequency Plot
This plot below plots the frequency of wickets taken for each of the bowlers
import cricpy.analytics as ca
ca.bowlerWktsFreqPercent("./shakib.csv","Shakib Al Hasan")

ca.bowlerWktsFreqPercent("./nabi.csv","Mohammad Nabi")

ca.bowlerWktsFreqPercent("./rashid.csv","Rashid Khan")

ca.bowlerWktsFreqPercent("./tahir.csv","Imran Tahir")

24. Wickets Runs plot
The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken.
import cricpy.analytics as ca
ca.bowlerWktsRunsPlot("./shakib.csv","Shakib Al Hasan")

ca.bowlerWktsRunsPlot("./nabi.csv","Mohammad Nabi")

ca.bowlerWktsRunsPlot("./rashid.csv","Rashid Khan")

ca.bowlerWktsRunsPlot("./tahir.csv","Imran Tahir")

25 Average wickets at different venues
The plot gives the average wickets taken by Muralitharan at different venues.
import cricpy.analytics as ca
ca.bowlerAvgWktsGround("./shakib.csv","Shakib Al Hasan")

ca.bowlerAvgWktsGround("./nabi.csv","Mohammad Nabi")

ca.bowlerAvgWktsGround("./rashid.csv","Rashid Khan")

ca.bowlerAvgWktsGround("./tahir.csv","Imran Tahir")

26 Average wickets against different opposition
The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team
import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("./shakib.csv","Shakib Al Hasan")

ca.bowlerAvgWktsOpposition("./nabi.csv","Mohammad Nabi")

ca.bowlerAvgWktsOpposition("./rashid.csv","Rashid Khan")

ca.bowlerAvgWktsOpposition("./tahir.csv","Imran Tahir")

27 Wickets taken moving average
From the plot below it can be see
import cricpy.analytics as ca
ca.bowlerMovingAverage("./shakib.csv","Shakib Al Hasan")

ca.bowlerMovingAverage("./nabi.csv","Mohammad Nabi")

ca.bowlerMovingAverage("./rashid.csv","Rashid Khan")

ca.bowlerMovingAverage("./tahir.csv","Imran Tahir")

28 Cumulative average wickets taken
The plots below give the cumulative average wickets taken by the bowlers. Rashid Khan has been the most effective with almost 2.28 wickets per match
import cricpy.analytics as ca
ca.bowlerCumulativeAvgWickets("./shakib.csv","Shakib Al Hasan")

ca.bowlerCumulativeAvgWickets("./nabi.csv","Mohammad Nabi")

ca.bowlerCumulativeAvgWickets("./rashid.csv","Rashid Khan")

ca.bowlerCumulativeAvgWickets("./tahir.csv","Imran Tahir")

29 Cumulative average economy rate
The plots below give the cumulative average economy rate of the bowlers. Rashid Khan has the nest economy rate followed by Mohammed Nabi
import cricpy.analytics as ca
ca.bowlerCumulativeAvgEconRate("./shakib.csv","Shakib Al Hasan")

ca.bowlerCumulativeAvgEconRate("./nabi.csv","Mohammad Nabi")

ca.bowlerCumulativeAvgEconRate("./rashid.csv","Rashid Khan")

ca.bowlerCumulativeAvgEconRate("./tahir.csv","Imran Tahir")

30 Relative cumulative average economy rate of bowlers
The Relative cumulative economy rate is given below. It can be seen that Rashid Khan has the best economy rate followed by Mohammed Nabi and then Imran Tahir
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

31 Relative Economy Rate against wickets taken
Rashid Khan has the best figures for wickets between 2-3.5 wickets. Mohammed Nabi pips Rashid Khan when takes a haul of 4 wickets.
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlingER(frames,names)

32 Relative cumulative average wickets of bowlers in career
Rashid has the best performance with cumulative average wickets. He is followed by Imran Tahir in the wicket haul, followed by Shakib Al Hasan
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgWickets(frames,names)

33. Key Findings
The plots above capture some of the capabilities and features of my cricpy package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.
Here are the main findings from the analysis above
This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR
Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkR