The plots below show the 3D scatter plot of Kohli’s Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease

The plot below gives the average runs scored by Kohli at different grounds. The plot also the number of innings at each ground as a label at x-axis.

This plot computes the average runs scored by Kohli against different countries.

The plot below shows the Runs Likelihood for a batsman. For this the performance of Kohli is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease. K-Means. The centroids of 3 clusters are computed and plotted. In this plot Kohli’s highest tendencies are computed and plotted using K-Means

The following batsmen have been very prolific in Twenty20 cricket and will be used for the analyses

The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs

## 12. Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

```
import cricpy.analytics as ca
ca.batsmanPerfBoxHist("./kohli.csv","Virat Kohli")
```

`ca.batsmanPerfBoxHist("./guptill.csv","M J Guptill")`

`ca.batsmanPerfBoxHist("./shahzad.csv","M Shahzad")`

`ca.batsmanPerfBoxHist("./mccullum.csv","BB McCullum")`

## 13 Moving Average of runs in career

Take a look at the Moving Average across the career of the Top 4 Twenty20 batsmen.

```
import cricpy.analytics as ca
ca.batsmanMovingAverage("./kohli.csv","Virat Kohli")
```

```
ca.batsmanMovingAverage("./guptill.csv","M J Guptill")
```

`ca.batsmanMovingAverage("./mccullum.csv","BB McCullum")`

## 14 Cumulative Average runs of batsman in career

This function provides the cumulative average runs of the batsman over the career.Kohli’s average tops around 45 runs around 43 innings, though there is a dip downwards

```
import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("./kohli.csv","Virat Kohli")
```

`ca.batsmanCumulativeAverageRuns("./guptill.csv","M J Guptill")`

`ca.batsmanCumulativeAverageRuns("./shahzad.csv","M Shahzad")`

`ca.batsmanCumulativeAverageRuns("./mccullum.csv","BB McCullum")`

## 15 Cumulative Average strike rate of batsman in career

Kohli, Guptill and McCullum average a strike rate of 125+

```
import cricpy.analytics as ca
ca.batsmanCumulativeStrikeRate("./kohli.csv","Virat Kohli")
```

`ca.batsmanCumulativeStrikeRate("./guptill.csv","M J Guptill")`

`ca.batsmanCumulativeStrikeRate("./shahzad.csv","M Shahzad")`

`ca.batsmanCumulativeStrikeRate("./mccullum.csv","BB McCullum")`

## 16 Relative Batsman Cumulative Average Runs

The plot below compares the Relative cumulative average runs of the batsman. Kohli is way above all the other 3 batsmen. Behind Kohli is McCullum and then Guptill

```
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullumn"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)
```

## 17. Relative Batsman Strike Rate

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show that Kohli tops the overall strike rate followed by McCullum and then Guptill

```
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.relativeBatsmanCumulativeStrikeRate(frames,names)
```

## 18. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A 3D prediction plane is fitted

```
import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")
```

`ca.battingPerf3d("./guptill.csv","M J Guptill")`

`ca.battingPerf3d("./shahzad.csv","M Shahzad")`

`ca.battingPerf3d("./mccullum.csv","BB McCullum")`

## 19. 3D plot of Runs vs Balls Faced and Minutes at Crease

Guptill and McCullum have a large percentage of sixes in comparison to the 4s. Kohli has a relative lower number of 6s

```
import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.batsman4s6s(frames,names)
```

## 20. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

```
import cricpy.analytics as ca
import numpy as np
import pandas as pd
BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})
kohli= ca.batsmanRunsPredict("./kohli.csv",newDF,"Kohli")
```

`print(kohli)`

```
## BF Mins Runs
## 0 10.000000 30.000000 14.753153
## 1 37.857143 70.714286 55.963333
## 2 65.714286 111.428571 97.173513
## 3 93.571429 152.142857 138.383693
## 4 121.428571 192.857143 179.593873
## 5 149.285714 233.571429 220.804053
## 6 177.142857 274.285714 262.014233
## 7 205.000000 315.000000 303.224414
## 8 232.857143 355.714286 344.434594
## 9 260.714286 396.428571 385.644774
## 10 288.571429 437.142857 426.854954
## 11 316.428571 477.857143 468.065134
## 12 344.285714 518.571429 509.275314
## 13 372.142857 559.285714 550.485494
## 14 400.000000 600.000000 591.695674
```

## 21 Analysis of Top Bowlers

The following 4 bowlers have had an excellent career and will be used for the analysis

- Shakib Hasan:
**Wickets: 80, Average = 21.07, Economy Rate – 6.74**
- Mohammed Nabi :
**Wickets: 67, Average = 24.25, Economy Rate – 7.13**
- Rashid Khan:
**Wickets: 64, Average = 12.40, Economy Rate – 6.01**
- Imran Tahir :
**Wickets:62, Average – 14.95, Economy Rate – 6.77**

## 22. Get the bowler’s data

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

```
import cricpy.analytics as ca
```

## 23. Wicket Frequency Plot

This plot below plots the frequency of wickets taken for each of the bowlers

```
import cricpy.analytics as ca
ca.bowlerWktsFreqPercent("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerWktsFreqPercent("./nabi.csv","Mohammad Nabi")`

`ca.bowlerWktsFreqPercent("./rashid.csv","Rashid Khan")`

`ca.bowlerWktsFreqPercent("./tahir.csv","Imran Tahir")`

## 24. Wickets Runs plot

The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken.

```
import cricpy.analytics as ca
ca.bowlerWktsRunsPlot("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerWktsRunsPlot("./nabi.csv","Mohammad Nabi")`

`ca.bowlerWktsRunsPlot("./rashid.csv","Rashid Khan")`

`ca.bowlerWktsRunsPlot("./tahir.csv","Imran Tahir")`

## 25 Average wickets at different venues

The plot gives the average wickets taken by Muralitharan at different venues.

```
import cricpy.analytics as ca
ca.bowlerAvgWktsGround("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerAvgWktsGround("./nabi.csv","Mohammad Nabi")`

`ca.bowlerAvgWktsGround("./rashid.csv","Rashid Khan")`

`ca.bowlerAvgWktsGround("./tahir.csv","Imran Tahir")`

## 26 Average wickets against different opposition

The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team

```
import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerAvgWktsOpposition("./nabi.csv","Mohammad Nabi")`

`ca.bowlerAvgWktsOpposition("./rashid.csv","Rashid Khan")`

`ca.bowlerAvgWktsOpposition("./tahir.csv","Imran Tahir")`

## 27 Wickets taken moving average

From the plot below it can be see

```
import cricpy.analytics as ca
ca.bowlerMovingAverage("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerMovingAverage("./nabi.csv","Mohammad Nabi")`

`ca.bowlerMovingAverage("./rashid.csv","Rashid Khan")`

`ca.bowlerMovingAverage("./tahir.csv","Imran Tahir")`

## 28 Cumulative average wickets taken

The plots below give the cumulative average wickets taken by the bowlers. Rashid Khan has been the most effective with almost 2.28 wickets per match

```
import cricpy.analytics as ca
ca.bowlerCumulativeAvgWickets("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerCumulativeAvgWickets("./nabi.csv","Mohammad Nabi")`

`ca.bowlerCumulativeAvgWickets("./rashid.csv","Rashid Khan")`

`ca.bowlerCumulativeAvgWickets("./tahir.csv","Imran Tahir")`

## 29 Cumulative average economy rate

The plots below give the cumulative average economy rate of the bowlers. Rashid Khan has the nest economy rate followed by Mohammed Nabi

```
import cricpy.analytics as ca
ca.bowlerCumulativeAvgEconRate("./shakib.csv","Shakib Al Hasan")
```

`ca.bowlerCumulativeAvgEconRate("./nabi.csv","Mohammad Nabi")`

`ca.bowlerCumulativeAvgEconRate("./rashid.csv","Rashid Khan")`

`ca.bowlerCumulativeAvgEconRate("./tahir.csv","Imran Tahir")`

## 30 Relative cumulative average economy rate of bowlers

The Relative cumulative economy rate is given below. It can be seen that Rashid Khan has the best economy rate followed by Mohammed Nabi and then Imran Tahir

```
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)
```

## 31 Relative Economy Rate against wickets taken

Rashid Khan has the best figures for wickets between 2-3.5 wickets. Mohammed Nabi pips Rashid Khan when takes a haul of 4 wickets.

```
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlingER(frames,names)
```

## 32 Relative cumulative average wickets of bowlers in career

Rashid has the best performance with cumulative average wickets. He is followed by Imran Tahir in the wicket haul, followed by Shakib Al Hasan

```
import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgWickets(frames,names)
```

# 33. Key Findings

The plots above capture some of the capabilities and features of my **cricpy** package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.

Here are the main findings from the analysis above

This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR

Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkRNote 2: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR