Revisiting World Bank data analysis with WDI and gVisMotionChart

Note: I had written a post about 3 years back on World Bank Data Analysis using World Development Indicators (WDI) & gVisMotionCharts. But the motion charts stopped working  some time ago. I have always been wanting to fix this and I now got to actually doing it. The issue was 2 of the WDI indicators had changed. After I fixed this I was able to host the generated motion chart using github.io pages. Please make sure that you enable flash player if you open the motion charts with Google Chrome. You may also have to enable flash if using Firefox, IE etc

Please check out the 2 motions charts with World Bank data

1. World Bank Chart 1
2. World Bank Chart 2

If you are using Chrome please enable (Allow)  ‘flash player’ by clicking on the lock sign in the URL as shown

 

 

 

 

 

 

Introduction

Recently I was surfing the web, when I came across a real cool post New R package to access World Bank data, by Markus Gesmann on using googleVis and motion charts with World Bank Data. The post also introduced me to Hans Rosling, Professor of Sweden’s Karolinska Institute. Hans Rosling, the creator of the famous Gapminder chart, the “Heath and Wealth of Nations” displays global trends through animated charts (A must see!!!). As they say, in Hans Rosling’s hands, data dances and sings. Take a look at  his Ted talks for e.g. Hans Rosling:New insights on poverty. Prof Rosling developed the breakthrough software behind the visualizations, in the Gapminder. The free software, which can be loaded with any data – was purchased by Google in March 2007.

In this post, I recreate some of the Gapminder charts with the help of R packages WDI and googleVis. The WDI  package of  Vincent Arel-Bundock, provides a set of really useful functions to get to data based on the World Bank Data indicators.  googleVis provides motion charts with which you can animate the data.

You can clone/download the code from Github at worldBankAnalysis which is in the form of an Rmd file.

library(WDI)
library(ggplot2)
library(googleVis)
library(plyr)

1.Get the data from 1960 to 2019 for the following

  1. Population – SP.POP.TOTL
  2. GDP in US $ – NY.GDP.MKTP.CD
  3. Life Expectancy at birth (Years) – SP.DYN.LE00.IN
  4. GDP Per capita income – NY.GDP.PCAP.PP.CD
  5. Fertility rate (Births per woman) – SP.DYN.TFRT.IN
  6. Poverty headcount ratio – SI.POV.NAHC
# World population total
population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2019)
# GDP in US $
gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2019)
# Life expectancy at birth (Years)
lifeExpectancy= WDI(indicator='SP.DYN.LE00.IN', country="all",start=1960, end=2019)
# GDP Per capita
income = WDI(indicator='NY.GDP.PCAP.PP.CD', country="all",start=1960, end=2019)
# Fertility rate (births per woman)
fertility = WDI(indicator='SP.DYN.TFRT.IN', country="all",start=1960, end=2019)
# Poverty head count
poverty= WDI(indicator='SI.POV.NAHC', country="all",start=1960, end=2019)

2.Rename the columns

names(population)[3]="Total population"
names(lifeExpectancy)[3]="Life Expectancy (Years)"
names(gdp)[3]="GDP (US$)"
names(income)[3]="GDP per capita income"
names(fertility)[3]="Fertility (Births per woman)"
names(poverty)[3]="Poverty headcount ratio"

3.Join the data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries
j1 <- join(population, gdp)

j2 <- join(j1,lifeExpectancy)

j3 <- join(j2,income)

j4 <- join(j3,poverty)

wbData <- join(j4,fertility)

4.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixes
wdi_data =WDI_data
# The 1st matrix is the list is the set of all World Bank Indicators
indicators=wdi_data[[1]]
# The 2nd  matrix gives the set of countries and regions
countries=wdi_data[[2]]
df = as.data.frame(countries)
aa <- df$region != "Aggregates"
# Remove the aggregates
countries_df <- df[aa,]
# Subset from the development data only those corresponding to the countries
bb = subset(wbData, country %in% countries_df$country)
cc = join(bb,countries_df)
dd = complete.cases(cc)
developmentDF = cc[dd,]

5.Create and display the motion chart

gg<- gvisMotionChart(cc,
                                idvar = "country",
                                timevar = "year",
                                xvar = "GDP",
                                yvar = "Life Expectancy",
                                sizevar ="Population",
                                colorvar = "region")
plot(gg)
cat(gg$html$chart, file="chart1.html")

Note: Unfortunately it is not possible to embed the motion chart in WordPress. It is has to hosted on a server as a Webpage. After exploring several possibilities I came up with the following process to display the animation graph. The plot is saved as a html file using ‘cat’ as shown above. The WorldBank_chart1.html page is then hosted as a Github page (gh-page) on Github.

Here is the ggvisMotionChart

Do give  World Bank Motion Chart1  a spin.  Here is how the Motion Chart has to be used

untitled

You can select Life Expectancy, Population, Fertility etc by clicking the black arrows. The blue arrow shows the ‘play’ button to set animate the motion chart. You can also select the countries and change the size of the circles. Do give it a try. Here are some quick analysis by playing around with the motion charts with different parameters chosen

The set of charts below are screenshots captured by running the motion chart World Bank Motion Chart1

a. Life Expectancy vs Fertility chart

This chart is used by Hans Rosling in his Ted talk. The left chart shows low life expectancy and high fertility rate for several sub Saharan and East Asia Pacific countries in the early 1960’s. Today the fertility has dropped and the life expectancy has increased overall. However the sub Saharan countries still have a high fertility rate

pic1

b. Population vs GDP

The chart below shows that GDP of India and China have the same GDP from 1973-1994 with US and Japan well ahead.

pic2

From 1998- 2014 China really pulls away from India and Japan as seen below

pic3

c. Per capita income vs Life Expectancy

In the 1990’s the per capita income and life expectancy of the sub -saharan countries are low (42-50). Japan and US have a good life expectancy in 1990’s. In 2014 the per capita income of the sub-saharan countries are still low though the life expectancy has marginally improved.

pic4

d. Population vs Poverty headcount

pic5

In the early 1990’s China had a higher poverty head count ratio than India. By 2004 China had this all figured out and the poverty head count ratio drops significantly. This can also be seen in the chart below.

pop_pov3

In the chart above China shows a drastic reduction in poverty headcount ratio vs India. Strangely Zambia shows an increase in the poverty head count ratio.

6.Get the data for the 2nd set of indicators

  1. Total population  – SP.POP.TOTL
  2. GDP in US$ – NY.GDP.MKTP.CD
  3. Access to electricity (% population) – EG.ELC.ACCS.ZS
  4. Electricity consumption KWh per capita -EG.USE.ELEC.KH.PC
  5. CO2 emissions -EN.ATM.CO2E.KT
  6. Basic Sanitation Access – SH.STA.BASS.ZS
# World population
population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2016)
# GDP in US $
gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2016)
# Access to electricity (% population)
elecAccess= WDI(indicator='EG.ELC.ACCS.ZS', country="all",start=1960, end=2016)
# Electric power consumption Kwh per capita
elecConsumption= WDI(indicator='EG.USE.ELEC.KH.PC', country="all",start=1960, end=2016)
#CO2 emissions
co2Emissions= WDI(indicator='EN.ATM.CO2E.KT', country="all",start=1960, end=2016)
# Access to sanitation (% population)
sanitationAccess= WDI(indicator='SH.STA.ACSN', country="all",start=1960, end=2016)

7.Rename the columns

names(population)[3]="Total population"
names(gdp)[3]="GDP US($)"
names(elecAccess)[3]="Access to Electricity (% popn)"
names(elecConsumption)[3]="Electric power consumption (KWH per capita)"
names(co2Emissions)[3]="CO2 emisions"
names(sanitationAccess)[3]="Access to sanitation(% popn)"

8.Join the individual data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries


j1 <- join(population, gdp)
j2 <- join(j1,elecAccess)
j3 <- join(j2,elecConsumption)
j4 <- join(j3,co2Emissions)
wbData1 <- join(j3,sanitationAccess)

 

9.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixes
wdi_data =WDI_data
# The 1st matrix is the list is the set of all World Bank Indicators
indicators=wdi_data[[1]]
# The 2nd  matrix gives the set of countries and regions
countries=wdi_data[[2]]
df = as.data.frame(countries)
aa <- df$region != "Aggregates"
# Remove the aggregates
countries_df <- df[aa,]
# Subset from the development data only those corresponding to the countries
ee = subset(wbData1, country %in% countries_df$country)
ff = join(ee,countries_df)
## Joining by: iso2c, country

10.Create and display the motion chart

gg1<- gvisMotionChart(ff,
                                idvar = "country",
                                timevar = "year",
                                xvar = "GDP",
                                yvar = "Access to Electricity",
                                sizevar ="Population",
                                colorvar = "region")
plot(gg1)
cat(gg1$html$chart, file="chart2.html")

This is World Bank Motion Chart2  which has a different set of parameters like Access to Energy, CO2 emissions etc

The set of charts below are screenshots of the motion chart World Bank Motion Chart 2

a. Access to Electricity vs Population
pic6The above chart shows that in China 100% population have access to electricity. India has made decent progress from 50% in 1990 to 79% in 2012. However Pakistan seems to have been much better in providing access to electricity. Pakistan moved from 59% to close 98% access to electricity

b. Power consumption vs population

powercon

The above chart shows the Power consumption vs Population. China and India have proportionally much lower consumption that Norway, US, Canada

c. CO2 emissions vs Population

pic7

In 1963 the CO2 emissions were fairly low and about comparable for all countries. US, India have shown a steady increase while China shows a steep increase. Interestingly UK shows a drop in CO2 emissions

d.  Access to sanitation
san

India shows an improvement but it has a long way to go with only 40% of population with access to sanitation. China has made much better strides with 80% having access to sanitation in 2015. Strangely Nigeria shows a drop in sanitation by almost about 20% of population.

The code is available at Github at worldBankAnalysis

Conclusion: So there you have it. I have shown some screenshots of some sample parameters of the World indicators. Please try to play around with World Bank Motion Chart1 & World Bank Motion Chart 2  with your own set of parameters and countries.  You can also create your own motion chart from the 100s of WDI indicators avaialable at  World Bank Data indicator.

Also see
1. My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
2.  Dabbling with Wiener filter using OpenCV
3. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
4. Design Principles of Scalable, Distributed Systems
5. Re-introducing cricketr! : An R package to analyze performances of cricketers
6. Natural language processing: What would Shakespeare say?
7. Brewing a potion with Bluemix, PostgreSQL, Node.js in the cloud
8. Simulating an Edge Shape in Android

To see all posts Index of posts

Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. – Jamie Zawinski

Some programmers, when confronted with a problem, think “I know, I’ll use floating point arithmetic.” Now they have 1.999999999997 problems. – @tomscott

Some people, when confronted with a problem, think “I know, I’ll use multithreading”. Nothhw tpe yawrve o oblems. – @d6

Some people, when confronted with a problem, think “I know, I’ll use versioning.” Now they have 2.1.0 problems. – @JaesCoyle

Some people, when faced with a problem, think, “I know, I’ll use binary.” Now they have 10 problems. – @nedbat

Introduction

The power of Spark, which operates on in-memory datasets, is the fact that it stores the data as collections using Resilient Distributed Datasets (RDDs), which are themselves distributed in partitions across clusters. RDDs, are a fast way of processing data, as the data is operated on parallel based on the map-reduce paradigm. RDDs can be be used when the operations are low level. RDDs, are typically used on unstructured data like logs or text. For structured and semi-structured data, Spark has a higher abstraction called Dataframes. Handling data through dataframes are extremely fast as they are Optimized using the Catalyst Optimization engine and the performance is orders of magnitude faster than RDDs. In addition Dataframes also use Tungsten which handle memory management and garbage collection more effectively.

The picture below shows the performance improvement achieved with Dataframes over RDDs

Benefits from Project Tungsten

Npte: The above data and graph is taken from the course Big Data Analysis with Apache Spark at edX, UC Berkeley
This post is a continuation of my 2 earlier posts
1. Big Data-1: Move into the big league:Graduate from Python to Pyspark
2. Big Data-2: Move into the big league:Graduate from R to SparkR

In this post I perform equivalent operations on a small dataset using RDDs, Dataframes in Pyspark & SparkR and HiveQL. As in some of my earlier posts, I have used the tendulkar.csv file for this post. The dataset is small and allows me to do most everything from data cleaning, data transformation and grouping etc.
You can clone fork the notebooks from github at Big Data:Part 3

The notebooks have also been published and can be accessed below

  1. Big Data-1: On RDDs, DataFrames and HiveQL with Pyspark
  2. Big Data-2:On RDDs, Dataframes and HiveQL with SparkR

1. RDD – Select all columns of tables

from pyspark import SparkContext 
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
rdd.map(lambda line: (line.split(","))).take(5)
Out[90]: [[‘Runs’, ‘Mins’, ‘BF’, ‘4s’, ‘6s’, ‘SR’, ‘Pos’, ‘Dismissal’, ‘Inns’, ‘Opposition’, ‘Ground’, ‘Start Date’], [’15’, ’28’, ’24’, ‘2’, ‘0’, ‘62.5’, ‘6’, ‘bowled’, ‘2’, ‘v Pakistan’, ‘Karachi’, ’15-Nov-89′], [‘DNB’, ‘-‘, ‘-‘, ‘-‘, ‘-‘, ‘-‘, ‘-‘, ‘-‘, ‘4’, ‘v Pakistan’, ‘Karachi’, ’15-Nov-89′], [’59’, ‘254’, ‘172’, ‘4’, ‘0’, ‘34.3’, ‘6’, ‘lbw’, ‘1’, ‘v Pakistan’, ‘Faisalabad’, ’23-Nov-89′], [‘8′, ’24’, ’16’, ‘1’, ‘0’, ’50’, ‘6’, ‘run out’, ‘3’, ‘v Pakistan’, ‘Faisalabad’, ’23-Nov-89′]]

1b.RDD – Select columns 1 to 4

from pyspark import SparkContext 
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
rdd.map(lambda line: (line.split(",")[0:4])).take(5)
Out[91]:
[[‘Runs’, ‘Mins’, ‘BF’, ‘4s’],
[’15’, ’28’, ’24’, ‘2’],
[‘DNB’, ‘-‘, ‘-‘, ‘-‘],
[’59’, ‘254’, ‘172’, ‘4’],
[‘8′, ’24’, ’16’, ‘1’]]

1c. RDD – Select specific columns 0, 10

from pyspark import SparkContext 
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
df=rdd.map(lambda line: (line.split(",")))
df.map(lambda x: (x[10],x[0])).take(5)
Out[92]:
[(‘Ground’, ‘Runs’),
(‘Karachi’, ’15’),
(‘Karachi’, ‘DNB’),
(‘Faisalabad’, ’59’),
(‘Faisalabad’, ‘8’)]

2. Dataframe:Pyspark – Select all columns

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1.show(5)
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
| 15| 28| 24| 2| 0| 62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|
| DNB| -| -| -| -| -| -| -| 4|v Pakistan| Karachi| 15-Nov-89|
| 59| 254|172| 4| 0| 34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|
| 8| 24| 16| 1| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|
| 41| 124| 90| 5| 0|45.55| 7| bowled| 1|v Pakistan| Lahore| 1-Dec-89|
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
only showing top 5 rows

2a. Dataframe:Pyspark- Select specific columns

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1.select("Runs","BF","Mins").show(5)
+—-+—+—-+
|Runs| BF|Mins|
+—-+—+—-+
| 15| 24| 28|
| DNB| -| -|
| 59|172| 254|
| 8| 16| 24|
| 41| 90| 124|
+—-+—+—-+

3. Dataframe:SparkR – Select all columns

# Load the SparkR library
library(SparkR)
# Initiate a SparkR session
sparkR.session()
tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

# Check the dimensions of the dataframe
df=SparkR::select(tendulkar1,"*")
head(SparkR::collect(df))

  Runs Mins  BF 4s 6s    SR Pos Dismissal Inns Opposition     Ground Start Date
1   15   28  24  2  0  62.5   6    bowled    2 v Pakistan    Karachi  15-Nov-89
2  DNB    -   -  -  -     -   -         -    4 v Pakistan    Karachi  15-Nov-89
3   59  254 172  4  0  34.3   6       lbw    1 v Pakistan Faisalabad  23-Nov-89
4    8   24  16  1  0    50   6   run out    3 v Pakistan Faisalabad  23-Nov-89
5   41  124  90  5  0 45.55   7    bowled    1 v Pakistan     Lahore   1-Dec-89
6   35   74  51  5  0 68.62   6       lbw    1 v Pakistan    Sialkot   9-Dec-89

3a. Dataframe:SparkR- Select specific columns

# Load the SparkR library
library(SparkR)
# Initiate a SparkR session
sparkR.session()
tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

# Check the dimensions of the dataframe
df=SparkR::select(tendulkar1, "Runs", "BF","Mins")
head(SparkR::collect(df))
Runs BF Mins
1 15 24 28
2 DNB – –
3 59 172 254
4 8 16 24
5 41 90 124
6 35 51 74

4. Hive QL – Select all columns

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1.createOrReplaceTempView('tendulkar1_table')
spark.sql('select  * from tendulkar1_table limit 5').show(10, truncate = False)
+—-+—+—-++—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
|Runs|Mins|BF |4s |6s |SR |Pos|Dismissal|Inns|Opposition|Ground |Start Date|
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
|15 |28 |24 |2 |0 |62.5 |6 |bowled |2 |v Pakistan|Karachi |15-Nov-89 |
|DNB |- |- |- |- |- |- |- |4 |v Pakistan|Karachi |15-Nov-89 |
|59 |254 |172|4 |0 |34.3 |6 |lbw |1 |v Pakistan|Faisalabad|23-Nov-89 |
|8 |24 |16 |1 |0 |50 |6 |run out |3 |v Pakistan|Faisalabad|23-Nov-89 |
|41 |124 |90 |5 |0 |45.55|7 |bowled |1 |v Pakistan|Lahore |1-Dec-89 |
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+

4a. Hive QL – Select specific columns

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1.createOrReplaceTempView('tendulkar1_table')
spark.sql('select  Runs, BF,Mins from tendulkar1_table limit 5').show(10, truncate = False)
+—-+—+—-+
|Runs|BF |Mins|
+—-+—+—-+
|15 |24 |28 |
|DNB |- |- |
|59 |172|254 |
|8 |16 |24 |
|41 |90 |124 |
+—-+—+—-+

5. RDD – Filter rows on specific condition

from pyspark import SparkContext
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
df=(rdd.map(lambda line: line.split(",")[:])
      .filter(lambda x: x !="DNB")
      .filter(lambda x: x!= "TDNB")
      .filter(lambda x: x!="absent")
      .map(lambda x: [x[0].replace("*","")] + x[1:]))

df.take(5)

Out[97]:
[[‘Runs’,
‘Mins’,
‘BF’,
‘4s’,
‘6s’,
‘SR’,
‘Pos’,
‘Dismissal’,
‘Inns’,
‘Opposition’,
‘Ground’,
‘Start Date’],
[’15’,
’28’,
’24’,
‘2’,
‘0’,
‘62.5’,
‘6’,
‘bowled’,
‘2’,
‘v Pakistan’,
‘Karachi’,
’15-Nov-89′],
[‘DNB’,
‘-‘,
‘-‘,
‘-‘,
‘-‘,
‘-‘,
‘-‘,
‘-‘,
‘4’,
‘v Pakistan’,
‘Karachi’,
’15-Nov-89′],
[’59’,
‘254’,
‘172’,
‘4’,
‘0’,
‘34.3’,
‘6’,
‘lbw’,
‘1’,
‘v Pakistan’,
‘Faisalabad’,
’23-Nov-89′],
[‘8′,
’24’,
’16’,
‘1’,
‘0’,
’50’,
‘6’,
‘run out’,
‘3’,
‘v Pakistan’,
‘Faisalabad’,
’23-Nov-89′]]

5a. Dataframe:Pyspark – Filter rows on specific condition

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'DNB')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'TDNB')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'absent')
tendulkar1 = tendulkar1.withColumn('Runs', regexp_replace('Runs', '[*]', ''))
tendulkar1.show(5)
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
| 15| 28| 24| 2| 0| 62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|
| 59| 254|172| 4| 0| 34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|
| 8| 24| 16| 1| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|
| 41| 124| 90| 5| 0|45.55| 7| bowled| 1|v Pakistan| Lahore| 1-Dec-89|
| 35| 74| 51| 5| 0|68.62| 6| lbw| 1|v Pakistan| Sialkot| 9-Dec-89|
+—-+—-+—+—+—+—–+—+———+—-+———-+———-+———-+
only showing top 5 rows

5b. Dataframe:SparkR – Filter rows on specific condition

sparkR.session()

tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

print(dim(tendulkar1))
tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent")
print(dim(tendulkar1))

# Cast the string type Runs to double
withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double"))
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
# Remove the "* indicating not out
tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "")
df=SparkR::select(tendulkar1,"*")
head(SparkR::collect(df))

5c Hive QL – Filter rows on specific condition

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1.createOrReplaceTempView('tendulkar1_table')
spark.sql('select  Runs, BF,Mins from tendulkar1_table where Runs NOT IN  ("DNB","TDNB","absent")').show(10, truncate = False)
+—-+—+—-+
|Runs|BF |Mins|
+—-+—+—-+
|15 |24 |28 |
|59 |172|254 |
|8 |16 |24 |
|41 |90 |124 |
|35 |51 |74 |
|57 |134|193 |
|0 |1 |1 |
|24 |44 |50 |
|88 |266|324 |
|5 |13 |15 |
+—-+—+—-+
only showing top 10 rows

6. RDD – Find rows where Runs > 50

from pyspark import SparkContext
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
df=rdd.map(lambda line: (line.split(",")))
df=rdd.map(lambda line: line.split(",")[0:4]) \
   .filter(lambda x: x[0] not in ["DNB", "TDNB", "absent"])
df1=df.map(lambda x: [x[0].replace("*","")] + x[1:4])
header=df1.first()
df2=df1.filter(lambda x: x !=header)
df3=df2.map(lambda x: [float(x[0])] +x[1:4])
df3.filter(lambda x: x[0]>=50).take(10)
Out[101]: 
[[59.0, '254', '172', '4'],
 [57.0, '193', '134', '6'],
 [88.0, '324', '266', '5'],
 [68.0, '216', '136', '8'],
 [119.0, '225', '189', '17'],
 [148.0, '298', '213', '14'],
 [114.0, '228', '161', '16'],
 [111.0, '373', '270', '19'],
 [73.0, '272', '208', '8'],
 [50.0, '158', '118', '6']]

6a. Dataframe:Pyspark – Find rows where Runs >50

from pyspark.sql import SparkSession

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.appName('Read CSV DF').getOrCreate()
tendulkar1 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'DNB')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'TDNB')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'absent')
tendulkar1 = tendulkar1.withColumn("Runs", tendulkar1["Runs"].cast(IntegerType()))
tendulkar1.filter(tendulkar1['Runs']>=50).show(10)
+—-+—-+—+—+—+—–+—+———+—-+————–+————+———-+
|Runs|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns| Opposition| Ground|Start Date|
+—-+—-+—+—+—+—–+—+———+—-+————–+————+———-+
| 59| 254|172| 4| 0| 34.3| 6| lbw| 1| v Pakistan| Faisalabad| 23-Nov-89|
| 57| 193|134| 6| 0|42.53| 6| caught| 3| v Pakistan| Sialkot| 9-Dec-89|
| 88| 324|266| 5| 0|33.08| 6| caught| 1| v New Zealand| Napier| 9-Feb-90|
| 68| 216|136| 8| 0| 50| 6| caught| 2| v England| Manchester| 9-Aug-90|
| 114| 228|161| 16| 0| 70.8| 4| caught| 2| v Australia| Perth| 1-Feb-92|
| 111| 373|270| 19| 0|41.11| 4| caught| 2|v South Africa|Johannesburg| 26-Nov-92|
| 73| 272|208| 8| 1|35.09| 5| caught| 2|v South Africa| Cape Town| 2-Jan-93|
| 50| 158|118| 6| 0|42.37| 4| caught| 1| v England| Kolkata| 29-Jan-93|
| 165| 361|296| 24| 1|55.74| 4| caught| 1| v England| Chennai| 11-Feb-93|
| 78| 285|213| 10| 0|36.61| 4| lbw| 2| v England| Mumbai| 19-Feb-93|
+—-+—-+—+—+—+—–+—+———+—-+————–+————+———-+

6b. Dataframe:SparkR – Find rows where Runs >50

# Load the SparkR library
library(SparkR)
sparkR.session()

tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

print(dim(tendulkar1))
tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent")
print(dim(tendulkar1))

# Cast the string type Runs to double
withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double"))
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
# Remove the "* indicating not out
tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "")
df=SparkR::select(tendulkar1,"*")
df=SparkR::filter(tendulkar1, tendulkar1$Runs > 50)
head(SparkR::collect(df))
  Runs Mins  BF 4s 6s    SR Pos Dismissal Inns    Opposition     Ground
1   59  254 172  4  0  34.3   6       lbw    1    v Pakistan Faisalabad
2   57  193 134  6  0 42.53   6    caught    3    v Pakistan    Sialkot
3   88  324 266  5  0 33.08   6    caught    1 v New Zealand     Napier
4   68  216 136  8  0    50   6    caught    2     v England Manchester
5  119  225 189 17  0 62.96   6   not out    4     v England Manchester
6  148  298 213 14  0 69.48   6   not out    2   v Australia     Sydney
  Start Date
1  23-Nov-89
2   9-Dec-89
3   9-Feb-90
4   9-Aug-90
5   9-Aug-90
6   2-Jan-92

 

7 RDD – groupByKey() and reduceByKey()

from pyspark import SparkContext
from pyspark.mllib.stat import Statistics
rdd = sc.textFile( "/FileStore/tables/tendulkar.csv")
df=rdd.map(lambda line: (line.split(",")))
df=rdd.map(lambda line: line.split(",")[0:]) \
   .filter(lambda x: x[0] not in ["DNB", "TDNB", "absent"])
df1=df.map(lambda x: [x[0].replace("*","")] + x[1:])
header=df1.first()
df2=df1.filter(lambda x: x !=header)
df3=df2.map(lambda x: [float(x[0])] +x[1:])
df4 = df3.map(lambda x: (x[10],x[0]))
df5=df4.reduceByKey(lambda a,b: a+b,1)
df4.groupByKey().mapValues(lambda x: sum(x) / len(x)).take(10)

[(‘Georgetown’, 81.0),
(‘Lahore’, 17.0),
(‘Adelaide’, 32.6),
(‘Colombo (SSC)’, 77.55555555555556),
(‘Nagpur’, 64.66666666666667),
(‘Auckland’, 5.0),
(‘Bloemfontein’, 85.0),
(‘Centurion’, 73.5),
(‘Faisalabad’, 27.0),
(‘Bridgetown’, 26.0)]

7a Dataframe:Pyspark – Compute mean, min and max

from pyspark.sql.functions import *
tendulkar1= (sqlContext
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'DNB')
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'TDNB')
tendulkar1 = tendulkar1.withColumn('Runs', regexp_replace('Runs', '[*]', ''))
tendulkar1.select('Runs').rdd.distinct().collect()

from pyspark.sql import functions as F
df=tendulkar1[['Runs','BF','Ground']].groupby(tendulkar1['Ground']).agg(F.mean(tendulkar1['Runs']),F.min(tendulkar1['Runs']),F.max(tendulkar1['Runs']))
df.show()
————-+—————–+———+———+
| Ground| avg(Runs)|min(Runs)|max(Runs)|
+————-+—————–+———+———+
| Bangalore| 54.3125| 0| 96|
| Adelaide| 32.6| 0| 61|
|Colombo (PSS)| 37.2| 14| 71|
| Christchurch| 12.0| 0| 24|
| Auckland| 5.0| 5| 5|
| Chennai| 60.625| 0| 81|
| Centurion| 73.5| 111| 36|
| Brisbane|7.666666666666667| 0| 7|
| Birmingham| 46.75| 1| 40|
| Ahmedabad| 40.125| 100| 8|
|Colombo (RPS)| 143.0| 143| 143|
| Chittagong| 57.8| 101| 36|
| Cape Town|69.85714285714286| 14| 9|
| Bridgetown| 26.0| 0| 92|
| Bulawayo| 55.0| 36| 74|
| Delhi|39.94736842105263| 0| 76|
| Chandigarh| 11.0| 11| 11|
| Bloemfontein| 85.0| 15| 155|
|Colombo (SSC)|77.55555555555556| 104| 8|
| Cuttack| 2.0| 2| 2|
+————-+—————–+———+———+
only showing top 20 rows

7b Dataframe:SparkR – Compute mean, min and max

sparkR.session()

tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

print(dim(tendulkar1))
tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent")
print(dim(tendulkar1))

# Cast the string type Runs to double
withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double"))
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
# Remove the "* indicating not out
tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "")
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
df=SparkR::summarize(SparkR::groupBy(tendulkar1, tendulkar1$Ground), mean = mean(tendulkar1$Runs), minRuns=min(tendulkar1$Runs),maxRuns=max(tendulkar1$Runs))
head(df,20)
          Ground       mean minRuns maxRuns
1      Bangalore  54.312500       0      96
2       Adelaide  32.600000       0      61
3  Colombo (PSS)  37.200000      14      71
4   Christchurch  12.000000       0      24
5       Auckland   5.000000       5       5
6        Chennai  60.625000       0      81
7      Centurion  73.500000     111      36
8       Brisbane   7.666667       0       7
9     Birmingham  46.750000       1      40
10     Ahmedabad  40.125000     100       8
11 Colombo (RPS) 143.000000     143     143
12    Chittagong  57.800000     101      36
13     Cape Town  69.857143      14       9
14    Bridgetown  26.000000       0      92
15      Bulawayo  55.000000      36      74
16         Delhi  39.947368       0      76
17    Chandigarh  11.000000      11      11
18  Bloemfontein  85.000000      15     155
19 Colombo (SSC)  77.555556     104       8
20       Cuttack   2.000000       2       2

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

To see all posts click Index of Posts

Analyzing performances of cricketers using cricketr template

This post includes a template which you can use for analyzing the performances of cricketers, both batsmen and bowlers in Test, ODI and Twenty 20 cricket using my R package cricketr. To see actual usage of functions in the R package cricketr see Introducing cricketr! : An R package to analyze performances of cricketers.

This template can be downloaded from Github at cricketer-template

The ‘cricketr’ package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Important note 1: The latest release of ‘cricketr’ now includes the ability to analyze performances of teams now!!  See Cricketr adds team analytics to its repertoire!!!

Important note 2 : Cricketr can now do a more fine-grained analysis of players, see Cricketr learns new tricks : Performs fine-grained analysis of players

Important note 3: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

The cricketr package

The cricketr package has several functions that perform several different analyses on both batsman and bowlers. The package has function that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available.

Other interesting functions include batting performance moving average, forecast and a function to check whether the batsmans in in-form or out-of-form.

The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Ricky Ponting, Sachin Tendulkar etc. This will bring up a page which have the profile number for the player e.g. for Sachin Tendulkar this would be http://www.espncricinfo.com/india/content/player/35320.html. Hence, Sachin’s profile is 35320. This can be used to get the data for Tendulkar as shown below

The cricketr package is now available from CRAN!!! You should be able to install directly with

1. Install the cricketr package

if (!require("cricketr")){
    install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

The cricketr package includes some pre-packaged sample (.csv) files. You can use these sample to test functions as shown below

# Retrieve the file path of a data file installed with cricketr
#pathToFile <- system.file("data", "tendulkar.csv", package = "cricketr")
#batsman4s(pathToFile, "Sachin Tendulkar")

# The general format is pkg-function(pathToFile,par1,...)
#batsman4s(<path-To-File>,"Sachin Tendulkar")

“` The pre-packaged files can be accessed as shown above. To get the data of any player use the function in Test, ODI and Twenty20 use the following

2. For Test cricket

#tendulkar <- getPlayerData(35320,dir="..",file="tendulkar.csv",type="batting",homeOrAway=c(1,2), result=c(1,2,4))

2a. For ODI cricket

#tendulkarOD <- getPlayerDataOD(35320,dir="..",file="tendulkarOD.csv",type="batting")

2b For Twenty 20 cricket

#tendulkarT20 <- getPlayerDataTT(35320,dir="..",file="tendulkarT20.csv",type="batting")

Analysis of batsmen

Important Note This needs to be done only once for a player. This function stores the player’s data in a CSV file (for e.g. tendulkar.csv as above) which can then be reused for all other functions. Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

Sachin Tendulkar’s performance – Basic Analyses

The 3 plots below provide the following for Tendulkar

  1. Frequency percentage of runs in each run range over the whole career
  2. Mean Strike Rate for runs scored in the given range
  3. A histogram of runs frequency percentages in runs ranges For example

3. Basic analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
#batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
#batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()
## null device 
##           1
  1. Player 1
  2. Player 2
  3. Player 3
  4. Player 4

4. More analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player1.csv","Player1")
#batsman6s("./player1.csv","Player1")
#batsmanMeanStrikeRate("./player1.csv","Player1")

# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player2.csv","Player2")
#batsman6s("./player2.csv","Player2")
#batsmanMeanStrikeRate("./player2.csv","Player2")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player3.csv","Player3")
#batsman6s("./player3.csv","Player3")
#batsmanMeanStrikeRate("./player3.csv","Player3")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")

dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player4.csv","Player4")
#batsman6s("./player4.csv","Player4")
#batsmanMeanStrikeRate("./player4.csv","Player4")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1

Note: For mean strike rate in ODI and Twenty20 use the function batsmanScoringRateODTT()

5.Boxplot histogram plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

#batsmanPerfBoxHist("./player1.csv","Player1")
#batsmanPerfBoxHist("./player2.csv","Player2")
#batsmanPerfBoxHist("./player3.csv","Player3")
#batsmanPerfBoxHist("./player4.csv","Player4")

6. Contribution to won and lost matches

For the 2 functions below you will have to use the getPlayerDataSp() function. I have commented this as I already have these files. This function can only be used for Test matches

#player1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player1sp.csv",ttype="batting")
#player2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player2sp.csv",ttype="batting")
#player3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player3sp.csv",ttype="batting")
#player4sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player4sp.csv",ttype="batting")
par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanContributionWonLost("player1sp.csv","Player1")
#batsmanContributionWonLost("player2sp.csv","Player2")
#batsmanContributionWonLost("player3sp.csv","Player3")
#batsmanContributionWonLost("player4sp.csv","Player4")
dev.off()
## null device 
##           1

7, Performance at home and overseas

This function also requires the use of getPlayerDataSp() as shown above. This can only be used for Test matches

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfHomeAway("player1sp.csv","Player1")
#batsmanPerfHomeAway("player2sp.csv","Player2")
#batsmanPerfHomeAway("player3sp.csv","Player3")
#batsmanPerfHomeAway("player4sp.csv","Player4")
dev.off()
## null device 
##           1

8. Batsman average at different venues

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsGround("./player1.csv","Player1")
#batsmanAvgRunsGround("./player2.csv","Player2")
#batsmanAvgRunsGround("./player3.csv","Ponting")
#batsmanAvgRunsGround("./player4.csv","Player4")
dev.off()
## null device 
##           1

9. Batsman average against different opposition

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsOpposition("./player1.csv","Player1")
#batsmanAvgRunsOpposition("./player2.csv","Player2")
#batsmanAvgRunsOpposition("./player3.csv","Ponting")
#batsmanAvgRunsOpposition("./player4.csv","Player4")
dev.off()
## null device 
##           1

10. Runs Likelihood of batsman

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanRunsLikelihood("./player1.csv","Player1")
#batsmanRunsLikelihood("./player2.csv","Player2")
#batsmanRunsLikelihood("./player3.csv","Ponting")
#batsmanRunsLikelihood("./player4.csv","Player4")
dev.off()
## null device 
##           1

11. Moving Average of runs in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanMovingAverage("./player1.csv","Player1")
#batsmanMovingAverage("./player2.csv","Player2")
#batsmanMovingAverage("./player3.csv","Ponting")
#batsmanMovingAverage("./player4.csv","Player4")
dev.off()
## null device 
##           1

12. Cumulative Average runs of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeAverageRuns("./player1.csv","Player1")
#batsmanCumulativeAverageRuns("./player2.csv","Player2")
#batsmanCumulativeAverageRuns("./player3.csv","Ponting")
#batsmanCumulativeAverageRuns("./player4.csv","Player4")
dev.off()
## null device 
##           1

13. Cumulative Average strike rate of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeStrikeRate("./player1.csv","Player1")
#batsmanCumulativeStrikeRate("./player2.csv","Player2")
#batsmanCumulativeStrikeRate("./player3.csv","Ponting")
#batsmanCumulativeStrikeRate("./player4.csv","Player4")
dev.off()
## null device 
##           1

14. Future Runs forecast

Here are plots that forecast how the batsman will perform in future. In this case 90% of the career runs trend is uses as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated runs trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

Take a look at the runs forecasted for the batsman below.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfForecast("./player1.csv","Player1")
#batsmanPerfForecast("./player2.csv","Player2")
#batsmanPerfForecast("./player3.csv","Player3")
#batsmanPerfForecast("./player4.csv","Player4")
dev.off()
## null device 
##           1

15. Relative Mean Strike Rate plot

The plot below compares the Mean Strike Rate of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanSR(frames,names)

16. Relative Runs Frequency plot

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeRunsFreqPerf(frames,names)

17. Relative cumulative average runs in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanCumulativeAvgRuns(frames,names)

18. Relative cumulative average strike rate in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","player4")
#relativeBatsmanCumulativeStrikeRate(frames,names)

19. Check Batsman In-Form or Out-of-Form

The below computation uses Null Hypothesis testing and p-value to determine if the batsman is in-form or out-of-form. For this 90% of the career runs is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the batsman continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the batsman is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

This is done for the Top 4 batsman

#checkBatsmanInForm("./player1.csv","Player1")
#checkBatsmanInForm("./player2.csv","Player2")
#checkBatsmanInForm("./player3.csv","Player3")
#checkBatsmanInForm("./player4.csv","Player4")

20. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player1.csv","Player1")
#battingPerf3d("./player2.csv","Player2")
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player3.csv","Player3")
#battingPerf3d("./player4.csv","player4")
dev.off()
## null device 
##           1

21. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
#Player1 <- batsmanRunsPredict("./player1.csv","Player1",newdataframe=newDF)
#Player2 <- batsmanRunsPredict("./player2.csv","Player2",newdataframe=newDF)
#ponting <- batsmanRunsPredict("./player3.csv","Player3",newdataframe=newDF)
#sangakkara <- batsmanRunsPredict("./player4.csv","Player4",newdataframe=newDF)
#batsmen <-cbind(round(Player1$Runs),round(Player2$Runs),round(Player3$Runs),round(Player4$Runs))
#colnames(batsmen) <- c("Player1","Player2","Player3","Player4")
#newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
#colnames(newDF) <- c("BallsFaced","MinsAtCrease")
#predictedRuns <- cbind(newDF,batsmen)
#predictedRuns

Analysis of bowlers

  1. Bowler1
  2. Bowler2
  3. Bowler3
  4. Bowler4

player1 <- getPlayerData(xxxx,dir=“..”,file=“player1.csv”,type=“bowling”) Note For One day you will have to use getPlayerDataOD() and for Twenty20 it is getPlayerDataTT()

21. Wicket Frequency Plot

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsFreqPercent("./bowler1.csv","Bowler1")
#bowlerWktsFreqPercent("./bowler2.csv","Bowler2")
#bowlerWktsFreqPercent("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

22. Wickets Runs plot

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsRunsPlot("./bowler1.csv","Bowler1")
#bowlerWktsRunsPlot("./bowler2.csv","Bowler2")
#bowlerWktsRunsPlot("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

23. Average wickets at different venues

#bowlerAvgWktsGround("./bowler3.csv","Bowler3")

24. Average wickets against different opposition

#bowlerAvgWktsOpposition("./bowler3.csv","Bowler3")

25. Wickets taken moving average

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerMovingAverage("./bowler1.csv","Bowler1")
#bowlerMovingAverage("./bowler2.csv","Bowler2")
#bowlerMovingAverage("./bowler3.csv","Bowler3")

dev.off()
## null device 
##           1

26. Cumulative Wickets taken

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgWickets("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgWickets("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgWickets("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

27. Cumulative Economy rate

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgEconRate("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgEconRate("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgEconRate("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

28. Future Wickets forecast

Here are plots that forecast how the bowler will perform in future. In this case 90% of the career wickets trend is used as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated wickets trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfForecast("./bowler1.csv","Bowler1")
#bowlerPerfForecast("./bowler2.csv","Bowler2")
#bowlerPerfForecast("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

29. Contribution to matches won and lost

As discussed above the next 2 charts require the use of getPlayerDataSp(). This can only be done for Test matches

#bowler1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler1sp.csv",ttype="bowling")
#bowler2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler2sp.csv",ttype="bowling")
#bowler3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler3sp.csv",ttype="bowling")
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerContributionWonLost("bowler1sp","Bowler1")
#bowlerContributionWonLost("bowler2sp","Bowler2")
#bowlerContributionWonLost("bowler3sp","Bowler3")
dev.off()
## null device 
##           1

30. Performance home and overseas.

This can only be done for Test matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfHomeAway("bowler1sp","Bowler1")
#bowlerPerfHomeAway("bowler2sp","Bowler2")
#bowlerPerfHomeAway("bowler3sp","Bowler3")
dev.off()
## null device 
##           1

31 Relative Wickets Frequency Percentage

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingPerf(frames,names)

32 Relative Economy Rate against wickets taken

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingER(frames,names)

33 Relative cumulative average wickets of bowlers in career

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgWickets(frames,names)

34 Relative cumulative average economy rate of bowlers

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgEconRate(frames,names)

35 Check for bowler in-form/out-of-form

The below computation uses Null Hypothesis testing and p-value to determine if the bowler is in-form or out-of-form. For this 90% of the career wickets is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the bowler continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the bowler is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

Note: The check for the form status of the bowlers indicate

#checkBowlerInForm("./bowler1.csv","Bowler1")
#checkBowlerInForm("./bowler2.csv","Bowler2")
#checkBowlerInForm("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

The Clash of the Titans in Test and ODI cricket

Who looks outside, dreams; who looks inside, awakes.
Show me a sane man and I will cure him for you.

            Carl Jung 

 

We’re made of star stuff. We are a way for the cosmos to know itself.
If you want to make an apple pie from scratch, you must first create the universe.

            Carl Sagan

Introduction

The biggest nag in the collective psyche of cricketing fraternity these days, is whether Virat Kohli has surpassed Sachin Tendulkar. This question has been troubling cricket lovers the world over and particularly in India, for quite a while. This nagging question has only grown stronger with Kohli’s 41st ODI century and with Michael Vaughan bestowing the GOAT title to Virat Kohli for ODI cricket. Hence, I decided to do my bit in addressing this, by doing analysis of Kohli’s and Tendulkar’s performance in ODI cricket. I also wanted to address the the best among the cricketing idols of India in Test cricket, namely Sunil Gavaskar, Sachin Tendulkar and Virat Kohli. Hence this post has 2 parts

  1. Analysis of Tendulkar, Gavaskar and Kohli in Test cricket
  2. Analysis of Tendulkar and Kohli in ODIs

In this post, I analyze the performances of these titans in Test and ODI cricket using my R package cricketr. While some may feel that comparisons are not possible as these batsmen are from different eras. To some extent this is true. I would give some leeway to Gavaskar as he had to bat in a pre-helmet era. But with Tendulkar and Kohli a fair and objective comparison is possible. There were pre-eminient bowlers in the times of Tendulkar as there are now.

From the analysis below, it can be seen that Tendulkar is ahead  of everybody else in Test cricket. However it must be noted that Tendulkar’s performance deteriorated towards the end of his career. Such was not the case with Gavaskar. Kohli has some catching up to do and he still has a lot of Test cricket in him.

In ODI Kohli can be seen to pulling ahead of Tendulkar in several aspects.

My R package cricketr can be installed directly from CRAN and you can use it analyze cricketers.

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Important note 1: The latest release of ‘cricketr’ now includes the ability to analyze performances of teams now!!  See Cricketr adds team analytics to its repertoire!!!

Important note 2 : Cricketr can now do a more fine-grained analysis of players, see Cricketr learns new tricks : Performs fine-grained analysis of players

Important note 3: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Note 1: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr templatefrom Github (which is the R Markdown file I have used for the analysis below).

Note 2: I sprinkle the charts with my observations. Feel free to look at them more closely and come to your conclusions.

If you are passionate about cricket, and love analyzing cricket performances, then check out my racy book on cricket ‘Cricket analytics with cricketr and cricpy – Analytics harmony with R & Python’! This book discusses and shows how to use my R package ‘cricketr’ and my Python package ‘cricpy’ to analyze batsmen and bowlers in all formats of the game (Test, ODI and T20). The paperback is available on Amazon at $21.99 and  the kindle version at $9.99/Rs 449/-. A must read for any cricket lover! Check it out!!

Untitled

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post Introducing cricpy:A python package to analyze performances of cricketers

1 Load the cricketr package

if (!require("cricketr")){
    install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

A Test cricket  – Analysis of Gavaskar, Tendulkar and Kohli

2. Get player data

tendulkar <- getPlayerData(35320,dir=".",file="tendulkar.csv",type="batting")
kohli <- getPlayerData(253802,dir=".",file="kohli.csv",type="batting")
gavaskar <- getPlayerData(28794,dir=".",file="gavaskar.csv",type="batting")

3a. Basic analyses for Tendulkar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()

3b Basic analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./kohli.csv","Kohli")
batsmanMeanStrikeRate("./kohli.csv","Kohli")
batsmanRunsRanges("./kohli.csv","Kohli")
dev.off()

3c Basic analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./gavaskar.csv","Gavaskar")
batsmanMeanStrikeRate("./gavaskar.csv","Gavaskar")
batsmanRunsRanges("./gavaskar.csv","Gavaskar")
dev.off()

4a.More analyses for Tendulkar

It can be seen that Tendulkar and Gavaskar has been bowled more often than Kohli. Also Kohli does not have as many sixes in Test cricket as Tendulkar and Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./tendulkar.csv","Tendulkar")
batsman6s("./tendulkar.csv","Tendulkar")
batsmanDismissals("./tendulkar.csv","Tendulkar")
dev.off()

4b. More analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./kohli.csv","Kohli")
batsman6s("./kohli.csv","Kohli")
batsmanDismissals("./kohli.csv","Kohli")
dev.off()

4c More analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./gavaskar.csv","Gavaskar")
batsman6s("./gavaskar.csv","Gavaskar")
batsmanDismissals("./gavaskar.csv","Gavaskar")
dev.off()

5 Performance of batsmen on different grounds

par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./tendulkar.csv","Tendulkar")
batsmanAvgRunsGround("./kohli.csv","Kohli")
batsmanAvgRunsGround("./gavaskar.csv","Gavaskar")

a

#dev.off()

6. Performance if batsmen against different Opposition

  1. Tendulkar averages 50 against the following countries – Australia, Bangladesh, England, Sri Lanka, West Indies and Zimbabwe
  2. Kohli average almost 50 against all the nations he has played – Australia, Bangladesh, England, New Zealand, Sri Lanka and West Indies
  3. Gavaskar averages 50 against Australia, Pakistan, West Indies, Sri Lanka
par(mar=c(4,4,2,2))
batsmanAvgRunsOpposition("./tendulkar.csv","Tendulkar")
batsmanAvgRunsOpposition("./kohli.csv","Kohli")
batsmanAvgRunsOpposition("./gavaskar.csv","Gavaskar")

7. Get player data special

This is required for the next 2 function calls

tendulkarsp <- getPlayerDataSp(35320,tdir=".",tfile="tendulkarsp.csv",ttype="batting")
kohlisp <- getPlayerDataSp(253802,tdir=".",tfile="kohlisp.csv",ttype="batting")
gavaskarsp <- getPlayerDataSp(28794,tdir=".",tfile="gavaskarsp.csv",ttype="batting")

#dev.off()

8 Get contribution of batsmen in matches won and lost

Kohli contribution has had an equal contribution in won and lost matches. Tendulkar’s runs seem to have not helped in winning as much as only 50% of matches he has played have been won

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanContributionWonLost("tendulkarsp.csv","Tendulkar")
batsmanContributionWonLost("./kohlisp.csv","Kohli")
batsmanContributionWonLost("./gavaskarsp.csv","Gavaskar")
  

a

9 Performance of batsmen at home and overseas

The boxplots show that Kohli performs better overseas than at home. The 3rd quartile is higher, though the median seems to lower overseas. For Tendulkar the performance is similar both ways. Gavaskar’s median runs scored overseas is higher.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))


batsmanPerfHomeAway("tendulkarsp.csv","Tendulkar")
batsmanPerfHomeAway("./kohlisp.csv","Kohli")
batsmanPerfHomeAway("./gavaskarsp.csv","Gavaskar")

10. Moving average of runs

Gavaskar’s moving average was very good at the time of his retirement. Kohli seems to be going very strong. Tendulkar’s performance shows signs of deterioration around the time of his retirement.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanMovingAverage("./tendulkar.csv","Tendulkar")
batsmanMovingAverage("./kohli.csv","Kohli")
batsmanMovingAverage("./gavaskar.csv","Gavaskar")

#dev.off()

11 Boxplot and histogram of runs

Kohli has a marginally higher average (50.69) than Tendulkar (48.65) while Gavaskar 46. The median runs are same for Tendulkar and Kohli at 32

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfBoxHist("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfBoxHist("./kohli.csv","Kohli")
batsmanPerfBoxHist("./gavaskar.csv","Gavaskar")

12 Cumulative average Runs for batsmen

Looking at the cumulative average runs we can see a gradual drop in the cumulative average for Tendulkar while Kohli and Gavaskar’s performance seems to be getting better

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeAverageRuns("./tendulkar.csv","Tendulkar")
batsmanCumulativeAverageRuns("./kohli.csv","Kohli")
batsmanCumulativeAverageRuns("./gavaskar.csv","Gavaskar")

13. Cumulative average strike rate of batsmen

Tendulkar’s strike rate is better than Kohli and Gavaskar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeStrikeRate("./tendulkar.csv","Tendulkar")
batsmanCumulativeStrikeRate("./kohli.csv","Kohli")
batsmanCumulativeStrikeRate("./gavaskar.csv","Gavaskar")

14 Performance forecast of batsmen

The forecasted performance for Kohli and Gavaskar is higher than that of Tendulkar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfForecast("./kohli.csv","Kohli")
batsmanPerfForecast("./gavaskar.csv","Gavaskar")

#dev.off()

15. Relative strike rate of batsmen

par(mar=c(4,4,2,2))

frames <- list("./tendulkar.csv","./kohli.csv","gavaskar.csv")
names <- list("Tendulkar","Kohli","Gavaskar")
relativeBatsmanSR(frames,names)
#dev.off()

16. Relative Runs frequency of batsmen

par(mar=c(4,4,2,2))
frames <- list("./tendulkar.csv","./kohli.csv","gavaskar.csv")
names <- list("Tendulkar","Kohli","Gavaskar")
relativeRunsFreqPerf(frames,names)
#dev.off()

17. Relative cumulative average runs of batsmen

Tendulkar leads the way here, but it can be seem Kohli catching up.

par(mar=c(4,4,2,2))
frames <- list("./tendulkar.csv","./kohli.csv","gavaskar.csv")
names <- list("Tendulkar","Kohli","Gavaskar")
relativeBatsmanCumulativeAvgRuns(frames,names)
#dev.off()

18. Relative cumulative average strike rate

Tendulkar has better strike rate than the other two.

par(mar=c(4,4,2,2))
frames <- list("./tendulkar.csv","./kohli.csv","gavaskar.csv")
names <- list("Tendulkar","Kohli","Gavaskar")
relativeBatsmanCumulativeStrikeRate(frames,names)
#dev.off()

19. Check batsman in form

As in the moving average and performance forecast and cumulative average runs, Kohli and Gavaskar are in-form while Tendulkar was out-of-form towards the end.

checkBatsmanInForm("./tendulkar.csv","Sachin Tendulkar")
## [1] "**************************** Form status of Sachin Tendulkar ****************************
\n\n Population size: 294  Mean of population: 50.48 \n Sample size: 33  Mean of sample: 32.42 SD of 
sample: 29.8 \n\n Null hypothesis H0 : Sachin Tendulkar 's sample average is within 95% confidence interval 
of population average\n Alternative hypothesis Ha : Sachin Tendulkar 's sample average is below 
the 95% confidence interval of population average\n\n 
Sachin Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./kohli.csv","Kohli")
## [1] "**************************** Form status of Kohli ****************************\n\n Population size: 117
  Mean of population: 50.35 \n Sample size: 13  Mean of sample: 53.77 SD of sample: 46.15 \n\n Null 
hypothesis H0 : Kohli 's sample average is within 95% confidence interval of population average\n 
Alternative hypothesis Ha : Kohli 's sample average is below the 95% confidence interval of population
 average\n\n Kohli 's Form Status: In-Form because the p value: 0.603244  is greater than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./gavaskar.csv","Gavaskar")
## [1] "**************************** Form status of Gavaskar ****************************\n\n 
Population size: 125  Mean of population: 44.67 \n Sample size: 14  Mean of sample: 57.86 SD of sample:
 58.55 \n\n Null hypothesis H0 : Gavaskar 's sample average is within 95% confidence interval of population
 average\n Alternative hypothesis Ha : Gavaskar 's sample average is below the 95% confidence interval of 
population average\n\n Gavaskar 's Form Status: In-Form because the p value: 0.793276  is greater 
than alpha=  0.05 \n *******************************************************************************************\n\n"
#dev.off()

20. Performance 3D

A 3D regression plane is fitted between the the Balls faced, Minutes at crease and Runs scored

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
battingPerf3d("./tendulkar.csv","Sachin Tendulkar")
battingPerf3d("./kohli.csv","Kohli")
battingPerf3d("./gavaskar.csv","Gavaskar")
#dev.off()

20. Runs likelihood

This functions computes the K-Means and determines the runs the batsmen are likely to score.

par(mar=c(4,4,2,2))
batsmanRunsLikelihood("./tendulkar.csv","Tendulkar")
## Summary of  Tendulkar 's runs scoring likelihood
## **************************************************
## 
## There is a 16.51 % likelihood that Tendulkar  will make  139 Runs in  251 balls over 353  Minutes 
## There is a 25.08 % likelihood that Tendulkar  will make  66 Runs in  122 balls over  167  Minutes 
## There is a 58.41 % likelihood that Tendulkar  will make  16 Runs in  31 balls over 44  Minutes
batsmanRunsLikelihood("./kohli.csv","Kohli")
## Summary of  Kohli 's runs scoring likelihood
## **************************************************
## 
## There is a 20 % likelihood that Kohli  will make  143 Runs in  232 balls over 330  Minutes 
## There is a 33.85 % likelihood that Kohli  will make  51 Runs in  92 balls over  127  Minutes 
## There is a 46.15 % likelihood that Kohli  will make  11 Runs in  24 balls over 31  Minutes
batsmanRunsLikelihood("./gavaskar.csv","Gavaskar")
## Summary of  Gavaskar 's runs scoring likelihood
## **************************************************
## 
## There is a 33.81 % likelihood that Gavaskar  will make  69 Runs in  159 balls over 214  Minutes 
## There is a 8.63 % likelihood that Gavaskar  will make  172 Runs in  364 balls over  506  Minutes 
## There is a 57.55 % likelihood that Gavaskar  will make  13 Runs in  35 balls over 48  Minutes

21. Predict runs for a random combination of Balls faced and runs scored

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
tendulkar <- batsmanRunsPredict("./tendulkar.csv","Tendulkar",newdataframe=newDF)
kohli <- batsmanRunsPredict("./kohli.csv","Kohli",newdataframe=newDF)
gavaskar <- batsmanRunsPredict("./gavaskar.csv","Gavaskar",newdataframe=newDF)
batsmen <-cbind(round(tendulkar$Runs),round(kohli$Runs),round(gavaskar$Runs))
colnames(batsmen) <- c("Tendulkar","Kohli","Gavaskar")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Tendulkar Kohli Gavaskar
## 1          10           30         7     6        4
## 2          38           71        23    24       17
## 3          66          111        39    42       30
## 4          94          152        54    60       43
## 5         121          193        70    78       56
## 6         149          234        86    96       69
## 7         177          274       102   114       82
## 8         205          315       118   132       95
## 9         233          356       134   150      108
## 10        261          396       150   168      121
## 11        289          437       165   186      134
## 12        316          478       181   204      147
## 13        344          519       197   222      160
## 14        372          559       213   240      173
## 15        400          600       229   258      186
#dev.off()

Key findings

  1. Kohli has a marginally higher average than Tendulkar
  2. Tendulkar has the best strike rate of all the 3.
  3. The cumulative average runs and the performance forecast for Kohli and Gavaskar show an improving trend, while Tendulkar’s numbers deteriorate towards the end of his career
  4. Kohli is fast catching up Tendulkar on cumulative average runs vs innings in career.

B ODI Cricket – Analysis of Tendulkar and Kohli

The functions below get the ODI data for Tendulkar and Kohli as CSV files so that the analyses can be done

22 Get player data for ODIs

tendulkarOD <- getPlayerDataOD(35320,dir=".",file="tendulkarOD.csv",type="batting")
kohliOD <- getPlayerDataOD(253802,dir=".",file="kohliOD.csv",type="batting")

#dev.off()

23a Basic performance of Tendulkar in ODI

par(mfrow=c(3,2))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./tendulkarOD.csv","Tendulkar")
batsmanRunsRanges("./tendulkarOD.csv","Tendulkar")
batsman4s("./tendulkarOD.csv","Tendulkar")
batsman6s("./tendulkarOD.csv","Tendulkar")
batsmanScoringRateODTT("./tendulkarOD.csv","Tendulkar")
#dev.off()

23b. Basic performance of Kohli in ODI

par(mfrow=c(3,2))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./kohliOD.csv","Kohli")
batsmanRunsRanges("./kohliOD.csv","Kohli")
batsman4s("./kohliOD.csv","Kohli")
batsman6s("./kohliOD.csv","Kohli")
batsmanScoringRateODTT("./kohliOD.csv","Kohli")
#dev.off()

24. Performance forecast in ODIs

Kohli’s forecasted runs are much higher than Tendulkar’s in ODIs

par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkarOD.csv","Tendulkar")
batsmanPerfForecast("./kohliOD.csv","Kohli")

25. Batting performance

A 3D regression plane is fitted between Balls faced, Minutes at crease and Runs scored.

par(mar=c(4,4,2,2))
battingPerf3d("./tendulkarOD.csv","Tendulkar")
battingPerf3d("./kohliOD.csv","Kohli")

26. Predicting runs scored for the ODI batsmen

Kohli will score runs than Tendulkar for the same minutes at crease and balls faced.

BF <- seq( 10, 200,length=10)
Mins <- seq(30,220,length=10)
newDF <- data.frame(BF,Mins)
tendulkarDF <- batsmanRunsPredict("./tendulkarOD.csv","Tendulkar",newdataframe=newDF)
kohliDF <- batsmanRunsPredict("./kohliOD.csv","Kohli",newdataframe=newDF)
batsmen <-cbind(round(tendulkarDF$Runs),round(kohliDF$Runs))
colnames(batsmen) <- c("Tendulkar","Kohli")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Tendulkar Kohli
## 1          10           30         7     8
## 2          31           51        26    28
## 3          52           72        45    48
## 4          73           93        64    68
## 5          94          114        83    88
## 6         116          136       102   108
## 7         137          157       121   128
## 8         158          178       140   149
## 9         179          199       159   169
## 10        200          220       178   189

27. Runs likelihood for the ODI batsmen

Tendulkar has clusters around 13, 53 and 111 runs while Kohli has clusters around 13, 63,116. So it more likely that Kohli will tend to score higher

par(mar=c(4,4,2,2))
batsmanRunsLikelihood("./tendulkarOD.csv","Tendulkar")
## Summary of  Tendulkar 's runs scoring likelihood
## **************************************************
## 
## There is a 18.09 % likelihood that Tendulkar  will make  111 Runs in  118 balls over 172  Minutes 
## There is a 28.39 % likelihood that Tendulkar  will make  53 Runs in  63 balls over  95  Minutes 
## There is a 53.52 % likelihood that Tendulkar  will make  13 Runs in  18 balls over 27  Minutes
batsmanRunsLikelihood("./kohliOD.csv","Kohli")
## Summary of  Kohli 's runs scoring likelihood
## **************************************************
## 
## There is a 31.41 % likelihood that Kohli  will make  63 Runs in  69 balls over 97  Minutes 
## There is a 49.74 % likelihood that Kohli  will make  13 Runs in  18 balls over  24  Minutes 
## There is a 18.85 % likelihood that Kohli  will make  116 Runs in  113 balls over 163  Minutes

28. Runs in different venues for the ODI batsmen

par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./tendulkarOD.csv","Tendulkar")
batsmanAvgRunsGround("./kohliOD.csv","Kohli")

28. Runs against different opposition for the ODI batsmen

Tendulkar’s has 50+ average against Bermuda, Kenya and Namibia. While Kohli has a 50+ average against New Zealand, West Indies, South Africa, Zimbabwe and Bangladesh

par(mar=c(4,4,2,2))
batsmanAvgRunsOpposition("./tendulkarOD.csv","Tendulkar")
batsmanAvgRunsOpposition("./kohliOD.csv","Kohli")

29. Moving average of runs for the ODI batsmen

Tendulkar’s moving average shows an improvement (50+) towards the end of his career, but Kohli shows a marked increase 60+ currently

par(mar=c(4,4,2,2))
batsmanMovingAverage("./tendulkarOD.csv","Tendulkar")
batsmanMovingAverage("./kohliOD.csv","Kohli")

30. Cumulative average runs of ODI batsmen

Tendulkar plateaus at 40+ while Kohli’s cumulative average runs goes up and up!!!

par(mar=c(4,4,2,2))
batsmanCumulativeAverageRuns("./tendulkarOD.csv","Tendulkar")
batsmanCumulativeAverageRuns("./kohliOD.csv","Kohli")

31 Cumulative strike rate of ODI batsmen

par(mar=c(4,4,2,2))
batsmanCumulativeStrikeRate("./tendulkarOD.csv","Tendulkar")
batsmanCumulativeStrikeRate("./kohliOD.csv","Kohli")

32. Relative batsmen strike rate

par(mar=c(4,4,2,2))

frames <- list("./tendulkarOD.csv","./kohliOD.csv")
names <- list("Tendulkar","Kohli")
relativeBatsmanSRODTT(frames,names)
#dev.off()

33. Relative Run Frequency percentages

par(mar=c(4,4,2,2))

frames <- list("./tendulkarOD.csv","./kohliOD.csv")
names <- list("Tendulkar","Kohli")
relativeRunsFreqPerfODTT(frames,names)
#dev.off()

34. Relative cumulative average runs of ODI batsmen

Kohli breaks away from Tendulkar in cumulative average runs after 100 innings

par(mar=c(4,4,2,2))

frames <- list("./tendulkarOD.csv","./kohliOD.csv")
names <- list("Tendulkar","Kohli")
relativeBatsmanCumulativeAvgRuns(frames,names)
#dev.off()

35. Relative cumulative strike rate of ODI batsmen

This seems to be tussle with Kohli having an edge till about 40 innings and then from 40+ to 180 innings Tendulkar leads. Kohli just seems to be edging forward.

par(mar=c(4,4,2,2))

frames <- list("./tendulkarOD.csv","./kohliOD.csv")
names <- list("Tendulkar","Kohli")
relativeBatsmanCumulativeStrikeRate(frames,names)
#dev.off()

36. Batsmen 4s and 6s

par(mar=c(4,4,2,2))

frames <- list("./tendulkarOD.csv","./kohliOD.csv")
names <- list("Tendulkar","Kohli")
batsman4s6s(frames,names)
##                Tendulkar Kohli
## Runs(1s,2s,3s)     66.29 69.67
## 4s                 29.65 25.90
## 6s                  4.06  4.43
#dev.off()

37. Check ODI batsmen form

par(mar=c(4,4,2,2))

checkBatsmanInForm("./tendulkar.csv","Tendulkar")
## [1] "**************************** Form status of Tendulkar ********
********************\n\n Population size: 294  Mean of population: 50.48 \n
 Sample size: 33  Mean of sample: 32.42 SD of sample: 29.8 \n\n 
Null hypothesis H0 : Tendulkar 's sample average is within 95% confidence
 interval of population average\n Alternative hypothesis 
Ha : Tendulkar 's sample average is below the 95% confidence interval 
of population average\n\n Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./kohli.csv","Kohli")
## [1] "**************************** Form status of Kohli ***********
*****************\n\n Population size: 117  Mean of population: 50.35 \n
 Sample size: 13  Mean of sample: 53.77 SD of sample: 46.15 \n\n 
Null hypothesis H0 : Kohli 's sample average is within 95% confidence 
interval of population average\n Alternative hypothesis 
Ha : Kohli 's sample average is below the 95% confidence interval 
of population average\n\n Kohli 's Form Status: In-Form because 
the p value: 0.603244  is greater than alpha=  0.05 \n *******************************************************************************************\n\n"
#dev.off()

Key Findings

  1. Kohli has a better performance against oppositions like West Indies, South Africa and New Zealand
  2. Kohli breaks away from Tendulkar in cumulative average runs
  3. Tendulkar has been leading the strike rate rate but Kohli in recent times seems to be breaking loose.

Check out some other players with my R package cricketr

Important note: Do check out my other posts using cricketr at cricketr-posts

Also see

  1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
  2. A primer on Qubits, Quantum gates and Quantum Operations
  3. De-blurring revisited with Wiener filter using OpenCV
  4. Deep Learning from first principles in Python, R and Octave – Part 4
  5. The Many Faces of Latency
  6. Fun simulation of a Chain in Android
  7. Presentation on Wireless Technologies – Part 1
  8. yorkr crashes the IPL party ! – Part 1

To see all posts click Index of posts

Analyzing T20 matches with yorkpy templates

1. Introduction

In this post I create yorkpy templates for end-to-end analysis of any T20 matches that are available on Cricsheet as yaml format. These templates can be used to analyze Intl. T20, IPL, BBL and Natwest T20. In fact they can be used for any T20 games which have been saved in the yaml format as specified by Cricsheet Cricheet.

Noteyorkpy is the clone of my R package yorkr see yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

With these templates you can convert all T20 match data which is in yaml format to Pandas dataframes and save them as CSV. Note The data for Intl T20, IPL, BBL and Natwest T20 have already been converted and are available at allYorkpyData. This templates is also available at Github at yorkpyTemplate. The template includes the following steps

  1. Template for conversion and setup
  2. Analysis of Any T20 match
  3. Analysis of a T20 team in all matches against another T20 team
  4. Analysis of a T20 team in all matches against all other teams
  5. Analysis of T20 batsmen and bowlers

You can recreate the files as more matches are added to Cricsheet site in IPL 2017 and future seasons. This post contains all the steps needed for detailed analysis of IPL matches, teams and IPL player. This will also be my reference in future if I decide to analyze IPL in future!

Install yorkpy with pip install yorkpy

Data conversion of the yaml files have to be done before any analysis of T20 batsmen, bowlers, any T20 match matches between any 2 T20 team or analysis of a teams performance against all other team can be done

The first step is To convert the YAML files that are available for the different T20 leagues namely Intl. T20, IPL, BBL, Natwest T20 which are available in yaml format in Cricsheet. For initial data setup we need to use slighly different functions for each of the T20 leagues since the teams are different. The function to convert yaml to Pandas dataframe and save as CSV is common for all leagues

A. For International T20

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 Intl T20 countries
#yka.saveAllMatchesBetween2IntlT20s(dir1)

#Save all matches between an Intl.T20 country and all other countries
#yka.saveAllMatchesAllOppositionIntlT20(dir1)

# Get batting details for a country
#yka.getTeamBattingDetails(<country>,dir=dir1, save=True)

#Get bowling details
#yka.getTeamBowlingDetails(<country>,dir=dir1, save=True)

B. For Indian Premier League (IPL)

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 IPL teams
#yka.saveAllMatchesBetween2IPLTeams(dir1)

#Save all matches between an IPL team and all other teams
#yka.saveAllMatchesAllOppositionIPLT20(dir1)

# Get batting details for an IPL team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an IPL team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

C. For Big Bash League (BBL)

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 BBL teams
#yka.saveAllMatchesBetween2BBLTeams(dir1)

#Save all matches between an BBL team and all other teams
#yka.saveAllMatchesAllOppositionBBLT20(dir1)

# Get batting details for an BBL team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an BBL team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

D For Natwest T20

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 NWB teams
#yka.saveAllMatchesBetween2NWBTeams(dir1)

#Save all matches between an NWB team and all other teams
#yka.saveAllMatchesAllOppositionNWBT20(dir1)

# Get batting details for an NWB team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an NWB team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

Once the conversion has been done and the data has been setup we can use any of the yorkpy functions for the the 4 leagues (Intl. T20, IPL, BBL or Natwest T20) There are four classes of functions. These functions can be used for any of the

  1. Class 1 – Functions that analyze a single T20 match
  2. Class 2 – Functions that analyze the performance of a T20 team in all matches against another T20 team
  3. Class 3 – Functions that analyze the performance of a T20 team against all other teams
  4. Class 4 – Functions that analyze individual T20 batsmen or bowler

2. Class 1 functions

These functions analyze a single T20 match (Intl T20, BBL, IPL or Natwest T20) To see actual usage of Class 1 function see Pitching yorkpy … short of good length to IPL – Part 1

import yorkpy.analytics as yka
# Get scorecard
#scorecard,extras=yka.teamBattingScorecardMatch(<team1>,"Name of Team")

#Get partnership
#match=pd.read_csv("<match.csv>")
#yka.teamBatsmenPartnershipMatch(match,<team1>,<team2>,plot=True/False)

#Batsmen vs bowler
#match=pd.read_csv("<match.csv>")
#yka.teamBatsmenVsBowlersMatch(match,<team1>,<team2>,plot=True/False)

#Bowling scorecard
#match=pd.read_csv("<match.csv>")
#a=yka.teamBowlingScorecardMatch(match,<team1>)

#Wicket Kind
#match=pd.read_csv("<match.csv>")
#yka.teamBowlingWicketKindMatch((match,<team1>,<team2>)

#Wicket Match
#match=pd.read_csv("<match.csv>")
#yka.teamBowlingWicketMatch(match,<team1>,<team2>,plot=True/False)

#Bowler vs Batsman
#match=pd.read_csv("<match.csv>")
#yka.teamBowlersVsBatsmenMatch(match,<team1>,<team2>)

#Match worm chart
#match=pd.read_csv("<match.csv>")
#yka.matchWormChart(match,<team1>,<team2>,)

3. Class 2 functions

These set of functions analyze the performance a T20 team for e.g. Intl T20, BBL or Natwest T20 in all matches against another T20 team (country or IPL, BBL or Natwest T20 team. To see usages of Class 2 functions see Pitching yorkpy…on the middle and outside off-stump to IPL – Part 2

import yorkpy.analytics as yka

# Batting partnerships - Table
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#m=yka.teamBatsmenPartnershiOppnAllMatches(team1_team2_matches,<team1/team2>,report="summary/detailed", top=<n>)

# Batting partnerships - Plot
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBatsmenPartnershipOppnAllMatchesChart(team1_team2_matches,<team1>,<team2> plot=<True/False>, top=<N>, partnershipRuns=<M>)

#Batsmen vs Bowlers
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBatsmenVsBowlersOppnAllMatches(team1_team2_matches,<team1>,<team2> plot=<True/False>, top=<N>,runsScored=<M>)

# Batting scorecard
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#scorecard=yka.teamBattingScorecardOppnAllMatches(team1_team2_matches,<team1>,<team2>)

#Bowling scorecard
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#scorecard=yka.teamBowlingScorecardOppnAllMatches(team1_team2_matches,<team1>,<team2>)

#Bowling wicket kind
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBowlingWicketKindOppositionAllMatches(team1_team2_matches,<team1>,<team2>,plot=<True/False>,top=<N>,wickets=<M>)

#Bowler vs batsman
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBowlersVsBatsmenOppnAllMatches(team1_team2_matches,<team1>,<team2>,plot=<True/False>,top=<N>,runsConceded=<M>)

# Wins vs losses
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinLossBetweenTeams(team1_team2_matches,<team1>,<team2>)

#Wins by win type
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinsByRunOrWickets(team1_team2_matches,<team1>)

#Wins by toss decision
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinsbyTossDecision(team1_team2_matches,<team1>,tossDecision=<field/bat>)

4. Class 3 functions

This set of functions deals with analyzing the performance of a T20 team (Intl. T20, IPL, BBL or Natwest T20) in all matches against all other teams. To see usages of Class 3 functions see Pitching yorkpy…swinging away from the leg stump to IPL – Part 3. After the data is save all matches between all oppositions we can use this data

import yorkpy.analytics as yka
#Batsman partnerships
#allmatches = pd.read_csv("<allmatchesForteam")
#m=yka.teamBatsmenPartnershiAllOppnAllMatches(allmatches,<team1>,report=<"summary"/"detailed", top=<N>,partnershipRuns=<M>)

#Batsmen vs Bowlers
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBatsmenVsBowlersAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=N>,runsScored=<M>)

#Batting scorecard
#allmatches = pd.read_csv("<allmatchesForteam")
#scorecard=yka.teamBattingScorecardAllOppnAllMatches(allmatches,<team1>)

#Bowling scorecard
#allmatches = pd.read_csv("<allmatchesForteam")
#scorecard=yka.teamBowlingScorecardAllOppnAllMatches(allmatches,<team1>)

#Bowling wicket kind
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBowlingWicketKindAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=<N>,wickets=<M>)

# Bowler vs Batsmen
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBowlersVsBatsmenAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=<N>,runsConceded=<M>)

# Wins vs losses
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinLossByTeamAllOpposition(allmatches,<team1>,plot=<"summary"/"detailed">)

# Wins by win type
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinsByRunOrWicketsAllOpposition(allmatches,<team1>)

# Wins by toss decision
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinsbyTossDecisionAllOpposition(allmatches,<team1>,tossDecision='bat'/'field',plot='summary'/'detailed')

5. Class 4 functions

This set of functions are used for analyzing individual batsman/bowler. From the converted xxx-BattingDetails.csv and xxx-BowlingDetails.csv we can get the batsman and bowler details as shown below. Subsequenly we can perform analyses of the individual batsman and bowler. To see actual usages of Class 4 functions see Pitching yorkpy … in the block hole – Part 4

import yorkpy.analytics as yka

#Batsman analyses
#Get batsman Dataframe
#batsmanDF=yka.getBatsmanDetails(<team1>,<batsman>,dir=dir1)

#Batsman Runs vs Deliveries
#yka.batsmanRunsVsDeliveries(batsmanDF,<batsmanName>)

#Batsman fours and sixes
#yka.batsmanFoursSixes(batsmanDF,<batsmanName>)


#Batsman dismissals
#yka.batsmanDismissals(batsmanDF,<batsmanName>)

#Batsman Runs vs Strike Rate
#yka.batsmanRunsVsStrikeRate(batsmanDF,<batsmanName>)

#Batsman Moving average
#yka.batsmanMovingAverage(batsmanDF,<batsmanName>)


#Batsman Cumulative average
#yka.batsmanCumulativeAverageRuns(batsmanDF,<batsmanName>)

#Batsman Cumulative Strike rate
#yka.batsmanCumulativeStrikeRate(batsmanDF,<batsmanName>)

#Batsman Runs against opposition
#yka.batsmanRunsAgainstOpposition(batsmanDF,<batsmanName>)

#Batsman Runs against opposition
#yka.batsmanRunsVenue(batsmanDF,<batsmanName>)


#Bowler analyses
#Get bowler dataframe
#bowlerDF=yka.getBowlerWicketDetails(<team1>,<bowler>dir=dir1)

#Mean economy rate
#yka.bowlerMeanEconomyRate(bowlerDF,<bowlerName>)


#Mean Economy rate
#yka.bowlerMeanEconomyRate(bowlerDF,<bowlerName>)

#Mean Runs conceded
#yka.bowlerMeanRunsConceded(bowlerDF,<bowlerName>)

#Moving average of wickets
#yka.bowlerMovingAverage((bowlerDF,<bowlerName>)

# Cumulative average of wickets
#yka.bowlerCumulativeAvgWickets(bowlerDF,<bowlerName>)

# Cumulative economy rate
#yka.bowlerCumulativeAvgEconRate(bowlerDF,<bowlerName>)

# Wicket plot
#yka.bowlerWicketPlot(df,name)

# Wicket against opposition
#yka.bowlerWicketsAgainstOpposition(bowlerDF,<bowlerName>)

# Wickets at venue
#yka.bowlerWicketsVenue(bowlerDF,<bowlerName>)

Important note: Do check out my other posts using yorkpy at yorkpy-posts

Conclusion

With the above templates detailed analyis can be done on

  • A T20 match
  • Performance of a team in all matches against another team
  • Performance of a team in all matches against all other teams
  • Individual batting and bowling performances

See also

  1. Deep Learning from first principles in Python, R and Octave – Part 5
  2. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
  3. Practical Machine Learning with R and Python – Part 4
  4. Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8
  5. A method to crowd source pothole marking on (Indian) roads

To see all posts click Index of posts

yorkpy takes a hat-trick, bowls out Intl. T20s, BBL and Natwest T20!!!

“Dear, dear! How queer everything is to-day! And yesterday things went on just as usual. I wonder if I’ve been changed in the night? Let me think: was I the same when I got up this morning? I almost think I can remember feeling a little different. But if I’m not the same, the next question is ’Who in the world am I? Ah, that’s the great puzzle!”

             Alice's adventures  in Wonderland, Lewis Carroll

1. Introduction

In this post, yorkpy clean bowls the following T20 formats namely International T20s, Big Bash League and Natwest T20 Blast. I take yorkpy on a spin through these T20 leagues. In the post below,I choose a random set of about 10-12 of the overall 63 functions that yorkpy has, and execute them for each of the different T20 leagues – Intl T20s, BBL and Natwest T20s. yorkpy, is the python avatar of my R package yorkr, see Introducing cricket package yorkr: Part 1- Beaten by sheer pace!

There were a couple of new functions that needed to be added for each of the T20 leagues – Intl T20, BBL and Natwest T20 to take into account the different teams in each of these leagues. Further some bugs were also ironed out in tje latest version of yorkpy. yorkpy uses data from Cricsheet . The match data is in the form of YAML files. yorkpy converts these YAML files to dataframes. YAML files are very detailed and include a ball-by-ball account of the match.

– You can clone/fork the latest code for yorkpy from github yorkpy
– This post has also been published in RPubs at yorkpy takes a hat-trick
– You can download the PDF version of this post at yorkpy takes a hat-trick

The data for IPL, Intl. T20, BBL and Natwest T20 have already been converted into pandas dataframes and saved as CSVs. You can download the converted files from Github at [allYorkpyT20Data])(https://github.com/tvganesh/allYorkpyT20Data)

yorkpy has the following 4 main classes of functions

A.Functions analyzing individual T20 match (Class 1)

This was demonstrated in Pitching yorkpy . short of good length to IPL – Part 1 The functions deal with individual T20 matches. The functions are

  1. convertYaml2PandasDataframeT20()
  2. convertAllYaml2PandasDataframesT20()
  3. teamBattingScorecardMatch()
  4. teamBatsmenPartnershipMatch()
  5. teamBatsmenVsBowlersMatch()
  6. teamBowlingScorecardMatch()
  7. teamBowlingWicketKindMatch()
  8. teamBowlingWicketRunsMatch()
  9. teamBowlingWicketMatch()
  10. teamBowlersVsBatsmenMatch()
  11. matchWormChart()

B. Functions that analyze all matches between 2 T20 teams (Class 2

Pitching yorkpy.on the middle and outside off-stump to IPL – Part 2 included functions that analyze head-to-head confrontation between any 2 T20 teams The functions are

  1. getAllMatchesBetweenTeams()
  2. saveAllMatchesBetween2IPLTeams()
  3. getAllMatchesBetweenTeams()
  4. saveAllMatchesBetween2IPLTeams()
  5. teamBatsmenPartnershiOppnAllMatches()
  6. teamBatsmenPartnershipOppnAllMatchesChart()
  7. teamBatsmenVsBowlersOppnAllMatches()
  8. teamBattingScorecardOppnAllMatches()
  9. teamBowlingScorecardOppnAllMatches()
  10. teamBowlingWicketKindOppositionAllMatches()
  11. teamBowlersVsBatsmenOppnAllMatches()
  12. plotWinLossBetweenTeams()
  13. plotWinsByRunOrWickets() 23.plotWinsbyTossDecision()

C. Functions that analyze the performance of a T20 team against all other teams (Class 3)

The post Pitching yorkpy.swinging away from the leg stump to IPL – Part 3 is based on Class C set of functions shown below

  1. getAllMatchesAllOpposition()
  2. saveAllMatchesAllOppositionIPLT20(dir1)
  3. getAllMatchesAllOpposition()
  4. saveAllMatchesAllOppositionIPLT20()
  5. teamBatsmenPartnershiAllOppnAllMatches()
  6. teamBatsmenPartnershipAllOppnAllMatchesChart()
  7. teamBatsmenVsBowlersAllOppnAllMatches()
  8. teamBattingScorecardAllOppnAllMatches()
  9. teamBowlingScorecardAllOppnAllMatches()
  10. teamBowlingWicketKindAllOppnAllMatches()
  11. teamBowlersVsBatsmenAllOppnAllMatches()
  12. plotWinLossByTeamAllOpposition()
  13. plotWinsByRunOrWicketsAllOpposition()
  14. plotWinsbyTossDecisionAllOpposition()

D. Functions that analyze performances of T20 batsmen and bowlers (Class 4)

These set of functions analyze individual batsmen and bowlers and have been used in Pitching yorkpy . in the block hole – Part 4 The functions are

  1. getTeamBattingDetails()
  2. getBatsmanDetails()
  3. batsmanRunsVsDeliveries()
  4. batsmanFoursSixes()
  5. batsmanDismissals()
  6. batsmanRunsVsStrikeRate()
  7. batsmanMovingAverage()
  8. batsmanCumulativeAverageRuns()
  9. batsmanCumulativeStrikeRate()
  10. batsmanRunsAgainstOpposition()
  11. batsmanRunsVenue
  12. getTeamBowlingDetails()
  13. getBowlerWicketDetails()
  14. bowlerMeanEconomyRate()
  15. bowlerMeanRunsConceded()
  16. bowlerMovingAverage()
  17. bowlerCumulativeAvgWickets()
  18. bowlerCumulativeAvgEconRate()
  19. bowlerWicketPlot()
  20. bowlerWicketsAgainstOpposition()
  21. bowlerWicketsVenue()

Additional new functions were added to handle Intl T20s, Big Bash League and Natwest T20 Blast, since the teams are different. They are

59. saveAllMatchesBetween2IntlT20s()
60. saveAllMatchesAllOppositionIntlT20()
61. saveAllMatchesBetween2BBLTeams()
62 saveAllMatchesAllOppositionBBLT20()
63. saveAllMatchesBetween2NWBTeams()
64. saveAllMatchesAllOppositionNWBT20()

All other functions can be used as is! You can get the help of any function in yorkpy using

import yorkpy.analytics as yka
help(yka.teamBatsmenPartnershiOppnAllMatches)
## Help on function teamBatsmenPartnershiOppnAllMatches in module yorkpy.analytics:
## 
## teamBatsmenPartnershiOppnAllMatches(matches, theTeam, report='summary', top=5)
##     Team batting partnership against a opposition all IPL matches
##     
##     Description
##     
##     This function computes the performance of batsmen against all bowlers of an oppositions in 
##     all matches. This function returns a dataframe
##     
##     Usage
##     
##     teamBatsmenPartnershiOppnAllMatches(matches,theTeam,report="summary")
##     Arguments
##     
##     matches     
##     All the matches of the team against the oppositions
##     theTeam     
##     The team for which the the batting partnerships are sought
##     report      
##     If the report="summary" then the list of top batsmen with the highest partnerships 
##     is displayed. If report="detailed" then the detailed break up of partnership is returned 
##     as a dataframe
##     top
##     The number of players to be displayed from the top
##     Value
##     
##     partnerships The data frame of the partnerships
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh tvganesh.85@gmail.com
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://cricsheet.org/
##     https://gigadom.wordpress.com/
##     
##     
##     See Also
##     
##     teamBatsmenVsBowlersOppnAllMatchesPlot
##     teamBatsmenPartnershipOppnAllMatchesChart

As I mentioned above I will be randomly choosing a set of 12 functions from Class 1,2,3,4 for each of the T20 leagues (Intl T20, BBL and NWB T20) for analysis

2. International T20s

The following functions were added for handling Intl. T20s

  1. saveAllMatchesBetween2IntlT20s()
  2. saveAllMatchesAllOppositionIntlT20()

To handle the countries in Intl. T20s below

Afghanistan, Australia, Bangladesh, Bermuda, Canada, England,Hong Kong,India, Ireland, Kenya, Nepal, Netherlands, “New Zealand, Oman,Pakistan,Scotland,South Africa, Sri Lanka, United Arab Emirates,West Indies, Zimbabwe

import os
#os.chdir('C:\\software\\cricket-package\\yorkpyT20\\t20s')
#import yorkpy.analytics as yka
#1.  Convert all YAML files to dataframes and CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches'
#2. Save all matches between 2 T20 teams
#yka.saveAllMatchesBetween2IntlT20s(dir1)
#3. Save all matches between a T20 team and all other teams
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches'
#yka.saveAllMatchesAllOppositionIntlT20(dir1)
#4. Get batting details
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches
#yka.getTeamBattingDetails("Afghanistan",dir=dir1, save=True)
#yka.getTeamBattingDetails("Australia",dir=dir1,save=True)
#yka.getTeamBattingDetails("Bangladesh",dir=dir1,save=True)
#...
#5. Get bowling details
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches
#yka.getTeamBowlingDetails("Afghanistan",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Australia",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Bangladesh",dir=dir1,save=True)
# ...

Once the data is converted you can use the yorkpy functions. The data has been converted for Intl T20 and is available at Github at IntlT20

To use the yorkpy functions for a new league we need to initial convert the YAML files into appropriate format for processing by yorkpy functions

This will create the necessary files which are are used in the functions below

2.2 2.1 Intl. T20 – Team score card  (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\India-New Zealand-2007-09-16.csv")
ind_nz=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(ind_nz,"India")
print(scorecard)
##             batsman  runs  balls  4s  6s          SR
## 0         G Gambhir    51     34   5   2  150.000000
## 1          V Sehwag    40     18   6   2  222.222222
## 2        RV Uthappa     0      2   0   0    0.000000
## 3          MS Dhoni    24     20   2   0  120.000000
## 4      Yuvraj Singh     5      7   0   0   71.428571
## 5        KD Karthik    17     12   3   0  141.666667
## 6         IK Pathan    11     10   2   0  110.000000
## 7        AB Agarkar     1      2   0   0   50.000000
## 8   Harbhajan Singh     7      6   1   0  116.666667
## 9       S Sreesanth    19     10   4   0  190.000000
## 10         RP Singh     1      1   0   0  100.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    370      6        0        8     0        0      14

2.2 Intl. T20 -Team batsmen partnership (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\South Africa-Australia-2009-03-27.csv")
sa_aus=pd.read_csv(path)
yka.teamBatsmenPartnershipMatch(sa_aus,'Australia','New Zealand',plot=True)

2.3 Intl. T20 -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\Sri Lanka-West Indies-2012-09-28.csv")
sl_wi=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(sl_wi,'Sri Lanka')
print(a)
##          bowler  overs  runs  maidens  wicket  econrate
## 0    A Mohammed      2    13        0       0       6.5
## 1  SA Campbelle      1     8        0       1       8.0
## 2     SC Selman      1     3        0       0       3.0
## 3      SF Daley      2     5        0       1       2.5
## 4     SR Taylor      2     4        0       1       2.0
## 5     TD Smartt      2    17        0       0       8.5

2.4 Intl. T20 -Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\England-India-2012-09-29.csv")
eng_ind=pd.read_csv(path)
yka.matchWormChart(eng_ind,"England", "India")

path=os.path.join(dir1,".\\Bangladesh-Ireland-2015-12-05.csv")
ban_ire=pd.read_csv(path)
yka.matchWormChart(ban_ire,"Bangladesh", "Ireland")

2.5 Intl. T20 -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"India-England-allMatches.csv")
dc_mi_matches = pd.read_csv(path)
theTeam='India'
m=yka.teamBatsmenPartnershiOppnAllMatches(dc_mi_matches,theTeam,report="detailed", top=4)
print(m)
##      batsman  totalPartnershipRuns    non_striker  partnershipRuns
## 0   SK Raina                   265      G Gambhir                2
## 1   SK Raina                   265       KL Rahul               40
## 2   SK Raina                   265      MK Tiwary               24
## 3   SK Raina                   265       MS Dhoni              124
## 4   SK Raina                   265        P Kumar                0
## 5   SK Raina                   265      PP Chawla                4
## 6   SK Raina                   265       R Ashwin                1
## 7   SK Raina                   265      RG Sharma               16
## 8   SK Raina                   265        V Kohli               47
## 9   SK Raina                   265   Yuvraj Singh                7
## 10  MS Dhoni                   264       A Mishra                1
## 11  MS Dhoni                   264      AT Rayudu               18
## 12  MS Dhoni                   264      HH Pandya                8
## 13  MS Dhoni                   264      IK Pathan                2
## 14  MS Dhoni                   264      JJ Bumrah                2
## 15  MS Dhoni                   264      MK Pandey                3
## 16  MS Dhoni                   264  Parvez Rasool               21
## 17  MS Dhoni                   264       R Ashwin               11
## 18  MS Dhoni                   264      RA Jadeja               11
## 19  MS Dhoni                   264      RG Sharma                9
## 20  MS Dhoni                   264        RR Pant                6
## 21  MS Dhoni                   264     RV Uthappa                5
## 22  MS Dhoni                   264       SK Raina               98
## 23  MS Dhoni                   264      YK Pathan               36
## 24  MS Dhoni                   264   Yuvraj Singh               33
## 25   V Kohli                   236      AM Rahane                3
## 26   V Kohli                   236      G Gambhir               78
## 27   V Kohli                   236       KL Rahul               46
## 28   V Kohli                   236      RG Sharma                2
## 29   V Kohli                   236     RV Uthappa                4
## 30   V Kohli                   236       S Dhawan               45
## 31   V Kohli                   236       SK Raina               48
## 32   V Kohli                   236   Yuvraj Singh               10
## 33     M Raj                   176       A Sharma                2
## 34     M Raj                   176         H Kaur               18
## 35     M Raj                   176      J Goswami                6
## 36     M Raj                   176        KV Jain                5
## 37     M Raj                   176       L Kumari                5
## 38     M Raj                   176    N Niranjana                3
## 39     M Raj                   176       N Tanwar               17
## 40     M Raj                   176        PG Raut               41
## 41     M Raj                   176     R Malhotra                5
## 42     M Raj                   176     S Mandhana                8
## 43     M Raj                   176         S Naik               10
## 44     M Raj                   176       S Pandey               19
## 45     M Raj                   176       SK Naidu               37

2.6 Intl. T20 -Team Batsmen vs Bowlers all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Ireland-Netherlands-allMatches.csv")
ire_nl_matches = pd.read_csv(path)
yka.teamBatsmenVsBowlersOppnAllMatches(ire_nl_matches,'Ireland',"Netherlands",plot=True,top=3,runsScored=10)

2.7 Intl. T20 -Team Bowling scorecard all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Bangladesh-Nepal-allMatches.csv")
bang_nep_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardOppnAllMatches(bang_nep_matches,'Bangladesh',"Nepal")
print(scorecard)
##         bowler  overs  runs  maidens  wicket   econrate
## 0      B Regmi      3    14        0       1   4.666667
## 3   SP Gauchan      4    40        0       1  10.000000
## 1   JK Mukhiya      2    16        0       0   8.000000
## 2     P Khadka      3    23        0       0   7.666667
## 4    Sagar Pun      1    16        0       0  16.000000
## 5  Sompal Kami      2    21        0       0  10.500000

2.8 Intl. T20 -Team Batsmen vs Bowlers all Oppositions (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-allMatchesAllOpposition\\"
path=os.path.join(dir1,"Australia-allMatchesAllOpposition.csv")
aus_matches = pd.read_csv(path)
yka.teamBatsmenVsBowlersAllOppnAllMatches(aus_matches,"Australia",plot=True,top=3,runsScored=40)

2.9 Intl. T20 -Wins vs Losses of a team against all other teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-allMatchesAllOpposition\\"
path=os.path.join(dir1,"South Africa-allMatchesAllOpposition.csv")
sa_matches = pd.read_csv(path)
team1='South Africa'
yka.plotWinLossByTeamAllOpposition(sa_matches,team1,plot="detailed")

2.10 Intl. T20 -Batsmen analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-BattingBowlingDetails\\"
# Rohit Sharma
name="RG Sharma"
team='India'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeAverageRuns(df,name)

# MJ Guptill
name="MJ Guptill"
team='New Zealand'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

2.11 Intl. T20 -Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-BattingBowlingDetails\\"
# Shakib Al Hasan
name="Shakib Al Hasan"
team='Bangladesh'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanEconomyRate(df,name)

# Rashid Khan
name="SL Malinga"
team='Sri Lanka'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsAgainstOpposition(df,name)

3. Big Bash League

The following functions for added to handle BBL teams

  1. saveAllMatchesBetween2BBLTeams()
  2. saveAllMatchesAllOppositionBBLT20

The BBL teams are included are Adelaide Strikers, Brisbane Heat, Hobart Hurricanes, Melbourne Renegades, Perth Scorchers, Sydney Sixers, Sydney Thunder

To use the yorkpy functions first the YAML files have to be converted into pandas dataframe and then saved as CSV as shown below

import os
import yorkpy.analytics as yka
os.chdir('C:\\software\\cricket-package\\yorkpyBBL\\bbl')
#1. Convert all YAML files to dataframes and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\BBLT20-Matches")
#2. Save all matches between 2 BBL teams
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.saveAllMatchesBetween2BBLTeams(dir1)
#3. Save T20 matches between a BBL team and all other teams
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.saveAllMatchesAllOppositionBBLT20(dir1)
#4. Get the batting details
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.getTeamBattingDetails("Adelaide Strikers",dir=dir1, save=True)
#yka.getTeamBattingDetails("Brisbane Heat",dir=dir1,save=True)
#yka.getTeamBattingDetails("Hobart Hurricanes",dir=dir1,save=True)
#...
# Get the bowling details
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.getTeamBowlingDetails("Adelaide Strikers",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Brisbane Heat",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Hobart Hurricanes",dir=dir1,save=True)
#...

The functions below perform analysis on the generated files from above. The YAML files have already been converted and are available at Github at BBL

3.1 Big Bash League – Team score card (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Adelaide Strikers-Brisbane Heat-2012-12-13.csv")
as_bh=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(as_bh,"Brisbane Heat")
print(scorecard)
##          batsman  runs  balls  4s  6s          SR
## 0  LA Pomersbach    65     42   8   2  154.761905
## 1       JR Hopes     1      2   0   0   50.000000
## 2       JA Burns    37     31   2   2  119.354839
## 3   DT Christian    12     15   0   0   80.000000
## 4    NLTC Perera    12      4   0   2  300.000000
## 5        CA Lynn    19     18   1   1  105.555556
## 6    BCJ Cutting    13      5   0   2  260.000000
## 7     PJ Forrest    12      8   0   1  150.000000
## 8     CD Hartley     5      2   1   0  250.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    371     10        2        5     0        0      17

3.2 Big Bash League -Team batsmen vs Bowlers (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Hobart Hurricanes-Melbourne Renegades-2012-01-18.csv")
hh_mr=pd.read_csv(path)
yka.teamBatsmenVsBowlersMatch(hh_mr,'Hobart Hurricanes','Melbourne Renegades',plot=True)

3.3 Big Bash League -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Melbourne Stars-Sydney Thunder-2016-01-24.csv")
ms_st=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(ms_st,'Sydney Thunder')
print(a)
##           bowler  overs  runs  maidens  wicket   econrate
## 0        A Zampa      4    32        0       2   8.000000
## 1  BW Hilfenhaus      2    21        0       0  10.500000
## 2      DJ Hussey      1     9        0       1   9.000000
## 3     DJ Worrall      3    42        0       0  14.000000
## 4      EP Gulbis      2    19        0       0   9.500000
## 5        MA Beer      3    25        0       1   8.333333
## 6     MP Stoinis      4    30        0       3   7.500000

3.4 Big Bash League – Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Sydney Sixers-Melbourne Stars-2011-12-27.csv")
ss_ms=pd.read_csv(path)
yka.matchWormChart(ss_ms,"Melbourne Stars", "Sydney Sixers")

path=os.path.join(dir1,".\\Hobart Hurricanes-Brisbane Heat-2015-01-02.csv")
hh_bh=pd.read_csv(path)
yka.matchWormChart(hh_bh,"Hobart Hurricanes", "Brisbane Heat")

3.5 Big Bash League -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Brisbane Heat-Adelaide Strikers-allMatches.csv")
bh_as_matches = pd.read_csv(path)
yka.teamBatsmenPartnershipOppnAllMatchesChart(bh_as_matches,"Brisbane Heat","Adelaide Strikers",plot=True, top=4, partnershipRuns=20)

3.6 Big Bash League -Team Bowling wicket kind all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Sydney Sixers-Perth Scorchers-allMatches.csv")
ss_ps_matches = pd.read_csv(path)
yka.teamBowlingWicketKindOppositionAllMatches(ss_ps_matches,'Perth Scorchers','Sydney Sixers',plot=True,top=5,wickets=1)

3.7 Big Bash League -Team Bowling scorecard all teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Hobart Hurricanes-allMatchesAllOpposition.csv")
hh_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardAllOppnAllMatches(hh_matches,"Hobart Hurricanes")
print(scorecard)
##              bowler  overs  runs  maidens  wicket   econrate
## 16            B Lee     20   132        0       9   6.600000
## 30         CJ McKay     13   110        0       9   8.461538
## 88    NJ Rimmington     16   103        1       9   6.437500
## 67      JW Hastings     15    88        0       8   5.866667
## 63      JP Faulkner     15   146        0       7   9.733333
## 27        CJ Gannon     17   147        1       7   8.647059
## 93          NM Lyon      8    51        0       7   6.375000
## 20      BCJ Cutting     27   226        0       7   8.370370
## 48          GB Hogg     22   167        0       7   7.590909
## 107       SM Boland     12    96        0       7   8.000000
## 15       B Laughlin     13    99        0       7   7.615385
## 87      MT Steketee     15   134        0       5   8.933333
## 121    Yasir Arafat      9    48        0       4   5.333333
## 96       PJ Cummins      8    83        0       4  10.375000
## 46      Fawad Ahmed     11    64        0       4   5.818182
## 76          MA Beer     12    63        0       4   5.250000
## 108     SNJ O'Keefe     15   104        0       4   6.933333
## 75   M Muralitharan      7    31        0       4   4.428571
## 10           AJ Tye     16   127        0       4   7.937500
## 52          J Botha     13    94        0       4   7.230769
## 56     JL Pattinson      7    71        0       4  10.142857
## 62   JP Behrendorff     16   119        0       4   7.437500
## 3           AC Agar     12    87        0       4   7.250000
## 24     BM Edmondson      4    40        0       4  10.000000
## 37        DJ Hussey      8    47        0       3   5.875000
## 49       GJ Maxwell      8    65        0       3   8.125000
## 84       MN Samuels      4    22        0       3   5.500000
## 81         MG Neser      5    54        0       3  10.800000
## 44     DT Christian      9   114        0       3  12.666667
## 50        GS Sandhu      7    51        0       3   7.285714
## ..              ...    ...   ...      ...     ...        ...
## 43        DP Nannes      8    58        0       1   7.250000
## 51         IA Moran      4    25        0       1   6.250000
## 55         JK Lalor     10    82        0       1   8.200000
## 54        JH Kallis      3    18        0       1   6.000000
## 73   LR Butterworth      4    25        0       1   6.250000
## 4      AC McDermott      2    28        0       1  14.000000
## 70         LA Doran      4    38        0       1   9.500000
## 69    KW Richardson      6    44        0       1   7.333333
## 119     WD Sheridan      2     6        0       0   3.000000
## 2       AB McDonald      1    15        0       0  15.000000
## 115      TD Andrews      3    23        0       0   7.666667
## 11          AK Heal      4    33        0       0   8.250000
## 7        AD Russell      4    40        0       0  10.000000
## 8          AJ Finch      2    15        0       0   7.500000
## 9         AJ Turner      3    28        0       0   9.333333
## 60        JM Mennie      1    20        0       0  20.000000
## 18        BA Stokes      1     9        0       0   9.000000
## 26         CH Gayle      1    16        0       0  16.000000
## 28         CJ Green      4    44        0       0  11.000000
## 95   PD Collingwood      2    20        0       0  10.000000
## 31       CJ Simmons      4    21        0       0   5.250000
## 59       JM Holland      3    34        0       0  11.333333
## 36         DJ Bravo      6    64        0       0  10.666667
## 38     DJ Pattinson      2    16        0       0   8.000000
## 41       DJ Worrall      8    90        0       0  11.250000
## 72      LN O'Connor      6    56        0       0   9.333333
## 71        LJ Wright      3    27        0       0   9.000000
## 68       KA Pollard      1     7        0       0   7.000000
## 58       JM Herrick      4    23        0       0   5.750000
## 92       NM Hauritz      5    42        0       0   8.400000
## 
## [122 rows x 6 columns]

3.8 Big Bash League -Plot wins vs losses against all teams(Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Sydney Sixers-allMatchesAllOpposition.csv")
ss_matches = pd.read_csv(path)
yka.plotWinLossByTeamAllOpposition(ss_matches,'Sydney Sixers')

3.9 Big Bash League -Wins vs losses by toss decision (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Adelaide Strikers-allMatchesAllOpposition.csv")
as_matches = pd.read_csv(path)
yka.plotWinsByRunOrWicketsAllOpposition(as_matches,'Adelaide Strikers')

3.10 Big Bash League -Batsmen Analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-BattingBowlingDetails"
# CA Lynn
name="CA Lynn"
team='Brisbane Heat'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsStrikeRate(df,name)

# UT Khawaja
name="UT Khawaja"
team='Sydney Thunder'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

3.11Big Bash League – Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-BattingBowlingDetails"
# CJ McKay
name="CJ McKay"
team='Sydney Thunder'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgWickets(df,name)

# AU Rashid
name="AU Rashid"
team='Adelaide Strikers'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

4. Natwest T20 Blast

The following functions for added to handle Natwest T20 teams

  1. saveAllMatchesBetween2NWBTeams()
  2. saveAllMatchesAllOppositionNWBT20

The Natwest teams are
Derbyshire, Durham, Essex, Glamorgan, Gloucestershire, Hampshire, Kent,Lancashire, Leicestershire, Middlesex,Northamptonshire, Nottinghamshire, Somerset, Surrey, Sussex, Warwickshire, Worcestershire,Yorkshire

In order to perform analysis with yorkpy, the YAML data has to be converted to pandas dataframe and saves as CSV as shown

#import os
#import yorkpy.analytics as yka
#os.chdir('C:\\software\\cricket-package\\yorkpyNWB\\nwb')
#1. Convert YAML to dataframes and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\NWBT20-Matches")
#2. Save all matches between 2 NWBT20 teams
#dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.saveAllMatchesBetween2NWBTeams(dir1)
#3. Save all matches between a NWB T20 team and all other teams
#dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.saveAllMatchesAllOppositionNWBT20(dir1)
#4. Compute the batting details
dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.getTeamBattingDetails("Derbyshire",dir=dir1, save=True)
#yka.getTeamBattingDetails("Durham",dir=dir1,save=True)
#yka.getTeamBattingDetails("Essex",dir=dir1,save=True)
#..
#5. Compute bowling details
dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.getTeamBowlingDetails("Derbyshire",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Durham",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Essex",dir=dir1,save=True)
#...

Once the data is converted all yorkpy functions can be used. This has already been done and is available at github NWB

4.1 Natwest T20 Blast – Team score card (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Durham-Yorkshire-2016-08-20.csv")
d_y=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(d_y,"Durham")
print(scorecard)
##           batsman  runs  balls  4s  6s          SR
## 0     MD Stoneman    25     20   4   0  125.000000
## 1     KK Jennings    11     13   1   0   84.615385
## 2       BA Stokes    56     37   4   3  151.351351
## 3   MJ Richardson    29     23   4   1  126.086957
## 4     JTA Burnham    17     15   1   1  113.333333
## 5      RD Pringle    10      9   1   0  111.111111
## 6  PD Collingwood     2      3   0   0   66.666667
## 7        U Arshad     1      1   0   0  100.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    305      2        0        5     0        0       7

4.2 Natwest T20 Blast -Team batsmen vs Bowlers (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Derbyshire-Lancashire-2016-07-13.csv")
d_l=pd.read_csv(path)
yka.teamBatsmenVsBowlersMatch(d_l,'Lancashire','Derbyshire',plot=True)

4.3 Natwest T20 Blast -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Essex-Surrey-2016-05-20.csv")
e_s=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(e_s,'Essex')
print(a)
##           bowler  overs  runs  maidens  wicket   econrate
## 0  Azhar Mahmood      3    38        0       4  12.666667
## 1       GJ Batty      4    33        0       1   8.250000
## 2       JE Burke      1    18        0       0  18.000000
## 3     MW Pillans      3    28        0       0   9.333333
## 4      SM Curran      4    23        0       2   5.750000
## 5      TK Curran      4    21        0       3   5.250000

4.4 Natwest T20 Blast -Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Gloucestershire-Glamorgan-2016-06-10.csv")
ss_ms=pd.read_csv(path)
yka.matchWormChart(ss_ms,"Gloucestershire", "Glamorgan")

path=os.path.join(dir1,".\\Leicestershire-Northamptonshire-2016-05-20.csv")
hh_bh=pd.read_csv(path)
yka.matchWormChart(hh_bh,"Northamptonshire", "Leicestershire")

4.5 Natwest T20 Blast -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Hampshire-Sussex-allMatches.csv")
h_s_matches = pd.read_csv(path)
yka.teamBatsmenPartnershipOppnAllMatchesChart(h_s_matches,"Hampshire","Sussex",plot=True, top=4, partnershipRuns=10)

4.6 Natwest T20 Blast -Team Bowling wicket kind all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Kent-Somerset-allMatches.csv")
k_s_matches = pd.read_csv(path)
yka.teamBowlersVsBatsmenOppnAllMatches(k_s_matches,'Kent','Somerset',plot=True,
top=5,runsConceded=10)

4.7 Natwest T20 Blast -Team Bowling scorecard all teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Middlesex-allMatchesAllOpposition.csv")
m_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardAllOppnAllMatches(m_matches,"Middlesex")
print(scorecard)
##               bowler  overs  runs  maidens  wicket   econrate
## 1             AJ Tye      8    75        0       6   9.375000
## 5         BAC Howell      8    41        0       5   5.125000
## 26         GR Napier      7    65        0       5   9.285714
## 15        DI Stevens      4    31        0       4   7.750000
## 19       DW Lawrence      6    37        0       4   6.166667
## 32       JW Dernbach      4    33        0       3   8.250000
## 7          BTJ Wheal      4    43        0       3  10.750000
## 18         DR Briggs      4    24        0       3   6.000000
## 50     RK Kleinveldt      4    24        0       3   6.000000
## 46         R McLaren      7    59        0       3   8.428571
## 47         R Rampaul      3    21        0       3   7.000000
## 34         L Gregory      6    51        0       2   8.500000
## 33   KMDN Kulasekara      2    24        0       2  12.000000
## 40          MG Hogan      3    17        0       2   5.666667
## 43        MTC Waller      4    31        0       2   7.750000
## 49        RJ Gleeson      4    20        0       2   5.000000
## 48  RE van der Merwe      5    24        0       2   4.800000
## 51  RN ten Doeschate      4    32        0       2   8.000000
## 53        S Prasanna      4    20        0       2   5.000000
## 56           SW Tait      3    17        0       2   5.666667
## 57     Shahid Afridi      8    55        0       2   6.875000
## 59  T van der Gugten      3    13        1       2   4.333333
## 64          TS Mills      3    34        0       2  11.333333
## 65          WAT Beer      4    23        0       2   5.750000
## 31          JH Davey      4    28        0       2   7.000000
## 68         ZS Ansari      3    16        0       2   5.333333
## 25         GM Andrew      3    19        0       2   6.333333
## 23          GJ Batty      6    55        0       2   9.166667
## 16          DJ Bravo      3    27        0       2   9.000000
## 41          MR Quinn      6    65        0       1  10.833333
## ..               ...    ...   ...      ...     ...        ...
## 24     GL van Buuren      7    49        0       1   7.000000
## 37           MD Hunn      3    35        0       1  11.666667
## 36        LC Norwell      6    62        0       1  10.333333
## 29       JC Tredwell      4    35        0       1   8.750000
## 35         LA Dawson      6    53        0       1   8.833333
## 62           TL Best      4    51        0       0  12.750000
## 58         T Westley      2    12        0       0   6.000000
## 4         Azharullah      3    24        0       0   8.000000
## 60     TD Groenewald      1    21        0       0  21.000000
## 61         TK Curran      4    35        0       0   8.750000
## 38         MD Taylor      3    30        0       0  10.000000
## 30        JG Myburgh      1     5        0       0   5.000000
## 8          C Overton      2    18        0       0   9.000000
## 2        Ashar Zaidi      1     5        0       0   5.000000
## 66          WR Smith      2    25        0       0  12.500000
## 28         J Overton      2    24        0       0  12.000000
## 6          BJ Taylor      1     6        0       0   6.000000
## 22          GG White      4    31        0       0   7.750000
## 55          SP Crook      1     9        0       0   9.000000
## 39        ME Claydon      4    40        0       0  10.000000
## 52         RS Bopara      4    32        0       0   8.000000
## 10           CD Nash      2    19        0       0   9.500000
## 11         CH Morris      4    36        0       0   9.000000
## 12         DA Cosker      3    32        0       0  10.666667
## 13      DA Griffiths      4    39        0       0   9.750000
## 45          PD Trego      1    11        0       0  11.000000
## 44   PA van Meekeren      2    19        0       0   9.500000
## 42          MS Crane      2    25        0       0  12.500000
## 20        FK Cowdrey      1    19        0       0  19.000000
## 14        DD Masters      2    16        0       0   8.000000
## 
## [69 rows x 6 columns]

4.8 Natwest T20 Blast -Plot wins vs losses against all teams(Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Warwickshire-allMatchesAllOpposition.csv")
w_matches = pd.read_csv(path)
yka.plotWinLossByTeamAllOpposition(w_matches,'Warwickshire')

4.9 Natwest T20 Blast -Batsmen Analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-BattingBowlingDetails"
# M Klinger
name="M Klinger"
team='Gloucestershire'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

# CA Ingram
name="CA Ingram"
team='Glamorgan'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

4.11 Natwest T20 Blast -Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-BattingBowlingDetails"
# BAC Howell
name="BAC Howell"
team='Gloucestershire'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

# GR Napier
name="GR Napier"
team='Essex'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsVenue(df,name)

Note: yorkpy will work for all T20 leagues which are in YAML format as specified in Cricsheet.

You can clone/fork the latest code for yorkpy from github yorkpy

The data for IPL, Intl. T20, BBL and Natwest T20 have already been converted into pandas dataframes and saved as CSVs. You can download the converted files from Github at [allYorkpyT20Data])(https://github.com/tvganesh/allYorkpyT20Data)

Conclusion This post shows the kind of detailed analysis that can be performed with yorkpy. In fact with all the converted data it should be possible to also train a Machine Learning model, which I will probably keep for another day. You could go ahead and use the data in other innovative ways. Do keep me posted if you do!!

Important note: Do check out my other posts using yorkpy at yorkpy-posts

Have fun with yorkpy!!

See also
1. Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8
2. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
3. Hand detection through Haartraining: A hands-on approach
4.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
5. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
6. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon

To see all posts click Index of posts

Pitching yorkpy … in the block hole – Part 4

A good programmer is someone who always looks both ways before crossing a one-way street.  Doug Linder

There are two ways to write error-free programs; only the third one works. Alan J. Perlis

In order to understand recursion, one must first understand recursion. Anonymous

This is the fourth and final part of my Python package yorkpy. In this part yorkpy, the python avatar of my R package yorkr see Introducing cricket package yorkr: Part 1- Beaten by sheer pace!, develops wings and is prepared for take-off. The yorkpy package uses data from Cricsheet

You can clone/download the code at Github yorkpy
This post has been published to RPubs at yorkpy-Part4
You can download this post as PDF at IPLT20-yorkpy-part4
You can download all the data used in this post and the previous post at yorkpyData

This post is a continuation of the earlier posts on yorkpy

1. Pitching yorkpy . short of good length to IPL – Part 1 In this part I included functions that convert the yaml data of IPL matches into Pandas dataframe which are then saved as CSV. This part can perform analysis of individual IPL matches. Note The converted data is available at yorkpyData
2. Pitching yorkpy.on the middle and outside off-stump to IPL – Part 2 This part included functions to create a large data frame for head-to-head confrontation between any 2IPL teams says CSK-MI, DD-KKR etc, which can be saved as CSV. Analysis is then performed on these team-2-team confrontations. Note The converted data is available at yorkpyData
3. Pitching yorkpy.swinging away from the leg stump to IPL – Part 3 The 3rd part includes the performance of any IPL team against all other IPL teams. The data can also be saved as CSV.Note The converted data is available at yorkpyData

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton yorkpy-template from Github (which is the R Markdown file I have used for the analysis below).

This 4th and final part includes analysis of batting and bowling performances of any IPL player. The batting and bowling details for all teams have already been converted and are available at IPLT20-Batting-BowlingDetails

This part includes the following new functions

Batsman functions

  1. batsmanRunsVsDeliveries
  2. batsmanFoursSixes
  3. batsmanDismissals
  4. batsmanRunsVsStrikeRate
  5. batsmanMovingAverage
  6. batsmanCumulativeAverageRuns
  7. batsmanCumulativeStrikeRate
  8. batsmanRunsAgainstOpposition
  9. batsmanRunsVenue

Bowler functions

  1. bowlerMeanEconomyRate
  2. bowlerMeanRunsConceded
  3. bowlerMovingAverage
  4. bowlerCumulativeAvgWickets
  5. bowlerCumulativeAvgEconRate
  6. bowlerWicketPlot
  7. bowlerWicketsAgainstOpposition
  8. bowlerWicketsVenue

A. Batsman functions

1. Get IPL Team Batting details

The function below gets the overall IPL team batting details based on the CSV files that were saved for IPL T20 matches. This is currently also available in Github at yorkpyData. The batting details of the IPL team in each match is created and a huge data frame is created by combining the batting details from each match. This can be saved as a csv file with name as for e.g. Delhi Daredevils-BattingDetails.csv.

dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
#csk_details = yka.getTeamBattingDetails("Chennai Super Kings",dir=dir1, save=True)
#dd_details = yka.getTeamBattingDetails("Delhi Daredevils",dir=dir1,save=True)
#kkr_details = yka.getTeamBattingDetails("Kolkata Knight Riders",dir=dir1,save=True)

2. Get IPL batsman details

This function is used to get the individual IPL T20 batting record for a the specified batsman of the team as in the functions below.

For the batsmen functions below I have chosen Rishabh Pant, Kane Williamson and Ambati Rayudu for the analysis as they top the batting lists. You can choose any IPL batsmen for the analysis

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
rpant=yka.getBatsmanDetails(team,name,dir=dir1)

3 Batsman Runs vs Deliveries (in IPL matches)

This functions plots the runs vs deliveries faced for batsman

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsDeliveries(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsDeliveries(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsDeliveries(df,name)

4. Batsman fours and sixes (in IPL matches)

This plots the fours, sixes and the total runs for a batsman

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanFoursSixes(df,name)


# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanFoursSixes(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanFoursSixes(df,name)

5. Batsman dismissals (in IPL matches)

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanDismissals(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanDismissals(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanDismissals(df,name)

6. Batsman Runs vs Strike Rate (in IPL matches)

The plots below give the Runs vs Strike rate for batsmen

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsStrikeRate(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsStrikeRate(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsStrikeRate(df,name)

7. Batsman Moving average of runs (in IPL matches)

The plots below compute and plot the moving average of batsmen

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanMovingAverage(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanMovingAverage(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanMovingAverage(df,name)

8. Batsman Cumulative average of runs (in IPL matches)

The functions below plot the cumulative average of the batsmen

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeAverageRuns(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeAverageRuns(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeAverageRuns(df,name)

9. Batsman Cumulative Strike Rate (in IPL matches)

The functions below plot the cumulative strike rate of the batsmen

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

10. Batsman performance against opposition (in IPL matches)

The plots below show how the batsmen performed against other IPL teams

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

11. Batsman performance at different venues (in IPL matches)

The plots below show how the batsmen performed at different venues

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Rishabh Pant
name="RR Pant"
team='Delhi Daredevils'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVenue(df,name)

# 2. Kane Williamson
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="KS Williamson"
team='Sunrisers Hyderabad'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVenue(df,name)

#3. Ambati Rayudu
name="AT Rayudu"
team='Mumbai Indians'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVenue(df,name)

B. Bowler functions

12. Get bowling details in IPL matches

The function below gets the overall team IPL T20 bowling details based on the RData file available in IPL T20 matches. This is currently also available in Github at yorkpyData. The IPL T20 bowling details of the IPL team in each match is created, and a huge data frame is created by stacking the individual dataframes. This can be saved as a CSV file for e.g. Chennai Super Kings-BowlingDetails.csv

dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
#kkr_bowling = yka.getTeamBowlingDetails("Kolkata Knight Riders",dir=dir1,save=True)
#csk_bowling = yka.getTeamBowlingDetails("Chennai Super Kings",dir=dir1,save=True)
#kxip_bowling = yka.getTeamBowlingDetails("Kings XI Punjab",dir=dir1,save=True)

13. Get bowling details of the individual IPL bowlers

This function is used to get the individual bowling record for a specified bowler of the country as in the functions below.

The plots below deal with bowler’s performance. For this analysis I have chosen Amit Mishra, Piyush Chawla and Bhuvaneshwar Kumar for the analysis. You can chose any other IPL bowler

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
#df=yka.getBowlerWicketDetails(team,name,dir=dir1)

14. Bowler Economy Rate (in IPL matches)

The plots below show the economy rate of the selected bowlers

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanEconomyRate(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanEconomyRate(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanEconomyRate(df,name)

15. Bowler Mean Runs conceded (in IPL matches)

The plots below show the mean runs conceded by the selected bowlers

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanRunsConceded(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanRunsConceded(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanRunsConceded(df,name)

16. Moving average of wickets for bowler (in IPL matches)

The moving average of the bowlers are plotted below

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMovingAverage(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMovingAverage(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMovingAverage(df,name)

17. Cumulative average wickets for bowler (in IPL matches)

The cumulative average wickets for each bowler is computed and plotted

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgWickets(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgWickets(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgWickets(df,name)

18. Cumulative average economy rate for bowler (in IPL matches)

The plots below give the cumulative average economy rate for each bowler

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

19. Bowler wicket plot (in IPL matches)

The plots below give the over vs wickets for bowlers

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketPlot(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketPlot(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketPlot(df,name)

20. Bowler wicket against opposition (in IPL matches)

The performance of the bowlers against different IPL teams is shown below

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsAgainstOpposition(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsAgainstOpposition(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsAgainstOpposition(df,name)

21. Bowler wicket in different venues (in IPL matches)

The plots below show how the bowlers perform at different venues

import pandas as pd
import os
import yorkpy.analytics as yka
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
# 1. Amit Mishra
name="A Mishra"
team='Delhi Daredevils'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsVenue(df,name)

# 2. Piyush Chawla
dir1= "C:\\software\\cricket-package\\yorkpyIPLData\\data3"
name="PP Chawla"
team='Kolkata Knight Riders'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsVenue(df,name)

#3. Bhuvneshwar Kumar
name="B Kumar"
team='Sunrisers Hyderabad'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsVenue(df,name)

Note:You can clone/download the code at Github yorkpy

Important note: Do check out my other posts using yorkpy at yorkpy-posts

Conclusion: This concludes the python package yorkpy. Go ahead and give yorkpy a spin!

Also see
1. Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8
2. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
3. Hand detection through Haartraining: A hands-on approach
4.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
5. Big Data-1: Move into the big league:Graduate from Python to Pyspark
6. Cricpy takes a swing at the ODIs

To see all posts click Index of posts