T20 Win Probability using CTGANs, synthetic data

This should be my last post on computing T20 Win Probability. In this post I compute Win Probability using Augmented Data with the help of Conditional Tabular Generative Adversarial Networks (CTGANs).

A.Introduction

I started the computation of T20 match Win Probability in my earlier post

a) ‘Computing Win-Probability of T20 matches‘ where I used

  • vanilla Logistic Regression to get an accuracy of 0.67,
  • Random Forest with Tidy models gave me an accuracy of 0.737
  • Deep Learning with Keras also with 0.73.

This was done without player embeddings

b) Next I used player embeddings for batsmen and bowlers in my post Boosting Win Probability accuracy with player embeddings , and my accuracies improved significantly

  • glmnet : accuracy – 0.728 and roc_auc – 0.81
  • random forest : accuracy – 0.927 and roc_auc – 0.98
  • mlp-dnn :accuracy – 0.762 and roc_auc – 0.854

c) Third I tried using Deep Learning with Keras using player embeddings

  • DL network gave an accuracy of 0.8639

This was lightweight and could be easily deployed in my Shiny GooglyPlusPlus app as opposed to the Tidymodel’s Random Forest, which was bulky and slow.

d) Finally I decided to try and improve the accuracy of my Deep Learning Model using Synthetic data. Towards this end, my explorations led me to Conditional Tabular Generative Adversarial Networks (CTGANs). CTGAN are GAN networks that can be used with Tabular data as GAN models are not useful with tabular data. However, the best performance I got for

  • DL Keras Model + Synthetic data : accuracy =0.77

The poorer accuracy was because CTGAN requires enormous computing power (GPUs) and RAM. The free version of Colab, Kaggle kept crashing when I tried with even 0.1 % of my 1.2 million dataset size. Finally, I tried with just 0.05% and was able to generate synthetic data. Most likely, it is the small sample size and the smaller number of epochs could be the reason for the poor result. In any case, it was worth trying and this approach would possibly work with sufficient computing resources.

B.Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) was the brain child of Ian Goodfellow who demonstrated it in 2014. GANs are capable of generating synthetic text, tables, images, videos using available data. In Adversarial nets framework, the generative model is pitted against an adversary: a
discriminative model that learns to determine whether a sample is from the model distribution or the
data distribution.

GANs have 2 Deep Neural Networks , the Generator and Discriminator which compete against other

  • The Generator (Counterfeiter) takes random noise as input and generates fake images, tables, text. The generator learns to generate plausible data. The generated instances become negative training examples for the discriminator.
  • The Discriminator (Police) which tries to distinguish between the real and fake images, text. The discriminator learns to distinguish the generator’s fake data from real data. The discriminator penalises the generator for producing implausible results.

A pictorial representation of the GAN model can be shown below

Theoretically best performance of GANs are supposed to happen when the network reaches the ‘Nash equilibrium‘, i.e. when the Generator produces near fake images and the Discriminator’s loss is f ~0.5 i.e. the discriminator is unable to distinguish between real and fake images.

Note: Though I have mentioned T20 data in the above GAN model, the T20 tabular data is actually used in CTGAN which is slightly different from the above. See Reference 2) below.

C. Conditional Tabular Generative Adversial Networks (CTGANs)

“Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult.” CTGANs handle these challenges.

I came upon CTGAN after spending some time exploring GANs via blogs, videos etc. For building the model I use real T20 match data. However, CTGAN requires immense raw computing power and a lot of RAM. My initial attempts on Colab, my Mac (12 core, 32GB RAM), took forever before eventually crashing, I switched to Kaggle and used GPUs. Still I was only able to use only a miniscule part of my T20 dataset. My match data has 1.2 million rows, hoanything > 0.05% resulted in Kaggle crashing. Since I was able to use only a fraction, I executed the CTGAN model over several iterations, each iteration with a random 0.05% sample of the dataset. At the end of each iterations I also generate synthetic dataset. Over 12 iterations, I generate close 360K of ‘synthetic‘ T20 match data.

I then augment the 1.2 million rows of ‘real‘ T20 match data with the generated ‘synthetic T20 match data to run my Deep Learning model

D. Executing the CTGAN model

a. Read the real T20 match data

!pip install ctgan
import pandas as pd
import ctgan
from ctgan import CTGAN
from numpy.random import seed

# Read the T20 match data
df = pd.read_csv('/kaggle/input/cricket1/t20.csv')

# Randomly sample 0.05% of the dataset. Note larger datasets cause the algo to crash
train_dataset = df.sample(frac=0.05)

# Print the real T20 match data
print(train_dataset.head(10))
print(train_dataset.shape)

             batsmanIdx  bowlerIdx  ballNum  ballsRemaining  runs   runRate  \
363695         3333        432      134             119   153  1.285714   
1082839        3881       1180      218              30    93  3.100000   
595799         2366        683      187              65   120  1.846154   
737614         4490       1381      148              87   144  1.655172   
410202          934       1003       19             106    35  1.842105   
525627          921       1711      251               1     8  8.000000   
657669         4718       1602      130             115   145  1.260870   
666461         4309       1989       44              87    38  0.863636   
651229         3336        754       30              92    36  1.200000   
709892         3048        421       97              28   119  1.226804   

            numWickets  runsMomentum  perfIndex  isWinner  
363695            0      0.092437  18.333333         1  
1082839           5      0.200000   4.736842         0  
595799            4      0.107692   9.566667         0  
737614            1      0.114943   9.130435         1  
410202            0      0.103774  20.263158         0  
525627            8      3.000000   3.837209         0  
657669            0      0.095652  19.555556         0  
666461            0      0.126437   9.500000         0  
651229            0      0.119565  13.200000         0  
709892            3      0.285714   9.814433         1  
(59956, 10)

b. Run CTGAN model on the real T20 data

import pandas as pd
import ctgan
from ctgan import CTGAN
from numpy.random import seed
from pickle import TRUE

df = pd.read_csv('/kaggle/input/cricket1/t20.csv')

#Specify the categorical features. batsmanIdx & bowlerIdx are player embeddings
categorical_features = ['batsmanIdx','bowlerIdx']

# Create a empty dataframe for synthetic data
df1 = pd.DataFrame()

# Loop for 12 iterations. Minimize generator & discriminator loss
for i in range(12):
    print(i)
    train_dataset = df.sample(frac=0.05)
    seed(33)

    ctgan = CTGAN(epochs=20,verbose=True,generator_lr=.001,discriminator_lr=.001,batch_size=1000)
    ctgan.fit(train_dataset, categorical_features)

    # Generate synthetic data
    samples = ctgan.sample(30000)

   # Concatenate the synthetic data after each iteration
    df1 = pd.concat([df1,samples])
    print(samples.head())
    print(df1.shape)

# Output the synthetic data to file
df1.to_csv("output1.csv",index=False)

0
Epoch 1, Loss G:  8.3825,Loss D: -0.6159
Epoch 2, Loss G:  3.5117,Loss D: -0.3016
Epoch 3, Loss G:  2.1619,Loss D: -0.5713
Epoch 4, Loss G:  0.9847,Loss D:  0.1010
Epoch 5, Loss G:  0.6198,Loss D:  0.0789
Epoch 6, Loss G:  0.1710,Loss D:  0.0959
Epoch 7, Loss G:  0.3236,Loss D: -0.1554
Epoch 8, Loss G:  0.2317,Loss D: -0.0765
Epoch 9, Loss G: -0.0127,Loss D:  0.0275
Epoch 10, Loss G:  0.1477,Loss D: -0.0353
Epoch 11, Loss G:  0.0997,Loss D: -0.0129
Epoch 12, Loss G:  0.0066,Loss D: -0.0486
Epoch 13, Loss G:  0.0351,Loss D: -0.0805
Epoch 14, Loss G: -0.1399,Loss D: -0.0021
Epoch 15, Loss G: -0.1503,Loss D: -0.0518
Epoch 16, Loss G: -0.2306,Loss D: -0.0234
Epoch 17, Loss G: -0.2986,Loss D:  0.0469
Epoch 18, Loss G: -0.1941,Loss D: -0.0560
Epoch 19, Loss G: -0.3794,Loss D:  0.0000
Epoch 20, Loss G: -0.2763,Loss D:  0.0368
   batsmanIdx  bowlerIdx  ballNum  ballsRemaining  runs   runRate  numWickets  \
0         906        224        8              75    81  1.955153           4   
1        4159        433       17              31   126  1.799280           9   
2         229        351      192              66    82  1.608527           5   
3        1926        962       63               0   117  1.658105           0   
4         286        431      128               1    36  1.605079           0   

   runsMomentum  perfIndex  isWinner  
0      0.146670   6.937595         1  
1      0.160534  10.904346         1  
2      0.516010  11.698128         1  
3      0.380986  11.914613         0  
4      0.112255   5.392120         0  
(30000, 10)
1
Epoch 1, Loss G:  7.9977,Loss D: -0.3592
Epoch 2, Loss G:  3.7418,Loss D: -0.3371
Epoch 3, Loss G:  1.6685,Loss D: -0.3211
Epoch 4, Loss G:  1.0539,Loss D: -0.3495
Epoch 5, Loss G:  0.4664,Loss D: -0.0907
Epoch 6, Loss G:  0.4004,Loss D: -0.1208
Epoch 7, Loss G:  0.3250,Loss D: -0.1482
Epoch 8, Loss G:  0.1753,Loss D:  0.0169
Epoch 9, Loss G:  0.1382,Loss D:  0.0661
Epoch 10, Loss G:  0.1509,Loss D: -0.1023
Epoch 11, Loss G: -0.0235,Loss D:  0.0210
Epoch 12, Loss G: -0.1636,Loss D: -0.0124
Epoch 13, Loss G: -0.3370,Loss D: -0.0185
Epoch 14, Loss G: -0.3054,Loss D: -0.0085
Epoch 15, Loss G: -0.5142,Loss D:  0.0121
Epoch 16, Loss G: -0.3813,Loss D: -0.0921
Epoch 17, Loss G: -0.5838,Loss D:  0.0210
Epoch 18, Loss G: -0.4033,Loss D: -0.0181
Epoch 19, Loss G: -0.5711,Loss D:  0.0269
Epoch 20, Loss G: -0.4828,Loss D: -0.0830
   batsmanIdx  bowlerIdx  ballNum  ballsRemaining  runs   runRate  numWickets  \
0        2202        265      223              39    13  0.868927           0   
1        3641        856       35              59    26  2.236160           6   
2         676       2903      218              93    16  0.460693           1   
3        3482       3459       44             117   102  0.851471           8   
4        3046       3076       59               5    84  1.016824           2   

   runsMomentum  perfIndex  isWinner  
0      0.138586   4.733462         0  
1      0.124453   5.146831         1  
2      0.273168  10.106869         0  
3      0.129520   5.361127         0  
4      1.083525  25.677574         1  
(60000, 10)
...
...
11
Epoch 1, Loss G:  8.8362,Loss D: -0.7111
Epoch 2, Loss G:  4.1322,Loss D: -0.8468
Epoch 3, Loss G:  1.2782,Loss D:  0.1245
Epoch 4, Loss G:  1.1135,Loss D: -0.3588
Epoch 5, Loss G:  0.6033,Loss D: -0.1255
Epoch 6, Loss G:  0.6912,Loss D: -0.1906
Epoch 7, Loss G:  0.3340,Loss D: -0.1048
Epoch 8, Loss G:  0.3515,Loss D: -0.0730
Epoch 9, Loss G:  0.1702,Loss D:  0.0237
Epoch 10, Loss G:  0.1064,Loss D:  0.0632
Epoch 11, Loss G:  0.0884,Loss D: -0.0005
Epoch 12, Loss G:  0.0556,Loss D: -0.0607
Epoch 13, Loss G: -0.0917,Loss D: -0.0223
Epoch 14, Loss G: -0.1492,Loss D:  0.0258
Epoch 15, Loss G: -0.0986,Loss D: -0.0112
Epoch 16, Loss G: -0.1428,Loss D: -0.0060
Epoch 17, Loss G: -0.2225,Loss D: -0.0263
Epoch 18, Loss G: -0.2255,Loss D: -0.0328
Epoch 19, Loss G: -0.3482,Loss D:  0.0277
Epoch 20, Loss G: -0.2667,Loss D: -0.0721
   batsmanIdx  bowlerIdx  ballNum  ballsRemaining  runs   runRate  numWickets  \
0         367       1447      129              27    30  1.242120           2   
1        2481       1528      221               4    10  1.344024           2   
2        1034       3116      132              87   153  1.142750           3   
3        1201       2868      151              60   136  1.091638           1   
4        4327       3291      108              89    22  0.842775           2   

   runsMomentum  perfIndex  isWinner  
0      1.978739   6.393691         1  
1      0.539650   6.783990         0  
2      0.107156  12.154197         0  
3      3.193574  11.992059         0  
4      0.127507  12.210876         0  
(360000, 10)

E. Sample of the Synthetic data

synthetic_data = ctgan.sample(20000)
print(synthetic_data.head(100))

    batsmanIdx  bowlerIdx  ballNum  ballsRemaining  runs    runRate  \
0         1073       3059       72              72   149   2.230236   
1         3769       1443      106               7   137   0.881409   
2          448       3048      166               6   220   1.092504   
3         2969       1244      103              82   207  12.314862   
4          180       1372      125             111    14   1.310051   
..         ...        ...      ...             ...   ...        ...   
95        1521       1040      153               6   166   1.097363   
96        2366         62       25             114   119   0.910642   
97        3506       1736      100             118   140   1.640921   
98        3343       2347       47              54    50   0.696462   
99        1957       2888      136              27   153   1.315565   

    numWickets  runsMomentum  perfIndex  isWinner  
0            0      0.111707  17.466925         0  
1            1      0.130352  14.274113         0  
2            1      0.173541  11.076731         1  
3            1      0.218977   6.239951         0  
4            4      2.829380   9.183323         1  
..         ...           ...        ...       ...  
95           0      0.223437   7.011180         0  
96           1      0.451371  16.908120         1  
97           5      0.156936   9.217205         0  
98           6      0.124536   6.273091         0  
99           1      0.249329  14.221554         0  

[100 rows x 10 columns]

F. Evaluating the synthetic T20 match data

Here the quality of the synthetic data set is evaluated.

a) Statistical evaluation

  • Read the real T20 match data
  • Read the generated T20 synthetic match data
import pandas as pd

# Read the T20 match and synthetic match data
df = pd.read_csv('/kaggle/input/cricket1/t20.csv').  #1.2 million rows
synthetic=pd.read_csv('/kaggle/input/synthetic/synthetic.csv')   #300K

# Randomly sample 1000 rows, and generate stats
df1=df.sample(n=1000)
real=df1.describe()
realData_stats=real.transpose
print(realData_stats)

synthetic1=synthetic.sample(n=1000)
synthetic=synthetic1.describe()
syntheticData_stats=synthetic.transpose
syntheticData_stats

a) Stats of real T20 match data

<bound method DataFrame.transpose of         batsmanIdx    bowlerIdx      ballNum  ballsRemaining         runs  \
count  1000.000000  1000.000000  1000.000000     1000.000000  1000.000000   
mean   2323.940000  1776.481000   118.165000       59.236000    77.649000   
std    1329.703046  1011.470703    70.564291       35.312934    49.098763   
min       8.000000    13.000000     1.000000        1.000000    -2.000000   
25%    1134.750000   850.000000    58.000000       28.750000    39.000000   
50%    2265.000000  1781.500000   117.000000       59.000000    72.000000   
75%    3510.000000  2662.250000   178.000000       89.000000   111.000000   
max    4738.000000  3481.000000   265.000000      127.000000   246.000000   

           runRate   numWickets  runsMomentum    perfIndex     isWinner  
count  1000.000000  1000.000000   1000.000000  1000.000000  1000.000000  
mean      1.734979     2.614000      0.310568     9.580386     0.499000  
std       5.698104     2.267189      0.686171     4.530856     0.500249  
min      -2.000000     0.000000      0.071429     0.000000     0.000000  
25%       1.009063     1.000000      0.105769     6.666667     0.000000  
50%       1.272727     2.000000      0.141026     9.236842     0.000000  
75%       1.546891     4.000000      0.250000    12.146735     1.000000  
max     166.000000    10.000000     10.000000    30.800000     1.000000

b) Stats of Synthetic T20 match data

     
           batsmanIdx    bowlerIdx      ballNum  ballsRemaining         runs  \
count  1000.000000  1000.000000  1000.000000     1000.000000  1000.000000   
mean   2304.135000  1760.776000   116.081000       50.102000    74.357000   
std    1342.348684  1003.496003    72.019228       35.795236    48.103446   
min       2.000000    15.000000    -4.000000       -2.000000    -1.000000   
25%    1093.000000   881.000000    46.000000       18.000000    30.000000   
50%    2219.500000  1763.500000   116.000000       45.000000    75.000000   
75%    3496.500000  2644.750000   180.250000       77.000000   112.000000   
max    4718.000000  3481.000000   253.000000      124.000000   222.000000   

           runRate   numWickets  runsMomentum    perfIndex     isWinner  
count  1000.000000  1000.000000   1000.000000  1000.000000  1000.000000  
mean      1.637225     3.096000      0.336540     9.278073     0.507000  
std       1.691060     2.640408      0.502346     4.727677     0.500201  
min      -4.388339     0.000000      0.083351    -0.902991     0.000000  
25%       1.077789     1.000000      0.115770     5.731931     0.000000  
50%       1.369655     2.000000      0.163085     9.104328     1.000000  
75%       1.660477     5.000000      0.311586    12.619318     1.000000  
max      23.757001    10.000000      4.630908    29.829497     1.000000

c) Plotting the Generator and Discriminator loss

import pandas as pd

# CTGAN prints out a new line for each epoch
epochs_output = str(output).split('\n')

# CTGAN separates the values with commas
raw_values = [line.split(',') for line in epochs_output]
loss_values = pd.DataFrame(raw_values)[:-1] # convert to df and delete last row (empty)

# Rename columns
loss_values.columns = ['Epoch', 'Generator Loss', 'Discriminator Loss']

# Extract the numbers from each column 
loss_values['Epoch'] = loss_values['Epoch'].str.extract('(\d+)').astype(int)
loss_values['Generator Loss'] = loss_values['Generator Loss'].str.extract('([-+]?\d*\.\d+|\d+)').astype(float)
loss_values['Discriminator Loss'] = loss_values['Discriminator Loss'].str.extract('([-+]?\d*\.\d+|\d+)').astype(float)

# the result is a row for each epoch that contains the generator and discriminator loss
loss_values.head()

	Epoch	Generator Loss	Discriminator Loss
0	1	8.0158	-0.3840
1	2	4.6748	-0.9589
2	3	1.1503	-0.0066
3	4	1.5593	-0.8148
4	5	0.6734	-0.1425
5	6	0.5342	-0.2202
6	7	0.4539	-0.1462
7	8	0.2907	-0.0155
8	9	0.2399	0.0172
9	10	0.1520	-0.0236
import plotly.graph_objects as go

# Plot loss function
fig = go.Figure(data=[go.Scatter(x=loss_values['Epoch'], y=loss_values['Generator Loss'], name='Generator Loss'),
                      go.Scatter(x=loss_values['Epoch'], y=loss_values['Discriminator Loss'], name='Discriminator Loss')])


# Update the layout for best viewing
fig.update_layout(template='plotly_white',
                    legend_orientation="h",
                    legend=dict(x=0, y=1.1))

title = 'CTGAN loss function for T20 dataset - ' 
fig.update_layout(title=title, xaxis_title='Epoch', yaxis_title='Loss')
fig.show()

G. Qualitative evaluation of Synthetic data

a) Quality of continuous columns in synthetic data

KSComplement -This metric computes the similarity of a real column vs. a synthetic column in terms of the column shapes.The KSComplement uses the Kolmogorov-Smirnov statistic. Closer to 1.0 is good and 0 is worst

from sdmetrics.single_column import KSComplement
numerical_columns=['ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']
total_score = 0
for column_name in numerical_columns:
    column_score = KSComplement.compute(df[column_name], synthetic[column_name])
    total_score += column_score
    print('Column:', column_name, ', Score: ', column_score)

print('\nAverage: ', total_score/len(numerical_columns))

Column: ballNum , Score:  0.9502754283367316
Column: ballsRemaining , Score:  0.8770284103276166
Column: runs , Score:  0.9136464248633367
Column: runRate , Score:  0.9183841670732166
Column: numWickets , Score:  0.9016209114638712
Column: runsMomentum , Score:  0.8773491702213716
Column: perfIndex , Score:  0.9173808852778924

Average:  0.9079550567948624

b) Quality of categorical columns

This statistic measures the quality of generated categorical columns. 1 is best and 0 is worst

categorical_columns=['batsmanIdx','bowlerIdx']
from sdmetrics.single_column import TVComplement

total_score = 0
for column_name in categorical_columns:
    column_score = TVComplement.compute(df[column_name], synthetic[column_name])
    total_score += column_score
    print('Column:', column_name, ', Score: ', column_score)

print('\nAverage: ', total_score/len(categorical_columns))

Column: batsmanIdx , Score:  0.8436263499539245
Column: bowlerIdx , Score:  0.7356177407921669

Average:  0.7896220453730457

The performance is decent but not excellent. I was unable to execute more epochs as it it required larger than the memory allowed

c) Correlation similarity

This metric measures the correlation between a pair of numerical columns and computes the similarity between the real and synthetic data – it compares the trends of 2D distributions. Best 1.0 and 0.0 is worst

import itertools
from sdmetrics.column_pairs import CorrelationSimilarity

total_score = 0
total_pairs = 0
for pair in itertools.combinations(numerical_columns,2):
    col_A, col_B = pair
    score = CorrelationSimilarity.compute(df[[col_A, col_B]], synthetic[[col_A, col_B]])
    print('Columns:', pair, ' Score:', score)
    total_score += score
    total_pairs += 1

print('\nAverage: ', total_score/total_pairs)

Columns: ('ballNum', 'ballsRemaining')  Score: 0.7153942317384889
Columns: ('ballNum', 'runs')  Score: 0.8838043045134777
Columns: ('ballNum', 'runRate')  Score: 0.8710243133637056
Columns: ('ballNum', 'numWickets')  Score: 0.7978515509750435
Columns: ('ballNum', 'runsMomentum')  Score: 0.8956281260834316
Columns: ('ballNum', 'perfIndex')  Score: 0.9275145840528048
Columns: ('ballsRemaining', 'runs')  Score: 0.9566928975064546
Columns: ('ballsRemaining', 'runRate')  Score: 0.9127313819127167
Columns: ('ballsRemaining', 'numWickets')  Score: 0.6770737279315224
Columns: ('ballsRemaining', 'runsMomentum')  Score: 0.7939260278412358
Columns: ('ballsRemaining', 'perfIndex')  Score: 0.8694582252638351
Columns: ('runs', 'runRate')  Score: 0.999593795992159
Columns: ('runs', 'numWickets')  Score: 0.9510731832916608
Columns: ('runs', 'runsMomentum')  Score: 0.9956131422133428
Columns: ('runs', 'perfIndex')  Score: 0.9742931845536701
Columns: ('runRate', 'numWickets')  Score: 0.8859830711832263
Columns: ('runRate', 'runsMomentum')  Score: 0.9174744874779561
Columns: ('runRate', 'perfIndex')  Score: 0.9491100087911353
Columns: ('numWickets', 'runsMomentum')  Score: 0.8989709776329797
Columns: ('numWickets', 'perfIndex')  Score: 0.7178946968801441
Columns: ('runsMomentum', 'perfIndex')  Score: 0.9744441623018661

Average:  0.8840738134048025

d) Category coverage

This metric measures whether a synthetic column covers all the possible categories that are present in a real column. 1.0 is best , 0 is worst

from sdmetrics.single_column import CategoryCoverage

total_score = 0
for column_name in categorical_columns:
    column_score = CategoryCoverage.compute(df[column_name], synthetic[column_name])
    total_score += column_score
    print('Column:', column_name, ', Score: ', column_score)

print('\nAverage: ', total_score/len(categorical_columns))

Column: batsmanIdx , Score:  0.9533951919021509
Column: bowlerIdx , Score:  0.9913966160022942

Average:  0.9723959039522225

H. Augmenting the T20 match data set

In this final part I augment my T20 match data set with the generated synthetic T20 data set.

import pandas as pd
from numpy import savetxt
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np


from keras.layers import Input, Embedding, Flatten, Dense, Reshape, Concatenate, Dropout
from keras.models import Model
import matplotlib.pyplot as plt

# Read real and synthetic data
df = pd.read_csv('/kaggle/input/cricket1/t20.csv')
synthetic=pd.read_csv('/kaggle/input/synthetic/synthetic.csv')

# Augment the data. Concatenate real & synthetic data
df1=pd.concat([df,synthetic])

# Create training and test samples
print("Shape of dataframe=",df1.shape)
train_dataset = df1.sample(frac=0.8,random_state=0)
test_dataset = df1.drop(train_dataset.index)
train_dataset1 = train_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
test_dataset1 = test_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
train_dataset1
train_labels = train_dataset.pop('isWinner')
test_labels = test_dataset.pop('isWinner')
print(train_dataset1.shape)

a=train_dataset1.describe()
stats=a.transpose
print(a)

a) Create A Deep Learning Model in Keras

from numpy.random import seed
seed(33)
tf.random.set_seed(432)
# create input layers for each of the predictors
batsmanIdx_input = Input(shape=(1,), name='batsmanIdx')
bowlerIdx_input = Input(shape=(1,), name='bowlerIdx')
ballNum_input = Input(shape=(1,), name='ballNum')
ballsRemaining_input = Input(shape=(1,), name='ballsRemaining')
runs_input = Input(shape=(1,), name='runs')
runRate_input = Input(shape=(1,), name='runRate')
numWickets_input = Input(shape=(1,), name='numWickets')
runsMomentum_input = Input(shape=(1,), name='runsMomentum')
perfIndex_input = Input(shape=(1,), name='perfIndex')

no_of_unique_batman=len(df1["batsmanIdx"].unique()) 
print(no_of_unique_batman)
no_of_unique_bowler=len(df1["bowlerIdx"].unique()) 
print(no_of_unique_bowler)

embedding_size_bat = no_of_unique_batman ** (1/4)
print(embedding_size_bat)
embedding_size_bwl = no_of_unique_bowler ** (1/4)
print(embedding_size_bwl)

# create embedding layer for the categorical predictor
batsmanIdx_embedding = Embedding(input_dim=no_of_unique_batman+1, output_dim=16,input_length=1)(batsmanIdx_input)
print(batsmanIdx_embedding)
batsmanIdx_flatten = Flatten()(batsmanIdx_embedding)
print(batsmanIdx_flatten)
bowlerIdx_embedding = Embedding(input_dim=no_of_unique_bowler+1, output_dim=16,input_length=1)(bowlerIdx_input)
bowlerIdx_flatten = Flatten()(bowlerIdx_embedding)
print(bowlerIdx_flatten)
# concatenate all the predictors
x = keras.layers.concatenate([batsmanIdx_flatten,bowlerIdx_flatten, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input])
print(x.shape)
# add hidden layers
x = Dense(96, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(8, activation='relu')(x)
x = Dropout(0.1)(x)
# add output layer
output = Dense(1, activation='sigmoid', name='output')(x)
print(output.shape)
# create model
model = Model(inputs=[batsmanIdx_input,bowlerIdx_input, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input], outputs=output)
model.summary()
# compile model
#optimizer=keras.optimizers.Adam(learning_rate=.01, beta_1=0.1, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
#optimizer=keras.optimizers.RMSprop(learning_rate=0.001, rho=0.2, momentum=0.2, epsilon=1e-07)
#optimizer=keras.optimizers.SGD(learning_rate=.01,momentum=0.1) #- Works without dropout
#optimizer = tf.keras.optimizers.RMSprop(0.01)
#optimizer=keras.optimizers.SGD(learning_rate=.01,momentum=0.1)
#optimizer=keras.optimizers.RMSprop(learning_rate=.005, rho=0.1, momentum=0, epsilon=1e-07)

optimizer=keras.optimizers.Adam(learning_rate=.015, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=True)

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# train the model
history=model.fit([train_dataset1['batsmanIdx'],train_dataset1['bowlerIdx'],train_dataset1['ballNum'],train_dataset1['ballsRemaining'],train_dataset1['runs'],
           train_dataset1['runRate'],train_dataset1['numWickets'],train_dataset1['runsMomentum'],train_dataset1['perfIndex']], train_labels, epochs=20, batch_size=1024,
          validation_data = ([test_dataset1['batsmanIdx'],test_dataset1['bowlerIdx'],test_dataset1['ballNum'],test_dataset1['ballsRemaining'],test_dataset1['runs'],
           test_dataset1['runRate'],test_dataset1['numWickets'],test_dataset1['runsMomentum'],test_dataset1['perfIndex']],test_labels), verbose=1)

plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

==================================================================================================
Total params: 144,497
Trainable params: 144,497
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/20
1219/1219 [==============================] - 15s 11ms/step - loss: 0.6285 - accuracy: 0.6372 - val_loss: 0.5164 - val_accuracy: 0.7606
Epoch 2/20
1219/1219 [==============================] - 14s 11ms/step - loss: 0.5594 - accuracy: 0.7121 - val_loss: 0.4920 - val_accuracy: 0.7663
Epoch 3/20
1219/1219 [==============================] - 14s 12ms/step - loss: 0.5338 - accuracy: 0.7244 - val_loss: 0.4541 - val_accuracy: 0.7878
Epoch 4/20
1219/1219 [==============================] - 14s 11ms/step - loss: 0.5176 - accuracy: 0.7317 - val_loss: 0.4226 - val_accuracy: 0.7933
Epoch 5/20
1219/1219 [==============================] - 13s 11ms/step - loss: 0.4966 - accuracy: 0.7420 - val_loss: 0.4547 - val_accuracy: 0.7
...
...
poch 18/20
1219/1219 [==============================] - 14s 11ms/step - loss: 0.4300 - accuracy: 0.7747 - val_loss: 0.3536 - val_accuracy: 0.8288
Epoch 19/20
1219/1219 [==============================] - 14s 12ms/step - loss: 0.4269 - accuracy: 0.7766 - val_loss: 0.3565 - val_accuracy: 0.8302
Epoch 20/20
1219/1219 [==============================] - 14s 11ms/step - loss: 0.4259 - accuracy: 0.7775 - val_loss: 0.3498 - val_accuracy: 0.831

As can be seen the accuracy with augmented dataset is around 0.77, while without it I was getting 0.867 with just the real data. This degradation is probably due to the folllowing reasons

  • Only a fraction of the dataset was used for training. This was not representative of the data distribution for CTGAN to correctly synthesise data
  • The number of epochs had to be kept low to prevent Kaggle/Colab from crashing

I. Conclusion

This post shows how we can generate synthetic T20 match data to augment real T20 match data. Assuming we have sufficient processing power we should be able to generate synthetic data for augmenting our data set. This should improve the accuracy of the Win Probabily Deep Learning model.

References

  1. Generative Adversarial Networks – Ian Goodfellow et al.
  2. Modeling Tabular data using Conditional GAN
  3. Introduction to GAN
  4. Ian Goodfellow: Generative Adversarial Networks (GANs) | Lex Fridman Podcast
  5. CTGAN
  6. Tabular Synthetic Data Generation using CTGAN
  7. CTGAN Model
  8. Interpreting the Progress of CTGAN
  9. CTGAN metrics

Also see

  1. Using embeddings, collaborative filtering with Deep Learning to analyse T20 players
  2. Using Reinforcement Learning to solve Gridworld
  3. Deep Learning from first principles in Python, R and Octave – Part 4
  4. Practical Machine Learning with R and Python – Part 5
  5. Cricketr adds team analytics to its repertoire!!!
  6. yorkpy takes a hat-trick, bowls out Intl. T20s, BBL and Natwest T20!!!
  7. Deconstructing Convolutional Neural Networks with Tensorflow and Keras
  8. My TEDx talk on the “Internet of Things”
  9. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  10. The Anomaly

To see all posts click Index of posts

Big Data 6: The T20 Dance of Apache NiFi and yorkpy

“I don’t count my sit-ups. I only start counting once it starts hurting. ”

Muhammad Ali

“Hard work beats talent when talent doesn’t work hard.”

Tim Notke

In my previous post Big Data 5: kNiFI-ing through cricket data with Apache NiFi and yorkpy, I created a Big Data Pipeline that takes raw data in YAML format from a Cricsheet to processing and ranking IPL T20 players. In that post I had mentioned that we could create a similar pipeline to create a real time dashboard of IPL Analytics. I could have have done this but I needed to know how to create a Web UI. After digging and poking around, I have been able to create a simple Web UI running off Apache Web server. This UI uses basic JQuery and CSS to display a real time IPL T20 dashboard. As in my previous post, this is an end-2-end Big Data pipeline which can handle large data sets at scheduled times, process them and generate real time dashboards.

We could imagine an inter-galactic T20 championship league where T20 data comes in every hour or sooner and we need to perform analytics to see if us earthlings are any better than people with pointy heads  or little green men. The NiFi pipeline could be used as-is, however the yorkpy package would have to be rewritten in Pyspark. That is in another eon, though.

My package yorkpy has around ~45+ functions which fall in the following main categories

1. Pitching yorkpy . short of good length to IPL – Part 1 :Class 1: This includes functions that convert the yaml data of IPL matches into Pandas dataframe which are then saved as CSV. This part can perform analysis of individual IPL matches.
2. Pitching yorkpy.on the middle and outside off-stump to IPL – Part 2 :Class 2:This part includes functions to create a large data frame for head-to-head confrontation between any 2IPL teams says CSK-MI, DD-KKR etc, which can be saved as CSV. Analysis is then performed on these team-2-team confrontations.
3. Pitching yorkpy.swinging away from the leg stump to IPL – Part 3 Class 3:The 3rd part includes the performance of any IPL team against all other IPL teams. The data can also be saved as CSV.
4. Pitching yorkpy … in the block hole – Part 4 :Class 4: This part performs analysis of individual IPL batsmen and bowlers

 

Watch the live demo of the end-2-end NiFi pipeline at ‘The T20 Dance

You can download the NiFi template and associated code from Github at  T20 Dance

The Apache NiFi Pipeline is shown below

1. T20 Dance – Overall NiFi Pipeline

 

There are 5 process groups

2. ListAndConvertYaml2DataFrames

This post starts with having the YAML files downloaded and unpacked from Cricsheet.  The individual YAML files are converted into Pandas dataframes and saved as CSV. A concurrency of 12 is used to increase performance and process YAML files in parallel. The processor MergeContent creates a merged content to signal the completion of conversion and triggers the other Process Groups through a funnel.

 

3. Analyse individual IPL T20 matches

This Process Group ‘Analyse T20 matches’  used the yorkpy’s Class 1 functions which can perform analysis of individual IPL T20 matches. The matchWorm() and matchScorecard() functions are used, through any other function could have been used. The Process Group is shown below

 

4. Analyse performance of an IPL team in all matches against another IPL team

This Process Group ‘Analyse performance of IPL team in all matched against another IPL team‘ does analysis in all matches between any 2 IPL teams (Class 2) as shown below

5. Analyse performance of IPL team in all matches against all other IPL teams

This uses Class 3 functions. Individual data sets for each IPL team versus all other IPL teams is created before Class 3 yorkpy functions are invoked. This is included below

6. Analyse performances of IPL batsmen and bowlers

This Process Group uses Class 4 yorkpy functions. The match CSV files are processed to get batting and bowling details before calling the individual functions as shown below

 

7. IPL T20 Dashboard

The IPL T20 Dashboard is shown

 

Conclusion

This NiFI pipeline was done for IPL T20 however, it could be done for any T20 format like Intl T20, BBL, Natwest etc which are posted in Cricsheet. Also, only a subset of the yorkpy functions were used. There is a much wider variety of functions available.

Hope the T20 dance got your foot a-tapping!

 

You may also like
1. A primer on Qubits, Quantum gates and Quantum Operations
2.Computer Vision: Ramblings on derivatives, histograms and contours
3.Deep Learning from first principles in Python, R and Octave – Part 6
4.A Bluemix recipe with MongoDB and Node.js
5.Practical Machine Learning with R and Python – Part 4
6.Simulating the domino effect in Android using Box2D and AndEngine

To see all posts click Index of posts

Ranking T20 players in Intl T20, IPL, BBL and Natwest using yorkpy

There is a voice that doesn’t use words, listen.
When someone beats a rug, the blows are not against the rug, but against the dust in it.
I lost my hat while gazing at the moon, and then I lost my mind.
Rumi

Introduction

After a long hiatus, I am back to my big, bad, blogging ways! In this post I rank T20 players from several different leagues namely

  • International T20
  • Indian Premier League (IPL) T20
  • Big Bash League (BBL) T20
  • Natwest Blast (NTB) T20

I have added 8 new functions to my Python Package yorkpy, which will perform the ranking for the above 4 T20 League formats. To know more about my Python package see Pitching yorkpy . short of good length to IPL – Part 1, and the related posts on yorkpy. The code can be easily extended to other leagues which have a the same ‘yaml’ format for the matches. I also fixed some issues which started to crop up, possibly because a few things have changed in the new data.

The new functions are

  1. rankIntlT20Batting()
  2. rankIntlT20Batting()
  3. rankIPLT20Batting()
  4. rankIPLT20Batting
  5. rankBBLT20Batting()
  6. rankBBLT20Batting()
  7. rankNTBT20Batting()
  8. rankNTBT20Batting()

The yorkpy package uses data from Cricsheet

You can clone/fork the code for yorkpy at yorkpy

You can download the PDF of the post from Rank T20

yorkpy can be installed with ‘pip install yorkpy

1. International T20

The steps to do before ranking for International T20 matches are 1. Download International T20 zip file from Cricsheet Intl T20 2. Unzip the file. This will create a folder with yaml files

import yorkpy.analytics as yka
#yka.convertAllYaml2PandasDataframesT20("../t20s","../data")

This above step will convert the yaml files into CSV files. Now do the ranking as below

1a. Ranking of International T20 batsmen

import yorkpy.analytics as yka
intlT20RankBatting=yka.rankIntlT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\data")
intlT20RankBatting.head(15)
##                      matches  runs_mean     SR_mean
## batsman                                            
## V Kohli                   58  38.672414  125.212402
## KS Williamson             42  32.595238  122.884631
## Mohammad Shahzad          52  31.942308  118.212288
## CH Gayle                  50  31.140000  111.869984
## BB McCullum               69  29.492754  117.011666
## MM Lanning                48  28.812500   98.582663
## SJ Taylor                 44  28.659091   98.684856
## MJ Guptill                68  28.573529  117.673702
## DA Warner                 71  28.507042  121.142746
## DPMD Jayawardene          53  27.584906  107.787092
## KC Sangakkara             54  26.407407  106.039838
## JP Duminy                 68  26.294118  114.606717
## TM Dilshan                78  26.243590   97.910384
## RG Sharma                 65  25.907692  113.056548
## H Masakadza               53  25.566038   99.453880

1b. Ranking of International T20 bowlers

import yorkpy.analytics as yka
intlT20RankBowling=yka.rankIntlT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\data")
intlT20RankBowling.head(15)
##                       matches  wicket_mean  econrate_mean
## bowler                                                   
## Umar Gul                   58     1.603448       7.637931
## SL Malinga                 78     1.500000       7.409188
## Saeed Ajmal                63     1.492063       6.451058
## DW Steyn                   46     1.478261       7.014855
## A Shrubsole                45     1.422222       6.294444
## M Morkel                   41     1.292683       7.680894
## KMDN Kulasekara            57     1.280702       7.476608
## TG Southee                 51     1.274510       8.759804
## SCJ Broad                  53     1.264151            inf
## Shakib Al Hasan            58     1.241379       6.836207
## R Ashwin                   44     1.204545       7.162879
## Nida Dar                   44     1.204545       6.083333
## KH Brunt                   44     1.204545       5.982955
## KD Mills                   42     1.166667       8.289683
## SR Watson                  46     1.152174       8.246377

2. Indian Premier League (IPL) T20

The steps to do before ranking for IPL T20 matches are 1. Download IPL T20 zip file from Cricsheet IPL T20 2. Unzip the file. This will create a folder with yaml files

import yorkpy.analytics as yka
#yka.convertAllYaml2PandasDataframesT20("../ipl","../ipldata")

This above step will convert the yaml files into CSV files in the /ipldata folder. Now do the ranking as below

2a. Ranking of batsmen in IPL T20

import yorkpy.analytics as yka
IPLT20RankBatting=yka.rankIPLT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\ipldata")
IPLT20RankBatting.head(15)
##                    matches  runs_mean     SR_mean
## batsman                                          
## DA Warner              129  37.589147  119.917864
## CH Gayle               123  36.723577  125.256818
## SE Marsh                70  36.314286  114.707578
## KL Rahul                59  33.542373  123.424971
## MEK Hussey              60  33.400000  100.439187
## V Kohli                174  32.413793  115.830849
## KS Williamson           42  31.690476  120.443172
## AB de Villiers         143  30.923077  128.967081
## JC Buttler              45  30.800000  132.561154
## AM Rahane              118  30.330508  102.240398
## SR Tendulkar            79  29.949367  101.651959
## F du Plessis            65  29.415385  112.462114
## Q de Kock               51  29.333333  110.973836
## SS Iyer                 47  29.170213  102.144222
## G Gambhir              155  28.741935  103.997558

2b. Ranking of bowlers in IPL T20

import yorkpy.analytics as yka
IPLT20RankBowling=yka.rankIPLT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\ipldata")
IPLT20RankBowling.head(15)
##                      matches  wicket_mean  econrate_mean
## bowler                                                  
## SL Malinga               122     1.540984       7.173361
## Imran Tahir               43     1.465116       8.155039
## A Nehra                   88     1.375000       7.923295
## MJ McClenaghan            56     1.339286       8.638393
## Rashid Khan               46     1.304348       6.543478
## Sandeep Sharma            79     1.303797       7.860759
## MM Patel                  63     1.301587       7.530423
## DJ Bravo                 131     1.282443       8.458333
## M Morkel                  70     1.257143       7.760714
## SP Narine                109     1.256881       6.747706
## YS Chahal                 83     1.228916       8.103659
## R Vinay Kumar            104     1.221154       8.556090
## RP Singh                  82     1.219512       8.149390
## CH Morris                 52     1.211538       7.854167
## B Kumar                  117     1.205128       7.536325

3. Natwest T20

The steps to do before ranking for Natwest T20 matches are 1. Download Natwest T20 zip file from Cricsheet NTB T20 2. Unzip the file. This will create a folder with yaml files

import yorkpy.analytics as yka
#yka.convertAllYaml2PandasDataframesT20("../ntb","../ntbdata")

This above step will convert the yaml files into CSV files in the /ntbdata folder. Now do the ranking as below

3a. Ranking of NTB batsmen

import yorkpy.analytics as yka
NTBT20RankBatting=yka.rankNTBT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\ntbdata")
NTBT20RankBatting.head(15)
##                      matches  runs_mean     SR_mean
## batsman                                            
## Babar Azam                13  44.461538  121.268809
## T Banton                  13  42.230769  139.376274
## JJ Roy                    12  41.250000  142.182147
## DJM Short                 12  40.250000  131.182294
## AN Petersen               12  37.916667  132.522727
## IR Bell                   13  37.615385  130.104721
## M Klinger                 26  35.346154  112.682922
## EJG Morgan                16  35.062500  129.817650
## AJ Finch                  19  34.578947  137.093465
## MH Wessels                26  33.884615  116.300969
## S Steel                   11  33.545455  140.118207
## DJ Bell-Drummond          21  33.142857  108.566309
## Ashar Zaidi               11  33.000000  178.553331
## DJ Malan                  26  33.000000  120.127202
## T Kohler-Cadmore          23  32.956522  112.493019

3b. Ranking of NTB bowlers

import yorkpy.analytics as yka
NTBT20RankBowling=yka.rankNTBT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\ntbdata")
NTBT20RankBowling.head(15)
##                        matches  wicket_mean  econrate_mean
## bowler                                                    
## MW Parkinson                11     2.000000       7.628788
## HF Gurney                   23     1.956522       8.831884
## GR Napier                   12     1.916667       8.694444
## R Rampaul                   19     1.736842       7.131579
## P Coughlin                  11     1.727273       8.909091
## AJ Tye                      26     1.692308       8.227564
## GC Viljoen                  12     1.666667       7.708333
## BAC Howell                  21     1.666667       6.857143
## BW Sanderson                12     1.583333       7.902778
## KJ Abbott                   14     1.571429       9.398810
## JE Taylor                   13     1.538462       9.839744
## JDS Neesham                 12     1.500000      10.812500
## MJ Potts                    12     1.500000       8.486111
## TT Bresnan                  21     1.476190       8.817460
## T van der Gugten            13     1.461538       7.211538

4. Big Bash Leagure (BBL) T20

The steps to do before ranking for BBL T20 matches are 1. Download BBL T20 zip file from Cricsheet BBL T20 2. Unzip the file. This will create a folder with yaml files

import yorkpy.analytics as yka
#yka.convertAllYaml2PandasDataframesT20("../bbl","../bbldata")

This above step will convert the yaml files into CSV files in the /bbldata folder. Now do the ranking as below

4a. Ranking of BBL batsmen

import yorkpy.analytics as yka
BBLT20RankBatting=yka.rankBBLT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\bbldata")
BBLT20RankBatting.head(15)
##                 matches  runs_mean     SR_mean
## batsman                                       
## DJM Short            43  40.883721  118.773047
## SE Marsh             47  39.148936  113.616053
## AJ Finch             62  36.306452  120.271231
## AT Carey             37  34.945946  120.125341
## UT Khawaja           41  31.268293  107.355655
## CA Lynn              74  31.162162  121.746578
## MS Wade              46  30.782609  120.310081
## TM Head              45  30.000000  126.769564
## MEK Hussey           23  29.173913  109.492934
## BJ Hodge             29  29.000000  124.438040
## BR Dunk              39  28.230769  106.149913
## AD Hales             31  27.161290  117.678008
## BB McCullum          34  27.058824  115.486392
## GJ Bailey            57  27.000000  121.159220
## MR Marsh             47  26.510638  114.994909

4b. Ranking of BBL bowlers

import yorkpy.analytics as yka
BBLT20RankBowling=yka.rankBBLT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\bbldata")
BBLT20RankBowling.head(15)
##                    matches  wicket_mean  econrate_mean
## bowler                                                
## Yasir Arafat            15     2.000000       7.587778
## CH Morris               15     1.733333       8.572222
## TK Curran               27     1.629630       8.716049
## TT Bresnan              13     1.615385       8.775641
## JR Hazlewood            18     1.555556       7.361111
## CJ McKay                15     1.533333       8.555556
## DR Sams                 36     1.527778       8.581019
## AC McDermott            14     1.500000       9.166667
## JP Faulkner             20     1.500000       8.345833
## SP Narine               12     1.500000       7.395833
## AJ Tye                  51     1.490196       8.101307
## M Kelly                 21     1.476190       8.908730
## SA Abbott               73     1.438356       8.737443
## B Laughlin              82     1.426829       8.332317
## SW Tait                 31     1.419355       8.895161

Conclusion

You should be able to now rank players in the above formats as new data is added to Cricsheet. yorkpy can also be used for other leagues which follow the Cricsheet format.

Also see
1. Deep Learning from first principles in Python, R and Octave – Part 5
2. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
3. Using Reinforcement Learning to solve Gridworld
4. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
5. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
6. Deblurring with OpenCV: Weiner filter reloaded
7. Rock N’ Roll with Bluemix, Cloudant & NodeExpress
8. Modeling a Car in Android

To see all posts click Index of posts

Cricpy performs granular analysis of players

“Gold medals aren’t really made of gold. They’re made of sweat, determination, & a hard-to-find alloy called guts.” Dan Gable

“It doesn’t matter whether you are pursuing success in business, sports, the arts, or life in general: The bridge between wishing and accomplishing is discipline” Harvey Mackay

“I won’t predict anything historic. But nothing is impossible.” Michael Phelps

Introduction

In this post, I introduce 2 new functions in my Python package ‘cricpy’ (cricpy v0.20) see Introducing cricpy:A python package to analyze performances of cricketers which enable granular analysis of batsmen and bowlers. They are

  1. Step 1: getPlayerDataHA – This function is a wrapper around getPlayerData(), getPlayerDataOD() and getPlayerDataTT(), and adds an extra column ‘homeOrAway’ which says whether the match was played at home/away/neutral venues. A CSV file is created with this new column.
  2. Step 2: getPlayerDataOppnHA – This function allows you to slice & dice the data for batsmen and bowlers against specific oppositions, at home/away/neutral venues and between certain periods. This reducedsubset of data can be used to perform analyses. A CSV file is created as an output based on the parameters of opposition, home or away and the interval of time

Note All the existing cricpy functions can be used on this smaller fine-grained data set for a closer analysis of players

This post has been published in Rpubs and can be accessed at Cricpy performs granular analysis of players

You can download a PDF version of this post at Cricpy performs granular analysis of players

I have also updated the cricpy template with these lastest changes. See cricpy-template

You can download the latest PDF version of the book  at  ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition

1. Analyzing Rahul Dravid at 3 different stages of his career

The following functions analyze Rahul Dravid during 3 different periods of his illustrious career. a) 1st Jan 2001-1st Jan 2002 b) 1st Jan 2004-1st Jan 2005 c) 1st Jan 2009-1st Jan 2010

import cricpy.analytics as ca
# Get the homeOrAway dataset for Dravid in matches
# Note:Since I have already got the data I reuse the CSV file
#df=ca.getPlayerDataHA(28114,tfile="dravidTestHA.csv",matchType="Test")

# Get Dravid's data for 2001-02
df1=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2001.csv",startDate="2001-01-01",endDate="2002-01-01")

# Get Dravid's data for 2004-05
df2=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2004.csv", startDate="2004-01-01",endDate="2005-01-01")

# Get Dravid's data for 2009-10
df3=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2009.csv",startDate="2009-01-01",endDate="2010-01-01")

1a. Plot the performance of Dravid at venues during 2001,2004,2009

Note: Any of the cricpy functions can be used on the fine-grained subset of data as below.

import cricpy.analytics as ca
ca.batsmanAvgRunsGround("dravidTest2001.csv","Dravid-2001")

ca.batsmanAvgRunsGround("dravidTest2004.csv","Dravid-2004")

ca.batsmanAvgRunsGround("dravidTest2009.csv","Dravid-2009")


1b. Plot the performance of Dravid against different oppositions during 2001,2004,2009

import cricpy.analytics as ca
ca.batsmanAvgRunsOpposition("dravidTest2001.csv","Dravid-2001")

ca.batsmanAvgRunsOpposition("dravidTest2004.csv","Dravid-2004")


ca.batsmanAvgRunsOpposition("dravidTest2009.csv","Dravid-2009")


1c. Plot the relative cumulative average and relative strike rate of Dravid in 2001,2004,2009

The plot below compares Dravid’s cumulative strike rate and cumulative average during 3 different stages of his career

import cricpy.analytics as ca
frames=["dravidTest2001.csv","dravidTest2004.csv","dravidTest2009.csv"]
names=["Dravid-2001","Dravid-2004","Dravid-2009"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

ca.relativeBatsmanCumulativeStrikeRate(frames,names)

2. Analyzing Virat Kohli’s performance against England in England in 2014 and 2018

The analysis below looks at Kohli’s performance against England in ‘away’ venues (England) in 2014 and 2018

import cricpy.analytics as ca
# Get the homeOrAway data for Kohli in Test matches
#df=ca.getPlayerDataHA(253802,tfile="kohliTestHA.csv",type="batting",matchType="Test")

# Get the homeOrAway data for Kohli in Test matches
df=ca.getPlayerDataHA(253802,tfile="kohliTestHA.csv",type="batting",matchType="Test")

# Get the subset if data of Kohli's performance against England in England in 2014
df=ca.getPlayerDataOppnHA(infile="kohliTestHA.csv",outfile="kohliTestEng2014.csv",  opposition=["England"],homeOrAway=["away"],startDate="2014-01-01",endDate="2015-01-01")

# Get the subset if data of Kohli's performance against England in England in 2018
df1=ca.getPlayerDataOppnHA(infile="kohliTestHA.csv",outfile="kohliTestEng2018.csv",
   opposition=["England"],homeOrAway=["away"],startDate="2018-01-01",endDate="2019-01-01")

2a. Kohli’s performance at England grounds in 2014 & 2018

Kohli had a miserable outing to England in 2014 with a string of low scores. In 2018 Kohli pulls himself out of the morass

import cricpy.analytics as ca
ca.batsmanAvgRunsGround("kohliTestEng2014.csv","Kohli-Eng-2014")
ca.batsmanAvgRunsGround("kohliTestEng2018.csv","Kohli-Eng-2018")


2a. Kohli’s cumulative average runs in 2014 & 2018

Kohli’s cumulative average runs in 2014 is in the low 15s, while in 2018 it is 70+. Kohli stamps his class back again and undoes the bad memories of 2014

import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("kohliTestEng2014.csv", "Kohli-Eng-2014")

ca.batsmanCumulativeAverageRuns("kohliTestEng2018.csv", "Kohli-Eng-2018")

3a. Compare the performances of Ganguly, Dravid and VVS Laxman against opposition in ‘away’ matches in Tests

The analyses below compares the performances of Sourav Ganguly, Rahul Dravid and VVS Laxman against Australia, South Africa, and England in ‘away’ venues between 01 Jan 2002 to 01 Jan 2008

import cricpy.analytics as ca
#Get the HA data for Ganguly, Dravid and Laxman
#df=ca.getPlayerDataHA(28779,tfile="gangulyTestHA.csv",type="batting",matchType="Test")
#df=ca.getPlayerDataHA(28114,tfile="dravidTestHA.csv",type="batting",matchType="Test")
#df=ca.getPlayerDataHA(30750,tfile="laxmanTestHA.csv",type="batting",matchType="Test")

# Slice the data 
df=ca.getPlayerDataOppnHA(infile="gangulyTestHA.csv",outfile="gangulyTestAES2002-08.csv" ,opposition=["Australia", "England", "South Africa"],                        homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01")
df=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTestAES2002-08.csv" ,opposition=["Australia", "England", "South Africa"],                        homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01")
df=ca.getPlayerDataOppnHA(infile="laxmanTestHA.csv",outfile="laxmanTestAES2002-08.csv",opposition=["Australia", "England", "South Africa"],                       homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01")

3b Plot the relative cumulative average runs and relative cumative strike rate

Plot the relative cumulative average runs and relative cumative strike rate of Ganguly, Dravid and Laxman

-Dravid towers over Laxman and Ganguly with respect to cumulative average runs. – Ganguly has a superior strike rate followed by Laxman and then Dravid

import cricpy.analytics as ca
frames=["gangulyTestAES2002-08.csv","dravidTestAES2002-08.csv","laxmanTestAES2002-08.csv"]
names=["GangulyAusEngSA2002-08","DravidAusEngSA2002-08","LaxmanAusEngSA2002-08"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

ca.relativeBatsmanCumulativeStrikeRate(frames,names)

4. Compare the ODI performances of Rohit Sharma, Joe Root and Kane Williamson against opposition

Compare the performances of Rohit Sharma, Joe Root and Kane williamson in away & neutral venues against Australia, West Indies and Soouth Africa

  • Joe Root piles us the runs in about 15 matches. Rohit has played far more ODIs than the other two and averages a steady 35+
import cricpy.analytics as ca
# Get the ODI HA data for Rohit, Root and Williamson
#df=ca.getPlayerDataHA(34102,tfile="rohitODIHA.csv",type="batting",matchType="ODI")
#df=ca.getPlayerDataHA(303669,tfile="joerootODIHA.csv",type="batting",matchType="ODI")
#df=ca.getPlayerDataHA(277906,tfile="williamsonODIHA.csv",type="batting",matchType="ODI")

# Subset the data for specific opposition in away and neutral venues
## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
##   from pandas.core import datetools
df=ca.getPlayerDataOppnHA(infile="rohitODIHA.csv",outfile="rohitODIAusWISA.csv"
                       ,opposition=["Australia", "West Indies", "South Africa"],
                      homeOrAway=["away","neutral"])
df=ca.getPlayerDataOppnHA(infile="joerootODIHA.csv",outfile="joerootODIAusWISA.csv"
                       ,opposition=["Australia", "West Indies", "South Africa"],
                       homeOrAway=["away","neutral"])
df=ca.getPlayerDataOppnHA(infile="williamsonODIHA.csv",outfile="williamsonODIAusWiSA.csv",opposition=["Australia", "West Indies", "South Africa"],                    homeOrAway=["away","neutral"])

4a. Compare cumulative strike rates and cumulative average runs of Rohit, Root and Williamson

The relative cumulative strike rate of all 3 are comparable

import cricpy.analytics as ca
frames=["rohitODIAusWISA.csv","joerootODIAusWISA.csv","williamsonODIAusWiSA.csv"]
names=["Rohit-ODI-AusWISA","Joe Root-ODI-AusWISA","Williamson-ODI-AusWISA"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

ca.relativeBatsmanCumulativeStrikeRate(frames,names)

5. Plot the performance of Dhoni in T20s against specific opposition at all venues

Plot the performances of Dhoni against Australia, West Indies, South Africa and England

import cricpy.analytics as ca
# Get the HA T20 data for Dhoni
#df=ca.getPlayerDataHA(28081,tfile="dhoniT20HA.csv",type="batting",matchType="T20")
#Subset the data
df=ca.getPlayerDataOppnHA(infile="dhoniT20HA.csv",outfile="dhoniT20AusWISAEng.csv",opposition=["Australia", "West Indies", "South Africa","England"],                homeOrAway=["all"])

5a. Plot Dhoni’s performances in T20

Note You can use any of cricpy’s functions against the fine grained data

import cricpy.analytics as ca
ca.batsmanAvgRunsOpposition("dhoniT20AusWISAEng.csv","Dhoni")

ca.batsmanAvgRunsGround("dhoniT20AusWISAEng.csv","Dhoni")

ca.batsmanCumulativeStrikeRate("dhoniT20AusWISAEng.csv","Dhoni")

ca.batsmanCumulativeAverageRuns("dhoniT20AusWISAEng.csv","Dhoni")

6. Compute and performances of Anil Kumble, Muralitharan and Warne in ‘away’ test matches

Compute the performances of Kumble, Warne and Maralitharan against New Zealand, West Indies, South Africa and England in pitches that are not ‘home’ pithes

import cricpy.analytics as ca
# Get the bowling data for Kumble, Warne and Muralitharan in Test matches
#df=ca.getPlayerDataHA(30176,tfile="kumbleTestHA.csv",type="bowling",matchType="Test")
#df=ca.getPlayerDataHA(8166,tfile="warneTestHA.csv",type="bowling",matchType="Test")
#df=ca.getPlayerDataHA(49636,tfile="muraliTestHA.csv",type="bowling",matchType="Test")

# Subset the data
df=ca.getPlayerDataOppnHA(infile="kumbleTestHA.csv",outfile="kumbleTest-NZWISAEng.csv",opposition=["New Zealand", "West Indies", "South Africa","England"],
                       homeOrAway=["away"])

df=ca.getPlayerDataOppnHA(infile="warneTestHA.csv",outfile="warneTest-NZWISAEng.csv"
                       ,opposition=["New Zealand", "West Indies", "South Africa","England"], homeOrAway=["away"])

df=ca.getPlayerDataOppnHA(infile="muraliTestHA.csv",outfile="muraliTest-NZWISAEng.csv"
                       ,opposition=["New Zealand", "West Indies", "South Africa","England"], homeOrAway=["away"])

6a. Plot the average wickets of Kumble, Warne and Murali

import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("kumbleTest-NZWISAEng.csv","Kumble-NZWISAEng-AN")

ca.bowlerAvgWktsOpposition("warneTest-NZWISAEng.csv","Warne-NZWISAEng-AN")

ca.bowlerAvgWktsOpposition("muraliTest-NZWISAEng.csv","Murali-NZWISAEng-AN")

6b. Plot the average wickets in different grounds of Kumble, Warne and Murali

import cricpy.analytics as ca
ca.bowlerAvgWktsGround("kumbleTest-NZWISAEng.csv","Kumble")

ca.bowlerAvgWktsGround("warneTest-NZWISAEng.csv","Warne")

ca.bowlerAvgWktsGround("muraliTest-NZWISAEng.csv","Murali")

6c. Plot the cumulative average wickets and cumulative economy rate of Kumble, Warne and Murali

  • Murali has the best economy rate followed by Kumble and then Warne
  • Again Murali has the best cumulative average wickets followed by Warne and then Kumble
import cricpy.analytics as ca
frames=["kumbleTest-NZWISAEng.csv","warneTest-NZWISAEng.csv","muraliTest-NZWISAEng.csv"]
names=["Kumble","Warne","Murali"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

ca.relativeBowlerCumulativeAvgWickets(frames,names)

7. Compute and plot the performances of Bumrah in 2016, 2017 and 2018 in ODIs

import cricpy.analytics as ca
# Get the HA data for Bumrah in ODI in bowling
#df=ca.getPlayerDataHA(625383,tfile="bumrahODIHA.csv",type="bowling",matchType="ODI")

# Slice the data for periods 2016, 2017 and 2018
df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2016.csv",
                       startDate="2016-01-01",endDate="2017-01-01")

df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2017.csv",
                       startDate="2017-01-01",endDate="2018-01-01")

df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2018.csv",
                       startDate="2018-01-01",endDate="2019-01-01")

7a. Compute the performances of Bumrah in 2016, 2017 and 2018

  • Very clearly Bumrah is getting better at his art. His economy rate in 2018 is the best!!!
  • Bumrah has had a very prolific year in 2017. However all the years he seems to be quite effective
import cricpy.analytics as ca
frames=["bumrahODI2016.csv","bumrahODI2017.csv","bumrahODI2018.csv"]
names=["Bumrah-2016","Bumrah-2017","Bumrah-2018"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

ca.relativeBowlerCumulativeAvgWickets(frames,names)

8. Compute and plot the performances of Shakib, Bumrah and Jadeja in T20 matches for bowling

import cricpy.analytics as ca
# Get the HA bowling data for Shakib, Bumrah and Jadeja
#df=ca.getPlayerDataHA(56143,tfile="shakibT20HA.csv",type="bowling",matchType="T20")
#df=ca.getPlayerDataHA(625383,tfile="bumrahT20HA.csv",type="bowling",matchType="T20")
#df=ca.getPlayerDataHA(234675,tfile="jadejaT20HA.csv",type="bowling",matchType="T20")

# Slice the data for performances against Sri Lanka, Australia, South Africa and England
df=ca.getPlayerDataOppnHA(infile="shakibT20HA.csv",outfile="shakibT20-SLAusSAEng.csv" ,opposition=["Sri Lanka","Australia", "South Africa","England"],
                       homeOrAway=["all"])
df=ca.getPlayerDataOppnHA(infile="bumrahT20HA.csv",outfile="bumrahT20-SLAusSAEng.csv",opposition=["Sri Lanka","Australia", "South Africa","England"],
                       homeOrAway=["all"])

df=ca.getPlayerDataOppnHA(infile="jadejaT20HA.csv",outfile="jadejaT20-SLAusSAEng.csv"                      ,opposition=["Sri Lanka","Australia", "South Africa","England"],   homeOrAway=["all"])

8a. Compare the relative performances of Shakib, Bumrah and Jadeja

  • Jadeja and Bumrah have comparable economy rates. Shakib is more expensive
  • Shakib pips Bumrah in number of cumulative wickets, though Bumrah is close behind
import cricpy.analytics as ca
frames=["shakibT20-SLAusSAEng.csv","bumrahT20-SLAusSAEng.csv","jadejaT20-SLAusSAEng.csv"]
names=["Shakib-SLAusSAEng","Bumrah-SLAusSAEng","Jadeja-SLAusSAEng"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

ca.relativeBowlerCumulativeAvgWickets(frames,names)

Conclusion

By getting the homeOrAway data for players using the profileNo, you can slice and dice the data based on your choice of opposition, whether you want matches that were played at home/away/neutral venues. Finally by specifying the period for which the data has to be subsetted you can create fine grained analysis.

Hope you have a great time with cricpy!!!

Also see
1. My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon
2. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
3. Exploring Quantum Gate operations with QCSimulator
4. Deep Learning from first principles in Python, R and Octave – Part 6
5. Natural selection of database technology through the years
6. Pitching yorkpy … short of good length to IPL – Part 1
7. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
8. Practical Machine Learning with R and Python – Part 3

To see all posts click Index of posts

Analyze cricket players and cricket teams with cricpy template

Introduction

This post shows how you can analyze batsmen, bowlers see Introducing cricpy:A python package to analyze performances of cricketers and cricket teams see Cricpy adds team analytics to its arsenal! in Test, ODI and T20s using cricpy templates, with data from ESPN Cricinfo.

The cricpy package

A. Analyzing batsmen and bowlers in Test, ODI and T20s

The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Rahul Dravid, Virat Kohli, Alastair Cook etc. This will bring up a page which have the profile number for the player e.g. for Rahul Dravid this would be http://www.espncricinfo.com/india/content/player/28114.html. Hence, Dravid’s profile is 28114. This can be used to get the data for Rahul Dravid as shown below

and select the player you want Please mindful of the ESPN Cricinfo Terms of Use

My posts on Cripy were

  1. Introducing cricpy:A python package to analyze performances of cricketers
  2. Cricpy takes a swing at the ODIs
  3. Cricpy takes guard for the Twenty20s

You can clone/download this cricpy template for your own analysis of players. This can be done using RStudio or IPython notebooks

The cricpy package is now available with pip install cricpy!!!

You can download the latest PDF version of the book  at  ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition

1 Importing cricpy – Python

# Install the package
# Do a pip install cricpy
# Import cricpy
import cricpy.analytics as ca 
## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
##   from pandas.core import datetools

2. Invoking functions with Python package cricpy

import cricpy.analytics as ca 
#ca.batsman4s("aplayer.csv","A Player")

3. Getting help from cricpy – Python

import cricpy.analytics as ca
#help(ca.getPlayerData)

The details below will introduce the different functions that are available in cricpy.

4. Get the player data for a player using the function getPlayerData()

Important Note This needs to be done only once for a player. This function stores the player’s data in the specified CSV file (for e.g. dravid.csv as above) which can then be reused for all other functions). Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

4a. For Test players

import cricpy.analytics as ca
#player1 =ca.getPlayerData(profileNo1,dir="..",file="player1.csv",type="batting",homeOrAway=[1,2], result=[1,2,4])
#player1 =ca.getPlayerData(profileNo2,dir="..",file="player2.csv",type="batting",homeOrAway=[1,2], result=[1,2,4])

4b. For ODI players

import cricpy.analytics as ca
#player1 =ca.getPlayerDataOD(profileNo1,dir="..",file="player1.csv",type="batting")
#player1 =ca.getPlayerDataOD(profileNo2,dir="..",file="player2.csv",type="batting"")

4c For T20 players

import cricpy.analytics as ca
#player1 =ca.getPlayerDataTT(profileNo1,dir="..",file="player1.csv",type="batting")
#player1 =ca.getPlayerDataTT(profileNo2,dir="..",file="player2.csv",type="batting"")

5 A Player’s performance – Basic Analyses

The 3 plots below provide the following for Rahul Dravid

  1. Frequency percentage of runs in each run range over the whole career
  2. Mean Strike Rate for runs scored in the given range
  3. A histogram of runs frequency percentages in runs ranges

import cricpy.analytics as ca
import matplotlib.pyplot as plt


#ca.batsmanRunsFreqPerf("aplayer.csv","A Player")
#ca.batsmanMeanStrikeRate("aplayer.csv","A Player")
#ca.batsmanRunsRanges("aplayer.csv","A Player") 

6. More analyses

This gives details on the batsmen’s 4s, 6s and dismissals

import cricpy.analytics as ca

#ca.batsman4s("aplayer.csv","A Player")
#ca.batsman6s("aplayer.csv","A Player") 
#ca.batsmanDismissals("aplayer.csv","A Player")

# The below function is for ODI and T20 only
#ca.batsmanScoringRateODTT("./kohli.csv","Virat Kohli")  

7. 3D scatter plot and prediction plane

The plots below show the 3D scatter plot of Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease

import cricpy.analytics as ca
#ca.battingPerf3d("aplayer.csv","A Player")

8. Average runs at different venues

The plot below gives the average runs scored at different grounds. The plot also the number of innings at each ground as a label at x-axis.

import cricpy.analytics as ca
#ca.batsmanAvgRunsGround("aplayer.csv","A Player")

9. Average runs against different opposing teams

This plot computes the average runs scored against different countries.

import cricpy.analytics as ca

#ca.batsmanAvgRunsOpposition("aplayer.csv","A Player")

10. Highest Runs Likelihood

The plot below shows the Runs Likelihood for a batsman.

import cricpy.analytics as ca

#ca.batsmanRunsLikelihood("aplayer.csv","A Player")

11. A look at the Top 4 batsman

Choose any number of players

1.Player1 2.Player2 3.Player3 …

The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs

12. Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

import cricpy.analytics as ca

#ca.batsmanPerfBoxHist("aplayer001.csv","A Player001")
#ca.batsmanPerfBoxHist("aplayer002.csv","A Player002")
#ca.batsmanPerfBoxHist("aplayer003.csv","A Player003")
#ca.batsmanPerfBoxHist("aplayer004.csv","A Player004")

13. get Player Data special

import cricpy.analytics as ca

#player1sp = ca.getPlayerDataSp(profile1,tdir=".",tfile="player1sp.csv",ttype="batting")
#player2sp = ca.getPlayerDataSp(profile2,tdir=".",tfile="player2sp.csv",ttype="batting")
#player3sp = ca.getPlayerDataSp(profile3,tdir=".",tfile="player3sp.csv",ttype="batting")
#player4sp = ca.getPlayerDataSp(profile4,tdir=".",tfile="player4sp.csv",ttype="batting")

14. Contribution to won and lost matches

Note:This can only be used for Test matches

import cricpy.analytics as ca

#ca.batsmanContributionWonLost("player1sp.csv","A Player001")
#ca.batsmanContributionWonLost("player2sp.csv","A Player002")
#ca.batsmanContributionWonLost("player3sp.csv","A Player003")
#ca.batsmanContributionWonLost("player4sp.csv","A Player004")

15. Performance at home and overseas

Note:This can only be used for Test matches This function also requires the use of getPlayerDataSp() as shown above

import cricpy.analytics as ca
#ca.batsmanPerfHomeAway("player1sp.csv","A Player001")
#ca.batsmanPerfHomeAway("player2sp.csv","A Player002")
#ca.batsmanPerfHomeAway("player3sp.csv","A Player003")
#ca.batsmanPerfHomeAway("player4sp.csv","A Player004")

16 Moving Average of runs in career

import cricpy.analytics as ca

#ca.batsmanMovingAverage("aplayer001.csv","A Player001")
#ca.batsmanMovingAverage("aplayer002.csv","A Player002")
#ca.batsmanMovingAverage("aplayer003.csv","A Player003")
#ca.batsmanMovingAverage("aplayer004.csv","A Player004")

17 Cumulative Average runs of batsman in career

This function provides the cumulative average runs of the batsman over the career.

import cricpy.analytics as ca

#ca.batsmanCumulativeAverageRuns("aplayer001.csv","A Player001")
#ca.batsmanCumulativeAverageRuns("aplayer002.csv","A Player002")
#ca.batsmanCumulativeAverageRuns("aplayer003.csv","A Player003")
#ca.batsmanCumulativeAverageRuns("aplayer004.csv","A Player004")

18 Cumulative Average strike rate of batsman in career

.

import cricpy.analytics as ca
#ca.batsmanCumulativeStrikeRate("aplayer001.csv","A Player001")
#ca.batsmanCumulativeStrikeRate("aplayer002.csv","A Player002")
#ca.batsmanCumulativeStrikeRate("aplayer003.csv","A Player003")
#ca.batsmanCumulativeStrikeRate("aplayer004.csv","A Player004")

19 Future Runs forecast

import cricpy.analytics as ca

#ca.batsmanPerfForecast("aplayer001.csv","A Player001")

20 Relative Batsman Cumulative Average Runs

The plot below compares the Relative cumulative average runs of the batsman for each of the runs ranges of 10 and plots them.

import cricpy.analytics as ca

frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"]
names = ["A Player1","A Player2","A Player3","A Player4"]
#ca.relativeBatsmanCumulativeAvgRuns(frames,names)

21 Plot of 4s and 6s

import cricpy.analytics as ca

frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"]
names = ["A Player1","A Player2","A Player3","A Player4"]
#ca.batsman4s6s(frames,names)

22. Relative Batsman Strike Rate

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

import cricpy.analytics as ca

frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"]
names = ["A Player1","A Player2","A Player3","A Player4"]
#ca.relativeBatsmanCumulativeStrikeRate(frames,names)

23. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

import cricpy.analytics as ca
#ca.battingPerf3d("aplayer001.csv","A Player001")
#ca.battingPerf3d("aplayer002.csv","A Player002")
#ca.battingPerf3d("aplayer003.csv","A Player003")
#ca.battingPerf3d("aplayer004.csv","A Player004")

24. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

import cricpy.analytics as ca

import numpy as np
import pandas as pd

BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})

#aplayer = ca.batsmanRunsPredict("aplayer.csv",newDF,"A Player")
#print(aplayer)

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease.

25 Analysis of Top 3 wicket takers

Take any number of bowlers from either Test, ODI or T20

  1. Bowler1
  2. Bowler2
  3. Bowler3 …

26. Get the bowler’s data (Test)

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

import cricpy.analytics as ca

#abowler1 =ca.getPlayerData(profileNo1,dir=".",file="abowler1.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4])
#abowler2 =ca.getPlayerData(profileNo2,dir=".",file="abowler2.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4])
#abowler3 =ca.getPlayerData(profile3,dir=".",file="abowler3.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4])

26b For ODI bowlers

import cricpy.analytics as ca

#abowler1 =ca.getPlayerDataOD(profileNo1,dir=".",file="abowler1.csv",type="bowling")
#abowler2 =ca.getPlayerDataOD(profileNo2,dir=".",file="abowler2.csv",type="bowling")
#abowler3 =ca.getPlayerDataOD(profile3,dir=".",file="abowler3.csv",type="bowling")

26c For T20 bowlers

import cricpy.analytics as ca

#abowler1 =ca.getPlayerDataTT(profileNo1,dir=".",file="abowler1.csv",type="bowling")
#abowler2 =ca.getPlayerDataTT(profileNo2,dir=".",file="abowler2.csv",type="bowling")
#abowler3 =ca.getPlayerDataTT(profile3,dir=".",file="abowler3.csv",type="bowling")

27. Wicket Frequency Plot

This plot below plots the frequency of wickets taken for each of the bowlers

import cricpy.analytics as ca

#ca.bowlerWktsFreqPercent("abowler1.csv","A Bowler1")
#ca.bowlerWktsFreqPercent("abowler2.csv","A Bowler2")
#ca.bowlerWktsFreqPercent("abowler3.csv","A Bowler3")

28. Wickets Runs plot

The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken

import cricpy.analytics as ca

#ca.bowlerWktsRunsPlot("abowler1.csv","A Bowler1")
#ca.bowlerWktsRunsPlot("abowler2.csv","A Bowler2")
#ca.bowlerWktsRunsPlot("abowler3.csv","A Bowler3")

29 Average wickets at different venues

The plot gives the average wickets taken bat different venues.

import cricpy.analytics as ca

#ca.bowlerAvgWktsGround("abowler1.csv","A Bowler1")
#ca.bowlerAvgWktsGround("abowler2.csv","A Bowler2")
#ca.bowlerAvgWktsGround("abowler3.csv","A Bowler3")

30 Average wickets against different opposition

The plot gives the average wickets taken against different countries.

import cricpy.analytics as ca

#ca.bowlerAvgWktsOpposition("abowler1.csv","A Bowler1")
#ca.bowlerAvgWktsOpposition("abowler2.csv","A Bowler2")
#ca.bowlerAvgWktsOpposition("abowler3.csv","A Bowler3")

31 Wickets taken moving average

import cricpy.analytics as ca

#ca.bowlerMovingAverage("abowler1.csv","A Bowler1")
#ca.bowlerMovingAverage("abowler2.csv","A Bowler2")
#ca.bowlerMovingAverage("abowler3.csv","A Bowler3")

32 Cumulative average wickets taken

The plots below give the cumulative average wickets taken by the bowlers.

import cricpy.analytics as ca

#ca.bowlerCumulativeAvgWickets("abowler1.csv","A Bowler1")
#ca.bowlerCumulativeAvgWickets("abowler2.csv","A Bowler2")
#ca.bowlerCumulativeAvgWickets("abowler3.csv","A Bowler3")

33 Cumulative average economy rate

The plots below give the cumulative average economy rate of the bowlers.

import cricpy.analytics as ca

#ca.bowlerCumulativeAvgEconRate("abowler1.csv","A Bowler1")
#ca.bowlerCumulativeAvgEconRate("abowler2.csv","A Bowler2")
#ca.bowlerCumulativeAvgEconRate("abowler3.csv","A Bowler3")

34 Future Wickets forecast

import cricpy.analytics as ca
#ca.bowlerPerfForecast("abowler1.csv","A bowler1")

35 Get player data special

import cricpy.analytics as ca

#abowler1sp =ca.getPlayerDataSp(profile1,tdir=".",tfile="abowler1sp.csv",ttype="bowling")
#abowler2sp =ca.getPlayerDataSp(profile2,tdir=".",tfile="abowler2sp.csv",ttype="bowling")
#abowler3sp =ca.getPlayerDataSp(profile3,tdir=".",tfile="abowler3sp.csv",ttype="bowling")

36 Contribution to matches won and lost

Note:This can be done only for Test cricketers

import cricpy.analytics as ca

#ca.bowlerContributionWonLost("abowler1sp.csv","A Bowler1")
#ca.bowlerContributionWonLost("abowler2sp.csv","A Bowler2")
#ca.bowlerContributionWonLost("abowler3sp.csv","A Bowler3")

37 Performance home and overseas

Note:This can be done only for Test cricketers

import cricpy.analytics as ca

#ca.bowlerPerfHomeAway("abowler1sp.csv","A Bowler1")
#ca.bowlerPerfHomeAway("abowler2sp.csv","A Bowler2")
#ca.bowlerPerfHomeAway("abowler3sp.csv","A Bowler3")

38 Relative cumulative average economy rate of bowlers

import cricpy.analytics as ca

frames = ["abowler1.csv","abowler2.csv","abowler3.csv"]
names = ["A Bowler1","A Bowler2","A Bowler3"]
#ca.relativeBowlerCumulativeAvgEconRate(frames,names)

39 Relative Economy Rate against wickets taken

import cricpy.analytics as ca

frames = ["abowler1.csv","abowler2.csv","abowler3.csv"]
names = ["A Bowler1","A Bowler2","A Bowler3"]
#ca.relativeBowlingER(frames,names)

40 Relative cumulative average wickets of bowlers in career

import cricpy.analytics as ca
frames = ["abowler1.csv","abowler2.csv","abowler3.csv"]
names = ["A Bowler1","A Bowler2","A Bowler3"]
#ca.relativeBowlerCumulativeAvgWickets(frames,names)

B. Analyzing cricket teams in Test, ODI and T20s

The following functions will get the team data for Tests, ODI and T20s

1a. Get Test team data

import cricpy.analytics as ca
#country1Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country1Test.csv",save=True,teamName="Country1")
#country2Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country2Test.csv",save=True,teamName="Country2")
#country3Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country3Test.csv",save=True,teamName="Country3")

1b. Get ODI team data

import cricpy.analytics as ca
#team1ODI=  ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team1ODI.csv",save=True,teamName="team1")
#team2ODI=  ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team2ODI.csv",save=True,teamName="team2")
#team3ODI=  ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team3ODI.csv",save=True,teamName="team3")

1c. Get T20 team data

import cricpy.analytics as ca
#team1T20 = ca.getTeamDataHomeAway(matchType="T20",file="team1T20.csv",save=True,teamName="team1")
#team2T20 = ca.getTeamDataHomeAway(matchType="T20",file="team2T20.csv",save=True,teamName="team2")
#team3T20 = ca.getTeamDataHomeAway(matchType="T20",file="team3T20.csv",save=True,teamName="team3")

2a. Test – Analyzing test performances against opposition

import cricpy.analytics as ca
# Get the performance of Indian test team against all teams at all venues as a dataframe
#df = ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=False)
#print(df.head())
# Plot the performance of Country1 Test team  against all teams at all venues
#ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True)
# Plot the performance of Country1 Test team  against specific teams at home/away venues
#ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["Country2","Country3","Country4"],homeOrAway=["home","away","neutral"],matchType="Test",plot=True)

2b. Test – Analyzing test performances against opposition at different grounds

import cricpy.analytics as ca
# Get the performance of Indian test team against all teams at all venues as a dataframe
#df = ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=False)
#df.head()
# Plot the performance of Country1 Test team  against all teams at all venues
#ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True)
# Plot the performance of Country1 Test team  against specific teams at home/away venues
#ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["Country2","Country3","Country4"],homeOrAway=["home","away","neutral"],matchType="Test",plot=True)

2c. Test – Plot time lines of wins and losses

import cricpy.analytics as ca
#ca.plotTimelineofWinsLosses("country1Test.csv",team="Country1",opposition=["all"], #startDate="1970-01-01",endDate="2017-01-01")
#ca.plotTimelineofWinsLosses("country1Test.csv",team="Country1",opposition=["Country2","Count#ry3","Country4"], homeOrAway=["home",away","neutral"], startDate=<start Date> #,endDate=<endDate>)

3a. ODI – Analyzing test performances against opposition

import cricpy.analytics as ca
#df = ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False)
#print(df.head())
# Plot the performance of team1  in ODIs against Sri Lanka, India at all venues
#ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="ODI",plot=True)
# Plot the performance of Team1 ODI team  against specific teams at home/away venues
#ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway="home","away","neutral"],matchType="ODI",plot=True)

3b. ODI – Analyzing test performances against opposition at different venues

import cricpy.analytics as ca
#df = ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False)
#print(df.head())
# Plot the performance of Team1s in ODIs specific ODI teams at all venues
#ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="ODI",plot=True)
# Plot the performance of Team1 against specific ODI teams at home/away venues
#ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="ODI",plot=True)

3c. ODI – Plot time lines of wins and losses

import cricpy.analytics as ca
#Plot the time line of wins/losses of Bangladesh ODI team between 2 dates all venues
#ca.plotTimelineofWinsLosses("team1ODI.csv",team="Team1",startDate=<start date> ,endDa#te=<end date>,matchType="ODI")
#Plot the time line of wins/losses against specific opposition between 2 dates
#ca.plotTimelineofWinsLosses("team1ODI.csv",team="Team1",opposition=["Team2","Team2"], homeOrAway=["home",away","neutral"], startDate=<start date>,endDate=<end date> ,matchType="ODI")

4a. T20 – Analyzing test performances against opposition

import cricpy.analytics as ca
#df = ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False)
#print(df.head())
# Plot the performance of Team1 in T20s  against  all opposition at all venues
#ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="T20",plot=True)
# Plot the performance of T20 Test team  against specific teams at home/away venues
#ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="T20",plot=True)

4b. T20 – Analyzing test performances against opposition at different venues

import cricpy.analytics as ca
#df = ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False)
#df.head()
# Plot the performance of Team1s in ODIs specific ODI teams at all venues
#ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=True)
# Plot the performance of Team1 against specific ODI teams at home/away venues
#ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="T20",plot=True)

4c. T20 – Plot time lines of wins and losses

import cricpy.analytics as ca
#Plot the time line of wins/losses of Bangladesh ODI team between 2 dates all venues
#ca.plotTimelineofWinsLosses("teamT20.csv",team="Team1",startDate=<start date> ,endDa#te=<end date>,matchType="T20")
#Plot the time line of wins/losses against specific opposition between 2 dates
#ca.plotTimelineofWinsLosses("teamT20.csv",team="Team1",opposition=c("Team2","Team2"), homeOrAway=c("home",away","neutral"), startDate=<start date>,endDate=<end date> ,matchType="T20")

Conclusion

Key Findings

Analysis of batsman

Analysis of bowlers

Analysis of teams

Have fun with cripy!!!

Cricpy adds team analytics to its arsenal!!

I can’t sit still and see another man slaving and working. I want to get up and superintend, and walk round with my hands in my pockets, and tell him what to do. It is my energetic nature. I can’t help it.

It always does seem to me that I am doing more work than I should do. It is not that I object to the work, mind you; I like work: it fascinates me. I can sit and look at it for hours. I love to keep it by me: the idea of getting rid of it nearly breaks my heart.

Let your boat of life be light, packed with only what you need – a homely home and simple pleasures, one or two friends, worth the name, someone to love and someone to love you, a cat, a dog, and a pipe or two, enough to eat and enough to wear, and a little more than enough to drink; for thirst is a dangerous thing.

                Three Men in a boat by Jerome K Jerome
                

Introduction

Cricpy, the python avatar of my R package was born about a 9 months back see Introducing cricpy:A python package to analyze performances of cricketers. Cricpy, like its R twin, can analyze performance of batsmen & bowlers in Test, ODI and T20 formats. About a week and a half back, I added team analytics to my R package cricketr see Cricketr adds team analytics to its repertoire!!!. If cricketr has team analysis functions, then can cricpy be far behind? So, I have included the same 8 functions which can perform Team analytics into cricpy also. Team performance analysis can be done for Test, ODI and T20 matches.

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package can handle all formats of the game including Test, ODI and Twenty20 cricket.

You should be able to install the package using pip install cricpy. Please be mindful of ESPN Cricinfo Terms of Use

You can download the latest PDF version of the book  at  ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition

There are 5 functions which are used internally 1) getTeamData b) getTeamNumber c) getMatchType d) getTeamDataHomeAway e) cleanTeamData

and the external functions which a) teamWinLossStatusVsOpposition b) teamWinLossStatusAtGrounds c) plotTimelineofhttps://drive.google.com/file/d/1l4nQsRZ0C2FyPosigZmo0t-kC2xZZ_wl/view?usp=sharingWinsLosses

All the above functions are common to Test, ODI and T20 teams

The data for a particular Team can be obtained with the getTeamDataHomeAway() function from the package. This will return a dataframe of the team’s win/loss status at home and away venues over a period of time. This can be saved as a CSV file. Once this is done, you can use this CSV file for all subsequent analysis

This post has been published at Rpubs at teamAnalyticsCricpy You can download the PDF version of this post at teamAnalyticsCricpy

As before you can get the help for any of the cricpy functions as below

import cricpy.analytics as ca
help(ca.teamWinLossStatusAtGrounds)
## Help on function teamWinLossStatusAtGrounds in module cricpy.analytics:
## 
## teamWinLossStatusAtGrounds(file, teamName, opposition=['all'], homeOrAway=['all'], matchType='Test', plot=False)
##     Compute the wins/losses/draw/tied etc for a Team in Test, ODI or T20 at venues
##     
##     Description
##     
##     This function computes the won,lost,draw,tied or no result for a team against other teams in home/away or neutral venues and either returns a dataframe or plots it for grounds
##     
##     Usage
##     
##     teamWinLossStatusAtGrounds(file,teamName,opposition=["all"],homeOrAway=["all"],
##                   matchType="Test",plot=FALSE)
##     Arguments
##     
##     file        
##     The CSV file for which the plot is required
##     teamName    
##     The name of the team for which plot is required
##     opposition  
##     Opposition is a vector namely ["all")] or ["Australia", "India", "England"]
##     homeOrAway  
##     This parameter is a vector which is either ["all")] or a vector of venues ["home","away","neutral"]
##     matchType   
##     Match type - Test, ODI or T20
##     plot        
##     If plot=FALSE then a data frame is returned, If plot=TRUE then a plot is generated
##     Value
##     
##     None
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh tvganesh.85@gmail.com
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://www.espncricinfo.com/ci/content/stats/index.html
##     https://gigadom.in/
##     See Also
##     
##     teamWinLossStatusVsOpposition teamWinLossStatusAtGrounds plotTimelineofWinsLosses
##     
##     Examples
##     
##     ## Not run: 
##     #Get the team data for India for Tests
##     
##     df =getTeamDataHomeAway(teamName="India",file="indiaOD.csv",matchType="ODI")
##     ca.teamWinLossStatusAtGrounds("india.csv",teamName="India",opposition=c("Australia","England","India"),
##                               homeOrAway=c("home","away"),plot=TRUE)
##     
##     ## End(Not run)

1. Get team data

1a. Test

The teams in Test cricket are included below

  1. Afghanistan 2.Bangladesh 3. England 4. World 5. India 6. Ireland 7. New Zealand 8. Pakistan 9. South Africa 10.Sri Lanka 11. West Indies 12.Zimbabwe

You can use this for the teamName paramater. This will return a dataframe and also save the file as a CSV , if save=True

Note: – Since I have already got the data as CSV files I am not executing the lines below

import cricpy.analytics as ca
# Get the data for the teams. Save as CSV
#indiaTest= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="indiaTest.csv",save=True,teamName="India")
#ca.getTeamDataHomeAway(teamName="South Africa", matchType="Test", file="southafricaTest.csv", save=True)
#ca.getTeamDataHomeAway(teamName="West Indies", matchType="Test", file="westindiesTest.csv", save=True)
#newzealandTest = ca.getTeamDataHomeAway(matchType="Test",file="newzealandTest.csv",save=True,teamName="New Zealand")

1b. ODI

The ODI teams in the world are below. The data for these teams can be got by names as shown below

  1. Afghanistan 2. Africa XI 3. Asia XI 4.Australia 5.Bangladesh
  2. Bermuda 7. England 8. ICC World X1 9. India 11.Ireland 12. New Zealand 13. Pakistan       14. South Africa 15.Sri Lanka 17. West Indies 18. Zimbabwe 19 Canada    21. East Africa        22. Hong Kong 23.Ireland 24. Kenya 25. Namibia 26.Nepal 27.Netherlands 28. Oman 29.Papua New Guinea 30. Scotland 31 United Arab Emirates 32. United States of America
import cricpy.analytics as ca
#indiaODI=  ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="indiaODI.csv",save=True,teamName="India")
#englandODI =  ca.getTeamDataHomeAway(matchType="ODI",file="englandODI.csv",save=True,teamName="England")
#westindiesODI = ca.getTeamDataHomeAway(matchType="ODI",file="westindiesODI.csv",save=True,teamName="West Indies")
#irelandODI <- ca.getTeamDataHomeAway(matchType="ODI",file="irelandODI.csv",save=True,teamName="Ireland")

1c T20

The T20 teams in the world are

  1. Afghanistan 2. Australia 3. Bahrain 4. Bangladesh 5. Belgium 6. Belize
  2. Bermuda 8.Botswana 9. Canada 11. Costa Rica 12. Germany 13. Ghana
  3. Guernsey 15. Hong Kong 16. ICC World X1 17.India 18. Ireland 19.Italy
  4. Jersey 21. Kenya 22.Kuwait 23.Maldives 24.Malta 25.Mexico 26.Namibia
    27.Nepal 28.Netherlands 29. New Zealand 30.Nigeria 31.Oman 32. Pakistan
    33.Panama 34.Papua New Guinea 35. Philippines 36.Qatar 37.Saudi Arabia
    38.Scotland 39.South Africa 40.Spain 41.Sri Lanka 42.Uganda 43.United Arab Emirates United States of America 44.Vanuatu 45.West Indies
import cricpy.analytics as ca
#southafricaT20 = ca.getTeamDataHomeAway(matchType="T20",file="southafricaT20.csv",save=True,teamName="South Africa")
#srilankaT20 = ca.getTeamDataHomeAway(matchType="T20",file="srilankaT20.csv",save=True,teamName="Sri Lanka")
#canadaT20 = ca.getTeamDataHomeAway(matchType="T20",file="canadaT20.csv",save=True,teamName="Canada")
#afghanistanT20 = ca.getTeamDataHomeAway(matchType="T20",file="afghanistanT20.csv",save=True,teamName="Afghanistan")

2 Analysis of Test matches

The functions below perform analysis of Test teams

2a. Wins vs Loss against opposition

This function performs analysis of Test teams against other teams at home/away or neutral venue. Note:- The opposition can be a list of opposition teams. Similarly homeOrAway can also be a list of home/away/neutral venues.

import cricpy.analytics as ca
# Get the performance of Indian test team against all teams at all venues as a dataframe
df =ca.teamWinLossStatusVsOpposition("indiaTest.csv",teamName="India",opposition=["all"], homeOrAway=["all"], matchType="Test", plot=False)
print(df)
## ha                   away  home
## Opposition   Result            
## Afghanistan  won      0.0   1.0
## Australia    draw    20.0  23.0
##              lost    58.0  26.0
##              tied     0.0   2.0
##              won     13.0  39.0
## Bangladesh   draw     3.0   0.0
##              won      9.0   2.0
## England      draw    35.0  48.0
##              lost    68.0  26.0
##              won     13.0  33.0
## New Zealand  draw    18.0  28.0
##              lost    16.0   4.0
##              won     10.0  28.0
## Pakistan     draw    29.0  34.0
##              lost    14.0  10.0
##              won      2.0  13.0
## South Africa draw    13.0   3.0
##              lost    20.0  10.0
##              won      6.0  15.0
## Sri Lanka    draw    11.0  14.0
##              lost    14.0   0.0
##              won     16.0  13.0
## West Indies  draw    39.0  35.0
##              lost    32.0  28.0
##              won     13.0  21.0
## Zimbabwe     draw     1.0   1.0
##              lost     4.0   0.0
##              won      5.0   6.0
# Plot the performance of Indian Test team  against all teams at all venues
ca.teamWinLossStatusVsOpposition("indiaTest.csv",teamName="India",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True)















# Get the performance of Australia against India, England and New Zealand at all venues in Tests
df =ca.teamWinLossStatusVsOpposition("southafricaTest.csv",teamName="South Africa",opposition=["India","England","New Zealand"],homeOrAway=["all"],matchType="Test",plot=False)
print(df)

#Plot the performance of Australia against England, India and New Zealand only at home (Australia) 
## ha                  away  home
## Opposition  Result            
## England     draw      43    55
##             lost      60    62
##             won       26    34
## India       draw       5    14
##             lost      16     6
##             won        7    19
## New Zealand draw      20     7
##             lost       2     6
##             won       14    29
ca.teamWinLossStatusVsOpposition("southafricaTest.csv",teamName="South Africa",opposition=["India","England","New Zealand"],homeOrAway=["home","away"],matchType="Test",plot=True)

2b Wins vs losses of Test teams against opposition at different venues

import cricpy.analytics as ca
# Get the  performance of Pakistan against India, West Indies, South Africa at all venues in Tests and show performances at the venues
df = ca.teamWinLossStatusAtGrounds("westindiesTest.csv",teamName="West Indies",opposition=["India","Sri Lanka","South Africa"],homeOrAway=["all"],matchType="Test",plot=False)
print(df)

# Plot the performance of New Zealand Test team against England, Sri Lanka and Bangladesh at all grounds playes 
## ha                         away  home
## Ground             Result            
## Ahmedabad          won      2.0   0.0
## Basseterre         draw     0.0   3.0
## Bengaluru          draw     2.0   0.0
##                    won      2.0   0.0
## Bridgetown         draw     0.0   6.0
##                    lost     0.0   6.0
##                    won      0.0  14.0
## Cape Town          draw     2.0   0.0
##                    lost     6.0   0.0
## Centurion          lost     6.0   0.0
## Chennai            draw     4.0   0.0
##                    lost     8.0   0.0
##                    won      3.0   0.0
## Colombo (PSS)      lost     2.0   0.0
## Colombo (RPS)      draw     2.0   0.0
## Colombo (SSC)      lost     4.0   0.0
## Delhi              draw     6.0   0.0
##                    lost     2.0   0.0
##                    won      3.0   0.0
## Durban             lost     6.0   0.0
## Galle              draw     1.0   0.0
##                    lost     4.0   0.0
## Georgetown         draw     0.0  10.0
## Gros Islet         draw     0.0   5.0
##                    lost     0.0   2.0
## Hyderabad (Deccan) lost     2.0   0.0
## Johannesburg       lost     4.0   0.0
## Kandy              lost     4.0   0.0
## Kanpur             draw     1.0   0.0
##                    won      3.0   0.0
## Kingston           draw     0.0   8.0
##                    lost     0.0   4.0
##                    won      0.0  15.0
## Kingstown          draw     0.0   2.0
## Kolkata            draw     7.0   0.0
##                    lost     6.0   0.0
##                    won      3.0   0.0
## Mohali             won      2.0   0.0
## Moratuwa           draw     1.0   0.0
## Mumbai             draw     7.0   0.0
##                    lost     6.0   0.0
##                    won      2.0   0.0
## Mumbai (BS)        draw     5.0   0.0
##                    won      2.0   0.0
## Nagpur             draw     2.0   0.0
## North Sound        lost     0.0   2.0
## Pallekele          draw     1.0   0.0
## Port Elizabeth     draw     1.0   0.0
##                    lost     2.0   0.0
##                    won      2.0   0.0
## Port of Spain      draw     0.0  12.0
##                    lost     0.0  12.0
##                    won      0.0  10.0
## Providence         lost     0.0   2.0
## Rajkot             lost     2.0   0.0
## Roseau             draw     0.0   2.0
## St John's          draw     0.0   6.0
##                    lost     0.0   2.0
##                    won      0.0   2.0
ca. teamWinLossStatusAtGrounds("newzealandTest.csv",teamName="New Zealand",opposition=["England","Sri Lanka","Bangladesh"],homeOrAway=["all"],matchType="Test",plot=True)

2c. Plot the time line of wins vs losses of Test teams against opposition at different venues during an interval

import cricpy.analytics as ca
# Plot the time line of wins/losses of India against Australia, West Indies, South Africa in away/neutral venues
#from 2000-01-01 to 2017-01-01
ca.plotTimelineofWinsLosses("indiaTest.csv",teamName="India",opposition=["Australia","West Indies","South Africa"],
                         homeOrAway=["away","neutral"], startDate="2000-01-01",endDate="2017-01-01")
#Plot the time line of wins/losses of Indian Test team from 1970 onwards

ca.plotTimelineofWinsLosses("indiaTest.csv",teamName="India",startDate="1970-01-01",endDate="2017-01-01")

3 ODI

The functions below perform analysis of ODI teams listed above

3a. Wins vs Loss against opposition ODI teams

This function performs analysis of ODI teams against other teams at home/away or neutral venue. Note:- The opposition can be a vector of opposition teams. Similarly homeOrAway can also be a vector of home/away/neutral venues.

import cricpy.analytics as ca
# Get the performance of West Indies in ODIs against all other ODI teams at all venues and retirn as a dataframe
df = ca.teamWinLossStatusVsOpposition("westindiesODI.csv",teamName="West Indies",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False)
print(df)

# Plot the performance of West Indies in ODIs against Sri Lanka, India at all venues
## ha                   away  home  neutral
## Opposition   Result                     
## Afghanistan  lost     0.0   1.0      2.0
##              won      0.0   1.0      0.0
## Australia    lost    41.0  25.0      8.0
##              n/r      3.0   0.0      0.0
##              tied     1.0   2.0      0.0
##              won     35.0  18.0      7.0
## Bangladesh   lost     6.0   5.0      3.0
##              n/r      1.0   0.0      1.0
##              won     10.0   8.0      3.0
## Bermuda      won      0.0   0.0      1.0
## Canada       won      2.0   1.0      1.0
## England      lost    22.0  17.0     12.0
##              n/r      0.0   3.0      0.0
##              won     15.0  23.0      6.0
## India        lost    27.0  14.0     18.0
##              n/r      0.0   1.0      0.0
##              tied     1.0   0.0      1.0
##              won     27.0  20.0     15.0
## Ireland      lost     0.0   0.0      1.0
##              won      2.0   3.0      2.0
## Kenya        lost     0.0   0.0      1.0
##              won      3.0   0.0      2.0
## Netherlands  won      0.0   0.0      2.0
## New Zealand  lost    19.0   5.0      3.0
##              n/r      2.0   0.0      2.0
##              won     10.0  15.0      5.0
## P.N.G.       won      0.0   0.0      1.0
## Pakistan     lost    11.0  15.0     34.0
##              tied     1.0   2.0      0.0
##              won     14.0  16.0     41.0
## Scotland     won      0.0   0.0      3.0
## South Africa lost    20.0  17.0      7.0
##              n/r      1.0   0.0      0.0
##              tied     0.0   0.0      1.0
##              won      5.0   7.0      3.0
## Sri Lanka    lost     9.0   5.0     11.0
##              n/r      2.0   1.0      0.0
##              won      3.0   5.0     20.0
## U.A.E.       won      0.0   0.0      2.0
## Zimbabwe     lost     4.0   1.0      5.0
##              n/r      0.0   1.0      0.0
##              tied     1.0   0.0      0.0
##              won      9.0  15.0     12.0
ca.teamWinLossStatusVsOpposition("westindiesODI.csv",teamName="West Indies",opposition=["Sri Lanka", "India"],homeOrAway=["all"],matchType="ODI",plot=True)















#Plot the performance of Ireland in ODIs against Zimbabwe, Kenya, bermuda, UAE, Oman and Scotland at all venues
ca.teamWinLossStatusVsOpposition("irelandODI.csv",teamName="Ireland",opposition=["Zimbabwe","Kenya","Bermuda","U.A.E.","Oman","Scotland"],homeOrAway=["all"],matchType="ODI",plot=True)

3b Wins vs losses of ODI teams against opposition at different venues

import cricpy.analytics as ca
# Plot the performance of England ODI team against Bangladesh, West Indies and Australia at all venues
ca.teamWinLossStatusAtGrounds("englandODI.csv",teamName="England",opposition=["West Indies"],homeOrAway=["all"],matchType="ODI",plot=True)
























#Plot the performance of India against South Africa, West Indies and Australia at 'home' venues
ca.teamWinLossStatusAtGrounds("indiaODI.csv",teamName="India",opposition=["South Africa"],homeOrAway=["home"],matchType="ODI",plot=True)

3c. Plot the time line of wins vs losses of ODI teams against opposition at different venues during an interval


import cricpy.analytics as ca
#Plot the time line of wins/losses of Bangladesh ODI team between 2015 and 2019 against all other teams and at
# all venues
ca.plotTimelineofWinsLosses("bangladeshOD.csv",teamName="Bangladesh",startDate="2015-01-01",endDate="2019-01-01",matchType="ODI")























#Plot the time line of wins/losses of India ODI against Sri Lanka, Bangladesh from 2016 to 2019
ca.plotTimelineofWinsLosses("indiaODI.csv",teamName="India",opposition=["Sri Lanka","Bangladesh"],startDate="2016-01-01",endDate="2019-01-01",matchType="ODI")

4 Twenty 20

The functions below perform analysis of Twenty 20 teams listed above

4a. Wins vs Loss against opposition ODI teams

This function performs analysis of T20 teams against other T20 teams at home/away or neutral venue. Note:- The opposition can be a list of opposition teams. Similarly homeOrAway can also be a list of home/away/neutral venues.

import cricpy.analytics as ca
# Get the performance of South Africa T20 team against England, India and Sri Lanka at home grounds at England
df = ca.teamWinLossStatusVsOpposition("southafricaT20.csv",teamName="South Africa",opposition=["England","India","Sri Lanka"], homeOrAway=["home"], matchType="T20", plot=False)
print(df)

#Plot the performance of South Africa T20 against England, India and Sri Lanka at all venues
## ha                 home
## Opposition Result      
## England    lost       1
##            won        4
## India      lost       5
##            won        2
## Sri Lanka  lost       2
##            tied       1
##            won        3
ca.teamWinLossStatusVsOpposition("southafricaT20.csv",teamName="South Africa", opposition=["England","India","Sri Lanka"],homeOrAway=["all"],matchType="T20",plot=True)

























#Plot the performance of Afghanistan T20 teams against all oppositions
ca.teamWinLossStatusVsOpposition("afghanistanT20.csv",teamName="Afghanistan",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=True)

4b Wins vs losses of T20 teams against opposition at different venues

# Compute the performance of Canada against all opposition at all venues and show by grounds. Return as dataframe
df =ca.teamWinLossStatusAtGrounds("canadaT20.csv",teamName="Canada",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False)
print(df)

# Plot the performance of Sri Lanka T20 team against India and Bangladesh in different venues at home/away and neutral
## ha                     home  neutral
## Ground         Result               
## Abu Dhabi      lost     0.0      1.0
## Belfast        lost     0.0      1.0
##                won      0.0      2.0
## Colombo (SSC)  lost     0.0      1.0
##                won      0.0      1.0
## Dubai (DSC)    lost     0.0      5.0
## ICCA Dubai     lost     0.0      2.0
##                won      0.0      1.0
## King City (NW) lost     3.0      0.0
##                tied     1.0      0.0
## Sharjah        lost     0.0      1.0
ca.teamWinLossStatusAtGrounds("srilankaT20.csv",teamName="Sri Lanka",opposition=["India", "Bangladesh"], homeOrAway=["all"], matchType="T20", plot=True)

4c. Plot the time line of wins vs losses of T20 teams against opposition at different venues during an interval

#Plot the time line of Sri Lanka T20 team agaibst all opposition
ca.plotTimelineofWinsLosses("srilankaT20.csv",teamName="Sri Lanka",opposition=["Australia", "Pakistan"], startDate="2013-01-01", endDate="2019-01-01",  matchType="T20")





















# Plot the time line of South Africa T20 between 2010 and 2015 against West Indies and Pakistan
ca.plotTimelineofWinsLosses("southafricaT20.csv",teamName="South Africa",opposition=["West Indies", "Pakistan"], startDate="2010-01-01", endDate="2015-01-01",  matchType="T20")

Conclusion

With the above additional functions cricpy can now analyze batsmen, bowlers and teams in all formats of the game (Test, ODI and T20).

Have fun with cricpy!!!

You may also like

  1. My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
  2. Practical Machine Learning with R and Python – Part 3
  3. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  4. Revisiting World Bank data analysis with WDI and gVisMotionChart
  5. The Clash of the Titans in Test and ODI cricket
  6. Simulating the domino effect in Android using Box2D and AndEngine
  7. Presentation on Wireless Technologies – Part 1 8.De-blurring revisited with Wiener filter using OpenCV
  8. Cloud Computing – Design Considerations

To see all posts click Index of posts

Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR

“There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ” 

From  Infinite Powers by Steven Strogatz

Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data.  Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs.  This post is a continuation of my 3 earlier posts on Big Data namely

1. Big Data-1: Move into the big league:Graduate from Python to Pyspark
2. Big Data-2: Move into the big league:Graduate from R to SparkR
3. Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3

This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi

hits home. I spent many hours struggling with regex!!

For this post for the RDD part,  I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark. Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work.

The notebooks used in this post have been published and are available at

  1. logsAnalysiswithRDDs
  2. logsAnalysiswithPyspark
  3. logsAnalysiswithSparkRandSparklyR

You can also download all the notebooks from Github at WebServerLogsAnalysis

An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed.

A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP

Description These two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

Format The logs are an ASCII file with one line per request, with the following columns:

-host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.

-timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.

-request given in quotes.

-HTTP reply code.

-bytes in the reply.

1 Parse Web server logs with RDDs

1.1 Read NASA Web server logs

Read the logs files from NASA for the months Jul 95 and Aug 95

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

sqlcontext = SQLContext(sc)
rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz")
rdd.count()
Out[1]: 3461613

1.2Check content

Check the logs to identify the parsing rules required for the logs

i=0
for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect():
    i=i+1
    print(line)
    if i >5:
      break
ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263
slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133
piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527
ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067
dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786
p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204

1.3 Write the parsing rule for each of the fields

  • host
  • timestamp
  • path
  • status
  • content_bytes

1.21 Get IP address/host name

This regex is at the start of the log and includes any non-white characted

import re
rslt=(rdd.map(lambda line: re.search('\S+',line)
   .group(0))
   .take(3)) # Get the IP address \host name
rslt
Out[3]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’]

1.22 Get timestamp

Get the time stamp

rslt=(rdd.map(lambda line: re.search(‘(\S+ -\d{4})’,line)
    .groups())
    .take(3))  #Get the  date
rslt
[(‘[01/Aug/1995:00:00:01 -0400’,),
(‘[01/Aug/1995:00:00:07 -0400’,),
(‘[01/Aug/1995:00:00:08 -0400’,)]

1.23 HTTP request

Get the HTTP request sent to Web server \w+ {GET}

# Get the REST call with ” “
rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line)
    .groups())
    .take(3)) # Get the REST call
rslt
[(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,),
(‘/’,),
(‘/images/ksclogo-medium.gif’,)]

1.23Get HTTP response status

Get the HTTP response to the request

rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line)
    .groups())
    .take(3)) #Get the status
rslt
Out[6]: [(‘200’,), (‘304’,), (‘304’,)]

1.24 Get content size

Get the HTTP response in bytes

rslt=(rdd.map(lambda line: re.search(‘^.*\s(\d*)$’,line)
    .groups())
    .take(3)) # Get the content size
rslt
Out[7]: [(‘1839’,), (‘0’,), (‘0’,)]

1.24 Putting it all together

Now put all the individual pieces together into 1 big regular expression and assign to the groups

  1. Host 2. Timestamp 3. Path 4. Status 5. Content_size
rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    .groups())
    .take(3))
rslt
[(‘in24.inetnebr.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:01 -0400]’,
‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘,
‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,
‘200 1839’,
‘1839’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:07 -0400]’,
‘”GET / HTTP/1.0″‘,
‘/’,
‘304 0’,
‘0’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:08 -0400]’,
‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘,
‘/images/ksclogo-medium.gif’,
‘304 0’,
‘0’)]

1.25 Add a log parsing function

import re
def parse_log1(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    if match is None:    
        return(line,0)
    else:
        return(line,1)

1.26 Check for parsing failure

Check how many lines successfully parsed with the parsing function

n_logs = rdd.count()
failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count()
print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed))
# Get the failed records line[1] == 0
failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0)
failed1.take(3)
Out of a total of 3461613 logs, 38768 failed to parse
Out[10]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
0),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
0),
(‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’,
0)]

1.26 The above rule is not enough to parse the logs

It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group'” which shows up

#rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4)

File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper
return f(*args, **kwargs)
File “<command-1348022240961444>”, line 1, in <lambda>
AttributeError: ‘NoneType’ object has no attribute ‘group’

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)

1.27 Add rule for parsing failed records

One of the issues with the earlier rule is the content_size has “-” for some logs

import re
def parse_failed(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:        
        return(line,0)
    else:
        return(line,1)

1.28 Parse records which fail

Parse the records that fails with the new rule

failed2=rdd.map(lambda line: parse_failed(line)).filter(lambda line: line[1]==1)
failed2.take(5)
Out[13]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
1),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
1),
(‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘,
1),
(‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘,
1),
(‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘,
1)]

1.28 Add both rules

Add both rules for parsing the log.

Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added

import re
def parse_log2(line):
    # Parse logs with the rule below
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    # If match failed then use the rule below
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:
        return (line, 0) # Return 0 for failure
    else:
        return (line, 1) # Return 1 for success

1.29 Group the different regex to groups for handling

def map2groups(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line)    
    return(match.groups())

1.30 Parse the logs and map the groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

2. Parse Web server logs with Pyspark

2.1Read data into a Pyspark dataframe

import os
logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz')
from pyspark.sql.functions import split, regexp_extract
base_df = sqlContext.read.text(logs_file_path)
#base_df.show(truncate=False)
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(5,truncate=False)
+———————+————————–+———————————————–+——+————+
|host |timestamp |path |status|content_size|
+———————+————————–+———————————————–+——+————+
|199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 |
|unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 |
|199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 |
|burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 |
|199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 |
+———————+————————–+———————————————–+——+————+
only showing top 5 rows

2.2 Check data

bad_rows_df = split_df.filter(split_df[‘host’].isNull() |
                              split_df['timestamp'].isNull() |
                              split_df['path'].isNull() |
                              split_df['status'].isNull() |
                             split_df['content_size'].isNull())
bad_rows_df.count()
Out[20]: 33905

2.3Check no of rows which do not have digits

We have already seen that the content_type field has ‘-‘ instead of digits in RDDs

#bad_content_size_df = base_df.filter(~ base_df[‘value’].rlike(r’\d+$’))
bad_content_size_df.count()
Out[21]: 33905

2.4 Add ‘*’ to identify bad rows

To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number

from pyspark.sql.functions import lit, concat
bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False)
+—————————————————————————————————————————————————+
|concat(value, *) |
+—————————————————————————————————————————————————+
|dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*|
|dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* |
+—————————————————————————————————————————————————+

2.5 Fill NAs with 0s

# Replace all null content_size values with 0.

cleaned_df = split_df.na.fill({‘content_size’: 0})

3. Webserver  logs parsing with SparkR

library(SparkR)
library(stringr)
file_location = "/FileStore/tables/NASA_access_log_Jul95.gz"
file_location = "/FileStore/tables/NASA_access_log_Aug95.gz"
# Load the SparkR library


# Initiate a SparkR session
sparkR.session()
sc <- sparkR.session()
sqlContext <- sparkRSQL.init(sc)
df <- read.text(sqlContext,"/FileStore/tables/NASA_access_log_Jul95.gz")

#df=SparkR::select(df, "value")
#head(SparkR::collect(df))
#m=regexp_extract(df$value,'\\\\S+',1)

a=df %>% 
  withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>%
  withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>%
  withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2))  %>%
  withColumn('status', regexp_extract(df$value, '(^.*"\\s+([^\\s]+))', 2)) %>%
  withColumn('content_size', regexp_extract(df$value, '(^.*\\s+(\\d+)$)', 2))
#b=a%>% select(host,timestamp,path,status,content_type)
head(SparkR::collect(a),10)

1 199.72.81.55 – – [01/Jul/1995:00:00:01 -0400] “GET /history/apollo/ HTTP/1.0” 200 6245
2 unicomp6.unicomp.net – – [01/Jul/1995:00:00:06 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
3 199.120.110.21 – – [01/Jul/1995:00:00:09 -0400] “GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0” 200 4085
4 burger.letters.com – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/countdown/liftoff.html HTTP/1.0” 304 0
5 199.120.110.21 – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0” 200 4179
6 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 304 0
7 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/video/livevideo.gif HTTP/1.0” 200 0
8 205.212.115.106 – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/countdown.html HTTP/1.0” 200 3985
9 d104.aa.net – – [01/Jul/1995:00:00:13 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
10 129.94.144.152 – – [01/Jul/1995:00:00:13 -0400] “GET / HTTP/1.0” 200 7074
host timestamp
1 199.72.81.55 [01/Jul/1995:00:00:01 -0400
2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400
3 199.120.110.21 [01/Jul/1995:00:00:09 -0400
4 burger.letters.com [01/Jul/1995:00:00:11 -0400
5 199.120.110.21 [01/Jul/1995:00:00:11 -0400
6 burger.letters.com [01/Jul/1995:00:00:12 -0400
7 burger.letters.com [01/Jul/1995:00:00:12 -0400
8 205.212.115.106 [01/Jul/1995:00:00:12 -0400
9 d104.aa.net [01/Jul/1995:00:00:13 -0400
10 129.94.144.152 [01/Jul/1995:00:00:13 -0400
path status content_size
1 /history/apollo/ 200 6245
2 /shuttle/countdown/ 200 3985
3 /shuttle/missions/sts-73/mission-sts-73.html 200 4085
4 /shuttle/countdown/liftoff.html 304 0
5 /shuttle/missions/sts-73/sts-73-patch-small.gif 200 4179
6 /images/NASA-logosmall.gif 304 0
7 /shuttle/countdown/video/livevideo.gif 200 0
8 /shuttle/countdown/countdown.html 200 3985
9 /shuttle/countdown/ 200 3985
10 / 200 7074

4 Webserver logs parsing with SparklyR

install.packages("sparklyr")
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf
Installing package into ‘/databricks/spark/R/lib’
# Source: spark [?? x 1]
   line                                                                         
                                                                           
 1 "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1…
 2 "unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countd…
 3 "199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts…
 4 "burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdow…
 5 "199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts…
 6 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logo…
 7 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdow…
 8 "205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/c…
 9 "d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP…
10 "129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074"
# … with more rows
#install.packages(“sparklyr”)
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf <- sdf %>% mutate(host = regexp_extract(line, '^(\\\\S+)',1)) %>% 
               mutate(timestamp = regexp_extract(line, '((\\\\S+ -\\\\d{4}))',2)) %>%
               mutate(path = regexp_extract(line, '(\\\\"\\\\w+\\\\s+([^\\\\s]+)\\\\s+HTTP.*")',2)) %>%
               mutate(status = regexp_extract(line, '(^.*"\\\\s+([^\\\\s]+))',2)) %>%
               mutate(content_size = regexp_extract(line, '(^.*\\\\s+(\\\\d+)$)',2))

5 Hosts

5.1  RDD

5.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

# Create tuples of (host,1) and apply reduceByKey() and order by descending
rslt=(parsed_rdd2.map(lambda x😦x[0],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[18]:
[(‘piweba3y.prodigy.com’, 21988),
(‘piweba4y.prodigy.com’, 16437),
(‘piweba1y.prodigy.com’, 12825),
(‘edams.ksc.nasa.gov’, 11962),
(‘163.206.89.4’, 9697),
(‘news.ti.com’, 8161),
(‘www-d1.proxy.aol.com’, 8047),
(‘alyssa.prodigy.com’, 8037),
(‘siltb10.orl.mmc.com’, 7573),
(‘www-a2.proxy.aol.com’, 7516)]

5.12Plot counts of hosts

import seaborn as sns

import pandas as pd import matplotlib.pyplot as plt df=pd.DataFrame(rslt,columns=[‘host’,‘count’]) sns.barplot(x=‘host’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.6, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

5.2 PySpark

5.21 Compute counts of hosts

df= (cleaned_df
     .groupBy('host')
     .count()
     .orderBy('count',ascending=False))
df.show(5)
+——————–+—–+
| host|count|
+——————–+—–+
|piweba3y.prodigy….|21988|
|piweba4y.prodigy….|16437|
|piweba1y.prodigy….|12825|
| edams.ksc.nasa.gov |11964|
| 163.206.89.4 | 9697|
+——————–+—–+
only showing top 5 rows

5.22 Plot count of hosts

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df1=df.toPandas()
df2 = df1.head(10)
df2.count()
sns.barplot(x='host',y='count',data=df2)
plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9)
plt.xlabel("Hosts")
plt.ylabel('Count')
plt.xticks(rotation="vertical",fontsize=10)
display()

5.3 SparkR

5.31 Compute count of hosts

c <- SparkR::select(a,a$host)
df=SparkR::summarize(SparkR::groupBy(c, a$host), noHosts = count(a$host))
df1 =head(arrange(df,desc(df$noHosts)),10)
head(df1)
                  host noHosts
1 piweba3y.prodigy.com   17572
2 piweba4y.prodigy.com   11591
3 piweba1y.prodigy.com    9868
4   alyssa.prodigy.com    7852
5  siltb10.orl.mmc.com    7573
6 piweba2y.prodigy.com    5922

5.32 Plot count of hosts

library(ggplot2)
p <-ggplot(data=df1, aes(x=host, y=noHosts,fill=host)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Host') + ylab('Count')
p

5.4 SparklyR

5.41 Compute count of Hosts

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(host) %>% group_by(host) %>% summarise(noHosts=n()) %>% arrange(desc(noHosts))
df2 <-head(df1,10)

5.42 Plot count of hosts

library(ggplot2)

p <-ggplot(data=df2, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Host’) + ylab(Count’)

p

6 Paths

6.1 RDD

6.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[5],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
[(‘”GET /images/NASA-logosmall.gif HTTP/1.0″‘, 207520),
(‘”GET /images/KSC-logosmall.gif HTTP/1.0″‘, 164487),
(‘”GET /images/MOSAIC-logosmall.gif HTTP/1.0″‘, 126933),
(‘”GET /images/USA-logosmall.gif HTTP/1.0″‘, 126108),
(‘”GET /images/WORLD-logosmall.gif HTTP/1.0″‘, 124972),
(‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, 120704),
(‘”GET /ksc.html HTTP/1.0″‘, 83209),
(‘”GET /images/launch-logo.gif HTTP/1.0″‘, 75839),
(‘”GET /history/apollo/images/apollo-logo1.gif HTTP/1.0″‘, 68759),
(‘”GET /shuttle/countdown/ HTTP/1.0″‘, 64467)]

6.12 Plot counts of HTTP Requests

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘path’,‘count’]) sns.barplot(x=‘path’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

6.2 Pyspark

6.21 Compute count of HTTP Requests

df= (cleaned_df
     .groupBy('path')
     .count()
     .orderBy('count',ascending=False))
df.show(5)
Out[20]:
+——————–+——+
| path| count|
+——————–+——+
|/images/NASA-logo…|208362|
|/images/KSC-logos…|164813|
|/images/MOSAIC-lo…|127656|
|/images/USA-logos…|126820|
|/images/WORLD-log…|125676|
+——————–+——+
only showing top 5 rows

6.22 Plot count of HTTP Requests

import matplotlib.pyplot as plt

import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘path’,y=‘count’,data=df2)

plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xlabel(“HTTP Requests”) plt.ylabel(‘Count’) plt.xticks(rotation=90,fontsize=8)

display()

 

6.3 SparkR

6.31Compute count of HTTP requests

library(SparkR)
c <- SparkR::select(a,a$path)
df=SparkR::summarize(SparkR::groupBy(c, a$path), numRequest = count(a$path))
df1=head(df)

3.14 Plot count of HTTP Requests

library(ggplot2)
p <-ggplot(data=df1, aes(x=path, y=numRequest,fill=path)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ xlab('Path') + ylab('Count')
p

6.4 SparklyR

6.41 Compute count of paths

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(path) %>% group_by(path) %>% summarise(noPaths=n()) %>% arrange(desc(noPaths))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noPaths)
   path                                    noPaths
                                        
 1 /images/NASA-logosmall.gif               208362
 2 /images/KSC-logosmall.gif                164813
 3 /images/MOSAIC-logosmall.gif             127656
 4 /images/USA-logosmall.gif                126820
 5 /images/WORLD-logosmall.gif              125676
 6 /images/ksclogo-medium.gif               121286
 7 /ksc.html                                 83685
 8 /images/launch-logo.gif                   75960
 9 /history/apollo/images/apollo-logo1.gif   68858
10 /shuttle/countdown/                       64695

6.42 Plot count of Paths

library(ggplot2)
p <-ggplot(data=df2, aes(x=path, y=noPaths,fill=path)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Path') + ylab('Count')
p

7.1 RDD

7.11 Compute count of HTTP Status

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[7],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[22]:
[(‘200’, 3095682),
(‘304’, 266764),
(‘302’, 72970),
(‘404’, 20625),
(‘403’, 225),
(‘500’, 65),
(‘501’, 41)]

1.37 Plot counts of HTTP response status’

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘status’,‘count’]) sns.barplot(x=‘status’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

7.2 Pyspark

7.21 Compute count of HTTP status

status_count=(cleaned_df
                .groupBy('status')
                .count()
                .orderBy('count',ascending=False))
status_count.show()
+——+——-+
|status| count|
+——+——-+
| 200|3100522|
| 304| 266773|
| 302| 73070|
| 404| 20901|
| 403| 225|
| 500| 65|
| 501| 41|
| 400| 15|
| null| 1|

7.22 Plot count of HTTP status

Plot the HTTP return status vs the counts

df1=status_count.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘status’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“HTTP Status”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

7.3 SparkR

7.31 Compute count of HTTP Response status

library(SparkR)
c <- SparkR::select(a,a$status)
df=SparkR::summarize(SparkR::groupBy(c, a$status), numStatus = count(a$status))
df1=head(df)

3.16 Plot count of HTTP Response status

library(ggplot2)
p <-ggplot(data=df1, aes(x=status, y=numStatus,fill=status)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Status') + ylab('Count')
p

7.4 SparklyR

7.41 Compute count of status

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(status) %>% group_by(status) %>% summarise(noStatus=n()) %>% arrange(desc(noStatus))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noStatus)
  status noStatus
       
1 200     3100522
2 304      266773
3 302       73070
4 404       20901
5 403         225
6 500          65
7 501          41
8 400          15
9 ""            1

7.42 Plot count of status

library(ggplot2)

p <-ggplot(data=df2, aes(x=status, y=noStatus,fill=status)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Status’) + ylab(Count’) p

8.1 RDD

8.12 Compute count of content size

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[8],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[24]:
[(‘0’, 280017),
(‘786’, 167281),
(‘1204’, 140505),
(‘363’, 111575),
(‘234’, 110824),
(‘669’, 110056),
(‘5866’, 107079),
(‘1713’, 66904),
(‘1173’, 63336),
(‘3635’, 55528)]

8.21 Plot content size

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘content_size’,‘count’]) sns.barplot(x=‘content_size’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

8.2 Pyspark

8.21 Compute count of content_size

size_counts=(cleaned_df
                .groupBy('content_size')
                .count()
                .orderBy('count',ascending=False))
size_counts.show(10)
+------------+------+
|content_size| count|
+------------+------+
|           0|313932|
|         786|167709|
|        1204|140668|
|         363|111835|
|         234|111086|
|         669|110313|
|        5866|107373|
|        1713| 66953|
|        1173| 63378|
|        3635| 55579|
+------------+------+
only showing top 10 rows

8.22 Plot counts of content size

Plot the path access versus the counts

df1=size_counts.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘content_size’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“content_size”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

8.3 SparkR

8.31 Compute count of content size

library(SparkR)
c <- SparkR::select(a,a$content_size)
df=SparkR::summarize(SparkR::groupBy(c, a$content_size), numContentSize = count(a$content_size))
df1=head(df)
df1
     content_size numContentSize
1        28426           1414
2        78382            293
3        60053              4
4        36067              2
5        13282            236
6        41785            174
8.32 Plot count of content sizes
library(ggplot2)

p <-ggplot(data=df1, aes(x=content_size, y=numContentSize,fill=content_size)) + geom_bar(stat=identity”) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Content Size’) + ylab(Count’)

p

8.4 SparklyR

8.41Compute count of content_size

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(content_size) %>% group_by(content_size) %>% summarise(noContentSize=n()) %>% arrange(desc(noContentSize))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noContentSize)
   content_size noContentSize
                   
 1 0                   280027
 2 786                 167709
 3 1204                140668
 4 363                 111835
 5 234                 111086
 6 669                 110313
 7 5866                107373
 8 1713                 66953
 9 1173                 63378
10 3635                 55579

8.42 Plot count of content_size

library(ggplot2)
p <-ggplot(data=df2, aes(x=content_size, y=noContentSize,fill=content_size)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Content size') + ylab('Count')
p

Conclusion: I spent many,many hours struggling with Regex and getting RDDs,Pyspark to work. Also had to spend a lot of time trying to work out the syntax for SparkR and SparklyR for parsing. After you parse the logs plotting and analysis is a piece of cake! This is definitely worth a try!

Watch this space!!

Also see
1. Practical Machine Learning with R and Python – Part 3
2. Deep Learning from first principles in Python, R and Octave – Part 5
3. My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon
4. Latency, throughput implications for the Cloud
5. Modeling a Car in Android
6. Architecting a cloud based IP Multimedia System (IMS)
7. Dabbling with Wiener filter using OpenCV

To see all posts click Index of posts

My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon

‘Cricket analytics with cricketr and cricpy – Analytics harmony with R and Python’ is now available on Amazon in both paperback ($21.99) and kindle ($9.99/Rs 449) versions. The book includes analysis of cricketers using both my R package ‘cricketr’ and my python package ‘cricpy’ for all formats of the game namely Test, ODI and T20. Both packages use data from ESPN Cricinfo Statsguru. The paperback is available on Amazon for $21.99 and the kindle version is available for $9.99/Rs 449

Pick up your copy today!

The book includes the following chapters

CONTENTS

Introduction 7
1. Cricket analytics with cricketr 9
1.1. Introducing cricketr! : An R package to analyze performances of cricketers 10
1.2. Taking cricketr for a spin – Part 1 48
1.2. cricketr digs the Ashes! 69
1.3. cricketr plays the ODIs! 97
1.4. cricketr adapts to the Twenty20 International! 139
1.5. Sixer – R package cricketr’s new Shiny avatar 168
1.6. Re-introducing cricketr! : An R package to analyze performances of cricketers 178
1.7. cricketr sizes up legendary All-rounders of yesteryear 233
1.8. cricketr flexes new muscles: The final analysis 277
1.9. The Clash of the Titans in Test and ODI cricket 300
1.10. Analyzing performances of cricketers using cricketr template 338
2. Cricket analytics with cricpy 352
2.1 Introducing cricpy:A python package to analyze performances of cricketers 353
2.2 Cricpy takes a swing at the ODIs 405
Analysis of Top 4 batsman 448
2.3 Cricpy takes guard for the Twenty20s 449
2.4 Analyzing batsmen and bowlers with cricpy template 490
9. Average runs against different opposing teams 493
3. Other cricket posts in R 500
3.1 Analyzing cricket’s batting legends – Through the mirage with R 500
3.2 Mirror, mirror … the best batsman of them all? 527
4. Appendix 541
Cricket analysis with Machine Learning using Octave 541
4.1 Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid 542
4.2 Informed choices through Machine Learning-2 Pitting together Kumble, Kapil, Chandra 555
Further reading 569
Important Links 570

You can download the latest PDF version of the book  at  ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition

Also see
1. My book “Deep Learning from first principles” now on Amazon
2. Practical Machine Learning with R and Python – Part 1
3. Revisiting World Bank data analysis with WDI and gVisMotionChart
4. Natural language processing: What would Shakespeare say?
5. Optimal Cloud Computing
6. Pitching yorkpy … short of good length to IPL – Part 1
7. Computer Vision: Ramblings on derivatives, histograms and contours

To see all posts click Index of posts

Analyzing T20 matches with yorkpy templates

1. Introduction

In this post I create yorkpy templates for end-to-end analysis of any T20 matches that are available on Cricsheet as yaml format. These templates can be used to analyze Intl. T20, IPL, BBL and Natwest T20. In fact they can be used for any T20 games which have been saved in the yaml format as specified by Cricsheet Cricheet.

Noteyorkpy is the clone of my R package yorkr see yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

With these templates you can convert all T20 match data which is in yaml format to Pandas dataframes and save them as CSV. Note The data for Intl T20, IPL, BBL and Natwest T20 have already been converted and are available at allYorkpyData. This templates is also available at Github at yorkpyTemplate. The template includes the following steps

  1. Template for conversion and setup
  2. Analysis of Any T20 match
  3. Analysis of a T20 team in all matches against another T20 team
  4. Analysis of a T20 team in all matches against all other teams
  5. Analysis of T20 batsmen and bowlers

You can recreate the files as more matches are added to Cricsheet site in IPL 2017 and future seasons. This post contains all the steps needed for detailed analysis of IPL matches, teams and IPL player. This will also be my reference in future if I decide to analyze IPL in future!

Install yorkpy with pip install yorkpy

Data conversion of the yaml files have to be done before any analysis of T20 batsmen, bowlers, any T20 match matches between any 2 T20 team or analysis of a teams performance against all other team can be done

The first step is To convert the YAML files that are available for the different T20 leagues namely Intl. T20, IPL, BBL, Natwest T20 which are available in yaml format in Cricsheet. For initial data setup we need to use slighly different functions for each of the T20 leagues since the teams are different. The function to convert yaml to Pandas dataframe and save as CSV is common for all leagues

A. For International T20

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 Intl T20 countries
#yka.saveAllMatchesBetween2IntlT20s(dir1)

#Save all matches between an Intl.T20 country and all other countries
#yka.saveAllMatchesAllOppositionIntlT20(dir1)

# Get batting details for a country
#yka.getTeamBattingDetails(<country>,dir=dir1, save=True)

#Get bowling details
#yka.getTeamBowlingDetails(<country>,dir=dir1, save=True)

B. For Indian Premier League (IPL)

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 IPL teams
#yka.saveAllMatchesBetween2IPLTeams(dir1)

#Save all matches between an IPL team and all other teams
#yka.saveAllMatchesAllOppositionIPLT20(dir1)

# Get batting details for an IPL team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an IPL team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

C. For Big Bash League (BBL)

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 BBL teams
#yka.saveAllMatchesBetween2BBLTeams(dir1)

#Save all matches between an BBL team and all other teams
#yka.saveAllMatchesAllOppositionBBLT20(dir1)

# Get batting details for an BBL team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an BBL team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

D For Natwest T20

import yorkpy.analytics as yka
# COnvert yaml to pandas and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")

# Save all matches between any 2 NWB teams
#yka.saveAllMatchesBetween2NWBTeams(dir1)

#Save all matches between an NWB team and all other teams
#yka.saveAllMatchesAllOppositionNWBT20(dir1)

# Get batting details for an NWB team
#yka.getTeamBattingDetails(<team1>,dir=dir1, save=True)

#Get bowling details for an NWB team
#yka.getTeamBowlingDetails(<team1>>,dir=dir1, save=True)

Once the conversion has been done and the data has been setup we can use any of the yorkpy functions for the the 4 leagues (Intl. T20, IPL, BBL or Natwest T20) There are four classes of functions. These functions can be used for any of the

  1. Class 1 – Functions that analyze a single T20 match
  2. Class 2 – Functions that analyze the performance of a T20 team in all matches against another T20 team
  3. Class 3 – Functions that analyze the performance of a T20 team against all other teams
  4. Class 4 – Functions that analyze individual T20 batsmen or bowler

2. Class 1 functions

These functions analyze a single T20 match (Intl T20, BBL, IPL or Natwest T20) To see actual usage of Class 1 function see Pitching yorkpy … short of good length to IPL – Part 1

import yorkpy.analytics as yka
# Get scorecard
#scorecard,extras=yka.teamBattingScorecardMatch(<team1>,"Name of Team")

#Get partnership
#match=pd.read_csv("<match.csv>")
#yka.teamBatsmenPartnershipMatch(match,<team1>,<team2>,plot=True/False)

#Batsmen vs bowler
#match=pd.read_csv("<match.csv>")
#yka.teamBatsmenVsBowlersMatch(match,<team1>,<team2>,plot=True/False)

#Bowling scorecard
#match=pd.read_csv("<match.csv>")
#a=yka.teamBowlingScorecardMatch(match,<team1>)

#Wicket Kind
#match=pd.read_csv("<match.csv>")
#yka.teamBowlingWicketKindMatch((match,<team1>,<team2>)

#Wicket Match
#match=pd.read_csv("<match.csv>")
#yka.teamBowlingWicketMatch(match,<team1>,<team2>,plot=True/False)

#Bowler vs Batsman
#match=pd.read_csv("<match.csv>")
#yka.teamBowlersVsBatsmenMatch(match,<team1>,<team2>)

#Match worm chart
#match=pd.read_csv("<match.csv>")
#yka.matchWormChart(match,<team1>,<team2>,)

3. Class 2 functions

These set of functions analyze the performance a T20 team for e.g. Intl T20, BBL or Natwest T20 in all matches against another T20 team (country or IPL, BBL or Natwest T20 team. To see usages of Class 2 functions see Pitching yorkpy…on the middle and outside off-stump to IPL – Part 2

import yorkpy.analytics as yka

# Batting partnerships - Table
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#m=yka.teamBatsmenPartnershiOppnAllMatches(team1_team2_matches,<team1/team2>,report="summary/detailed", top=<n>)

# Batting partnerships - Plot
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBatsmenPartnershipOppnAllMatchesChart(team1_team2_matches,<team1>,<team2> plot=<True/False>, top=<N>, partnershipRuns=<M>)

#Batsmen vs Bowlers
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBatsmenVsBowlersOppnAllMatches(team1_team2_matches,<team1>,<team2> plot=<True/False>, top=<N>,runsScored=<M>)

# Batting scorecard
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#scorecard=yka.teamBattingScorecardOppnAllMatches(team1_team2_matches,<team1>,<team2>)

#Bowling scorecard
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#scorecard=yka.teamBowlingScorecardOppnAllMatches(team1_team2_matches,<team1>,<team2>)

#Bowling wicket kind
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBowlingWicketKindOppositionAllMatches(team1_team2_matches,<team1>,<team2>,plot=<True/False>,top=<N>,wickets=<M>)

#Bowler vs batsman
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.teamBowlersVsBatsmenOppnAllMatches(team1_team2_matches,<team1>,<team2>,plot=<True/False>,top=<N>,runsConceded=<M>)

# Wins vs losses
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinLossBetweenTeams(team1_team2_matches,<team1>,<team2>)

#Wins by win type
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinsByRunOrWickets(team1_team2_matches,<team1>)

#Wins by toss decision
#team1_team2_matches = pd.read_csv(<matches_between_2_teams.csv)
#yka.plotWinsbyTossDecision(team1_team2_matches,<team1>,tossDecision=<field/bat>)

4. Class 3 functions

This set of functions deals with analyzing the performance of a T20 team (Intl. T20, IPL, BBL or Natwest T20) in all matches against all other teams. To see usages of Class 3 functions see Pitching yorkpy…swinging away from the leg stump to IPL – Part 3. After the data is save all matches between all oppositions we can use this data

import yorkpy.analytics as yka
#Batsman partnerships
#allmatches = pd.read_csv("<allmatchesForteam")
#m=yka.teamBatsmenPartnershiAllOppnAllMatches(allmatches,<team1>,report=<"summary"/"detailed", top=<N>,partnershipRuns=<M>)

#Batsmen vs Bowlers
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBatsmenVsBowlersAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=N>,runsScored=<M>)

#Batting scorecard
#allmatches = pd.read_csv("<allmatchesForteam")
#scorecard=yka.teamBattingScorecardAllOppnAllMatches(allmatches,<team1>)

#Bowling scorecard
#allmatches = pd.read_csv("<allmatchesForteam")
#scorecard=yka.teamBowlingScorecardAllOppnAllMatches(allmatches,<team1>)

#Bowling wicket kind
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBowlingWicketKindAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=<N>,wickets=<M>)

# Bowler vs Batsmen
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.teamBowlersVsBatsmenAllOppnAllMatches(allmatches,<team1>,plot=<True/False>,top=<N>,runsConceded=<M>)

# Wins vs losses
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinLossByTeamAllOpposition(allmatches,<team1>,plot=<"summary"/"detailed">)

# Wins by win type
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinsByRunOrWicketsAllOpposition(allmatches,<team1>)

# Wins by toss decision
#allmatches = pd.read_csv("<allmatchesForteam")
#yka.plotWinsbyTossDecisionAllOpposition(allmatches,<team1>,tossDecision='bat'/'field',plot='summary'/'detailed')

5. Class 4 functions

This set of functions are used for analyzing individual batsman/bowler. From the converted xxx-BattingDetails.csv and xxx-BowlingDetails.csv we can get the batsman and bowler details as shown below. Subsequenly we can perform analyses of the individual batsman and bowler. To see actual usages of Class 4 functions see Pitching yorkpy … in the block hole – Part 4

import yorkpy.analytics as yka

#Batsman analyses
#Get batsman Dataframe
#batsmanDF=yka.getBatsmanDetails(<team1>,<batsman>,dir=dir1)

#Batsman Runs vs Deliveries
#yka.batsmanRunsVsDeliveries(batsmanDF,<batsmanName>)

#Batsman fours and sixes
#yka.batsmanFoursSixes(batsmanDF,<batsmanName>)


#Batsman dismissals
#yka.batsmanDismissals(batsmanDF,<batsmanName>)

#Batsman Runs vs Strike Rate
#yka.batsmanRunsVsStrikeRate(batsmanDF,<batsmanName>)

#Batsman Moving average
#yka.batsmanMovingAverage(batsmanDF,<batsmanName>)


#Batsman Cumulative average
#yka.batsmanCumulativeAverageRuns(batsmanDF,<batsmanName>)

#Batsman Cumulative Strike rate
#yka.batsmanCumulativeStrikeRate(batsmanDF,<batsmanName>)

#Batsman Runs against opposition
#yka.batsmanRunsAgainstOpposition(batsmanDF,<batsmanName>)

#Batsman Runs against opposition
#yka.batsmanRunsVenue(batsmanDF,<batsmanName>)


#Bowler analyses
#Get bowler dataframe
#bowlerDF=yka.getBowlerWicketDetails(<team1>,<bowler>dir=dir1)

#Mean economy rate
#yka.bowlerMeanEconomyRate(bowlerDF,<bowlerName>)


#Mean Economy rate
#yka.bowlerMeanEconomyRate(bowlerDF,<bowlerName>)

#Mean Runs conceded
#yka.bowlerMeanRunsConceded(bowlerDF,<bowlerName>)

#Moving average of wickets
#yka.bowlerMovingAverage((bowlerDF,<bowlerName>)

# Cumulative average of wickets
#yka.bowlerCumulativeAvgWickets(bowlerDF,<bowlerName>)

# Cumulative economy rate
#yka.bowlerCumulativeAvgEconRate(bowlerDF,<bowlerName>)

# Wicket plot
#yka.bowlerWicketPlot(df,name)

# Wicket against opposition
#yka.bowlerWicketsAgainstOpposition(bowlerDF,<bowlerName>)

# Wickets at venue
#yka.bowlerWicketsVenue(bowlerDF,<bowlerName>)

Important note: Do check out my other posts using yorkpy at yorkpy-posts

Conclusion

With the above templates detailed analyis can be done on

  • A T20 match
  • Performance of a team in all matches against another team
  • Performance of a team in all matches against all other teams
  • Individual batting and bowling performances

See also

  1. Deep Learning from first principles in Python, R and Octave – Part 5
  2. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
  3. Practical Machine Learning with R and Python – Part 4
  4. Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8
  5. A method to crowd source pothole marking on (Indian) roads

To see all posts click Index of posts

yorkpy takes a hat-trick, bowls out Intl. T20s, BBL and Natwest T20!!!

“Dear, dear! How queer everything is to-day! And yesterday things went on just as usual. I wonder if I’ve been changed in the night? Let me think: was I the same when I got up this morning? I almost think I can remember feeling a little different. But if I’m not the same, the next question is ’Who in the world am I? Ah, that’s the great puzzle!”

             Alice's adventures  in Wonderland, Lewis Carroll

1. Introduction

In this post, yorkpy clean bowls the following T20 formats namely International T20s, Big Bash League and Natwest T20 Blast. I take yorkpy on a spin through these T20 leagues. In the post below,I choose a random set of about 10-12 of the overall 63 functions that yorkpy has, and execute them for each of the different T20 leagues – Intl T20s, BBL and Natwest T20s. yorkpy, is the python avatar of my R package yorkr, see Introducing cricket package yorkr: Part 1- Beaten by sheer pace!

There were a couple of new functions that needed to be added for each of the T20 leagues – Intl T20, BBL and Natwest T20 to take into account the different teams in each of these leagues. Further some bugs were also ironed out in tje latest version of yorkpy. yorkpy uses data from Cricsheet . The match data is in the form of YAML files. yorkpy converts these YAML files to dataframes. YAML files are very detailed and include a ball-by-ball account of the match.

– You can clone/fork the latest code for yorkpy from github yorkpy
– This post has also been published in RPubs at yorkpy takes a hat-trick
– You can download the PDF version of this post at yorkpy takes a hat-trick

The data for IPL, Intl. T20, BBL and Natwest T20 have already been converted into pandas dataframes and saved as CSVs. You can download the converted files from Github at [allYorkpyT20Data])(https://github.com/tvganesh/allYorkpyT20Data)

yorkpy has the following 4 main classes of functions

A.Functions analyzing individual T20 match (Class 1)

This was demonstrated in Pitching yorkpy . short of good length to IPL – Part 1 The functions deal with individual T20 matches. The functions are

  1. convertYaml2PandasDataframeT20()
  2. convertAllYaml2PandasDataframesT20()
  3. teamBattingScorecardMatch()
  4. teamBatsmenPartnershipMatch()
  5. teamBatsmenVsBowlersMatch()
  6. teamBowlingScorecardMatch()
  7. teamBowlingWicketKindMatch()
  8. teamBowlingWicketRunsMatch()
  9. teamBowlingWicketMatch()
  10. teamBowlersVsBatsmenMatch()
  11. matchWormChart()

B. Functions that analyze all matches between 2 T20 teams (Class 2

Pitching yorkpy.on the middle and outside off-stump to IPL – Part 2 included functions that analyze head-to-head confrontation between any 2 T20 teams The functions are

  1. getAllMatchesBetweenTeams()
  2. saveAllMatchesBetween2IPLTeams()
  3. getAllMatchesBetweenTeams()
  4. saveAllMatchesBetween2IPLTeams()
  5. teamBatsmenPartnershiOppnAllMatches()
  6. teamBatsmenPartnershipOppnAllMatchesChart()
  7. teamBatsmenVsBowlersOppnAllMatches()
  8. teamBattingScorecardOppnAllMatches()
  9. teamBowlingScorecardOppnAllMatches()
  10. teamBowlingWicketKindOppositionAllMatches()
  11. teamBowlersVsBatsmenOppnAllMatches()
  12. plotWinLossBetweenTeams()
  13. plotWinsByRunOrWickets() 23.plotWinsbyTossDecision()

C. Functions that analyze the performance of a T20 team against all other teams (Class 3)

The post Pitching yorkpy.swinging away from the leg stump to IPL – Part 3 is based on Class C set of functions shown below

  1. getAllMatchesAllOpposition()
  2. saveAllMatchesAllOppositionIPLT20(dir1)
  3. getAllMatchesAllOpposition()
  4. saveAllMatchesAllOppositionIPLT20()
  5. teamBatsmenPartnershiAllOppnAllMatches()
  6. teamBatsmenPartnershipAllOppnAllMatchesChart()
  7. teamBatsmenVsBowlersAllOppnAllMatches()
  8. teamBattingScorecardAllOppnAllMatches()
  9. teamBowlingScorecardAllOppnAllMatches()
  10. teamBowlingWicketKindAllOppnAllMatches()
  11. teamBowlersVsBatsmenAllOppnAllMatches()
  12. plotWinLossByTeamAllOpposition()
  13. plotWinsByRunOrWicketsAllOpposition()
  14. plotWinsbyTossDecisionAllOpposition()

D. Functions that analyze performances of T20 batsmen and bowlers (Class 4)

These set of functions analyze individual batsmen and bowlers and have been used in Pitching yorkpy . in the block hole – Part 4 The functions are

  1. getTeamBattingDetails()
  2. getBatsmanDetails()
  3. batsmanRunsVsDeliveries()
  4. batsmanFoursSixes()
  5. batsmanDismissals()
  6. batsmanRunsVsStrikeRate()
  7. batsmanMovingAverage()
  8. batsmanCumulativeAverageRuns()
  9. batsmanCumulativeStrikeRate()
  10. batsmanRunsAgainstOpposition()
  11. batsmanRunsVenue
  12. getTeamBowlingDetails()
  13. getBowlerWicketDetails()
  14. bowlerMeanEconomyRate()
  15. bowlerMeanRunsConceded()
  16. bowlerMovingAverage()
  17. bowlerCumulativeAvgWickets()
  18. bowlerCumulativeAvgEconRate()
  19. bowlerWicketPlot()
  20. bowlerWicketsAgainstOpposition()
  21. bowlerWicketsVenue()

Additional new functions were added to handle Intl T20s, Big Bash League and Natwest T20 Blast, since the teams are different. They are

59. saveAllMatchesBetween2IntlT20s()
60. saveAllMatchesAllOppositionIntlT20()
61. saveAllMatchesBetween2BBLTeams()
62 saveAllMatchesAllOppositionBBLT20()
63. saveAllMatchesBetween2NWBTeams()
64. saveAllMatchesAllOppositionNWBT20()

All other functions can be used as is! You can get the help of any function in yorkpy using

import yorkpy.analytics as yka
help(yka.teamBatsmenPartnershiOppnAllMatches)
## Help on function teamBatsmenPartnershiOppnAllMatches in module yorkpy.analytics:
## 
## teamBatsmenPartnershiOppnAllMatches(matches, theTeam, report='summary', top=5)
##     Team batting partnership against a opposition all IPL matches
##     
##     Description
##     
##     This function computes the performance of batsmen against all bowlers of an oppositions in 
##     all matches. This function returns a dataframe
##     
##     Usage
##     
##     teamBatsmenPartnershiOppnAllMatches(matches,theTeam,report="summary")
##     Arguments
##     
##     matches     
##     All the matches of the team against the oppositions
##     theTeam     
##     The team for which the the batting partnerships are sought
##     report      
##     If the report="summary" then the list of top batsmen with the highest partnerships 
##     is displayed. If report="detailed" then the detailed break up of partnership is returned 
##     as a dataframe
##     top
##     The number of players to be displayed from the top
##     Value
##     
##     partnerships The data frame of the partnerships
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh tvganesh.85@gmail.com
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://cricsheet.org/
##     https://gigadom.wordpress.com/
##     
##     
##     See Also
##     
##     teamBatsmenVsBowlersOppnAllMatchesPlot
##     teamBatsmenPartnershipOppnAllMatchesChart

As I mentioned above I will be randomly choosing a set of 12 functions from Class 1,2,3,4 for each of the T20 leagues (Intl T20, BBL and NWB T20) for analysis

2. International T20s

The following functions were added for handling Intl. T20s

  1. saveAllMatchesBetween2IntlT20s()
  2. saveAllMatchesAllOppositionIntlT20()

To handle the countries in Intl. T20s below

Afghanistan, Australia, Bangladesh, Bermuda, Canada, England,Hong Kong,India, Ireland, Kenya, Nepal, Netherlands, “New Zealand, Oman,Pakistan,Scotland,South Africa, Sri Lanka, United Arab Emirates,West Indies, Zimbabwe

import os
#os.chdir('C:\\software\\cricket-package\\yorkpyT20\\t20s')
#import yorkpy.analytics as yka
#1.  Convert all YAML files to dataframes and CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\data1")
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches'
#2. Save all matches between 2 T20 teams
#yka.saveAllMatchesBetween2IntlT20s(dir1)
#3. Save all matches between a T20 team and all other teams
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches'
#yka.saveAllMatchesAllOppositionIntlT20(dir1)
#4. Get batting details
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches
#yka.getTeamBattingDetails("Afghanistan",dir=dir1, save=True)
#yka.getTeamBattingDetails("Australia",dir=dir1,save=True)
#yka.getTeamBattingDetails("Bangladesh",dir=dir1,save=True)
#...
#5. Get bowling details
#dir1='C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches
#yka.getTeamBowlingDetails("Afghanistan",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Australia",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Bangladesh",dir=dir1,save=True)
# ...

Once the data is converted you can use the yorkpy functions. The data has been converted for Intl T20 and is available at Github at IntlT20

To use the yorkpy functions for a new league we need to initial convert the YAML files into appropriate format for processing by yorkpy functions

This will create the necessary files which are are used in the functions below

2.2 2.1 Intl. T20 – Team score card  (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\India-New Zealand-2007-09-16.csv")
ind_nz=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(ind_nz,"India")
print(scorecard)
##             batsman  runs  balls  4s  6s          SR
## 0         G Gambhir    51     34   5   2  150.000000
## 1          V Sehwag    40     18   6   2  222.222222
## 2        RV Uthappa     0      2   0   0    0.000000
## 3          MS Dhoni    24     20   2   0  120.000000
## 4      Yuvraj Singh     5      7   0   0   71.428571
## 5        KD Karthik    17     12   3   0  141.666667
## 6         IK Pathan    11     10   2   0  110.000000
## 7        AB Agarkar     1      2   0   0   50.000000
## 8   Harbhajan Singh     7      6   1   0  116.666667
## 9       S Sreesanth    19     10   4   0  190.000000
## 10         RP Singh     1      1   0   0  100.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    370      6        0        8     0        0      14

2.2 Intl. T20 -Team batsmen partnership (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\South Africa-Australia-2009-03-27.csv")
sa_aus=pd.read_csv(path)
yka.teamBatsmenPartnershipMatch(sa_aus,'Australia','New Zealand',plot=True)

2.3 Intl. T20 -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\Sri Lanka-West Indies-2012-09-28.csv")
sl_wi=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(sl_wi,'Sri Lanka')
print(a)
##          bowler  overs  runs  maidens  wicket  econrate
## 0    A Mohammed      2    13        0       0       6.5
## 1  SA Campbelle      1     8        0       1       8.0
## 2     SC Selman      1     3        0       0       3.0
## 3      SF Daley      2     5        0       1       2.5
## 4     SR Taylor      2     4        0       1       2.0
## 5     TD Smartt      2    17        0       0       8.5

2.4 Intl. T20 -Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-Matches"
path=os.path.join(dir1,".\\England-India-2012-09-29.csv")
eng_ind=pd.read_csv(path)
yka.matchWormChart(eng_ind,"England", "India")

path=os.path.join(dir1,".\\Bangladesh-Ireland-2015-12-05.csv")
ban_ire=pd.read_csv(path)
yka.matchWormChart(ban_ire,"Bangladesh", "Ireland")

2.5 Intl. T20 -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"India-England-allMatches.csv")
dc_mi_matches = pd.read_csv(path)
theTeam='India'
m=yka.teamBatsmenPartnershiOppnAllMatches(dc_mi_matches,theTeam,report="detailed", top=4)
print(m)
##      batsman  totalPartnershipRuns    non_striker  partnershipRuns
## 0   SK Raina                   265      G Gambhir                2
## 1   SK Raina                   265       KL Rahul               40
## 2   SK Raina                   265      MK Tiwary               24
## 3   SK Raina                   265       MS Dhoni              124
## 4   SK Raina                   265        P Kumar                0
## 5   SK Raina                   265      PP Chawla                4
## 6   SK Raina                   265       R Ashwin                1
## 7   SK Raina                   265      RG Sharma               16
## 8   SK Raina                   265        V Kohli               47
## 9   SK Raina                   265   Yuvraj Singh                7
## 10  MS Dhoni                   264       A Mishra                1
## 11  MS Dhoni                   264      AT Rayudu               18
## 12  MS Dhoni                   264      HH Pandya                8
## 13  MS Dhoni                   264      IK Pathan                2
## 14  MS Dhoni                   264      JJ Bumrah                2
## 15  MS Dhoni                   264      MK Pandey                3
## 16  MS Dhoni                   264  Parvez Rasool               21
## 17  MS Dhoni                   264       R Ashwin               11
## 18  MS Dhoni                   264      RA Jadeja               11
## 19  MS Dhoni                   264      RG Sharma                9
## 20  MS Dhoni                   264        RR Pant                6
## 21  MS Dhoni                   264     RV Uthappa                5
## 22  MS Dhoni                   264       SK Raina               98
## 23  MS Dhoni                   264      YK Pathan               36
## 24  MS Dhoni                   264   Yuvraj Singh               33
## 25   V Kohli                   236      AM Rahane                3
## 26   V Kohli                   236      G Gambhir               78
## 27   V Kohli                   236       KL Rahul               46
## 28   V Kohli                   236      RG Sharma                2
## 29   V Kohli                   236     RV Uthappa                4
## 30   V Kohli                   236       S Dhawan               45
## 31   V Kohli                   236       SK Raina               48
## 32   V Kohli                   236   Yuvraj Singh               10
## 33     M Raj                   176       A Sharma                2
## 34     M Raj                   176         H Kaur               18
## 35     M Raj                   176      J Goswami                6
## 36     M Raj                   176        KV Jain                5
## 37     M Raj                   176       L Kumari                5
## 38     M Raj                   176    N Niranjana                3
## 39     M Raj                   176       N Tanwar               17
## 40     M Raj                   176        PG Raut               41
## 41     M Raj                   176     R Malhotra                5
## 42     M Raj                   176     S Mandhana                8
## 43     M Raj                   176         S Naik               10
## 44     M Raj                   176       S Pandey               19
## 45     M Raj                   176       SK Naidu               37

2.6 Intl. T20 -Team Batsmen vs Bowlers all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Ireland-Netherlands-allMatches.csv")
ire_nl_matches = pd.read_csv(path)
yka.teamBatsmenVsBowlersOppnAllMatches(ire_nl_matches,'Ireland',"Netherlands",plot=True,top=3,runsScored=10)

2.7 Intl. T20 -Team Bowling scorecard all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\IntlT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Bangladesh-Nepal-allMatches.csv")
bang_nep_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardOppnAllMatches(bang_nep_matches,'Bangladesh',"Nepal")
print(scorecard)
##         bowler  overs  runs  maidens  wicket   econrate
## 0      B Regmi      3    14        0       1   4.666667
## 3   SP Gauchan      4    40        0       1  10.000000
## 1   JK Mukhiya      2    16        0       0   8.000000
## 2     P Khadka      3    23        0       0   7.666667
## 4    Sagar Pun      1    16        0       0  16.000000
## 5  Sompal Kami      2    21        0       0  10.500000

2.8 Intl. T20 -Team Batsmen vs Bowlers all Oppositions (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-allMatchesAllOpposition\\"
path=os.path.join(dir1,"Australia-allMatchesAllOpposition.csv")
aus_matches = pd.read_csv(path)
yka.teamBatsmenVsBowlersAllOppnAllMatches(aus_matches,"Australia",plot=True,top=3,runsScored=40)

2.9 Intl. T20 -Wins vs Losses of a team against all other teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-allMatchesAllOpposition\\"
path=os.path.join(dir1,"South Africa-allMatchesAllOpposition.csv")
sa_matches = pd.read_csv(path)
team1='South Africa'
yka.plotWinLossByTeamAllOpposition(sa_matches,team1,plot="detailed")

2.10 Intl. T20 -Batsmen analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-BattingBowlingDetails\\"
# Rohit Sharma
name="RG Sharma"
team='India'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeAverageRuns(df,name)

# MJ Guptill
name="MJ Guptill"
team='New Zealand'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

2.11 Intl. T20 -Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyT20\\\IntlT20-BattingBowlingDetails\\"
# Shakib Al Hasan
name="Shakib Al Hasan"
team='Bangladesh'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerMeanEconomyRate(df,name)

# Rashid Khan
name="SL Malinga"
team='Sri Lanka'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsAgainstOpposition(df,name)

3. Big Bash League

The following functions for added to handle BBL teams

  1. saveAllMatchesBetween2BBLTeams()
  2. saveAllMatchesAllOppositionBBLT20

The BBL teams are included are Adelaide Strikers, Brisbane Heat, Hobart Hurricanes, Melbourne Renegades, Perth Scorchers, Sydney Sixers, Sydney Thunder

To use the yorkpy functions first the YAML files have to be converted into pandas dataframe and then saved as CSV as shown below

import os
import yorkpy.analytics as yka
os.chdir('C:\\software\\cricket-package\\yorkpyBBL\\bbl')
#1. Convert all YAML files to dataframes and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\BBLT20-Matches")
#2. Save all matches between 2 BBL teams
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.saveAllMatchesBetween2BBLTeams(dir1)
#3. Save T20 matches between a BBL team and all other teams
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.saveAllMatchesAllOppositionBBLT20(dir1)
#4. Get the batting details
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.getTeamBattingDetails("Adelaide Strikers",dir=dir1, save=True)
#yka.getTeamBattingDetails("Brisbane Heat",dir=dir1,save=True)
#yka.getTeamBattingDetails("Hobart Hurricanes",dir=dir1,save=True)
#...
# Get the bowling details
dir1='C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches'
#yka.getTeamBowlingDetails("Adelaide Strikers",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Brisbane Heat",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Hobart Hurricanes",dir=dir1,save=True)
#...

The functions below perform analysis on the generated files from above. The YAML files have already been converted and are available at Github at BBL

3.1 Big Bash League – Team score card (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Adelaide Strikers-Brisbane Heat-2012-12-13.csv")
as_bh=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(as_bh,"Brisbane Heat")
print(scorecard)
##          batsman  runs  balls  4s  6s          SR
## 0  LA Pomersbach    65     42   8   2  154.761905
## 1       JR Hopes     1      2   0   0   50.000000
## 2       JA Burns    37     31   2   2  119.354839
## 3   DT Christian    12     15   0   0   80.000000
## 4    NLTC Perera    12      4   0   2  300.000000
## 5        CA Lynn    19     18   1   1  105.555556
## 6    BCJ Cutting    13      5   0   2  260.000000
## 7     PJ Forrest    12      8   0   1  150.000000
## 8     CD Hartley     5      2   1   0  250.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    371     10        2        5     0        0      17

3.2 Big Bash League -Team batsmen vs Bowlers (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Hobart Hurricanes-Melbourne Renegades-2012-01-18.csv")
hh_mr=pd.read_csv(path)
yka.teamBatsmenVsBowlersMatch(hh_mr,'Hobart Hurricanes','Melbourne Renegades',plot=True)

3.3 Big Bash League -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Melbourne Stars-Sydney Thunder-2016-01-24.csv")
ms_st=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(ms_st,'Sydney Thunder')
print(a)
##           bowler  overs  runs  maidens  wicket   econrate
## 0        A Zampa      4    32        0       2   8.000000
## 1  BW Hilfenhaus      2    21        0       0  10.500000
## 2      DJ Hussey      1     9        0       1   9.000000
## 3     DJ Worrall      3    42        0       0  14.000000
## 4      EP Gulbis      2    19        0       0   9.500000
## 5        MA Beer      3    25        0       1   8.333333
## 6     MP Stoinis      4    30        0       3   7.500000

3.4 Big Bash League – Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-Matches"
path=os.path.join(dir1,".\\Sydney Sixers-Melbourne Stars-2011-12-27.csv")
ss_ms=pd.read_csv(path)
yka.matchWormChart(ss_ms,"Melbourne Stars", "Sydney Sixers")

path=os.path.join(dir1,".\\Hobart Hurricanes-Brisbane Heat-2015-01-02.csv")
hh_bh=pd.read_csv(path)
yka.matchWormChart(hh_bh,"Hobart Hurricanes", "Brisbane Heat")

3.5 Big Bash League -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Brisbane Heat-Adelaide Strikers-allMatches.csv")
bh_as_matches = pd.read_csv(path)
yka.teamBatsmenPartnershipOppnAllMatchesChart(bh_as_matches,"Brisbane Heat","Adelaide Strikers",plot=True, top=4, partnershipRuns=20)

3.6 Big Bash League -Team Bowling wicket kind all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Sydney Sixers-Perth Scorchers-allMatches.csv")
ss_ps_matches = pd.read_csv(path)
yka.teamBowlingWicketKindOppositionAllMatches(ss_ps_matches,'Perth Scorchers','Sydney Sixers',plot=True,top=5,wickets=1)

3.7 Big Bash League -Team Bowling scorecard all teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Hobart Hurricanes-allMatchesAllOpposition.csv")
hh_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardAllOppnAllMatches(hh_matches,"Hobart Hurricanes")
print(scorecard)
##              bowler  overs  runs  maidens  wicket   econrate
## 16            B Lee     20   132        0       9   6.600000
## 30         CJ McKay     13   110        0       9   8.461538
## 88    NJ Rimmington     16   103        1       9   6.437500
## 67      JW Hastings     15    88        0       8   5.866667
## 63      JP Faulkner     15   146        0       7   9.733333
## 27        CJ Gannon     17   147        1       7   8.647059
## 93          NM Lyon      8    51        0       7   6.375000
## 20      BCJ Cutting     27   226        0       7   8.370370
## 48          GB Hogg     22   167        0       7   7.590909
## 107       SM Boland     12    96        0       7   8.000000
## 15       B Laughlin     13    99        0       7   7.615385
## 87      MT Steketee     15   134        0       5   8.933333
## 121    Yasir Arafat      9    48        0       4   5.333333
## 96       PJ Cummins      8    83        0       4  10.375000
## 46      Fawad Ahmed     11    64        0       4   5.818182
## 76          MA Beer     12    63        0       4   5.250000
## 108     SNJ O'Keefe     15   104        0       4   6.933333
## 75   M Muralitharan      7    31        0       4   4.428571
## 10           AJ Tye     16   127        0       4   7.937500
## 52          J Botha     13    94        0       4   7.230769
## 56     JL Pattinson      7    71        0       4  10.142857
## 62   JP Behrendorff     16   119        0       4   7.437500
## 3           AC Agar     12    87        0       4   7.250000
## 24     BM Edmondson      4    40        0       4  10.000000
## 37        DJ Hussey      8    47        0       3   5.875000
## 49       GJ Maxwell      8    65        0       3   8.125000
## 84       MN Samuels      4    22        0       3   5.500000
## 81         MG Neser      5    54        0       3  10.800000
## 44     DT Christian      9   114        0       3  12.666667
## 50        GS Sandhu      7    51        0       3   7.285714
## ..              ...    ...   ...      ...     ...        ...
## 43        DP Nannes      8    58        0       1   7.250000
## 51         IA Moran      4    25        0       1   6.250000
## 55         JK Lalor     10    82        0       1   8.200000
## 54        JH Kallis      3    18        0       1   6.000000
## 73   LR Butterworth      4    25        0       1   6.250000
## 4      AC McDermott      2    28        0       1  14.000000
## 70         LA Doran      4    38        0       1   9.500000
## 69    KW Richardson      6    44        0       1   7.333333
## 119     WD Sheridan      2     6        0       0   3.000000
## 2       AB McDonald      1    15        0       0  15.000000
## 115      TD Andrews      3    23        0       0   7.666667
## 11          AK Heal      4    33        0       0   8.250000
## 7        AD Russell      4    40        0       0  10.000000
## 8          AJ Finch      2    15        0       0   7.500000
## 9         AJ Turner      3    28        0       0   9.333333
## 60        JM Mennie      1    20        0       0  20.000000
## 18        BA Stokes      1     9        0       0   9.000000
## 26         CH Gayle      1    16        0       0  16.000000
## 28         CJ Green      4    44        0       0  11.000000
## 95   PD Collingwood      2    20        0       0  10.000000
## 31       CJ Simmons      4    21        0       0   5.250000
## 59       JM Holland      3    34        0       0  11.333333
## 36         DJ Bravo      6    64        0       0  10.666667
## 38     DJ Pattinson      2    16        0       0   8.000000
## 41       DJ Worrall      8    90        0       0  11.250000
## 72      LN O'Connor      6    56        0       0   9.333333
## 71        LJ Wright      3    27        0       0   9.000000
## 68       KA Pollard      1     7        0       0   7.000000
## 58       JM Herrick      4    23        0       0   5.750000
## 92       NM Hauritz      5    42        0       0   8.400000
## 
## [122 rows x 6 columns]

3.8 Big Bash League -Plot wins vs losses against all teams(Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Sydney Sixers-allMatchesAllOpposition.csv")
ss_matches = pd.read_csv(path)
yka.plotWinLossByTeamAllOpposition(ss_matches,'Sydney Sixers')

3.9 Big Bash League -Wins vs losses by toss decision (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Adelaide Strikers-allMatchesAllOpposition.csv")
as_matches = pd.read_csv(path)
yka.plotWinsByRunOrWicketsAllOpposition(as_matches,'Adelaide Strikers')

3.10 Big Bash League -Batsmen Analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-BattingBowlingDetails"
# CA Lynn
name="CA Lynn"
team='Brisbane Heat'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsVsStrikeRate(df,name)

# UT Khawaja
name="UT Khawaja"
team='Sydney Thunder'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

3.11Big Bash League – Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyBBL\\BBLT20-BattingBowlingDetails"
# CJ McKay
name="CJ McKay"
team='Sydney Thunder'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgWickets(df,name)

# AU Rashid
name="AU Rashid"
team='Adelaide Strikers'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

4. Natwest T20 Blast

The following functions for added to handle Natwest T20 teams

  1. saveAllMatchesBetween2NWBTeams()
  2. saveAllMatchesAllOppositionNWBT20

The Natwest teams are
Derbyshire, Durham, Essex, Glamorgan, Gloucestershire, Hampshire, Kent,Lancashire, Leicestershire, Middlesex,Northamptonshire, Nottinghamshire, Somerset, Surrey, Sussex, Warwickshire, Worcestershire,Yorkshire

In order to perform analysis with yorkpy, the YAML data has to be converted to pandas dataframe and saves as CSV as shown

#import os
#import yorkpy.analytics as yka
#os.chdir('C:\\software\\cricket-package\\yorkpyNWB\\nwb')
#1. Convert YAML to dataframes and save as CSV
#yka.convertAllYaml2PandasDataframesT20(".", "..\\NWBT20-Matches")
#2. Save all matches between 2 NWBT20 teams
#dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.saveAllMatchesBetween2NWBTeams(dir1)
#3. Save all matches between a NWB T20 team and all other teams
#dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.saveAllMatchesAllOppositionNWBT20(dir1)
#4. Compute the batting details
dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.getTeamBattingDetails("Derbyshire",dir=dir1, save=True)
#yka.getTeamBattingDetails("Durham",dir=dir1,save=True)
#yka.getTeamBattingDetails("Essex",dir=dir1,save=True)
#..
#5. Compute bowling details
dir1='C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-Matches'
#yka.getTeamBowlingDetails("Derbyshire",dir=dir1, save=True)
#yka.getTeamBowlingDetails("Durham",dir=dir1,save=True)
#yka.getTeamBowlingDetails("Essex",dir=dir1,save=True)
#...

Once the data is converted all yorkpy functions can be used. This has already been done and is available at github NWB

4.1 Natwest T20 Blast – Team score card (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Durham-Yorkshire-2016-08-20.csv")
d_y=pd.read_csv(path)
scorecard,extras=yka.teamBattingScorecardMatch(d_y,"Durham")
print(scorecard)
##           batsman  runs  balls  4s  6s          SR
## 0     MD Stoneman    25     20   4   0  125.000000
## 1     KK Jennings    11     13   1   0   84.615385
## 2       BA Stokes    56     37   4   3  151.351351
## 3   MJ Richardson    29     23   4   1  126.086957
## 4     JTA Burnham    17     15   1   1  113.333333
## 5      RD Pringle    10      9   1   0  111.111111
## 6  PD Collingwood     2      3   0   0   66.666667
## 7        U Arshad     1      1   0   0  100.000000
print(extras)
##    total  wides  noballs  legbyes  byes  penalty  extras
## 0    305      2        0        5     0        0       7

4.2 Natwest T20 Blast -Team batsmen vs Bowlers (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Derbyshire-Lancashire-2016-07-13.csv")
d_l=pd.read_csv(path)
yka.teamBatsmenVsBowlersMatch(d_l,'Lancashire','Derbyshire',plot=True)

4.3 Natwest T20 Blast -Team bowling scorecard match (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Essex-Surrey-2016-05-20.csv")
e_s=pd.read_csv(path)
a=yka.teamBowlingScorecardMatch(e_s,'Essex')
print(a)
##           bowler  overs  runs  maidens  wicket   econrate
## 0  Azhar Mahmood      3    38        0       4  12.666667
## 1       GJ Batty      4    33        0       1   8.250000
## 2       JE Burke      1    18        0       0  18.000000
## 3     MW Pillans      3    28        0       0   9.333333
## 4      SM Curran      4    23        0       2   5.750000
## 5      TK Curran      4    21        0       3   5.250000

4.4 Natwest T20 Blast -Match Worm chart (Class 1)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\\yorkpyNWB\\NWBT20-Matches"
path=os.path.join(dir1,".\\Gloucestershire-Glamorgan-2016-06-10.csv")
ss_ms=pd.read_csv(path)
yka.matchWormChart(ss_ms,"Gloucestershire", "Glamorgan")

path=os.path.join(dir1,".\\Leicestershire-Northamptonshire-2016-05-20.csv")
hh_bh=pd.read_csv(path)
yka.matchWormChart(hh_bh,"Northamptonshire", "Leicestershire")

4.5 Natwest T20 Blast -Team Batting partnerships all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Hampshire-Sussex-allMatches.csv")
h_s_matches = pd.read_csv(path)
yka.teamBatsmenPartnershipOppnAllMatchesChart(h_s_matches,"Hampshire","Sussex",plot=True, top=4, partnershipRuns=10)

4.6 Natwest T20 Blast -Team Bowling wicket kind all matches 2 teams (Class 2)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesBetween2Teams"
path=os.path.join(dir1,"Kent-Somerset-allMatches.csv")
k_s_matches = pd.read_csv(path)
yka.teamBowlersVsBatsmenOppnAllMatches(k_s_matches,'Kent','Somerset',plot=True,
top=5,runsConceded=10)

4.7 Natwest T20 Blast -Team Bowling scorecard all teams (Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Middlesex-allMatchesAllOpposition.csv")
m_matches = pd.read_csv(path)
scorecard=yka.teamBowlingScorecardAllOppnAllMatches(m_matches,"Middlesex")
print(scorecard)
##               bowler  overs  runs  maidens  wicket   econrate
## 1             AJ Tye      8    75        0       6   9.375000
## 5         BAC Howell      8    41        0       5   5.125000
## 26         GR Napier      7    65        0       5   9.285714
## 15        DI Stevens      4    31        0       4   7.750000
## 19       DW Lawrence      6    37        0       4   6.166667
## 32       JW Dernbach      4    33        0       3   8.250000
## 7          BTJ Wheal      4    43        0       3  10.750000
## 18         DR Briggs      4    24        0       3   6.000000
## 50     RK Kleinveldt      4    24        0       3   6.000000
## 46         R McLaren      7    59        0       3   8.428571
## 47         R Rampaul      3    21        0       3   7.000000
## 34         L Gregory      6    51        0       2   8.500000
## 33   KMDN Kulasekara      2    24        0       2  12.000000
## 40          MG Hogan      3    17        0       2   5.666667
## 43        MTC Waller      4    31        0       2   7.750000
## 49        RJ Gleeson      4    20        0       2   5.000000
## 48  RE van der Merwe      5    24        0       2   4.800000
## 51  RN ten Doeschate      4    32        0       2   8.000000
## 53        S Prasanna      4    20        0       2   5.000000
## 56           SW Tait      3    17        0       2   5.666667
## 57     Shahid Afridi      8    55        0       2   6.875000
## 59  T van der Gugten      3    13        1       2   4.333333
## 64          TS Mills      3    34        0       2  11.333333
## 65          WAT Beer      4    23        0       2   5.750000
## 31          JH Davey      4    28        0       2   7.000000
## 68         ZS Ansari      3    16        0       2   5.333333
## 25         GM Andrew      3    19        0       2   6.333333
## 23          GJ Batty      6    55        0       2   9.166667
## 16          DJ Bravo      3    27        0       2   9.000000
## 41          MR Quinn      6    65        0       1  10.833333
## ..               ...    ...   ...      ...     ...        ...
## 24     GL van Buuren      7    49        0       1   7.000000
## 37           MD Hunn      3    35        0       1  11.666667
## 36        LC Norwell      6    62        0       1  10.333333
## 29       JC Tredwell      4    35        0       1   8.750000
## 35         LA Dawson      6    53        0       1   8.833333
## 62           TL Best      4    51        0       0  12.750000
## 58         T Westley      2    12        0       0   6.000000
## 4         Azharullah      3    24        0       0   8.000000
## 60     TD Groenewald      1    21        0       0  21.000000
## 61         TK Curran      4    35        0       0   8.750000
## 38         MD Taylor      3    30        0       0  10.000000
## 30        JG Myburgh      1     5        0       0   5.000000
## 8          C Overton      2    18        0       0   9.000000
## 2        Ashar Zaidi      1     5        0       0   5.000000
## 66          WR Smith      2    25        0       0  12.500000
## 28         J Overton      2    24        0       0  12.000000
## 6          BJ Taylor      1     6        0       0   6.000000
## 22          GG White      4    31        0       0   7.750000
## 55          SP Crook      1     9        0       0   9.000000
## 39        ME Claydon      4    40        0       0  10.000000
## 52         RS Bopara      4    32        0       0   8.000000
## 10           CD Nash      2    19        0       0   9.500000
## 11         CH Morris      4    36        0       0   9.000000
## 12         DA Cosker      3    32        0       0  10.666667
## 13      DA Griffiths      4    39        0       0   9.750000
## 45          PD Trego      1    11        0       0  11.000000
## 44   PA van Meekeren      2    19        0       0   9.500000
## 42          MS Crane      2    25        0       0  12.500000
## 20        FK Cowdrey      1    19        0       0  19.000000
## 14        DD Masters      2    16        0       0   8.000000
## 
## [69 rows x 6 columns]

4.8 Natwest T20 Blast -Plot wins vs losses against all teams(Class 3)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-allMatchesAllOpposition"
path=os.path.join(dir1,"Warwickshire-allMatchesAllOpposition.csv")
w_matches = pd.read_csv(path)
yka.plotWinLossByTeamAllOpposition(w_matches,'Warwickshire')

4.9 Natwest T20 Blast -Batsmen Analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-BattingBowlingDetails"
# M Klinger
name="M Klinger"
team='Gloucestershire'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanRunsAgainstOpposition(df,name)

# CA Ingram
name="CA Ingram"
team='Glamorgan'
df=yka.getBatsmanDetails(team,name,dir=dir1)
yka.batsmanCumulativeStrikeRate(df,name)

4.11 Natwest T20 Blast -Bowler analysis (Class 4)

import os
import pandas as pd
import yorkpy.analytics as yka
dir1="C:\\software\\cricket-package\\yorkpyNWB\\NWBT20-BattingBowlingDetails"
# BAC Howell
name="BAC Howell"
team='Gloucestershire'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerCumulativeAvgEconRate(df,name)

# GR Napier
name="GR Napier"
team='Essex'
df=yka.getBowlerWicketDetails(team,name,dir=dir1)
yka.bowlerWicketsVenue(df,name)

Note: yorkpy will work for all T20 leagues which are in YAML format as specified in Cricsheet.

You can clone/fork the latest code for yorkpy from github yorkpy

The data for IPL, Intl. T20, BBL and Natwest T20 have already been converted into pandas dataframes and saved as CSVs. You can download the converted files from Github at [allYorkpyT20Data])(https://github.com/tvganesh/allYorkpyT20Data)

Conclusion This post shows the kind of detailed analysis that can be performed with yorkpy. In fact with all the converted data it should be possible to also train a Machine Learning model, which I will probably keep for another day. You could go ahead and use the data in other innovative ways. Do keep me posted if you do!!

Important note: Do check out my other posts using yorkpy at yorkpy-posts

Have fun with yorkpy!!

See also
1. Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8
2. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
3. Hand detection through Haartraining: A hands-on approach
4.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
5. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
6. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon

To see all posts click Index of posts