GooglyPlusPlus: Computing T20 player’s Win Probability Contribution

In this post, I compute each batsman’s or bowler’s Win Probability Contribution (WPC) in a T20 match. This metric captures by how much the player (batsman or bowler) changed/impacted the Win Probability of the T20 match. For this computation I use my machine learning models, I had created earlier, which predicts the ball-by-ball win probability as the T20 match progresses through the 2 innings of the match.

In the picture snippet below, you can see how the win probability changes ball-by-ball for each batsman for a T20 match between CSK vs LSG- 31 Mar 2022

In my previous posts I had created several Machine Learning models. In order to compute the player’s Win Probability contribution in this post, I have used the following ML models

The batsman’s or bowler’s win probability contribution changes ball-by=ball. The player’s contribution is calculated as the difference in win probability when the batsman faces the 1st ball in his innings and the last ball either when is out or the innings comes to an end. If the difference is +ve the the player has had a positive impact, and likewise for negative contribution. Similarly, for a bowler, it is the win probability when he/she comes into bowl till, the last delivery he/she bowls

Note: The Win Probability Contribution does not have any relation to the how much runs or at what strike rate the batsman scored the runs. Rather the model computes different win probability for each player, based on his/her embedding, the ball in the innings and six other feature vectors like runs, run rate, runsMomentum etc. These values change for every ball as seen in the table above. Also, this is not continuous. The 2 ML models determine the Win Probability for a specific player, ball and the context in the match.

This metric is similar to Win Probability Added (WPA) used in Sabermetrics for baseball. Here is the definition of WPA from Fangraphs “Win Probability Added (WPA) captures the change in Win Expectancy from one plate appearance to the next and credits or debits the player based on how much their action increased their team’s odds of winning.” This article in Fangraphs explains in detail how this computation is done.

In this post I have added 4 new function to my R package yorkr.

  • batsmanWinProbLR – batsman’s win probability contribution based on glmnet (Logistic Regression)
  • bowlerWinProbLR – bowler’s win probability contribution based on glmnet (Logistic Regression)
  • batsmanWinProbDL – batsman’s win probability contribution based on Deep Learning Model
  • bowlerWinProbDL – bowlerWinProbLR – bowler’s win probability contribution based on Deep Learning

Hence there are 4 additional features in GooglyPlusPlus based on the above 4 functions. In addition I have also updated

-winProbLR (overLap) function to include the names of batsman when they come to bat and when they get out or the innings comes to an end, based on Logistic Regression

-winProbDL(overLap) function to include the names of batsman when they come to bat and when they get out based on Deep Learning

Hence there are 6 new features in this version of GooglyPlusPlus.

Note: All these new 6 features are available for all 9 formats of T20 in GooglyPlusPlus namely

a) IPL b) BBL c) NTB d) PSL e) Intl, T20 (men) f) Intl. T20 (women) g) WBB h) CSL i) SSM

Check out the latest version of GooglyPlusPlus at gpp2023-2

Note: The data for GooglyPlusPlus comes from Cricsheet and the Shiny app is based on my R package yorkr

A) Chennai SuperKings vs Delhi Capitals – 04 Oct 2021

To understand Win Probability Contribution better let us look at Chennai Super Kings vs Delhi Capitals match on 04 Oct 2021

This was closely fought match with fortunes swinging wildly. If we take a look at the Worm wicket chart of this match

a) Worm Wicket chartCSK vs DC – 04 Oct 2021

Delhi Capitals finally win the match

b) Win Probability Logistic Regression (side-by-side) – CSK vs DC – 4 Oct 2021

Plotting how win probability changes over the course of the match using Logistic Regression Model

In this match Delhi Capitals won. The batting scorecard of Delhi Capitals

c) Batting Scorecard of Delhi Capitals – CSK vs DC – 4 Oct 2021

d) Win Probability Logistic Regression (Overlapping) – CSK vs DC – 4 Oct 2021

The Win Probability LR (overlapping) shows the probability function of both teams superimposed over one another. The plot includes when a batsman came into to play and when he got out. This is for both teams. This looks a little noisy, but there is a way to selectively display the change in Win Probability for each team. This can be done , by clicking the 3 arrows (orange or blue) from top to bottom. First double-click the team CSK or DC, then click the next 2 items (blue,red or black,grey) Sorry the legends don’t match the colors! 😦

Below we can see how the win probability changed for Delhi Capitals during their innings, as batsmen came into to play. See below

e) Batsman Win Probability contribution:DC – CSK vs DC – 4 Oct 2021

Computing the individual batsman’s Win Contribution and plotting we have. Hetmeyer has a higher Win Probability contribution than Shikhar Dhawan depsite scoring fewer runs

f) Bowler’s Win Probability contribution :CSK – CSK vs DC – 4 Oct 2021

We can also check the Win Probability of the bowlers. So for e.g the CSK bowlers and which bowlers had the most impact. Moeen Ali has the least impact in this match

B) Intl. T20 (men) Australia vs India – 25 Sep 2022

a) Worm wicket chart – Australia vs India – 25 Sep 2022

This was another close match in which India won with the penultimate ball

b) Win Probability based on Deep Learning model (side-by-side) – Australia vs India – 25 Sep 2022

c) Win Probability based on Deep Learning model (overlapping) – Australia vs India – 25 Sep 2022

The plot below shows how the Win Probability of the teams varied across the 20 overs. The 2 Win Probability distributions are superimposed over each other

d) Batsman Win Probability Contribution : IndiaAustralia vs India – 25 Sep 2022

Selectively choosing the India Win Probability plot by double-clicking legend ‘India’ on the right , followed by single click of black, grey legend we have

We see that Kohli, Suryakumar Yadav have good contribution to the Win Probability

e) Plotting the Runs vs Strike Rate:India – Australia vs India – 25 Sep 2022

f) Batsman’s Win Probability Contribution- Australia vs India – 25 Sep 2022

Finally plotting the Batsman’s Win Probability Contribution

Interestingly, Kohli has a greater Win Probability Contribution than SKY, though SKY scored more runs at a better strike rate. As mentioned above, the Win Probability is context dependent and also depends on past performances of the player (batsman, bowler)

Finally let us look at

C) India vs England Intll T20 Women (11 July 2021)

a) Worm wicket chart – India vs England Intl. T20 Women (11 July 2021)

India won this T20 match by 8 runs

b) Win Probability using the Logistic Regression Model – India vs England Intl. T20 Women (11 July 2021)

c) Win Probability with the DL model – India vs England Intl. T20 Women (11 July 2021)

d) Bowler Win Probability Contribution with the LR model India vs England Intl. T20 Women (11 July 2021)

e) Bowler Win Contribution with the DL model India vs England Intl. T20 Women (11 July 2021)

Go ahead and try out the latest version of GooglyPlusPlus

Also see my other posts

  1. Deep Learning from first principles in Python, R and Octave – Part 8
  2. A method to crowd source pothole marking on (Indian) roads
  3. Big Data 7: yorkr waltzes with Apache NiFi
  4. Practical Machine Learning with R and Python – Part 6
  5. Introducing cricpy:A python package to analyze performances of cricketers
  6. Revisiting World Bank data analysis with WDI and gVisMotionChart
  7. Literacy in India – A deepR dive
  8. Cricketr learns new tricks : Performs fine-grained analysis of players
  9. Presentation on “Intelligent Networks, CAMEL protocol, services & applications”
  10. Adventures in LogParser, HTA and charts

To see all posts click Index of posts

GooglyPlusPlus: Win Probability using Deep Learning and player embeddings

In my last post ‘GooglyPlusPlus now with Win Probability Analysis for all T20 matches‘ I had discussed the performance of my ML models, created with and without player embeddings, in computing the Win Probability of T20 matches. With batsman & bowler embeddings I got much better performance than without the embeddings

  • glmnet – Accuracy – 0.73
  • Random Forest (RF) – Accuracy – 0.92

While the Random Forest gave excellent accuracy, it was bulky and also took an unusually long time to predict the Win Probability of a single T20 match. The above 2 ML models were built using R’s Tidymodels. glmnet was fast, but I wanted to see if I could create a ML model that was better, lighter and faster. I had initially tried to use Tensorflow, Keras in Python but then abandoned it, since I did not know how to port the Deep Learning model to R and use in my app GooglyPlusPlus.

But later, since I was stuck with a bulky Random Forest model, I decided to again explore options for saving the Keras Deep Learning model and loading it in R. I found out that saving the model as .h5, we can load it in R and use it for predictions. Hence, I rebuilt a Deep Learning model using Keras, Python with player embeddings and I got excellent performance. The DL model was light and had an accuracy 0.8639 with an ROC_AUC of 0.964 which was great!

GooglyPlusPlus uses data from Cricsheet and is based on my R package yorkr

You can try out this latest version of GooglyPlusPlus at gpp2023-1

Here are the steps

A. Build a Keras Deep Learning model

a. Import necessary packages

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from pathlib import Path
import matplotlib.pyplot as plt

b, Upload the data of all 9 T20 leagues (BBL, CPL, IPL, T20 (men) , T20(women), NTB, CPL, SSM, WBB)

# Read all T20 leagues 
df1=pd.read_csv('t20.csv')
print("Shape of dataframe=",df1.shape)

# Create training and test data set
train_dataset = df1.sample(frac=0.8,random_state=0)
test_dataset = df1.drop(train_dataset.index)
train_dataset1 = train_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
test_dataset1 = test_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
train_dataset1

# Set the target data
train_labels = train_dataset.pop('isWinner')
test_labels = test_dataset.pop('isWinner')
train_dataset1

a=train_dataset1.describe()
stats=a.transpose
a

c. Create a Deep Learning ML model using batsman & bowler embeddings

import pandas as pd
import numpy as np
from keras.layers import Input, Embedding, Flatten, Dense
from keras.models import Model
from keras.layers import Input, Embedding, Flatten, Dense, Reshape, Concatenate, Dropout
from keras.models import Model

# Set seed
tf.random.set_seed(432)

# create input layers for each of the predictors
batsmanIdx_input = Input(shape=(1,), name='batsmanIdx')
bowlerIdx_input = Input(shape=(1,), name='bowlerIdx')
ballNum_input = Input(shape=(1,), name='ballNum')
ballsRemaining_input = Input(shape=(1,), name='ballsRemaining')
runs_input = Input(shape=(1,), name='runs')
runRate_input = Input(shape=(1,), name='runRate')
numWickets_input = Input(shape=(1,), name='numWickets')
runsMomentum_input = Input(shape=(1,), name='runsMomentum')
perfIndex_input = Input(shape=(1,), name='perfIndex')

# Set the embedding size as the 4th root of unique batsmen, bowlers
no_of_unique_batman=len(df1["batsmanIdx"].unique()) 
no_of_unique_bowler=len(df1["bowlerIdx"].unique()) 
embedding_size_bat = no_of_unique_batman ** (1/4)
embedding_size_bwl = no_of_unique_bowler ** (1/4)


# create embedding layer for the categorical predictor
batsmanIdx_embedding = Embedding(input_dim=no_of_unique_batman+1, output_dim=16,input_length=1)(batsmanIdx_input)
batsmanIdx_flatten = Flatten()(batsmanIdx_embedding)
bowlerIdx_embedding = Embedding(input_dim=no_of_unique_bowler+1, output_dim=16,input_length=1)(bowlerIdx_input)
bowlerIdx_flatten = Flatten()(bowlerIdx_embedding)

# concatenate all the predictors
x = keras.layers.concatenate([batsmanIdx_flatten,bowlerIdx_flatten, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input])

# add hidden layers
# Use dropouts for regularisation
x = Dense(64, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(8, activation='relu')(x)
x = Dropout(0.1)(x)

# add output layer
output = Dense(1, activation='sigmoid', name='output')(x)
print(output.shape)

# create a DL model
model = Model(inputs=[batsmanIdx_input,bowlerIdx_input, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input], outputs=output)
model.summary()

# compile model
optimizer=keras.optimizers.Adam(learning_rate=.01, beta_1=0.9, beta_2=0.999, epsilon=1e-07, decay=0.0, amsgrad=True)

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# train the model
history=model.fit([train_dataset1['batsmanIdx'],train_dataset1['bowlerIdx'],train_dataset1['ballNum'],train_dataset1['ballsRemaining'],train_dataset1['runs'],
           train_dataset1['runRate'],train_dataset1['numWickets'],train_dataset1['runsMomentum'],train_dataset1['perfIndex']], train_labels, epochs=40, batch_size=1024,
          validation_data = ([test_dataset1['batsmanIdx'],test_dataset1['bowlerIdx'],test_dataset1['ballNum'],test_dataset1['ballsRemaining'],test_dataset1['runs'],
           test_dataset1['runRate'],test_dataset1['numWickets'],test_dataset1['runsMomentum'],test_dataset1['perfIndex']],test_labels), verbose=1)

plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 batsmanIdx (InputLayer)        [(None, 1)]          0           []                               
                                                                                                  
 bowlerIdx (InputLayer)         [(None, 1)]          0           []                               
                                                                                                  
 embedding_10 (Embedding)       (None, 1, 16)        75888       ['batsmanIdx[0][0]']             
                                                                                                  
 embedding_11 (Embedding)       (None, 1, 16)        55808       ['bowlerIdx[0][0]']              
                                                                                                  
 flatten_10 (Flatten)           (None, 16)           0           ['embedding_10[0][0]']           
                                                                                                  
 flatten_11 (Flatten)           (None, 16)           0           ['embedding_11[0][0]']           
                                                                                                  
 ballNum (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 ballsRemaining (InputLayer)    [(None, 1)]          0           []                               
                                                                                                  
 runs (InputLayer)              [(None, 1)]          0           []                               
                                                                                                  
 runRate (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 numWickets (InputLayer)        [(None, 1)]          0           []                               
                                                                                                  
 runsMomentum (InputLayer)      [(None, 1)]          0           []                               
                                                                                                  
 perfIndex (InputLayer)         [(None, 1)]          0           []                               
                                                                                                  
 concatenate_5 (Concatenate)    (None, 39)           0           ['flatten_10[0][0]',             
                                                                  'flatten_11[0][0]',             
                                                                  'ballNum[0][0]',                
                                                                  'ballsRemaining[0][0]',         
                                                                  'runs[0][0]',                   
                                                                  'runRate[0][0]',                
                                                                  'numWickets[0][0]',             
                                                                  'runsMomentum[0][0]',           
                                                                  'perfIndex[0][0]']              
                                                                                                  
 dense_19 (Dense)               (None, 64)           2560        ['concatenate_5[0][0]']          
                                                                                                  
 dropout_19 (Dropout)           (None, 64)           0           ['dense_19[0][0]']               
                                                                                                  
 dense_20 (Dense)               (None, 32)           2080        ['dropout_19[0][0]']             
                                                                                                  
 dropout_20 (Dropout)           (None, 32)           0           ['dense_20[0][0]']               
                                                                                                  
 dense_21 (Dense)               (None, 16)           528         ['dropout_20[0][0]']             
                                                                                                  
 dropout_21 (Dropout)           (None, 16)           0           ['dense_21[0][0]']               
                                                                                                  
 dense_22 (Dense)               (None, 8)            136         ['dropout_21[0][0]']             
                                                                                                  
 dropout_22 (Dropout)           (None, 8)            0           ['dense_22[0][0]']               
                                                                                                  
 output (Dense)                 (None, 1)            9           ['dropout_22[0][0]']             
                                                                                                  
==================================================================================================
Total params: 137,009
Trainable params: 137,009
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/40
937/937 [==============================] - 11s 10ms/step - loss: 0.5683 - accuracy: 0.6968 - val_loss: 0.4480 - val_accuracy: 0.7708
Epoch 2/40
937/937 [==============================] - 9s 10ms/step - loss: 0.4477 - accuracy: 0.7721 - val_loss: 0.4305 - val_accuracy: 0.7833
Epoch 3/40
937/937 [==============================] - 9s 10ms/step - loss: 0.4229 - accuracy: 0.7832 - val_loss: 0.3984 - val_accuracy: 0.7936
...
...
937/937 [==============================] - 10s 10ms/step - loss: 0.2909 - accuracy: 0.8627 - val_loss: 0.2943 - val_accuracy: 0.8613
Epoch 38/40
937/937 [==============================] - 10s 10ms/step - loss: 0.2892 - accuracy: 0.8633 - val_loss: 0.2933 - val_accuracy: 0.8621
Epoch 39/40
937/937 [==============================] - 10s 10ms/step - loss: 0.2889 - accuracy: 0.8638 - val_loss: 0.2941 - val_accuracy: 0.8620
Epoch 40/40
937/937 [==============================] - 10s 11ms/step - loss: 0.2886 - accuracy: 0.8639 - val_loss: 0.2929 - val_accuracy: 0.8621

d. Compute and plot the ROC-AUC for the above model

from sklearn.metrics import roc_curve

# Select a random sample set
tf.random.set_seed(59)
train = df1.sample(frac=0.9,random_state=0)
test = df1.drop(train_dataset.index)
test_dataset1 = test[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
test_labels = test.pop('isWinner')

# Compute the predicted values
y_pred_keras = model.predict([test_dataset1['batsmanIdx'],test_dataset1['bowlerIdx'],test_dataset1['ballNum'],test_dataset1['ballsRemaining'],test_dataset1['runs'],
           test_dataset1['runRate'],test_dataset1['numWickets'],test_dataset1['runsMomentum'],test_dataset1['perfIndex']]).ravel()

# Compute TPR & FPR
fpr_keras, tpr_keras, thresholds_keras = roc_curve(test_labels, y_pred_keras)

fpr_keras, tpr_keras, thresholds_keras = roc_curve(test_labels, y_pred_keras)
from sklearn.metrics import auc

# Plot the Area Under the Curve (AUC)
auc_keras = auc(fpr_keras, tpr_keras)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_keras, tpr_keras, label='Keras (area = {:.3f})'.format(auc_keras))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

The ROC_AUC for the Deep Learning Model is 0.946 as seen below

e. Save the Keras model for use in Python

from keras.models import Model
model.save("wpDL.h5")

f. Load the model in R using rhdf5 package for use in GooglyPlusPlus

library(rhdf5)
dl_model <- load_model_hdf5('wpDL.h5')

This was a huge success for me to be able to create the Deep Learning model in Python and use it in my Shiny app GooglyPlusPlus. The Deep Learning Keras model is light-weight and extremely fast.

The Deep Learning model has now been integrated into GooglyPlusPlus. Now you can check the Win Probability using both a) glmnet (Logistic Regression with lasso regularisation) b) Keras Deep Learning model with dropouts as regularisation

In addition I have created 2 features based on Win Probability (WP)

i) Win Probability (Side-by-side – Plot(interactive) : With this functionality the 1st and 2nd innings will be side-by-side. When the 1st innings is played by team 1, the Win Probability of team 2 = 100 – WP (team1). Similarly, when the 2nd innings is being played by team 2, the Win Probability of team1 = 100 – WP (team 2)

ii) Win Probability (Overlapping) – Plot (static): With this functionality the Win Probabilities of both team1(1st innings) & team 2 (2nd innings) are displayed overlapping, so that we can see how the probabilities vary ball-by-ball.

Note: Since the same UI is used for all match functions I had to re-use the Plot(interactive) and Plot(static) radio buttons for Win Probability (Side-by-side) and Win Probability(Overlapping) respectively

Here are screenshots using both ML models with both functionality for some random matches

B) ICC T20 Men World Cup – Netherland-South Africa- 2022-11-06

i) Match Worm wicket chart

ii) Win Probability with LR (Side-by-Side- Plot(interactive))

iii) Win Probability LR (Overlapping- Plot(static))

iv) Win Probability Deep Learning (Side-by-side – Plot(interactive)

In the 213th ball of the innings South Africa was slightly ahead of Netherlands. After that they crashed and burned!

v) Win Probability Deep Learning (Overlapping – Plot (static)

It can be seen that in the 94th ball of both innings South Africa was ahead of Netherlands before the eventual slump.

C) Intl. T20 (Women) India – New Zealand – 2020 – 02 – 27

Here is an interesting match between India and New Zealand T20 Women’s teams. NZ successfully chased the India’s total in a wildly swinging fortunes. See the charts below

i) Match Worm Wicket chart

ii) Win Probability with LR (Side-by-side – Plot (interactive)

iii) Win Probability with LR (Overlapping – Plot (static)

iv) Win Probability with DL model (Side-by-side – Plot (interactive))

v) Win Probability with DL model (Overlapping – Plot (static))

The above functionality in plotting the Win Probability using LR or DL with both options (Side-by-side or Overlapping) is available for all 9 T20 leagues currently supported by GooglyPlusPlus.

Go ahead and give gpp2023-1 a try!!!

Do also check out my other posts’

  1. Deep Learning from first principles in Python, R and Octave – Part 7
  2. Big Data 6: The T20 Dance of Apache NiFi and yorkpy
  3. Latency, throughput implications for the Cloud
  4. Design Principles of Scalable, Distributed Systems
  5. Cricpy adds team analytics to its arsenal!!
  6. Analyzing performances of cricketers using cricketr template
  7. Modeling a Car in Android
  8. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
  9. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  10. Experiments with deblurring using OpenCV
  11. Using embeddings, collaborative filtering with Deep Learning to analyse T20 players

To see all posts click Index of posts

Boosting Win Probability accuracy with player embeddings

In my previous post Computing Win Probability of T20 matches I had discussed various approaches on computing Win Probability of T20 matches. I had created ML models with glmnet and random forest using TidyModels. This was what I had achieved

  • glmnet : accuracy – 0.67 and sensitivity/specificity – 0.68/0.65
  • random forest : accuracy – 0.737 and roc_auc- 0.834
  • DL model with Keras in Python : accuracy – 0.73

I wanted to see if the performance of the models could be further improved. I got a suggestion from a AI/DL whizkid, who is close to me, to include embeddings for batsmen and bowlers. He felt that win percentage is influenced by which batsman faces which bowler.

So, I started to explore this idea. Embeddings can be used to convert categorical variables to a vector of continuous floating point numbers.Fortunately R’s Tidymodels, has a convenient functionality to create embeddings. By including embeddings for batsman, bowler the performance of my ML models improved vastly. Now the performance is

  • glmnet : accuracy – 0.728 and roc_auc – 0.81
  • random forest : accuracy – 0.927 and roc_auc – 0.98
  • mlp-dnn :accuracy – 0.762 and roc_auc – 0.854

As can be seem there is almost a 20% increase in accuracy with random forests with embeddings over the model without embeddings. Moreover, the feature importance which is plotted below shows that the bowler and batsman embeddings have a significant influence on the Win Probability

Note: The data for this analysis is taken from Cricsheet and has been processed with my R package yorkr.

A. Win Probability using GLM with penalty and player embeddings

Here Generalised Linear Model (GLMNET) for Logistic Regression is used. In the GLMNET the regularisation path is computed for the lasso or elastic net penalty at a grid of values for the regularisation parameter lambda. glmnet is extremely fast and gave an accuracy of 0.72 for an roc_auc of 0.81 with batsman, bowler embeddings. This was good improvement over my earlier implementation with glmnet without the batsman & bowler embeddings which had a

  1. Read the data

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)  
library(embed)

# Helper packages
library(readr)       # for importing data
library(vip) 

df1=read.csv("output3/matchesBBL3.csv")
df2=read.csv("output3/matchesCPL3.csv")
df3=read.csv("output3/matchesIPL3.csv")
df4=read.csv("output3/matchesNTB3.csv")
df5=read.csv("output3/matchesPSL3.csv")
df6=read.csv("output3/matchesSSM3.csv")
df7=read.csv("output3/matchesT20M3.csv")
df8=read.csv("output3/matchesT20W3.csv")
df9=read.csv("output3/matchesWBB3.csv")

#Bind all dataframes together
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)
glimpse(df)
Rows: 1,199,115
Columns: 10
$ batsman        <chr> "JD Smith", "M Klinger", "M Klinger", "M Klinger", "JD …
$ bowler         <chr> "NM Hauritz", "NM Hauritz", "NM Hauritz", "NM Hauritz",…

$ ballNum        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ ballsRemaining <int> 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, …
$ runs           <int> 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 6, 7, 13, 14, 16, 18, 18,…

$ runRate        <dbl> 1.0000000, 0.5000000, 0.6666667, 0.7500000, 0.6000000, …
$ numWickets     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ runsMomentum   <dbl> 0.08800000, 0.08870968, 0.08943089, 0.09016393, 0.09090…
$ perfIndex      <dbl> 11.000000, 5.500000, 7.333333, 8.250000, 6.600000, 5.50…
$ isWinner       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…


df %>% 
  count(isWinner) %>% 
  mutate(prop = n/sum(n))
  isWinner      n      prop
1        
0 614237 0.5122419
2        
1 584878 0.4877581

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

set.seed(123)
splits      <- initial_split(df,prop = 0.80)
splits
<Training/Testing/Total>
<959292/239823/1199115>
df_other <- training(splits)
df_test  <- testing(splits)

set.seed(234)
val_set <- validation_split(df_other,prop = 0.80)
val_set
# A tibble: 1 × 2
  splits                  
id        
  <list>                  <chr>     
1 <split [767433/191859]> validation

3) Create pre-processing recipe

a) Normalise the following predictors

  • ballNum
  • ballsRemaining
  • runs
  • runRate
  • numWickets
  • runsMomentum
  • perfIndex

b) Create floating point embeddings for

  • batsman
  • bowler

4) Create a Logistic Regression Workflow by adding the GLM model and the recipe

5) Create grid of elastic penalty values for regularisation

6) Train all 30 models

7) Plot the ROC of the model against the penalty

# Use all 12 cores
cores <- parallel::detectCores()
cores
# Create a Logistic Regression model with penalty
lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet",num.threads = cores)

# Create pre-processing recipe
lr_recipe <- 
  recipe(isWinner ~ ., data = df_other) %>%
  step_embed(batsman,bowler, outcome = vars(isWinner)) %>%  step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex) 

# Set the workflow by adding the GLM model with the recipe
lr_workflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(lr_recipe)

# Create a grid for the elastic net penalty
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5) 
# A tibble: 5 × 1
   penalty
     
<dbl>
1 0.0001  
2 0.000127
3 0.000161
4 0.000204
5 0.000259

lr_reg_grid %>% top_n(5)  # highest penalty values
# A tibble: 5 × 1
  penalty
    <dbl>
1  0.0386
2  0.0489
3  0.0621
4  0.0788
5  0.1

# Train 30 penalized models
lr_res <- 
  lr_workflow %>% 
  tune_grid(val_set,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(accuracy,roc_auc))

# Plot the penalty versus ROC
lr_plot <- 
  lr_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())

lr_plot

The Penalty vs ROC plot is shown below

8) Display the ROC_AUC of the top models with the penalty

9) Select the model with the best ROC_AUC and the associated penalty. It can be seen the best mean ROC_AUC is 0.81 and the associated penalty is 0.000530

top_models <-
  lr_res %>% 
  show_best("roc_auc", n = 15) %>% 
  arrange(penalty) 
top_models

# A tibble: 15 × 7
    penalty .metric .estimator  mean     n std_err .config              
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1 0.0001   roc_auc binary     0.810     1      NA Preprocessor1_Model01
 2 0.000127 roc_auc binary     0.810     1      NA Preprocessor1_Model02
 3 0.000161 roc_auc binary     0.810     1      NA Preprocessor1_Model03
 4 0.000204 roc_auc binary     0.810     1      NA Preprocessor1_Model04
 5 0.000259 roc_auc binary     0.810     1      NA Preprocessor1_Model05
 6 0.000329 roc_auc binary     0.810     1      NA Preprocessor1_Model06
 7 0.000418 roc_auc binary     0.810     1      NA Preprocessor1_Model07
 8 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08
 9 0.000672 roc_auc binary     0.810     1      NA Preprocessor1_Model09
10 0.000853 roc_auc binary     0.810     1      NA Preprocessor1_Model10
11 0.00108  roc_auc binary     0.810     1      NA Preprocessor1_Model11
12 0.00137  roc_auc binary     0.810     1      NA Preprocessor1_Model12
13 0.00174  roc_auc binary     0.809     1      NA Preprocessor1_Model13
14 0.00221  roc_auc binary     0.809     1      NA Preprocessor1_Model14
15 0.00281  roc_auc binary     0.809     1      NA Preprocessor1_Model15

#Picking the best model and the corresponding penalty
lr_best <- 
  lr_res %>% 
  collect_metrics() %>% 
  arrange(penalty) %>% 
  slice(8)
lr_best
# A tibble: 1 × 7
   
   penalty .metric .estimator  mean     n std_err .config              
     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                

1 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08

# Collect predictions and generate the AUC curve
lr_auc <- 
  lr_res %>% 
  collect_predictions(parameters = lr_best) %>% 
  roc_curve(isWinner, .pred_0) %>% 
  mutate(model = "Logistic Regression")

autoplot(lr_auc)

7) Plot the Area under the Curve (AUC).

10) Build the final model with the best LR parameters value as found in lr_best

a) The best performance was for a penalty of 0.000530

b) The accuracy achieved is 0.72. Clearly using the embeddings for batsman, bowlers improves on the performance of the GLM model without the embeddings. The accuracy achieved was 0.72 whereas previously it was 0.67 see (Computing Win Probability of T20 Matches)

c) Create a fit with the best parameters

d) The accuracy is 72.8% and the ROC_AUC is 0.813

# Create a model with the penalty for best ROC_AUC
last_lr_mod <- 
  logistic_reg(penalty = 0.000530, mixture = 1) %>% 
  set_engine("glmnet",num.threads = cores,importance = "impurity")

#Update the workflow with this model
last_lr_workflow <- 
  lr_workflow %>% 
  update_model(last_lr_mod)

#Create a fit
set.seed(345)
last_lr_fit <- 
  last_lr_workflow %>% 
  last_fit(splits)

#Generate accuracy, roc_auc
last_lr_fit %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  
<chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.728 Preprocessor1_Model1

2 roc_auc  binary         0.813 Preprocessor1_Model1

11) Plot the feature importance

It can be seen that bowler and batsman embeddings are the most significant for the prediction followed by runRate.

runRate –

  • runRate in 1st innings
  • requiredRunRate in 2nd innings

12) Plot the ROC characteristics

last_lr_fit %>% 
  collect_predictions() %>% 
  roc_curve(isWinner, .pred_0) %>% 
  autoplot()

13) Generate a confusion matrix

14) Create a final Generalised Linear Model for Logistic Regression with the penalty of 0.000530

15) Save the model

# generate predictions from the test set
test_predictions <- last_lr_fit %>% collect_predictions()
test_predictions

# generate a confusion matrix
test_predictions %>% 
  conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction     0     1
         
0                  90105 32658
         
1                  32572 84488

final_lr_model <- fit(last_lr_workflow, df_other)

final_lr_model

obj_size(final_lr_model)
146.51 MB


butcher::weigh(final_lr_model)
A tibble: 305 × 2
object                                  size
<chr>                                  <dbl>
  1 pre.actions.recipe.recipe.steps.terms1  57.9
2 pre.actions.recipe.recipe.steps.terms2  57.9
3 pre.actions.recipe.recipe.steps.terms3  57.9


cleaned_lm <- butcher::axe_env(final_lr_model, verbose = TRUE)
#✔ Memory released: "1.04 kB"
#✔ Memory released: "1.62 kB"

saveRDS(cleaned_lm, "cleanedLR.rds")
  

16) Compute Ball-by-ball Win Probability

  • Chennai Super Kings-Lucknow Super Giants-2022-03-31

16a) The corresponding Worm-wicket graph for this match is as below

  • Chennai Super Kings-Lucknow Super Giants-2022-03-31

B) Win Probability using Random Forest with player embeddings

In the 2nd approach I use Random Forest with batsman and bowler embeddings. The performance of the model with embeddings is quantum jump from the earlier performance without embeddings. However, the random forest is also computationally intensive.

1) Read the data

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)  
library(tidymodels)  
library(embed)

# Helper packages
library(readr)       # for importing data
library(vip) 
library(ranger)

# Read all the 9 T20 leagues
df1=read.csv("output3/matchesBBL3.csv")
df2=read.csv("output3/matchesCPL3.csv")
df3=read.csv("output3/matchesIPL3.csv")
df4=read.csv("output3/matchesNTB3.csv")
df5=read.csv("output3/matchesPSL3.csv")
df6=read.csv("output3/matchesSSM3.csv")
df7=read.csv("output3/matchesT20M3.csv")
df8=read.csv("output3/matchesT20W3.csv")
df9=read.csv("output3/matchesWBB3.csv")

# Bind into a single dataframe
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

set.seed(123)
df$isWinner = as.factor(df$isWinner)

#Split data into training, validation and test sets
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other, prop = 0.80)
val_set

2) Create a Random Forest model tuning for number of predictor nodes at each decision node (mtry) and minimum number of predictor nodes (min_n)

3) Use the ranger engine and set up for classification

4) Set up the recipe and include batsman and bowler embeddings

5) Create a workflow and add the recipe and the random forest model with the tuning parameters

# Use all 12 cores parallely
cores <- parallel::detectCores()
cores
[1] 12

# Create the random forest model with mtry and min as tuning parameters
rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

# Setup the recipe with batsman and bowler embeddings
rf_recipe <- 
  recipe(isWinner ~ ., data = df_other) %>% 
  step_embed(batsman,bowler, outcome = vars(isWinner)) 

# Create the random forest workflow
rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

rf_mod
# show what will be tuned
extract_parameter_set_dials(rf_mod)

set.seed(345)
# specify which values meant to tune

# Build the model
rf_res <- 
  rf_workflow %>% 
  tune_grid(val_set,
            grid = 10,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(accuracy,roc_auc))

# Pick the best  roc_auc and the associated tuning parameters
rf_res %>% 
  show_best(metric = "roc_auc")
# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

rf_res %>% 
  show_best(metric = "accuracy")
# A tibble: 5 × 8
   
mtry min_n .metric  .estimator  mean     n std_err .config              
  <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0

6) Select all models with the best roc_auc. It can be seen that the best roc_auc is 0.980 for mtry=4 and min_n=4

7) Get the model with the highest accuracy. The highest accuracy achieved is 0.927 or 92.7. This accuracy is also for mtry=4 and min_n=4

# Pick the best  roc_auc and the associated tuning parameters
rf_res %>% 
  show_best(metric = "roc_auc")
# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

# Display the accuracy of the models in descending order and the parameters
rf_res %>% 
  show_best(metric = "accuracy")
# A tibble: 5 × 8
   
mtry min_n .metric  .estimator  mean     n std_err .config              
  <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0

8) Select the model with the best parameters for accuracy mtry=4 and min_n=4. For this the accuracy is 0.927. For this configuration the roc_auc is also the best at 0.980

9) Plot the Area Under the Curve (AUC). It can be seen that this model performs really well and it hugs the top left.

# Pick the best model
rf_best <- 
  rf_res %>% 
  select_best(metric = "accuracy")

# The best model has mtry=4 and min=4
rf_best
     mtry min_n .config              
  <int> <int> <chr>                
1     4     4      Preprocessor1_Model08

#Plot AUC
rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% 
  roc_curve(isWinner, .pred_0) %>% 
  mutate(model = "Random Forest")

autoplot(rf_auc)

10) Create the final model with the best parameters

11) Execute the final fit

12) Plot feature importance, The bowler and batsman embedding followed by perfIndex and runRate are features that contribute the most to the Win Probability

last_rf_mod <- 
  rand_forest(mtry = 4, min_n = 4, trees = 1000) %>% 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")

# the last workflow
last_rf_workflow <- 
  rf_workflow %>% 
  update_model(last_rf_mod)

set.seed(345)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits)

last_rf_fit %>% 
  collect_metrics()

  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               

1 accuracy binary         0.944 Preprocessor1_Model1
2 roc_auc  binary         0.988 Preprocessor1_Model1

last_rf_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 9)

13) Plot the ROC curve for the best fit

# Plot the ROC for the final model
last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(isWinner, .pred_0) %>% 
  autoplot()

14) Create a confusion matrix

We can see that the number of false positives and false negatives is very low

15) Create the final fit with the Random Forest Model

# generate predictions from the test set
test_predictions <- last_rf_fit %>% collect_predictions()
test_predictions

   id               .pred_0 .pred_1  .row .pred_class isWinner .config          
   <chr>              <dbl>   <dbl> <int> <fct>       <fct>    <chr>            
 1 train/test split   0.838  0.162      1 0           0       Preprocessor1_Mo…
 2 
train/test split   0.463  0.537     11 1           0        Preprocessor1_Mo…
 3 
train/test split   0.846  0.154     14 0           0        Preprocessor1_Mo…
 4 
train/test split   0.839  0.161     22 0           0        Preprocessor1_Mo…
 5 
train/test split   0.846  0.154     36 0           0        Preprocessor1_Mo…
 6 
train/test split   0.848  0.152     37 0           0        Preprocessor1_Mo…
 7 
train/test split   0.731  0.269     39 0           0        Preprocessor1_Mo…
 8 
train/test split   0.972  0.0281    40 0           0        Preprocessor1_Mo…
 9 
train/test split   0.655  0.345     42 0           0        Preprocessor1_Mo…
10 
train/test split   0.662  0.338     43 0           0        Preprocessor1_Mo…

# generate a confusion matrix
test_predictions %>% 
  conf_mat(truth = isWinner, estimate = .pred_class)

          Truth
Prediction      0      1
         
          0 116576   7096
         
          1   6391 109760

# Create the final model
final_model <- fit(last_rf_workflow, df_other)

16) Computing Win Probability with Random Forest Model for match

  • Pakistan-India-2022-10-23

17) Worm -wicket graph of match

  • Pakistan-India-2022-10-23

C) Win Probability using MLP – Deep Neural Network (DNN) with player embeddings

In this approach the MLP package of Tidymodels was used. Multi-layer perceptron (MLP) with Deep Neural Network (DNN) was used to compute the Win Probability using player embeddings. An accuracy of 0.76 was obtained

1) Read the data

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)    
library(embed)

# Helper packages
library(readr)       # for importing data
library(vip) 
library(ranger)

df1=read.csv("output3/matchesBBL3.csv")
df2=read.csv("output3/matchesCPL3.csv")
df3=read.csv("output3/matchesIPL3.csv")
df4=read.csv("output3/matchesNTB3.csv")
df5=read.csv("output3/matchesPSL3.csv")
df6=read.csv("output3/matchesSSM3.csv")
df7=read.csv("output3/matchesT20M3.csv")
df8=read.csv("output3/matchesT20W3.csv")
df9=read.csv("output3/matchesWBB3.csv")

df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)


set.seed(123)
df$isWinner = as.factor(df$isWinner)
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other, 
                            prop = 0.80)
val_set

3) Create a Deep Neural Network recipe

  • Normalize parameters
  • Add embeddings for batsman, bowler

4) Set the MLP-DNN hyperparameters

  • epochs=100
  • hidden units =5
  • dropout regularization =0.1

5) Fit on Training data

cores <- parallel::detectCores()
cores

nn_recipe <- 
  recipe(isWinner ~ ., data = df_other) %>% 
step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex) %>%
  step_embed(batsman,bowler, outcome = vars(isWinner)) %>%
  prep(training = df_other, retain = TRUE) 

# For validation:
test_normalized <- bake(nn_recipe, new_data = df_test)

set.seed(57974)
# Set the hyper parameters for DNN
# Use Keras
# Fit on training data
nnet_fit <-
  mlp(epochs = 100, hidden_units = 5, dropout = 0.1) %>%
  set_mode("classification") %>% 
  # Also set engine-specific `verbose` argument to prevent logging the results: 
  set_engine("keras", verbose = 0) %>%
  fit(isWinner ~ ., data = bake(nn_recipe, new_data = df_other))

nnet_fit
parsnip model object
Model:"sequential"

____________________________________________________________________________

Layer (type)                                           Output Shape                                    Param #            
============================================================================
dense (Dense)                                           (None, 5)                                          60                 
____________________________________________________________________________

dense_1 (Dense)                                         (None, 5)                                          30                 
____________________________________________________________________________
dropout (Dropout)                                       (None, 5)                                          0                  
____________________________________________________________________________
dense_2 (Dense)                                         (None, 2)                                          12                 
============================================================================
Total params: 102
Trainable params: 102
Non-trainable params: 0

6) Test on Test data

  • Check ROC_AUC. It is 0.854
  • Check accuracy. The MLP-DNN gives a decent performance with an acuracy of 0.76
  • Compute the Confusion Matrix
# Validate on test data
val_results <- 
  df_test %>%
  bind_cols(
    predict(nnet_fit, new_data = test_normalized),
    predict(nnet_fit, new_data = test_normalized, type = "prob")
  )
val_results 

# Check roc_auc
val_results %>% roc_auc(truth = isWinner, .pred_0)
  .metric .estimator .estimate
  
   <chr>   <chr>          <dbl>
1 roc_auc binary         0.854

# Check accuracy
val_results %>% accuracy(truth = isWinner, .pred_class)
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.762

# Display confusion matrix
val_results %>% conf_mat(truth = isWinner, .pred_class)
          Truth
Prediction     
           0     1        
       0 97419 31564       
       1 25548 85292

Conclusion

  1. Of the 3 ML models, glmnet, random forest and Multi-layer Perceptron DNN, random forest had the best performance
  2. Random Forest ML model with batsman, bowler embeddings was able to achieve an accuracy of 92.4% and a ROC_AUC of 0.98 with very low false positives, negatives. This was a quantum jump from my earlier random forest model without embeddings which had an accuracy of 73.7% and an ROC_AUC of 0.834
  3. The glmnet and NN models are fairly light weight. Random Forest is computationally very intensive.

Check out my other posts

  1. Using Reinforcement Learning to solve Gridworld
  2. Deep Learning from first principles in Python, R and Octave – Part 8
  3. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  4. Big Data-5: kNiFi-ing through cricket data with yorkpy
  5. Singularity
  6. Practical Machine Learning with R and Python – Part 6
  7. GooglyPlusPlus2022 optimizes batting/bowling lineup
  8. Fun simulation of a Chain in Android
  9. Introducing cricpy:A python package to analyze performances of cricketers
  10. Programming languages in layman’s language

To see all posts click Index of posts

Near Real-time Analytics of ICC Men’s T20 World Cup with GooglyPlusPlus

In my last post GooglyPlusPlus gets ready for ICC Men’s T20 World Cup, I had mentioned that GooglyPlusPlus was preparing for the big event the ICC Men’s T20 World cup. Now that the T20 World cup is underway, my Shiny app in R, GooglyPlusPlus ,will be generating near real-time analytics of matches completed the previous day. Besides the app can also do historical analysis of players, teams and matches.

The whole process is automated. A cron job will execute every day, in the morning, which will automatically download the matches of the previous day from Cricsheet, unzip them, start a pipeline which will transform and process the match data into necessary folders and finally upload the newly acquired data into my Shiny app. Hence, you will be able to access all the breathless, pulsating cricketing action in timeless, interactive plots and tables which will capture all aspects of Men’s T20 matches, namely batsman, bowler performance, match analysis, team-vs-team, team-vs-all teams besides ranking of batsmen & bowlers. Since the data is cumulative, all the analytics are historical and current.

Check out GooglyPlusPlus!!

The data for GooglyPlusPlus is taken from Cricsheet

Interest in cricket, has mushroomed in recent times around the world, with the addition of new formats which started with ODI, T20, T10, 100 ball and so on. There are leagues which host these matches at different levels around the world. While GooglyPlusPlus, provides near real-time analytics of Men’s T20 World cup, we can clearly envision a big data platform which ingests matches daily from multiple cricket formats, leagues around the world generating real-time and near real-time analytics which are essential these days to selection of teams at different levels through auctions. For more discussion on this see my posts

  1. Big Data 7: yorkr waltzes with Apache NiFi
  2. Big Data 6: The T20 Dance of Apache NiFi and yorkpy

We could imagine a Data Lake, into which are ingested data from the different cricket formats, leagues through appropriate technology connectors. Once the data is ingested, we could have data pipelines, based on Azure ADF, Apache NiFi, Apache Airflow or Amazon EMR etc., to transform, process and enhance the data, generating real-time analytics on the fly. Recent formats like T20, T10 require more urgency in strategic thinking based on scoring within limited overs, or containing batsmen from going on a rampage within the set of overs, the analytics on a fly may help the coach to modify the batting or bowling lineup at points in match. In this context see my earlier post Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket

All of these are not just possible, but are likely to become reality as more and more formats, leagues and cricket data proliferate around the world.

This post, focuses on generating near-real time analytics for ICC Men’s T20 World Cup using GooglyPlusPlus. Included below, is a sampling of the analytics that you can perform for analysing the matches. In addition you can do all the analysis included in my post GooglyPlusPlus gets ready for ICC Men’s T20 World Cup

  1. Namibia-Sri Lanka-16 Oct 2022 : Match Worm graph

The opening match between Namibia vs Sri Lanka resulted in an upset. We can see this in the match worm-wicket graph below

2. Scotland vs West Indies – 17 Oct 2022: Batsmen vs Bowlers

George Munsey was the top scorer for Scotland and was instrumental in the win against WI. His performance against West Indies bowlers is shown below. Note, the charts are interactive

3. Zimbabwe vs Ireland – 17 Oct 2022 : Team Runs vs SR

Sikander Raza of Zimbabwe with 82 runs with the strike rate ~ 170

4. United Arab Emirates vs Netherlands – 16 Oct 2022: Team runs across 20 overs

UAE pipped Netherlands in the middle overs and were able to win by 1 ball and 3 wickets

5. Scotland vs Ireland – 19 Oct 2022 : Team Runs vs SR Middle overs plot

Curtis Campher snatched the game away from Scotland with his stellar performance in middle and death overs

6. UAE vs Namibia : 20 Oct 2022 : Team Wickets vs ER plot

Basoor Hameed and Zahoor Khan got 2 wickets apiece with an economy rate of ~5.00 but still they were not able to stop UAE from stealing a win

7. Overall Runs vs SR in T20 World Cup 2022

It is too early to rank the players, nevertheless in the current T20 World Cup, MP O’Dowd (Netherlands), BKG Mendis (Sri Lanka) and JN Frylinck(Namibia) are the top 3 batsmen with good runs and Strike Rate

8. Overall Wickets over ER in T20 World Cup 2022

The top 3 bowlers so far in T20 World Cup 2022 are a) BFW de Leede (Netherlands) b) PWH De Silva (Sri Lanka) c) KP Meiyappan (UAE) with a total of 7,7, and 6 wickets respectively

Note: Besides the match analysis GooglyPlusPlus also provides detailed analysis of batsmen, bowlers, matches as above, team-vs-team, team-vs-all teams, ranking of batsmen & bowlers etc. For more details see my post GooglyPlusPlus gets ready for ICC Men’s T20 World Cup

Do visit GooglyPlusPlus everyday to check out the cricketing actions of matches gone by. You can also follow me on twitter @tvganesh_85 for daily highlights.

You may also like

  1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  2. De-blurring revisited with Wiener filter using OpenCV
  3. Using Reinforcement Learning to solve Gridworld
  4. Deep Learning from first principles in Python, R and Octave – Part 3
  5. Getting started with Tensorflow, Keras in Python and R
  6. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  7. Practical Machine Learning with R and Python – Part 5
  8. Cricpy takes a swing at the ODIs
  9. Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

To see all posts click Index of posts

GooglyPlusPlus gets ready for ICC Men’s T20 World Cup

It is time!! So last weekend, I turned the wheels, moved the levers and listened to the hiss of steam, as I cranked up my Shiny app GooglyPlusPlus. The ICC Men’s T20 World Cup is just around the corner, and it was time to prepare for this event. This latest GooglyPlusPlus is current with the latest Intl. men’s T20 match data, give or take a few. GooglyPlusPlus can analyze batsmen, bowlers, matches, team-vs-team, team-vs-all teams, besides also ranking batsmen, bowlers and plot performances in Powerplay, middle and death overs.

In this post, I include a quick refresher of some of features of my app GooglyPlusPlus. Note: This is a random sampling of the functions available. There are more than 120+ features available in the app.

Check out your favourite players and your country’s team with GooglyPlusPlus

Note 1: All charts are interactive

Note 2: You can choose a date range for your analysis

Note 3: The data for this app is taken from Cricsheet

  1. T20 Batsman tab

This tab includes functions pertaining to individual batsmen. Functions include Runs vs Deliveries, moving average runs, cumulative average run, cumulative average strike rate, runs against opposition, runs at venue etc.

For e.g.

a) Suryakumar Yadav’s (India) cumulative strike rate

b) Mohammed Rizwan’s (Pakistan) performance against opposition

2. T20 Bowler’s Tab

The bowlers tab has functions for computing mean economy rate, moving average wickets, cumulative average wicks, cumulative economy rate, bowlers performance against opposition, bowlers performance in venue, predict wickets and others

A random function is shown below

a) Predict wickets for Wanindu Hasaranga of Sri Lanka

3. T20 Match tab

The match tab has functions that can compute match batting & bowling scorecard, batting partnerships, batsmen performance vs bowlers, bowler’s wicket kind, bowler’s wicket match, match worm graph, match worm wicket graph, team runs across 20 overs, team wickets in 20 overs, teams runs or wickets in powerplay, middle and death overs

Here are a couple of functions from this tab

a) Afghanistan vs Ireland – 2022-08-15

b) Australia vs Sri Lanka – 2019-11-01 – Runs across 20 overs

4. T20 Head-to-head tab

This tab provides the analysis of all combination of T20 teams (countries) in different aspects. This tab can compute the overall batting, bowling scorecard in all matches between 2 countries, batsmen partnerships, performances against bowlers, bowlers vs batsmen, runs, strike rate, wickets, economy rate across 20 overs, runs vs SR plot and wicket vs ER plot in all matches between team and so on. Here are a couple of examples from this tab

a) Bangladesh vs West Indies – Batting scorecard from 2019-01-01 to 2022-07-07

b) Wickets vs ER plot – England vs New Zealand – 2019-01-01 to 2021-11-10

5. T20 Team performance overall tab

This tab provides detailed analysis of the team’s performance against all other teams. As in the previous tab there are functions to compute the overall batting, bowling scorecard of a team against all other teams for any specific interval of time. This can help in picking out the most consistent batsmen, bowlers. Besides there are functions to compute overall batting partnerships, bowler vs batsmen, runs, wickets across 20 overs, run vs SR and wickets vs ER etc.

a) Batsmen vs Bowlers (Rank 1- V Kohli 2019-01-01 to 2022-09-25)

b) team Runs vs SR in Death overs (India) (2019-01-01 to 2022-09-25)

6) Optimisation tab

In the optimisation tab we can check the performance of a specific batsmen against specific bowlers or bowlers against batsmen

a) Batsmen vs Bowlers

b) Bowlers vs batsmen

7) T20 Batting Performance tab

This tab performs various analytics like ranking batsmen based on Run over SR and SR over Runs. Also you can plot overall Runs vs SR, and more specifically Runs vs SR in Powerplay, Middle and Death overs. All of this can be done for a specific date range. Here are some examples. The data includes all of T20 (all countries all matches)

a) Rank batsmen (Runs over SR, minimum matches played=33, date range=2019-01-01 to 2022-09-27)

The top 3 batsmen are Mohamen Rizwan, V Kohli and Babar Azam

b) Overall runs vs SR plot (2019-01-01 to 2022-09-27)

c) Overall Runs vs SR in Powerplay (all teams- 2019-01-01-2022-09-27)

This plot will be crowded. However, we can zoom into an area of interest. The controls for interacting with the plot are in the top of the plot as shown

Zooming in and panning to the area we can see the best performers in powerplay are as below

8) T20 Bowling Performance tab

This tab computes and ranks bowlers on Wickets over Economy and Economy rate over wickets. We can also compute and plot the Wickets vs ER in all matches , besides the Wickets vs ER in powerplay, middle and death overs with data from all countries

a) Rank Bowlers (Wickets over ER, minimum matches=28, 2019-01-01 to 2022-09-27)

b) Wickets vs ER plot

S Lamichhane (NEP), Hasaranga (SL) and Shamsi (SA) are excellent bowlers with high wickets and low ER as seen in the plot below

c) Wickets vs ER in death overs (2019-01-01 to 2022-09-27, min matches=24)

Zooming in and panning we see the best performers in death overs are MR Adair (IRE), Haris Rauf(PAK) and Chris Jordan (ENG)

With the excitement building up, it is time you checked out how your country will perform and the players who will do well.

Go ahead give GooglyPlusPlus a spin !!!

Also see

  1. Deep Learning from first principles in Python, R and Octave – Part 5
  2. Big Data-5: kNiFi-ing through cricket data with yorkpy
  3. Understanding Neural Style Transfer with Tensorflow and Keras
  4. De-blurring revisited with Wiener filter using OpenCV
  5. Re-introducing cricketr! : An R package to analyze performances of cricketers
  6. Modeling a Car in Android
  7. Presentation on “Intelligent Networks, CAMEL protocol, services & applications”
  8. Practical Machine Learning with R and Python – Part 2
  9. Cricpy adds team analytics to its arsenal!!
  10. Benford’s law meets IPL, Intl. T20 and ODI cricket

To see all posts click Index of posts

GooglyPlusPlus2022 optimizes batting/bowling lineup

GooglyPlusPlus2022 is the new avatar of last year’s GooglyPlusPlus2021. Roughly, about 5 years back I had written a post on Using linear programming to optimize T20 batting and bowling line up. This post has been on the back of my mind for a long time and I decided to pay this post a revisit. This requires computing performance of individual batsmen vs bowlers and vice-versa for performing the optimization. So in this latest incarnation, there are 4 new functions

  1. batsmanVsBowlerPerf – Performance of batsmen against chosen bowlers
  2. bowlerVsBatsmanPerf – Performance of bowlers versus specific batsmen
  3. battingOptimization – Optimizing batting line up based on strike rates ad remaining overs
  4. bowlingOptimization – Optimizing bowling line up based on economy rates and remaining overs

These 4 functions have been incorporated in all the supported 9 T20 formats namely a. IPL b. Intl. T20(men) c. Intl. T20 (women) d. BBL e. NTB f. PSL g. WBB h. CPL i. SSM

Check out GooglyPlusPlus2022!!

You can clone/fork the code for GooglyPlusPlus2022 from Github from gpp2022-1

With this latest update you can do a myriad of analyses of batsmen, bowlers, teams, matches. This is just-in-time for the IPL Mega-auction!! Do check out these other posts of GooglyPlusPlus for other detailed analysis

  1. GooglyPlusPlus2021: Towards more picturesque analytics!
  2. GooglyPlusPlus2021 now with power play, middle and death over analysis
  3. GooglyPlusPlus2021 adds new bells and whistles!!
  4. GooglyPlusPlus2021 is now fully interactive!!!

A) Batsman Vs Bowlers – This option computes the performance of individual batsman against individual bowlers

a) IPL Batsmen vs Bowlers

Included below are the performances of Dhoni, Raina and Kohli against Malinga, Ashwin and Bumrah. Note: The last 2 text box input are not required for this.

b) Intl. T20 (men) Batsmen vs Bowlers

Note: You can type the name and choose from the drop down list

B) Bowler vs Batsmen – You can check the performance of specific bowlers against specific batsmen

a) Intl. T20 (women) India vs Australia

b) PSL Bowlers vs Batsmen

C) Strategy for optimizing batting and bowling line up

From the above 2 tabs, it is obvious, that different bowlers have different ER and wicket rate against different batsmen. In other words, the effectiveness of the bowlers varies by batsmen. Conversely, batsmen are more comfortable with certain bowlers versus others and this shows up in different strike rates.

Hence during the death overs, when trying to restrict batsmen to a certain score or on the flip side when the batting side needs to score a target within certain overs, we need to take advantage of the relative effectiveness of bowlers vs batsmen for optimising bowling and aggressiveness of batsmen versus bowlers to quickly reach the target.

This is the approach that is used for bowling and batting optimisation. For optimising bowling, we need to formulate a minimisation problem based on ER rates and for optimising batting, a maximisation strategy is chosen based on SR. ‘Integer programming’ is used to compute during the last set of overs

This latest version includes optimization using “integer programming” based on R package lpSolve.

Here are the 2 formulations

Assume there are 3 bowlers – bwlr_{1},bwlr_{2},bwlr_{3}
and there are 3 batsmen – bman_{1},bman_{2},bman_{3}

I) LP Formulation for bowling order

Let the economy rate er_{ij} be the Economy Rate of the jth bowler to the ith batsman. Also if remaining overs for the bowlers are o_{1},o_{2},o_{3}
and the total number of overs left to be bowled are
o_{1}+o_{2}+o_{3} = N

Let the economy rate er_{ij} be the Economy Rate of the jth bowler to the ith batsman.
Objective function : Minimize –
er_{11}*o_{11} + er_{12}*o_{12} +..+er_{1n}*o_{1n}+ er_{21}*o_{21} + er_{22}*o_{22}+.. + er_{22}*o_{2n}+ er_{m1}*o_{m1}+..+ er_{mn}*o_{mn}
i.e.
\sum_{i=1}^{i=m}\sum_{j=1}^{i=n}er_{ij}*o_{ij}
Constraints
Where o_{j} is the number of overs remaining for the jth bowler against  ‘k’ batsmen
o_{j1} + o_{j2} + .. o_{jk} < o_{j}
and if the total number of overs remaining to be bowled is N then
o_{1} + o_{2} +...+ o_{k} = N or
\sum_{j=1}^{j=k} o_{j} =N
The overs that any bowler can bowl is o_{j} >=0

II) LP Formulation for batting lineup

Let the strike rate sr_{ij}  be the Strike Rate of the ith batsman to the jth bowler
Objective function : Maximize –
sr_{11}*o_{11} + sr_{12}*o_{12} +..+ sr_{1n}*o_{1n}+ sr_{21}*o_{21} + sr_{22}*o_{22}+.. sr_{2n}*o_{2n}+ sr_{m1}*o_{m1}+..+ sr_{mn}*o_{mn}
i.e.
\sum_{i=1}^{i=4}\sum_{j=1}^{i=3}sr_{ij}*o_{ij}
Constraints
Where o_{j} is the number of overs remaining for the jth bowler against  ‘k’ batsmen
o_{j1} + o_{j2} + .. o_{jk} < o_{j}
and the total number of overs remaining to be bowled is N then
o_{1} + o_{2} +...+ o_{k} = N or
\sum_{j=1}^{j=k} o_{j} =N
The overs that any bowler can bowl is
o_{j} >=0

C) Optimized bowling lineup

a) IPL – Optimizing bowling line up

Note: For computing the Optimal bowling lineup, the total number of overs remaining and the number of overs for each bowler have to be entered.

b) PSL – Optimizing batting line up

d) Optimized batting lineup

a) Intl. T20 (men) India vs England

b) Carribean Premier LeagueOptimizing batting line up

Give GooglyPlusPlus2022 a spin!

You can also check the code here gpp2022-1

Hope you have a good time with GooglyPlusPlus2022!

Also see

  1. Re-working the Lucy Richardson algorithm in OpenCV
  2. Deconstructing Convolutional Neural Networks with Tensorflow and Keras
  3. Deep Learning from first principles in Python, R and Octave – Part 5
  4. Cricketr adds team analytics to its repertoire!!!
  5. Practical Machine Learning with R and Python – Part 4
  6. Cricpy takes a swing at the ODIs
  7. yorkpy takes a hat-trick, bowls out Intl. T20s, BBL and Natwest T20!!!
  8. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  9. Introducing QCSimulator: A 5-qubit quantum computing simulator in R

To see all posts click Index of posts

GooglyPlusPlus2021 now with power play, middle and death over analysis

This latest edition of GooglyPlusPlus2021 now includes detailed analysis of teams, batsmen and bowlers in power play, middle and death overs. The T20 format is based on 3 phases as each side faces 20 overs.

Power play: Overs: 0 – 6 – No more than 2 players can be outside the 30 yard circle

Middle overs: Overs: 7- 16 – During these overs the batting side tries to consolidate their innings

Death overs: Overs: 16 -20 – During these 5 overs the batting side tries to accelerate the scoring rate, while the bowling side will try to restrict the batsmen against going for big hits

This is shown below

This latest update of GooglyPlusPlus2021 includes the following functions

a) Match tab

  1. teamRunsAcrossOvers
  2. teamSRAcrossOvers
  3. teamWicketsAcrossOvers
  4. teamERAcrossOvers
  5. matchWormWickets

b) Head-to-head tab

  1. teamRunsAcrossOversOppnAllMatches
  2. teamSRAcrossOversOppnAllMatches
  3. teamWicketsAcrossOversOppnAllMatches
  4. teamERAcrossOversOppnAllMatches
  5. topRunsBatsmenAcrossOversOppnAllMatches
  6. topSRBatsmenAcrossOversOppnAllMatches
  7. topWicketsBowlersAcrossOversOppnAllMatches
  8. topERBowlerAcrossOverOppnAllMatches

c) Overall performance tab

  1. teamRunsAcrossOversAllOppnAllMatches
  2. teamSRAcrossOversAllOppnAllMatches
  3. teamWicketsAcrossOversAllOppnAllMatches
  4. teamERAcrossOversAllOppnAllMatches
  5. topRunsBatsmenAcrossOversAllOppnAllMatches
  6. topSRBatsmenAcrossOversAllOppnAllMatches
  7. topWicketsBowlersAcrossOversAllOppnAllMatches
  8. topERBowlerAcrossOverAllOppnAllMatches

Hence a total of 8 + 8 + 5 = 21 functions have been added. These functions can be utilized across all the 9 T20 formats that are supported in GooglyPlusPlus2021 namely

i) IPL ii) Intl. T20 (men) iii) Intl. T20 (women) iv) BBL v) NTB vi) PSL vii) CPL viii) SSM ix) WBB

Hence there are a total of 21 x 9 = 189 new possibilities to explore in GooglyPlusPlus2021

GooglyPlusPlus2021 is based on my R package yorkr and is based on data from Cricsheet. To know how to use GooglyPlusPlus see any of earlier posts GooglyPlusPlus2021 is now fully interactive!!!, GooglyPlusPlus2021 adds new bells and whistles!!, GooglyPlusPlus2021 enhanced with drill-down batsman, bowler analytics

Take GooglyPlusPlus for a spin here GooglyPlusPlus2021

You can clone/fork the code for the Shiny app from Github – gpp2021-9

Included below is a random selection of options from the 189 possibilities mentioned above. Feel free to try out for yourself

A) IPL – CSK vs KKR 2018-04-10

a) Team Runs in power play, middle and death overs

b) Team Strike rate in power play, middle and death overs

B) Intl. T20 (men) – India vs Afghanistan (2021-11-03)

a) Team wickets in power play, middle and death overs

b) Team Economy rate in power play, middle and death overs

C) Intl. T20 (women) Head-to-head : India vs Australia since 2018

a) Team Runs in all matches in power play, middle and death overs

D) PSL Head-to-head strike rate since 2019

a) Team vs team Strike rate : Karachi Kings vs Lahore Qalanders since 2019 in power play, middle and death overs

E) Team overall performance in all matches against all opposition

a) BBL : Brisbane Heats : Team Wickets between 2015 – 2018 in power play, middle and death overs

F) Top Runs and Strike rate Batsman of Mumbai Indians vs Royal Challengers Bangalore since 2018

a) Top runs scorers for Mumbai Indians (MI) in power play, middle and death overs

b) Top strike rate for RCB in power play, middle and death overs

F) Intl. T20 (women) India vs England since 2018

a) Top wicket takers for England in power play, middle and death overs since 2018

b) Top wicket takers for India in power play, middle and death overs since 2018

G) Intl. T20 (men) All time best batsmen and bowlers for India

a) Most runs in power play, middle and death overs

b) Highest strike rate in power play, middle and death overs

H) Match worm wicket chart

In addition to the usual Match worm chart, I have also added a Match Wicket worm chart in the latest version

Note: You can zoom to the area where you would like to focus more

The option of looking at the Match worm chart (without wickets) also exists.

Go ahead take GooglyPlusPlus2021 for a test drive and check out how your favourite players perform in power play, middle and death overs. Click GooglyPlusPlus2021

You can fork/download the app code from Github at gpp2021-9

Hope you have fun with GooglyPlusPlus

You may also like

  1. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
  2. Practical Machine Learning with R and Python – Part 6
  3. Big Data 6: The T20 Dance of Apache NiFi and yorkpy
  4. Understanding Neural Style Transfer with Tensorflow and Keras
  5. Using Reinforcement Learning to solve Gridworld
  6. Exploring Quantum Gate operations with QCSimulator
  7. Experiments with deblurring using OpenCV
  8. Deep Learning from first principles in Python, R and Octave – Part 5
  9. Re-introducing cricketr! : An R package to analyze performances of cricketers
  10. Natural language processing: What would Shakespeare say?

To see all posts click Index of posts

GooglyPlusPlus2021:ICC WC T20:Pavilion-view analytics as-it-happens!

This year 2021, we are witnessing a rare spectacle in the cricketing universe, where IPL playoffs are immediately followed by ICC World Cup T20. Cricket pundits have claimed such a phenomenon occurs once in 127 years! Jokes apart, the World cup T20 is underway and as usual GooglyPlusPlus is ready for the action.

GooglyPlusPlus will provide near-real time analytics, by automatically downloading the latest match data daily, processing and organising the match data into appropriate folders so that my R package yorkr can slice and dice the data to provide the pavilion-view analytics.

The charts capture all the breathless, heart-pounding, and nail-biting action in great details in the many tables and plots. Every table and chart tell a story. You just have to ‘read between the lines!’

GooglyPlusPlus2021 will update itself automatically every day, so the data will be current and you can analyse all matches upto the previous day, along with the historical performances of the teams. So make sure you check it everyday.

Note:

  1. All charts are interactive. To know how to use the interactive charts see my post GooglyPlusPlus2021 is now fully interactive!!!
  2. The are 5 tabs for each of the formats supported by GooglyPlusPlus2021 which now supports IPL, Intl. T20(men), Intl. T20(women), BBL, NTB, PSL, CPL, SSM, WBB. Besides, it also supports ODI (men) and ODI (women)
  3. Each of the formats have 5 tabs – Batsman, Bowler, Match, Head-to-head and Overall Performace.
  4. All T20 formats also include a ranking functionality for the batsmen and bowlers
  5. You can now perform drill-down analytics for batsmen, bowlers, head-to-head and overall performance based on date-range selector functionality. The ranking tabs also include date range selector granular analysis. For more details see GooglyPlusPlus2021 enhanced with drill-down batsman, bowler analytics

Try out GooglyPlusPlus2021 here GooglyPlusPlus2021!!

You can clone fork the code from Github gpp2021-8

I am including some random screenshots of things that can be done with GooglyPlusPlus2021

A. Papua New Guinea vs Oman (2021-10-17)

a. Batting partnership

B. Match worm chart (New Papua Guinea v Oman)

This was a no contest as Oman cruised to victory

C. Scotland vs Bangladesh (2021-10-17)

a. Scorland upset Bangladesh

b. March worm chart (Scotland vs Bangladesh)

Fortunes see-sawed one way, then another, as can be seen in the match worm chart

C. Netherlands vs Ireland (2021-10-18)

a. Batman vs Bowler

D. Historical performance head-to-head

a. Sri Lanka vs West Indies (2019-2021) – Batting partnerships

b. India vs England (2018 – 2021) – Bowling scorecard

c) Australia vs South Africa – Team wicket opposition

E) Overall performance

a. Pakistan batting scorecard since 2019

a. Win loss of Australia since 2019

F) Batsman Performance

a. PR Stirling’s runs against opposition since 2019

b. KJ Brien’s cumulative average runs since 2019

G. Bowler performance

a. PWH De Silva’s wicket prediction since 2019

b. T Shamsi’s cumulative average wickets since 2019

H. Ranking Intl. T20 batsman since 2019

a. Runs over Strike rate

b. Strike rate over runs

I. Ranking bowlers since 2019

a. Wickets over Economy rate

b. Economy rate over wickets

As mentioned above GooglyPlusPlus2021 will be updated daily automatically, so you won’t miss any analytic action.

Do give GooglyPlusPlus2021 a spin!

Clone/fork the code for the Shiny app from Github gpp2021-8

You may also like

  1. Natural language processing: What would Shakespeare say?
  2. Literacy in India – A deepR dive
  3. Practical Machine Learning with R and Python – Part 5
  4. Big Data 7: yorkr waltzes with Apache NiFi
  5. Getting started with Tensorflow, Keras in Python and R
  6. Deep Learning from first principles in Python, R and Octave – Part 7
  7. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  8. Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

To see all post click Index of posts

GooglyPlusPlus2021 adds new bells and whistles!!

This latest update of GooglyPlusPlus2021 includes new controls which allow for granular analysis of teams and matches. This version includes a new ‘Date Range’ widget which will allow you to choose a specific interval between which you would like to analyze data. The Date Range widget has been added to 2 tabs namely

a) Head-to-Head

b) Overall Performance

Important note:

This change is applicable to all T20 formats and ODI formats that GooglyPlusPlus2021 handles. This means you can do fine-grained analysis of the following formats

a. IPL b. Intl. T20 (men) c. Intl. T20 (women)

d. BBL e. NTB f. PSL

g. WBB h. CPL i. SSM

j. ODI (men) k. ODI (women)

Important note 1: Also note that all charts in GooglyPlusPlus2021 are interactive. You ca hover over the charts to get details of the data below. You can also selectively filter in bar charts using double-click and click. To know more about how to use GooglyPlusPlus2021 interactively, please see my post GooglyPlusPlus2021 is now fully interactive!!

You can clone/download the code for GooglyPlusPlus2021 from Github at GooglyPlusPlus2021

Try out GooglyPlusPlus2021 here GooglyPlusPlus2021

Here are some random examples from the latest version of GooglyPlusPlus2021

a) Team Batting Scorecard – MI vs CSK (all matches 2008-2013) – Tendulkar era

Tendulkar is the top scorer, followed by Rohit Sharma and Jayasuriya for Mumbai Indians

b) Team Batting Partnerships (MI -CSK) – Tendulkar’s partnerships

Partnerships for Tendulkar with his MI team mates

c) Team Bowler Wicket Kinds (Opposition countries vs India in all matches in T20)

d) Win vs Loss India vs Australia T20 Women (2010 – 2015)

Australia won all 3 matches against India

e) Win vs Loss India vs Australia T20 Women (2015 – 2020)

Between 2016-2020 the tally is 3-2 for Australia vs India

f) Wins vs Losses – MI vs all other teams 2013 – 2018

g) Team Batting Partnerships Head-to-head Australia vs England ODI (Women)

Partnerships of Australia women EA Perry and AJ Blackwell for Australia

Go ahead give GooglyPlusPlus2021 a try!

Hope you have fun!

Also see

  1. Exploring Quantum Gate operations with QCSimulator
  2. De-blurring revisited with Wiener filter using OpenCV
  3. Deep Learning from first principles in Python, R and Octave – Part 3
  4. Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  5. Cricpy adds team analytics to its arsenal!!
  6. Practical Machine Learning with R and Python – Part 5

To see all posts see Index of posts

GooglyPlusPlus2021 is now fully interactive!!!

GooglyPlusPlus2021 is now fully interactive. Please read the below post carefully to see the different ways you can interact with the data in the plots.

There are 2 main updates in this latest version of GooglyPlusPlus2021

a) GooglyPlusPlus gets all ‘touchy, feely‘ with the data and now you can interact with the plot/chart to get more details of the underlying data. There are many ways you can slice’n dice the data in the charts. The examples below illustrate a few of this. You can interact with plots by hover’ing, ‘click’ing and ‘double-click’ing curves, plots, barplots to get details of the data.

b) GooglyPlusPlus also includes the ‘Super Smash T20’ league from New Zealand. You can analyze batsmen, bowlers, matches, teams and rank Super Smash (SSM) also

Note: GooglyPlusPlus2021 can handle a total of 11 formats including T20 and ODI. They are

i) IPL ii) Intl. T20(men) ii) Intl. T20(women) iv) BBL

v) NTB vi) PSL vii) WBB. viii) CPL

ix) SSM x) ODI (men) xi) ODI (women)

Each of these formats have 7 tabs which are

— Analyze batsman

— Analyze bowlers

— Analyze match

— Head-to-head

— Team vs all other teams

— Rank batsmen

— Rank bowlers

Within these 11 x 7 = 77 tabs you can analyze batsmen, bowlers, matches, head-to-head, team vs all other teams and rank players for T20 and ODI. In addition all plots have been made interactive so there is a lot more information that you can get from these charts

Try out the interactive GooglyPlusPlus2021 now!!!

You can fork/clone the Shiny app from Github at GooglyPlusPlus2021

Below I have randomly included some charts for different formats to show how you can interact with them

a) Batsman Analysis – Runs vs Deliveries (IPL)

Mouse-over/Hover

The plot below gives the number of runs scored by David Warner vs Deliveries faced.

b) Batsman Analysis – Runs vs Deliveries (IPL) (prediction)

Since a 2nd order regression line,with confidence intervals(shaded area), has been fitted in the above plot, we can predict the runs given the ‘balls faced’ as below

Click ‘Toggle Spike lines’ (use palette on top-right)

By using hover(mouse-over) on the curve we can determine the predicted number of runs Warner will score given a certain number of deliveries

c) Bowler Analysis – Wickets against opposition – Intl. T20 (women)

Jhulan Goswami’s wickets against opposition countries in Intl. T20 (women)

d) Bowler Analysis (Predict bowler wickets) IPL – (non-interactive**)

Note: Some plots are non-interactive, like the one below which predicts the number of wickets Bumrah will take based on number of deliveries bowled

e) Match Analysis – Batsmen Partnership -Intl. T20 (men)

India vs England batting partnership between Virat Kohli & Shikhar Dhawan in all matches between England and India

f) Match Analysis – Worm chart (Super Smash T20) SSM

i) Worm chart of Auckland vs Northern Districts (29 Jan 2021).

ii) The final cross-over happens around the 2nd delivery of the 19th over (18.2) as Northern Districts over-takes Auckland to win the match.

g) Head-to-head – Team batsmen vs bowlers (Bangladesh batsmen against Afghanistan bowlers) Intl. T20 (men)

Batting performance of Shakib-al-Hasan (Bangladesh) against Afghanistan bowlers in Intl. T20 (men)

h) Head-to-head – Team batsmen vs bowlers (Bangladesh batsmen against Afghanistan bowlers) Intl. T20 (men)Filter

Double click on Shakib-al-Hasan on the legend to get the performance of Shakib-al-Hasan against Afghanistan bowlers

Avoiding the clutter

i) Head-to-head – Team bowler vs batsmen (Chennai Super Kings bowlers vs Mumbai Indians batsmen) – IPL

If you choose the above option the resulting plot is very crowded as shown below

To get the performance of Mumbai Indian (MI) batsmen (Rohit Sharma & Kieron Pollard) against Chennai Super Kings (CSK) bowlers in all matches do as told below

Steps to avoid clutter in stacked bar plots

1) This can be avoided by selectively choosing to filter out the batsmen we are interested in. say RG Sharma and Kieron Pollard. Then double-ciick RG Sharma, this is will bring up the chart with only RG Sharma as below

2) Now add additional batsmen you are interested in by single-clicking. In the example below Kieron Pollard is added

You can continue to add additional players that you are interested by single clicking.

j) Head-to-head (Performance of Indian batsmen vs Australian bowlers)- ODI

In the plot V Kohli, MS Dhoni and SC Ganguly have been selected for their performance against Australian bowlers (use toggle spike lines)

k) Overall Performance – PSL batting partnership against all teams (Fakhar Zaman)

The plot below shows Fakhar Zaman (Lahore Qalanders) partnerships with other teammates in all matches in PSL.

l) Win-loss against all teams (CPL)

Win-loss chart of Jamaica Talawallahs (CPL) in all matches against all opposition

m) Team batting partnerships against all teams for India (ODI Women)

Batting partnerships of Indian ODI women against all other teams

n) Ranking of batsmen (IPL 2021)

Finally here is the latest ranking of IPL batsmen for IPL 2021 (can be done for all other T20 formats)

o) Ranking of bowlers (IPL 2021)

Clone/download the Shiny app from Github at GooglyPlusPlus2021

So what are you waiting for? Go ahead and try out GooglyPlusPlus2021!

Knock yourself out!

Enjoy enjaami!!!

See also

  1. Deconstructing Convolutional Neural Networks with Tensorflow and Keras
  2. Deep Learning from first principles in Python, R and Octave – Part 6
  3. Cricketr learns new tricks : Performs fine-grained analysis of players
  4. Big Data 6: The T20 Dance of Apache NiFi and yorkpy
  5. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
  6. Practical Machine Learning with R and Python – Part 6
  7. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
  8. Simulating an oscillating revoluteJoint in Android
  9. Benford’s law meets IPL, Intl. T20 and ODI cricket
  10. De-blurring revisited with Wiener filter using OpenCV

To see all posts click Index of posts