# Boosting Win Probability accuracy with player embeddings

In my previous post Computing Win Probability of T20 matches I had discussed various approaches on computing Win Probability of T20 matches. I had created ML models with glmnet and random forest using TidyModels. This was what I had achieved

• glmnet : accuracy – 0.67 and sensitivity/specificity – 0.68/0.65
• random forest : accuracy – 0.737 and roc_auc- 0.834
• DL model with Keras in Python : accuracy – 0.73

I wanted to see if the performance of the models could be further improved. I got a suggestion from a AI/DL whizkid, who is close to me, to include embeddings for batsmen and bowlers. He felt that win percentage is influenced by which batsman faces which bowler.

So, I started to explore this idea. Embeddings can be used to convert categorical variables to a vector of continuous floating point numbers.Fortunately R’s Tidymodels, has a convenient functionality to create embeddings. By including embeddings for batsman, bowler the performance of my ML models improved vastly. Now the performance is

• glmnet : accuracy – 0.728 and roc_auc – 0.81
• random forest : accuracy – 0.927 and roc_auc – 0.98
• mlp-dnn :accuracy – 0.762 and roc_auc – 0.854

As can be seem there is almost a 20% increase in accuracy with random forests with embeddings over the model without embeddings. Moreover, the feature importance which is plotted below shows that the bowler and batsman embeddings have a significant influence on the Win Probability

Note: The data for this analysis is taken from Cricsheet and has been processed with my R package yorkr.

A. Win Probability using GLM with penalty and player embeddings

Here Generalised Linear Model (GLMNET) for Logistic Regression is used. In the GLMNET the regularisation path is computed for the lasso or elastic net penalty at a grid of values for the regularisation parameter lambda. glmnet is extremely fast and gave an accuracy of 0.72 for an roc_auc of 0.81 with batsman, bowler embeddings. This was good improvement over my earlier implementation with glmnet without the batsman & bowler embeddings which had a

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

``````library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(embed)

# Helper packages
library(vip)

#Bind all dataframes together
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)
glimpse(df)
Rows: 1,199,115
Columns: 10
\$ batsman        <chr> "JD Smith", "M Klinger", "M Klinger", "M Klinger", "JD …
\$ bowler         <chr> "NM Hauritz", "NM Hauritz", "NM Hauritz", "NM Hauritz",…

\$ ballNum        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
\$ ballsRemaining <int> 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, …
\$ runs           <int> 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 6, 7, 13, 14, 16, 18, 18,…

\$ runRate        <dbl> 1.0000000, 0.5000000, 0.6666667, 0.7500000, 0.6000000, …
\$ numWickets     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
\$ runsMomentum   <dbl> 0.08800000, 0.08870968, 0.08943089, 0.09016393, 0.09090…
\$ perfIndex      <dbl> 11.000000, 5.500000, 7.333333, 8.250000, 6.600000, 5.50…
\$ isWinner       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

df %>%
count(isWinner) %>%
mutate(prop = n/sum(n))
isWinner      n      prop
1
0 614237 0.5122419
2
1 584878 0.4877581
``````

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

``````set.seed(123)
splits      <- initial_split(df,prop = 0.80)
splits
<Training/Testing/Total>
<959292/239823/1199115>
df_other <- training(splits)
df_test  <- testing(splits)

set.seed(234)
val_set <- validation_split(df_other,prop = 0.80)
val_set
# A tibble: 1 × 2
splits
id
<list>                  <chr>
1 <split [767433/191859]> validation

``````

3) Create pre-processing recipe

a) Normalise the following predictors

• ballNum
• ballsRemaining
• runs
• runRate
• numWickets
• runsMomentum
• perfIndex

b) Create floating point embeddings for

• batsman
• bowler

4) Create a Logistic Regression Workflow by adding the GLM model and the recipe

5) Create grid of elastic penalty values for regularisation

6) Train all 30 models

7) Plot the ROC of the model against the penalty

``````# Use all 12 cores
cores <- parallel::detectCores()
cores
# Create a Logistic Regression model with penalty
lr_mod <-
logistic_reg(penalty = tune(), mixture = 1) %>%

# Create pre-processing recipe
lr_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_embed(batsman,bowler, outcome = vars(isWinner)) %>%  step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex)

# Set the workflow by adding the GLM model with the recipe
lr_workflow <-
workflow() %>%

# Create a grid for the elastic net penalty
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5)
# A tibble: 5 × 1
penalty

<dbl>
1 0.0001
2 0.000127
3 0.000161
4 0.000204
5 0.000259

lr_reg_grid %>% top_n(5)  # highest penalty values
# A tibble: 5 × 1
penalty
<dbl>
1  0.0386
2  0.0489
3  0.0621
4  0.0788
5  0.1

# Train 30 penalized models
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,roc_auc))

# Plot the penalty versus ROC
lr_plot <-
lr_res %>%
collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())

lr_plot``````

The Penalty vs ROC plot is shown below

8) Display the ROC_AUC of the top models with the penalty

9) Select the model with the best ROC_AUC and the associated penalty. It can be seen the best mean ROC_AUC is 0.81 and the associated penalty is 0.000530

``````top_models <-
lr_res %>%
show_best("roc_auc", n = 15) %>%
arrange(penalty)
top_models

# A tibble: 15 × 7
penalty .metric .estimator  mean     n std_err .config
<dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 0.0001   roc_auc binary     0.810     1      NA Preprocessor1_Model01
2 0.000127 roc_auc binary     0.810     1      NA Preprocessor1_Model02
3 0.000161 roc_auc binary     0.810     1      NA Preprocessor1_Model03
4 0.000204 roc_auc binary     0.810     1      NA Preprocessor1_Model04
5 0.000259 roc_auc binary     0.810     1      NA Preprocessor1_Model05
6 0.000329 roc_auc binary     0.810     1      NA Preprocessor1_Model06
7 0.000418 roc_auc binary     0.810     1      NA Preprocessor1_Model07
8 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08
9 0.000672 roc_auc binary     0.810     1      NA Preprocessor1_Model09
10 0.000853 roc_auc binary     0.810     1      NA Preprocessor1_Model10
11 0.00108  roc_auc binary     0.810     1      NA Preprocessor1_Model11
12 0.00137  roc_auc binary     0.810     1      NA Preprocessor1_Model12
13 0.00174  roc_auc binary     0.809     1      NA Preprocessor1_Model13
14 0.00221  roc_auc binary     0.809     1      NA Preprocessor1_Model14
15 0.00281  roc_auc binary     0.809     1      NA Preprocessor1_Model15

#Picking the best model and the corresponding penalty
lr_best <-
lr_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(8)
lr_best
# A tibble: 1 × 7

penalty .metric .estimator  mean     n std_err .config
<dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>

1 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08

# Collect predictions and generate the AUC curve
lr_auc <-
lr_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(isWinner, .pred_0) %>%
mutate(model = "Logistic Regression")

autoplot(lr_auc)``````

7) Plot the Area under the Curve (AUC).

10) Build the final model with the best LR parameters value as found in lr_best

a) The best performance was for a penalty of 0.000530

b) The accuracy achieved is 0.72. Clearly using the embeddings for batsman, bowlers improves on the performance of the GLM model without the embeddings. The accuracy achieved was 0.72 whereas previously it was 0.67 see (Computing Win Probability of T20 Matches)

c) Create a fit with the best parameters

d) The accuracy is 72.8% and the ROC_AUC is 0.813

``````# Create a model with the penalty for best ROC_AUC
last_lr_mod <-
logistic_reg(penalty = 0.000530, mixture = 1) %>%

#Update the workflow with this model
last_lr_workflow <-
lr_workflow %>%
update_model(last_lr_mod)

#Create a fit
set.seed(345)
last_lr_fit <-
last_lr_workflow %>%
last_fit(splits)

#Generate accuracy, roc_auc
last_lr_fit %>%
collect_metrics()
# A tibble: 2 × 4
.metric  .estimator .estimate .config

<chr>    <chr>          <dbl> <chr>
1 accuracy binary         0.728 Preprocessor1_Model1

2 roc_auc  binary         0.813 Preprocessor1_Model1
``````

11) Plot the feature importance

It can be seen that bowler and batsman embeddings are the most significant for the prediction followed by runRate.

runRate –

• runRate in 1st innings
• requiredRunRate in 2nd innings

12) Plot the ROC characteristics

``````last_lr_fit %>%
collect_predictions() %>%
roc_curve(isWinner, .pred_0) %>%
autoplot()``````

13) Generate a confusion matrix

14) Create a final Generalised Linear Model for Logistic Regression with the penalty of 0.000530

15) Save the model

``````# generate predictions from the test set
test_predictions <- last_lr_fit %>% collect_predictions()
test_predictions

# generate a confusion matrix
test_predictions %>%
conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction     0     1

0                  90105 32658

1                  32572 84488

final_lr_model <- fit(last_lr_workflow, df_other)

final_lr_model

obj_size(final_lr_model)
146.51 MB

butcher::weigh(final_lr_model)
A tibble: 305 × 2
object                                  size
<chr>                                  <dbl>
1 pre.actions.recipe.recipe.steps.terms1  57.9
2 pre.actions.recipe.recipe.steps.terms2  57.9
3 pre.actions.recipe.recipe.steps.terms3  57.9

cleaned_lm <- butcher::axe_env(final_lr_model, verbose = TRUE)
#✔ Memory released: "1.04 kB"
#✔ Memory released: "1.62 kB"

saveRDS(cleaned_lm, "cleanedLR.rds")
``````

16) Compute Ball-by-ball Win Probability

• Chennai Super Kings-Lucknow Super Giants-2022-03-31

16a) The corresponding Worm-wicket graph for this match is as below

• Chennai Super Kings-Lucknow Super Giants-2022-03-31

B) Win Probability using Random Forest with player embeddings

In the 2nd approach I use Random Forest with batsman and bowler embeddings. The performance of the model with embeddings is quantum jump from the earlier performance without embeddings. However, the random forest is also computationally intensive.

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

``````library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(tidymodels)
library(embed)

# Helper packages
library(vip)
library(ranger)

# Read all the 9 T20 leagues

# Bind into a single dataframe
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

set.seed(123)
df\$isWinner = as.factor(df\$isWinner)

#Split data into training, validation and test sets
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other, prop = 0.80)
val_set``````

2) Create a Random Forest model tuning for number of predictor nodes at each decision node (mtry) and minimum number of predictor nodes (min_n)

3) Use the ranger engine and set up for classification

4) Set up the recipe and include batsman and bowler embeddings

5) Create a workflow and add the recipe and the random forest model with the tuning parameters

``````# Use all 12 cores parallely
cores <- parallel::detectCores()
cores
[1] 12

# Create the random forest model with mtry and min as tuning parameters
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_mode("classification")

# Setup the recipe with batsman and bowler embeddings
rf_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_embed(batsman,bowler, outcome = vars(isWinner))

# Create the random forest workflow
rf_workflow <-
workflow() %>%

rf_mod
# show what will be tuned
extract_parameter_set_dials(rf_mod)

set.seed(345)
# specify which values meant to tune

# Build the model
rf_res <-
rf_workflow %>%
tune_grid(val_set,
grid = 10,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,roc_auc))

# Pick the best  roc_auc and the associated tuning parameters
rf_res %>%
show_best(metric = "roc_auc")
# A tibble: 5 × 8
mtry min_n .metric .estimator  mean     n std_err .config
<int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

rf_res %>%
show_best(metric = "accuracy")
# A tibble: 5 × 8

mtry min_n .metric  .estimator  mean     n std_err .config
<int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0``````

6) Select all models with the best roc_auc. It can be seen that the best roc_auc is 0.980 for mtry=4 and min_n=4

7) Get the model with the highest accuracy. The highest accuracy achieved is 0.927 or 92.7. This accuracy is also for mtry=4 and min_n=4

``````# Pick the best  roc_auc and the associated tuning parameters
rf_res %>%
show_best(metric = "roc_auc")
# A tibble: 5 × 8
mtry min_n .metric .estimator  mean     n std_err .config
<int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

# Display the accuracy of the models in descending order and the parameters
rf_res %>%
show_best(metric = "accuracy")
# A tibble: 5 × 8

mtry min_n .metric  .estimator  mean     n std_err .config
<int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0``````

8) Select the model with the best parameters for accuracy mtry=4 and min_n=4. For this the accuracy is 0.927. For this configuration the roc_auc is also the best at 0.980

9) Plot the Area Under the Curve (AUC). It can be seen that this model performs really well and it hugs the top left.

``````# Pick the best model
rf_best <-
rf_res %>%
select_best(metric = "accuracy")

# The best model has mtry=4 and min=4
rf_best
mtry min_n .config
<int> <int> <chr>
1     4     4      Preprocessor1_Model08

#Plot AUC
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(isWinner, .pred_0) %>%
mutate(model = "Random Forest")

autoplot(rf_auc)``````

10) Create the final model with the best parameters

11) Execute the final fit

12) Plot feature importance, The bowler and batsman embedding followed by perfIndex and runRate are features that contribute the most to the Win Probability

``````last_rf_mod <-
rand_forest(mtry = 4, min_n = 4, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")

# the last workflow
last_rf_workflow <-
rf_workflow %>%
update_model(last_rf_mod)

set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits)

last_rf_fit %>%
collect_metrics()

.metric  .estimator .estimate .config
<chr>    <chr>          <dbl> <chr>

1 accuracy binary         0.944 Preprocessor1_Model1
2 roc_auc  binary         0.988 Preprocessor1_Model1

last_rf_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 9)``````

13) Plot the ROC curve for the best fit

``````# Plot the ROC for the final model
last_rf_fit %>%
collect_predictions() %>%
roc_curve(isWinner, .pred_0) %>%
autoplot()
``````

14) Create a confusion matrix

We can see that the number of false positives and false negatives is very low

15) Create the final fit with the Random Forest Model

``````# generate predictions from the test set
test_predictions <- last_rf_fit %>% collect_predictions()
test_predictions

id               .pred_0 .pred_1  .row .pred_class isWinner .config
<chr>              <dbl>   <dbl> <int> <fct>       <fct>    <chr>
1 train/test split   0.838  0.162      1 0           0       Preprocessor1_Mo…
2
train/test split   0.463  0.537     11 1           0        Preprocessor1_Mo…
3
train/test split   0.846  0.154     14 0           0        Preprocessor1_Mo…
4
train/test split   0.839  0.161     22 0           0        Preprocessor1_Mo…
5
train/test split   0.846  0.154     36 0           0        Preprocessor1_Mo…
6
train/test split   0.848  0.152     37 0           0        Preprocessor1_Mo…
7
train/test split   0.731  0.269     39 0           0        Preprocessor1_Mo…
8
train/test split   0.972  0.0281    40 0           0        Preprocessor1_Mo…
9
train/test split   0.655  0.345     42 0           0        Preprocessor1_Mo…
10
train/test split   0.662  0.338     43 0           0        Preprocessor1_Mo…

# generate a confusion matrix
test_predictions %>%
conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction      0      1

0 116576   7096

1   6391 109760

# Create the final model
final_model <- fit(last_rf_workflow, df_other)

``````

16) Computing Win Probability with Random Forest Model for match

• Pakistan-India-2022-10-23

17) Worm -wicket graph of match

• Pakistan-India-2022-10-23

C) Win Probability using MLP – Deep Neural Network (DNN) with player embeddings

In this approach the MLP package of Tidymodels was used. Multi-layer perceptron (MLP) with Deep Neural Network (DNN) was used to compute the Win Probability using player embeddings. An accuracy of 0.76 was obtained

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

``````library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(embed)

# Helper packages
library(vip)
library(ranger)

df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

set.seed(123)
df\$isWinner = as.factor(df\$isWinner)
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other,
prop = 0.80)
val_set

``````

3) Create a Deep Neural Network recipe

• Normalize parameters
• Add embeddings for batsman, bowler

4) Set the MLP-DNN hyperparameters

• epochs=100
• hidden units =5
• dropout regularization =0.1

5) Fit on Training data

``````cores <- parallel::detectCores()
cores

nn_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex) %>%
step_embed(batsman,bowler, outcome = vars(isWinner)) %>%
prep(training = df_other, retain = TRUE)

# For validation:
test_normalized <- bake(nn_recipe, new_data = df_test)

set.seed(57974)
# Set the hyper parameters for DNN
# Use Keras
# Fit on training data
nnet_fit <-
mlp(epochs = 100, hidden_units = 5, dropout = 0.1) %>%
set_mode("classification") %>%
# Also set engine-specific `verbose` argument to prevent logging the results:
set_engine("keras", verbose = 0) %>%
fit(isWinner ~ ., data = bake(nn_recipe, new_data = df_other))

nnet_fit
parsnip model object
Model:"sequential"

____________________________________________________________________________

Layer (type)                                           Output Shape                                    Param #
============================================================================
dense (Dense)                                           (None, 5)                                          60
____________________________________________________________________________

dense_1 (Dense)                                         (None, 5)                                          30
____________________________________________________________________________
dropout (Dropout)                                       (None, 5)                                          0
____________________________________________________________________________
dense_2 (Dense)                                         (None, 2)                                          12
============================================================================
Total params: 102
Trainable params: 102
Non-trainable params: 0
``````

6) Test on Test data

• Check ROC_AUC. It is 0.854
• Check accuracy. The MLP-DNN gives a decent performance with an acuracy of 0.76
• Compute the Confusion Matrix
``````# Validate on test data
val_results <-
df_test %>%
bind_cols(
predict(nnet_fit, new_data = test_normalized),
predict(nnet_fit, new_data = test_normalized, type = "prob")
)
val_results

# Check roc_auc
val_results %>% roc_auc(truth = isWinner, .pred_0)
.metric .estimator .estimate

<chr>   <chr>          <dbl>
1 roc_auc binary         0.854

# Check accuracy
val_results %>% accuracy(truth = isWinner, .pred_class)
.metric  .estimator .estimate
<chr>    <chr>          <dbl>
1 accuracy binary         0.762

# Display confusion matrix
val_results %>% conf_mat(truth = isWinner, .pred_class)
Truth
Prediction
0     1
0 97419 31564
1 25548 85292``````

Conclusion

1. Of the 3 ML models, glmnet, random forest and Multi-layer Perceptron DNN, random forest had the best performance
2. Random Forest ML model with batsman, bowler embeddings was able to achieve an accuracy of 92.4% and a ROC_AUC of 0.98 with very low false positives, negatives. This was a quantum jump from my earlier random forest model without embeddings which had an accuracy of 73.7% and an ROC_AUC of 0.834
3. The glmnet and NN models are fairly light weight. Random Forest is computationally very intensive.

Check out my other posts

To see all posts click Index of posts

# Computing Win-Probability of T20 matches

I am late to the ‘Win probability’ computation for T20 matches, but managed to jump on to this bus with this post. Win Probability analysis and computation have been around for some time and are used in baseball, NFL, soccer hockey and others. On T20 cricket, the following posts from White Ball Analytics & Sports Data Science were good pointers to the general approach. The data for the Win Probability computation is taken from Cricsheet.

My initial Machine Learning models could not do better than 62% accuracy. I created a data set of ~830 IPL matches which roughly came to about 280,000 rows of ball-by-ball match data but I could not move beyond 62%. Addition of T20 men moved the needle to 64% accuracy. I spent time tuning Deep Learning networks using Tensorflow and Keras. Finally, I added T20 data from 9 T20 leagues – IPL, T20 men, T20 women, BBL, CPL, NTB, PSL, WBB, SSM. I had one large data set of 1.2 million rows of ball by ball data. The data frame looks like

I created a data frame for each match from ball Num 1 to ballNum ~240 for the 1st and 2nd innings of the match. My initial set of features were ballNum, runs, runRate, numWickets. The target variable isWinner= {0,1} depending on whether the team has won or lost the match.

The features

• ballNum – ball number for 1 ~ 240+ in data frame. 1 – 120+ for 1st innings and 120+ – 240+ in 2nd innings including noballs, wides etc.
• runs = cumulative runs scored at the ball count
• runRate = cumulative runs scored/ ballNum (for 1st innings) and runs= required runs/ball Num for 2nd innings
• numWickets = wickets lost

The target variable isWinner can take values {0,1} depending whether the team won or lost

With this initial dataframe, even though I had close to 1.2 million rows of ball by ball data of T20 matches my best performance with vanilla Logistic regression & SVM in Python was about 64% accuracy.

``````# Read all the data from 9 T20 leagues
# BBL,CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB

# Create one large dataframe
df10=pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9])
print("Shape of dataframe=",df10.shape)
print("#####################################")
stats=check_values(df10)
print("#####################################")
model_fit(df10)
#norm_model_fit(df,stats)
svm_model_fit(df10)

Shape of dataframe= (1206901, 6)
#####################################
Null values: False
It contains 0 infinite values

Accuracy of Logistic regression classifier on training set: 0.63
Accuracy of Logistic regression classifier on test set: 0.64
Accuracy: 0.64
Precision: 0.62
Recall: 0.65
F1: 0.64

Accuracy of Linear SVC classifier on training set: 0.52
Accuracy of Linear SVC classifier on test set: 0.52``````

With Tensorflow/Keras the performance was about 67%. I tried several things

• Normalisation
• Tried different learning rates
• Different optimisers – SGD, RMSProp, Adam
• Changed depth and width of Neural Network

However I did not get much improvement. Finally I decided to do some Feature engineering. I added 2 new features

a) Runs Momentum : This feature is based on the fact that more the wickets in hand, the more freely the batsmen can make risky strokes, hence increasing the momentum of the runs, This is calculated as

runsMomentum = (11 – numWickets)/balls remaining

b) Performance Index: This feature is the product of the run rate x wickets in hand. In other words, if the strike rate is good and fewer wickets lost at the point in the match, then the performance index is higher at that point in the match will be higher

The final set of features chosen were as below

I had also included the balls Remaining in the innings. Now with this set of features I decided to execute Tensorflow/Keras and do a GridSearch with different learning rates, optimisers. After a couple of hours of computation I got an accuracy of 0.73. I needed to be able to read the ML model in R which required installation of Tensorflow, reticulate and Keras in RStudio and I had several issues. Since I hit a roadblock I moved to regular R models

I performed WIn Probability computation in the following ways

A) Win Probability with Vanilla Logistic Regression (R)

With vanilla Logistic Regression in R using the ‘glm’ package I got an accuracy of 0.67, sensitivity of 0.68 and specificity of 0.65 as shown below

``````library(dplyr)
library(caret)
library(e1071)
library(ggplot2)

# Read all the data from 9 T20 leagues
# BBL,CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB

# Create one large dataframe
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

# Helper function to split into training/test
trainTestSplit <- function(df,trainPercent,seed1){
## Sample size percent
samp_size <- floor(trainPercent/100 * nrow(df))
## set the seed
set.seed(seed1)
idx <- sample(seq_len(nrow(df)), size = samp_size)
idx

}

train_idx <- trainTestSplit(df,trainPercent=80,seed=5)
train <- df[train_idx, ]

test <- df[-train_idx, ]
# Fit a generalized linear logistic model,
fit=glm(isWinner~.,family=binomial,data=train,control = list(maxit = 50))

a=predict(fit,newdata=train,type="response")
# Set response >0.5 as 1 and <=0.5 as 0
b=as.factor(ifelse(a>0.5,1,0))
# Compute the confusion matrix for training data

confusionMatrix(
factor(b, levels = 0:1),
factor(train\$isWinner, levels = 0:1)
)

Confusion Matrix and Statistics

Reference
Prediction
0      1
0 339938 160336
1 154236 310217

Accuracy : 0.6739
95% CI : (0.673, 0.6749)
No Information Rate : 0.5122
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.3473

Mcnemar's Test P-Value : < 2.2e-16

Sensitivity : 0.6879
Specificity : 0.6593
Pos Pred Value : 0.6795
Neg Pred Value : 0.6679
Prevalence : 0.5122
Detection Rate : 0.3524
Detection Prevalence : 0.5186
Balanced Accuracy : 0.6736

'Positive' Class : 0

# This can be saved and loaded as
saveRDS(fit, "glm.rds")

Using the above ML model on Deccan Chargers vs Chennai Super on 27-04-2009 the Win Probability as the match progresses is as below

The Worm wicket graph of this match shows it was a closely fought match

B) Win Probability using Random Forests with Tidy Models – R

Initially I tried Tidy models with tuning for glmnet. The best I got was 0.67. However, I got an excellent performance using TidyModels with Random Forests. I am using Tidy Models for the first time and I have been blown away with how logically it is constructed, much like dplyr & ggplot2.

``````library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)

# Helper packages
library(vip)
library(ranger)
# Read all the data from 9 T20 leagues
# BBL,CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB

# Create one large dataframe
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

dim(df)
[1]
1205909       8

# Take a peek at the dataset
glimpse(df)
\$ ballNum        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28…
\$ ballsRemaining <int> 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, 114, 113, 112, 111, 110, 109, 108, 107, 106, 1…
\$ runs           <int> 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 6, 7, 13, 14, 16, 18, 18, 18, 24, 24, 24, 26, 26, 32, 32, 33, 34, 34, 3…
\$ runRate        <dbl> 1.0000000, 0.5000000, 0.6666667, 0.7500000, 0.6000000, 0.5000000, 0.5714286, 0.5000000, 0.5555556, 0.…
\$ numWickets     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,…
\$ runsMomentum   <dbl> 0.08800000, 0.08870968, 0.08943089, 0.09016393, 0.09090909, 0.09166667, 0.09243697, 0.09322034, 0.094…
\$ perfIndex      <dbl> 11.000000, 5.500000, 7.333333, 8.250000, 6.600000, 5.500000, 6.285714, 5.500000, 6.111111, 5.000000, …
\$ isWinner       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

df %>%
count(isWinner) %>%
mutate(prop = n/sum(n))

set.seed(123)
df\$isWinner = as.factor(df\$isWinner)

# Split the data into training and test set in 80%:20%
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)

# Create a validation set from training set in 80%:20%
set.seed(234)
val_set <- validation_split(df_other,
prop = 0.80)
val_set

# Setup for Random forest using Ranger for classification
# Set up cores for parallel execution
cores <- parallel::detectCores()
cores

#Set up Random Forest engine
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_mode("classification")

rf_mod
# The Random Forest engine includes mtry which is number of predictor
# variables required at each decision  tree with min_n the minimum number # of
Random Forest Model Specification (classification)

Main Arguments:
mtry = tune()
trees = 1000
min_n = tune()

Engine-Specific Arguments:

Computational engine: ranger

# Setup the predictors and target variable
# Normalise all predictors. Random Forest don't need normalization but
# I have done it anyway
rf_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_normalize(all_predictors())

# Create workflow adding the ML model and recipe
rf_workflow <-
workflow() %>%

# The tune is done for 5 different values of the tuning parameters.
# Metrics include accuracy and roc_auc
rf_res <-
rf_workflow %>%
tune_grid(val_set,
grid = 5,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,roc_auc))

\$ Pick the best of ROC/AUC
rf_res %>%
show_best(metric = "roc_auc")

We can see that when mtry (number of predictors) is 5 or 7 the ROC_AUC is 0.834 which is quite good

# A tibble: 5 × 8
mtry min_n .metric .estimator  mean     n std_err .config
<int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1     5    26 roc_auc binary     0.834     1      NA Preprocessor1_Model5
2     7    36 roc_auc binary     0.834     1      NA Preprocessor1_Model3
3     2    17 roc_auc binary     0.833     1      NA Preprocessor1_Model4
4     1    20 roc_auc binary     0.832     1      NA Preprocessor1_Model2
5     5     6 roc_auc binary     0.825     1      NA Preprocessor1_Model1

# Select the model with highest accuracy
rf_res %>%
show_best(metric = "accuracy")
mtry min_n .metric  .estimator  mean     n std_err .config
<int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
1     7    36 accuracy binary     0.737     1      NA Preprocessor1_Model3
2     5    26 accuracy binary     0.736     1      NA Preprocessor1_Model5
3     1    20 accuracy binary     0.736     1      NA Preprocessor1_Model2
4     2    17 accuracy binary     0.735     1      NA Preprocessor1_Model4
5     5     6 accuracy binary     0.731     1      NA Preprocessor1_Model1

# The model with mtry (number of predictors) is 7 has the best accuracy.
# Hence the best model has mtry=7 and min_n=36

rf_best <-
rf_res %>%
select_best(metric = "accuracy")

# Display the best model
rf_best
# A tibble: 1 × 3
mtry min_n .config
<int> <int> <chr>
1     7    36 Preprocessor1_Model3

rf_res %>%
collect_predictions()
id         .pred_class  .row  mtry min_n .pred_0  .pred_1 isWinner .config
<chr>      <fct>       <int> <int> <int>   <dbl>    <dbl> <fct>    <chr>
1 validation 1               1     5     6 0.497   0.503    0        Preprocessor1_Model1
2 validation 1               9     5     6 0.00753 0.992    1        Preprocessor1_Model1
3 validation 0              10     5     6 0.627   0.373    0        Preprocessor1_Model1
4 validation 0              16     5     6 0.998   0.002    0        Preprocessor1_Model1
5 validation 1              18     5     6 0.270   0.730    1        Preprocessor1_Model1
6 validation 0              23     5     6 0.899   0.101    0        Preprocessor1_Model1
7 validation 1              26     5     6 0.452   0.548    1        Preprocessor1_Model1
8 validation 0              30     5     6 0.657   0.343    1        Preprocessor1_Model1
9 validation 0              34     5     6 0.576   0.424    0        Preprocessor1_Model1
10 validation 0              35     5     6 1.00    0.000167 0        Preprocessor1_Model1

rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(isWinner, .pred_0) %>%
mutate(model = "Random Forest")

autoplot(rf_auc)

``````

I

The Final Model

``````# Create the final Random Forest model with mtry=7 and min_n=36
# engine as "ranger" for classification
last_rf_mod <-
rand_forest(mtry = 7, min_n = 36, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")

# the last workflow is updated with the final model
last_rf_workflow <-
rf_workflow %>%
update_model(last_rf_mod)

set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits)

# Collect metrics
last_rf_fit %>%
collect_metrics()
.metric  .estimator .estimate .config
<chr>    <chr>          <dbl> <chr>
1 accuracy binary         0.739 Preprocessor1_Model1
2 roc_auc  binary         0.837 Preprocessor1_Model1

The Random Forest model gives an accuracy of 0.739 and ROC_AUC of .837 which I think is quite good. This is roughly what I got with Tensorflow/Keras

# Get the feature importance
last_rf_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 7)

``````

Interestingly the feature that I engineered seems to have the maximum importancce namely Performance Index which is a product of Run rate x Wicket in Hand. I would have thought numWickets would be important but in T20 match probably is is not.

`````` generate predictions from the test set
test_predictions <- last_rf_fit %>% collect_predictions()
> test_predictions
# A tibble: 241,182 × 7
id               .pred_0 .pred_1  .row .pred_class isWinner .config
<chr>              <dbl>   <dbl> <int> <fct>       <fct>    <chr>
1 train/test split   0.496   0.504     1 1           0        Preprocessor1_Model1
2 train/test split   0.640   0.360    11 0           0        Preprocessor1_Model1
3 train/test split   0.596   0.404    14 0           0        Preprocessor1_Model1
4 train/test split   0.287   0.713    22 1           0        Preprocessor1_Model1
5 train/test split   0.616   0.384    28 0           0        Preprocessor1_Model1
6 train/test split   0.516   0.484    36 0           0        Preprocessor1_Model1
7 train/test split   0.754   0.246    37 0           0        Preprocessor1_Model1
8 train/test split   0.641   0.359    39 0           0        Preprocessor1_Model1
9 train/test split   0.811   0.189    40 0           0        Preprocessor1_Model1
10 train/test split   0.618   0.382    42 0           0        Preprocessor1_Model1

# generate a confusion matrix
test_predictions %>%
conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction     0     1
0 92173 31623
1 31320 86066

# Create the final model on the train/test data
final_model <- fit(last_rf_workflow, df_other)

# Final model
final_model
══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Ranger result

Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~7,      x), num.trees = ~1000, min.node.size = min_rows(~36, x),      num.threads = ~cores, importance = ~"impurity", verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE)

Type:                             Probability estimation
Number of trees:                  1000
Sample size:                      964727
Number of independent variables:  7
Mtry:                             7
Target node size:                 36
Variable importance mode:         impurity
Splitrule:                        gini
OOB prediction error (Brier s.):  0.1631303
``````

The Random Forest Model’s performance has been quite impressive and probably requires further exploration.

``````# Saving and loading the model
save(final_model, file = "fit.rda")

#Predicting the Win Probability of CSK vs DD match on 12 May 2012``````

Comparing this with the Worm wicket graph of this match we see that DD had no chance at all

C) Win Probability with Tensorflow/Keras with Grid Search – Python

I spent a fair amount of time tuning the hyper parameters of the Keras Deep Learning Network. Finally did go for the Grid Search. Incidentally I did ask ChatGPT to suggest code snippets for GridSearch which it promptly did!!!

``````import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from sklearn.model_selection import GridSearchCV

# Define the model
tf.random.set_seed(4)
model = tf.keras.Sequential([
keras.layers.Dense(32, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(8, activation=tf.nn.relu),
keras.layers.Dense(1,activation=tf.nn.sigmoid)
])

# Since this is binary classification use binary_crossentropy
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics='accuracy')
return(model)

# Create a KerasClassifier object
model = keras.wrappers.scikit_learn.KerasClassifier(build_fn=create_model)

# Define the grid of hyperparameters to search over
batch_size = [1024]
epochs = [40]
learning_rate = [0.01, 0.001, 0.0001]

param_grid = dict(dict(optimizer=optimizer,batch_size=batch_size, epochs=epochs) )
# Create the grid search object
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# Fit the grid search object to the training data
grid_search.fit(normalized_train_data, train_labels)

# Print the best hyperparameters
print('Best hyperparameters:', grid_search.best_params_)
# summarize results
print("Best: %f using %s" % (grid_search.best_score_, grid_search.best_params_))
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
params = grid_search.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))``````

The best worked out to be the optimiser ‘Nadam’ with a learning rate of 0.001

``````import matplotlib.pyplot as plt
# Create a model
tf.random.set_seed(4)
model = tf.keras.Sequential([
keras.layers.Dense(32, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]),
keras.layers.Dense(16, activation=tf.nn.relu),
keras.layers.Dense(8, activation=tf.nn.relu),
keras.layers.Dense(1,activation=tf.nn.sigmoid)
])

# Since this is binary classification use binary_crossentropy
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics='accuracy')

# Fit
#history=model.fit(
#  train_dataset1, train_labels,batch_size=1024,
#  epochs=40, validation_data=(test_dataset1,test_labels), verbose=1)
history=model.fit(
normalized_train_data, train_labels,batch_size=1024,
epochs=40, validation_data=(normalized_test_data,test_labels), verbose=1)

Epoch 37/40
943/943 [==============================] - 3s 3ms/step - loss: 0.4971 - accuracy: 0.7310 - val_loss: 0.4968 - val_accuracy: 0.7357
Epoch 38/40
943/943 [==============================] - 3s 3ms/step - loss: 0.4970 - accuracy: 0.7310 - val_loss: 0.4974 - val_accuracy: 0.7378
Epoch 39/40
943/943 [==============================] - 4s 4ms/step - loss: 0.4970 - accuracy: 0.7309 - val_loss: 0.4994 - val_accuracy: 0.7296
Epoch 40/40
943/943 [==============================] - 3s 3ms/step - loss: 0.4969 - accuracy: 0.7311 - val_loss: 0.4998 - val_accuracy: 0.7300
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()``````

Conclusion

So, the Keras Deep Learning Network gives about the same performance of Random Forest in Tidy Models. But I went with R Random Forest as it was easier to save and load the model for use with my data. Also, I am not sure whether the performance of the ML model can be improved beyond a point. However, I will continue to explore.

Watch this space!!!

Also see

To see all posts click Index of posts

References