# GooglyPlusPlus: Computing T20 player’s Win Probability Contribution

In this post, I compute each batsman’s or bowler’s Win Probability Contribution (WPC) in a T20 match. This metric captures by how much the player (batsman or bowler) changed/impacted the Win Probability of the T20 match. For this computation I use my machine learning models, I had created earlier, which predicts the ball-by-ball win probability as the T20 match progresses through the 2 innings of the match.

In the picture snippet below, you can see how the win probability changes ball-by-ball for each batsman for a T20 match between CSK vs LSG- 31 Mar 2022

In my previous posts I had created several Machine Learning models. In order to compute the player’s Win Probability contribution in this post, I have used the following ML models

The batsman’s or bowler’s win probability contribution changes ball-by=ball. The player’s contribution is calculated as the difference in win probability when the batsman faces the 1st ball in his innings and the last ball either when is out or the innings comes to an end. If the difference is +ve the the player has had a positive impact, and likewise for negative contribution. Similarly, for a bowler, it is the win probability when he/she comes into bowl till, the last delivery he/she bowls

Note: The Win Probability Contribution does not have any relation to the how much runs or at what strike rate the batsman scored the runs. Rather the model computes different win probability for each player, based on his/her embedding, the ball in the innings and six other feature vectors like runs, run rate, runsMomentum etc. These values change for every ball as seen in the table above. Also, this is not continuous. The 2 ML models determine the Win Probability for a specific player, ball and the context in the match.

This metric is similar to Win Probability Added (WPA) used in Sabermetrics for baseball. Here is the definition of WPA from Fangraphs “Win Probability Added (WPA) captures the change in Win Expectancy from one plate appearance to the next and credits or debits the player based on how much their action increased their team’s odds of winning.” This article in Fangraphs explains in detail how this computation is done.

In this post I have added 4 new function to my R package yorkr.

• batsmanWinProbLR – batsman’s win probability contribution based on glmnet (Logistic Regression)
• bowlerWinProbLR – bowler’s win probability contribution based on glmnet (Logistic Regression)
• batsmanWinProbDL – batsman’s win probability contribution based on Deep Learning Model
• bowlerWinProbDL – bowlerWinProbLR – bowler’s win probability contribution based on Deep Learning

Hence there are 4 additional features in GooglyPlusPlus based on the above 4 functions. In addition I have also updated

-winProbLR (overLap) function to include the names of batsman when they come to bat and when they get out or the innings comes to an end, based on Logistic Regression

-winProbDL(overLap) function to include the names of batsman when they come to bat and when they get out based on Deep Learning

Hence there are 6 new features in this version of GooglyPlusPlus.

Note: All these new 6 features are available for all 9 formats of T20 in GooglyPlusPlus namely

a) IPL b) BBL c) NTB d) PSL e) Intl, T20 (men) f) Intl. T20 (women) g) WBB h) CSL i) SSM

Note: The data for GooglyPlusPlus comes from Cricsheet and the Shiny app is based on my R package yorkr

A) Chennai SuperKings vs Delhi Capitals – 04 Oct 2021

To understand Win Probability Contribution better let us look at Chennai Super Kings vs Delhi Capitals match on 04 Oct 2021

This was closely fought match with fortunes swinging wildly. If we take a look at the Worm wicket chart of this match

a) Worm Wicket chartCSK vs DC – 04 Oct 2021

Delhi Capitals finally win the match

b) Win Probability Logistic Regression (side-by-side) – CSK vs DC – 4 Oct 2021

Plotting how win probability changes over the course of the match using Logistic Regression Model

In this match Delhi Capitals won. The batting scorecard of Delhi Capitals

c) Batting Scorecard of Delhi Capitals – CSK vs DC – 4 Oct 2021

d) Win Probability Logistic Regression (Overlapping) – CSK vs DC – 4 Oct 2021

The Win Probability LR (overlapping) shows the probability function of both teams superimposed over one another. The plot includes when a batsman came into to play and when he got out. This is for both teams. This looks a little noisy, but there is a way to selectively display the change in Win Probability for each team. This can be done , by clicking the 3 arrows (orange or blue) from top to bottom. First double-click the team CSK or DC, then click the next 2 items (blue,red or black,grey) Sorry the legends don’t match the colors! 😦

Below we can see how the win probability changed for Delhi Capitals during their innings, as batsmen came into to play. See below

e) Batsman Win Probability contribution:DC – CSK vs DC – 4 Oct 2021

Computing the individual batsman’s Win Contribution and plotting we have. Hetmeyer has a higher Win Probability contribution than Shikhar Dhawan depsite scoring fewer runs

f) Bowler’s Win Probability contribution :CSK – CSK vs DC – 4 Oct 2021

We can also check the Win Probability of the bowlers. So for e.g the CSK bowlers and which bowlers had the most impact. Moeen Ali has the least impact in this match

B) Intl. T20 (men) Australia vs India – 25 Sep 2022

a) Worm wicket chart – Australia vs India – 25 Sep 2022

This was another close match in which India won with the penultimate ball

b) Win Probability based on Deep Learning model (side-by-side) – Australia vs India – 25 Sep 2022

c) Win Probability based on Deep Learning model (overlapping) – Australia vs India – 25 Sep 2022

The plot below shows how the Win Probability of the teams varied across the 20 overs. The 2 Win Probability distributions are superimposed over each other

d) Batsman Win Probability Contribution : IndiaAustralia vs India – 25 Sep 2022

Selectively choosing the India Win Probability plot by double-clicking legend ‘India’ on the right , followed by single click of black, grey legend we have

We see that Kohli, Suryakumar Yadav have good contribution to the Win Probability

e) Plotting the Runs vs Strike Rate:India – Australia vs India – 25 Sep 2022

f) Batsman’s Win Probability Contribution- Australia vs India – 25 Sep 2022

Finally plotting the Batsman’s Win Probability Contribution

Interestingly, Kohli has a greater Win Probability Contribution than SKY, though SKY scored more runs at a better strike rate. As mentioned above, the Win Probability is context dependent and also depends on past performances of the player (batsman, bowler)

Finally let us look at

C) India vs England Intll T20 Women (11 July 2021)

a) Worm wicket chart – India vs England Intl. T20 Women (11 July 2021)

India won this T20 match by 8 runs

b) Win Probability using the Logistic Regression Model – India vs England Intl. T20 Women (11 July 2021)

c) Win Probability with the DL model – India vs England Intl. T20 Women (11 July 2021)

d) Bowler Win Probability Contribution with the LR model India vs England Intl. T20 Women (11 July 2021)

e) Bowler Win Contribution with the DL model India vs England Intl. T20 Women (11 July 2021)

Also see my other posts

To see all posts click Index of posts

# GooglyPlusPlus: Win Probability using Deep Learning and player embeddings

In my last post ‘GooglyPlusPlus now with Win Probability Analysis for all T20 matches‘ I had discussed the performance of my ML models, created with and without player embeddings, in computing the Win Probability of T20 matches. With batsman & bowler embeddings I got much better performance than without the embeddings

• glmnet – Accuracy – 0.73
• Random Forest (RF) – Accuracy – 0.92

While the Random Forest gave excellent accuracy, it was bulky and also took an unusually long time to predict the Win Probability of a single T20 match. The above 2 ML models were built using R’s Tidymodels. glmnet was fast, but I wanted to see if I could create a ML model that was better, lighter and faster. I had initially tried to use Tensorflow, Keras in Python but then abandoned it, since I did not know how to port the Deep Learning model to R and use in my app GooglyPlusPlus.

But later, since I was stuck with a bulky Random Forest model, I decided to again explore options for saving the Keras Deep Learning model and loading it in R. I found out that saving the model as .h5, we can load it in R and use it for predictions. Hence, I rebuilt a Deep Learning model using Keras, Python with player embeddings and I got excellent performance. The DL model was light and had an accuracy 0.8639 with an ROC_AUC of 0.964 which was great!

GooglyPlusPlus uses data from Cricsheet and is based on my R package yorkr

Here are the steps

A. Build a Keras Deep Learning model

a. Import necessary packages

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from pathlib import Path
import matplotlib.pyplot as plt


b, Upload the data of all 9 T20 leagues (BBL, CPL, IPL, T20 (men) , T20(women), NTB, CPL, SSM, WBB)

# Read all T20 leagues
print("Shape of dataframe=",df1.shape)

# Create training and test data set
train_dataset = df1.sample(frac=0.8,random_state=0)
test_dataset = df1.drop(train_dataset.index)
train_dataset1 = train_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
test_dataset1 = test_dataset[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
train_dataset1

# Set the target data
train_labels = train_dataset.pop('isWinner')
test_labels = test_dataset.pop('isWinner')
train_dataset1

a=train_dataset1.describe()
stats=a.transpose
a

c. Create a Deep Learning ML model using batsman & bowler embeddings

import pandas as pd
import numpy as np
from keras.layers import Input, Embedding, Flatten, Dense
from keras.models import Model
from keras.layers import Input, Embedding, Flatten, Dense, Reshape, Concatenate, Dropout
from keras.models import Model

# Set seed
tf.random.set_seed(432)

# create input layers for each of the predictors
batsmanIdx_input = Input(shape=(1,), name='batsmanIdx')
bowlerIdx_input = Input(shape=(1,), name='bowlerIdx')
ballNum_input = Input(shape=(1,), name='ballNum')
ballsRemaining_input = Input(shape=(1,), name='ballsRemaining')
runs_input = Input(shape=(1,), name='runs')
runRate_input = Input(shape=(1,), name='runRate')
numWickets_input = Input(shape=(1,), name='numWickets')
runsMomentum_input = Input(shape=(1,), name='runsMomentum')
perfIndex_input = Input(shape=(1,), name='perfIndex')

# Set the embedding size as the 4th root of unique batsmen, bowlers
no_of_unique_batman=len(df1["batsmanIdx"].unique())
no_of_unique_bowler=len(df1["bowlerIdx"].unique())
embedding_size_bat = no_of_unique_batman ** (1/4)
embedding_size_bwl = no_of_unique_bowler ** (1/4)

# create embedding layer for the categorical predictor
batsmanIdx_embedding = Embedding(input_dim=no_of_unique_batman+1, output_dim=16,input_length=1)(batsmanIdx_input)
batsmanIdx_flatten = Flatten()(batsmanIdx_embedding)
bowlerIdx_embedding = Embedding(input_dim=no_of_unique_bowler+1, output_dim=16,input_length=1)(bowlerIdx_input)
bowlerIdx_flatten = Flatten()(bowlerIdx_embedding)

# concatenate all the predictors
x = keras.layers.concatenate([batsmanIdx_flatten,bowlerIdx_flatten, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input])

# Use dropouts for regularisation
x = Dense(64, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(8, activation='relu')(x)
x = Dropout(0.1)(x)

output = Dense(1, activation='sigmoid', name='output')(x)
print(output.shape)

# create a DL model
model = Model(inputs=[batsmanIdx_input,bowlerIdx_input, ballNum_input, ballsRemaining_input, runs_input, runRate_input, numWickets_input, runsMomentum_input, perfIndex_input], outputs=output)
model.summary()

# compile model

model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# train the model
history=model.fit([train_dataset1['batsmanIdx'],train_dataset1['bowlerIdx'],train_dataset1['ballNum'],train_dataset1['ballsRemaining'],train_dataset1['runs'],
train_dataset1['runRate'],train_dataset1['numWickets'],train_dataset1['runsMomentum'],train_dataset1['perfIndex']], train_labels, epochs=40, batch_size=1024,
validation_data = ([test_dataset1['batsmanIdx'],test_dataset1['bowlerIdx'],test_dataset1['ballNum'],test_dataset1['ballsRemaining'],test_dataset1['runs'],
test_dataset1['runRate'],test_dataset1['numWickets'],test_dataset1['runsMomentum'],test_dataset1['perfIndex']],test_labels), verbose=1)

plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "test"], loc="upper left")
plt.show()

Model: "model_5"
__________________________________________________________________________________________________
Layer (type)                   Output Shape         Param #     Connected to
==================================================================================================
batsmanIdx (InputLayer)        [(None, 1)]          0           []

bowlerIdx (InputLayer)         [(None, 1)]          0           []

embedding_10 (Embedding)       (None, 1, 16)        75888       ['batsmanIdx[0][0]']

embedding_11 (Embedding)       (None, 1, 16)        55808       ['bowlerIdx[0][0]']

flatten_10 (Flatten)           (None, 16)           0           ['embedding_10[0][0]']

flatten_11 (Flatten)           (None, 16)           0           ['embedding_11[0][0]']

ballNum (InputLayer)           [(None, 1)]          0           []

ballsRemaining (InputLayer)    [(None, 1)]          0           []

runs (InputLayer)              [(None, 1)]          0           []

runRate (InputLayer)           [(None, 1)]          0           []

numWickets (InputLayer)        [(None, 1)]          0           []

runsMomentum (InputLayer)      [(None, 1)]          0           []

perfIndex (InputLayer)         [(None, 1)]          0           []

concatenate_5 (Concatenate)    (None, 39)           0           ['flatten_10[0][0]',
'flatten_11[0][0]',
'ballNum[0][0]',
'ballsRemaining[0][0]',
'runs[0][0]',
'runRate[0][0]',
'numWickets[0][0]',
'runsMomentum[0][0]',
'perfIndex[0][0]']

dense_19 (Dense)               (None, 64)           2560        ['concatenate_5[0][0]']

dropout_19 (Dropout)           (None, 64)           0           ['dense_19[0][0]']

dense_20 (Dense)               (None, 32)           2080        ['dropout_19[0][0]']

dropout_20 (Dropout)           (None, 32)           0           ['dense_20[0][0]']

dense_21 (Dense)               (None, 16)           528         ['dropout_20[0][0]']

dropout_21 (Dropout)           (None, 16)           0           ['dense_21[0][0]']

dense_22 (Dense)               (None, 8)            136         ['dropout_21[0][0]']

dropout_22 (Dropout)           (None, 8)            0           ['dense_22[0][0]']

output (Dense)                 (None, 1)            9           ['dropout_22[0][0]']

==================================================================================================
Total params: 137,009
Trainable params: 137,009
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/40
937/937 [==============================] - 11s 10ms/step - loss: 0.5683 - accuracy: 0.6968 - val_loss: 0.4480 - val_accuracy: 0.7708
Epoch 2/40
937/937 [==============================] - 9s 10ms/step - loss: 0.4477 - accuracy: 0.7721 - val_loss: 0.4305 - val_accuracy: 0.7833
Epoch 3/40
937/937 [==============================] - 9s 10ms/step - loss: 0.4229 - accuracy: 0.7832 - val_loss: 0.3984 - val_accuracy: 0.7936
...
...
937/937 [==============================] - 10s 10ms/step - loss: 0.2909 - accuracy: 0.8627 - val_loss: 0.2943 - val_accuracy: 0.8613
Epoch 38/40
937/937 [==============================] - 10s 10ms/step - loss: 0.2892 - accuracy: 0.8633 - val_loss: 0.2933 - val_accuracy: 0.8621
Epoch 39/40
937/937 [==============================] - 10s 10ms/step - loss: 0.2889 - accuracy: 0.8638 - val_loss: 0.2941 - val_accuracy: 0.8620
Epoch 40/40
937/937 [==============================] - 10s 11ms/step - loss: 0.2886 - accuracy: 0.8639 - val_loss: 0.2929 - val_accuracy: 0.8621

d. Compute and plot the ROC-AUC for the above model

from sklearn.metrics import roc_curve

# Select a random sample set
tf.random.set_seed(59)
train = df1.sample(frac=0.9,random_state=0)
test = df1.drop(train_dataset.index)
test_dataset1 = test[['batsmanIdx','bowlerIdx','ballNum','ballsRemaining','runs','runRate','numWickets','runsMomentum','perfIndex']]
test_labels = test.pop('isWinner')

# Compute the predicted values
y_pred_keras = model.predict([test_dataset1['batsmanIdx'],test_dataset1['bowlerIdx'],test_dataset1['ballNum'],test_dataset1['ballsRemaining'],test_dataset1['runs'],
test_dataset1['runRate'],test_dataset1['numWickets'],test_dataset1['runsMomentum'],test_dataset1['perfIndex']]).ravel()

# Compute TPR & FPR
fpr_keras, tpr_keras, thresholds_keras = roc_curve(test_labels, y_pred_keras)

fpr_keras, tpr_keras, thresholds_keras = roc_curve(test_labels, y_pred_keras)
from sklearn.metrics import auc

# Plot the Area Under the Curve (AUC)
auc_keras = auc(fpr_keras, tpr_keras)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_keras, tpr_keras, label='Keras (area = {:.3f})'.format(auc_keras))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

The ROC_AUC for the Deep Learning Model is 0.946 as seen below

e. Save the Keras model for use in Python

from keras.models import Model
model.save("wpDL.h5")

f. Load the model in R using rhdf5 package for use in GooglyPlusPlus

library(rhdf5)
dl_model <- load_model_hdf5('wpDL.h5')

This was a huge success for me to be able to create the Deep Learning model in Python and use it in my Shiny app GooglyPlusPlus. The Deep Learning Keras model is light-weight and extremely fast.

The Deep Learning model has now been integrated into GooglyPlusPlus. Now you can check the Win Probability using both a) glmnet (Logistic Regression with lasso regularisation) b) Keras Deep Learning model with dropouts as regularisation

In addition I have created 2 features based on Win Probability (WP)

i) Win Probability (Side-by-side – Plot(interactive) : With this functionality the 1st and 2nd innings will be side-by-side. When the 1st innings is played by team 1, the Win Probability of team 2 = 100 – WP (team1). Similarly, when the 2nd innings is being played by team 2, the Win Probability of team1 = 100 – WP (team 2)

ii) Win Probability (Overlapping) – Plot (static): With this functionality the Win Probabilities of both team1(1st innings) & team 2 (2nd innings) are displayed overlapping, so that we can see how the probabilities vary ball-by-ball.

Note: Since the same UI is used for all match functions I had to re-use the Plot(interactive) and Plot(static) radio buttons for Win Probability (Side-by-side) and Win Probability(Overlapping) respectively

Here are screenshots using both ML models with both functionality for some random matches

B) ICC T20 Men World Cup – Netherland-South Africa- 2022-11-06

i) Match Worm wicket chart

ii) Win Probability with LR (Side-by-Side- Plot(interactive))

iii) Win Probability LR (Overlapping- Plot(static))

iv) Win Probability Deep Learning (Side-by-side – Plot(interactive)

In the 213th ball of the innings South Africa was slightly ahead of Netherlands. After that they crashed and burned!

v) Win Probability Deep Learning (Overlapping – Plot (static)

It can be seen that in the 94th ball of both innings South Africa was ahead of Netherlands before the eventual slump.

C) Intl. T20 (Women) India – New Zealand – 2020 – 02 – 27

Here is an interesting match between India and New Zealand T20 Women’s teams. NZ successfully chased the India’s total in a wildly swinging fortunes. See the charts below

i) Match Worm Wicket chart

ii) Win Probability with LR (Side-by-side – Plot (interactive)

iii) Win Probability with LR (Overlapping – Plot (static)

iv) Win Probability with DL model (Side-by-side – Plot (interactive))

v) Win Probability with DL model (Overlapping – Plot (static))

The above functionality in plotting the Win Probability using LR or DL with both options (Side-by-side or Overlapping) is available for all 9 T20 leagues currently supported by GooglyPlusPlus.

Go ahead and give gpp2023-1 a try!!!

Do also check out my other posts’

To see all posts click Index of posts

# Boosting Win Probability accuracy with player embeddings

In my previous post Computing Win Probability of T20 matches I had discussed various approaches on computing Win Probability of T20 matches. I had created ML models with glmnet and random forest using TidyModels. This was what I had achieved

• glmnet : accuracy – 0.67 and sensitivity/specificity – 0.68/0.65
• random forest : accuracy – 0.737 and roc_auc- 0.834
• DL model with Keras in Python : accuracy – 0.73

I wanted to see if the performance of the models could be further improved. I got a suggestion from a AI/DL whizkid, who is close to me, to include embeddings for batsmen and bowlers. He felt that win percentage is influenced by which batsman faces which bowler.

So, I started to explore this idea. Embeddings can be used to convert categorical variables to a vector of continuous floating point numbers.Fortunately R’s Tidymodels, has a convenient functionality to create embeddings. By including embeddings for batsman, bowler the performance of my ML models improved vastly. Now the performance is

• glmnet : accuracy – 0.728 and roc_auc – 0.81
• random forest : accuracy – 0.927 and roc_auc – 0.98
• mlp-dnn :accuracy – 0.762 and roc_auc – 0.854

As can be seem there is almost a 20% increase in accuracy with random forests with embeddings over the model without embeddings. Moreover, the feature importance which is plotted below shows that the bowler and batsman embeddings have a significant influence on the Win Probability

Note: The data for this analysis is taken from Cricsheet and has been processed with my R package yorkr.

A. Win Probability using GLM with penalty and player embeddings

Here Generalised Linear Model (GLMNET) for Logistic Regression is used. In the GLMNET the regularisation path is computed for the lasso or elastic net penalty at a grid of values for the regularisation parameter lambda. glmnet is extremely fast and gave an accuracy of 0.72 for an roc_auc of 0.81 with batsman, bowler embeddings. This was good improvement over my earlier implementation with glmnet without the batsman & bowler embeddings which had a

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(embed)

# Helper packages
library(vip)

#Bind all dataframes together
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)
glimpse(df)
Rows: 1,199,115
Columns: 10
$batsman <chr> "JD Smith", "M Klinger", "M Klinger", "M Klinger", "JD …$ bowler         <chr> "NM Hauritz", "NM Hauritz", "NM Hauritz", "NM Hauritz",…

$ballNum <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …$ ballsRemaining <int> 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, …
$runs <int> 1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 6, 7, 13, 14, 16, 18, 18,…$ runRate        <dbl> 1.0000000, 0.5000000, 0.6666667, 0.7500000, 0.6000000, …
$numWickets <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…$ runsMomentum   <dbl> 0.08800000, 0.08870968, 0.08943089, 0.09016393, 0.09090…
$perfIndex <dbl> 11.000000, 5.500000, 7.333333, 8.250000, 6.600000, 5.50…$ isWinner       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

df %>%
count(isWinner) %>%
mutate(prop = n/sum(n))
isWinner      n      prop
1
0 614237 0.5122419
2
1 584878 0.4877581


2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

set.seed(123)
splits      <- initial_split(df,prop = 0.80)
splits
<Training/Testing/Total>
<959292/239823/1199115>
df_other <- training(splits)
df_test  <- testing(splits)

set.seed(234)
val_set <- validation_split(df_other,prop = 0.80)
val_set
# A tibble: 1 × 2
splits
id
<list>                  <chr>
1 <split [767433/191859]> validation



3) Create pre-processing recipe

a) Normalise the following predictors

• ballNum
• ballsRemaining
• runs
• runRate
• numWickets
• runsMomentum
• perfIndex

b) Create floating point embeddings for

• batsman
• bowler

4) Create a Logistic Regression Workflow by adding the GLM model and the recipe

5) Create grid of elastic penalty values for regularisation

6) Train all 30 models

7) Plot the ROC of the model against the penalty

# Use all 12 cores
cores <- parallel::detectCores()
cores
# Create a Logistic Regression model with penalty
lr_mod <-
logistic_reg(penalty = tune(), mixture = 1) %>%

# Create pre-processing recipe
lr_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_embed(batsman,bowler, outcome = vars(isWinner)) %>%  step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex)

# Set the workflow by adding the GLM model with the recipe
lr_workflow <-
workflow() %>%

# Create a grid for the elastic net penalty
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5)
# A tibble: 5 × 1
penalty

<dbl>
1 0.0001
2 0.000127
3 0.000161
4 0.000204
5 0.000259

lr_reg_grid %>% top_n(5)  # highest penalty values
# A tibble: 5 × 1
penalty
<dbl>
1  0.0386
2  0.0489
3  0.0621
4  0.0788
5  0.1

# Train 30 penalized models
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,roc_auc))

# Plot the penalty versus ROC
lr_plot <-
lr_res %>%
collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())

lr_plot

The Penalty vs ROC plot is shown below

8) Display the ROC_AUC of the top models with the penalty

9) Select the model with the best ROC_AUC and the associated penalty. It can be seen the best mean ROC_AUC is 0.81 and the associated penalty is 0.000530

top_models <-
lr_res %>%
show_best("roc_auc", n = 15) %>%
arrange(penalty)
top_models

# A tibble: 15 × 7
penalty .metric .estimator  mean     n std_err .config
<dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 0.0001   roc_auc binary     0.810     1      NA Preprocessor1_Model01
2 0.000127 roc_auc binary     0.810     1      NA Preprocessor1_Model02
3 0.000161 roc_auc binary     0.810     1      NA Preprocessor1_Model03
4 0.000204 roc_auc binary     0.810     1      NA Preprocessor1_Model04
5 0.000259 roc_auc binary     0.810     1      NA Preprocessor1_Model05
6 0.000329 roc_auc binary     0.810     1      NA Preprocessor1_Model06
7 0.000418 roc_auc binary     0.810     1      NA Preprocessor1_Model07
8 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08
9 0.000672 roc_auc binary     0.810     1      NA Preprocessor1_Model09
10 0.000853 roc_auc binary     0.810     1      NA Preprocessor1_Model10
11 0.00108  roc_auc binary     0.810     1      NA Preprocessor1_Model11
12 0.00137  roc_auc binary     0.810     1      NA Preprocessor1_Model12
13 0.00174  roc_auc binary     0.809     1      NA Preprocessor1_Model13
14 0.00221  roc_auc binary     0.809     1      NA Preprocessor1_Model14
15 0.00281  roc_auc binary     0.809     1      NA Preprocessor1_Model15

#Picking the best model and the corresponding penalty
lr_best <-
lr_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(8)
lr_best
# A tibble: 1 × 7

penalty .metric .estimator  mean     n std_err .config
<dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>

1 0.000530 roc_auc binary     0.810     1      NA Preprocessor1_Model08

# Collect predictions and generate the AUC curve
lr_auc <-
lr_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(isWinner, .pred_0) %>%
mutate(model = "Logistic Regression")

autoplot(lr_auc)

7) Plot the Area under the Curve (AUC).

10) Build the final model with the best LR parameters value as found in lr_best

a) The best performance was for a penalty of 0.000530

b) The accuracy achieved is 0.72. Clearly using the embeddings for batsman, bowlers improves on the performance of the GLM model without the embeddings. The accuracy achieved was 0.72 whereas previously it was 0.67 see (Computing Win Probability of T20 Matches)

c) Create a fit with the best parameters

d) The accuracy is 72.8% and the ROC_AUC is 0.813

# Create a model with the penalty for best ROC_AUC
last_lr_mod <-
logistic_reg(penalty = 0.000530, mixture = 1) %>%

#Update the workflow with this model
last_lr_workflow <-
lr_workflow %>%
update_model(last_lr_mod)

#Create a fit
set.seed(345)
last_lr_fit <-
last_lr_workflow %>%
last_fit(splits)

#Generate accuracy, roc_auc
last_lr_fit %>%
collect_metrics()
# A tibble: 2 × 4
.metric  .estimator .estimate .config

<chr>    <chr>          <dbl> <chr>
1 accuracy binary         0.728 Preprocessor1_Model1

2 roc_auc  binary         0.813 Preprocessor1_Model1


11) Plot the feature importance

It can be seen that bowler and batsman embeddings are the most significant for the prediction followed by runRate.

runRate –

• runRate in 1st innings
• requiredRunRate in 2nd innings

12) Plot the ROC characteristics

last_lr_fit %>%
collect_predictions() %>%
roc_curve(isWinner, .pred_0) %>%
autoplot()

13) Generate a confusion matrix

14) Create a final Generalised Linear Model for Logistic Regression with the penalty of 0.000530

15) Save the model

# generate predictions from the test set
test_predictions <- last_lr_fit %>% collect_predictions()
test_predictions

# generate a confusion matrix
test_predictions %>%
conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction     0     1

0                  90105 32658

1                  32572 84488

final_lr_model <- fit(last_lr_workflow, df_other)

final_lr_model

obj_size(final_lr_model)
146.51 MB

butcher::weigh(final_lr_model)
A tibble: 305 × 2
object                                  size
<chr>                                  <dbl>
1 pre.actions.recipe.recipe.steps.terms1  57.9
2 pre.actions.recipe.recipe.steps.terms2  57.9
3 pre.actions.recipe.recipe.steps.terms3  57.9

cleaned_lm <- butcher::axe_env(final_lr_model, verbose = TRUE)
#✔ Memory released: "1.04 kB"
#✔ Memory released: "1.62 kB"

saveRDS(cleaned_lm, "cleanedLR.rds")


16) Compute Ball-by-ball Win Probability

• Chennai Super Kings-Lucknow Super Giants-2022-03-31

16a) The corresponding Worm-wicket graph for this match is as below

• Chennai Super Kings-Lucknow Super Giants-2022-03-31

B) Win Probability using Random Forest with player embeddings

In the 2nd approach I use Random Forest with batsman and bowler embeddings. The performance of the model with embeddings is quantum jump from the earlier performance without embeddings. However, the random forest is also computationally intensive.

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(tidymodels)
library(embed)

# Helper packages
library(vip)
library(ranger)

# Read all the 9 T20 leagues

# Bind into a single dataframe
df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

set.seed(123)
df$isWinner = as.factor(df$isWinner)

#Split data into training, validation and test sets
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other, prop = 0.80)
val_set

2) Create a Random Forest model tuning for number of predictor nodes at each decision node (mtry) and minimum number of predictor nodes (min_n)

3) Use the ranger engine and set up for classification

4) Set up the recipe and include batsman and bowler embeddings

5) Create a workflow and add the recipe and the random forest model with the tuning parameters

# Use all 12 cores parallely
cores <- parallel::detectCores()
cores
[1] 12

# Create the random forest model with mtry and min as tuning parameters
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_mode("classification")

# Setup the recipe with batsman and bowler embeddings
rf_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_embed(batsman,bowler, outcome = vars(isWinner))

# Create the random forest workflow
rf_workflow <-
workflow() %>%

rf_mod
# show what will be tuned
extract_parameter_set_dials(rf_mod)

set.seed(345)
# specify which values meant to tune

# Build the model
rf_res <-
rf_workflow %>%
tune_grid(val_set,
grid = 10,
control = control_grid(save_pred = TRUE),
metrics = metric_set(accuracy,roc_auc))

# Pick the best  roc_auc and the associated tuning parameters
rf_res %>%
show_best(metric = "roc_auc")
# A tibble: 5 × 8
mtry min_n .metric .estimator  mean     n std_err .config
<int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

rf_res %>%
show_best(metric = "accuracy")
# A tibble: 5 × 8

mtry min_n .metric  .estimator  mean     n std_err .config
<int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0

6) Select all models with the best roc_auc. It can be seen that the best roc_auc is 0.980 for mtry=4 and min_n=4

7) Get the model with the highest accuracy. The highest accuracy achieved is 0.927 or 92.7. This accuracy is also for mtry=4 and min_n=4

# Pick the best  roc_auc and the associated tuning parameters
rf_res %>%
show_best(metric = "roc_auc")
# A tibble: 5 × 8
mtry min_n .metric .estimator  mean     n std_err .config
<int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
1     4     4 roc_auc binary     0.980     1      NA Preprocessor1_Model08
2     9     8 roc_auc binary     0.979     1      NA Preprocessor1_Model03

3     8    16 roc_auc binary     0.974     1      NA Preprocessor1_Model10
4     7    22 roc_auc binary     0.969     1      NA Preprocessor1_Model09

5     5    19 roc_auc binary     0.969     1      NA Preprocessor1_Model06

# Display the accuracy of the models in descending order and the parameters
rf_res %>%
show_best(metric = "accuracy")
# A tibble: 5 × 8

mtry min_n .metric  .estimator  mean     n std_err .config
<int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>
1  4     4 accuracy binary    0.927     1      NA Preprocessor1_Model08

2  9     8 accuracy binary    0.926     1      NA Preprocessor1_Model03
3  8    16 accuracy binary    0.915     1      NA Preprocessor1_Model10
4  7    22 accuracy binary    0.906     1      NA Preprocessor1_Model09

5  5    19 accuracy binary    0.904     1      NA Preprocessor1_Model0

8) Select the model with the best parameters for accuracy mtry=4 and min_n=4. For this the accuracy is 0.927. For this configuration the roc_auc is also the best at 0.980

9) Plot the Area Under the Curve (AUC). It can be seen that this model performs really well and it hugs the top left.

# Pick the best model
rf_best <-
rf_res %>%
select_best(metric = "accuracy")

# The best model has mtry=4 and min=4
rf_best
mtry min_n .config
<int> <int> <chr>
1     4     4      Preprocessor1_Model08

#Plot AUC
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(isWinner, .pred_0) %>%
mutate(model = "Random Forest")

autoplot(rf_auc)

10) Create the final model with the best parameters

11) Execute the final fit

12) Plot feature importance, The bowler and batsman embedding followed by perfIndex and runRate are features that contribute the most to the Win Probability

last_rf_mod <-
rand_forest(mtry = 4, min_n = 4, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")

# the last workflow
last_rf_workflow <-
rf_workflow %>%
update_model(last_rf_mod)

set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits)

last_rf_fit %>%
collect_metrics()

.metric  .estimator .estimate .config
<chr>    <chr>          <dbl> <chr>

1 accuracy binary         0.944 Preprocessor1_Model1
2 roc_auc  binary         0.988 Preprocessor1_Model1

last_rf_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 9)

13) Plot the ROC curve for the best fit

# Plot the ROC for the final model
last_rf_fit %>%
collect_predictions() %>%
roc_curve(isWinner, .pred_0) %>%
autoplot()


14) Create a confusion matrix

We can see that the number of false positives and false negatives is very low

15) Create the final fit with the Random Forest Model

# generate predictions from the test set
test_predictions <- last_rf_fit %>% collect_predictions()
test_predictions

id               .pred_0 .pred_1  .row .pred_class isWinner .config
<chr>              <dbl>   <dbl> <int> <fct>       <fct>    <chr>
1 train/test split   0.838  0.162      1 0           0       Preprocessor1_Mo…
2
train/test split   0.463  0.537     11 1           0        Preprocessor1_Mo…
3
train/test split   0.846  0.154     14 0           0        Preprocessor1_Mo…
4
train/test split   0.839  0.161     22 0           0        Preprocessor1_Mo…
5
train/test split   0.846  0.154     36 0           0        Preprocessor1_Mo…
6
train/test split   0.848  0.152     37 0           0        Preprocessor1_Mo…
7
train/test split   0.731  0.269     39 0           0        Preprocessor1_Mo…
8
train/test split   0.972  0.0281    40 0           0        Preprocessor1_Mo…
9
train/test split   0.655  0.345     42 0           0        Preprocessor1_Mo…
10
train/test split   0.662  0.338     43 0           0        Preprocessor1_Mo…

# generate a confusion matrix
test_predictions %>%
conf_mat(truth = isWinner, estimate = .pred_class)

Truth
Prediction      0      1

0 116576   7096

1   6391 109760

# Create the final model
final_model <- fit(last_rf_workflow, df_other)



16) Computing Win Probability with Random Forest Model for match

• Pakistan-India-2022-10-23

17) Worm -wicket graph of match

• Pakistan-India-2022-10-23

C) Win Probability using MLP – Deep Neural Network (DNN) with player embeddings

In this approach the MLP package of Tidymodels was used. Multi-layer perceptron (MLP) with Deep Neural Network (DNN) was used to compute the Win Probability using player embeddings. An accuracy of 0.76 was obtained

a) Read the data from 9 T20 leagues (BBL, CPL, IPL, NTB, PSL, SSM, T20 Men, T20 Women, WBB) and create a single data frame of ball-by-ball data. Display the data frame

2) Create training.validation and test sets

b) Split to training, validation and test sets. The dataset is initially split into training and test in the ratio 80%:20%. The training data is again split into training and validation in the ratio 80:20

library(dplyr)
library(caret)
library(e1071)
library(ggplot2)
library(tidymodels)
library(embed)

# Helper packages
library(vip)
library(ranger)

df=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9)

set.seed(123)
df$isWinner = as.factor(df$isWinner)
splits      <- initial_split(df,prop = 0.80)
df_other <- training(splits)
df_test  <- testing(splits)
set.seed(234)
val_set <- validation_split(df_other,
prop = 0.80)
val_set



3) Create a Deep Neural Network recipe

• Normalize parameters
• Add embeddings for batsman, bowler

4) Set the MLP-DNN hyperparameters

• epochs=100
• hidden units =5
• dropout regularization =0.1

5) Fit on Training data

cores <- parallel::detectCores()
cores

nn_recipe <-
recipe(isWinner ~ ., data = df_other) %>%
step_normalize(ballNum,ballsRemaining,runs,runRate,numWickets,runsMomentum,perfIndex) %>%
step_embed(batsman,bowler, outcome = vars(isWinner)) %>%
prep(training = df_other, retain = TRUE)

# For validation:
test_normalized <- bake(nn_recipe, new_data = df_test)

set.seed(57974)
# Set the hyper parameters for DNN
# Use Keras
# Fit on training data
nnet_fit <-
mlp(epochs = 100, hidden_units = 5, dropout = 0.1) %>%
set_mode("classification") %>%
# Also set engine-specific verbose argument to prevent logging the results:
set_engine("keras", verbose = 0) %>%
fit(isWinner ~ ., data = bake(nn_recipe, new_data = df_other))

nnet_fit
parsnip model object
Model:"sequential"

____________________________________________________________________________

Layer (type)                                           Output Shape                                    Param #
============================================================================
dense (Dense)                                           (None, 5)                                          60
____________________________________________________________________________

dense_1 (Dense)                                         (None, 5)                                          30
____________________________________________________________________________
dropout (Dropout)                                       (None, 5)                                          0
____________________________________________________________________________
dense_2 (Dense)                                         (None, 2)                                          12
============================================================================
Total params: 102
Trainable params: 102
Non-trainable params: 0


6) Test on Test data

• Check ROC_AUC. It is 0.854
• Check accuracy. The MLP-DNN gives a decent performance with an acuracy of 0.76
• Compute the Confusion Matrix
# Validate on test data
val_results <-
df_test %>%
bind_cols(
predict(nnet_fit, new_data = test_normalized),
predict(nnet_fit, new_data = test_normalized, type = "prob")
)
val_results

# Check roc_auc
val_results %>% roc_auc(truth = isWinner, .pred_0)
.metric .estimator .estimate

<chr>   <chr>          <dbl>
1 roc_auc binary         0.854

# Check accuracy
val_results %>% accuracy(truth = isWinner, .pred_class)
.metric  .estimator .estimate
<chr>    <chr>          <dbl>
1 accuracy binary         0.762

# Display confusion matrix
val_results %>% conf_mat(truth = isWinner, .pred_class)
Truth
Prediction
0     1
0 97419 31564
1 25548 85292

Conclusion

1. Of the 3 ML models, glmnet, random forest and Multi-layer Perceptron DNN, random forest had the best performance
2. Random Forest ML model with batsman, bowler embeddings was able to achieve an accuracy of 92.4% and a ROC_AUC of 0.98 with very low false positives, negatives. This was a quantum jump from my earlier random forest model without embeddings which had an accuracy of 73.7% and an ROC_AUC of 0.834
3. The glmnet and NN models are fairly light weight. Random Forest is computationally very intensive.

Check out my other posts

To see all posts click Index of posts

# My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

This is the next set of presentations on “Elements of Neural Networks and Deep Learning”.  In the 4th presentation I discuss and derive the generalized equations for a multi-unit, multi-layer Deep Learning network.  The 5th presentation derives the equations for a Deep Learning network when performing multi-class classification along with the derivations for cross-entropy loss. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

1. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

2. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of Posts

# My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle-$8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

The second edition of my book ‘Deep Learning from first principles:Second Edition- In vectorized Python, R and Octave’, is now available on Amazon, in both paperback ($18.99) and kindle ($9.99/Rs449/-)  versions. Since this book is almost 70% code, all functions, and code snippets have been formatted to use the fixed-width font ‘Lucida Console’. In addition line numbers have been added to all code snippets. This makes the code more organized and much more readable. I have also fixed typos in the book

The book includes the following chapters

Table of Contents
Preface 4
Introduction 6
1. Logistic Regression as a Neural Network 8
2. Implementing a simple Neural Network 23
3. Building a L- Layer Deep Learning Network 48
4. Deep Learning network with the Softmax 85
5. MNIST classification with Softmax 103
6. Initialization, regularization in Deep Learning 121
7. Gradient Descent Optimization techniques 167
8. Gradient Check in Deep Learning 197
1. Appendix A 214
2. Appendix 1 – Logistic Regression as a Neural Network 220
3. Appendix 2 - Implementing a simple Neural Network 227
4. Appendix 3 - Building a L- Layer Deep Learning Network 240
5. Appendix 4 - Deep Learning network with the Softmax 259
6. Appendix 5 - MNIST classification with Softmax 269
7. Appendix 6 - Initialization, regularization in Deep Learning 302
8. Appendix 7 - Gradient Descent Optimization techniques 344
9. Appendix 8 – Gradient Check 405
References 475

To see posts click Index of Posts

# My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle-$9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# My book “Deep Learning from first principles” now on Amazon

Note: The 2nd edition of this book is now available on Amazon

My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($18.99) and kindle ($9.99/Rs449). The book starts with the most primitive 2-layer Neural Network and works  its way to a generic L-layer Deep Learning Network, with all the bells and whistles.  The book includes detailed derivations and vectorized implementations in Python, R and Octave.  The code has been extensively  commented and has been included in the Appendix section.

# Deep Learning from first principles in Python, R and Octave – Part 5

## Introduction

a. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
b. A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
c. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

      Isaac Asimov's Three Laws of Robotics 

Any sufficiently advanced technology is indistinguishable from magic.

      Arthur C Clarke.   

In this 5th part on Deep Learning from first Principles in Python, R and Octave, I solve the MNIST data set of handwritten digits (shown below), from the basics. To do this, I construct a L-Layer, vectorized Deep Learning implementation in Python, R and Octave from scratch and classify the  MNIST data set. The MNIST training data set  contains 60000 handwritten digits from 0-9, and a test set of 10000 digits. MNIST, is a popular dataset for running Deep Learning tests, and has been rightfully termed as the ‘drosophila’ of Deep Learning, by none other than the venerable Prof Geoffrey Hinton.

The ‘Deep Learning from first principles in Python, R and Octave’ series, so far included  Part 1 , where I had implemented logistic regression as a simple Neural Network. Part 2 implemented the most elementary neural network with 1 hidden layer, but  with any number of activation units in that layer, and a sigmoid activation at the output layer.

This post, ‘Deep Learning from first principles in Python, R and Octave – Part 5’ largely builds upon Part3. in which I implemented a multi-layer Deep Learning network, with an arbitrary number of hidden layers and activation units per hidden layer and with the output layer was based on the sigmoid unit, for binary classification. In Part 4, I derive the Jacobian of a Softmax, the Cross entropy loss and the gradient equations for a multi-class Softmax classifier. I also  implement a simple Neural Network using Softmax classifications in Python, R and Octave.

In this post I combine Part 3 and Part 4 to to build a L-layer Deep Learning network, with arbitrary number of hidden layers and hidden units, which can do both binary (sigmoid) and multi-class (softmax) classification.

Note: A detailed discussion of the derivation for multi-class clasification can be seen in my video presentation Neural Networks 5

The generic, vectorized L-Layer Deep Learning Network implementations in Python, R and Octave can be cloned/downloaded from GitHub at DeepLearning-Part5. This implementation allows for arbitrary number of hidden layers and hidden layer units. The activation function at the hidden layers can be one of sigmoid, relu and tanh (will be adding leaky relu soon). The output activation can be used for binary classification with the ‘sigmoid’, or multi-class classification with ‘softmax’. Feel free to download and play around with the code!

I thought the exercise of combining the two parts(Part 3, & Part 4)  would be a breeze. But it was anything but. Incorporating a Softmax classifier into the generic L-Layer Deep Learning model was a challenge. Moreover I found that I could not use the gradient descent on 60,000 training samples as my laptop ran out of memory. So I had to implement Stochastic Gradient Descent (SGD) for Python, R and Octave. In addition, I had to also implement the numerically stable version of Softmax, as the softmax and its derivative would result in NaNs.

### Numerically stable Softmax

The Softmax function $S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}}$ can be numerically unstable because of the division of large exponentials.  To handle this problem we have to implement stable Softmax function as below

$S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}}$
$S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}} = \frac{Ce^{Z_{j}}}{C\sum_{i}^{k}e^{Z_{i}}} = \frac{e^{Z_{j}+log(C)}}{\sum_{i}^{k}e^{Z_{i}+log(C)}}$
Therefore $S_{j} = \frac{e^{Z_{j}+ D}}{\sum_{i}^{k}e^{Z_{i}+ D}}$
Here ‘D’ can be anything. A common choice is
$D=-max(Z_{1},Z_{2},... Z_{k})$

Here is the stable Softmax implementation in Python

# A numerically stable Softmax implementation
def stableSoftmax(Z):
#Compute the softmax of vector x in a numerically stable way.
shiftZ = Z.T - np.max(Z.T,axis=1).reshape(-1,1)
exp_scores = np.exp(shiftZ)
# normalize them for each example
A = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
cache=Z
return A,cache


While trying to create a L-Layer generic Deep Learning network in the 3 languages, I found it useful to ensure that the model executed correctly on smaller datasets.  You can run into numerous problems while setting up the matrices, which becomes extremely difficult to debug. So in this post, I run the model on 2 smaller data for sets used in my earlier posts(Part 3 & Part4) , in each of the languages, before running the generic model on MNIST.

Here is a fair warning. if you think you can dive directly into Deep Learning, with just some basic knowledge of Machine Learning, you are bound to run into serious issues. Moreover, your knowledge will be incomplete. It is essential that you have a good grasp of Machine and Statistical Learning, the different algorithms, the measures and metrics for selecting the models etc.It would help to be conversant with all the ML models, ML concepts, validation techniques, classification measures  etc. Check out the internet/books for background.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

### 1. Random dataset with Sigmoid activation – Python

This random data with 9 clusters, was used in my post Deep Learning from first principles in Python, R and Octave – Part 3 , and was used to test the complete L-layer Deep Learning network with Sigmoid activation.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_blobs
exec(open("DLfunctions51.py").read()) # Cannot import in Rmd.
# Create a random data set with 9 centeres
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,cluster_std = 1.3, random_state =4)

#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Set the dimensions of L -layer DL network
layersDimensions = [2, 9, 9,1] #  4-layer model
# Execute DL network with hidden activation=relu and sigmoid output function
parameters = L_Layer_DeepModel(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.3,num_iterations = 2500, print_cost = True)

### 2. Spiral dataset with Softmax activation – Python

The Spiral data was used in my post Deep Learning from first principles in Python, R and Octave – Part 4 and was used to test the complete L-layer Deep Learning network with multi-class Softmax activation at the output layer

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_blobs

# Create an input data set - Taken from CS231n Convolutional Neural networks
# http://cs231n.github.io/neural-networks-case-study/
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

X1=X.T
Y1=y.reshape(-1,1).T
numHidden=100 # No of hidden units in hidden layer
numFeats= 2 # dimensionality
numOutput = 3 # number of classes
# Set the dimensions of the layers
layersDimensions=[numFeats,numHidden,numOutput]
parameters = L_Layer_DeepModel(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.6,num_iterations = 9000, print_cost = True)
## Cost after iteration 0: 1.098759
## Cost after iteration 1000: 0.112666
## Cost after iteration 2000: 0.044351
## Cost after iteration 3000: 0.027491
## Cost after iteration 4000: 0.021898
## Cost after iteration 5000: 0.019181
## Cost after iteration 6000: 0.017832
## Cost after iteration 7000: 0.017452
## Cost after iteration 8000: 0.017161

### 3. MNIST dataset with Softmax activation – Python

In the code below, I execute Stochastic Gradient Descent on the MNIST training data of 60000. I used a mini-batch size of 1000. Python takes about 40 minutes to crunch the data. In addition I also compute the Confusion Matrix and other metrics like Accuracy, Precision and Recall for the MNIST data set. I get an accuracy of 0.93 on the MNIST test set. This accuracy can be improved by choosing more hidden layers or more hidden units and possibly also tweaking the learning rate and the number of epochs.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
from sklearn.datasets import make_classification, make_blobs
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Read the MNIST training and test sets
# Create labels and pixel arrays
lbls=[]
pxls=[]
print(len(training))
#for i in range(len(training)):
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
# Set the dimensions of the layers. The MNIST data is 28x28 pixels= 784
# Hence input layer is 784. For the 10 digits the Softmax classifier
# has to handle 10 outputs
layersDimensions=[784, 15,9,10] # Works very well,lr=0.01,mini_batch =1000, total=20000
np.random.seed(1)
costs = []
# Run Stochastic Gradient Descent with Learning Rate=0.01, mini batch size=1000
# number of epochs=3000
parameters = L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 ,mini_batch_size =1000, num_epochs = 3000, print_cost = True)

# Compute the Confusion Matrix on Training set
# Compute the training accuracy, precision and recall
proba=predict_proba(parameters, X1,outputActivationFunc="softmax")
#A2, cache = forwardPropagationDeep(X1, parameters)
#proba=np.argmax(A2, axis=0).reshape(-1,1)
a=confusion_matrix(Y1.T,proba)
print(a)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y1.T, proba)))
print('Precision: {:.2f}'.format(precision_score(Y1.T, proba,average="micro")))
print('Recall: {:.2f}'.format(recall_score(Y1.T, proba,average="micro")))

lbls=[]
pxls=[]
print(len(test))
for i in range(10000):
l,p=test[i]
lbls.append(l)
pxls.append(p)
testLabels= np.array(lbls)
testPixels=np.array(pxls)
ytest=testLabels.reshape(-1,1)
Xtest=testPixels.reshape(testPixels.shape[0],-1)
X1test=Xtest.T
Y1test=ytest.T

# Compute the Confusion Matrix on Test set
# Compute the test accuracy, precision and recall
probaTest=predict_proba(parameters, X1test,outputActivationFunc="softmax")
#A2, cache = forwardPropagationDeep(X1, parameters)
#proba=np.argmax(A2, axis=0).reshape(-1,1)
a=confusion_matrix(Y1test.T,probaTest)
print(a)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y1test.T, probaTest)))
print('Precision: {:.2f}'.format(precision_score(Y1test.T, probaTest,average="micro")))
print('Recall: {:.2f}'.format(recall_score(Y1test.T, probaTest,average="micro")))

##1.  Confusion Matrix of Training set
0     1    2    3    4    5    6    7    8    9
## [[5854    0   19    2   10    7    0    1   24    6]
##  [   1 6659   30   10    5    3    0   14   20    0]
##  [  20   24 5805   18    6   11    2   32   37    3]
##  [   5    4  175 5783    1   27    1   58   60   17]
##  [   1   21    9    0 5780    0    5    2   12   12]
##  [  29    9   21  224    6 4824   18   17  245   28]
##  [   5    4   22    1   32   12 5799    0   43    0]
##  [   3   13  148  154   18    3    0 5883    4   39]
##  [  11   34   30   21   13   16    4    7 5703   12]
##  [  10    4    1   32  135   14    1   92  134 5526]]

##2. Accuracy, Precision, Recall of  Training set
## Accuracy: 0.96
## Precision: 0.96
## Recall: 0.96

##3. Confusion Matrix of Test set
0     1    2    3    4    5    6    7    8    9
## [[ 954    1    8    0    3    3    2    4    4    1]
##  [   0 1107    6    5    0    0    1    2   14    0]
##  [  11    7  957   10    5    0    5   20   16    1]
##  [   2    3   37  925    3   13    0    8   18    1]
##  [   2    6    1    1  944    0    7    3    4   14]
##  [  12    5    4   45    2  740   24    8   42   10]
##  [   8    4    4    2   16    9  903    0   12    0]
##  [   4   10   27   18    5    1    0  940    1   22]
##  [  11   13    6   13    9   10    7    2  900    3]
##  [   8    5    1    7   50    7    0   20   29  882]]
##4. Accuracy, Precision, Recall of  Training set
## Accuracy: 0.93
## Precision: 0.93
## Recall: 0.93

### 4. Random dataset with Sigmoid activation – R code

This is the random data set used in the Python code above which was saved as a CSV. The code is used to test a L -Layer DL network with Sigmoid Activation in R.

source("DLfunctions5.R")
# Read the random data set
x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
# Set the dimensions of the  layer
layersDimensions = c(2, 9, 9,1)

# Run Gradient Descent on the data set with relu hidden unit activation
# sigmoid activation unit in the output layer
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.3,
numIterations = 5000,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("Iterations") + ylab("Loss") ### 5. Spiral dataset with Softmax activation – R The spiral data set used in the Python code above, is reused to test multi-class classification with Softmax. source("DLfunctions5.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) # Initialize number of features, number of hidden units in hidden layer and # number of classes numFeats<-2 # No features numHidden<-100 # No of hidden units numOutput<-3 # No of classes # Set the layer dimensions layersDimensions = c(numFeats,numHidden,numOutput) # Perform gradient descent with relu activation unit for hidden layer # and softmax activation in the output retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.5, numIterations = 9000, print_cost = True) #Plot cost vs iterations iterations <- seq(0,9000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("Iterations") + ylab("Costs")

### 6. MNIST dataset with Softmax activation – R

The code below executes a L – Layer Deep Learning network with Softmax output activation, to classify the 10 handwritten digits from MNIST with Stochastic Gradient Descent. The entire 60000 data set was used to train the data. R takes almost 8 hours to process this data set with a mini-batch size of 1000.  The use of ‘for’ loops is limited to iterating through epochs, mini batches and for creating the mini batches itself. All other code is vectorized. Yet, it seems to crawl. Most likely the use of ‘lists’ in R, to return multiple values is performance intensive. Some day, I will try to profile the code, and see where the issue is. However the code works!

Having said that, the Confusion Matrix in R dumps a lot of interesting statistics! There is a bunch of statistical measures for each class. For e.g. the Balanced Accuracy for the digits ‘6’ and ‘9’ is around 50%. Looks like, the classifier is confused by the fact that 6 is inverted 9 and vice-versa. The accuracy on the Test data set is just around 75%. I could have played around with the number of layers, number of hidden units, learning rates, epochs etc to get a much higher accuracy. But since each test took about 8+ hours, I may work on this, some other day!

source("DLfunctions5.R")
source("mnist.R")
show_digit(train$x[2,]) #Set the layer dimensions layersDimensions=c(784, 15,9, 10) # Works at 1500 x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 32768 random samples from MNIST permutation = c(sample(2^15)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) # Execute Stochastic Gradient Descent on the entire training set # with Softmax activation retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.05, mini_batch_size = 512, num_epochs = 1, print_cost = True)  # Compute the Confusion Matrix library(caret) library(e1071) predictions=predictProba(retvalsSGD[['parameters']], X,hiddenActivationFunc='relu', outputActivationFunc="softmax") confusionMatrix(predictions,Y) # Confusion Matrix on the Training set > confusionMatrix(predictions,Y) Confusion Matrix and Statistics Reference Prediction 0 1 2 3 4 5 6 7 8 9 0 5738 1 21 5 16 17 7 15 9 43 1 5 6632 21 24 25 3 2 33 13 392 2 12 32 5747 106 25 28 3 27 44 4779 3 0 27 12 5715 1 21 1 20 1 13 4 10 5 21 18 5677 9 17 30 15 166 5 142 21 96 136 93 5306 5884 43 60 413 6 0 0 0 0 0 0 0 0 0 0 7 6 9 13 13 3 4 0 6085 0 55 8 8 12 7 43 1 32 2 7 5703 69 9 2 3 20 71 1 1 2 5 6 19 Overall Statistics Accuracy : 0.777 95% CI : (0.7737, 0.7804) No Information Rate : 0.1124 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7524 Mcnemar's Test P-Value : NA Statistics by Class: Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Sensitivity 0.96877 0.9837 0.96459 0.93215 0.97176 0.97879 0.00000 Specificity 0.99752 0.9903 0.90644 0.99822 0.99463 0.87380 1.00000 Pos Pred Value 0.97718 0.9276 0.53198 0.98348 0.95124 0.43513 NaN Neg Pred Value 0.99658 0.9979 0.99571 0.99232 0.99695 0.99759 0.90137 Prevalence 0.09872 0.1124 0.09930 0.10218 0.09737 0.09035 0.09863 Detection Rate 0.09563 0.1105 0.09578 0.09525 0.09462 0.08843 0.00000 Detection Prevalence 0.09787 0.1192 0.18005 0.09685 0.09947 0.20323 0.00000 Balanced Accuracy 0.98314 0.9870 0.93551 0.96518 0.98319 0.92629 0.50000 Class: 7 Class: 8 Class: 9 Sensitivity 0.9713 0.97471 0.0031938 Specificity 0.9981 0.99666 0.9979464 Pos Pred Value 0.9834 0.96924 0.1461538 Neg Pred Value 0.9967 0.99727 0.9009521 Prevalence 0.1044 0.09752 0.0991500 Detection Rate 0.1014 0.09505 0.0003167 Detection Prevalence 0.1031 0.09807 0.0021667 Balanced Accuracy 0.9847 0.98568 0.5005701  # Confusion Matrix on the Training set xtest <- t(test$x) Xtest <- xtest[,1:10000] ytest <-test\$y ytest1 <- ytest[1:10000] ytest2 <- as.matrix(ytest1) Ytest=t(ytest2)

Confusion Matrix and Statistics

Reference
Prediction    0    1    2    3    4    5    6    7    8    9
0  950    2    2    3    0    6    9    4    7    6
1    3 1110    4    2    9    0    3   12    5   74
2    2    6  965   21    9   14    5   16   12  789
3    1    2    9  908    2   16    0   21    2    6
4    0    1    9    5  938    1    8    6    8   39
5   19    5   25   35   20  835  929    8   54   67
6    0    0    0    0    0    0    0    0    0    0
7    4    4    7   10    2    4    0  952    5    6
8    1    5    8   14    2   16    2    3  876   21
9    0    0    3   12    0    0    2    6    5    1

Overall Statistics

Accuracy : 0.7535
95% CI : (0.7449, 0.7619)
No Information Rate : 0.1135
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7262
Mcnemar's Test P-Value : NA

Statistics by Class:

Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
Sensitivity            0.9694   0.9780   0.9351   0.8990   0.9552   0.9361   0.0000
Specificity            0.9957   0.9874   0.9025   0.9934   0.9915   0.8724   1.0000
Pos Pred Value         0.9606   0.9083   0.5247   0.9390   0.9241   0.4181      NaN
Neg Pred Value         0.9967   0.9972   0.9918   0.9887   0.9951   0.9929   0.9042
Prevalence             0.0980   0.1135   0.1032   0.1010   0.0982   0.0892   0.0958
Detection Rate         0.0950   0.1110   0.0965   0.0908   0.0938   0.0835   0.0000
Detection Prevalence   0.0989   0.1222   0.1839   0.0967   0.1015   0.1997   0.0000
Balanced Accuracy      0.9825   0.9827   0.9188   0.9462   0.9733   0.9043   0.5000
Class: 7 Class: 8  Class: 9
Sensitivity            0.9261   0.8994 0.0009911
Specificity            0.9953   0.9920 0.9968858
Pos Pred Value         0.9577   0.9241 0.0344828
Neg Pred Value         0.9916   0.9892 0.8989068
Prevalence             0.1028   0.0974 0.1009000
Detection Rate         0.0952   0.0876 0.0001000
Detection Prevalence   0.0994   0.0948 0.0029000
Balanced Accuracy      0.9607   0.9457 0.4989384


### 7. Random dataset with Sigmoid activation – Octave

The Octave code below uses the random data set used by Python. The code below implements a L-Layer Deep Learning with Sigmoid Activation.


source("DL5functions.m")

X=data(:,1:2);
Y=data(:,3);
#Set the layer dimensions
layersDimensions = [2 9 7  1]; #tanh=-0.5(ok), #relu=0.1 best!
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.1,
numIterations = 10000);
# Plot cost vs iterations
plotCostVsIterations(10000,costs);


### 8. Spiral dataset with Softmax activation – Octave

The  code below uses the spiral data set used by Python above. The code below implements a L-Layer Deep Learning with Softmax Activation.

# Read the data

# Setup the data
X=data(:,1:2);
Y=data(:,3);

# Set the number of features, number of hidden units in hidden layer and number of classess
numFeats=2; #No features
numHidden=100; # No of hidden units
numOutput=3; # No of  classes
# Set the layer dimensions
layersDimensions = [numFeats numHidden  numOutput];
#Perform gradient descent with softmax activation unit
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
numIterations = 10000);


### 9. MNIST dataset with Softmax activation – Octave

The code below implements a L-Layer Deep Learning Network in Octave with Softmax output activation unit, for classifying the 10 handwritten digits in the MNIST dataset. Unfortunately, Octave can only index to around 10000 training at a time,  and I was getting an error ‘error: out of memory or dimension too large for Octave’s index type error: called from…’, when I tried to create a batch size of 20000.  So I had to come with a work around to create a batch size of 10000 (randomly) and then use a mini-batch of 1000 samples and execute Stochastic Gradient Descent. The performance was good. Octave takes about 15 minutes, on a batch size of 10000 and a mini batch of 1000.

I thought if the performance was not good, I could iterate through these random batches and refining the gradients as follows

# Pseudo code that could be used since Octave only allows 10K batches
# at a time
# Randomly create weights
[weights biases] = initialize_weights()
for i=1:k
# Create a random permutation and create a random batch
permutation = randperm(10000);
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Compute weights from SGD and update weights in the next batch update
[weights biases costs]=L_Layer_DeepModel_SGD(X,Y,mini_bactch=1000,weights, biases,...);
...
endfor
# Load the MNIST data
#Create a random permutatation from 60K
permutation = randperm(10000);
disp(length(permutation));

# Use this 10K as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];

# Run Stochastic Gradient descent with batch size=10K and mini_batch_size=1000
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
mini_batch_size = 2000, num_epochs = 5000);


#### 9. Final thoughts

Here are some of my final thoughts after working on Python, R and Octave in this series and in other projects
1. Python, with its highly optimized numpy library, is ideally suited for creating Deep Learning Models, which have a lot of matrix manipulations. Python is a real workhorse when it comes to Deep Learning computations.
2. R is somewhat clunky in comparison to its cousin Python in handling matrices or in returning multiple values. But R’s statistical libraries, dplyr, and ggplot are really superior to the Python peers. Also, I find R handles  dataframes,  much better than Python.
3. Octave is a no-nonsense,minimalist language which is very efficient in handling matrices. It is ideally suited for implementing Machine Learning and Deep Learning from scratch. But Octave has its problems and cannot handle large matrix sizes, and also lacks the statistical libaries of R and Python. They possibly exist in its sibling, Matlab

#### Conclusion

Building a Deep Learning Network from scratch is quite challenging, time-consuming but nevertheless an exciting task.  While the statements in the different languages for manipulating matrices, summing up columns, finding columns which have ones don’t take more than a single statement, extreme care has to be taken to ensure that the statements work well for any dimension.  The lessons learnt from creating L -Layer Deep Learning network  are many and well worth it. Give it a try!

Hasta la vista! I’ll be back, so stick around!
Watch this space!

To see all posts click Index of Posts

# Presentation on ‘Machine Learning in plain English – Part 2’

This presentation is a continuation of my earlier presentation Presentation on ‘Machine Learning in plain English – Part 1’. As the title suggests, the presentation is devoid of any math or programming constructs, and just focuses on the concepts and approaches to different Machine Learning algorithms. In this 2nd part, I discuss KNN regression, KNN classification, Cross Validation techniques like (LOOCV, K-Fold)   feature selection methods including best-fit,forward-fit and backward fit and finally Ridge (L2) and Lasso Regression (L1)

If you would like to see the implementations of the discussed algorithms, in this presentation, do check out my book   My book ‘Practical Machine Learning with R and Python’ on Amazon

To see all post click Index of posts