Using embeddings, collaborative filtering with Deep Learning to analyse T20 players

There is a school of thought which considers that total runs scored and strike rate for a batsman, or total wickets taken and economy rate for a bowler, do not tell the whole story. This is true to a fair extent. The runs scored or the wickets taken could have been against weaker teams and hence the runs, strike rate or the wickets and economy rate alone do not capture all the performance details of the batsman or bowler. A technique to determine the performance of batsmen against different bowlers and identify the batsman’s possible performance even against bowlers he/she has not yet faced could be done with collaborative filtering. Collaborative filtering, with embeddings can also be used to group players with similar characteristics. Similarly, we could also identify the performance of bowlers versus different batsmen. Hence we need to look at average runs, SR and total wickets, ER with the lens of batsmen, bowlers against similar opposition. This is where collaborative filtering is useful.

The table below shows the performance of all batsman against all bowlers in the table below. The row in the table below is the batsman and the column is the bowler, with the value in the cell is the total Runs scored by the batsman against the bowler in all matches. Note the values are 0 for batsmen who have not yet faced specific bowlers. The table is fairly sparse.

Table A

Similarly, we can compute the performance of all bowlers against all batsmen as in the table below. Here the row is the bowler, the column batsman and the value in the cell is the number of times the bowler got the batsman’s wicket. As before the data is sparsely populated

This problem of computing batsman’s performance against bowlers or vice versa, is identical to the user vs movie rating problem used in collaborative filtering. For e.g we could consider

This above problem depicted could be computed using collaborative filtering with embeddings. We could assign sequential numbers for the batsmen from 1 to M, and for the bowlers from 1 to N. The total runs scored could be represented only for the rows where there are values. One way to solve this problem in Machine Learning is to use One Hot Encoding (OHE), where we assign values for each row and each column and map the values of the table with values of the cell for each combination. But this would take a enormous computation time and memory. The solution to this is use vector embeddings. Here embeddings could be used for capturing the sparse tensors between the batsmen, bowlers, runs scored or vice versa between bowlers against batsmen and the wickets taken. We only need to consider the cells for which values exist. An embedding is a relatively low-dimensional space, into which you can translate high-dimensional vectors. An embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.

a) To compute bowler performances and identify similarities between bowlers the following embedding in the Deep Learning Network was used

To compute batsmen similarities a similar Deep Learning network for bowler vs batsmen is used

I had earlier created another post Player Performance Estimation using AI Collaborative Filtering for batsman and bowler recommendation, using R package Recommender Lab. However, I was not too happy with the results I got with this R package. When I searched the net for material on using embeddings for collaborative filtering, most of material on the web on movie lens or word2vec are repetitive and have no new material. Finally, this short video lecture from Developer Google on Embeddings provided the most clarity.

I have created 4 Colab notebooks to identify player similarities (recommendations)

a) Batsman similarities IPL

b) Batsman similarities T20

c) Bowler similarities IPL

d) Bowler similarities T20

For creating the model I have used all the data for T20 and IPL from so that I get the best results. The data is from Cricsheet. I have also used Google’s Embeddings Projector to display batsman and bowler embedding to and to group similar players

All the Colab notebooks and the data associated with the code are available in Github. Feel free to download and execute them. See if you get better performance. I tried a wide variety of hyperparameters – learning rate, width and depth of nodes per layer, number of layers, gradient methods etc.

You can download all the code & data from Github at embeddings

A) Batsman Recommender IPL (BatsmanRecommenderIPLA.ipynb)

Steps for creating the model

a) Upload bowler vs batsmen with times wicket was taken for batsman. This will be a sparse matrix

b) Assign integer indices for bowlers, batsmen

c) Add additional input features balls, runs conceded and Economy rate

d) Minimise loss for wickets taken for the bowler using SGD

e) Display embeddings of similar batsmen using Tensorboard projector

a) Upload data

Upload data file
2. Remove rows where wickets = 0

from google.colab import files
import io
uploaded = files.upload()
df2 = pd.read_csv(io.BytesIO(uploaded['bowlerVsBatsmanIPLE.csv']))
print(df2.shape)
df2 = df2.loc[df2['wicketTaken']!= 0]
print(df2.shape)

uploaded = files.upload()
df6 = pd.read_csv(io.BytesIO(uploaded['bowlerVsBatsmanIPLAll.csv']))
df6

Out[14]:

	bowler1	batsman1	balls	runsConceded	ER
0	A Ashish Reddy	DJG Sammy	1	0	0.000000
1	A Ashish Reddy	G Gambhir	10	17	10.200000
2	A Ashish Reddy	JEC Franklin	2	0	0.000000
3	A Ashish Reddy	LRPL Taylor	5	6	7.200000
4	A Ashish Reddy	MA Agarwal	3	7	14.000000
…	…	…	…	…	…
8550	Z Khan	Vishnu Vinod	4	8	12.000000
8551	Z Khan	VS Malik	3	5	10.000000
8552	Z Khan	W Jaffer	7	3	2.571429
8553	Z Khan	YK Pathan	22	35	9.545455
8554	Z Khan	Yuvraj Singh	12	12	6.000000

b) Create integer dictionaries for batsmen & bowlers

bowlers = df3["bowler1"].unique().tolist()
bowlers
# Create dictionary of bowler to index
bowlers2index = {x: i for i, x in enumerate(bowlers)}
bowlers2index
#Create dictionary of index tp bowler
index2bowlers = {i: x for i, x in enumerate(bowlers)}
index2bowlers


batsmen = df3["batsman1"].unique().tolist()
batsmen
# Create dictionary of batsman to index
batsmen2index = {x: i for i, x in enumerate(batsmen)}
batsmen2index
# Create dictionary of index to batsman
index2batsmen = {i: x for i, x in enumerate(batsmen)}
index2batsmen

#Map bowler, batsman to respective indices
df3["bowler"] = df3["bowler1"].map(bowlers2index)
df3["batsman"] = df3["batsman1"].map(batsmen2index)
df3
num_bowlers =len(bowlers2index)
num_batsmen = len(batsmen2index)
df3["wicketTaken"] = df3["wicketTaken"].values.astype(np.float32)
df3
# min and max ratings will be used to normalize the ratings later
min_wicketTaken = min(df3["wicketTaken"])
max_wicketTaken = max(df3["wicketTaken"])

print(
    "Number of bowlers: {}, Number of batsmen: {}, Min wicketsTaken: {}, Max wicketsTaken: {}".format(
        num_bowlers, num_batsmen, min_wicketTaken, max_wicketTaken
    )
)

c) Concatenate additional features

df3
df6
df31=pd.concat([df3,df6],axis=1)
df31

d) Create a Tensorflow/Keras deep learning mode. Minimise using Mean Squared Error using Stochastic Gradient Descent. I used ‘dropouts’ to regularise the model to keep validation loss within limits

tf.random.set_seed(4)
vector_size=len(batsmen2index)

df4=df31[['bowler','batsman','wicketTaken','balls','runsConceded','ER']]
df4
train_dataset = df4.sample(frac=0.9,random_state=0)
test_dataset = df4.drop(train_dataset.index)

train_dataset1 = train_dataset[['bowler','batsman','balls','runsConceded','ER']]
test_dataset1 = test_dataset[['bowler','batsman','balls','runsConceded','ER']]
train_stats = train_dataset1.describe()
train_stats = train_stats.transpose()
#print(train_stats)

train_labels = train_dataset.pop('wicketTaken')
test_labels = test_dataset.pop('wicketTaken')

# Create a Deep Learning model with keras
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vector_size,16,input_length=5),
    tf.keras.layers.Flatten(),
    keras.layers.Dropout(.2),
    keras.layers.Dense(16),
 
    keras.layers.Dense(8,activation=tf.nn.relu),
    
    keras.layers.Dense(4,activation=tf.nn.relu),
    keras.layers.Dense(1)
  ])

# Print the model summary
#model.summary()
# Use the Adam optimizer with a learning rate of 0.01
#optimizer=keras.optimizers.Adam(learning_rate=.0009, beta_1=0.5, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
#optimizer=keras.optimizers.RMSprop(learning_rate=0.01, rho=0.2, momentum=0.2, epsilon=1e-07)
#optimizer=keras.optimizers.SGD(learning_rate=.009,momentum=0.1) - Works without dropout
optimizer=keras.optimizers.SGD(learning_rate=.01,momentum=0.1)

model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                )

 # Setup the training parameters
#model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy'])
# Create a model
history=model.fit(
  train_dataset1, train_labels,batch_size=32,
  epochs=40, validation_data = (test_dataset1,test_labels), verbose=1)

e) Plot losses

f) Predict wickets that will be taken by bowlers against random batsmen


df5= df4[['bowler','batsman','balls','runsConceded','ER']]
test1 = df5.sample(n=10)
test1.shape
for i in range(test1.shape[0]):
      print('Bowler :', index2bowlers.get(test1.iloc[i,0]), ", Batsman : ",index2batsmen.get(test1.iloc[i,1]), '- Times wicket Prediction:',model.predict(test1.iloc[[i]]))
1/1 [==============================] - 0s 90ms/step
Bowler : Harbhajan Singh , Batsman :  AM Nayar - Times wicket Prediction: [[1.0114906]]
1/1 [==============================] - 0s 18ms/step
Bowler : T Natarajan , Batsman :  Arshdeep Singh - Times wicket Prediction: [[0.98656166]]
1/1 [==============================] - 0s 19ms/step
Bowler : KK Ahmed , Batsman :  A Mishra - Times wicket Prediction: [[1.0504484]]
1/1 [==============================] - 0s 24ms/step
Bowler : M Muralitharan , Batsman :  F du Plessis - Times wicket Prediction: [[1.0941994]]
1/1 [==============================] - 0s 25ms/step
Bowler : SK Warne , Batsman :  DR Smith - Times wicket Prediction: [[1.0679393]]
1/1 [==============================] - 0s 28ms/step
Bowler : Mohammad Nabi , Batsman :  Ishan Kishan - Times wicket Prediction: [[1.403399]]
1/1 [==============================] - 0s 32ms/step
Bowler : R Bhatia , Batsman :  DJ Thornely - Times wicket Prediction: [[0.89399755]]
1/1 [==============================] - 0s 26ms/step
Bowler : SP Narine , Batsman :  MC Henriques - Times wicket Prediction: [[1.1997008]]
1/1 [==============================] - 0s 19ms/step
Bowler : AS Rajpoot , Batsman :  K Gowtham - Times wicket Prediction: [[0.9911405]]
1/1 [==============================] - 0s 21ms/step
Bowler : K Rabada , Batsman :  P Simran Singh - Times wicket Prediction: [[1.0064855]]

g) The embedding can be visualised using Google’s Embedding Projector, which identifies other batsmen who have similar characteristics. Here Cosine Similarity is used for grouping similar batsmen of IPL

The closest neighbor for AB De Villiers in IPL is SK Raina, then Rohit Sharma as seen in the visualisation below

B. Bowler Recommender T20 (BowlerRecommenderT20M1A.ipynb)

Similar to how batsman was set up,

The steps are

a) Upload data for T20 Batsman vs Bowler with Total runs scored. This will be a sparse matrix

b) Create integer dictionaries for batsman & bowler

c) Add additional features like fours, sixes and strike rate

d) Minimise loss for wicket taken

e) Display embeddings of bowlers using Tensorboard Embeddings Projector

Minimizing the loss for wicket taken using SGD

tf.random.set_seed(4)
vector_size=len(batsman2index)

#Normalize target variable
df4=df31[['bowler','batsman','totalRuns','fours','sixes','ballsFaced']]
df4['normalizedRuns'] = (df4['totalRuns'] -df4['totalRuns'].mean())/df4['totalRuns'].std()
print(df4)
train_dataset = df4.sample(frac=0.8,random_state=0)
test_dataset = df4.drop(train_dataset.index)
train_dataset1 = train_dataset[['batsman','bowler','fours','sixes','ballsFaced']]
test_dataset1 = test_dataset[['batsman','bowler','fours','sixes','ballsFaced']]

train_labels = train_dataset.pop('normalizedRuns')
test_labels = test_dataset.pop('normalizedRuns')
train_labels
print(train_dataset1)

# Create a Deep Learning model with keras
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vector_size,16,input_length=5),
    tf.keras.layers.Flatten(),
    keras.layers.Dropout(.2),
    keras.layers.Dense(16),
 
    keras.layers.Dense(8,activation=tf.nn.relu),
    
    keras.layers.Dense(4,activation=tf.nn.relu),
    keras.layers.Dense(1)
  ])

# Print the model summary
#model.summary()
# Use the Adam optimizer with a learning rate of 0.01
#optimizer=keras.optimizers.Adam(learning_rate=.0009, beta_1=0.5, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
#optimizer=keras.optimizers.RMSprop(learning_rate=0.001, rho=0.2, momentum=0.2, epsilon=1e-07)
#optimizer=keras.optimizers.SGD(learning_rate=.009,momentum=0.1) - Works without dropout
optimizer=keras.optimizers.SGD(learning_rate=.01,momentum=0.1)

model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                )

 # Setup the training parameters
#model.compile(loss='binary_crossentropy',optimizer='rmsprop',metrics=['accuracy'])
# Create a model
history=model.fit(
  train_dataset1, train_labels,batch_size=32,
  epochs=40, validation_data = (test_dataset1,test_labels), verbose=1)

model.predict(train_dataset1[1:10])
df5= df4[['batsman','bowler','fours','sixes','ballsFaced']]
test1 = df5.sample(n=10)
model.predict(test1)
#(model.predict(test1)* df4['totalRuns'].std()) + df4['totalRuns'].mean()
for i in range(test1.shape[0]):
        print('Batsman :', index2batsman.get(test1.iloc[i,0]), ", Bowler : ",index2bowler.get(test1.iloc[i,1]), '- Total runs Prediction:',(model.predict(test1.iloc[i])* df4['totalRuns'].std()) + df4['totalRuns'].mean())
1/1 [==============================] - 0s 396ms/step
1/1 [==============================] - 0s 112ms/step
1/1 [==============================] - 0s 183ms/step
Batsman : G Chohan , Bowler :  Khawar Ali - Total runs Prediction: [[1.8883028]]
1/1 [==============================] - 0s 56ms/step
Batsman : Umar Akmal , Bowler :  LJ Wright - Total runs Prediction: [[9.305391]]
1/1 [==============================] - 0s 68ms/step
Batsman : M Shumba , Bowler :  Simi Singh - Total runs Prediction: [[19.662743]]
1/1 [==============================] - 0s 30ms/step
Batsman : CH Gayle , Bowler :  RJW Topley - Total runs Prediction: [[16.854687]]
1/1 [==============================] - 0s 39ms/step
Batsman : BA King , Bowler :  Taskin Ahmed - Total runs Prediction: [[3.5154686]]
1/1 [==============================] - 0s 102ms/step
Batsman : KD Shah , Bowler :  Avesh Khan - Total runs Prediction: [[8.411661]]
1/1 [==============================] - 0s 38ms/step
Batsman : ST Jayasuriya , Bowler :  SCJ Broad - Total runs Prediction: [[5.867449]]
1/1 [==============================] - 0s 45ms/step
Batsman : AB de Villiers , Bowler :  Saeed Ajmal - Total runs Prediction: [[15.150892]]
1/1 [==============================] - 0s 46ms/step
Batsman : SV Samson , Bowler :  J Little - Total runs Prediction: [[10.44426]]
1/1 [==============================] - 0s 102ms/step
Batsman : Zawar Farid , Bowler :  GJ Delany - Total runs Prediction: [[1.9770675]]

Identifying similar bowlers using Embeddings Projector for T20

Bhuvaneshwar Kumar’s performance is closest to CR Woakes

Note: Incidentally the accuracy in the above model was not too good. I may work on this again later!

C) Bowler Embeddings IPL – Grouping similar bowlers of IPL with Embeddings Projector (BowlerRecommenderIPLA.ipynb)

D) Batting Embeddings T20 – Grouping similar batsmen of T20 (BatsmanRecommenderT20MA.ipynb)

The Tensorboard Pmbeddings projector is also interesting. There are multiple ways the data can be visualised namely UMAP, T-SNE, PCA(included). You could play with it.

As mentioned above the Colab notebooks and data are available at Github embeddings

The ability to identify batsmen & bowlers who would perform similarly against specific bowling attacks coupled with the average runs & strike rate should give a good measure of a player’s performance.

Take a look at some of my other posts