# Deconstructing Convolutional Neural Networks with Tensorflow and Keras

I have been very fascinated by how Convolution Neural  Networks have been able to, so efficiently,  do image classification and image recognition CNN’s have been very successful in in both these tasks. A good paper that explores the workings of a CNN Visualizing and Understanding Convolutional Networks  by Matthew D Zeiler and Rob Fergus. They show how through a reverse process of convolution using a deconvnet.

In their paper they show how by passing the feature map through a deconvnet ,which does the reverse process of the convnet, they can reconstruct what input pattern originally caused a given activation in the feature map

In the paper they say “A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features, it does the opposite. An input image is presented to the CNN and features  activation computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.”

I started to scout the internet to see how I can implement this reverse process of Convolution to understand what really happens under the hood of a CNN.  There are a lot of good articles and blogs, but I found this post Applied Deep Learning – Part 4: Convolutional Neural Networks take the visualization of the CNN one step further.

This post takes VGG16 as the pre-trained network and then uses this network to display the intermediate visualizations.  While this post was very informative and also the visualizations of the various images were very clear, I wanted to simplify the problem for my own understanding.

Hence I decided to take the MNIST digit classification as my base problem. I created a simple 3 layer CNN which gives close to 99.1% accuracy and decided to see if I could do the visualization.

As mentioned in the above post, there are 3 major visualisations

1. Feature activations at the layer
2. Visualisation of the filters
3. Visualisation of the class outputs

Feature Activation – This visualization the feature activation at the 3 different layers for a given input image. It can be seen that first layer  activates based on the edge of the image. Deeper layers activate in a more abstract way.

Visualization of the filters: This visualization shows what patterns the filters respond maximally to. This is implemented in Keras here

To do this the following is repeated in a loop

• Choose a loss function that maximizes the value of a convnet filter activation
• Do gradient ascent (maximization) in input space that increases the filter activation

Visualizing Class Outputs of the MNIST Convnet: This process is similar to determining the filter activation. Here the convnet is made to generate an image that represents the category maximally.

You can access the Google colab notebook here – Deconstructing Convolutional Neural Networks in Tensoflow and Keras

import numpy as np
import pandas as pd
import os
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Input
from keras.models import Model
from sklearn.model_selection import train_test_split
from keras.utils import np_utils

Using TensorFlow backend.
In [0]:
mnist=tf.keras.datasets.mnist
# Set training and test data and labels

In [0]:
#Normalize training data
X =np.array(training_images).reshape(training_images.shape[0],28,28,1)
# Normalize the images by dividing by 255.0
X = X/255.0
X.shape
# Use one hot encoding for the labels
Y = np_utils.to_categorical(training_labels, 10)
Y.shape
# Split training data into training and validation data in the ratio of 80:20
X_train, X_validation, y_train, y_validation = train_test_split(X,Y,test_size=0.20, random_state=42)

In [4]:
# Normalize test data
X_test =np.array(test_images).reshape(test_images.shape[0],28,28,1)
X_test=X_test/255.0
#Use OHE for the test labels
Y_test = np_utils.to_categorical(test_labels, 10)
X_test.shape

Out[4]:
(10000, 28, 28, 1)

# Display data

Display the training data and the corresponding labels

In [5]:
print(training_labels[0:10])
f, axes = plt.subplots(1, 10, sharey=True,figsize=(10,10))
for i,ax in enumerate(axes.flat):
ax.axis('off')
ax.imshow(X[i,:,:,0],cmap="gray")



# Create a Convolutional Neural Network

The CNN consists of 3 layers

• Conv2D of size 28 x 28 with 24 filters
• Perform Max pooling
• Conv2D of size 14 x 14 with 48 filters
• Perform max pooling
• Conv2d of size 7 x 7 with 64 filters
• Flatten
• Use Dense layer with 128 units
• Perform 25% dropout
• Perform categorical cross entropy with softmax activation function
In [0]:
num_classes=10
inputs = Input(shape=(28,28,1))
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(48, (3, 3), padding='same',activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(64, (3, 3), padding='same',activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.25)(x)
output = Dense(num_classes,activation="softmax")(x)

model = Model(inputs,output)

model.compile(loss='categorical_crossentropy',
metrics=['accuracy'])


# Summary of CNN

Display the summary of CNN

In [7]:
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 28, 28, 1)         0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 24)        240
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 24)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 48)        10416
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 48)          0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 7, 7, 64)          27712
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 3, 3, 64)          0
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0
_________________________________________________________________
dense_1 (Dense)              (None, 128)               73856
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290
=================================================================
Total params: 113,514
Trainable params: 113,514
Non-trainable params: 0
_________________________________________________________________


# Perform Gradient descent and validate with the validation data

In [8]:
epochs = 20
batch_size=256
history = model.fit(X_train,y_train,
epochs=epochs,
batch_size=batch_size,
validation_data=(X_validation,y_validation))
————————————————
acc = history.history[ ‘accuracy’ ]
val_acc = history.history[ ‘val_accuracy’ ]
loss = history.history[ ‘loss’ ]
val_loss = history.history[‘val_loss’ ]
epochs = range(len(acc)) # Get number of epochs
#————————————————
# Plot training and validation accuracy per epoch
#————————————————
plt.plot ( epochs, acc,label=”training accuracy” )
plt.plot ( epochs, val_acc, label=’validation acuracy’ )
plt.title (‘Training and validation accuracy’)
plt.legend()
plt.figure()
#————————————————
# Plot training and validation loss per epoch
#————————————————
plt.plot ( epochs, loss , label=”training loss”)
plt.plot ( epochs, val_loss,label=”validation loss” )
plt.title (‘Training and validation loss’ )
plt.legend()
Test model on test data
f, axes = plt.subplots(1, 10, sharey=True,figsize=(10,10))
for i,ax in enumerate(axes.flat):
ax.axis(‘off’)
ax.imshow(X_test[i,:,:,0],cmap=”gray”)
l=[]
for i in range(10):
x=X_test[i].reshape(1,28,28,1)
y=model.predict(x)
m = np.argmax(y, axis=1)
print(m)

[7]
[2]
[1]
[0]
[4]
[1]
[4]
[9]
[5]
[9]


# Generate the filter activations at the intermediate CNN layers

In [12]:
img = test_images[51].reshape(1,28,28,1)
fig = plt.figure(figsize=(5,5))
print(img.shape)
plt.imshow(img[0,:,:,0],cmap="gray")
plt.axis('off')


# Display the activations at the intermediate layers

This displays the intermediate activations as the image passes through the filters and generates these feature maps

In [13]:
layer_names = ['conv2d_4', 'conv2d_5', 'conv2d_6']

layer_outputs = [layer.output for layer in model.layers if 'conv2d' in layer.name]
activation_model = Model(inputs=model.input,outputs=layer_outputs)
intermediate_activations = activation_model.predict(img)
images_per_row = 8
max_images = 8

for layer_name, layer_activation in zip(layer_names, intermediate_activations):
print(layer_name,layer_activation.shape)
n_features = layer_activation.shape[-1]
print("features=",n_features)
n_features = min(n_features, max_images)
print(n_features)

size = layer_activation.shape[1]
print("size=",size)
n_cols = n_features // images_per_row
display_grid = np.zeros((size * n_cols, images_per_row * size))

for col in range(n_cols):
for row in range(images_per_row):
channel_image = layer_activation[0,:, :, col * images_per_row + row]

channel_image -= channel_image.mean()
channel_image /= channel_image.std()
channel_image *= 64
channel_image += 128
channel_image = np.clip(channel_image, 0, 255).astype('uint8')
display_grid[col * size : (col + 1) * size,
row * size : (row + 1) * size] = channel_image
scale = 2. / size
plt.figure(figsize=(scale * display_grid.shape[1],
scale * display_grid.shape[0]))
plt.axis('off')
plt.title(layer_name)
plt.grid(False)
plt.imshow(display_grid, aspect='auto', cmap='viridis')

plt.show()

It can be seen that at the higher layers only abstract features of the input image are captured

# To fix the ImportError: cannot import name 'imresize' in the next cell. Run this cell. Then comment and restart and run all
#!pip install scipy==1.1.0


## Visualize the pattern that the filters respond to maximally

• Choose a loss function that maximizes the value of the CNN filter in a given layer
• Start from a blank input image.
• Do gradient ascent in input space. Modify input values so that the filter activates more
• Repeat this in a loop.
In [14]:
from vis.visualization import visualize_activation, get_num_filters
from vis.utils import utils
from vis.input_modifiers import Jitter

max_filters = 24
selected_indices = []
vis_images = [[], [], [], [], []]
i = 0
selected_filters = [[0, 3, 11, 15, 16, 17, 18, 22],
[8, 21, 23, 25, 31, 32, 35, 41],
[2, 7, 11, 14, 19, 26, 35, 48]]

# Set the layers
layer_name = ['conv2d_4', 'conv2d_5', 'conv2d_6']
# Set the layer indices
layer_idx = [1,3,5]
for layer_name,layer_idx in zip(layer_name,layer_idx):

# Visualize all filters in this layer.
if selected_filters:
filters = selected_filters[i]
else:
# Randomly select filters
filters = sorted(np.random.permutation(get_num_filters(model.layers[layer_idx]))[:max_filters])
selected_indices.append(filters)

# Generate input image for each filter.
# Loop through the selected filters in each layer and generate the activation of these filters
for idx in filters:
img = visualize_activation(model, layer_idx, filter_indices=idx, tv_weight=0.,
input_modifiers=[Jitter(0.05)], max_iter=300)
vis_images[i].append(img)

# Generate stitched image palette with 4 cols so we get 2 rows.
stitched = utils.stitch_images(vis_images[i], cols=4)
plt.figure(figsize=(20, 30))
plt.title(layer_name)
plt.axis('off')
stitched = stitched.reshape(1,61,127,1)
plt.imshow(stitched[0,:,:,0])
plt.show()
i += 1
from vis.utils import utils
new_vis_images = [[], [], [], [], []]
i = 0
layer_name = ['conv2d_4', 'conv2d_5', 'conv2d_6']
layer_idx = [1,3,5]
for layer_name,layer_idx in zip(layer_name,layer_idx):

# Generate input image for each filter.
for j, idx in enumerate(selected_indices[i]):
img = visualize_activation(model, layer_idx, filter_indices=idx,
seed_input=vis_images[i][j], input_modifiers=[Jitter(0.05)], max_iter=300)
#img = utils.draw_text(img, 'Filter {}'.format(idx))
new_vis_images[i].append(img)

stitched = utils.stitch_images(new_vis_images[i], cols=4)
plt.figure(figsize=(20, 30))
plt.title(layer_name)
plt.axis('off')
print(stitched.shape)
stitched = stitched.reshape(1,61,127,1)
plt.imshow(stitched[0,:,:,0])
plt.show()
i += 1



## Visualizing Class Outputs

Here the CNN will generate the image that maximally represents the category. Each of the output represents one of the digits as can be seen below

In [16]:
from vis.utils import utils
from keras import activations
codes = '''
zero 0
one 1
two 2
three 3
four 4
five 5
six 6
seven 7
eight 8
nine 9
'''
layer_idx=10
initial = []
images = []
tuples = []
# Swap softmax with linear for better visualization
model.layers[layer_idx].activation = activations.linear
model = utils.apply_modifications(model)
for line in codes.split('\n'):
if not line:
continue
name, idx = line.rsplit(' ', 1)
idx = int(idx)
img = visualize_activation(model, layer_idx, filter_indices=idx,
tv_weight=0., max_iter=300, input_modifiers=[Jitter(0.05)])

initial.append(img)
tuples.append((name, idx))

i = 0
for name, idx in tuples:
img = visualize_activation(model, layer_idx, filter_indices=idx,
seed_input = initial[i], max_iter=300, input_modifiers=[Jitter(0.05)])
#img = utils.draw_text(img, name) # Unable to display text on gray scale image
i += 1
images.append(img)

stitched = utils.stitch_images(images, cols=4)
plt.figure(figsize=(20, 20))
plt.axis('off')
stitched = stitched.reshape(1,94,127,1)
plt.imshow(stitched[0,:,:,0])

plt.show()



In the grid below the class outputs show the MNIST digits to which output responds to maximally. We can see the ghostly outline
of digits 0 – 9. We can clearly see the outline if 0,1, 2,3,4,5 (yes, it is there!),6,7, 8 and 9. If you look at this from a little distance the digits are clearly visible. Isn’t that really cool!!

## Conclusion:

It is really interesting to see the class outputs which show the image to which the class output responds to maximally. In the
post Applied Deep Learning – Part 4: Convolutional Neural Networks the class output show much more complicated images and is worth a look. It is really interesting to note that the model has adjusted the filter values and the weights of the fully connected network to maximally respond to the MNIST digits

## Also see

To see all posts click Index of posts

# Understanding Neural Style Transfer with Tensorflow and Keras

Neural Style Transfer (NST)  is a fascinating area of Deep Learning and Convolutional Neural Networks. NST is an interesting technique, in which the style from an image, known as the ‘style image’ is transferred to another image ‘content image’ and we get a third a image which is a generated image which has the content of the original image and the style of another image.

NST can be used to reimagine how famous painters like Van Gogh, Claude Monet or a Picasso would have visualised a scenery or architecture. NST uses Convolutional Neural Networks (CNNs) to achieve this artistic style transfer from one image to another. NST was originally implemented by Gati et al., in their paper Neural Algorithm of Artistic Style. Convolutional Neural Networks have been very successful in image classification image recognition et cetera. CNN networks have also been have also generated very interesting pictures using Neural Style Transfer which will be shown in this post. An interesting aspect of CNN’s is that the first couple of layers in the CNN capture basic features of the image like edges and  pixel values. But as we go deeper into the CNN, the network captures higher level features of the input image.

To get started with Neural Style transfer  we will be using the VGG19 pre-trained network. The VGG19 CNN is a compact pre-trained your network which can be used for performing the NST. However, we could have also used Resnet or InceptionV3 networks for this purpose but these are very large networks. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

What needs to be done to transfer the style from one of the image to another image. This brings us to the question – What is ‘style’? What is it that distinguishes Van Gogh’s painting or Picasso’s cubist art. Convolutional Neural Networks capture basic features in the lower layers and much more complex features in the deeper layers.  Style can be computed by taking the correlation of the feature maps in a layer L. This is my interpretation of how style is captured.  Since style  is intrinsic to  the image, it  implies that the style feature would exist across all the filters in a layer. Hence, to pick up this style we would need to get the correlation of the filters across channels of a lawyer. This is computed mathematically, using the Gram matrix which calculates the correlation of the activation of a the filter by the style image and generated image

To transfer the style from one image to the content image we need to do two parallel operations while doing forward propagation
– Compute the content loss between the source image and the generated image
– Compute the style loss between the style image and the generated image
– Finally we need to compute the total loss

In order to get transfer the style from the ‘style’ image to the ‘content ‘image resulting in a  ‘generated’  image  the total loss has to be minimised. Therefore backward propagation with gradient descent  is done to minimise the total loss comprising of the content and style loss.

Initially we make the Generated Image ‘G’ the same as the source image ‘S’

The content loss at layer ‘l’

$L_{content} = 1/2 \sum_{i}^{j} ( F^{l}_{i,j} - P^{l}_{i,j})^{2}$

where $F^{l}_{i,j}$ and $P^{l}_{i,j}$ represent the activations at layer ‘l’ in a filter i, at position ‘j’. The intuition is that the activations will be same for similar source and generated image. We need to minimise the content loss so that the generated stylized image is as close to the original image as possible. An intermediate layer of VGG19 block5_conv2 is used

The Style layers that are are used are

style_layers = [‘block1_conv1’,
‘block2_conv1’,
‘block3_conv1’,
‘block4_conv1’,
‘block5_conv1’]
To compute the Style Loss the Gram matrix needs to be computed. The Gram Matrix is computed by unrolling the filters as shown below (source: Convolutional Neural Networks by Prof Andrew Ng, Coursera). The result is a matrix of size $n_{c}$ x $n_{c}$ where $n_{c}$ is the number of channels
The above diagram shows the filters of height $n_{H}$ and width $n_{W}$ with $n_{C}$ channels
The contribution of layer ‘l’ to style loss is given by
$L^{'}_{style} = \frac{\sum_{i}^{j} (G^{2}_{i,j} - A^l{i,j})^2}{4N^{2}_{l}M^{2}_{l}}$
where $G_{i,j}$  and $A_{i,j}$ are the Gram matrices of the style and generated images respectively. By minimising the distance in the gram matrices of the style and generated image we can ensure that generated image is a stylized version of the original image similar to the style image
The total loss is given by
$L_{total} = \alpha L_{content} + \beta L_{style}$
Back propagation with gradient descent works to minimise the content loss between the source and generated image, while the style loss tries to minimise the discrepancies in the style of the style image and generated image. Running through forward and backpropagation through several epochs successfully transfers the style from the style image to the source image.
You can check the Notebook at Neural Style Transfer

Note: The code in this notebook is largely based on the Neural Style Transfer tutorial from Tensorflow, though I may have taken some changes from other blogs. I also made a few changes to the code in this tutorial, like removing the scaling factor, or the class definition (Personally, I belong to the old school (C language) and am not much in love with the ‘self.”..All references are included below

Note: Here is a interesting thought. Could we do a Neural Style Transfer in music? Imagine Carlos Santana playing ‘Hotel California’ or Brian May style in ‘Another brick in the wall’. While our first reaction would be that it may not sound good as we are used to style of these songs, we may be surprised by a possible style transfer. This is definitely music to the ears!

Here are few runs from this

## A) Run 1

1. Neural Style Transfer – a) Content Image – My portrait.  b) Style Image – Wassily Kadinsky Oil on canvas, 1913, Vassily Kadinsky’s composition

2. Result of Neural Style Transfer

2) Run 2

a) Content Image – Portrait of my parents b) Style Image –  Vincent Van Gogh’s ,Starry Night Oil on canvas 1889

2. Result of Neural Style Transfer

## Run 3

1.  Content Image – Caesar 2 (Masai Mara- 20 Jun 2018).  Style Image – The Great Wave at Kanagawa – Katsushika Hokosai, 1826-1833

2. Result of Neural Style Transfer

## Run 4

1.   Content Image – Junagarh Fort , Rajasthan   Sep 2016              b) Style Image – Le Pont Japonais by Claude Monet, Oil on canvas, 1920

2. Result of Neural Style Transfer

Neural Style Transfer is a very ingenious idea which shows that we can segregate the style of a painting and transfer to another image.

### References

1. A Neural Algorithm of Artistic Style, Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
2. Neural style transfer
3. Neural Style Transfer: Creating Art with Deep Learning using tf.keras and eager execution
4. Convolutional Neural Network, DeepLearning.AI Specialization, Prof Andrew Ng
5. Intuitive Guide to Neural Style Transfer

To see all posts click Index of posts

# Big Data-5: kNiFi-ing through cricket data with yorkpy

“The temptation to form premature theories upon insufficient data is the bane of our profession.”

Sherlock Holmes in the Valley of fear by Arthur Conan Doyle

“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”

Jim Barksdale, former CEO Netscape

In this post I use  Apache NiFi Dataflow Pipeline along with my Python package yorkpy to crunch through cricket data from Cricsheet. The Data Pipelne  flows all the way from the source  to target analytics output. Apache NiFi was created to automate the flow of data between systems.  NiFi dataflows enable the automated and managed flow of information between systems. This post automates the flow of data from Cricsheet, from where the zip file it is downloaded, unpacked, processed, transformed and finally T20 players are ranked.

While this is a straight forward example of what can be done, this pattern can be applied to real Big Data systems. For example hypothetically, we could consider that we get several parallel streams of  cricket data or for that matter any sports related data. There could be parallel Data flow pipelines that get the data from the sources. This would then be  followed by data transformation modules and finally a module for generating analytics. At the other end a UI based on AngularJS or ReactJS could display the results in a cool and awesome way.

Incidentally, the NiFi pipeline that I discuss in this post, is a simplistic example, and does not use the Big Data stack like HDFS, Hive, Spark etc. Nevertheless, the pattern used, has all the modules for a Big Data pipeline namely ingestion, unpacking, transformation and finally analytics. This NiF pipeline demonstrates the flow using the regular file system of Mac and my python based package yorkpy. The concepts mentioned could be used in a real Big Data scenario which has much fatter pipes of data coming. If  this was the case the NiFi pipeline would utilize  HDFS/Hive for storing the ingested data and Pyspark/Scala for the transformation and analytics and other related technologies.

A pictorial representation is given below

In the diagram above each of the vertical boxes could be any technology from the ever proliferating Big Data stack namely HDFS, Hive, Spark, Sqoop, Kafka, Impala and so on.  Such a dataflow automation could be created when any big sporting event happens, as long as the data generated large, and there is a need for dynamic and automated reporting. The UI could be based on AngularJS/ReactJS and could display analytical tables and charts.

This post demonstrates one such scenario in which IPL T20 data is downloaded from Cricsheet site, unpacked and stored in a specific directory. This dataflow automation is based on my yorkpy package. To know more about the yorkpy package  see Pitching yorkpy … short of good length to IPL – Part 1  and the associated parts. The zip file, from Cricsheet, contains individual IPL T20 matches in YAML format. The convertYaml2DataframeT20() function is used to convert the YAML files into Pandas dataframes before storing them as CSV files. After this done, the function rankIPLT20batting() function is used to perform the overall ranking of the T20 players. My yorkpy Python package has about ~ 50+ functions that perform various analytics on any T20 data for e.g it has the following classes of functions

• analyze T20 matches
• analyze performance of a T20 team in all matches against another T20 team
• analyze performance of a T20 team against all other T20 teams
• analyze performance of T20 batsman and bowlers
• rank T20 batsmen and bowlers

The functions of yorkpy generate tables or charts. While this post demonstrates one scenario, we could use any of the yorkpy T20 functions, generate the output and display on a widget in the UI display, created with cool technologies like AngularJS/ReactJS,  possibly in near real time as data keeps coming in.,

To use yorkpy with NiFI the following packages have to be installed in your environment

-pip install yorkpy
-pip install pyyaml
-pip install pandas
-yum install python-devel (equivalent in Windows)
-pip install matplotlib
-pip install seaborn
-pip install sklearn
-pip install datetime

I have created a video of the NiFi Pipeline with the real dataflow fro source to the ranked IPL T20 batsmen. Take a look at RankingT20PlayersWithNiFiYorkpy

You can clone/fork the NiFi template from rankT20withNiFiYorkpy

The NiFi Data Flow Automation is shown below

## 1. Overall flow

The overall NiFi flow contains 2 Process Groups a) DownloadAnd Unpack. b) Convert and Rank IPL batsmen. While it appears that the Process Groups are disconnected, they are not. The first process group downloads the T20 zip file, unpacks the. zip file and saves the YAML files in a specific folder. The second process group monitors this folder and starts processing as soon the YAML files are available. It processes the YAML converting it into dataframes before storing it as CSV file. The next  processor then does the actual ranking of the batsmen before writing the output into IPLrank.txt

This process group is shown below

The ${T20data} variable points to the specific T20 format that needs to be downloaded. I have set this to https://cricsheet.org/downloads/ipl.zip. This could be set any other data set. In fact we could have parallel data flows for different T20/ Sports data sets and generate #### 1.1.2 SaveUnpackedData This processor stores the YAML files in a predetermined folder, so that the data can be picked up by the 2nd Process Group for processing ### 1.2 ProcessAndRankT20Players Process Group This is the second process group which converts the YAML files to pandas dataframes before storing them as. CSV files. The RankIPLPlayers will then read all the CSV files, stack them and then proceed to rank the IPL players. The Process Group is shown below #### 1.2.1 ListFile and FetchFile Processors The left 2 Processors ListFile and FetchFile get all the YAML files from the folder and pass it to the next processor #### 1.2.2 convertYaml2DataFrame Processor The convertYaml2DataFrame Processor uses the ExecuteStreamCommand which call a python script. The Python script invoked the yorkpy function convertYaml2Dataframe() as shown below The${convertYaml2Dataframe} variable points to the python file below which invoked the yorkpy function yka.convertYaml2PandasDataframeT20()

import yorkpy.analytics as yka
import argparse
parser = argparse.ArgumentParser(description='convert')
args=parser.parse_args()
yamlFile=args.yamlFile
yka.convertYaml2PandasDataframeT20(yamlFile,"/Users/tvganesh/backup/software/nifi/ipl","/Users/tvganesh/backup/software/nifi/ipldata")

This function takes as input $filename which comes from FetchFile processor which is a FlowFile. So I have added a concurrency of 8 to handle upto 8 Flowfiles at a time. The thumb rule as I read on the internet is 2x, 4x the number of cores of your system. Since I have an 8 core Mac, I could possibly have gone ~ 30 concurrent threads. Also the number of concurrent threads is less when the flow is run in a Oracle Box VirtualMachine. Box since a vCore < actual Core The scheduling tab is as below Here are the 8 concurrent Python threads on Mac at bottom right… (pretty cool!) I have not fully tested how latency vs throughput slider changes, affects the performance. #### 1.2.3 MergeContent Processor This processor’s only job is to trigger the rankIPLPlayers when all the FlowFiles have merged into 1 file. #### 1.2.4 RankT20Players This processor is an ExecuteStreamCommand Processor that executes a Python script which invokes a yorkpy function rankIPLT20Batting() import yorkpy.analytics as yka rank=yka.rankIPLT20Batting("/Users/tvganesh/backup/software/nifi/ipldata") print(rank.head(15))  #### 1.2.5 OutputRankofT20Player Processor This processor writes the generated rank to an output file. ### 1.3 Final Ranking of IPL T20 players The Nodejs based web server picks up this file and displays on the web page the final ranks (the code is based on a good youtube for reading from file) ## 2. Final thoughts As I have mentioned above though the above NiFi Cricket Dataflow automation does not use the Hadoop ecosystem, the pattern used is valid and can be used with some customization in Big Data flows as parallel stream. I could have also done this on Oracle VirtualBox but I thought since the code is based on Python and Pandas there is no real advantage of running on the VirtualBox. GIve the NiFi flow a shot. Have fun!!! To see all posts click Index of posts # Ranking T20 players in Intl T20, IPL, BBL and Natwest using yorkpy There is a voice that doesn’t use words, listen. When someone beats a rug, the blows are not against the rug, but against the dust in it. I lost my hat while gazing at the moon, and then I lost my mind. Rumi ## Introduction After a long hiatus, I am back to my big, bad, blogging ways! In this post I rank T20 players from several different leagues namely • International T20 • Indian Premier League (IPL) T20 • Big Bash League (BBL) T20 • Natwest Blast (NTB) T20 I have added 8 new functions to my Python Package yorkpy, which will perform the ranking for the above 4 T20 League formats. To know more about my Python package see Pitching yorkpy . short of good length to IPL – Part 1, and the related posts on yorkpy. The code can be easily extended to other leagues which have a the same ‘yaml’ format for the matches. I also fixed some issues which started to crop up, possibly because a few things have changed in the new data. The new functions are 1. rankIntlT20Batting() 2. rankIntlT20Batting() 3. rankIPLT20Batting() 4. rankIPLT20Batting 5. rankBBLT20Batting() 6. rankBBLT20Batting() 7. rankNTBT20Batting() 8. rankNTBT20Batting() The yorkpy package uses data from Cricsheet You can clone/fork the code for yorkpy at yorkpy You can download the PDF of the post from Rank T20 yorkpy can be installed with ‘pip install yorkpy ## 1. International T20 The steps to do before ranking for International T20 matches are 1. Download International T20 zip file from Cricsheet Intl T20 2. Unzip the file. This will create a folder with yaml files import yorkpy.analytics as yka #yka.convertAllYaml2PandasDataframesT20("../t20s","../data") This above step will convert the yaml files into CSV files. Now do the ranking as below ## 1a. Ranking of International T20 batsmen import yorkpy.analytics as yka intlT20RankBatting=yka.rankIntlT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\data") intlT20RankBatting.head(15) ## matches runs_mean SR_mean ## batsman ## V Kohli 58 38.672414 125.212402 ## KS Williamson 42 32.595238 122.884631 ## Mohammad Shahzad 52 31.942308 118.212288 ## CH Gayle 50 31.140000 111.869984 ## BB McCullum 69 29.492754 117.011666 ## MM Lanning 48 28.812500 98.582663 ## SJ Taylor 44 28.659091 98.684856 ## MJ Guptill 68 28.573529 117.673702 ## DA Warner 71 28.507042 121.142746 ## DPMD Jayawardene 53 27.584906 107.787092 ## KC Sangakkara 54 26.407407 106.039838 ## JP Duminy 68 26.294118 114.606717 ## TM Dilshan 78 26.243590 97.910384 ## RG Sharma 65 25.907692 113.056548 ## H Masakadza 53 25.566038 99.453880 ## 1b. Ranking of International T20 bowlers import yorkpy.analytics as yka intlT20RankBowling=yka.rankIntlT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\data") intlT20RankBowling.head(15) ## matches wicket_mean econrate_mean ## bowler ## Umar Gul 58 1.603448 7.637931 ## SL Malinga 78 1.500000 7.409188 ## Saeed Ajmal 63 1.492063 6.451058 ## DW Steyn 46 1.478261 7.014855 ## A Shrubsole 45 1.422222 6.294444 ## M Morkel 41 1.292683 7.680894 ## KMDN Kulasekara 57 1.280702 7.476608 ## TG Southee 51 1.274510 8.759804 ## SCJ Broad 53 1.264151 inf ## Shakib Al Hasan 58 1.241379 6.836207 ## R Ashwin 44 1.204545 7.162879 ## Nida Dar 44 1.204545 6.083333 ## KH Brunt 44 1.204545 5.982955 ## KD Mills 42 1.166667 8.289683 ## SR Watson 46 1.152174 8.246377 ## 2. Indian Premier League (IPL) T20 The steps to do before ranking for IPL T20 matches are 1. Download IPL T20 zip file from Cricsheet IPL T20 2. Unzip the file. This will create a folder with yaml files import yorkpy.analytics as yka #yka.convertAllYaml2PandasDataframesT20("../ipl","../ipldata") This above step will convert the yaml files into CSV files in the /ipldata folder. Now do the ranking as below ## 2a. Ranking of batsmen in IPL T20 import yorkpy.analytics as yka IPLT20RankBatting=yka.rankIPLT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\ipldata") IPLT20RankBatting.head(15) ## matches runs_mean SR_mean ## batsman ## DA Warner 129 37.589147 119.917864 ## CH Gayle 123 36.723577 125.256818 ## SE Marsh 70 36.314286 114.707578 ## KL Rahul 59 33.542373 123.424971 ## MEK Hussey 60 33.400000 100.439187 ## V Kohli 174 32.413793 115.830849 ## KS Williamson 42 31.690476 120.443172 ## AB de Villiers 143 30.923077 128.967081 ## JC Buttler 45 30.800000 132.561154 ## AM Rahane 118 30.330508 102.240398 ## SR Tendulkar 79 29.949367 101.651959 ## F du Plessis 65 29.415385 112.462114 ## Q de Kock 51 29.333333 110.973836 ## SS Iyer 47 29.170213 102.144222 ## G Gambhir 155 28.741935 103.997558 ## 2b. Ranking of bowlers in IPL T20 import yorkpy.analytics as yka IPLT20RankBowling=yka.rankIPLT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\ipldata") IPLT20RankBowling.head(15) ## matches wicket_mean econrate_mean ## bowler ## SL Malinga 122 1.540984 7.173361 ## Imran Tahir 43 1.465116 8.155039 ## A Nehra 88 1.375000 7.923295 ## MJ McClenaghan 56 1.339286 8.638393 ## Rashid Khan 46 1.304348 6.543478 ## Sandeep Sharma 79 1.303797 7.860759 ## MM Patel 63 1.301587 7.530423 ## DJ Bravo 131 1.282443 8.458333 ## M Morkel 70 1.257143 7.760714 ## SP Narine 109 1.256881 6.747706 ## YS Chahal 83 1.228916 8.103659 ## R Vinay Kumar 104 1.221154 8.556090 ## RP Singh 82 1.219512 8.149390 ## CH Morris 52 1.211538 7.854167 ## B Kumar 117 1.205128 7.536325 ## 3. Natwest T20 The steps to do before ranking for Natwest T20 matches are 1. Download Natwest T20 zip file from Cricsheet NTB T20 2. Unzip the file. This will create a folder with yaml files import yorkpy.analytics as yka #yka.convertAllYaml2PandasDataframesT20("../ntb","../ntbdata") This above step will convert the yaml files into CSV files in the /ntbdata folder. Now do the ranking as below ## 3a. Ranking of NTB batsmen import yorkpy.analytics as yka NTBT20RankBatting=yka.rankNTBT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\ntbdata") NTBT20RankBatting.head(15) ## matches runs_mean SR_mean ## batsman ## Babar Azam 13 44.461538 121.268809 ## T Banton 13 42.230769 139.376274 ## JJ Roy 12 41.250000 142.182147 ## DJM Short 12 40.250000 131.182294 ## AN Petersen 12 37.916667 132.522727 ## IR Bell 13 37.615385 130.104721 ## M Klinger 26 35.346154 112.682922 ## EJG Morgan 16 35.062500 129.817650 ## AJ Finch 19 34.578947 137.093465 ## MH Wessels 26 33.884615 116.300969 ## S Steel 11 33.545455 140.118207 ## DJ Bell-Drummond 21 33.142857 108.566309 ## Ashar Zaidi 11 33.000000 178.553331 ## DJ Malan 26 33.000000 120.127202 ## T Kohler-Cadmore 23 32.956522 112.493019 ## 3b. Ranking of NTB bowlers import yorkpy.analytics as yka NTBT20RankBowling=yka.rankNTBT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\ntbdata") NTBT20RankBowling.head(15) ## matches wicket_mean econrate_mean ## bowler ## MW Parkinson 11 2.000000 7.628788 ## HF Gurney 23 1.956522 8.831884 ## GR Napier 12 1.916667 8.694444 ## R Rampaul 19 1.736842 7.131579 ## P Coughlin 11 1.727273 8.909091 ## AJ Tye 26 1.692308 8.227564 ## GC Viljoen 12 1.666667 7.708333 ## BAC Howell 21 1.666667 6.857143 ## BW Sanderson 12 1.583333 7.902778 ## KJ Abbott 14 1.571429 9.398810 ## JE Taylor 13 1.538462 9.839744 ## JDS Neesham 12 1.500000 10.812500 ## MJ Potts 12 1.500000 8.486111 ## TT Bresnan 21 1.476190 8.817460 ## T van der Gugten 13 1.461538 7.211538 ## 4. Big Bash Leagure (BBL) T20 The steps to do before ranking for BBL T20 matches are 1. Download BBL T20 zip file from Cricsheet BBL T20 2. Unzip the file. This will create a folder with yaml files import yorkpy.analytics as yka #yka.convertAllYaml2PandasDataframesT20("../bbl","../bbldata") This above step will convert the yaml files into CSV files in the /bbldata folder. Now do the ranking as below ## 4a. Ranking of BBL batsmen import yorkpy.analytics as yka BBLT20RankBatting=yka.rankBBLT20Batting("C:\\software\\cricket-package\\yorkpyPkg\\data\\bbldata") BBLT20RankBatting.head(15) ## matches runs_mean SR_mean ## batsman ## DJM Short 43 40.883721 118.773047 ## SE Marsh 47 39.148936 113.616053 ## AJ Finch 62 36.306452 120.271231 ## AT Carey 37 34.945946 120.125341 ## UT Khawaja 41 31.268293 107.355655 ## CA Lynn 74 31.162162 121.746578 ## MS Wade 46 30.782609 120.310081 ## TM Head 45 30.000000 126.769564 ## MEK Hussey 23 29.173913 109.492934 ## BJ Hodge 29 29.000000 124.438040 ## BR Dunk 39 28.230769 106.149913 ## AD Hales 31 27.161290 117.678008 ## BB McCullum 34 27.058824 115.486392 ## GJ Bailey 57 27.000000 121.159220 ## MR Marsh 47 26.510638 114.994909 ## 4b. Ranking of BBL bowlers import yorkpy.analytics as yka BBLT20RankBowling=yka.rankBBLT20Bowling("C:\\software\\cricket-package\\yorkpyPkg\\data\\bbldata") BBLT20RankBowling.head(15) ## matches wicket_mean econrate_mean ## bowler ## Yasir Arafat 15 2.000000 7.587778 ## CH Morris 15 1.733333 8.572222 ## TK Curran 27 1.629630 8.716049 ## TT Bresnan 13 1.615385 8.775641 ## JR Hazlewood 18 1.555556 7.361111 ## CJ McKay 15 1.533333 8.555556 ## DR Sams 36 1.527778 8.581019 ## AC McDermott 14 1.500000 9.166667 ## JP Faulkner 20 1.500000 8.345833 ## SP Narine 12 1.500000 7.395833 ## AJ Tye 51 1.490196 8.101307 ## M Kelly 21 1.476190 8.908730 ## SA Abbott 73 1.438356 8.737443 ## B Laughlin 82 1.426829 8.332317 ## SW Tait 31 1.419355 8.895161 ## Conclusion You should be able to now rank players in the above formats as new data is added to Cricsheet. yorkpy can also be used for other leagues which follow the Cricsheet format. To see all posts click Index of posts # Cricpy performs granular analysis of players “Gold medals aren’t really made of gold. They’re made of sweat, determination, & a hard-to-find alloy called guts.” Dan Gable “It doesn’t matter whether you are pursuing success in business, sports, the arts, or life in general: The bridge between wishing and accomplishing is discipline” Harvey Mackay “I won’t predict anything historic. But nothing is impossible.” Michael Phelps ## Introduction In this post, I introduce 2 new functions in my Python package ‘cricpy’ (cricpy v0.20) see Introducing cricpy:A python package to analyze performances of cricketers which enable granular analysis of batsmen and bowlers. They are 1. Step 1: getPlayerDataHA – This function is a wrapper around getPlayerData(), getPlayerDataOD() and getPlayerDataTT(), and adds an extra column ‘homeOrAway’ which says whether the match was played at home/away/neutral venues. A CSV file is created with this new column. 2. Step 2: getPlayerDataOppnHA – This function allows you to slice & dice the data for batsmen and bowlers against specific oppositions, at home/away/neutral venues and between certain periods. This reducedsubset of data can be used to perform analyses. A CSV file is created as an output based on the parameters of opposition, home or away and the interval of time Note All the existing cricpy functions can be used on this smaller fine-grained data set for a closer analysis of players This post has been published in Rpubs and can be accessed at Cricpy performs granular analysis of players You can download a PDF version of this post at Cricpy performs granular analysis of players I have also updated the cricpy template with these lastest changes. See cricpy-template ## 1. Analyzing Rahul Dravid at 3 different stages of his career The following functions analyze Rahul Dravid during 3 different periods of his illustrious career. a) 1st Jan 2001-1st Jan 2002 b) 1st Jan 2004-1st Jan 2005 c) 1st Jan 2009-1st Jan 2010 import cricpy.analytics as ca # Get the homeOrAway dataset for Dravid in matches # Note:Since I have already got the data I reuse the CSV file #df=ca.getPlayerDataHA(28114,tfile="dravidTestHA.csv",matchType="Test") # Get Dravid's data for 2001-02 df1=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2001.csv",startDate="2001-01-01",endDate="2002-01-01") # Get Dravid's data for 2004-05 df2=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2004.csv", startDate="2004-01-01",endDate="2005-01-01") # Get Dravid's data for 2009-10 df3=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTest2009.csv",startDate="2009-01-01",endDate="2010-01-01") ## 1a. Plot the performance of Dravid at venues during 2001,2004,2009 Note: Any of the cricpy functions can be used on the fine-grained subset of data as below. import cricpy.analytics as ca ca.batsmanAvgRunsGround("dravidTest2001.csv","Dravid-2001") ca.batsmanAvgRunsGround("dravidTest2004.csv","Dravid-2004")  ca.batsmanAvgRunsGround("dravidTest2009.csv","Dravid-2009") ## 1b. Plot the performance of Dravid against different oppositions during 2001,2004,2009 import cricpy.analytics as ca ca.batsmanAvgRunsOpposition("dravidTest2001.csv","Dravid-2001") ca.batsmanAvgRunsOpposition("dravidTest2004.csv","Dravid-2004") ca.batsmanAvgRunsOpposition("dravidTest2009.csv","Dravid-2009") ## 1c. Plot the relative cumulative average and relative strike rate of Dravid in 2001,2004,2009 The plot below compares Dravid’s cumulative strike rate and cumulative average during 3 different stages of his career import cricpy.analytics as ca frames=["dravidTest2001.csv","dravidTest2004.csv","dravidTest2009.csv"] names=["Dravid-2001","Dravid-2004","Dravid-2009"] ca.relativeBatsmanCumulativeAvgRuns(frames,names) ca.relativeBatsmanCumulativeStrikeRate(frames,names) ## 2. Analyzing Virat Kohli’s performance against England in England in 2014 and 2018 The analysis below looks at Kohli’s performance against England in ‘away’ venues (England) in 2014 and 2018 import cricpy.analytics as ca # Get the homeOrAway data for Kohli in Test matches #df=ca.getPlayerDataHA(253802,tfile="kohliTestHA.csv",type="batting",matchType="Test") # Get the homeOrAway data for Kohli in Test matches df=ca.getPlayerDataHA(253802,tfile="kohliTestHA.csv",type="batting",matchType="Test") # Get the subset if data of Kohli's performance against England in England in 2014 df=ca.getPlayerDataOppnHA(infile="kohliTestHA.csv",outfile="kohliTestEng2014.csv", opposition=["England"],homeOrAway=["away"],startDate="2014-01-01",endDate="2015-01-01") # Get the subset if data of Kohli's performance against England in England in 2018 df1=ca.getPlayerDataOppnHA(infile="kohliTestHA.csv",outfile="kohliTestEng2018.csv", opposition=["England"],homeOrAway=["away"],startDate="2018-01-01",endDate="2019-01-01") ## 2a. Kohli’s performance at England grounds in 2014 & 2018 Kohli had a miserable outing to England in 2014 with a string of low scores. In 2018 Kohli pulls himself out of the morass import cricpy.analytics as ca ca.batsmanAvgRunsGround("kohliTestEng2014.csv","Kohli-Eng-2014") ca.batsmanAvgRunsGround("kohliTestEng2018.csv","Kohli-Eng-2018") ## 2a. Kohli’s cumulative average runs in 2014 & 2018 Kohli’s cumulative average runs in 2014 is in the low 15s, while in 2018 it is 70+. Kohli stamps his class back again and undoes the bad memories of 2014 import cricpy.analytics as ca ca.batsmanCumulativeAverageRuns("kohliTestEng2014.csv", "Kohli-Eng-2014") ca.batsmanCumulativeAverageRuns("kohliTestEng2018.csv", "Kohli-Eng-2018") ## 3a. Compare the performances of Ganguly, Dravid and VVS Laxman against opposition in ‘away’ matches in Tests The analyses below compares the performances of Sourav Ganguly, Rahul Dravid and VVS Laxman against Australia, South Africa, and England in ‘away’ venues between 01 Jan 2002 to 01 Jan 2008 import cricpy.analytics as ca #Get the HA data for Ganguly, Dravid and Laxman #df=ca.getPlayerDataHA(28779,tfile="gangulyTestHA.csv",type="batting",matchType="Test") #df=ca.getPlayerDataHA(28114,tfile="dravidTestHA.csv",type="batting",matchType="Test") #df=ca.getPlayerDataHA(30750,tfile="laxmanTestHA.csv",type="batting",matchType="Test") # Slice the data df=ca.getPlayerDataOppnHA(infile="gangulyTestHA.csv",outfile="gangulyTestAES2002-08.csv" ,opposition=["Australia", "England", "South Africa"], homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01") df=ca.getPlayerDataOppnHA(infile="dravidTestHA.csv",outfile="dravidTestAES2002-08.csv" ,opposition=["Australia", "England", "South Africa"], homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01") df=ca.getPlayerDataOppnHA(infile="laxmanTestHA.csv",outfile="laxmanTestAES2002-08.csv",opposition=["Australia", "England", "South Africa"], homeOrAway=["away"],startDate="2002-01-01",endDate="2008-01-01") ## 3b Plot the relative cumulative average runs and relative cumative strike rate Plot the relative cumulative average runs and relative cumative strike rate of Ganguly, Dravid and Laxman -Dravid towers over Laxman and Ganguly with respect to cumulative average runs. – Ganguly has a superior strike rate followed by Laxman and then Dravid import cricpy.analytics as ca frames=["gangulyTestAES2002-08.csv","dravidTestAES2002-08.csv","laxmanTestAES2002-08.csv"] names=["GangulyAusEngSA2002-08","DravidAusEngSA2002-08","LaxmanAusEngSA2002-08"] ca.relativeBatsmanCumulativeAvgRuns(frames,names) ca.relativeBatsmanCumulativeStrikeRate(frames,names) ## 4. Compare the ODI performances of Rohit Sharma, Joe Root and Kane Williamson against opposition Compare the performances of Rohit Sharma, Joe Root and Kane williamson in away & neutral venues against Australia, West Indies and Soouth Africa • Joe Root piles us the runs in about 15 matches. Rohit has played far more ODIs than the other two and averages a steady 35+ import cricpy.analytics as ca # Get the ODI HA data for Rohit, Root and Williamson #df=ca.getPlayerDataHA(34102,tfile="rohitODIHA.csv",type="batting",matchType="ODI") #df=ca.getPlayerDataHA(303669,tfile="joerootODIHA.csv",type="batting",matchType="ODI") #df=ca.getPlayerDataHA(277906,tfile="williamsonODIHA.csv",type="batting",matchType="ODI") # Subset the data for specific opposition in away and neutral venues ## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. ## from pandas.core import datetools df=ca.getPlayerDataOppnHA(infile="rohitODIHA.csv",outfile="rohitODIAusWISA.csv" ,opposition=["Australia", "West Indies", "South Africa"], homeOrAway=["away","neutral"]) df=ca.getPlayerDataOppnHA(infile="joerootODIHA.csv",outfile="joerootODIAusWISA.csv" ,opposition=["Australia", "West Indies", "South Africa"], homeOrAway=["away","neutral"]) df=ca.getPlayerDataOppnHA(infile="williamsonODIHA.csv",outfile="williamsonODIAusWiSA.csv",opposition=["Australia", "West Indies", "South Africa"], homeOrAway=["away","neutral"])  ## 4a. Compare cumulative strike rates and cumulative average runs of Rohit, Root and Williamson The relative cumulative strike rate of all 3 are comparable import cricpy.analytics as ca frames=["rohitODIAusWISA.csv","joerootODIAusWISA.csv","williamsonODIAusWiSA.csv"] names=["Rohit-ODI-AusWISA","Joe Root-ODI-AusWISA","Williamson-ODI-AusWISA"] ca.relativeBatsmanCumulativeAvgRuns(frames,names) ca.relativeBatsmanCumulativeStrikeRate(frames,names) ## 5. Plot the performance of Dhoni in T20s against specific opposition at all venues Plot the performances of Dhoni against Australia, West Indies, South Africa and England import cricpy.analytics as ca # Get the HA T20 data for Dhoni #df=ca.getPlayerDataHA(28081,tfile="dhoniT20HA.csv",type="batting",matchType="T20") #Subset the data df=ca.getPlayerDataOppnHA(infile="dhoniT20HA.csv",outfile="dhoniT20AusWISAEng.csv",opposition=["Australia", "West Indies", "South Africa","England"], homeOrAway=["all"]) ## 5a. Plot Dhoni’s performances in T20 Note You can use any of cricpy’s functions against the fine grained data import cricpy.analytics as ca ca.batsmanAvgRunsOpposition("dhoniT20AusWISAEng.csv","Dhoni") ca.batsmanAvgRunsGround("dhoniT20AusWISAEng.csv","Dhoni") ca.batsmanCumulativeStrikeRate("dhoniT20AusWISAEng.csv","Dhoni") ca.batsmanCumulativeAverageRuns("dhoniT20AusWISAEng.csv","Dhoni") ## 6. Compute and performances of Anil Kumble, Muralitharan and Warne in ‘away’ test matches Compute the performances of Kumble, Warne and Maralitharan against New Zealand, West Indies, South Africa and England in pitches that are not ‘home’ pithes import cricpy.analytics as ca # Get the bowling data for Kumble, Warne and Muralitharan in Test matches #df=ca.getPlayerDataHA(30176,tfile="kumbleTestHA.csv",type="bowling",matchType="Test") #df=ca.getPlayerDataHA(8166,tfile="warneTestHA.csv",type="bowling",matchType="Test") #df=ca.getPlayerDataHA(49636,tfile="muraliTestHA.csv",type="bowling",matchType="Test") # Subset the data df=ca.getPlayerDataOppnHA(infile="kumbleTestHA.csv",outfile="kumbleTest-NZWISAEng.csv",opposition=["New Zealand", "West Indies", "South Africa","England"], homeOrAway=["away"]) df=ca.getPlayerDataOppnHA(infile="warneTestHA.csv",outfile="warneTest-NZWISAEng.csv" ,opposition=["New Zealand", "West Indies", "South Africa","England"], homeOrAway=["away"]) df=ca.getPlayerDataOppnHA(infile="muraliTestHA.csv",outfile="muraliTest-NZWISAEng.csv" ,opposition=["New Zealand", "West Indies", "South Africa","England"], homeOrAway=["away"]) ## 6a. Plot the average wickets of Kumble, Warne and Murali import cricpy.analytics as ca ca.bowlerAvgWktsOpposition("kumbleTest-NZWISAEng.csv","Kumble-NZWISAEng-AN") ca.bowlerAvgWktsOpposition("warneTest-NZWISAEng.csv","Warne-NZWISAEng-AN") ca.bowlerAvgWktsOpposition("muraliTest-NZWISAEng.csv","Murali-NZWISAEng-AN") ## 6b. Plot the average wickets in different grounds of Kumble, Warne and Murali import cricpy.analytics as ca ca.bowlerAvgWktsGround("kumbleTest-NZWISAEng.csv","Kumble") ca.bowlerAvgWktsGround("warneTest-NZWISAEng.csv","Warne") ca.bowlerAvgWktsGround("muraliTest-NZWISAEng.csv","Murali") ## 6c. Plot the cumulative average wickets and cumulative economy rate of Kumble, Warne and Murali • Murali has the best economy rate followed by Kumble and then Warne • Again Murali has the best cumulative average wickets followed by Warne and then Kumble import cricpy.analytics as ca frames=["kumbleTest-NZWISAEng.csv","warneTest-NZWISAEng.csv","muraliTest-NZWISAEng.csv"] names=["Kumble","Warne","Murali"] ca.relativeBowlerCumulativeAvgEconRate(frames,names) ca.relativeBowlerCumulativeAvgWickets(frames,names) ## 7. Compute and plot the performances of Bumrah in 2016, 2017 and 2018 in ODIs import cricpy.analytics as ca # Get the HA data for Bumrah in ODI in bowling #df=ca.getPlayerDataHA(625383,tfile="bumrahODIHA.csv",type="bowling",matchType="ODI") # Slice the data for periods 2016, 2017 and 2018 df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2016.csv", startDate="2016-01-01",endDate="2017-01-01") df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2017.csv", startDate="2017-01-01",endDate="2018-01-01") df=ca.getPlayerDataOppnHA(infile="bumrahODIHA.csv",outfile="bumrahODI2018.csv", startDate="2018-01-01",endDate="2019-01-01") ## 7a. Compute the performances of Bumrah in 2016, 2017 and 2018 • Very clearly Bumrah is getting better at his art. His economy rate in 2018 is the best!!! • Bumrah has had a very prolific year in 2017. However all the years he seems to be quite effective import cricpy.analytics as ca frames=["bumrahODI2016.csv","bumrahODI2017.csv","bumrahODI2018.csv"] names=["Bumrah-2016","Bumrah-2017","Bumrah-2018"] ca.relativeBowlerCumulativeAvgEconRate(frames,names) ca.relativeBowlerCumulativeAvgWickets(frames,names) ## 8. Compute and plot the performances of Shakib, Bumrah and Jadeja in T20 matches for bowling import cricpy.analytics as ca # Get the HA bowling data for Shakib, Bumrah and Jadeja #df=ca.getPlayerDataHA(56143,tfile="shakibT20HA.csv",type="bowling",matchType="T20") #df=ca.getPlayerDataHA(625383,tfile="bumrahT20HA.csv",type="bowling",matchType="T20") #df=ca.getPlayerDataHA(234675,tfile="jadejaT20HA.csv",type="bowling",matchType="T20") # Slice the data for performances against Sri Lanka, Australia, South Africa and England df=ca.getPlayerDataOppnHA(infile="shakibT20HA.csv",outfile="shakibT20-SLAusSAEng.csv" ,opposition=["Sri Lanka","Australia", "South Africa","England"], homeOrAway=["all"]) df=ca.getPlayerDataOppnHA(infile="bumrahT20HA.csv",outfile="bumrahT20-SLAusSAEng.csv",opposition=["Sri Lanka","Australia", "South Africa","England"], homeOrAway=["all"]) df=ca.getPlayerDataOppnHA(infile="jadejaT20HA.csv",outfile="jadejaT20-SLAusSAEng.csv" ,opposition=["Sri Lanka","Australia", "South Africa","England"], homeOrAway=["all"]) ## 8a. Compare the relative performances of Shakib, Bumrah and Jadeja • Jadeja and Bumrah have comparable economy rates. Shakib is more expensive • Shakib pips Bumrah in number of cumulative wickets, though Bumrah is close behind import cricpy.analytics as ca frames=["shakibT20-SLAusSAEng.csv","bumrahT20-SLAusSAEng.csv","jadejaT20-SLAusSAEng.csv"] names=["Shakib-SLAusSAEng","Bumrah-SLAusSAEng","Jadeja-SLAusSAEng"] ca.relativeBowlerCumulativeAvgEconRate(frames,names) ca.relativeBowlerCumulativeAvgWickets(frames,names) ## Conclusion By getting the homeOrAway data for players using the profileNo, you can slice and dice the data based on your choice of opposition, whether you want matches that were played at home/away/neutral venues. Finally by specifying the period for which the data has to be subsetted you can create fine grained analysis. Hope you have a great time with cricpy!!! To see all posts click Index of posts # Getting started with Tensorflow, Keras in Python and R The Pale Blue Dot “From this distant vantage point, the Earth might not seem of any particular interest. But for us, it’s different. Consider again that dot. That’s here, that’s home, that’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every “superstar,” every “supreme leader,” every saint and sinner in the history of our species lived there—on the mote of dust suspended in a sunbeam.” Carl Sagan Tensorflow and Keras are Deep Learning frameworks that really simplify a lot of things to the user. If you are familiar with Machine Learning and Deep Learning concepts then Tensorflow and Keras are really a playground to realize your ideas. In this post I show how you can get started with Tensorflow in both Python and R ### Tensorflow in Python For tensorflow in Python, I found Google’s Colab an ideal environment for running your Deep Learning code. This is an Google’s research project where you can execute your code on GPUs, TPUs etc ### Tensorflow in R (RStudio) To execute tensorflow in R (RStudio) you need to install tensorflow and keras as shown below In this post I show how to get started with Tensorflow and Keras in R. # Install Tensorflow in RStudio #install_tensorflow() # Install Keras #install_packages("keras") library(tensorflow) libary(keras) This post takes 3 different Machine Learning problems and uses the Tensorflow/Keras framework to solve it Note: You can view the Google Colab notebook at Tensorflow in Python The RMarkdown file has been published at RPubs and can be accessed at Getting started with Tensorflow in R Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449). ## 1. Multivariate regression with Tensorflow – Python This code performs multivariate regression using Tensorflow and keras on the advent of Parkinson disease through sound recordings see Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set . The clinician’s motorUPDRS score has to be predicted from the set of features In [0]: # Import tensorflow import tensorflow as tf from tensorflow import keras  In [2]: #Get the data rom the UCI Machine Learning repository dataset = keras.utils.get_file("parkinsons_updrs.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data")  Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data 917504/911261 [==============================] - 0s 0us/step  In [3]: # Read the CSV file import pandas as pd parkinsons = pd.read_csv(dataset, na_values = "?", comment='\t', sep=",", skipinitialspace=True) print(parkinsons.shape) print(parkinsons.columns) #Check if there are any NAs in the rows parkinsons.isna().sum()  (5875, 22) Index(['subject#', 'age', 'sex', 'test_time', 'motor_UPDRS', 'total_UPDRS', 'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'PPE'], dtype='object')  Out[3]: subject# 0 age 0 sex 0 test_time 0 motor_UPDRS 0 total_UPDRS 0 Jitter(%) 0 Jitter(Abs) 0 Jitter:RAP 0 Jitter:PPQ5 0 Jitter:DDP 0 Shimmer 0 Shimmer(dB) 0 Shimmer:APQ3 0 Shimmer:APQ5 0 Shimmer:APQ11 0 Shimmer:DDA 0 NHR 0 HNR 0 RPDE 0 DFA 0 PPE 0 dtype: int64 Note: To see how to create dummy variables see my post Practical Machine Learning with R and Python – Part 2 In [4]: # Drop the columns subject number as it is not relevant parkinsons1=parkinsons.drop(['subject#'],axis=1) # Create dummy variables for sex (M/F) parkinsons2=pd.get_dummies(parkinsons1,columns=['sex']) parkinsons2.head() Out[4] age test_time motor_UPDRS total_UPDRS Jitter(%) Jitter(Abs) Jitter:RAP Jitter:PPQ5 Jitter:DDP Shimmer Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 Shimmer:APQ11 Shimmer:DDA NHR HNR RPDE DFA PPE sex_0 sex_1 0 72 5.6431 28.199 34.398 0.00662 0.000034 0.00401 0.00317 0.01204 0.02565 0.230 0.01438 0.01309 0.01662 0.04314 0.014290 21.640 0.41888 0.54842 0.16006 1 0 1 72 12.6660 28.447 34.894 0.00300 0.000017 0.00132 0.00150 0.00395 0.02024 0.179 0.00994 0.01072 0.01689 0.02982 0.011112 27.183 0.43493 0.56477 0.10810 1 0 2 72 19.6810 28.695 35.389 0.00481 0.000025 0.00205 0.00208 0.00616 0.01675 0.181 0.00734 0.00844 0.01458 0.02202 0.020220 23.047 0.46222 0.54405 0.21014 1 0 3 72 25.6470 28.905 35.810 0.00528 0.000027 0.00191 0.00264 0.00573 0.02309 0.327 0.01106 0.01265 0.01963 0.03317 0.027837 24.445 0.48730 0.57794 0.33277 1 0 4 72 33.6420 29.187 36.375 0.00335 0.000020 0.00093 0.00130 0.00278 0.01703 0.176 0.00679 0.00929 0.01819 0.02036 0.011625 26.126 0.47188 0.56122 0.19361 1 0  # Create a training and test data set with 80%/20% train_dataset = parkinsons2.sample(frac=0.8,random_state=0) test_dataset = parkinsons2.drop(train_dataset.index) # Select columns train_dataset1= train_dataset[['age', 'test_time', 'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'PPE', 'sex_0', 'sex_1']] test_dataset1= test_dataset[['age','test_time', 'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'PPE', 'sex_0', 'sex_1']]  In [7]: # Generate the statistics of the columns for use in normalization of the data train_stats = train_dataset1.describe() train_stats = train_stats.transpose() train_stats  Out[7]: count mean std min 25% 50% 75% max age 4700.0 64.792766 8.870401 36.000000 58.000000 65.000000 72.000000 85.000000 test_time 4700.0 93.399490 53.630411 -4.262500 46.852250 93.405000 139.367500 215.490000 Jitter(%) 4700.0 0.006136 0.005612 0.000830 0.003560 0.004900 0.006770 0.099990 Jitter(Abs) 4700.0 0.000044 0.000036 0.000002 0.000022 0.000034 0.000053 0.000396 Jitter:RAP 4700.0 0.002969 0.003089 0.000330 0.001570 0.002235 0.003260 0.057540 Jitter:PPQ5 4700.0 0.003271 0.003760 0.000430 0.001810 0.002480 0.003460 0.069560 Jitter:DDP 4700.0 0.008908 0.009267 0.000980 0.004710 0.006705 0.009790 0.172630 Shimmer 4700.0 0.033992 0.025922 0.003060 0.019020 0.027385 0.039810 0.268630 Shimmer(dB) 4700.0 0.310487 0.231016 0.026000 0.175000 0.251000 0.363250 2.107000 Shimmer:APQ3 4700.0 0.017125 0.013275 0.001610 0.009190 0.013615 0.020562 0.162670 Shimmer:APQ5 4700.0 0.020151 0.016848 0.001940 0.010750 0.015785 0.023733 0.167020 Shimmer:APQ11 4700.0 0.027508 0.020270 0.002490 0.015630 0.022685 0.032713 0.275460 Shimmer:DDA 4700.0 0.051375 0.039826 0.004840 0.027567 0.040845 0.061683 0.488020 NHR 4700.0 0.032116 0.060206 0.000304 0.010827 0.018403 0.031452 0.748260 HNR 4700.0 21.704631 4.288853 1.659000 19.447750 21.973000 24.445250 37.187000 RPDE 4700.0 0.542549 0.100212 0.151020 0.471235 0.543490 0.614335 0.966080 DFA 4700.0 0.653015 0.070446 0.514040 0.596470 0.643285 0.710618 0.865600 PPE 4700.0 0.219559 0.091506 0.021983 0.156470 0.205340 0.264017 0.731730 sex_0 4700.0 0.681489 0.465948 0.000000 0.000000 1.000000 1.000000 1.000000 sex_1 4700.0 0.318511 0.465948 0.000000 0.000000 0.000000 1.000000 1.000000 In [0]: # Create the target variable train_labels = train_dataset.pop('motor_UPDRS') test_labels = test_dataset.pop('motor_UPDRS')  In [0]: # Normalize the data by subtracting the mean and dividing by the standard deviation def normalize(x): return (x - train_stats['mean']) / train_stats['std'] # Create normalized training and test data normalized_train_data = normalize(train_dataset1) normalized_test_data = normalize(test_dataset1)  In [0]: # Create a Deep Learning model with keras model = tf.keras.Sequential([ keras.layers.Dense(6, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]), keras.layers.Dense(9, activation=tf.nn.relu), keras.layers.Dense(6,activation=tf.nn.relu), keras.layers.Dense(1) ]) # Use the Adam optimizer with a learning rate of 0.01 optimizer=keras.optimizers.Adam(lr=.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) # Set the metrics required to be Mean Absolute Error and Mean Squared Error.For regression, the loss is mean_squared_error model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mean_absolute_error', 'mean_squared_error'])  In [0]: # Create a model history=model.fit( normalized_train_data, train_labels, epochs=1000, validation_data = (normalized_test_data,test_labels), verbose=0)  In [26]: hist = pd.DataFrame(history.history) hist['epoch'] = history.epoch hist.tail()  Out[26]: loss mean_absolute_error mean_squared_error val_loss val_mean_absolute_error val_mean_squared_error epoch 995 15.773989 2.936990 15.773988 16.980803 3.028168 16.980803 995 996 15.238623 2.873420 15.238622 17.458752 3.101033 17.458752 996 997 15.437594 2.895500 15.437593 16.926016 2.971508 16.926018 997 998 15.867891 2.943521 15.867892 16.950249 2.985036 16.950249 998 999 15.846878 2.938914 15.846880 17.095623 3.014504 17.095625 999 In [30]: def plot_history(history): hist = pd.DataFrame(history.history) hist['epoch'] = history.epoch plt.figure() plt.xlabel('Epoch') plt.ylabel('Mean Abs Error') plt.plot(hist['epoch'], hist['mean_absolute_error'], label='Train Error') plt.plot(hist['epoch'], hist['val_mean_absolute_error'], label = 'Val Error') plt.ylim([2,5]) plt.legend() plt.figure() plt.xlabel('Epoch') plt.ylabel('Mean Square Error ') plt.plot(hist['epoch'], hist['mean_squared_error'], label='Train Error') plt.plot(hist['epoch'], hist['val_mean_squared_error'], label = 'Val Error') plt.ylim([10,40]) plt.legend() plt.show() plot_history(history)  ### Observation It can be seen that the mean absolute error is on an average about +/- 4.0. The validation error also is about the same. This can be reduced by playing around with the hyperparamaters and increasing the number of iterations ### 1a. Multivariate Regression in Tensorflow – R # Install Tensorflow in RStudio #install_tensorflow() # Install Keras #install_packages("keras") library(tensorflow) library(keras)  library(dplyr) library(dummies) ## dummies-1.5.6 provided by Decision Patterns library(tensorflow) library(keras) ## Multivariate regression This code performs multivariate regression using Tensorflow and keras on the advent of Parkinson disease through sound recordings see Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set. The clinician’s motorUPDRS score has to be predicted from the set of features. ### Read the data # Download the Parkinson's data from UCI Machine Learning repository dataset <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data") # Set the column names names(dataset) <- c("subject","age", "sex", "test_time","motor_UPDRS","total_UPDRS","Jitter","Jitter.Abs", "Jitter.RAP","Jitter.PPQ5","Jitter.DDP","Shimmer", "Shimmer.dB", "Shimmer.APQ3", "Shimmer.APQ5","Shimmer.APQ11","Shimmer.DDA", "NHR","HNR", "RPDE", "DFA","PPE") # Remove the column 'subject' as it is not relevant to analysis dataset1 <- subset(dataset, select = -c(subject)) # Make the column 'sex' as a factor for using dummies dataset1$sex=as.factor(dataset1sex) # Add dummy variables for categorical cariable 'sex' dataset2 <- dummy.data.frame(dataset1, sep = ".") ## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = ## FALSE): non-list contrasts argument ignored dataset3 <- na.omit(dataset2) ### Split the data as training and test in 80/20 ## Split data 80% training and 20% test sample_size <- floor(0.8 * nrow(dataset3)) ## set the seed to make your partition reproducible set.seed(12) train_index <- sample(seq_len(nrow(dataset3)), size = sample_size) train_dataset <- dataset3[train_index, ] test_dataset <- dataset3[-train_index, ] train_data <- train_dataset %>% select(sex.0,sex.1,age, test_time,Jitter,Jitter.Abs,Jitter.PPQ5,Jitter.DDP, Shimmer, Shimmer.dB,Shimmer.APQ3,Shimmer.APQ11, Shimmer.DDA,NHR,HNR,RPDE,DFA,PPE) train_labels <- select(train_dataset,motor_UPDRS) test_data <- test_dataset %>% select(sex.0,sex.1,age, test_time,Jitter,Jitter.Abs,Jitter.PPQ5,Jitter.DDP, Shimmer, Shimmer.dB,Shimmer.APQ3,Shimmer.APQ11, Shimmer.DDA,NHR,HNR,RPDE,DFA,PPE) test_labels <- select(test_dataset,motor_UPDRS) ## Normalize the data  # Normalize the data by subtracting the mean and dividing by the standard deviation normalize<-function(x) { y<-(x - mean(x)) / sd(x) return(y) } normalized_train_data <-apply(train_data,2,normalize) # Convert to matrix train_labels <- as.matrix(train_labels) normalized_test_data <- apply(test_data,2,normalize) test_labels <- as.matrix(test_labels) ### Create the Deep Learning Model model <- keras_model_sequential() model %>% layer_dense(units = 6, activation = 'relu', input_shape = dim(normalized_train_data)[2]) %>% layer_dense(units = 9, activation = 'relu') %>% layer_dense(units = 6, activation = 'relu') %>% layer_dense(units = 1) # Set the metrics required to be Mean Absolute Error and Mean Squared Error.For regression, the loss is # mean_squared_error model %>% compile( loss = 'mean_squared_error', optimizer = optimizer_rmsprop(), metrics = c('mean_absolute_error','mean_squared_error') ) # Fit the model # Use the test data for validation history <- model %>% fit( normalized_train_data, train_labels, epochs = 30, batch_size = 128, validation_data = list(normalized_test_data,test_labels) ) ### Plot mean squared error, mean absolute error and loss for training data and test data plot(history)  Fig1 ## 2. Binary classification in Tensorflow – Python This is a simple binary classification problem from UCI Machine Learning repository and deals with data on Breast cancer from the Univ. of Wisconsin Breast Cancer Wisconsin (Diagnostic) Data Set bold text In [31]: import tensorflow as tf from tensorflow import keras import pandas as pd # Read the data set from UCI ML site dataset_path = keras.utils.get_file("breast-cancer-wisconsin.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data") raw_dataset = pd.read_csv(dataset_path, sep=",", na_values = "?", skipinitialspace=True,) dataset = raw_dataset.copy() #Check for Null and drop dataset.isna().sum() dataset = dataset.dropna() dataset.isna().sum() # Set the column names dataset.columns = ["id","thickness", "cellsize", "cellshape","adhesion","epicellsize", "barenuclei","chromatin","normalnucleoli","mitoses","class"] dataset.head()  Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data 24576/19889 [=====================================] - 0s 1us/step id thickness cellsize cellshape adhesion epicellsize barenuclei chromatin normalnucleoli mitoses class 0 1002945 5 4 4 5 7 10.0 3 2 1 2 1 1015425 3 1 1 1 2 2.0 3 1 1 2 2 1016277 6 8 8 1 3 4.0 3 7 1 2 3 1017023 4 1 1 3 2 1.0 3 1 1 2 4 1017122 8 10 10 8 7 10.0 9 7 1 4 # Create a training/test set in the ratio 80/20 train_dataset = dataset.sample(frac=0.8,random_state=0) test_dataset = dataset.drop(train_dataset.index) # Set the training and test set train_dataset1= train_dataset[['thickness','cellsize','cellshape','adhesion', 'epicellsize', 'barenuclei', 'chromatin', 'normalnucleoli','mitoses']] test_dataset1=test_dataset[['thickness','cellsize','cellshape','adhesion', 'epicellsize', 'barenuclei', 'chromatin', 'normalnucleoli','mitoses']]  In [34]: # Generate the stats for each column to be used for normalization train_stats = train_dataset1.describe() train_stats = train_stats.transpose() train_stats  Out[34]: count mean std min 25% 50% 75% max thickness 546.0 4.430403 2.812768 1.0 2.0 4.0 6.0 10.0 cellsize 546.0 3.179487 3.083668 1.0 1.0 1.0 5.0 10.0 cellshape 546.0 3.225275 3.005588 1.0 1.0 1.0 5.0 10.0 adhesion 546.0 2.921245 2.937144 1.0 1.0 1.0 4.0 10.0 epicellsize 546.0 3.261905 2.252643 1.0 2.0 2.0 4.0 10.0 barenuclei 546.0 3.560440 3.651946 1.0 1.0 1.0 7.0 10.0 chromatin 546.0 3.483516 2.492687 1.0 2.0 3.0 5.0 10.0 normalnucleoli 546.0 2.875458 3.064305 1.0 1.0 1.0 4.0 10.0 mitoses 546.0 1.609890 1.736762 1.0 1.0 1.0 1.0 10.0 In [0]: # Create target variables train_labels = train_dataset.pop('class') test_labels = test_dataset.pop('class')  In [0]: # Set the target variables as 0 or 1 train_labels[train_labels==2] =0 # benign train_labels[train_labels==4] =1 # malignant test_labels[test_labels==2] =0 # benign test_labels[test_labels==4] =1 # malignant  In [0]: # Normalize by subtracting mean and dividing by standard deviation def normalize(x): return (x - train_stats['mean']) / train_stats['std'] # Convert columns to numeric train_dataset1 = train_dataset1.apply(pd.to_numeric) test_dataset1 = test_dataset1.apply(pd.to_numeric) # Normalize normalized_train_data = normalize(train_dataset1) normalized_test_data = normalize(test_dataset1)  In [0]: # Create a model model = tf.keras.Sequential([ keras.layers.Dense(6, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]), keras.layers.Dense(9, activation=tf.nn.relu), keras.layers.Dense(6,activation=tf.nn.relu), keras.layers.Dense(1) ]) # Use the RMSProp optimizer optimizer = tf.keras.optimizers.RMSprop(0.01) # Since this is binary classification use binary_crossentropy model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc']) # Fit a model history=model.fit( normalized_train_data, train_labels, epochs=1000, validation_data=(normalized_test_data,test_labels), verbose=0)  In [55]: hist = pd.DataFrame(history.history) hist['epoch'] = history.epoch hist.tail()  loss acc val_loss val_acc epoch 995 0.112499 0.992674 0.454739 0.970588 995 996 0.112499 0.992674 0.454739 0.970588 996 997 0.112499 0.992674 0.454739 0.970588 997 998 0.112499 0.992674 0.454739 0.970588 998 999 0.112499 0.992674 0.454739 0.970588 999 In [58]: # Plot training and test accuracy plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.ylim([0.9,1]) plt.show() # Plot training and test loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.ylim([0,0.5]) plt.show()  ### 2a. Binary classification in Tensorflow -R This is a simple binary classification problem from UCI Machine Learning repository and deals with data on Breast cancer from the Univ. of Wisconsin Breast Cancer Wisconsin (Diagnostic) Data Set # Read the data for Breast cancer (Wisconsin) dataset <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data") # Rename the columns names(dataset) <- c("id","thickness", "cellsize", "cellshape","adhesion","epicellsize", "barenuclei","chromatin","normalnucleoli","mitoses","class") # Remove the columns id and class dataset1 <- subset(dataset, select = -c(id, class)) dataset2 <- na.omit(dataset1) # Convert the column to numeric dataset2barenuclei <- as.numeric(dataset2$barenuclei) ## Normalize the data train_data <-apply(dataset2,2,normalize) train_labels <- as.matrix(select(dataset,class)) # Set the target variables as 0 or 1 as it binary classification train_labels[train_labels==2,]=0 train_labels[train_labels==4,]=1 ### Create the Deep Learning model model <- keras_model_sequential() model %>% layer_dense(units = 6, activation = 'relu', input_shape = dim(train_data)[2]) %>% layer_dense(units = 9, activation = 'relu') %>% layer_dense(units = 6, activation = 'relu') %>% layer_dense(units = 1) # Since this is a binary classification we use binary cross entropy model %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(), metrics = c('accuracy') # Metrics is accuracy ) ### Fit the model. Use 20% of data for validation history <- model %>% fit( train_data, train_labels, epochs = 30, batch_size = 128, validation_split = 0.2 ) ### Plot the accuracy and loss for training and validation data plot(history)  ### 3. MNIST in Tensorflow – Python This takes the famous MNIST handwritten digits . It ca be seen that Tensorflow and Keras make short work of this famous problem of the late 1980s # Download MNIST data mnist=tf.keras.datasets.mnist # Set training and test data and labels (training_images,training_labels),(test_images,test_labels)=mnist.load_data() print(training_images.shape) print(test_images.shape)  (60000, 28, 28) (10000, 28, 28)  In [61]: # Plot a sample image from MNIST and show contents import matplotlib.pyplot as plt plt.imshow(training_images[1]) print(training_images[1]) [[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 159 253 159 50 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 238 252 252 252 237 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 54 227 253 252 239 233 252 57 6 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 10 60 224 252 253 252 202 84 252 253 122 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 163 252 252 252 253 252 252 96 189 253 167 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 51 238 253 253 190 114 253 228 47 79 255 168 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 48 238 252 252 179 12 75 121 21 0 0 253 243 50 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 38 165 253 233 208 84 0 0 0 0 0 0 253 252 165 0 0 0 0 0] [ 0 0 0 0 0 0 0 7 178 252 240 71 19 28 0 0 0 0 0 0 253 252 195 0 0 0 0 0] [ 0 0 0 0 0 0 0 57 252 252 63 0 0 0 0 0 0 0 0 0 253 252 195 0 0 0 0 0] [ 0 0 0 0 0 0 0 198 253 190 0 0 0 0 0 0 0 0 0 0 255 253 196 0 0 0 0 0] [ 0 0 0 0 0 0 76 246 252 112 0 0 0 0 0 0 0 0 0 0 253 252 148 0 0 0 0 0] [ 0 0 0 0 0 0 85 252 230 25 0 0 0 0 0 0 0 0 7 135 253 186 12 0 0 0 0 0] [ 0 0 0 0 0 0 85 252 223 0 0 0 0 0 0 0 0 7 131 252 225 71 0 0 0 0 0 0] [ 0 0 0 0 0 0 85 252 145 0 0 0 0 0 0 0 48 165 252 173 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 86 253 225 0 0 0 0 0 0 114 238 253 162 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 85 252 249 146 48 29 85 178 225 253 223 167 56 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 85 252 252 252 229 215 252 252 252 196 130 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 28 199 252 252 253 252 252 233 145 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 25 128 252 253 252 141 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]  # Normalize the images by dividing by 255.0 training_images = training_images/255.0 test_images = test_images/255.0 # Create a Sequential Keras model model = tf.keras.models.Sequential([tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024,activation=tf.nn.relu), tf.keras.layers.Dense(10,activation=tf.nn.softmax)]) model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])  In [68]: history=model.fit(training_images,training_labels,validation_data=(test_images, test_labels), epochs=5, verbose=1)  Train on 60000 samples, validate on 10000 samples Epoch 1/5 60000/60000 [==============================] - 17s 291us/sample - loss: 0.0020 - acc: 0.9999 - val_loss: 0.0719 - val_acc: 0.9810 Epoch 2/5 60000/60000 [==============================] - 17s 284us/sample - loss: 0.0021 - acc: 0.9998 - val_loss: 0.0705 - val_acc: 0.9821 Epoch 3/5 60000/60000 [==============================] - 17s 286us/sample - loss: 0.0017 - acc: 0.9999 - val_loss: 0.0729 - val_acc: 0.9805 Epoch 4/5 60000/60000 [==============================] - 17s 284us/sample - loss: 0.0014 - acc: 0.9999 - val_loss: 0.0762 - val_acc: 0.9804 Epoch 5/5 60000/60000 [==============================] - 17s 280us/sample - loss: 0.0015 - acc: 0.9999 - val_loss: 0.0735 - val_acc: 0.9812 Fig 1 Fig 2 ## MNIST in Tensorflow – R The following code uses Tensorflow to learn MNIST’s handwritten digits ### Load MNIST data mnist <- dataset_mnist() x_train <- mnist$train$x y_train <- mnist$train$y x_test <- mnist$test$x y_test <- mnist$test$y ### Reshape and rescale # Reshape the array x_train <- array_reshape(x_train, c(nrow(x_train), 784)) x_test <- array_reshape(x_test, c(nrow(x_test), 784)) # Rescale x_train <- x_train / 255 x_test <- x_test / 255 ### Convert out put to One Hot encoded format y_train <- to_categorical(y_train, 10) y_test <- to_categorical(y_test, 10) ### Fit the model Use the softmax activation for recognizing 10 digits and categorical cross entropy for loss model <- keras_model_sequential() model %>% layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 10, activation = 'softmax') # Use softmax model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(), metrics = c('accuracy') ) ### Fit the model Note: A smaller number of epochs has been used. For better performance increase number of epochs history <- model %>% fit( x_train, y_train, epochs = 5, batch_size = 128, validation_data = list(x_test,y_test) ) ### Plot the accuracy and loss for training and test data plot(history)  Conclusion This post shows how to use Tensorflow and Keras in both Python & R Hope you have fun with Tensorflow!! To see all posts click Index of posts # Analyze cricket players and cricket teams with cricpy template # Introduction This post shows how you can analyze batsmen, bowlers see Introducing cricpy:A python package to analyze performances of cricketers and cricket teams see Cricpy adds team analytics to its arsenal! in Test, ODI and T20s using cricpy templates, with data from ESPN Cricinfo. # The cricpy package ## A. Analyzing batsmen and bowlers in Test, ODI and T20s The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Rahul Dravid, Virat Kohli, Alastair Cook etc. This will bring up a page which have the profile number for the player e.g. for Rahul Dravid this would be http://www.espncricinfo.com/india/content/player/28114.html. Hence, Dravid’s profile is 28114. This can be used to get the data for Rahul Dravid as shown below and select the player you want Please mindful of the ESPN Cricinfo Terms of Use My posts on Cripy were You can clone/download this cricpy template for your own analysis of players. This can be done using RStudio or IPython notebooks The cricpy package is now available with pip install cricpy!!! ## 1 Importing cricpy – Python # Install the package # Do a pip install cricpy # Import cricpy import cricpy.analytics as ca  ## C:\Users\Ganesh\ANACON~1\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. ## from pandas.core import datetools ## 2. Invoking functions with Python package cricpy import cricpy.analytics as ca #ca.batsman4s("aplayer.csv","A Player") # 3. Getting help from cricpy – Python import cricpy.analytics as ca #help(ca.getPlayerData) The details below will introduce the different functions that are available in cricpy. ## 4. Get the player data for a player using the function getPlayerData() Important Note This needs to be done only once for a player. This function stores the player’s data in the specified CSV file (for e.g. dravid.csv as above) which can then be reused for all other functions). Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses ## 4a. For Test players import cricpy.analytics as ca #player1 =ca.getPlayerData(profileNo1,dir="..",file="player1.csv",type="batting",homeOrAway=[1,2], result=[1,2,4]) #player1 =ca.getPlayerData(profileNo2,dir="..",file="player2.csv",type="batting",homeOrAway=[1,2], result=[1,2,4]) ## 4b. For ODI players import cricpy.analytics as ca #player1 =ca.getPlayerDataOD(profileNo1,dir="..",file="player1.csv",type="batting") #player1 =ca.getPlayerDataOD(profileNo2,dir="..",file="player2.csv",type="batting"") ## 4c For T20 players import cricpy.analytics as ca #player1 =ca.getPlayerDataTT(profileNo1,dir="..",file="player1.csv",type="batting") #player1 =ca.getPlayerDataTT(profileNo2,dir="..",file="player2.csv",type="batting"") ## 5 A Player’s performance – Basic Analyses The 3 plots below provide the following for Rahul Dravid 1. Frequency percentage of runs in each run range over the whole career 2. Mean Strike Rate for runs scored in the given range 3. A histogram of runs frequency percentages in runs ranges  import cricpy.analytics as ca import matplotlib.pyplot as plt #ca.batsmanRunsFreqPerf("aplayer.csv","A Player") #ca.batsmanMeanStrikeRate("aplayer.csv","A Player") #ca.batsmanRunsRanges("aplayer.csv","A Player")  ## 6. More analyses This gives details on the batsmen’s 4s, 6s and dismissals import cricpy.analytics as ca #ca.batsman4s("aplayer.csv","A Player") #ca.batsman6s("aplayer.csv","A Player") #ca.batsmanDismissals("aplayer.csv","A Player") # The below function is for ODI and T20 only #ca.batsmanScoringRateODTT("./kohli.csv","Virat Kohli")  ## 7. 3D scatter plot and prediction plane The plots below show the 3D scatter plot of Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease import cricpy.analytics as ca #ca.battingPerf3d("aplayer.csv","A Player") ## 8. Average runs at different venues The plot below gives the average runs scored at different grounds. The plot also the number of innings at each ground as a label at x-axis. import cricpy.analytics as ca #ca.batsmanAvgRunsGround("aplayer.csv","A Player") ## 9. Average runs against different opposing teams This plot computes the average runs scored against different countries. import cricpy.analytics as ca #ca.batsmanAvgRunsOpposition("aplayer.csv","A Player") ## 10. Highest Runs Likelihood The plot below shows the Runs Likelihood for a batsman. import cricpy.analytics as ca #ca.batsmanRunsLikelihood("aplayer.csv","A Player") # 11. A look at the Top 4 batsman Choose any number of players 1.Player1 2.Player2 3.Player3 … The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs ## 12. Box Histogram Plot This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency import cricpy.analytics as ca #ca.batsmanPerfBoxHist("aplayer001.csv","A Player001") #ca.batsmanPerfBoxHist("aplayer002.csv","A Player002") #ca.batsmanPerfBoxHist("aplayer003.csv","A Player003") #ca.batsmanPerfBoxHist("aplayer004.csv","A Player004") ## 13. get Player Data special import cricpy.analytics as ca #player1sp = ca.getPlayerDataSp(profile1,tdir=".",tfile="player1sp.csv",ttype="batting") #player2sp = ca.getPlayerDataSp(profile2,tdir=".",tfile="player2sp.csv",ttype="batting") #player3sp = ca.getPlayerDataSp(profile3,tdir=".",tfile="player3sp.csv",ttype="batting") #player4sp = ca.getPlayerDataSp(profile4,tdir=".",tfile="player4sp.csv",ttype="batting") ## 14. Contribution to won and lost matches Note:This can only be used for Test matches import cricpy.analytics as ca #ca.batsmanContributionWonLost("player1sp.csv","A Player001") #ca.batsmanContributionWonLost("player2sp.csv","A Player002") #ca.batsmanContributionWonLost("player3sp.csv","A Player003") #ca.batsmanContributionWonLost("player4sp.csv","A Player004") ## 15. Performance at home and overseas Note:This can only be used for Test matches This function also requires the use of getPlayerDataSp() as shown above import cricpy.analytics as ca #ca.batsmanPerfHomeAway("player1sp.csv","A Player001") #ca.batsmanPerfHomeAway("player2sp.csv","A Player002") #ca.batsmanPerfHomeAway("player3sp.csv","A Player003") #ca.batsmanPerfHomeAway("player4sp.csv","A Player004") ## 16 Moving Average of runs in career import cricpy.analytics as ca #ca.batsmanMovingAverage("aplayer001.csv","A Player001") #ca.batsmanMovingAverage("aplayer002.csv","A Player002") #ca.batsmanMovingAverage("aplayer003.csv","A Player003") #ca.batsmanMovingAverage("aplayer004.csv","A Player004") ## 17 Cumulative Average runs of batsman in career This function provides the cumulative average runs of the batsman over the career. import cricpy.analytics as ca #ca.batsmanCumulativeAverageRuns("aplayer001.csv","A Player001") #ca.batsmanCumulativeAverageRuns("aplayer002.csv","A Player002") #ca.batsmanCumulativeAverageRuns("aplayer003.csv","A Player003") #ca.batsmanCumulativeAverageRuns("aplayer004.csv","A Player004") ## 18 Cumulative Average strike rate of batsman in career . import cricpy.analytics as ca #ca.batsmanCumulativeStrikeRate("aplayer001.csv","A Player001") #ca.batsmanCumulativeStrikeRate("aplayer002.csv","A Player002") #ca.batsmanCumulativeStrikeRate("aplayer003.csv","A Player003") #ca.batsmanCumulativeStrikeRate("aplayer004.csv","A Player004") ## 19 Future Runs forecast import cricpy.analytics as ca #ca.batsmanPerfForecast("aplayer001.csv","A Player001") ## 20 Relative Batsman Cumulative Average Runs The plot below compares the Relative cumulative average runs of the batsman for each of the runs ranges of 10 and plots them. import cricpy.analytics as ca frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"] names = ["A Player1","A Player2","A Player3","A Player4"] #ca.relativeBatsmanCumulativeAvgRuns(frames,names) ## 21 Plot of 4s and 6s import cricpy.analytics as ca frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"] names = ["A Player1","A Player2","A Player3","A Player4"] #ca.batsman4s6s(frames,names) ## 22. Relative Batsman Strike Rate The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show import cricpy.analytics as ca frames = ["aplayer1.csv","aplayer2.csv","aplayer3.csv","aplayer4.csv"] names = ["A Player1","A Player2","A Player3","A Player4"] #ca.relativeBatsmanCumulativeStrikeRate(frames,names)  ## 23. 3D plot of Runs vs Balls Faced and Minutes at Crease The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted import cricpy.analytics as ca #ca.battingPerf3d("aplayer001.csv","A Player001") #ca.battingPerf3d("aplayer002.csv","A Player002") #ca.battingPerf3d("aplayer003.csv","A Player003") #ca.battingPerf3d("aplayer004.csv","A Player004") ## 24. Predicting Runs given Balls Faced and Minutes at Crease A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease. import cricpy.analytics as ca import numpy as np import pandas as pd BF = np.linspace( 10, 400,15) Mins = np.linspace( 30,600,15) newDF= pd.DataFrame({'BF':BF,'Mins':Mins}) #aplayer = ca.batsmanRunsPredict("aplayer.csv",newDF,"A Player") #print(aplayer) The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease. ## 25 Analysis of Top 3 wicket takers Take any number of bowlers from either Test, ODI or T20 1. Bowler1 2. Bowler2 3. Bowler3 … ## 26. Get the bowler’s data (Test) This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line import cricpy.analytics as ca #abowler1 =ca.getPlayerData(profileNo1,dir=".",file="abowler1.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) #abowler2 =ca.getPlayerData(profileNo2,dir=".",file="abowler2.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) #abowler3 =ca.getPlayerData(profile3,dir=".",file="abowler3.csv",type="bowling",homeOrAway=[1,2], result=[1,2,4]) ## 26b For ODI bowlers import cricpy.analytics as ca #abowler1 =ca.getPlayerDataOD(profileNo1,dir=".",file="abowler1.csv",type="bowling") #abowler2 =ca.getPlayerDataOD(profileNo2,dir=".",file="abowler2.csv",type="bowling") #abowler3 =ca.getPlayerDataOD(profile3,dir=".",file="abowler3.csv",type="bowling") ## 26c For T20 bowlers import cricpy.analytics as ca #abowler1 =ca.getPlayerDataTT(profileNo1,dir=".",file="abowler1.csv",type="bowling") #abowler2 =ca.getPlayerDataTT(profileNo2,dir=".",file="abowler2.csv",type="bowling") #abowler3 =ca.getPlayerDataTT(profile3,dir=".",file="abowler3.csv",type="bowling") ## 27. Wicket Frequency Plot This plot below plots the frequency of wickets taken for each of the bowlers import cricpy.analytics as ca #ca.bowlerWktsFreqPercent("abowler1.csv","A Bowler1") #ca.bowlerWktsFreqPercent("abowler2.csv","A Bowler2") #ca.bowlerWktsFreqPercent("abowler3.csv","A Bowler3") ## 28. Wickets Runs plot The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken import cricpy.analytics as ca #ca.bowlerWktsRunsPlot("abowler1.csv","A Bowler1") #ca.bowlerWktsRunsPlot("abowler2.csv","A Bowler2") #ca.bowlerWktsRunsPlot("abowler3.csv","A Bowler3") ## 29 Average wickets at different venues The plot gives the average wickets taken bat different venues. import cricpy.analytics as ca #ca.bowlerAvgWktsGround("abowler1.csv","A Bowler1") #ca.bowlerAvgWktsGround("abowler2.csv","A Bowler2") #ca.bowlerAvgWktsGround("abowler3.csv","A Bowler3")  ## 30 Average wickets against different opposition The plot gives the average wickets taken against different countries. import cricpy.analytics as ca #ca.bowlerAvgWktsOpposition("abowler1.csv","A Bowler1") #ca.bowlerAvgWktsOpposition("abowler2.csv","A Bowler2") #ca.bowlerAvgWktsOpposition("abowler3.csv","A Bowler3") ## 31 Wickets taken moving average import cricpy.analytics as ca #ca.bowlerMovingAverage("abowler1.csv","A Bowler1") #ca.bowlerMovingAverage("abowler2.csv","A Bowler2") #ca.bowlerMovingAverage("abowler3.csv","A Bowler3") ## 32 Cumulative average wickets taken The plots below give the cumulative average wickets taken by the bowlers. import cricpy.analytics as ca #ca.bowlerCumulativeAvgWickets("abowler1.csv","A Bowler1") #ca.bowlerCumulativeAvgWickets("abowler2.csv","A Bowler2") #ca.bowlerCumulativeAvgWickets("abowler3.csv","A Bowler3") ## 33 Cumulative average economy rate The plots below give the cumulative average economy rate of the bowlers. import cricpy.analytics as ca #ca.bowlerCumulativeAvgEconRate("abowler1.csv","A Bowler1") #ca.bowlerCumulativeAvgEconRate("abowler2.csv","A Bowler2") #ca.bowlerCumulativeAvgEconRate("abowler3.csv","A Bowler3") ## 34 Future Wickets forecast import cricpy.analytics as ca #ca.bowlerPerfForecast("abowler1.csv","A bowler1") ## 35 Get player data special import cricpy.analytics as ca #abowler1sp =ca.getPlayerDataSp(profile1,tdir=".",tfile="abowler1sp.csv",ttype="bowling") #abowler2sp =ca.getPlayerDataSp(profile2,tdir=".",tfile="abowler2sp.csv",ttype="bowling") #abowler3sp =ca.getPlayerDataSp(profile3,tdir=".",tfile="abowler3sp.csv",ttype="bowling") ## 36 Contribution to matches won and lost Note:This can be done only for Test cricketers import cricpy.analytics as ca #ca.bowlerContributionWonLost("abowler1sp.csv","A Bowler1") #ca.bowlerContributionWonLost("abowler2sp.csv","A Bowler2") #ca.bowlerContributionWonLost("abowler3sp.csv","A Bowler3")  ## 37 Performance home and overseas Note:This can be done only for Test cricketers import cricpy.analytics as ca #ca.bowlerPerfHomeAway("abowler1sp.csv","A Bowler1") #ca.bowlerPerfHomeAway("abowler2sp.csv","A Bowler2") #ca.bowlerPerfHomeAway("abowler3sp.csv","A Bowler3") ## 38 Relative cumulative average economy rate of bowlers import cricpy.analytics as ca frames = ["abowler1.csv","abowler2.csv","abowler3.csv"] names = ["A Bowler1","A Bowler2","A Bowler3"] #ca.relativeBowlerCumulativeAvgEconRate(frames,names) ## 39 Relative Economy Rate against wickets taken import cricpy.analytics as ca frames = ["abowler1.csv","abowler2.csv","abowler3.csv"] names = ["A Bowler1","A Bowler2","A Bowler3"] #ca.relativeBowlingER(frames,names) ## 40 Relative cumulative average wickets of bowlers in career import cricpy.analytics as ca frames = ["abowler1.csv","abowler2.csv","abowler3.csv"] names = ["A Bowler1","A Bowler2","A Bowler3"] #ca.relativeBowlerCumulativeAvgWickets(frames,names) ## B. Analyzing cricket teams in Test, ODI and T20s The following functions will get the team data for Tests, ODI and T20s ### 1a. Get Test team data import cricpy.analytics as ca #country1Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country1Test.csv",save=True,teamName="Country1") #country2Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country2Test.csv",save=True,teamName="Country2") #country3Test= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="country3Test.csv",save=True,teamName="Country3") ### 1b. Get ODI team data import cricpy.analytics as ca #team1ODI= ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team1ODI.csv",save=True,teamName="team1") #team2ODI= ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team2ODI.csv",save=True,teamName="team2") #team3ODI= ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="team3ODI.csv",save=True,teamName="team3") ### 1c. Get T20 team data import cricpy.analytics as ca #team1T20 = ca.getTeamDataHomeAway(matchType="T20",file="team1T20.csv",save=True,teamName="team1") #team2T20 = ca.getTeamDataHomeAway(matchType="T20",file="team2T20.csv",save=True,teamName="team2") #team3T20 = ca.getTeamDataHomeAway(matchType="T20",file="team3T20.csv",save=True,teamName="team3") ### 2a. Test – Analyzing test performances against opposition import cricpy.analytics as ca # Get the performance of Indian test team against all teams at all venues as a dataframe #df = ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=False) #print(df.head()) # Plot the performance of Country1 Test team against all teams at all venues #ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True) # Plot the performance of Country1 Test team against specific teams at home/away venues #ca.teamWinLossStatusVsOpposition("country1Test.csv",teamName="Country1",opposition=["Country2","Country3","Country4"],homeOrAway=["home","away","neutral"],matchType="Test",plot=True) ### 2b. Test – Analyzing test performances against opposition at different grounds import cricpy.analytics as ca # Get the performance of Indian test team against all teams at all venues as a dataframe #df = ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=False) #df.head() # Plot the performance of Country1 Test team against all teams at all venues #ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True) # Plot the performance of Country1 Test team against specific teams at home/away venues #ca.teamWinLossStatusAtGrounds("country1Test.csv",teamName="Country1",opposition=["Country2","Country3","Country4"],homeOrAway=["home","away","neutral"],matchType="Test",plot=True) ### 2c. Test – Plot time lines of wins and losses import cricpy.analytics as ca #ca.plotTimelineofWinsLosses("country1Test.csv",team="Country1",opposition=["all"], #startDate="1970-01-01",endDate="2017-01-01") #ca.plotTimelineofWinsLosses("country1Test.csv",team="Country1",opposition=["Country2","Count#ry3","Country4"], homeOrAway=["home",away","neutral"], startDate=<start Date> #,endDate=<endDate>) ### 3a. ODI – Analyzing test performances against opposition import cricpy.analytics as ca #df = ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False) #print(df.head()) # Plot the performance of team1 in ODIs against Sri Lanka, India at all venues #ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="ODI",plot=True) # Plot the performance of Team1 ODI team against specific teams at home/away venues #ca.teamWinLossStatusVsOpposition("team1ODI.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway="home","away","neutral"],matchType="ODI",plot=True) ### 3b. ODI – Analyzing test performances against opposition at different venues import cricpy.analytics as ca #df = ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False) #print(df.head()) # Plot the performance of Team1s in ODIs specific ODI teams at all venues #ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="ODI",plot=True) # Plot the performance of Team1 against specific ODI teams at home/away venues #ca.teamWinLossStatusAtGrounds("team1ODI.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="ODI",plot=True) ### 3c. ODI – Plot time lines of wins and losses import cricpy.analytics as ca #Plot the time line of wins/losses of Bangladesh ODI team between 2 dates all venues #ca.plotTimelineofWinsLosses("team1ODI.csv",team="Team1",startDate=<start date> ,endDa#te=<end date>,matchType="ODI") #Plot the time line of wins/losses against specific opposition between 2 dates #ca.plotTimelineofWinsLosses("team1ODI.csv",team="Team1",opposition=["Team2","Team2"], homeOrAway=["home",away","neutral"], startDate=<start date>,endDate=<end date> ,matchType="ODI") ### 4a. T20 – Analyzing test performances against opposition import cricpy.analytics as ca #df = ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False) #print(df.head()) # Plot the performance of Team1 in T20s against all opposition at all venues #ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=[all"],matchType="T20",plot=True) # Plot the performance of T20 Test team against specific teams at home/away venues #ca.teamWinLossStatusVsOpposition("teamT20.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="T20",plot=True) ### 4b. T20 – Analyzing test performances against opposition at different venues import cricpy.analytics as ca #df = ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False) #df.head() # Plot the performance of Team1s in ODIs specific ODI teams at all venues #ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=True) # Plot the performance of Team1 against specific ODI teams at home/away venues #ca.teamWinLossStatusAtGrounds("teamT20.csv",teamName="Team1",opposition=["Team2","Team3","Team4"],homeOrAway=["home","away","neutral"],matchType="T20",plot=True) ### 4c. T20 – Plot time lines of wins and losses import cricpy.analytics as ca #Plot the time line of wins/losses of Bangladesh ODI team between 2 dates all venues #ca.plotTimelineofWinsLosses("teamT20.csv",team="Team1",startDate=<start date> ,endDa#te=<end date>,matchType="T20") #Plot the time line of wins/losses against specific opposition between 2 dates #ca.plotTimelineofWinsLosses("teamT20.csv",team="Team1",opposition=c("Team2","Team2"), homeOrAway=c("home",away","neutral"), startDate=<start date>,endDate=<end date> ,matchType="T20") ## Conclusion # Key Findings ## Analysis of batsman ## Analysis of bowlers ## Analysis of teams Have fun with cripy!!! # Cricpy adds team analytics to its arsenal!! I can’t sit still and see another man slaving and working. I want to get up and superintend, and walk round with my hands in my pockets, and tell him what to do. It is my energetic nature. I can’t help it. It always does seem to me that I am doing more work than I should do. It is not that I object to the work, mind you; I like work: it fascinates me. I can sit and look at it for hours. I love to keep it by me: the idea of getting rid of it nearly breaks my heart. Let your boat of life be light, packed with only what you need – a homely home and simple pleasures, one or two friends, worth the name, someone to love and someone to love you, a cat, a dog, and a pipe or two, enough to eat and enough to wear, and a little more than enough to drink; for thirst is a dangerous thing.  Three Men in a boat by Jerome K Jerome  ## Introduction Cricpy, the python avatar of my R package was born about a 9 months back see Introducing cricpy:A python package to analyze performances of cricketers. Cricpy, like its R twin, can analyze performance of batsmen & bowlers in Test, ODI and T20 formats. About a week and a half back, I added team analytics to my R package cricketr see Cricketr adds team analytics to its repertoire!!!. If cricketr has team analysis functions, then can cricpy be far behind? So, I have included the same 8 functions which can perform Team analytics into cricpy also. Team performance analysis can be done for Test, ODI and T20 matches. This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package can handle all formats of the game including Test, ODI and Twenty20 cricket. You should be able to install the package using pip install cricpy. Please be mindful of ESPN Cricinfo Terms of Use There are 5 functions which are used internally 1) getTeamData b) getTeamNumber c) getMatchType d) getTeamDataHomeAway e) cleanTeamData and the external functions which a) teamWinLossStatusVsOpposition b) teamWinLossStatusAtGrounds c) plotTimelineofhttps://drive.google.com/file/d/1l4nQsRZ0C2FyPosigZmo0t-kC2xZZ_wl/view?usp=sharingWinsLosses All the above functions are common to Test, ODI and T20 teams The data for a particular Team can be obtained with the getTeamDataHomeAway() function from the package. This will return a dataframe of the team’s win/loss status at home and away venues over a period of time. This can be saved as a CSV file. Once this is done, you can use this CSV file for all subsequent analysis This post has been published at Rpubs at teamAnalyticsCricpy You can download the PDF version of this post at teamAnalyticsCricpy As before you can get the help for any of the cricpy functions as below import cricpy.analytics as ca help(ca.teamWinLossStatusAtGrounds) ## Help on function teamWinLossStatusAtGrounds in module cricpy.analytics: ## ## teamWinLossStatusAtGrounds(file, teamName, opposition=['all'], homeOrAway=['all'], matchType='Test', plot=False) ## Compute the wins/losses/draw/tied etc for a Team in Test, ODI or T20 at venues ## ## Description ## ## This function computes the won,lost,draw,tied or no result for a team against other teams in home/away or neutral venues and either returns a dataframe or plots it for grounds ## ## Usage ## ## teamWinLossStatusAtGrounds(file,teamName,opposition=["all"],homeOrAway=["all"], ## matchType="Test",plot=FALSE) ## Arguments ## ## file ## The CSV file for which the plot is required ## teamName ## The name of the team for which plot is required ## opposition ## Opposition is a vector namely ["all")] or ["Australia", "India", "England"] ## homeOrAway ## This parameter is a vector which is either ["all")] or a vector of venues ["home","away","neutral"] ## matchType ## Match type - Test, ODI or T20 ## plot ## If plot=FALSE then a data frame is returned, If plot=TRUE then a plot is generated ## Value ## ## None ## ## Note ## ## Maintainer: Tinniam V Ganesh tvganesh.85@gmail.com ## ## Author(s) ## ## Tinniam V Ganesh ## ## References ## ## http://www.espncricinfo.com/ci/content/stats/index.html ## https://gigadom.in/ ## See Also ## ## teamWinLossStatusVsOpposition teamWinLossStatusAtGrounds plotTimelineofWinsLosses ## ## Examples ## ## ## Not run: ## #Get the team data for India for Tests ## ## df =getTeamDataHomeAway(teamName="India",file="indiaOD.csv",matchType="ODI") ## ca.teamWinLossStatusAtGrounds("india.csv",teamName="India",opposition=c("Australia","England","India"), ## homeOrAway=c("home","away"),plot=TRUE) ## ## ## End(Not run) ## 1. Get team data ### 1a. Test The teams in Test cricket are included below 1. Afghanistan 2.Bangladesh 3. England 4. World 5. India 6. Ireland 7. New Zealand 8. Pakistan 9. South Africa 10.Sri Lanka 11. West Indies 12.Zimbabwe You can use this for the teamName paramater. This will return a dataframe and also save the file as a CSV , if save=True Note: – Since I have already got the data as CSV files I am not executing the lines below import cricpy.analytics as ca # Get the data for the teams. Save as CSV #indiaTest= ca.getTeamDataHomeAway(dir=".",teamView="bat",matchType="Test",file="indiaTest.csv",save=True,teamName="India") #ca.getTeamDataHomeAway(teamName="South Africa", matchType="Test", file="southafricaTest.csv", save=True) #ca.getTeamDataHomeAway(teamName="West Indies", matchType="Test", file="westindiesTest.csv", save=True) #newzealandTest = ca.getTeamDataHomeAway(matchType="Test",file="newzealandTest.csv",save=True,teamName="New Zealand") ### 1b. ODI The ODI teams in the world are below. The data for these teams can be got by names as shown below 1. Afghanistan 2. Africa XI 3. Asia XI 4.Australia 5.Bangladesh 2. Bermuda 7. England 8. ICC World X1 9. India 11.Ireland 12. New Zealand 13. Pakistan 14. South Africa 15.Sri Lanka 17. West Indies 18. Zimbabwe 19 Canada 21. East Africa 22. Hong Kong 23.Ireland 24. Kenya 25. Namibia 26.Nepal 27.Netherlands 28. Oman 29.Papua New Guinea 30. Scotland 31 United Arab Emirates 32. United States of America import cricpy.analytics as ca #indiaODI= ca.getTeamDataHomeAway(dir=".",matchType="ODI",file="indiaODI.csv",save=True,teamName="India") #englandODI = ca.getTeamDataHomeAway(matchType="ODI",file="englandODI.csv",save=True,teamName="England") #westindiesODI = ca.getTeamDataHomeAway(matchType="ODI",file="westindiesODI.csv",save=True,teamName="West Indies") #irelandODI <- ca.getTeamDataHomeAway(matchType="ODI",file="irelandODI.csv",save=True,teamName="Ireland") ### 1c T20 The T20 teams in the world are 1. Afghanistan 2. Australia 3. Bahrain 4. Bangladesh 5. Belgium 6. Belize 2. Bermuda 8.Botswana 9. Canada 11. Costa Rica 12. Germany 13. Ghana 3. Guernsey 15. Hong Kong 16. ICC World X1 17.India 18. Ireland 19.Italy 4. Jersey 21. Kenya 22.Kuwait 23.Maldives 24.Malta 25.Mexico 26.Namibia 27.Nepal 28.Netherlands 29. New Zealand 30.Nigeria 31.Oman 32. Pakistan 33.Panama 34.Papua New Guinea 35. Philippines 36.Qatar 37.Saudi Arabia 38.Scotland 39.South Africa 40.Spain 41.Sri Lanka 42.Uganda 43.United Arab Emirates United States of America 44.Vanuatu 45.West Indies import cricpy.analytics as ca #southafricaT20 = ca.getTeamDataHomeAway(matchType="T20",file="southafricaT20.csv",save=True,teamName="South Africa") #srilankaT20 = ca.getTeamDataHomeAway(matchType="T20",file="srilankaT20.csv",save=True,teamName="Sri Lanka") #canadaT20 = ca.getTeamDataHomeAway(matchType="T20",file="canadaT20.csv",save=True,teamName="Canada") #afghanistanT20 = ca.getTeamDataHomeAway(matchType="T20",file="afghanistanT20.csv",save=True,teamName="Afghanistan") ## 2 Analysis of Test matches The functions below perform analysis of Test teams ## 2a. Wins vs Loss against opposition This function performs analysis of Test teams against other teams at home/away or neutral venue. Note:- The opposition can be a list of opposition teams. Similarly homeOrAway can also be a list of home/away/neutral venues. import cricpy.analytics as ca # Get the performance of Indian test team against all teams at all venues as a dataframe df =ca.teamWinLossStatusVsOpposition("indiaTest.csv",teamName="India",opposition=["all"], homeOrAway=["all"], matchType="Test", plot=False) print(df)  ## ha away home ## Opposition Result ## Afghanistan won 0.0 1.0 ## Australia draw 20.0 23.0 ## lost 58.0 26.0 ## tied 0.0 2.0 ## won 13.0 39.0 ## Bangladesh draw 3.0 0.0 ## won 9.0 2.0 ## England draw 35.0 48.0 ## lost 68.0 26.0 ## won 13.0 33.0 ## New Zealand draw 18.0 28.0 ## lost 16.0 4.0 ## won 10.0 28.0 ## Pakistan draw 29.0 34.0 ## lost 14.0 10.0 ## won 2.0 13.0 ## South Africa draw 13.0 3.0 ## lost 20.0 10.0 ## won 6.0 15.0 ## Sri Lanka draw 11.0 14.0 ## lost 14.0 0.0 ## won 16.0 13.0 ## West Indies draw 39.0 35.0 ## lost 32.0 28.0 ## won 13.0 21.0 ## Zimbabwe draw 1.0 1.0 ## lost 4.0 0.0 ## won 5.0 6.0 # Plot the performance of Indian Test team against all teams at all venues ca.teamWinLossStatusVsOpposition("indiaTest.csv",teamName="India",opposition=["all"],homeOrAway=["all"],matchType="Test",plot=True) # Get the performance of Australia against India, England and New Zealand at all venues in Tests df =ca.teamWinLossStatusVsOpposition("southafricaTest.csv",teamName="South Africa",opposition=["India","England","New Zealand"],homeOrAway=["all"],matchType="Test",plot=False) print(df) #Plot the performance of Australia against England, India and New Zealand only at home (Australia)  ## ha away home ## Opposition Result ## England draw 43 55 ## lost 60 62 ## won 26 34 ## India draw 5 14 ## lost 16 6 ## won 7 19 ## New Zealand draw 20 7 ## lost 2 6 ## won 14 29 ca.teamWinLossStatusVsOpposition("southafricaTest.csv",teamName="South Africa",opposition=["India","England","New Zealand"],homeOrAway=["home","away"],matchType="Test",plot=True)  ### 2b Wins vs losses of Test teams against opposition at different venues import cricpy.analytics as ca # Get the performance of Pakistan against India, West Indies, South Africa at all venues in Tests and show performances at the venues df = ca.teamWinLossStatusAtGrounds("westindiesTest.csv",teamName="West Indies",opposition=["India","Sri Lanka","South Africa"],homeOrAway=["all"],matchType="Test",plot=False) print(df) # Plot the performance of New Zealand Test team against England, Sri Lanka and Bangladesh at all grounds playes  ## ha away home ## Ground Result ## Ahmedabad won 2.0 0.0 ## Basseterre draw 0.0 3.0 ## Bengaluru draw 2.0 0.0 ## won 2.0 0.0 ## Bridgetown draw 0.0 6.0 ## lost 0.0 6.0 ## won 0.0 14.0 ## Cape Town draw 2.0 0.0 ## lost 6.0 0.0 ## Centurion lost 6.0 0.0 ## Chennai draw 4.0 0.0 ## lost 8.0 0.0 ## won 3.0 0.0 ## Colombo (PSS) lost 2.0 0.0 ## Colombo (RPS) draw 2.0 0.0 ## Colombo (SSC) lost 4.0 0.0 ## Delhi draw 6.0 0.0 ## lost 2.0 0.0 ## won 3.0 0.0 ## Durban lost 6.0 0.0 ## Galle draw 1.0 0.0 ## lost 4.0 0.0 ## Georgetown draw 0.0 10.0 ## Gros Islet draw 0.0 5.0 ## lost 0.0 2.0 ## Hyderabad (Deccan) lost 2.0 0.0 ## Johannesburg lost 4.0 0.0 ## Kandy lost 4.0 0.0 ## Kanpur draw 1.0 0.0 ## won 3.0 0.0 ## Kingston draw 0.0 8.0 ## lost 0.0 4.0 ## won 0.0 15.0 ## Kingstown draw 0.0 2.0 ## Kolkata draw 7.0 0.0 ## lost 6.0 0.0 ## won 3.0 0.0 ## Mohali won 2.0 0.0 ## Moratuwa draw 1.0 0.0 ## Mumbai draw 7.0 0.0 ## lost 6.0 0.0 ## won 2.0 0.0 ## Mumbai (BS) draw 5.0 0.0 ## won 2.0 0.0 ## Nagpur draw 2.0 0.0 ## North Sound lost 0.0 2.0 ## Pallekele draw 1.0 0.0 ## Port Elizabeth draw 1.0 0.0 ## lost 2.0 0.0 ## won 2.0 0.0 ## Port of Spain draw 0.0 12.0 ## lost 0.0 12.0 ## won 0.0 10.0 ## Providence lost 0.0 2.0 ## Rajkot lost 2.0 0.0 ## Roseau draw 0.0 2.0 ## St John's draw 0.0 6.0 ## lost 0.0 2.0 ## won 0.0 2.0 ca. teamWinLossStatusAtGrounds("newzealandTest.csv",teamName="New Zealand",opposition=["England","Sri Lanka","Bangladesh"],homeOrAway=["all"],matchType="Test",plot=True)  ### 2c. Plot the time line of wins vs losses of Test teams against opposition at different venues during an interval import cricpy.analytics as ca # Plot the time line of wins/losses of India against Australia, West Indies, South Africa in away/neutral venues #from 2000-01-01 to 2017-01-01 ca.plotTimelineofWinsLosses("indiaTest.csv",teamName="India",opposition=["Australia","West Indies","South Africa"], homeOrAway=["away","neutral"], startDate="2000-01-01",endDate="2017-01-01") #Plot the time line of wins/losses of Indian Test team from 1970 onwards   ca.plotTimelineofWinsLosses("indiaTest.csv",teamName="India",startDate="1970-01-01",endDate="2017-01-01")  ## 3 ODI The functions below perform analysis of ODI teams listed above ### 3a. Wins vs Loss against opposition ODI teams This function performs analysis of ODI teams against other teams at home/away or neutral venue. Note:- The opposition can be a vector of opposition teams. Similarly homeOrAway can also be a vector of home/away/neutral venues. import cricpy.analytics as ca # Get the performance of West Indies in ODIs against all other ODI teams at all venues and retirn as a dataframe df = ca.teamWinLossStatusVsOpposition("westindiesODI.csv",teamName="West Indies",opposition=["all"],homeOrAway=["all"],matchType="ODI",plot=False) print(df) # Plot the performance of West Indies in ODIs against Sri Lanka, India at all venues ## ha away home neutral ## Opposition Result ## Afghanistan lost 0.0 1.0 2.0 ## won 0.0 1.0 0.0 ## Australia lost 41.0 25.0 8.0 ## n/r 3.0 0.0 0.0 ## tied 1.0 2.0 0.0 ## won 35.0 18.0 7.0 ## Bangladesh lost 6.0 5.0 3.0 ## n/r 1.0 0.0 1.0 ## won 10.0 8.0 3.0 ## Bermuda won 0.0 0.0 1.0 ## Canada won 2.0 1.0 1.0 ## England lost 22.0 17.0 12.0 ## n/r 0.0 3.0 0.0 ## won 15.0 23.0 6.0 ## India lost 27.0 14.0 18.0 ## n/r 0.0 1.0 0.0 ## tied 1.0 0.0 1.0 ## won 27.0 20.0 15.0 ## Ireland lost 0.0 0.0 1.0 ## won 2.0 3.0 2.0 ## Kenya lost 0.0 0.0 1.0 ## won 3.0 0.0 2.0 ## Netherlands won 0.0 0.0 2.0 ## New Zealand lost 19.0 5.0 3.0 ## n/r 2.0 0.0 2.0 ## won 10.0 15.0 5.0 ## P.N.G. won 0.0 0.0 1.0 ## Pakistan lost 11.0 15.0 34.0 ## tied 1.0 2.0 0.0 ## won 14.0 16.0 41.0 ## Scotland won 0.0 0.0 3.0 ## South Africa lost 20.0 17.0 7.0 ## n/r 1.0 0.0 0.0 ## tied 0.0 0.0 1.0 ## won 5.0 7.0 3.0 ## Sri Lanka lost 9.0 5.0 11.0 ## n/r 2.0 1.0 0.0 ## won 3.0 5.0 20.0 ## U.A.E. won 0.0 0.0 2.0 ## Zimbabwe lost 4.0 1.0 5.0 ## n/r 0.0 1.0 0.0 ## tied 1.0 0.0 0.0 ## won 9.0 15.0 12.0 ca.teamWinLossStatusVsOpposition("westindiesODI.csv",teamName="West Indies",opposition=["Sri Lanka", "India"],homeOrAway=["all"],matchType="ODI",plot=True) #Plot the performance of Ireland in ODIs against Zimbabwe, Kenya, bermuda, UAE, Oman and Scotland at all venues  ca.teamWinLossStatusVsOpposition("irelandODI.csv",teamName="Ireland",opposition=["Zimbabwe","Kenya","Bermuda","U.A.E.","Oman","Scotland"],homeOrAway=["all"],matchType="ODI",plot=True) ### 3b Wins vs losses of ODI teams against opposition at different venues import cricpy.analytics as ca # Plot the performance of England ODI team against Bangladesh, West Indies and Australia at all venues ca.teamWinLossStatusAtGrounds("englandODI.csv",teamName="England",opposition=["West Indies"],homeOrAway=["all"],matchType="ODI",plot=True) #Plot the performance of India against South Africa, West Indies and Australia at 'home' venues ca.teamWinLossStatusAtGrounds("indiaODI.csv",teamName="India",opposition=["South Africa"],homeOrAway=["home"],matchType="ODI",plot=True) ### 3c. Plot the time line of wins vs losses of ODI teams against opposition at different venues during an interval  import cricpy.analytics as ca #Plot the time line of wins/losses of Bangladesh ODI team between 2015 and 2019 against all other teams and at # all venues ca.plotTimelineofWinsLosses("bangladeshOD.csv",teamName="Bangladesh",startDate="2015-01-01",endDate="2019-01-01",matchType="ODI") #Plot the time line of wins/losses of India ODI against Sri Lanka, Bangladesh from 2016 to 2019 ca.plotTimelineofWinsLosses("indiaODI.csv",teamName="India",opposition=["Sri Lanka","Bangladesh"],startDate="2016-01-01",endDate="2019-01-01",matchType="ODI")  ## 4 Twenty 20 The functions below perform analysis of Twenty 20 teams listed above ### 4a. Wins vs Loss against opposition ODI teams This function performs analysis of T20 teams against other T20 teams at home/away or neutral venue. Note:- The opposition can be a list of opposition teams. Similarly homeOrAway can also be a list of home/away/neutral venues. import cricpy.analytics as ca # Get the performance of South Africa T20 team against England, India and Sri Lanka at home grounds at England df = ca.teamWinLossStatusVsOpposition("southafricaT20.csv",teamName="South Africa",opposition=["England","India","Sri Lanka"], homeOrAway=["home"], matchType="T20", plot=False) print(df) #Plot the performance of South Africa T20 against England, India and Sri Lanka at all venues ## ha home ## Opposition Result ## England lost 1 ## won 4 ## India lost 5 ## won 2 ## Sri Lanka lost 2 ## tied 1 ## won 3 ca.teamWinLossStatusVsOpposition("southafricaT20.csv",teamName="South Africa", opposition=["England","India","Sri Lanka"],homeOrAway=["all"],matchType="T20",plot=True) #Plot the performance of Afghanistan T20 teams against all oppositions ca.teamWinLossStatusVsOpposition("afghanistanT20.csv",teamName="Afghanistan",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=True)  ### 4b Wins vs losses of T20 teams against opposition at different venues # Compute the performance of Canada against all opposition at all venues and show by grounds. Return as dataframe df =ca.teamWinLossStatusAtGrounds("canadaT20.csv",teamName="Canada",opposition=["all"],homeOrAway=["all"],matchType="T20",plot=False) print(df) # Plot the performance of Sri Lanka T20 team against India and Bangladesh in different venues at home/away and neutral ## ha home neutral ## Ground Result ## Abu Dhabi lost 0.0 1.0 ## Belfast lost 0.0 1.0 ## won 0.0 2.0 ## Colombo (SSC) lost 0.0 1.0 ## won 0.0 1.0 ## Dubai (DSC) lost 0.0 5.0 ## ICCA Dubai lost 0.0 2.0 ## won 0.0 1.0 ## King City (NW) lost 3.0 0.0 ## tied 1.0 0.0 ## Sharjah lost 0.0 1.0 ca.teamWinLossStatusAtGrounds("srilankaT20.csv",teamName="Sri Lanka",opposition=["India", "Bangladesh"], homeOrAway=["all"], matchType="T20", plot=True)  ### 4c. Plot the time line of wins vs losses of T20 teams against opposition at different venues during an interval #Plot the time line of Sri Lanka T20 team agaibst all opposition ca.plotTimelineofWinsLosses("srilankaT20.csv",teamName="Sri Lanka",opposition=["Australia", "Pakistan"], startDate="2013-01-01", endDate="2019-01-01", matchType="T20") # Plot the time line of South Africa T20 between 2010 and 2015 against West Indies and Pakistan ca.plotTimelineofWinsLosses("southafricaT20.csv",teamName="South Africa",opposition=["West Indies", "Pakistan"], startDate="2010-01-01", endDate="2015-01-01", matchType="T20")  ## Conclusion With the above additional functions cricpy can now analyze batsmen, bowlers and teams in all formats of the game (Test, ODI and T20). Have fun with cricpy!!! You may also like To see all posts click Index of posts # Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR “There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ” From Infinite Powers by Steven Strogatz Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data. Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs. This post is a continuation of my 3 earlier posts on Big Data namely This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom “Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi hits home. I spent many hours struggling with regex!! For this post for the RDD part, I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark. Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work. The notebooks used in this post have been published and are available at You can also download all the notebooks from Github at WebServerLogsAnalysis An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed. A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP Description These two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. Format The logs are an ASCII file with one line per request, with the following columns: -host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up. -timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400. -request given in quotes. -HTTP reply code. -bytes in the reply. ## 1 Parse Web server logs with RDDs ### 1.1 Read NASA Web server logs Read the logs files from NASA for the months Jul 95 and Aug 95 from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]") sc = SparkContext.getOrCreate(conf) sqlcontext = SQLContext(sc) rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz") rdd.count() Out[1]: 3461613  ### 1.2Check content Check the logs to identify the parsing rules required for the logs i=0 for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect(): i=i+1 print(line) if i >5: break  ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263 slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133 piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527 ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067 dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786 p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204 ### 1.3 Write the parsing rule for each of the fields • host • timestamp • path • status • content_bytes ### 1.21 Get IP address/host name This regex is at the start of the log and includes any non-white characted import re rslt=(rdd.map(lambda line: re.search('\S+',line) .group(0)) .take(3)) # Get the IP address \host name rslt  Out[3]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’] ## 1.22 Get timestamp Get the time stamp rslt=(rdd.map(lambda line: re.search(‘(\S+ -\d{4})’,line)  .groups()) .take(3)) #Get the date rslt  [(‘[01/Aug/1995:00:00:01 -0400’,), (‘[01/Aug/1995:00:00:07 -0400’,), (‘[01/Aug/1995:00:00:08 -0400’,)] ### 1.23 HTTP request Get the HTTP request sent to Web server \w+ {GET} # Get the REST call with ” “ rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line) .groups()) .take(3)) # Get the REST call rslt [(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,), (‘/’,), (‘/images/ksclogo-medium.gif’,)] ### 1.23Get HTTP response status Get the HTTP response to the request rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line) .groups()) .take(3)) #Get the status rslt Out[6]: [(‘200’,), (‘304’,), (‘304’,)] ## 1.24 Get content size Get the HTTP response in bytes rslt=(rdd.map(lambda line: re.search(‘^.*\s(\d*)$’,line)
    .groups())
.take(3)) # Get the content size
rslt
Out[7]: [(‘1839’,), (‘0’,), (‘0’,)]

## 1.24 Putting it all together

Now put all the individual pieces together into 1 big regular expression and assign to the groups

1. Host 2. Timestamp 3. Path 4. Status 5. Content_size
rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line) .groups()) .take(3)) rslt  [(‘in24.inetnebr.com’, ‘ -‘, ‘ ‘, ‘-‘, ‘[01/Aug/1995:00:00:01 -0400]’, ‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘, ‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’, ‘200 1839’, ‘1839’), (‘uplherc.upl.com’, ‘ -‘, ‘ ‘, ‘-‘, ‘[01/Aug/1995:00:00:07 -0400]’, ‘”GET / HTTP/1.0″‘, ‘/’, ‘304 0’, ‘0’), (‘uplherc.upl.com’, ‘ -‘, ‘ ‘, ‘-‘, ‘[01/Aug/1995:00:00:08 -0400]’, ‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, ‘/images/ksclogo-medium.gif’, ‘304 0’, ‘0’)] ### 1.25 Add a log parsing function import re def parse_log1(line): match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
if match is None:
return(line,0)
else:
return(line,1)


### 1.26 Check for parsing failure

Check how many lines successfully parsed with the parsing function

n_logs = rdd.count()
failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count()
print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed))
# Get the failed records line[1] == 0
failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0)
failed1.take(3)
Out of a total of 3461613 logs, 38768 failed to parse
Out[10]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
0),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
0),
(‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’,
0)]

### 1.26 The above rule is not enough to parse the logs

It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group'” which shows up

#rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4) File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper return f(*args, **kwargs) File “<command-1348022240961444>”, line 1, in <lambda> AttributeError: ‘NoneType’ object has no attribute ‘group’ at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)

### 1.27 Add rule for parsing failed records

One of the issues with the earlier rule is the content_size has “-” for some logs

import re
def parse_failed(line):
match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line) if match is None: return(line,0) else: return(line,1) ### 1.28 Parse records which fail Parse the records that fails with the new rule failed2=rdd.map(lambda line: parse_failed(line)).filter(lambda line: line[1]==1) failed2.take(5)  Out[13]: [(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘, 1), (‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘, 1), (‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘, 1), (‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘, 1), (‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘, 1)] ### 1.28 Add both rules Add both rules for parsing the log. Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added import re def parse_log2(line): # Parse logs with the rule below match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
# If match failed then use the rule below
if match is None:
match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line) if match is None: return (line, 0) # Return 0 for failure else: return (line, 1) # Return 1 for success  ### 1.29 Group the different regex to groups for handling def map2groups(line):  match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
if match is None:
match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line) return(match.groups())  ### 1.30 Parse the logs and map the groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) ## 2. Parse Web server logs with Pyspark ### 2.1Read data into a Pyspark dataframe import os logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz') from pyspark.sql.functions import split, regexp_extract base_df = sqlContext.read.text(logs_file_path) #base_df.show(truncate=False) from pyspark.sql.functions import split, regexp_extract split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'), regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'), regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'), regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'), regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(5,truncate=False)

+———————+————————–+———————————————–+——+————+
|host |timestamp |path |status|content_size|
+———————+————————–+———————————————–+——+————+
|199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 |
|unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 |
|199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 |
|burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 |
|199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 |
+———————+————————–+———————————————–+——+————+
only showing top 5 rows

### 2.2 Check data

                              split_df['timestamp'].isNull() |
split_df['path'].isNull() |
split_df['status'].isNull() |
split_df['content_size'].isNull())
bad_rows_df.count()
Out[20]: 33905

### 2.3Check no of rows which do not have digits

We have already seen that the content_type field has ‘-‘ instead of digits in RDDs

#bad_content_size_df = base_df.filter(~ base_df[‘value’].rlike(r’\d+$’)) bad_content_size_df.count()  Out[21]: 33905 ### 2.4 Add ‘*’ to identify bad rows To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number from pyspark.sql.functions import lit, concat bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False)  +—————————————————————————————————————————————————+ |concat(value, *) | +—————————————————————————————————————————————————+ |dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*| |dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* | |ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* | |ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* | +—————————————————————————————————————————————————+ ### 2.5 Fill NAs with 0s # Replace all null content_size values with 0. cleaned_df = split_df.na.fill({‘content_size’: 0}) ## 3. Webserver logs parsing with SparkR library(SparkR) library(stringr) file_location = "/FileStore/tables/NASA_access_log_Jul95.gz" file_location = "/FileStore/tables/NASA_access_log_Aug95.gz" # Load the SparkR library # Initiate a SparkR session sparkR.session() sc <- sparkR.session() sqlContext <- sparkRSQL.init(sc) df <- read.text(sqlContext,"/FileStore/tables/NASA_access_log_Jul95.gz") #df=SparkR::select(df, "value") #head(SparkR::collect(df)) #m=regexp_extract(df$value,'\\\\S+',1)

a=df %>%
withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>% withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>%
withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2)) %>% withColumn('status', regexp_extract(df$value, '(^.*"\\s+([^\\s]+))', 2)) %>%
withColumn('content_size', regexp_extract(df$value, '(^.*\\s+(\\d+)$)', 2))
#b=a%>% select(host,timestamp,path,status,content_type)


1 199.72.81.55 – – [01/Jul/1995:00:00:01 -0400] “GET /history/apollo/ HTTP/1.0” 200 6245
2 unicomp6.unicomp.net – – [01/Jul/1995:00:00:06 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
3 199.120.110.21 – – [01/Jul/1995:00:00:09 -0400] “GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0” 200 4085
4 burger.letters.com – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/countdown/liftoff.html HTTP/1.0” 304 0
5 199.120.110.21 – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0” 200 4179
6 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 304 0
7 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/video/livevideo.gif HTTP/1.0” 200 0
8 205.212.115.106 – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/countdown.html HTTP/1.0” 200 3985
9 d104.aa.net – – [01/Jul/1995:00:00:13 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
10 129.94.144.152 – – [01/Jul/1995:00:00:13 -0400] “GET / HTTP/1.0” 200 7074
host timestamp
1 199.72.81.55 [01/Jul/1995:00:00:01 -0400
2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400
3 199.120.110.21 [01/Jul/1995:00:00:09 -0400
4 burger.letters.com [01/Jul/1995:00:00:11 -0400
5 199.120.110.21 [01/Jul/1995:00:00:11 -0400
6 burger.letters.com [01/Jul/1995:00:00:12 -0400
7 burger.letters.com [01/Jul/1995:00:00:12 -0400
8 205.212.115.106 [01/Jul/1995:00:00:12 -0400
9 d104.aa.net [01/Jul/1995:00:00:13 -0400
10 129.94.144.152 [01/Jul/1995:00:00:13 -0400
path status content_size
1 /history/apollo/ 200 6245
2 /shuttle/countdown/ 200 3985
3 /shuttle/missions/sts-73/mission-sts-73.html 200 4085
4 /shuttle/countdown/liftoff.html 304 0
5 /shuttle/missions/sts-73/sts-73-patch-small.gif 200 4179
6 /images/NASA-logosmall.gif 304 0
7 /shuttle/countdown/video/livevideo.gif 200 0
8 /shuttle/countdown/countdown.html 200 3985
9 /shuttle/countdown/ 200 3985
10 / 200 7074

## 4 Webserver logs parsing with SparklyR

install.packages("sparklyr")
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf

Installing package into ‘/databricks/spark/R/lib’
# Source: spark [?? x 1]
line

1 "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1…
2 "unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countd…
3 "199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts…
4 "burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdow…
5 "199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts…
6 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logo…
7 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdow…
8 "205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/c…
9 "d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP…
10 "129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074"
# … with more rows
#install.packages(“sparklyr”)
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf <- sdf %>% mutate(host = regexp_extract(line, '^(\\\\S+)',1)) %>%
mutate(timestamp = regexp_extract(line, '((\\\\S+ -\\\\d{4}))',2)) %>%
mutate(path = regexp_extract(line, '(\\\\"\\\\w+\\\\s+([^\\\\s]+)\\\\s+HTTP.*")',2)) %>%
mutate(status = regexp_extract(line, '(^.*"\\\\s+([^\\\\s]+))',2)) %>%
mutate(content_size = regexp_extract(line, '(^.*\\\\s+(\\\\d+)$)',2))  ## 5 Hosts ### 5.1 RDD #### 5.11 Parse and map to hosts to groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) # Create tuples of (host,1) and apply reduceByKey() and order by descending rslt=(parsed_rdd2.map(lambda x😦x[0],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt  Out[18]: [(‘piweba3y.prodigy.com’, 21988), (‘piweba4y.prodigy.com’, 16437), (‘piweba1y.prodigy.com’, 12825), (‘edams.ksc.nasa.gov’, 11962), (‘163.206.89.4’, 9697), (‘news.ti.com’, 8161), (‘www-d1.proxy.aol.com’, 8047), (‘alyssa.prodigy.com’, 8037), (‘siltb10.orl.mmc.com’, 7573), (‘www-a2.proxy.aol.com’, 7516)] #### 5.12Plot counts of hosts import seaborn as sns import pandas as pd import matplotlib.pyplot as plt df=pd.DataFrame(rslt,columns=[‘host’,‘count’]) sns.barplot(x=‘host’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.6, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display() ### 5.2 PySpark #### 5.21 Compute counts of hosts df= (cleaned_df  .groupBy('host') .count() .orderBy('count',ascending=False)) df.show(5)  +——————–+—–+ | host|count| +——————–+—–+ |piweba3y.prodigy….|21988| |piweba4y.prodigy….|16437| |piweba1y.prodigy….|12825| | edams.ksc.nasa.gov |11964| | 163.206.89.4 | 9697| +——————–+—–+ only showing top 5 rows ### 5.22 Plot count of hosts import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x='host',y='count',data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel("Hosts") plt.ylabel('Count') plt.xticks(rotation="vertical",fontsize=10) display() ### 5.3 SparkR ### 5.31 Compute count of hosts c <- SparkR::select(a,a$host)
df=SparkR::summarize(SparkR::groupBy(c, a$host), noHosts = count(a$host))
df1 =head(arrange(df,desc(df$noHosts)),10) head(df1)  host noHosts 1 piweba3y.prodigy.com 17572 2 piweba4y.prodigy.com 11591 3 piweba1y.prodigy.com 9868 4 alyssa.prodigy.com 7852 5 siltb10.orl.mmc.com 7573 6 piweba2y.prodigy.com 5922 ### 5.32 Plot count of hosts library(ggplot2) p <-ggplot(data=df1, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Host') + ylab('Count') p  ### 5.4 SparklyR #### 5.41 Compute count of Hosts df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(host) %>% group_by(host) %>% summarise(noHosts=n()) %>% arrange(desc(noHosts)) df2 <-head(df1,10)  ### 5.42 Plot count of hosts library(ggplot2) p <-ggplot(data=df2, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Host’) + ylab(Count’) p ## 6 Paths #### 6.1 RDD ### 6.11 Parse and map to hosts to groups parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda x😦x[5],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt  [(‘”GET /images/NASA-logosmall.gif HTTP/1.0″‘, 207520), (‘”GET /images/KSC-logosmall.gif HTTP/1.0″‘, 164487), (‘”GET /images/MOSAIC-logosmall.gif HTTP/1.0″‘, 126933), (‘”GET /images/USA-logosmall.gif HTTP/1.0″‘, 126108), (‘”GET /images/WORLD-logosmall.gif HTTP/1.0″‘, 124972), (‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, 120704), (‘”GET /ksc.html HTTP/1.0″‘, 83209), (‘”GET /images/launch-logo.gif HTTP/1.0″‘, 75839), (‘”GET /history/apollo/images/apollo-logo1.gif HTTP/1.0″‘, 68759), (‘”GET /shuttle/countdown/ HTTP/1.0″‘, 64467)] #### 6.12 Plot counts of HTTP Requests import seaborn as sns df=pd.DataFrame(rslt,columns=[‘path’,‘count’]) sns.barplot(x=‘path’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display() ### 6.2 Pyspark #### 6.21 Compute count of HTTP Requests df= (cleaned_df .groupBy('path') .count() .orderBy('count',ascending=False)) df.show(5)  Out[20]: +——————–+——+ | path| count| +——————–+——+ |/images/NASA-logo…|208362| |/images/KSC-logos…|164813| |/images/MOSAIC-lo…|127656| |/images/USA-logos…|126820| |/images/WORLD-log…|125676| +——————–+——+ only showing top 5 rows #### 6.22 Plot count of HTTP Requests import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘path’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xlabel(“HTTP Requests”) plt.ylabel(‘Count’) plt.xticks(rotation=90,fontsize=8) display() ### 6.3 SparkR #### 6.31Compute count of HTTP requests library(SparkR) c <- SparkR::select(a,a$path)
df=SparkR::summarize(SparkR::groupBy(c, a$path), numRequest = count(a$path))


#### 3.14 Plot count of HTTP Requests

library(ggplot2)
p <-ggplot(data=df1, aes(x=path, y=numRequest,fill=path)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ xlab('Path') + ylab('Count')
p



### 6.4 SparklyR

#### 6.41 Compute count of paths

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(path) %>% group_by(path) %>% summarise(noPaths=n()) %>% arrange(desc(noPaths))
df2

# Source: spark [?? x 2]
# Ordered by: desc(noPaths)
path                                    noPaths

1 /images/NASA-logosmall.gif               208362
2 /images/KSC-logosmall.gif                164813
3 /images/MOSAIC-logosmall.gif             127656
4 /images/USA-logosmall.gif                126820
5 /images/WORLD-logosmall.gif              125676
6 /images/ksclogo-medium.gif               121286
7 /ksc.html                                 83685
8 /images/launch-logo.gif                   75960
9 /history/apollo/images/apollo-logo1.gif   68858
10 /shuttle/countdown/                       64695

#### 6.42 Plot count of Paths

library(ggplot2)
p <-ggplot(data=df2, aes(x=path, y=noPaths,fill=path)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Path') + ylab('Count')
p



### 7.1 RDD

#### 7.11 Compute count of HTTP Status

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[7],1))
.reduceByKey(lambda a,b:a+b)
.takeOrdered(10, lambda x: -x[1]))
rslt

Out[22]:
[(‘200’, 3095682),
(‘304’, 266764),
(‘302’, 72970),
(‘404’, 20625),
(‘403’, 225),
(‘500’, 65),
(‘501’, 41)]

#### 1.37 Plot counts of HTTP response status’

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘status’,‘count’]) sns.barplot(x=‘status’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

### 7.2 Pyspark

#### 7.21 Compute count of HTTP status

status_count=(cleaned_df
                .groupBy('status')
.count()
.orderBy('count',ascending=False))
status_count.show()
+——+——-+
|status| count|
+——+——-+
| 200|3100522|
| 304| 266773|
| 302| 73070|
| 404| 20901|
| 403| 225|
| 500| 65|
| 501| 41|
| 400| 15|
| null| 1|

### 7.22 Plot count of HTTP status

Plot the HTTP return status vs the counts

df1=status_count.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘status’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“HTTP Status”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

### 7.3 SparkR

library(SparkR)
c <- SparkR::select(a,a$status) df=SparkR::summarize(SparkR::groupBy(c, a$status), numStatus = count(a$status)) df1=head(df) ## 3.16 Plot count of HTTP Response status library(ggplot2) p <-ggplot(data=df1, aes(x=status, y=numStatus,fill=status)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Status') + ylab('Count') p  #### 7.4 SparklyR #### 7.41 Compute count of status df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(status) %>% group_by(status) %>% summarise(noStatus=n()) %>% arrange(desc(noStatus)) df2 <-head(df1,10) df2  # Source: spark [?? x 2] # Ordered by: desc(noStatus) status noStatus 1 200 3100522 2 304 266773 3 302 73070 4 404 20901 5 403 225 6 500 65 7 501 41 8 400 15 9 "" 1 #### 7.42 Plot count of status library(ggplot2) p <-ggplot(data=df2, aes(x=status, y=noStatus,fill=status)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Status’) + ylab(Count’) p ### 8.1 RDD #### 8.12 Compute count of content size parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda x😦x[8],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt  Out[24]: [(‘0’, 280017), (‘786’, 167281), (‘1204’, 140505), (‘363’, 111575), (‘234’, 110824), (‘669’, 110056), (‘5866’, 107079), (‘1713’, 66904), (‘1173’, 63336), (‘3635’, 55528)] #### 8.21 Plot content size import seaborn as sns df=pd.DataFrame(rslt,columns=[‘content_size’,‘count’]) sns.barplot(x=‘content_size’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display() ### 8.2 Pyspark #### 8.21 Compute count of content_size size_counts=(cleaned_df  .groupBy('content_size') .count() .orderBy('count',ascending=False)) size_counts.show(10) +------------+------+ |content_size| count| +------------+------+ | 0|313932| | 786|167709| | 1204|140668| | 363|111835| | 234|111086| | 669|110313| | 5866|107373| | 1713| 66953| | 1173| 63378| | 3635| 55579| +------------+------+ only showing top 10 rows #### 8.22 Plot counts of content size Plot the path access versus the counts df1=size_counts.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘content_size’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“content_size”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display() ### 8.3 SparkR ### 8.31 Compute count of content size library(SparkR) c <- SparkR::select(a,a$content_size)
df=SparkR::summarize(SparkR::groupBy(c, a$content_size), numContentSize = count(a$content_size))
df1

content_size numContentSize
1        28426           1414
2        78382            293
3        60053              4
4        36067              2
5        13282            236
6        41785            174
8.32 Plot count of content sizes
library(ggplot2)

p <-ggplot(data=df1, aes(x=content_size, y=numContentSize,fill=content_size)) + geom_bar(stat=identity”) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Content Size’) + ylab(Count’)

p

### 8.4 SparklyR

#### 8.41Compute count of content_size

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(content_size) %>% group_by(content_size) %>% summarise(noContentSize=n()) %>% arrange(desc(noContentSize))
df2

# Source: spark [?? x 2]
# Ordered by: desc(noContentSize)
content_size noContentSize

1 0                   280027
2 786                 167709
3 1204                140668
4 363                 111835
5 234                 111086
6 669                 110313
7 5866                107373
8 1713                 66953
9 1173                 63378
10 3635                 55579

#### 8.42 Plot count of content_size

library(ggplot2)
p <-ggplot(data=df2, aes(x=content_size, y=noContentSize,fill=content_size)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Content size') + ylab('Count')
p



Conclusion: I spent many,many hours struggling with Regex and getting RDDs,Pyspark to work. Also had to spend a lot of time trying to work out the syntax for SparkR and SparklyR for parsing. After you parse the logs plotting and analysis is a piece of cake! This is definitely worth a try!

Watch this space!!

To see all posts click Index of posts

# My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon

‘Cricket analytics with cricketr and cricpy – Analytics harmony with R and Python’ is now available on Amazon in both paperback ($21.99) and kindle ($9.99/Rs 449) versions. The book includes analysis of cricketers using both my R package ‘cricketr’ and my python package ‘cricpy’ for all formats of the game namely Test, ODI and T20. Both packages use data from ESPN Cricinfo Statsguru. The paperback is available on Amazon for $21.99 and the kindle version is available for$9.99/Rs 449

The book includes the following chapters

CONTENTS

Introduction 7
1. Cricket analytics with cricketr 9
1.1. Introducing cricketr! : An R package to analyze performances of cricketers 10
1.2. Taking cricketr for a spin – Part 1 48
1.2. cricketr digs the Ashes! 69
1.3. cricketr plays the ODIs! 97
1.4. cricketr adapts to the Twenty20 International! 139
1.5. Sixer – R package cricketr’s new Shiny avatar 168
1.6. Re-introducing cricketr! : An R package to analyze performances of cricketers 178
1.7. cricketr sizes up legendary All-rounders of yesteryear 233
1.8. cricketr flexes new muscles: The final analysis 277
1.9. The Clash of the Titans in Test and ODI cricket 300
1.10. Analyzing performances of cricketers using cricketr template 338
2. Cricket analytics with cricpy 352
2.1 Introducing cricpy:A python package to analyze performances of cricketers 353
2.2 Cricpy takes a swing at the ODIs 405
Analysis of Top 4 batsman 448
2.3 Cricpy takes guard for the Twenty20s 449
2.4 Analyzing batsmen and bowlers with cricpy template 490
9. Average runs against different opposing teams 493
3. Other cricket posts in R 500
3.1 Analyzing cricket’s batting legends – Through the mirage with R 500
3.2 Mirror, mirror … the best batsman of them all? 527
4. Appendix 541
Cricket analysis with Machine Learning using Octave 541
4.1 Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid 542
4.2 Informed choices through Machine Learning-2 Pitting together Kumble, Kapil, Chandra 555