Revitalizing R package yorkr

There is nothing so useless as doing efficiently that which should not be done at all. Peter Drucker

The most important thing in communication is to hear what isn’t being said. Peter Drucker

“Work expands to fill the time available for its completion.” Corollary: “Expenditure rises to meet income.” Parkinson’s law

Introduction

“Operation successful!!!, the Programmer Surgeon in me, thought to himself. What should have been a routine surgery, turned out to be a major operation in the end, which involved several grueling hours. The surgeon looked at the large chunks of programming logic in the operation tray, which had been surgically removed, as they had outlived their utility and had partly become dysfunctional. The surgeon glanced at the new, concise code logic which had replaced the earlier somewhat convoluted logic, with a smile of satisfaction,

To, those who tuned in late, I am referring to my R package yorkr which I had created in many years ago, in early 2016. The package had worked well for quite some time on data from Cricsheet. Cricsheet went into a hiatus in late 2017-2018, and came alive back in 2019. Unfortunately, a key function in the package, started to malfunction. The diagnosis was that the format of the YAML files had changed, in newer files, which resulted in the problem. I had got mails from users mentioning that yorkr was not converting the new YAML files. This was on my to do list for a long time, and a week or two back, I decided to “bite the bullet” and fix the issue. I hoped the fix would be trivial but it was anything but. Finally, I took the hard decision of re-designing the core of the yorkr package, which involved converting YAML files to RData (dataframes). Also, since it has been a while since I did R code, having done more of Python stuff in recent times, I had to jog my memory with my earlier 2 posts Essential R and R vs Python

I spent many hours, tweaking and fixing the new logic so that it worked on the older and new files. Finally, I am happy to say that the new code is much more compact and probably less error prone.

I also had to ensure that the converted files performed exactly on all the other yorkr functions. I ran all the my yorkr functions in my yorkr posts on ODI, Intl. T20 and IPL and made sure the results were identical. (Phew!!)

The changes will be available in CRAN in yorkr_0.0.8

Do take a look at my yorkr posts. All the functions work correctly. Do use help, as I have changed a few functions. I will have my posts reflect the correct usage, but some function or other may slip the cracks.

One Day Internationals ODI-Part1, ODI-Part2, ODI-Part3, ODI-Part4
International T20s – T20-Part1,T20-Part2,T20-Part3,T20-Part4
Indian Premier League IPL-Part1, IPL-Part2,IPL-Part3, IPL-Part4

While making the changes, I also touched up some functions and made them more user friendly (added additional arguments etc). But by and large, yorkr is still yorkr and is intact.It just sports some spanking, new YAML conversion logic.

Note:

The code is available in Github yorkr
This RMarkdown has been published at RPubs Revitalizing yorkr
I have already converted the YAML files for ODI, Intl T20 and IPL. You can access and download the converted data from Github at yorkrData2020

setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrgit")
install.packages("yorkr_0.0.8.tar.gz",repos = NULL, type="source")
library(yorkr)

Checkout my interactive Shiny apps GooglyPlus2021 (interactive plots ) and GooglyPlusPlus2021 (analysis in specific intervals) which can be used to analyze IPL players, teams and matches.

Below I rank batsmen and bowlers in ODIs, T20 and IPL based on the data from Cricsheet.

1a. Rank ODI Batsmen

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/odi/odiMenMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/odi/odiBattingBowlingDetails"

rankODIBatsmen(dir=dir,odir=odir,minMatches=50)

## # A tibble: 151 x 4
##    batsman        matches meanRuns meanSR
##    <chr>            <int>    <dbl>  <dbl>
##  1 Babar Azam          52     50.2   87.2
##  2 SD Hope             51     48.7   71.0
##  3 V Kohli            207     48.4   79.4
##  4 HM Amla            159     46.6   82.4
##  5 DA Warner          114     46.1   88.0
##  6 AB de Villiers     190     45.5   94.5
##  7 JE Root            108     44.9   82.5
##  8 SR Tendulkar        96     43.9   77.1
##  9 IJL Trott           63     43.1   68.9
## 10 Q de Kock          106     42.0   82.7
## # … with 141 more rows

1b. Rank ODI Bowlers

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/odi/odiMenMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/odi/odiBattingBowlingDetails"

rankODIBowlers(dir=dir,odir=odir,minMatches=30)

## # A tibble: 265 x 4
##    bowler           matches totalWickets meanER
##    <chr>              <int>        <dbl>  <dbl>
##  1 SL Malinga           191          308   5.25
##  2 MG Johnson           142          238   4.73
##  3 Shakib Al Hasan      157          214   4.72
##  4 Shahid Afridi        166          213   4.69
##  5 JM Anderson          143          207   4.96
##  6 KMDN Kulasekara      161          190   4.94
##  7 SCJ Broad            115          189   5.31
##  8 DW Steyn             114          188   4.96
##  9 Mashrafe Mortaza     139          180   4.97
## 10 Saeed Ajmal          106          180   4.17
## # … with 255 more rows

2a. Rank T20 Batsmen

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20MenMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails"

rankT20Batsmen(dir=dir,odir=odir,minMatches=50)

## # A tibble: 43 x 4
##    batsman          matches meanRuns meanSR
##    <chr>              <int>    <dbl>  <dbl>
##  1 V Kohli               61     39.0   132.
##  2 Mohammad Shahzad      52     31.8   123.
##  3 CH Gayle              50     31.1   124.
##  4 BB McCullum           69     30.7   126.
##  5 PR Stirling           66     29.6   116.
##  6 MJ Guptill            70     29.6   125.
##  7 DA Warner             75     29.1   128.
##  8 AD Hales              50     28.1   120.
##  9 TM Dilshan            78     26.7   105.
## 10 RG Sharma             72     26.4   120.
## # … with 33 more rows

2b. Rank T20 Bowlers

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20MenMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails"

rankT20Bowlers(dir=dir,odir=odir,,minMatches=30)

## # A tibble: 153 x 4
##    bowler          matches totalWickets meanER
##    <chr>             <int>        <dbl>  <dbl>
##  1 SL Malinga           78          115   7.39
##  2 Shahid Afridi        89           98   6.80
##  3 Saeed Ajmal          62           92   6.30
##  4 Umar Gul             56           87   7.40
##  5 KMDN Kulasekara      56           72   7.25
##  6 TG Southee           55           69   8.68
##  7 DJ Bravo             60           69   8.41
##  8 DW Steyn             47           69   7.00
##  9 Shakib Al Hasan      57           69   6.82
## 10 SCJ Broad            55           68   7.83
## # … with 143 more rows

3a. Rank IPL Batsmen

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails"


rankIPLBatsmen(dir=dir,odir=odir,,minMatches=50)

## # A tibble: 69 x 4
##    batsman        matches meanRuns meanSR
##    <chr>            <int>    <dbl>  <dbl>
##  1 DA Warner          130     37.9   128.
##  2 CH Gayle           125     36.2   134.
##  3 SE Marsh            67     35.9   120.
##  4 MEK Hussey          59     33.8   105.
##  5 KL Rahul            59     33.5   128.
##  6 V Kohli            175     31.6   119.
##  7 AM Rahane          116     30.7   108.
##  8 AB de Villiers     141     30.3   135.
##  9 F du Plessis        65     29.4   117.
## 10 S Dhawan           140     29.0   114.
## # … with 59 more rows

3a. Rank IPL Bowlers

dir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplMatches"
odir="/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails"

rankIPLBowlers(dir=dir,odir=odir,,minMatches=30)

## # A tibble: 143 x 4
##    bowler          matches totalWickets meanER
##    <chr>             <int>        <dbl>  <dbl>
##  1 SL Malinga          120          184   6.99
##  2 SP Narine           108          137   6.71
##  3 Harbhajan Singh     131          134   7.11
##  4 DJ Bravo             85          118   8.18
##  5 B Kumar              86          116   7.43
##  6 YS Chahal            82          102   7.85
##  7 R Ashwin             92           98   6.81
##  8 JJ Bumrah            76           91   7.47
##  9 PP Chawla            85           87   8.02
## 10 RA Jadeja            89           85   7.93
## # … with 133 more rows

##Conclusion

Go ahead and give yorkr a spin once yorkr_0.0.8 is available in CRAN. I hope you have fun. Do get back to me if you have any issues.

I’ll be back. Watch this space!!

Getting started with Tensorflow, Keras in Python and R

The Pale Blue Dot

“From this distant vantage point, the Earth might not seem of any particular interest. But for us, it’s different. Consider again that dot. That’s here, that’s home, that’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every “superstar,” every “supreme leader,” every saint and sinner in the history of our species lived there—on the mote of dust suspended in a sunbeam.”

Carl Sagan

Tensorflow and Keras are Deep Learning frameworks that really simplify a lot of things to the user. If you are familiar with Machine Learning and Deep Learning concepts then Tensorflow and Keras are really a playground to realize your ideas. In this post I show how you can get started with Tensorflow in both Python and R

Tensorflow in Python

For tensorflow in Python, I found Google’s Colab an ideal environment for running your Deep Learning code. This is an Google’s research project where you can execute your code on GPUs, TPUs etc

Tensorflow in R (RStudio)

To execute tensorflow in R (RStudio) you need to install tensorflow and keras as shown below
In this post I show how to get started with Tensorflow and Keras in R.

# Install Tensorflow in RStudio
#install_tensorflow()
# Install Keras
#install_packages("keras")
library(tensorflow)
libary(keras)

This post takes 3 different Machine Learning problems and uses the
Tensorflow/Keras framework to solve it

Note:
You can view the Google Colab notebook at Tensorflow in Python
The RMarkdown file has been published at RPubs and can be accessed
at Getting started with Tensorflow in R

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

1. Multivariate regression with Tensorflow – Python

This code performs multivariate regression using Tensorflow and keras on the advent of Parkinson disease through sound recordings see Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set . The clinician’s motorUPDRS score has to be predicted from the set of features

In [0]:

# Import tensorflow
import tensorflow as tf
from tensorflow import keras

In [2]:

#Get the data rom the UCI Machine Learning repository
dataset = keras.utils.get_file("parkinsons_updrs.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data")

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data
917504/911261 [==============================] - 0s 0us/step

In [3]:

# Read the CSV file 
import pandas as pd
parkinsons = pd.read_csv(dataset, na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True)
print(parkinsons.shape)
print(parkinsons.columns)
#Check if there are any NAs in the rows
parkinsons.isna().sum()

(5875, 22)
Index(['subject#', 'age', 'sex', 'test_time', 'motor_UPDRS', 'total_UPDRS',
       'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP',
       'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'Shimmer:APQ11', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA', 'PPE'],
      dtype='object')

Out[3]:

subject#         0
age              0
sex              0
test_time        0
motor_UPDRS      0
total_UPDRS      0
Jitter(%)        0
Jitter(Abs)      0
Jitter:RAP       0
Jitter:PPQ5      0
Jitter:DDP       0
Shimmer          0
Shimmer(dB)      0
Shimmer:APQ3     0
Shimmer:APQ5     0
Shimmer:APQ11    0
Shimmer:DDA      0
NHR              0
HNR              0
RPDE             0
DFA              0
PPE              0
dtype: int64

Note: To see how to create dummy variables see my post Practical Machine Learning with R and Python – Part 2

In [4]:

# Drop the columns subject number as it is not relevant
parkinsons1=parkinsons.drop(['subject#'],axis=1)

# Create dummy variables for sex (M/F)
parkinsons2=pd.get_dummies(parkinsons1,columns=['sex'])
parkinsons2.head()

Out[4]
age test_time motor_UPDRS total_UPDRS Jitter(%) Jitter(Abs) Jitter:RAP Jitter:PPQ5 Jitter:DDP Shimmer Shimmer(dB) Shimmer:APQ3 Shimmer:APQ5 Shimmer:APQ11 Shimmer:DDA NHR HNR RPDE DFA PPE sex_0 sex_1
0 72 5.6431 28.199 34.398 0.00662 0.000034 0.00401 0.00317 0.01204 0.02565 0.230 0.01438 0.01309 0.01662 0.04314 0.014290 21.640 0.41888 0.54842 0.16006 1 0
1 72 12.6660 28.447 34.894 0.00300 0.000017 0.00132 0.00150 0.00395 0.02024 0.179 0.00994 0.01072 0.01689 0.02982 0.011112 27.183 0.43493 0.56477 0.10810 1 0
2 72 19.6810 28.695 35.389 0.00481 0.000025 0.00205 0.00208 0.00616 0.01675 0.181 0.00734 0.00844 0.01458 0.02202 0.020220 23.047 0.46222 0.54405 0.21014 1 0
3 72 25.6470 28.905 35.810 0.00528 0.000027 0.00191 0.00264 0.00573 0.02309 0.327 0.01106 0.01265 0.01963 0.03317 0.027837 24.445 0.48730 0.57794 0.33277 1 0
4 72 33.6420 29.187 36.375 0.00335 0.000020 0.00093 0.00130 0.00278 0.01703 0.176 0.00679 0.00929 0.01819 0.02036 0.011625 26.126 0.47188 0.56122 0.19361 1 0

# Create a training and test data set with 80%/20%
train_dataset = parkinsons2.sample(frac=0.8,random_state=0)
test_dataset = parkinsons2.drop(train_dataset.index)

# Select columns
train_dataset1= train_dataset[['age', 'test_time', 'Jitter(%)', 'Jitter(Abs)',
       'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)',
       'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR',
       'HNR', 'RPDE', 'DFA', 'PPE', 'sex_0', 'sex_1']]
test_dataset1= test_dataset[['age','test_time', 'Jitter(%)', 'Jitter(Abs)',
       'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP', 'Shimmer', 'Shimmer(dB)',
       'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA', 'NHR',
       'HNR', 'RPDE', 'DFA', 'PPE', 'sex_0', 'sex_1']]

In [7]:

# Generate the statistics of the columns for use in normalization of the data
train_stats = train_dataset1.describe()
train_stats = train_stats.transpose()
train_stats

Out[7]:

count mean std min 25% 50% 75% max
age 4700.0 64.792766 8.870401 36.000000 58.000000 65.000000 72.000000 85.000000
test_time 4700.0 93.399490 53.630411 -4.262500 46.852250 93.405000 139.367500 215.490000
Jitter(%) 4700.0 0.006136 0.005612 0.000830 0.003560 0.004900 0.006770 0.099990
Jitter(Abs) 4700.0 0.000044 0.000036 0.000002 0.000022 0.000034 0.000053 0.000396
Jitter:RAP 4700.0 0.002969 0.003089 0.000330 0.001570 0.002235 0.003260 0.057540
Jitter:PPQ5 4700.0 0.003271 0.003760 0.000430 0.001810 0.002480 0.003460 0.069560
Jitter:DDP 4700.0 0.008908 0.009267 0.000980 0.004710 0.006705 0.009790 0.172630
Shimmer 4700.0 0.033992 0.025922 0.003060 0.019020 0.027385 0.039810 0.268630
Shimmer(dB) 4700.0 0.310487 0.231016 0.026000 0.175000 0.251000 0.363250 2.107000
Shimmer:APQ3 4700.0 0.017125 0.013275 0.001610 0.009190 0.013615 0.020562 0.162670
Shimmer:APQ5 4700.0 0.020151 0.016848 0.001940 0.010750 0.015785 0.023733 0.167020
Shimmer:APQ11 4700.0 0.027508 0.020270 0.002490 0.015630 0.022685 0.032713 0.275460
Shimmer:DDA 4700.0 0.051375 0.039826 0.004840 0.027567 0.040845 0.061683 0.488020
NHR 4700.0 0.032116 0.060206 0.000304 0.010827 0.018403 0.031452 0.748260
HNR 4700.0 21.704631 4.288853 1.659000 19.447750 21.973000 24.445250 37.187000
RPDE 4700.0 0.542549 0.100212 0.151020 0.471235 0.543490 0.614335 0.966080
DFA 4700.0 0.653015 0.070446 0.514040 0.596470 0.643285 0.710618 0.865600
PPE 4700.0 0.219559 0.091506 0.021983 0.156470 0.205340 0.264017 0.731730
sex_0 4700.0 0.681489 0.465948 0.000000 0.000000 1.000000 1.000000 1.000000
sex_1 4700.0 0.318511 0.465948 0.000000 0.000000 0.000000 1.000000 1.000000

In [0]:

# Create the target variable
train_labels = train_dataset.pop('motor_UPDRS')
test_labels = test_dataset.pop('motor_UPDRS')

In [0]:

# Normalize the data by subtracting the mean and dividing by the standard deviation
def normalize(x):
  return (x - train_stats['mean']) / train_stats['std']

# Create normalized training and test data
normalized_train_data = normalize(train_dataset1)
normalized_test_data = normalize(test_dataset1)

In [0]:

# Create a Deep Learning model with keras
model = tf.keras.Sequential([
    keras.layers.Dense(6, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]),
    keras.layers.Dense(9, activation=tf.nn.relu),
    keras.layers.Dense(6,activation=tf.nn.relu),
    keras.layers.Dense(1)
  ])

# Use the Adam optimizer with a learning rate of 0.01
optimizer=keras.optimizers.Adam(lr=.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

# Set the metrics required to be Mean Absolute Error and Mean Squared Error.For regression, the loss is mean_squared_error
model.compile(loss='mean_squared_error',
                optimizer=optimizer,
                metrics=['mean_absolute_error', 'mean_squared_error'])

In [0]:

# Create a model
history=model.fit(
  normalized_train_data, train_labels,
  epochs=1000, validation_data = (normalized_test_data,test_labels), verbose=0)

In [26]:

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Out[26]:

	loss	mean_absolute_error	mean_squared_error	val_loss	val_mean_absolute_error	val_mean_squared_error	epoch
995	15.773989	2.936990	15.773988	16.980803	3.028168	16.980803	995
996	15.238623	2.873420	15.238622	17.458752	3.101033	17.458752	996
997	15.437594	2.895500	15.437593	16.926016	2.971508	16.926018	997
998	15.867891	2.943521	15.867892	16.950249	2.985036	16.950249	998
999	15.846878	2.938914	15.846880	17.095623	3.014504	17.095625	999

In [30]:

def plot_history(history):
  hist = pd.DataFrame(history.history)
  hist['epoch'] = history.epoch

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.ylim([2,5])
  plt.legend()

  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error ')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.ylim([10,40])
  plt.legend()
  plt.show()


plot_history(history)

Observation

It can be seen that the mean absolute error is on an average about +/- 4.0. The validation error also is about the same. This can be reduced by playing around with the hyperparamaters and increasing the number of iterations

1a. Multivariate Regression in Tensorflow – R

# Install Tensorflow in RStudio
#install_tensorflow()
# Install Keras
#install_packages("keras")
library(tensorflow)

library(keras)

library(dplyr)

library(dummies)

## dummies-1.5.6 provided by Decision Patterns

library(tensorflow)
library(keras)

Multivariate regression

This code performs multivariate regression using Tensorflow and keras on the advent of Parkinson disease through sound recordings see Parkinson Speech Dataset with Multiple Types of Sound Recordings Data Set. The clinician’s motorUPDRS score has to be predicted from the set of features.

Read the data

# Download the Parkinson's data from UCI Machine Learning repository
dataset <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data")

# Set the column names
names(dataset) <- c("subject","age", "sex", "test_time","motor_UPDRS","total_UPDRS","Jitter","Jitter.Abs",
                 "Jitter.RAP","Jitter.PPQ5","Jitter.DDP","Shimmer", "Shimmer.dB", "Shimmer.APQ3",
                 "Shimmer.APQ5","Shimmer.APQ11","Shimmer.DDA", "NHR","HNR", "RPDE", "DFA","PPE")

# Remove the column 'subject' as it is not relevant to analysis
dataset1 <- subset(dataset, select = -c(subject))

# Make the column 'sex' as a factor for using dummies
dataset1$sex=as.factor(dataset1$sex)
# Add dummy variables for categorical cariable 'sex'
dataset2 <- dummy.data.frame(dataset1, sep = ".")

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

dataset3 <- na.omit(dataset2)

Split the data as training and test in 80/20

## Split data 80% training and 20% test
sample_size <- floor(0.8 * nrow(dataset3))

## set the seed to make your partition reproducible
set.seed(12)
train_index <- sample(seq_len(nrow(dataset3)), size = sample_size)

train_dataset <- dataset3[train_index, ]
test_dataset <- dataset3[-train_index, ]

train_data <- train_dataset %>% select(sex.0,sex.1,age, test_time,Jitter,Jitter.Abs,Jitter.PPQ5,Jitter.DDP,
                              Shimmer, Shimmer.dB,Shimmer.APQ3,Shimmer.APQ11,
                              Shimmer.DDA,NHR,HNR,RPDE,DFA,PPE)

train_labels <- select(train_dataset,motor_UPDRS)
test_data <- test_dataset %>% select(sex.0,sex.1,age, test_time,Jitter,Jitter.Abs,Jitter.PPQ5,Jitter.DDP,
                              Shimmer, Shimmer.dB,Shimmer.APQ3,Shimmer.APQ11,
                              Shimmer.DDA,NHR,HNR,RPDE,DFA,PPE)
test_labels <- select(test_dataset,motor_UPDRS)

Normalize the data

 # Normalize the data by subtracting the mean and dividing by the standard deviation
normalize<-function(x) {
  y<-(x - mean(x)) / sd(x)
  return(y)
}

normalized_train_data <-apply(train_data,2,normalize)
# Convert to matrix
train_labels <- as.matrix(train_labels)
normalized_test_data <- apply(test_data,2,normalize)
test_labels <- as.matrix(test_labels)

Create the Deep Learning Model

model <- keras_model_sequential()
model %>% 
  layer_dense(units = 6, activation = 'relu', input_shape = dim(normalized_train_data)[2]) %>% 
  layer_dense(units = 9, activation = 'relu') %>%
  layer_dense(units = 6, activation = 'relu') %>%
  layer_dense(units = 1)

# Set the metrics required to be Mean Absolute Error and Mean Squared Error.For regression, the loss is 
# mean_squared_error
model %>% compile(
  loss = 'mean_squared_error',
  optimizer = optimizer_rmsprop(),
  metrics = c('mean_absolute_error','mean_squared_error')
)

# Fit the model
# Use the test data for validation
history <- model %>% fit(
  normalized_train_data, train_labels, 
  epochs = 30, batch_size = 128, 
  validation_data = list(normalized_test_data,test_labels)
)

Plot mean squared error, mean absolute error and loss for training data and test data

plot(history)

Fig1

2. Binary classification in Tensorflow – Python

This is a simple binary classification problem from UCI Machine Learning repository and deals with data on Breast cancer from the Univ. of Wisconsin Breast Cancer Wisconsin (Diagnostic) Data Set bold text

In [31]:

import tensorflow as tf
from tensorflow import keras
import pandas as pd
# Read the data set from UCI ML site
dataset_path = keras.utils.get_file("breast-cancer-wisconsin.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
raw_dataset = pd.read_csv(dataset_path, sep=",", na_values = "?", skipinitialspace=True,)
dataset = raw_dataset.copy()

#Check for Null and drop
dataset.isna().sum()
dataset = dataset.dropna()
dataset.isna().sum()

# Set the column names
dataset.columns = ["id","thickness",	"cellsize",	"cellshape","adhesion","epicellsize",
                    "barenuclei","chromatin","normalnucleoli","mitoses","class"]
dataset.head()

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
24576/19889 [=====================================] - 0s 1us/step
id	thickness	cellsize	cellshape	adhesion	epicellsize	barenuclei	chromatin	normalnucleoli	mitoses	class
0	1002945	5	4	4	5	7	10.0	3	2	1	2
1	1015425	3	1	1	1	2	2.0	3	1	1	2
2	1016277	6	8	8	1	3	4.0	3	7	1	2
3	1017023	4	1	1	3	2	1.0	3	1	1	2
4	1017122	8	10	10	8	7	10.0	9	7	1	4

# Create a training/test set in the ratio 80/20
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

# Set the training and test set
train_dataset1= train_dataset[['thickness','cellsize','cellshape','adhesion',
                'epicellsize', 'barenuclei', 'chromatin', 'normalnucleoli','mitoses']]
test_dataset1=test_dataset[['thickness','cellsize','cellshape','adhesion',
                'epicellsize', 'barenuclei', 'chromatin', 'normalnucleoli','mitoses']]

In [34]:

# Generate the stats for each column to be used for normalization
train_stats = train_dataset1.describe()
train_stats = train_stats.transpose()
train_stats

Out[34]:

	count	mean	std	min	25%	50%	75%	max
thickness	546.0	4.430403	2.812768	1.0	2.0	4.0	6.0	10.0
cellsize	546.0	3.179487	3.083668	1.0	1.0	1.0	5.0	10.0
cellshape	546.0	3.225275	3.005588	1.0	1.0	1.0	5.0	10.0
adhesion	546.0	2.921245	2.937144	1.0	1.0	1.0	4.0	10.0
epicellsize	546.0	3.261905	2.252643	1.0	2.0	2.0	4.0	10.0
barenuclei	546.0	3.560440	3.651946	1.0	1.0	1.0	7.0	10.0
chromatin	546.0	3.483516	2.492687	1.0	2.0	3.0	5.0	10.0
normalnucleoli	546.0	2.875458	3.064305	1.0	1.0	1.0	4.0	10.0
mitoses	546.0	1.609890	1.736762	1.0	1.0	1.0	1.0	10.0

In [0]:

# Create target variables
train_labels = train_dataset.pop('class')
test_labels = test_dataset.pop('class')

In [0]:

# Set the target variables as 0 or 1
train_labels[train_labels==2] =0 # benign
train_labels[train_labels==4] =1 # malignant

test_labels[test_labels==2] =0 # benign
test_labels[test_labels==4] =1 # malignant

In [0]:

# Normalize by subtracting mean and dividing by standard deviation
def normalize(x):
  return (x - train_stats['mean']) / train_stats['std']

# Convert columns to numeric
train_dataset1 = train_dataset1.apply(pd.to_numeric)
test_dataset1 = test_dataset1.apply(pd.to_numeric)

# Normalize
normalized_train_data = normalize(train_dataset1)
normalized_test_data = normalize(test_dataset1)

In [0]:

# Create a model
model = tf.keras.Sequential([
    keras.layers.Dense(6, activation=tf.nn.relu, input_shape=[len(train_dataset1.keys())]),
    keras.layers.Dense(9, activation=tf.nn.relu),
    keras.layers.Dense(6,activation=tf.nn.relu),
    keras.layers.Dense(1)
  ])

# Use the RMSProp optimizer
optimizer = tf.keras.optimizers.RMSprop(0.01)

# Since this is binary classification use binary_crossentropy
model.compile(loss='binary_crossentropy',
                optimizer=optimizer,
                metrics=['acc'])


# Fit a model
history=model.fit(
  normalized_train_data, train_labels,
  epochs=1000, validation_data=(normalized_test_data,test_labels), verbose=0)

In [55]:

hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

	loss	acc	val_loss	val_acc	epoch
995	0.112499	0.992674	0.454739	0.970588	995
996	0.112499	0.992674	0.454739	0.970588	996
997	0.112499	0.992674	0.454739	0.970588	997
998	0.112499	0.992674	0.454739	0.970588	998
999	0.112499	0.992674	0.454739	0.970588	999

In [58]:

# Plot training and test accuracy 
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.ylim([0.9,1])
plt.show()












# Plot training and test loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.ylim([0,0.5])
plt.show()

2a. Binary classification in Tensorflow -R

This is a simple binary classification problem from UCI Machine Learning repository and deals with data on Breast cancer from the Univ. of Wisconsin Breast Cancer Wisconsin (Diagnostic) Data Set

# Read the data for Breast cancer (Wisconsin)
dataset <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data")

# Rename the columns
names(dataset) <- c("id","thickness",   "cellsize", "cellshape","adhesion","epicellsize",
                    "barenuclei","chromatin","normalnucleoli","mitoses","class")

# Remove the columns id and class
dataset1 <- subset(dataset, select = -c(id, class))
dataset2 <- na.omit(dataset1)

# Convert the column to numeric
dataset2$barenuclei <- as.numeric(dataset2$barenuclei)

Normalize the data

train_data <-apply(dataset2,2,normalize)
train_labels <- as.matrix(select(dataset,class))

# Set the target variables as 0 or 1 as it binary classification
train_labels[train_labels==2,]=0
train_labels[train_labels==4,]=1

Create the Deep Learning model

model <- keras_model_sequential()
model %>% 
  layer_dense(units = 6, activation = 'relu', input_shape = dim(train_data)[2]) %>% 
  layer_dense(units = 9, activation = 'relu') %>%
  layer_dense(units = 6, activation = 'relu') %>%
  layer_dense(units = 1)

# Since this is a binary classification we use binary cross entropy
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')  # Metrics is accuracy
)

Fit the model. Use 20% of data for validation

history <- model %>% fit(
  train_data, train_labels, 
  epochs = 30, batch_size = 128, 
  validation_split = 0.2
)

Plot the accuracy and loss for training and validation data

plot(history)

3. MNIST in Tensorflow – Python

This takes the famous MNIST handwritten digits . It ca be seen that Tensorflow and Keras make short work of this famous problem of the late 1980s

# Download MNIST data
mnist=tf.keras.datasets.mnist
# Set training and test data and labels
(training_images,training_labels),(test_images,test_labels)=mnist.load_data()

print(training_images.shape)
print(test_images.shape)

(60000, 28, 28)
(10000, 28, 28)

In [61]:

# Plot a sample image from MNIST and show contents
import matplotlib.pyplot as plt
plt.imshow(training_images[1])
print(training_images[1])
[[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 159 253
159 50 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 238 252 252
252 237 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 54 227 253 252 239
233 252 57 6 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 10 60 224 252 253 252 202
84 252 253 122 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 163 252 252 252 253 252 252
96 189 253 167 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 51 238 253 253 190 114 253 228
47 79 255 168 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 48 238 252 252 179 12 75 121 21
0 0 253 243 50 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 38 165 253 233 208 84 0 0 0 0
0 0 253 252 165 0 0 0 0 0]
[ 0 0 0 0 0 0 0 7 178 252 240 71 19 28 0 0 0 0
0 0 253 252 195 0 0 0 0 0]
[ 0 0 0 0 0 0 0 57 252 252 63 0 0 0 0 0 0 0
0 0 253 252 195 0 0 0 0 0]
[ 0 0 0 0 0 0 0 198 253 190 0 0 0 0 0 0 0 0
0 0 255 253 196 0 0 0 0 0]
[ 0 0 0 0 0 0 76 246 252 112 0 0 0 0 0 0 0 0
0 0 253 252 148 0 0 0 0 0]
[ 0 0 0 0 0 0 85 252 230 25 0 0 0 0 0 0 0 0
7 135 253 186 12 0 0 0 0 0]
[ 0 0 0 0 0 0 85 252 223 0 0 0 0 0 0 0 0 7
131 252 225 71 0 0 0 0 0 0]
[ 0 0 0 0 0 0 85 252 145 0 0 0 0 0 0 0 48 165
252 173 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 86 253 225 0 0 0 0 0 0 114 238 253
162 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 85 252 249 146 48 29 85 178 225 253 223 167
56 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 85 252 252 252 229 215 252 252 252 196 130 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 28 199 252 252 253 252 252 233 145 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 25 128 252 253 252 141 37 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0]]

# Normalize the images by dividing by 255.0
training_images = training_images/255.0
test_images = test_images/255.0

# Create a Sequential Keras model
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
                                   tf.keras.layers.Dense(1024,activation=tf.nn.relu),
                                   tf.keras.layers.Dense(10,activation=tf.nn.softmax)])
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [68]:

history=model.fit(training_images,training_labels,validation_data=(test_images, test_labels), epochs=5, verbose=1)

Train on 60000 samples, validate on 10000 samples
Epoch 1/5
60000/60000 [==============================] - 17s 291us/sample - loss: 0.0020 - acc: 0.9999 - val_loss: 0.0719 - val_acc: 0.9810
Epoch 2/5
60000/60000 [==============================] - 17s 284us/sample - loss: 0.0021 - acc: 0.9998 - val_loss: 0.0705 - val_acc: 0.9821
Epoch 3/5
60000/60000 [==============================] - 17s 286us/sample - loss: 0.0017 - acc: 0.9999 - val_loss: 0.0729 - val_acc: 0.9805
Epoch 4/5
60000/60000 [==============================] - 17s 284us/sample - loss: 0.0014 - acc: 0.9999 - val_loss: 0.0762 - val_acc: 0.9804
Epoch 5/5
60000/60000 [==============================] - 17s 280us/sample - loss: 0.0015 - acc: 0.9999 - val_loss: 0.0735 - val_acc: 0.9812

Fig 1

Fig 2

MNIST in Tensorflow – R

The following code uses Tensorflow to learn MNIST’s handwritten digits ### Load MNIST data

mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

Reshape and rescale

# Reshape the array
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# Rescale
x_train <- x_train / 255
x_test <- x_test / 255

Convert out put to One Hot encoded format

y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)

Fit the model

Use the softmax activation for recognizing 10 digits and categorical cross entropy for loss

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax') # Use softmax

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

Fit the model

Note: A smaller number of epochs has been used. For better performance increase number of epochs

history <- model %>% fit(
  x_train, y_train, 
  epochs = 5, batch_size = 128, 
  validation_data = list(x_test,y_test)
)

Plot the accuracy and loss for training and test data

plot(history)

Conclusion
This post shows how to use Tensorflow and Keras in both Python & R
Hope you have fun with Tensorflow!!

To see all posts click Index of posts

Cricpy takes guard for the Twenty20s

“There are two ways to write error-free programs; only the third one works.”” Alan J. Perlis

“Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to produce bigger and better idiots. So far, the universe is winning. ” Rick Cook

“My software never has bugs. It just develops random features.” Anon

“If you make an ass out of yourself, there will always be someone to ride you.” Bruce Lee

Introduction

This is the 3rd and final post on cricpy, and is a continuation to my 2 earlier posts

1. Introducing cricpy:A python package to analyze performances of cricketers
2.Cricpy takes a swing at the ODIs

Cricpy, is the python avatar of my R package ‘cricketr’. To know more about my R package cricketr see Re-introducing cricketr! : An R package to analyze performances of cricketers

With this post cricpy, like cricketr, now becomes omnipotent, and is now capable of handling Test, ODI and T20 matches.

Cricpy uses the statistics info available in ESPN Cricinfo Statsguru.

You should be able to install the package using pip install cricpy and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Cricpy can now analyze performances of teams in Test, ODI and T20 cricket see Cricpy adds team analytics to its arsenal!!

This post is also hosted on Rpubs at Int

This post is also hosted on Rpubs at Cricpy takes guard for the Twenty 20s. You can also download the pdf version of this post at cricpy-TT.pdf

You can fork/clone the package at Github cricpy

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricpy-template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. The functions can be executed in RStudio or in a IPython notebook.

If you are passionate about cricket, and love analyzing cricket performances, then check out my racy book on cricket ‘Cricket analytics with cricketr and cricpy – Analytics harmony with R & Python’! This book discusses and shows how to use my R package ‘cricketr’ and my Python package ‘cricpy’ to analyze batsmen and bowlers in all formats of the game (Test, ODI and T20). The paperback is available on Amazon at $21.99 and the kindle version at $9.99/Rs 449/-. A must read for any cricket lover! Check it out!!

You can download the latest PDF version of the book at ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition‘

Untitled

The cricpy package

The data for a particular player in Twenty20s can be obtained with the getPlayerDataTT() function. To do this you will need to go to T20 Batting and T20 Bowling and click the player you are interested in This will bring up a page which have the profile number for the player e.g. for Virat Kohli this would be http://www.espncricinfo.com/india/content/player/253802.html. Hence,this can be used to get the data for Virat Kohlias shown below

The cricpy package is a clone of my R package cricketr. The signature of all the python functions are identical with that of its clone ‘cricketr’, with only the necessary variations between Python and R. It may be useful to look at my post R vs Python: Different similarities and similar differences. In fact if you are familar with one of the languages you can look up the package in the other and you will notice the parallel constructs.

You can fork/clone the package at Github cricpy

Note: The charts are self-explanatory and I have not added much of my own interpretation to it. Do look at the plots closely and check out the performances for yourself.

1 Importing cricpy – Python

# Install the package
# Do a pip install cricpy
# Import cricpy
import cricpy.analytics as ca

2. Invoking functions with Python package cricpy

import cricpy.analytics as ca 
ca.batsman4s("./kohli.csv","Virat Kohli")

3. Getting help from cricpy – Python

import cricpy.analytics as ca 
help(ca.getPlayerDataTT)

## Help on function getPlayerDataTT in module cricpy.analytics:
## 
## getPlayerDataTT(profile, opposition='', host='', dir='./data', file='player001.csv', type='batting', homeOrAway=[1, 2, 3], result=[1, 2, 3, 5], create=True)
##     Get the Twenty20 International player data from ESPN Cricinfo based on specific inputs and store in a file in a given directory~
##     
##     Description
##     
##     Get the Twenty20 player data given the profile of the batsman/bowler. The allowed inputs are home,away, neutralboth and won,lost,tied or no result of matches. The data is stored in a <player>.csv file in a directory specified. This function also returns a data frame of the player
##     
##     Usage
##     
##     getPlayerDataTT(profile, opposition="",host="",dir = "./data", file = "player001.csv", 
##     type = "batting", homeOrAway = c(1, 2, 3), result = c(1, 2, 3,5))
##     Arguments
##     
##     profile     
##     This is the profile number of the player to get data. This can be obtained from http://www.espncricinfo.com/ci/content/player/index.html. Type the name of the player and click search. This will display the details of the player. Make a note of the profile ID. For e.g For Virat Kohli this turns out to be 253802 http://www.espncricinfo.com/india/content/player/35263.html. Hence the profile for Sehwag is 35263
##     opposition  
##     The numerical value of the opposition country e.g.Australia,India, England etc. The values are Afghanistan:40,Australia:2,Bangladesh:25,England:1,Hong Kong:19,India:6,Ireland:29, New Zealand:5,Pakistan:7,Scotland:30,South Africa:3,Sri Lanka:8,United Arab Emirates:27, West Indies:4, Zimbabwe:9; Note: If no value is entered for opposition then all teams are considered
##     host        
##     The numerical value of the host country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,England:1,India:6,New Zealand:5, South Africa:3,Sri Lanka:8,United States of America:11,West Indies:4, Zimbabwe:9 Note: If no value is entered for host then all host countries are considered
##     dir 
##     Name of the directory to store the player data into. If not specified the data is stored in a default directory "./data". Default="./data"
##     file        
##     Name of the file to store the data into for e.g. kohli.csv. This can be used for subsequent functions. Default="player001.csv"
##     type        
##     type of data required. This can be "batting" or "bowling"
##     homeOrAway  
##     This is vector with either or all 1,2, 3. 1 is for home 2 is for away, 3 is for neutral venue
##     result      
##     This is a vector that can take values 1,2,3,5. 1 - won match 2- lost match 3-tied 5- no result
##     Details
##     
##     More details can be found in my short video tutorial in Youtube https://www.youtube.com/watch?v=q9uMPFVsXsI
##     
##     Value
##     
##     Returns the player's dataframe
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh <tvganesh.85@gmail.com>
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://www.espncricinfo.com/ci/content/stats/index.html
##     https://gigadom.in/
##     
##     See Also
##     
##     bowlerWktRateTT getPlayerData
##     
##     Examples
##     
##     ## Not run: 
##     # Only away. Get data only for won and lost innings
##     kohli =getPlayerDataTT(253802,dir="../cricketr/data", file="kohli1.csv",
##     type="batting")
##     
##     # Get bowling data and store in file for future
##     ashwin = getPlayerDataTT(26421,dir="../cricketr/data",file="ashwin1.csv",
##     type="bowling")
##     
##     kohli =getPlayerDataTT(253802,opposition = 2,host=2,dir="../cricketr/data", 
##     file="kohli1.csv",type="batting")

The details below will introduce the different functions that are available in cricpy.

4. Get the Twenty20 player data for a player using the function getPlayerDataOD()

Important Note This needs to be done only once for a player. This function stores the player’s data in the specified CSV file (for e.g. kohli.csv as above) which can then be reused for all other functions). Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerDataTT for all subsequent analyses

import cricpy.analytics as ca
#kohli=ca.getPlayerDataTT(253802,dir=".",file="kohli.csv",type="batting")
#guptill=ca.getPlayerDataTT(226492,dir=".",file="guptill.csv",type="batting")
#shahzad=ca.getPlayerDataTT(419873,dir=".",file="shahzad.csv",type="batting")
#mccullum=ca.getPlayerDataTT(37737,dir=".",file="mccullum.csv",type="batting")

Included below are some of the functions that can be used for ODI batsmen and bowlers. For this I have chosen, Virat Kohli, ‘the run machine’ who is on-track for breaking many of the Test, ODI and Twenty20 records

5 Virat Kohli’s performance – Basic Analyses

The 3 plots below provide the following for Virat Kohli in T20s

Frequency percentage of runs in each run range over the whole career
Mean Strike Rate for runs scored in the given range
A histogram of runs frequency percentages in runs ranges

import cricpy.analytics as ca
import matplotlib.pyplot as plt
ca.batsmanRunsFreqPerf("./kohli.csv","Virat Kohli")

ca.batsmanMeanStrikeRate("./kohli.csv","Virat Kohli")

ca.batsmanRunsRanges("./kohli.csv","Virat Kohli")

6. More analyses

import cricpy.analytics as ca
ca.batsman4s("./kohli.csv","Virat Kohli")

ca.batsman6s("./kohli.csv","Virat Kohli")

ca.batsmanDismissals("./kohli.csv","Virat Kohli")

ca.batsmanScoringRateODTT("./kohli.csv","Virat Kohli")

7. 3D scatter plot and prediction plane

The plots below show the 3D scatter plot of Kohli’s Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease

import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")

8. Average runs at different venues

The plot below gives the average runs scored by Kohli at different grounds. The plot also the number of innings at each ground as a label at x-axis.

import cricpy.analytics as ca
ca.batsmanAvgRunsGround("./kohli.csv","Virat Kohli")

9. Average runs against different opposing teams

This plot computes the average runs scored by Kohli against different countries.

import cricpy.analytics as ca
ca.batsmanAvgRunsOpposition("./kohli.csv","Virat Kohli")

10 . Highest Runs Likelihood

The plot below shows the Runs Likelihood for a batsman. For this the performance of Kohli is plotted as a 3D scatter plot with Runs versus Balls Faced + Minutes at crease. K-Means. The centroids of 3 clusters are computed and plotted. In this plot Kohli’s highest tendencies are computed and plotted using K-Means

import cricpy.analytics as ca
ca.batsmanRunsLikelihood("./kohli.csv","Virat Kohli")

11. A look at the Top 4 batsman – Kohli, Guptill, Shahzad and McCullum

The following batsmen have been very prolific in Twenty20 cricket and will be used for the analyses

Virat Kohli: Runs – 2167, Average:49.25 ,Strike rate-136.11
MJ Guptill : Runs -2271, Average:34.4 ,Strike rate-132.88
Mohammed Shahzad :Runs – 1936, Average:31.22 ,Strike rate-134.81
BB McCullum : Runs – 2140, Average:35.66 ,Strike rate-136.21

The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs

12. Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

import cricpy.analytics as ca
ca.batsmanPerfBoxHist("./kohli.csv","Virat Kohli")

ca.batsmanPerfBoxHist("./guptill.csv","M J Guptill")

ca.batsmanPerfBoxHist("./shahzad.csv","M Shahzad")

ca.batsmanPerfBoxHist("./mccullum.csv","BB McCullum")

13 Moving Average of runs in career

Take a look at the Moving Average across the career of the Top 4 Twenty20 batsmen.

import cricpy.analytics as ca
ca.batsmanMovingAverage("./kohli.csv","Virat Kohli")

ca.batsmanMovingAverage("./guptill.csv","M J Guptill")
#ca.batsmanMovingAverage("./shahzad.csv","M Shahzad") # Gives error. Check!

ca.batsmanMovingAverage("./mccullum.csv","BB McCullum")

14 Cumulative Average runs of batsman in career

This function provides the cumulative average runs of the batsman over the career.Kohli’s average tops around 45 runs around 43 innings, though there is a dip downwards

import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeAverageRuns("./guptill.csv","M J Guptill")

ca.batsmanCumulativeAverageRuns("./shahzad.csv","M Shahzad")

ca.batsmanCumulativeAverageRuns("./mccullum.csv","BB McCullum")

15 Cumulative Average strike rate of batsman in career

Kohli, Guptill and McCullum average a strike rate of 125+

import cricpy.analytics as ca
ca.batsmanCumulativeStrikeRate("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeStrikeRate("./guptill.csv","M J Guptill")

ca.batsmanCumulativeStrikeRate("./shahzad.csv","M Shahzad")

ca.batsmanCumulativeStrikeRate("./mccullum.csv","BB McCullum")

16 Relative Batsman Cumulative Average Runs

The plot below compares the Relative cumulative average runs of the batsman. Kohli is way above all the other 3 batsmen. Behind Kohli is McCullum and then Guptill

import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullumn"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

17. Relative Batsman Strike Rate

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show that Kohli tops the overall strike rate followed by McCullum and then Guptill

import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.relativeBatsmanCumulativeStrikeRate(frames,names)

18. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A 3D prediction plane is fitted

import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")

ca.battingPerf3d("./guptill.csv","M J Guptill")

ca.battingPerf3d("./shahzad.csv","M Shahzad")

ca.battingPerf3d("./mccullum.csv","BB McCullum")

19. 3D plot of Runs vs Balls Faced and Minutes at Crease

Guptill and McCullum have a large percentage of sixes in comparison to the 4s. Kohli has a relative lower number of 6s

import cricpy.analytics as ca
frames = ["./kohli.csv","./guptill.csv","./shahzad.csv","./mccullum.csv"]
names = ["Kohli","Guptill","Shahzad","McCullum"]
ca.batsman4s6s(frames,names)

20. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

import cricpy.analytics as ca
import numpy as np
import pandas as pd
BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})
kohli= ca.batsmanRunsPredict("./kohli.csv",newDF,"Kohli")

print(kohli)

##             BF        Mins        Runs
## 0    10.000000   30.000000   14.753153
## 1    37.857143   70.714286   55.963333
## 2    65.714286  111.428571   97.173513
## 3    93.571429  152.142857  138.383693
## 4   121.428571  192.857143  179.593873
## 5   149.285714  233.571429  220.804053
## 6   177.142857  274.285714  262.014233
## 7   205.000000  315.000000  303.224414
## 8   232.857143  355.714286  344.434594
## 9   260.714286  396.428571  385.644774
## 10  288.571429  437.142857  426.854954
## 11  316.428571  477.857143  468.065134
## 12  344.285714  518.571429  509.275314
## 13  372.142857  559.285714  550.485494
## 14  400.000000  600.000000  591.695674

21 Analysis of Top Bowlers

The following 4 bowlers have had an excellent career and will be used for the analysis

Shakib Hasan:Wickets: 80, Average = 21.07, Economy Rate – 6.74
Mohammed Nabi : Wickets: 67, Average = 24.25, Economy Rate – 7.13
Rashid Khan: Wickets: 64, Average = 12.40, Economy Rate – 6.01
Imran Tahir : Wickets:62, Average – 14.95, Economy Rate – 6.77

22. Get the bowler’s data

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

import cricpy.analytics as ca
#shakib=ca.getPlayerDataTT(56143,dir=".",file="shakib.csv",type="bowling")
#nabi=ca.getPlayerDataOD(25913,dir=".",file="nabi.csv",type="bowling")
#rashid=ca.getPlayerDataOD(793463,dir=".",file="rashid.csv",type="bowling")
#tahir=ca.getPlayerDataOD(40618,dir=".",file="tahir.csv",type="bowling")

23. Wicket Frequency Plot

This plot below plots the frequency of wickets taken for each of the bowlers

import cricpy.analytics as ca
ca.bowlerWktsFreqPercent("./shakib.csv","Shakib Al Hasan")

ca.bowlerWktsFreqPercent("./nabi.csv","Mohammad Nabi")

ca.bowlerWktsFreqPercent("./rashid.csv","Rashid Khan")

ca.bowlerWktsFreqPercent("./tahir.csv","Imran Tahir")

24. Wickets Runs plot

The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken.

import cricpy.analytics as ca
ca.bowlerWktsRunsPlot("./shakib.csv","Shakib Al Hasan")

ca.bowlerWktsRunsPlot("./nabi.csv","Mohammad Nabi")

ca.bowlerWktsRunsPlot("./rashid.csv","Rashid Khan")

ca.bowlerWktsRunsPlot("./tahir.csv","Imran Tahir")

25 Average wickets at different venues

The plot gives the average wickets taken by Muralitharan at different venues.

import cricpy.analytics as ca
ca.bowlerAvgWktsGround("./shakib.csv","Shakib Al Hasan")

ca.bowlerAvgWktsGround("./nabi.csv","Mohammad Nabi")

ca.bowlerAvgWktsGround("./rashid.csv","Rashid Khan")

ca.bowlerAvgWktsGround("./tahir.csv","Imran Tahir")

26 Average wickets against different opposition

The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team

import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("./shakib.csv","Shakib Al Hasan")

ca.bowlerAvgWktsOpposition("./nabi.csv","Mohammad Nabi")

ca.bowlerAvgWktsOpposition("./rashid.csv","Rashid Khan")

ca.bowlerAvgWktsOpposition("./tahir.csv","Imran Tahir")

27 Wickets taken moving average

From the plot below it can be see

import cricpy.analytics as ca
ca.bowlerMovingAverage("./shakib.csv","Shakib Al Hasan")

ca.bowlerMovingAverage("./nabi.csv","Mohammad Nabi")

ca.bowlerMovingAverage("./rashid.csv","Rashid Khan")

ca.bowlerMovingAverage("./tahir.csv","Imran Tahir")

28 Cumulative average wickets taken

The plots below give the cumulative average wickets taken by the bowlers. Rashid Khan has been the most effective with almost 2.28 wickets per match

import cricpy.analytics as ca
ca.bowlerCumulativeAvgWickets("./shakib.csv","Shakib Al Hasan")

ca.bowlerCumulativeAvgWickets("./nabi.csv","Mohammad Nabi")

ca.bowlerCumulativeAvgWickets("./rashid.csv","Rashid Khan")

ca.bowlerCumulativeAvgWickets("./tahir.csv","Imran Tahir")

29 Cumulative average economy rate

The plots below give the cumulative average economy rate of the bowlers. Rashid Khan has the nest economy rate followed by Mohammed Nabi

import cricpy.analytics as ca
ca.bowlerCumulativeAvgEconRate("./shakib.csv","Shakib Al Hasan")

ca.bowlerCumulativeAvgEconRate("./nabi.csv","Mohammad Nabi")

ca.bowlerCumulativeAvgEconRate("./rashid.csv","Rashid Khan")

ca.bowlerCumulativeAvgEconRate("./tahir.csv","Imran Tahir")

30 Relative cumulative average economy rate of bowlers

The Relative cumulative economy rate is given below. It can be seen that Rashid Khan has the best economy rate followed by Mohammed Nabi and then Imran Tahir

import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

31 Relative Economy Rate against wickets taken

Rashid Khan has the best figures for wickets between 2-3.5 wickets. Mohammed Nabi pips Rashid Khan when takes a haul of 4 wickets.

import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlingER(frames,names)

32 Relative cumulative average wickets of bowlers in career

Rashid has the best performance with cumulative average wickets. He is followed by Imran Tahir in the wicket haul, followed by Shakib Al Hasan

import cricpy.analytics as ca
frames = ["./shakib.csv","./nabi.csv","./rashid.csv","tahir.csv"]
names = ["Shakib Al Hasan","Mohammad Nabi","Rashid Khan", "Imran Tahir"]
ca.relativeBowlerCumulativeAvgWickets(frames,names)

33. Key Findings

The plots above capture some of the capabilities and features of my cricpy package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.

Here are the main findings from the analysis above

Analysis of Top 4 batsman

The analysis of the Top 4 test batsman Kohli, Guptill, Shahzad and McCullum
1.Kohli has the best overall cumulative average runs and towers over everybody else
2. Kohli, Guptill and McCullum has a very good strike rate of around 125+
3. Guptill and McCullum have a larger percentage of sixes as compared to Kohli
4. Rashid Khan has the best cumulative average wickets, followed by Imran Tahir and then Shakib Al Hasan
5. Rashid Khan is the most economical bowler, followed by Mohammed Nabi

You can fork/clone the package at Github cricpy

Conclusion

Cricpy now has almost all the functions and functionalities of my R package cricketr. There are still a few more features that need to be added to cricpy. I intend to do this as and when I find time.

Go ahead, take cricpy for a spin! Hope you enjoy the ride!

Watch this space!!!

Important note: Do check out my other posts using cricpy at cricpy-posts

To see all posts click Index of Posts

Cricpy takes a swing at the ODIs

No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either.” Marvin Minksy

“The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task with full humility, and avoids clever tricks like the plague” Edgser Djikstra

Introduction

In this post, cricpy, the Python avatar of my R package cricketr, learns some new tricks to be able to handle ODI matches. To know more about my R package cricketr see Re-introducing cricketr! : An R package to analyze performances of cricketers

Cricpy uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports only Test cricket

You should be able to install the package using pip install cricpy and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Cricpy can now analyze performances of teams in Test, ODI and T20 cricket see Cricpy adds team analytics to its arsenal!!

This post is also hosted on Rpubs at Int

To know how to use cricpy see Introducing cricpy:A python package to analyze performances of cricketers. To the original version of cricpy, I have added 3 new functions for ODI. The earlier functions work for Test and ODI.

This post is also hosted on Rpubs at Cricpy takes a swing at the ODIs. You can also down the pdf version of this post at cricpy-odi.pdf

You can fork/clone the package at Github cricpy

You can download the latest PDF version of the book at ‘Cricket analytics with cricketr and cricpy: Analytics harmony with R and Python-6th edition‘

Untitled

The cricpy package

The data for a particular player in ODI can be obtained with the getPlayerDataOD() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Virat Kohli, Virendar Sehwag, Chris Gayle etc. This will bring up a page which have the profile number for the player e.g. for Virat Kohli this would be http://www.espncricinfo.com/india/content/player/253802.html. Hence, Kohli’s profile is 253802. This can be used to get the data for Virat Kohlis shown below

The cricpy package is a clone of my R package cricketr. The signature of all the python functions are identical with that of its clone ‘cricketr’, with only the necessary variations between Python and R. It may be useful to look at my post R vs Python: Different similarities and similar differences. In fact if you are familar with one of the lanuguages you can look up the package in the other and you will notice the parallel constructs.

You can fork/clone the package at Github cricpy

Note: The charts are self-explanatory and I have not added much of my owy interpretation to it. Do look at the plots closely and check out the performances for yourself.

1 Importing cricpy – Python

# Install the package
# Do a pip install cricpy
# Import cricpy
import cricpy.analytics as ca

2. Invoking functions with Python package crlcpy

import cricpy.analytics as ca 
ca.batsman4s("./kohli.csv","Virat Kohli")

3. Getting help from cricpy – Python

import cricpy.analytics as ca 
help(ca.getPlayerDataOD)

## Help on function getPlayerDataOD in module cricpy.analytics:
## 
## getPlayerDataOD(profile, opposition='', host='', dir='./data', file='player001.csv', type='batting', homeOrAway=[1, 2, 3], result=[1, 2, 3, 5], create=True)
##     Get the One day player data from ESPN Cricinfo based on specific inputs and store in a file in a given directory
##     
##     Description
##     
##     Get the player data given the profile of the batsman. The allowed inputs are home,away or both and won,lost or draw of matches. The data is stored in a .csv file in a directory specified. This function also returns a data frame of the player
##     
##     Usage
##     
##     getPlayerDataOD(profile, opposition="",host="",dir = "../", file = "player001.csv", 
##     type = "batting", homeOrAway = c(1, 2, 3), result = c(1, 2, 3,5))
##     Arguments
##     
##     profile     
##     This is the profile number of the player to get data. This can be obtained from http://www.espncricinfo.com/ci/content/player/index.html. Type the name of the player and click search. This will display the details of the player. Make a note of the profile ID. For e.g For Virender Sehwag this turns out to be http://www.espncricinfo.com/india/content/player/35263.html. Hence the profile for Sehwag is 35263
##     opposition      The numerical value of the opposition country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,Bermuda:12, England:1,Hong Kong:19,India:6,Ireland:29, Netherlands:15,New Zealand:5,Pakistan:7,Scotland:30,South Africa:3,Sri Lanka:8,United Arab Emirates:27, West Indies:4, Zimbabwe:9; Africa XI:405 Note: If no value is entered for opposition then all teams are considered
##     host            The numerical value of the host country e.g.Australia,India, England etc. The values are Australia:2,Bangladesh:25,England:1,India:6,Ireland:29,Malaysia:16,New Zealand:5,Pakistan:7, Scotland:30,South Africa:3,Sri Lanka:8,United Arab Emirates:27,West Indies:4, Zimbabwe:9 Note: If no value is entered for host then all host countries are considered
##     dir 
##     Name of the directory to store the player data into. If not specified the data is stored in a default directory "../data". Default="../data"
##     file        
##     Name of the file to store the data into for e.g. tendulkar.csv. This can be used for subsequent functions. Default="player001.csv"
##     type        
##     type of data required. This can be "batting" or "bowling"
##     homeOrAway  
##     This is vector with either or all 1,2, 3. 1 is for home 2 is for away, 3 is for neutral venue
##     result      
##     This is a vector that can take values 1,2,3,5. 1 - won match 2- lost match 3-tied 5- no result
##     Details
##     
##     More details can be found in my short video tutorial in Youtube https://www.youtube.com/watch?v=q9uMPFVsXsI
##     
##     Value
##     
##     Returns the player's dataframe
##     
##     Note
##     
##     Maintainer: Tinniam V Ganesh <tvganesh.85@gmail.com>
##     
##     Author(s)
##     
##     Tinniam V Ganesh
##     
##     References
##     
##     http://www.espncricinfo.com/ci/content/stats/index.html
##     https://gigadom.in/
##     
##     See Also
##     
##     getPlayerDataSp getPlayerData
##     
##     Examples
##     
##     
##     ## Not run: 
##     # Both home and away. Result = won,lost and drawn
##     sehwag =getPlayerDataOD(35263,dir="../cricketr/data", file="sehwag1.csv",
##     type="batting", homeOrAway=[1,2],result=[1,2,3,4])
##     
##     # Only away. Get data only for won and lost innings
##     sehwag = getPlayerDataOD(35263,dir="../cricketr/data", file="sehwag2.csv",
##     type="batting",homeOrAway=[2],result=[1,2])
##     
##     # Get bowling data and store in file for future
##     malinga = getPlayerData(49758,dir="../cricketr/data",file="malinga1.csv",
##     type="bowling")
##     
##     # Get Dhoni's ODI record in Australia against Australua
##     dhoni = getPlayerDataOD(28081,opposition = 2,host=2,dir=".",
##     file="dhoniVsAusinAusOD",type="batting")
##     
##     ## End(Not run)

The details below will introduce the different functions that are available in cricpy.

4. Get the ODI player data for a player using the function getPlayerDataOD()

import cricpy.analytics as ca
#sehwag=ca.getPlayerDataOD(35263,dir=".",file="sehwag.csv",type="batting")
#kohli=ca.getPlayerDataOD(253802,dir=".",file="kohli.csv",type="batting")
#jayasuriya=ca.getPlayerDataOD(49209,dir=".",file="jayasuriya.csv",type="batting")
#gayle=ca.getPlayerDataOD(51880,dir=".",file="gayle.csv",type="batting")

5 Virat Kohli’s performance – Basic Analyses

The 3 plots below provide the following for Virat Kohli

Frequency percentage of runs in each run range over the whole career
Mean Strike Rate for runs scored in the given range
A histogram of runs frequency percentages in runs ranges

import cricpy.analytics as ca
import matplotlib.pyplot as plt
ca.batsmanRunsFreqPerf("./kohli.csv","Virat Kohli")

ca.batsmanMeanStrikeRate("./kohli.csv","Virat Kohli")

ca.batsmanRunsRanges("./kohli.csv","Virat Kohli")

6. More analyses

import cricpy.analytics as ca
ca.batsman4s("./kohli.csv","Virat Kohli")

ca.batsman6s("./kohli.csv","Virat Kohli")

ca.batsmanDismissals("./kohli.csv","Virat Kohli")

ca.batsmanScoringRateODTT("./kohli.csv","Virat Kohli")

7. 3D scatter plot and prediction plane

The plots below show the 3D scatter plot of Kohli’s Runs versus Balls Faced and Minutes at crease. A linear regression plane is then fitted between Runs and Balls Faced + Minutes at crease

import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")

Average runs at different venues

The plot below gives the average runs scored by Kohli at different grounds. The plot also the number of innings at each ground as a label at x-axis.

import cricpy.analytics as ca
ca.batsmanAvgRunsGround("./kohli.csv","Virat Kohli")

9. Average runs against different opposing teams

This plot computes the average runs scored by Kohli against different countries.

import cricpy.analytics as ca
ca.batsmanAvgRunsOpposition("./kohli.csv","Virat Kohli")

10 . Highest Runs Likelihood

import cricpy.analytics as ca
ca.batsmanRunsLikelihood("./kohli.csv","Virat Kohli")

A look at the Top 4 batsman – Kohli, Jayasuriya, Sehwag and Gayle

The following batsmen have been very prolific in ODI cricket and will be used for the analyses

Virat Kohli: Runs – 10232, Average:59.83 ,Strike rate-92.88
Sanath Jayasuriya : Runs – 13430, Average:32.36 ,Strike rate-91.2
Virendar Sehwag :Runs – 8273, Average:35.05 ,Strike rate-104.33
Chris Gayle : Runs – 9727, Average:37.12 ,Strike rate-85.82

The following plots take a closer at their performances. The box plots show the median the 1st and 3rd quartile of the runs

12. Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

import cricpy.analytics as ca
ca.batsmanPerfBoxHist("./kohli.csv","Virat Kohli")

ca.batsmanPerfBoxHist("./jayasuriya.csv","Sanath jayasuriya")

ca.batsmanPerfBoxHist("./gayle.csv","Chris Gayle")

ca.batsmanPerfBoxHist("./sehwag.csv","Virendar Sehwag")

13 Moving Average of runs in career

Take a look at the Moving Average across the career of the Top 4 (ignore the dip at the end of all plots. Need to check why this is so!). Kohli’s performance has been steadily improving over the years, so has Sehwag. Gayle seems to be on the way down

import cricpy.analytics as ca
ca.batsmanMovingAverage("./kohli.csv","Virat Kohli")

ca.batsmanMovingAverage("./jayasuriya.csv","Sanath jayasuriya")

ca.batsmanMovingAverage("./gayle.csv","Chris Gayle")

ca.batsmanMovingAverage("./sehwag.csv","Virendar Sehwag")

14 Cumulative Average runs of batsman in career

This function provides the cumulative average runs of the batsman over the career. Kohli seems to be getting better with time and reaches a cumulative average of 45+. Sehwag improves with time and reaches around 35+. Chris Gayle drops from 42 to 35

import cricpy.analytics as ca
ca.batsmanCumulativeAverageRuns("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeAverageRuns("./jayasuriya.csv","Sanath jayasuriya")

ca.batsmanCumulativeAverageRuns("./gayle.csv","Chris Gayle")

ca.batsmanCumulativeAverageRuns("./sehwag.csv","Virendar Sehwag")

15 Cumulative Average strike rate of batsman in career

Sehwag has the best strike rate of almost 90. Kohli and Jayasuriya have a cumulative strike rate of 75.

import cricpy.analytics as ca
ca.batsmanCumulativeStrikeRate("./kohli.csv","Virat Kohli")

ca.batsmanCumulativeStrikeRate("./jayasuriya.csv","Sanath jayasuriya")

ca.batsmanCumulativeStrikeRate("./gayle.csv","Chris Gayle")

ca.batsmanCumulativeStrikeRate("./sehwag.csv","Virendar Sehwag")

16 Relative Batsman Cumulative Average Runs

The plot below compares the Relative cumulative average runs of the batsman . It can be seen that Virat Kohli towers above all others in the runs. He is followed by Chris Gayle and then Sehwag

import cricpy.analytics as ca
frames = ["./sehwag.csv","./gayle.csv","./jayasuriya.csv","./kohli.csv"]
names = ["Sehwag","Gayle","Jayasuriya","Kohli"]
ca.relativeBatsmanCumulativeAvgRuns(frames,names)

Relative Batsman Strike Rate

The plot below gives the relative Runs Frequency Percentages for each 10 run bucket. The plot below show Sehwag has the best strike rate, followed by Jayasuriya

import cricpy.analytics as ca
frames = ["./sehwag.csv","./gayle.csv","./jayasuriya.csv","./kohli.csv"]
names = ["Sehwag","Gayle","Jayasuriya","Kohli"]
ca.relativeBatsmanCumulativeStrikeRate(frames,names)

18. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A 3D prediction plane is fitted

import cricpy.analytics as ca
ca.battingPerf3d("./kohli.csv","Virat Kohli")

ca.battingPerf3d("./jayasuriya.csv","Sanath jayasuriya")

ca.battingPerf3d("./gayle.csv","Chris Gayle")

ca.battingPerf3d("./sehwag.csv","Virendar Sehwag")

3D plot of Runs vs Balls Faced and Minutes at Crease

From the plot below it can be seen that Sehwag has more runs by way of 4s than 1’s,2’s or 3s. Gayle and Jayasuriya have large number of 6s

import cricpy.analytics as ca
frames = ["./sehwag.csv","./kohli.csv","./gayle.csv","./jayasuriya.csv"]
names = ["Sehwag","Kohli","Gayle","Jayasuriya"]
ca.batsman4s6s(frames,names)

20. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

import cricpy.analytics as ca
import numpy as np
import pandas as pd
BF = np.linspace( 10, 400,15)
Mins = np.linspace( 30,600,15)
newDF= pd.DataFrame({'BF':BF,'Mins':Mins})
kohli= ca.batsmanRunsPredict("./kohli.csv",newDF,"Kohli")

print(kohli)

##             BF        Mins        Runs
## 0    10.000000   30.000000    6.807407
## 1    37.857143   70.714286   36.034833
## 2    65.714286  111.428571   65.262259
## 3    93.571429  152.142857   94.489686
## 4   121.428571  192.857143  123.717112
## 5   149.285714  233.571429  152.944538
## 6   177.142857  274.285714  182.171965
## 7   205.000000  315.000000  211.399391
## 8   232.857143  355.714286  240.626817
## 9   260.714286  396.428571  269.854244
## 10  288.571429  437.142857  299.081670
## 11  316.428571  477.857143  328.309096
## 12  344.285714  518.571429  357.536523
## 13  372.142857  559.285714  386.763949
## 14  400.000000  600.000000  415.991375

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease.

21 Analysis of Top Bowlers

The following 4 bowlers have had an excellent career and will be used for the analysis

Muthiah Muralitharan:Wickets: 534, Average = 23.08, Economy Rate – 3.93
Wasim Akram : Wickets: 502, Average = 23.52, Economy Rate – 3.89
Shaun Pollock: Wickets: 393, Average = 24.50, Economy Rate – 3.67
Javagal Srinath : Wickets:315, Average – 28.08, Economy Rate – 4.44

How do Muralitharan, Akram, Pollock and Srinath compare with one another with respect to wickets taken and the Economy Rate. The next set of plots compute and plot precisely these analyses.

22. Get the bowler’s data

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

import cricpy.analytics as ca
#akram=ca.getPlayerDataOD(43547,dir=".",file="akram.csv",type="bowling")
#murali=ca.getPlayerDataOD(49636,dir=".",file="murali.csv",type="bowling")
#pollock=ca.getPlayerDataOD(46774,dir=".",file="pollock.csv",type="bowling")
#srinath=ca.getPlayerDataOD(34105,dir=".",file="srinath.csv",type="bowling")

23. Wicket Frequency Plot

This plot below plots the frequency of wickets taken for each of the bowlers

import cricpy.analytics as ca
ca.bowlerWktsFreqPercent("./murali.csv","M Muralitharan")

ca.bowlerWktsFreqPercent("./akram.csv","Wasim Akram")

ca.bowlerWktsFreqPercent("./pollock.csv","Shaun Pollock")

ca.bowlerWktsFreqPercent("./srinath.csv","J Srinath")

24. Wickets Runs plot

The plot below create a box plot showing the 1st and 3rd quartile of runs conceded versus the number of wickets taken. Murali’s median runs for wickets ia around 40 while Akram, Pollock and Srinath it is around 32+ runs. The spread around the median is larger for these 3 bowlers in comparison to Murali

import cricpy.analytics as ca
ca.bowlerWktsRunsPlot("./murali.csv","M Muralitharan")

ca.bowlerWktsRunsPlot("./akram.csv","Wasim Akram")

ca.bowlerWktsRunsPlot("./pollock.csv","Shaun Pollock")

ca.bowlerWktsRunsPlot("./srinath.csv","J Srinath")

25 Average wickets at different venues

The plot gives the average wickets taken by Muralitharan at different venues. McGrath best performances are at Centurion, Lord’s and Port of Spain averaging about 4 wickets. Kapil Dev’s does good at Kingston and Wellington. Anderson averages 4 wickets at Dunedin and Nagpur

import cricpy.analytics as ca
ca.bowlerAvgWktsGround("./murali.csv","M Muralitharan")

ca.bowlerAvgWktsGround("./akram.csv","Wasim Akram")

ca.bowlerAvgWktsGround("./pollock.csv","Shaun Pollock")

ca.bowlerAvgWktsGround("./srinath.csv","J Srinath")

26 Average wickets against different opposition

The plot gives the average wickets taken by Muralitharan against different countries. The x-axis also includes the number of innings against each team

import cricpy.analytics as ca
ca.bowlerAvgWktsOpposition("./murali.csv","M Muralitharan")

ca.bowlerAvgWktsOpposition("./akram.csv","Wasim Akram")

ca.bowlerAvgWktsOpposition("./pollock.csv","Shaun Pollock")

ca.bowlerAvgWktsOpposition("./srinath.csv","J Srinath")

27 Wickets taken moving average

From the plot below it can be see James Anderson has had a solid performance over the years averaging about wickets

import cricpy.analytics as ca
ca.bowlerMovingAverage("./murali.csv","M Muralitharan")

ca.bowlerMovingAverage("./akram.csv","Wasim Akram")

ca.bowlerMovingAverage("./pollock.csv","Shaun Pollock")

ca.bowlerMovingAverage("./srinath.csv","J Srinath")

28 Cumulative average wickets taken

The plots below give the cumulative average wickets taken by the bowlers. Muralitharan has consistently taken wickets at an average of 1.6 wickets per game. Shaun Pollock has an average of 1.5

import cricpy.analytics as ca
ca.bowlerCumulativeAvgWickets("./murali.csv","M Muralitharan")

ca.bowlerCumulativeAvgWickets("./akram.csv","Wasim Akram")

ca.bowlerCumulativeAvgWickets("./pollock.csv","Shaun Pollock")

ca.bowlerCumulativeAvgWickets("./srinath.csv","J Srinath")

29 Cumulative average economy rate

The plots below give the cumulative average economy rate of the bowlers. Pollock is the most economical, followed by Akram and then Murali

import cricpy.analytics as ca
ca.bowlerCumulativeAvgEconRate("./murali.csv","M Muralitharan")

ca.bowlerCumulativeAvgEconRate("./akram.csv","Wasim Akram")

ca.bowlerCumulativeAvgEconRate("./pollock.csv","Shaun Pollock")

ca.bowlerCumulativeAvgEconRate("./srinath.csv","J Srinath")

30 Relative cumulative average economy rate of bowlers

The Relative cumulative economy rate shows that Pollock is the most economical of the 4 bowlers. He is followed by Akram and then Murali

import cricpy.analytics as ca
frames = ["./srinath.csv","./akram.csv","./murali.csv","pollock.csv"]
names = ["J Srinath","Wasim Akram","M Muralitharan", "S Pollock"]
ca.relativeBowlerCumulativeAvgEconRate(frames,names)

31 Relative Economy Rate against wickets taken

Pollock is most economical vs number of wickets taken. Murali has the best figures for 4 wickets taken.

import cricpy.analytics as ca
frames = ["./srinath.csv","./akram.csv","./murali.csv","pollock.csv"]
names = ["J Srinath","Wasim Akram","M Muralitharan", "S Pollock"]
ca.relativeBowlingER(frames,names)

32 Relative cumulative average wickets of bowlers in career

The plot below shows that McGrath has the best overall cumulative average wickets. While the bowlers are neck to neck around 130 innings, you can see Muralitharan is most consistent and leads the pack after 150 innings in the number of wickets taken.

import cricpy.analytics as ca
frames = ["./srinath.csv","./akram.csv","./murali.csv","pollock.csv"]
names = ["J Srinath","Wasim Akram","M Muralitharan", "S Pollock"]
ca.relativeBowlerCumulativeAvgWickets(frames,names)

33. Key Findings

The plots above capture some of the capabilities and features of my cricpy package. Feel free to install the package and try it out. Please do keep in mind ESPN Cricinfo’s Terms of Use.

Here are the main findings from the analysis above

Analysis of Top 4 batsman

The analysis of the Top 4 test batsman Tendulkar, Kallis, Ponting and Sangakkara show the folliwing

Kohli is a mean run machine and has been consistently piling on runs. Clearly records will lay shattered in days to come for Kohli
Virendar Sehwag has the best strike rate of the 4, followed by Jayasuriya and then Kohli
Shaun Pollock is the most economical of the bowlers followed by Wasim Akram
Muralitharan is the most consistent wicket of the lot.

Important note: Do check out my other posts using cricpy at cricpy-posts

Also see
1. Architecting a cloud based IP Multimedia System (IMS)
2. Exploring Quantum Gate operations with QCSimulator
3. Dabbling with Wiener filter using OpenCV
4. Deep Learning from first principles in Python, R and Octave – Part 5
5. Big Data-2: Move into the big league:Graduate from R to SparkR
6. Singularity
7. Practical Machine Learning with R and Python – Part 4
8. Literacy in India – A deepR dive
9. Modeling a Car in Android

To see all posts click Index of Posts

A video tutorial on R programming – The essentials

Here is a my video tutorial on R programming – The essentials. This tutorial is meant for those who would like to learn R, for R beginners or for those who would like to get a quick start on R. This tutorial tries to focus on those main functions that any R programmer is likely to use often rather than trying to cover every aspect of R with all its subtleties. You can clone the R tutorial used in this video along with the powerpoint presentation from Github. For this you will have to install Git on your laptop After you have installed Git you should be able to clone this repository and then try the functions out yourself in RStudio. Make your own variations to the functions, as you familiarize yourself with R.

git clone https://github.com/tvganesh/R-Programming-.git
Take a look at the video tutorial R programming – The essentials

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr and Beaten by sheer pace-Cricket analytics with yorkr A must read for any cricket lover! Check it out!!

You could supplement the video by reading on these topics. This tutorial will give you enough momentum for a relatively short take-off into the R. So good luck on your R journey

Also see
1. Designing a Social Web Portal
2. Design principles of scalable, distributed systems
3. A Cloud Medley with IBM’s Bluemix, Cloudant and Node.js
4. Programming Zen and now – Some essential tips -2
5. Fun simulation of a Chain in Android

The common alphabet of programming languages

“All animals are equal, but some animals are more equal than other.” “Four legs good, two legs bad.”

from Animal Farm by George Orwell

Note: This post is largely intended for those who are embarking on their journey into the world of programming. The article below highlights a set of constructs that recur in many imperative, dynamic and object-oriented languages. While these constructs cannot be applied directly to functional programming languages like Lisp,Haskell or Clojure, it may help. To some extent the programming language domain has been intentionally oversimplified to show that languages are not as daunting as they seem. Clearly there are a lot more subtle and complex differences among languages. Hope you have fun programming!

Introduction: Anybody who is about to venture into the deep waters of programming will be bewildered and awed by the almost limitless number of programming languages and the associated paradigms on which they are based on. It is easy to feel apprehensive of programming, when faced with this this array of languages, not to mention the seemingly quirky syntax of each language. Many opinions abound, about what is the best programming language. In my opinion each language is best suited for a particular class of problems and is usually clunky if used outside of this. As an aside here is an interesting link provided by reader AKS to Rosetta Code, which is stated to be a a programming chrestomathy (present solutions to the same task in as many different languages as possible, to demonstrate how languages are similar and different, and to aid a person with a grounding in one approach to a problem in learning another. Rosetta Code currently has 772 tasks, 165 draft tasks, and is aware of 582 languages)

You are likely to hear “All programming languages are equal, but some languages are more equal than others” from seasoned programmers who have their own pet language. There may also be others who swear that “procedural languages good, object oriented languages bad” or maybe “object oriented languages good, aspect oriented languages bad”. Unity in diversity Regardless of the language this post discusses a thread that is common to all programming languages. In fact any programming language can be expressed as

Lx = C + Sx

Where Lx is any programming language ‘x’. All programming languages have a set of core, common constructs which I have denoted as ‘C’ and a set of Specialized constructs, unique to each language ‘x’ which I have denoted as Sx. I would like to look at these constructs that are common to most programming languages like C,C++,Perl, Python, Ruby, C#, R, Octave etc. In my opinion knowing these core, common constructs and a few of the more specialized constructs should allow you to get started off in the language of your choice. You can pick up the more unique constructs as you go along. Here are the common constructs (C mentioned above) that you must familiarize yourself with when embarking on a new language

Reading user input and printing to screen
Reading and writing from a file
Conditional statement if-then-else if-else
Loops – For, while, repeat, do while etc.

Knowing these constructs and some of the basic concepts unique to each language for e.g.
– Structure, Pointers in C,
– Classes, inheritance in C++
– Subsetting in Octave, R
– car, cdr in Lisp will enable you to get started off in your chosen language.
I show the examples of these core constructs in many languages. Note the similarity between these constructs
1. C
Read from and write to console
scanf(x,”%d); printf(“The value of x is %d”, x);
Read from and write to filefread(buffer, strlen(c)+1, 1, fp); fwrite(c, strlen(c) + 1, 1, fp);

Conditional
if(x > 5) { printf(“x is greater than 5”); } else if (x < 5) { printf(“x is less than 5”); } else{ printf(“x is equal to 5”); }
Loops I will only consider for loops, though one could use while, repeat etC.
for(i =0; i <100; i++) { money = money++) }
2. C++
Read from and write to console
cin >> age; Cout << “The value is “ << value

Read from and write to a file // open a file in read mode.
ifstream infile; infile.open("afile.dat"); cout << "Reading from the file" << endl; infile >> data; ofstream outfile; outfile.open("afile.dat"); // write inputted data into the file. outfile << data << endl;
Conditional same as C
if(x > 5) { printf(“x is greater than 5”); }
else if (x < 5) { printf(“x is less than 5”); } else{ printf(“x is equal to 5”); }
Loops
for(i =0; i <100; i++) { money = money++)

}

2. C++ Read from and write to console
cin >> age;
Cout << “The value is “ << value
Read from and write to a file // open a file in read mode.
ifstream infile;
infile.open("afile.dat");
cout << "Reading from the file" << endl;
infile >> data; ofstream outfile;
outfile.open("afile.dat");
// write inputted data into the file.
outfile << data << endl;
Conditional same as C
if(x > 5) {
printf(“x is greater than 5”);
}
else if (x < 5) {
printf(“x is less than 5”);
}
else{ printf(“x is equal to 5”);
}
Loops
for(i =0; i <100; i++){
money = money++)
}
3. Java
Reading from and writing to standard input
Console c = System.console();
int val = c.readLine("Enter a value: ");
System.out.println("Value is "+ val);
Reading and writing from file
try {
in = new FileInputStream("input.txt");
out = new FileOutputStream("output.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c); } } ...
Conditional (same as C)
if(x > 5) {
printf(“x is greater than 5”);
}
else if (x < 5) {
printf(“x is less than 5”);
}
else{ printf(“x is equal to 5”); }
Loops (same as C)
for(i =0; i <100; i++){
money = money++)
}

4. Perl Read from console
#!/usr/bin/perl
$userinput = ;
chomp ($userinput);
Write to console
print "User typed $userinput\n";
Reading and write to a file
open(IN,"infile") || die "cannot open input file";
open(OUT,"outfile") || die "cannot open output file";
while() {
print OUT $_;
# echo line read
}
close(IN);
close(OUT)
Conditional
if( $a == 20 ){
# if condition is true then print the following
printf "a has a value which is 20\n";
}
elsif( $a == 30 ){
# if condition is true then print the following
printf "a has a value which is 30\n";
}else{
# if none of the above conditions is true
printf "a has a value which is $a\n";
}
Loops
for (my $i=0; $i <= 9; $i++) {
print "$i\n";
}

5. Lisp
The syntax for Lisp will be different from the others as it is a functional language. You need to familiarize yourself with these constructs to move ahead
Read and write to console
To read from standard input use
(let ((temp 0))
(print ‘(Enter temp))
(setf temp (read))
(print (append ‘(the temp is) (list temp))))
Read from and write to file
(with-open-file (stream “C:\\acl82express\\lisp\\count.cl”)
(do ((line (read-line stream nil) (read-line stream nil)))
(with-open-file (stream “C:\\acl82express\\lisp\\test.txt” :direction :output :if-exists :supersede)
(write-line “test” stream) nil)
Conditional
$ (cond ((< x 5)
(setf x (+ x 8))
(setf y (* 2 y)))
((= x 10) (setf x (* x 2)))
(t (setf x 8)))
Loops
$ (setf x 5)
$ (let ((i 0))
(loop (setf y (* x i))
(when (> i 10) (return))
(print i) (prin1 y) (incf i )))

6. Python
Reading and writing from console
var = raw_input("Please enter something: ")
print “You entered: ” value
Reading and writing from files
f = open(filename, 'r')
a = f.readline().strip()
target = open(filename, 'w')
target.write(line1)
Conditionals
if x > 5:
print "x is greater than 5”
elif
x < 5:
print "x is less than 5"
else:
print "x is equal to 5"
Loops
for i in range(0, 6):
print "Value is :" % i 7.

R
x=5
paste('The value of x is =',x)
Reading and writing to a file
infile = read.csv(“file”)
write(x, file = "data", sep = " ")
Conditional
if(x > 5){
print(“x is greater than 5”)
}else if(x < 5){
print(“x is less than 5”)
}else {
print(“x is equal to 5”)
}
Loops
for (i in 1:10) print(i)

Conclusion
As can be seen the core constructs are very similar in different languages save for some minor variations. It is generally useful to get started with just knowing these constructs and few other important other features of the language that you are trying to learn. It is possible to code most programs with these Core constructs and a few of the Specialized constructs in the language. These Core constructs are the glue that hold your code together.

You can learn more compact and more powerful features of the language as you go along The above core constructs are like the letters of the programming language alphabet. You need to construct words by stringing together these constructs and form sensible sentences which will be your program. Good luck with your adventure in your next new programming language!!!

Also see
1.Programming languages in layman’s language
2. The mind of the programmer
3. How to program – Some essential tips
4. Programming Zen and now – Some essential tips -2

R incantations for the uninitiated

Here are some basic R incantations that will get you started with R

A) Scalars & Vectors:
Chant 1 – Now repeat after me, with your right hand forward at shoulder height “In R there are no scalars. There are only vectors of length 1”.
Just kidding:-)

To create an integer variable x with a value 5 we write
x <- 5 or x = 5
While the former notation may seem odd, it is actually more logical considering that the RHS is assigned to LHS. Anyway both seem to work
Vectors can be created as follows
a <- c( 2:10) b <- c("This", "is", 'R","language")

B) Sequences:
There are several ways of creating sequences of numbers which you intend to use for your computation
<- seq(5:25) # Sequence from 5 to 25

Other ways to create sequences
Increment by 2
> seq(5, 25, by=2) [1] 5 7 9 11 13 15 17 19 21 23 25

>seq(5,25,length=18) # Create sequence from 5 to 25 with a total length of 18 [1] 5.000000 6.176471 7.352941 8.529412 9.705882 10.882353 12.058824 13.235294 [9] 14.411765 15.588235 16.764706 17.941176 19.117647 20.294118 21.470588 22.647059 [17] 23.823529 25.000000

C) Conditions and loops
An if-else if-else construct goes like this
if(condition) { do something } else if (condition) { do something } else { do something }

Note: Make sure the statements appear as above with the else if and else appearing on the same line as the closing braces, otherwise R complains about ‘unexpected else’ in else statement

D) Loops
I would like to mention 2 ways of doing ‘for’ loops in R.
a) for (i in 1:10) { statement }
> a <- seq(5,25,length=10) > a [1] 5.000000 7.222222 9.444444 11.666667 13.888889 16.111111 18.333333 [8] 20.555556 22.777778 25.000000

b) Sequence along the vector sequence. Note: This is useful as we don’t have to know the length of the vector/sequence
for (i in seq_along(a)){ + print(a[i]) + }

[1] 5
[1] 7.222222
[1] 9.444444
[1] 11.66667
…

There are others ways of looping with ‘while’ and ‘repeat’ which I have not included in this post.

R makes manipulation of matrices and data frames really easy. All the elements in a matrix are numeric while data frames can have different types for each of the element

E) Matrix
> rnorm(12,5,2) [1] 2.699961 3.160208 5.087478 3.969129 3.317840 4.551565 2.585758 2.397780 [9] 5.297535 6.574757 7.468268 2.440835

a) Create a vector of 12 random numbers with a mean of 5 and SD of 2
> a <-rnorm(12,5,2) b) Convert the vector to a matrix with 4 rows and 3 columns > mat <- matrix(a,4,3) > mat[,1] [,2] [,3] [1,] 5.197010 3.839281 9.022818 [2,] 4.053590 5.321399 5.587495 [3,] 4.225763 4.873768 6.648151 [4,] 4.709784 4.129093 2.575523

c) Subset rows 1 & 2 from the matrix
> mat[1:2,] [,1]     [,2]     [,3] [1,] 5.19701 3.839281 9.022818 [2,] 4.05359 5.321399 5.587495
d) Subset matrix a rows 1& 2 and with columns 2 & 3
> mat[1:2,2:3] [,1]     [,2] [1,] 3.839281 9.022818 [2,] 5.321399 5.587495
e) Subset matrix a for all row elements for the column 3
> mat[,3] [1] 9.022818 5.587495 6.648151 2.575523
e) Add row names and column names for the matrix as follows
> names <- c(“tim”,”pat”,”joe”,”jim”)
> v <- data.frame(names,mat)
> v names       X1       X2       X3 1   tim 5.197010 3.839281 9.022818 2   pat 4.053590 5.321399 5.587495 3   joe 4.225763 4.873768 6.648151 4   jim 4.709784 4.129093 2.575523

> colnames(v) <- c("names","a","b","c") > v names a b c 1 tim 5.197010 3.839281 9.022818 2 pat 4.053590 5.321399 5.587495 3 joe 4.225763 4.873768 6.648151 4 jim 4.709784 4.129093 2.575523

F) Data Frames
In R data frames are the most important method to manipulate large amounts of data. One can read data in .csv format into data frame using
df <- read.csv(“mydata.csv”)
To get a feel of data frames it is useful to play around with the numerous data sets that are available with the installation of R
To check the available dataframes do
>data() AirPassengers Monthly Airline Passenger Numbers 1949-1960 BJsales Sales Data with Leading Indicator BJsales.lead (BJsales) Sales Data with Leading Indicator BOD Biochemical Oxygen Demand CO2 Carbon Dioxide Uptake in Grass Plants ChickWeight Weight versus age of chicks on different diets ...
…

I will be using the mtcars data frame. Here are some of the most important commands on data frames
a) load data from mtcars
data(mtcars)
b) > head(mtcars,3) # Display the top 3 rows of the data frame mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

c) > tail(mtcars,4) # Display the boittom 4 rows of the data frame mpg cyl disp hp drat wt qsec vs am gear carb Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

d) > names(mtcars) # Display the names of the columns of the data frame [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

e) > summary(mtcars) # Display the summary of the data frame mpg cyl disp hp drat wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 qsec vs am gear carb Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000

f) > str(mtcars) # Generate a concise description of the data frame - values in each column, factors 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
g) > mtcars[mtcars$mpg == 10.4,] #Select all rows in mtcars where the mpg column has a value 10.4 mpg cyl disp hp drat wt qsec vs am gear carb Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4

h) > mtcars[(mtcars$mpg >20) & (mtcars$mpg <24),] # Select all rows in mtcars where the mpg > 20 and mpg < 24 mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

i) > myset <- mtcars[(mtcars$cyl == 6) | (mtcars$cyl == 4),] # Get all calls which are either 4 or 6 cylinder > myset mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2… … …

j) > mean(myset$mpg) # Determine the mean of the set created above [1] 23.97222
k) > table(mtcars$cyl) #Create a table of cars which have 4,6, or 8 cylinders

4 6 8
11 7 14

G) lapply,sapply,tapply
I use the iris data set for these commands
a) > data(iris) #Load iris data set

b)> names(iris) #Show the column names of the data set [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" c) > lapply(iris,class) #Show the class of all the columns in iris $Sepal.Length [1] "numeric" $Sepal.Width [1] "numeric" $Petal.Length [1] "numeric" $Petal.Width [1] "numeric" $Species [1] "factor"

d) > sapply(iris,class) # Display a summary of the class of the iris data set Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"

e) tapply: Instead of getting the mean for each of the species as below we can use tapply > a <-iris[iris$Species == "setosa",] > mean(a$Sepal.Length) [1] 5.006 > b <-iris[iris$Species == "versicolor",] > mean(b$Sepal.Length) [1] 5.936 > c <-iris[iris$Species == "virginica",] > mean(c$Sepal.Length) [1] 6.588

> tapply(iris$Sepal.Length,iris$Species,mean)
setosa versicolor virginica
5.006 5.936 6.588

Hopefully this highly condensed version of R will set you on a R-oll.

Programming Zen and now – Some essential tips-2

This post is a follow-up to my earlier post – How to program – Some essential tips. In this post I expand on some of the ideas of my earlier post.

Programming means different things to different people. To some programming is a drudgery almost akin to manual labor, to others programming is an insurmountable mountain full of frustrations and disappointments while to others it is an intense problem solving and a creative activity. In my opinion programming can mean anything to you. It is your attitude towards coding that make it a chore, a daunting task or something really creative.

Here are some my insights on how to go about learning to code

Eyes wide open: People generally get frustrated when a piece of code that they wrote does not do what they intended it to do. In some cases the code snippet will do nothing when they were expecting final result, sometimes the code will crash or it will go into an infinite loop and drive the person nuts. (Let me assure you – I have been there, done that!) The usual reaction when this happens is anger and frustration where we generally tinker around with the code only to get the same result. Soon the emotions will progress from anger to hopelessness.

The first thing that one needs to while coding is to keep your ‘eyes wide open’. We tend to be guilty of ignoring the error messages that show up. Here one way to attack coding

a) Fully understand the ‘what’ of the problem. If there is an infinite loop or a core dump check after which point does it happen? If there is an execution error, what is the error trying to tell us?
b) Next look into ‘why’ the error occurred. You could either use debugger or insert appropriate print statements to take the offending code apart.
c) Thirdly think ‘how‘ you can address the situation. Make appropriate changes and re-run the code
d) Did it solve the issue.If yes, move forward. Otherwise go to step a)

Remember that we learn more from our programming mistakes more than when our code just ‘happens’ to work! Mistakes in our code make us to explain every part of the program

Changing times:

Times have changed. Programming Zen and programming now are worlds apart. In many ways, IDEs, Git, Google etc. have made the programmer’s life a lot easier

‘Git’ing from here to there: Here is a trick that I learnt fairly recently, though it should have occurred to me more than 2 years back. This is using Git judiciously for all programming tasks (Note: I am saying nothing new here!). I find it really useful in writing code with incremental changes. I create my initial code on the master and then test out incremental changes on a ‘new branch’ even for personal projects. Once I have proved a small increment works, I merge it with the ‘main’ branch. I again start working on the ‘new’ for the next incremental change followed by a merge to the master

The steps are

Make initial changes

1. git add .
2. git commit –m “ Initial changes’

Create a new branch
3. git checkout –b ‘new

Make incremental changes. Test.
4.git add .
5. git commit –m “Change 1”

Merge with the master
6.git checkout master
7. git merge new

Continue to work with ‘new’.
8 . git checkout new
9. Go to step 4)

This process can be continued till you get your final product. I find this extremely useful instead of just using an IDE to make code changes. Invariably you can run into a situation where you had something working some time back and in the next instant it is broken and you can’t figure out all the changes you made to the working code. This can be extremely frustrating. With Git you have a history of changes and you can switch to an earlier version of working code and start from there.

Rarely do I find a reason to have more than 1 branch

Here is a pictorial version of this

Taking help from Dr. Google: For most questions and errors that you encounter you will find others who have hit similar bugs. Just google it. You will more than surprised that others went down the exact same path that you are treading. Besides the internet is full of tutorials, blogs and articles on key aspects of programming

Explore the cave of Stack overflow: Spend time exploring Stack overflow. Stack overflow is replete with code snippets and questions that you wanted to ask. There is so much information out there. If you really don’t find an answer to your problem, post it in Stack overflow and you are bound to get an answer or a link to a similar question asked previously

Finally programming requires dollops of patience. Develop patience along with your skill in coding and soon programming will much more enjoyable to you.

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. How to program – Some essential tips
4. The mind of a programmer

How to program – Some essential tips

If one follows the arrow of time from the early 1980s to the present day, the number of programming problems have not only proliferated but have also become more difficult. Fortunately programming in itself has become more manageable with massive increases in computing horsepower, smarter tools and instant availability of information on the internet, typically with the click of a mouse.

Learning to program is no easy task, but can be done with the right mix of attitude, curiosity and interest. Becoming adept at programming, however, is something else. An interesting essay in this context is Peter Norvig’s ‘Teach yourself programming in 10 years’

Back in the 1980s when I wrote my first Fortran program on my college Mainframe, programming was a lengthy exercise, spanning several days.

My first program was to plot a sine wave of characters on a computer printout. Running this program required the following several steps

Enter the program on a teletype terminal and create a stack of Hollerith (punched) cards
Submit the stack of cards to the computer center
The computer center would do a batch execute in the evening on the Mainframe
God forbid, if your program has a syntax error. If you did find an error, go back to step 1, the next day.
Assuming everything is fine, the computer center would run your program and your output (printout) would be placed in the appropriate pigeon hole which you would need to pick up the next day.

The whole exercise to write a small-sized program could take anywhere between a couple of days to a whole week.

In the early 1990s things got a little better where one could code, compile, link and execute sitting at one’s desk. However while the programming itself got much simpler than before, certain tasks were still difficult. Till the late 90s programs of any sort had to be written using a regular text editor (vi , emacs etc.) You would then have to go through the process of compiling, linking and executing.

An angry compiler would typically spew forth venom at missing semi-colons, undeclared variables, and uninitialized values. This would happen till you are able to iron out all syntax errors. Then you would link, get undefined symbols and have to include appropriate libraries etc. And then finally you would execute your code, only to have it crash. The process of debugging would then start.

Luckily technology has made life a whole lot easier except for the last step where you could still run into an execution errors . In these days an IDE (Interactive Development Environment) like Eclipse will flag syntax errors, missing definitions/declarations etc. as you write your code. Moreover Eclipse can also indicate which libraries (imports) you would need to include in your package for it to build. The only missing step in IDEs of these days is the ability to predict possible execution errors in your program. I wouldn’t be surprised, if in future, like Microsoft Word, the IDE is able to tell you if a programming construct does not make sense.

So things have gotten a lot easier for the programmer. The following tips for are particularly useful as you progress along in programming

These days when you are learning a new programming language it is not necessary to know the language from cover to cover by reading a book. In those days when we learnt C it was necessary to know everything from bit structures, macros, pragma etc. The reason being that every syntax or execution error one had to rush to get the textbook and thumb through it for the answer. Not so, in these days of Google. You have the world’s library at your fingertips.
To get started it is necessary to learn just the most important programming constructs of the language say structure, class, car, cdr besides the usual suspects like loops, conditions and case constructs
Download and install an IDE for the language. In most case Eclipse will work
Try to write a simple program and test out your code.
To do any sort of programming these days you will necessarily need to make 3 friends
1. Google
2. Stackoverflow
3. Git & GitHub
Honing your Googling skills is very important. There are answers to almost any sort of programming problems out there. You would be surprised to know that there are many others who did exactly the same stupid mistake that you did out there. Also googling will take you to interesting tutorials, blogs, articles that discuss different aspects of the programming language and the problem you are trying to solve
Stackoverflow is really a God send to all programmers. There are so many questions on so many aspects of every programming language on earth there. If you spend time searching Stackoverflow you are bound to find answers, code snippets that you can readily use in your code
Post your questions in stackoverflow when you don’t find the answers there. You are bound to get quick answers. Thanks to the gamification of Stackoverflow (points, upvotes,badges etc) that has been created on Stackoverflow.
Git & GitHub: I would suggest that you download and install GitHub for Windows. This will provide you with version control on your desktop. You can modify code while being to switch back to an earlier version with Git. Read up a good tutorial on Git for Windows
Once you have working code you push it onto GitHub and share with other programmers

Now that you have the basic setup here are few other extremely important tips

The most important criteria for programming is ‘attitude’. Initially you are bound to get frustrated, angry, irritated etc. But it is necessary to look at the errors that you get with the right attitude. Know that an error is telling you something. Usually the answers to your mistake are in the ‘error message’ itself. Look at it closely and try to understand it. You will learn a lot more when you learn from errors than from copy-pasting from somebody else’s code, even if works right the first time around!
Make sure you do something different each time. As Einstein said “ If you keep doing the same thing, you will keep getting the same result’
There are different ways to debug your code. You could use the debugger and single step through the code and keep checking the values of the variables. I personally prefer print statements to localize where things are going wrong. I then try to narrow down the problem to a few lines of code and try to take it apart.

Hopefully the above tips are useful. Programming can be creative activity and will be indispensable in our future.

Above all have fun coding, there are so many possibilities these days!

Also see

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. The mind of the programmer
4. Programming Zen and now – Some essential tips -2

Programming Zen and now – Some essential tips-2

The computer is not a dumb machine!

“The computer is a dumb machine. It needs to be told what to do at every step”. How often have we heard of this refrain from friends and those who have only an incidental interaction with computers? To them a computer is like a ball which has to be kicked from place to place. These people are either ignorant of computers, say it by force of habit or have a fear of computers. However this is so far from the truth. In this post, my 100^th, I come to the defense of the computer in a slightly philosophical way.

The computer is truly a marvel of technology. The computer truly embodies untapped intelligence. In my opinion even a safety pin is frozen intelligence. From a piece of metal the safety pin can now hold things together while pinning them, besides incorporating an aspect of safety.

Stating that the computer is a dumb machine is like saying that a television is dumb and an airplane is dumber. An airplane probably represents a modern miracle in which the laws of flight are built into every nut and bolt that goes into the plane. The electronics and the controls enable it to lift off, fly and land with precision and perform a miracle in every flight.

Similarly a computer from the bare hardware to the upper most layer of software is nothing but layer and layer of human ingenuity, creativity and innovation. At the bare metal the hardware of the computer is made up integrated chips that work at the rate of 1 billion+ instructions per second. The circuits are organized so precisely that they are able to work together and produce a coherent output, all at blazing speeds of less than a billionth of a second.

On top of the bare bones hardware we have some programs that work at the assembly and machine code made of 0’s and 1’s. The machine code is nothing more than an amorphous strings of 0’s and 1’s. At this level the thing that is worked on (object) and the thing that works on it (subject) are indistinguishable. There is no subject and object at this level. What distinguishes then is the context.

Over this layer we have the Operating System (OS) which I would like to refer to as the mind of the computer. The OS is managing many things all at once much like the mind has complete control over sense organs which receive external input. So the OS manages processes, memory, devices and CPU (resources)

As humans, we like to pride ourselves that we have consciousness. Rather than going into any metaphysical discussion on what consciousness is or isn’t it is clear that the OS keeps the computer completely conscious of the state of all its resources. So just like we react to data received through our sense organs the computer reacts to input received through devices (mouse, keyboard) or its memory etc. So does the computer have consciousness?

You say human beings are capable of thought. So what is thought but some sensible evaluation of known concepts? In a way the OS is also constant churning in the background trying to make sense of the state of the CPU, the memory or the disk.

Not to give in I can hear you say “But human beings understand choice”. Really! So here is my program for a human being

If provoked
Get angry

If insulted
Get hurt

If ego stoked
Go mad with joy

Just kidding! Anyway the recent advances in cognitive computing now show it is possible to have computers choose the best alternative. IBM’s Watson is capable of evaluation alternative choices.

Over the OS we have compilers and above that we have several applications.
The computer truly represents layers and layers of solidified human thought. Whether it is the precise hardware circuitry, the OS, compilers, or any application they are all result of human thought and they are constantly working in the computer.

So if your initial attempt to perform something useful did not quite work out, you must understand that you are working with decades of human thought embodied in the computer. So your instructions should be precise and logical. Otherwise your attempts will be thwarted.

So whether it’s the computer, the mobile or your car, we should look and appreciate the deep beauty that resides in these modern conveniences, gadgets or machinery.

Find me on Google+

Introduction

1a. Rank ODI Batsmen

1b. Rank ODI Bowlers

2a. Rank T20 Batsmen

2b. Rank T20 Bowlers

3a. Rank IPL Batsmen

3a. Rank IPL Bowlers

Share:

Like this:

Tensorflow in Python

Tensorflow in R (RStudio)

1. Multivariate regression with Tensorflow – Python

Observation

1a. Multivariate Regression in Tensorflow – R

Multivariate regression

Read the data

Split the data as training and test in 80/20

Normalize the data

Create the Deep Learning Model

Plot mean squared error, mean absolute error and loss for training data and test data

2. Binary classification in Tensorflow – Python

2a. Binary classification in Tensorflow -R

Normalize the data

Create the Deep Learning model

Fit the model. Use 20% of data for validation

Plot the accuracy and loss for training and validation data

3. MNIST in Tensorflow – Python

MNIST in Tensorflow – R

Reshape and rescale

Convert out put to One Hot encoded format

Fit the model

Fit the model

Plot the accuracy and loss for training and test data

Share:

Like this:

Introduction

The cricpy package

1 Importing cricpy – Python

2. Invoking functions with Python package cricpy

3. Getting help from cricpy – Python

4. Get the Twenty20 player data for a player using the function getPlayerDataOD()

5 Virat Kohli’s performance – Basic Analyses

6. More analyses

7. 3D scatter plot and prediction plane

8. Average runs at different venues

9. Average runs against different opposing teams

10 . Highest Runs Likelihood

11. A look at the Top 4 batsman – Kohli, Guptill, Shahzad and McCullum

12. Box Histogram Plot

13 Moving Average of runs in career

14 Cumulative Average runs of batsman in career

15 Cumulative Average strike rate of batsman in career

16 Relative Batsman Cumulative Average Runs

17. Relative Batsman Strike Rate

18. 3D plot of Runs vs Balls Faced and Minutes at Crease

19. 3D plot of Runs vs Balls Faced and Minutes at Crease

20. Predicting Runs given Balls Faced and Minutes at Crease

21 Analysis of Top Bowlers

22. Get the bowler’s data

23. Wicket Frequency Plot

24. Wickets Runs plot

25 Average wickets at different venues

26 Average wickets against different opposition

27 Wickets taken moving average

28 Cumulative average wickets taken

29 Cumulative average economy rate

30 Relative cumulative average economy rate of bowlers

31 Relative Economy Rate against wickets taken

32 Relative cumulative average wickets of bowlers in career

33. Key Findings

Analysis of Top 4 batsman

Conclusion

Share:

Like this:

Introduction

The cricpy package

1 Importing cricpy – Python

2. Invoking functions with Python package crlcpy

3. Getting help from cricpy – Python

4. Get the ODI player data for a player using the function getPlayerDataOD()