My book ‘Practical Machine Learning with R and Python’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

My book ‘Practical Machine Learning with R and Python: Second Edition – Machine Learning in stereo’ is now available in both paperback ($10.99) and kindle ($7.99/Rs449) versions. In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code. This is almost like listening to parallel channels of music in stereo!
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle- $8.99/Rs449)
This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Essential R …………………………………….. 7
Essential Python for Datascience ………………..   54
R vs Python ……………………………………. 77
Regression of a continuous variable ………………. 96
Classification and Cross Validation ……………….113
Regression techniques and regularization …………. 134
SVMs, Decision Trees and Validation curves …………175
Splines, GAMs, Random Forests and Boosting …………202
PCA, K-Means and Hierarchical Clustering …………. 234

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

Practical Machine Learning with R and Python – Part 4

This is the 4th installment of my ‘Practical Machine Learning with R and Python’ series. In this part I discuss classification with Support Vector Machines (SVMs), using both a Linear and a Radial basis kernel, and Decision Trees. Further, a closer look is taken at some of the metrics associated with binary classification, namely accuracy vs precision and recall. I also touch upon Validation curves, Precision-Recall, ROC curves and AUC with equivalent code in R and Python

This post is a continuation of my 3 earlier posts on Practical Machine Learning in R and Python
1. Practical Machine Learning with R and Python – Part 1
2. Practical Machine Learning with R and Python – Part 2
3. Practical Machine Learning with R and Python – Part 3

The RMarkdown file with the code and the associated data files can be downloaded from Github at MachineLearning-RandPython-Part4

Note: Please listen to my video presentations Machine Learning in youtube
1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

 

Support Vector Machines (SVM) are another useful Machine Learning model that can be used for both regression and classification problems. SVMs used in classification, compute the hyperplane, that separates the 2 classes with the maximum margin. To do this the features may be transformed into a larger multi-dimensional feature space. SVMs can be used with different kernels namely linear, polynomial or radial basis to determine the best fitting model for a given classification problem.

In the 2nd part of this series Practical Machine Learning with R and Python – Part 2, I had mentioned the various metrics that are used in classification ML problems namely Accuracy, Precision, Recall and F1 score. Accuracy gives the fraction of data that were correctly classified as belonging to the +ve or -ve class. However ‘accuracy’ in itself is not a good enough measure because it does not take into account the fraction of the data that were incorrectly classified. This issue becomes even more critical in different domains. For e.g a surgeon who would like to detect cancer, would like to err on the side of caution, and classify even a possibly non-cancerous patient as possibly having cancer, rather than mis-classifying a malignancy as benign. Here we would like to increase recall or sensitivity which is  given by Recall= TP/(TP+FN) or we try reduce mis-classification by either increasing the (true positives) TP or reducing (false negatives) FN

On the other hand, search algorithms would like to increase precision which tries to reduce the number of irrelevant results in the search result. Precision= TP/(TP+FP). In other words we do not want ‘false positives’ or irrelevant results to come in the search results and there is a need to reduce the false positives.

When we try to increase ‘precision’, we do so at the cost of ‘recall’, and vice-versa. I found this diagram and explanation in Wikipedia very useful Source: Wikipedia

“Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).”

1.1a. Linear SVM – R code

In R code below I use SVM with linear kernel

source('RFunctions-1.R')
library(dplyr)
library(e1071)
library(caret)
library(reshape2)
library(ggplot2)
# Read data. Data from SKLearn
cancer <- read.csv("cancer.csv")
cancer$target <- as.factor(cancer$target)

# Split into training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

# Fit a linear basis kernel. DO not scale the data
svmfit=svm(target~., data=train, kernel="linear",scale=FALSE)
ypred=predict(svmfit,test)
#Print a confusion matrix
confusionMatrix(ypred,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 54  3
##          1  3 82
##                                           
##                Accuracy : 0.9577          
##                  95% CI : (0.9103, 0.9843)
##     No Information Rate : 0.5986          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9121          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.9647          
##          Pos Pred Value : 0.9474          
##          Neg Pred Value : 0.9647          
##              Prevalence : 0.4014          
##          Detection Rate : 0.3803          
##    Detection Prevalence : 0.4014          
##       Balanced Accuracy : 0.9560          
##                                           
##        'Positive' Class : 0               
## 

1.1b Linear SVM – Python code

The code below creates a SVM with linear basis in Python and also dumps the corresponding classification metrics

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
clf = LinearSVC().fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
## Breast cancer dataset
## Accuracy of Linear SVC classifier on training set: 0.92
## Accuracy of Linear SVC classifier on test set: 0.94

1.2 Dummy classifier

Often when we perform classification tasks using any ML model namely logistic regression, SVM, neural networks etc. it is very useful to determine how well the ML model performs agains at dummy classifier. A dummy classifier uses some simple computation like frequency of majority class, instead of fitting and ML model. It is essential that our ML model does much better that the dummy classifier. This problem is even more important in imbalanced classes where we have only about 10% of +ve samples. If any ML model we create has a accuracy of about 0.90 then it is evident that our classifier is not doing any better than a dummy classsfier which can just take a majority count of this imbalanced class and also come up with 0.90. We need to be able to do better than that.

In the examples below (1.3a & 1.3b) it can be seen that SVMs with ‘radial basis’ kernel with unnormalized data, for both R and Python, do not perform any better than the dummy classifier.

1.2a Dummy classifier – R code

R does not seem to have an explicit dummy classifier. I created a simple dummy classifier that predicts the majority class. SKlearn in Python also includes other strategies like uniform, stratified etc. but this should be possible to create in R also.

# Create a simple dummy classifier that computes the ratio of the majority class to the totla
DummyClassifierAccuracy <- function(train,test,type="majority"){
  if(type=="majority"){
      count <- sum(train$target==1)/dim(train)[1]
  }
  count
}


cancer <- read.csv("cancer.csv")
cancer$target <- as.factor(cancer$target)

# Create training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

#Dummy classifier majority class
acc=DummyClassifierAccuracy(train,test)
sprintf("Accuracy is %f",acc)
## [1] "Accuracy is 0.638498"

1.2b Dummy classifier – Python code

This dummy classifier uses the majority class.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_dummy_predictions = dummy_majority.predict(X_test)

print('Dummy classifier accuracy on test set: {:.2f}'
     .format(dummy_majority.score(X_test, y_test)))
## Dummy classifier accuracy on test set: 0.63

1.3a – Radial SVM (un-normalized) – R code

SVMs perform better when the data is normalized or scaled. The 2 examples below show that SVM with radial basis kernel does not perform any better than the dummy classifier

library(dplyr)
library(e1071)
library(caret)
library(reshape2)
library(ggplot2)

# Radial SVM unnormalized
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]
# Unnormalized data
svmfit=svm(target~., data=train, kernel="radial",cost=10,scale=FALSE)
ypred=predict(svmfit,test)
confusionMatrix(ypred,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 57 85
##                                           
##                Accuracy : 0.5986          
##                  95% CI : (0.5131, 0.6799)
##     No Information Rate : 0.5986          
##     P-Value [Acc > NIR] : 0.5363          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 1.195e-13       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.5986          
##              Prevalence : 0.4014          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

1.4b – Radial SVM (un-normalized) – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)


clf = SVC(C=10).fit(X_train, y_train)
print('Breast cancer dataset (unnormalized features)')
print('Accuracy of RBF-kernel SVC on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RBF-kernel SVC on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
## Breast cancer dataset (unnormalized features)
## Accuracy of RBF-kernel SVC on training set: 1.00
## Accuracy of RBF-kernel SVC on test set: 0.63

1.5a – Radial SVM (Normalized) -R Code

The data is scaled (normalized ) before using the SVM model. The SVM model has 2 paramaters a) C – Large C (less regularization), more regularization b) gamma – Small gamma has larger decision boundary with more misclassfication, and larger gamma has tighter decision boundary

The R code below computes the accuracy as the regularization paramater is changed

trainingAccuracy <- NULL
testAccuracy <- NULL
C1 <- c(.01,.1, 1, 10, 20)
for(i in  C1){
  
    svmfit=svm(target~., data=train, kernel="radial",cost=i,scale=TRUE)
    ypredTrain <-predict(svmfit,train)
    ypredTest=predict(svmfit,test)
    a <-confusionMatrix(ypredTrain,train$target)
    b <-confusionMatrix(ypredTest,test$target)
    trainingAccuracy <-c(trainingAccuracy,a$overall[1])
    testAccuracy <-c(testAccuracy,b$overall[1])
    
}
print(trainingAccuracy)
##  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy 
## 0.6384977 0.9671362 0.9906103 0.9976526 1.0000000
print(testAccuracy)
##  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy 
## 0.5985915 0.9507042 0.9647887 0.9507042 0.9507042
a <-rbind(C1,as.numeric(trainingAccuracy),as.numeric(testAccuracy))
b <- data.frame(t(a))
names(b) <- c("C1","trainingAccuracy","testAccuracy")
df <- melt(b,id="C1")
ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) +
    xlab("C (SVC regularization)value") + ylab("Accuracy") +
    ggtitle("Training and test accuracy vs C(regularization)")

1.5b – Radial SVM (normalized) – Python

The Radial basis kernel is used on normalized data for a range of ‘C’ values and the result is plotted.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
   
print('Breast cancer dataset (normalized with MinMax scaling)')
trainingAccuracy=[]
testAccuracy=[]
for C1 in [.01,.1, 1, 10, 20]:
    clf = SVC(C=C1).fit(X_train_scaled, y_train)
    acctrain=clf.score(X_train_scaled, y_train)
    accTest=clf.score(X_test_scaled, y_test)
    trainingAccuracy.append(acctrain)
    testAccuracy.append(accTest)
    
# Create a dataframe
C1=[.01,.1, 1, 10, 20]   
trainingAccuracy=pd.DataFrame(trainingAccuracy,index=C1)
testAccuracy=pd.DataFrame(testAccuracy,index=C1)

# Plot training and test R squared as a function of alpha
df=pd.concat([trainingAccuracy,testAccuracy],axis=1)
df.columns=['trainingAccuracy','trainingAccuracy']

fig1=df.plot()
fig1=plt.title('Training and test accuracy vs C (SVC)')
fig1.figure.savefig('fig1.png', bbox_inches='tight')
## Breast cancer dataset (normalized with MinMax scaling)

Output image: 

1.6a Validation curve – R code

Sklearn includes code creating validation curves by varying paramaters and computing and plotting accuracy as gamma or C or changd. I did not find this R but I think this is a useful function and so I have created the R equivalent of this.

# The R equivalent of np.logspace
seqLogSpace <- function(start,stop,len){
  a=seq(log10(10^start),log10(10^stop),length=len)
  10^a
}

# Read the data. This is taken the SKlearn cancer data
cancer <- read.csv("cancer.csv")
cancer$target <- as.factor(cancer$target)

set.seed(6)

# Create the range of C1 in log space
param_range = seqLogSpace(-3,2,20)
# Initialize the overall training and test accuracy to NULL
overallTrainAccuracy <- NULL
overallTestAccuracy <- NULL

# Loop over the parameter range of Gamma
for(i in param_range){
    # Set no of folds
    noFolds=5
    # Create the rows which fall into different folds from 1..noFolds
    folds = sample(1:noFolds, nrow(cancer), replace=TRUE) 
    # Initialize the training and test accuracy of folds to 0
    trainingAccuracy <- 0
    testAccuracy <- 0
    
    # Loop through the folds
    for(j in 1:noFolds){
        # The training is all rows for which the row is != j (k-1 folds -> training)
        train <- cancer[folds!=j,]
        # The rows which have j as the index become the test set
        test <- cancer[folds==j,]
        # Create a SVM model for this
        svmfit=svm(target~., data=train, kernel="radial",gamma=i,scale=TRUE)
  
        # Add up all the fold accuracy for training and test separately  
        ypredTrain <-predict(svmfit,train)
        ypredTest=predict(svmfit,test)
        
        # Create confusion matrix 
        a <-confusionMatrix(ypredTrain,train$target)
        b <-confusionMatrix(ypredTest,test$target)
        # Get the accuracy
        trainingAccuracy <-trainingAccuracy + a$overall[1]
        testAccuracy <-testAccuracy+b$overall[1]

    }
    # Compute the average of accuracy for K folds for number of features 'i'
    overallTrainAccuracy=c(overallTrainAccuracy,trainingAccuracy/noFolds)
    overallTestAccuracy=c(overallTestAccuracy,testAccuracy/noFolds)
}
#Create a dataframe
a <- rbind(param_range,as.numeric(overallTrainAccuracy),
               as.numeric(overallTestAccuracy))
b <- data.frame(t(a))
names(b) <- c("C1","trainingAccuracy","testAccuracy")
df <- melt(b,id="C1")
#Plot in log axis
ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) +
      xlab("C (SVC regularization)value") + ylab("Accuracy") +
      ggtitle("Training and test accuracy vs C(regularization)") + scale_x_log10()

1.6b Validation curve – Python

Compute and plot the validation curve as gamma is varied.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve


# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_cancer)

# Create a gamma values from 10^-3 to 10^2 with 20 equally spaced intervals
param_range = np.logspace(-3, 2, 20)
# Compute the validation curve
train_scores, test_scores = validation_curve(SVC(), X_scaled, y_cancer,
                                            param_name='gamma',
                                            param_range=param_range, cv=10)
                                            
#Plot the figure                                           
fig2=plt.figure()

#Compute the mean
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

fig2=plt.title('Validation Curve with SVM')
fig2=plt.xlabel('$\gamma$ (gamma)')
fig2=plt.ylabel('Score')
fig2=plt.ylim(0.0, 1.1)
lw = 2

fig2=plt.semilogx(param_range, train_scores_mean, label='Training score',
            color='darkorange', lw=lw)

fig2=plt.fill_between(param_range, train_scores_mean - train_scores_std,
                train_scores_mean + train_scores_std, alpha=0.2,
                color='darkorange', lw=lw)

fig2=plt.semilogx(param_range, test_scores_mean, label='Cross-validation score',
            color='navy', lw=lw)

fig2=plt.fill_between(param_range, test_scores_mean - test_scores_std,
                test_scores_mean + test_scores_std, alpha=0.2,
                color='navy', lw=lw)
fig2.figure.savefig('fig2.png', bbox_inches='tight')

Output image: 

1.7a Validation Curve (Preventing data leakage) – Python code

In this course Applied Machine Learning in Python, the Professor states that when we apply the same data transformation to a entire dataset, it will cause a data leakage. “The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately”

So I apply separate scaling to the training and testing folds and plot. In the lecture the Prof states that this can be done using pipelines.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import  KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# Read the data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Set the parameter range
param_range = np.logspace(-3, 2, 20)

# Set number of folds
folds=5
#Initialize
overallTrainAccuracy=[]
overallTestAccuracy=[]

# Loop over the paramater range
for c in  param_range:
    trainingAccuracy=0
    testAccuracy=0
    kf = KFold(len(X_cancer),n_folds=folds)
    # Partition into training and test folds
    for train_index, test_index in kf:
            # Partition the data acccording the fold indices generated
            X_train, X_test = X_cancer[train_index], X_cancer[test_index]
            y_train, y_test = y_cancer[train_index], y_cancer[test_index]  

            
            # Scale the X_train and X_test 
            scaler = MinMaxScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)
            # Fit a SVC model for each C
            clf = SVC(C=c).fit(X_train_scaled, y_train)
            #Compute the training and test score
            acctrain=clf.score(X_train_scaled, y_train)
            accTest=clf.score(X_test_scaled, y_test)
            trainingAccuracy += np.sum(acctrain)
            testAccuracy += np.sum(accTest)
    # Compute the mean training and testing accuracy
    overallTrainAccuracy.append(trainingAccuracy/folds)
    overallTestAccuracy.append(testAccuracy/folds)
        

overallTrainAccuracy=pd.DataFrame(overallTrainAccuracy,index=param_range)
overallTestAccuracy=pd.DataFrame(overallTestAccuracy,index=param_range)

# Plot training and test R squared as a function of alpha
df=pd.concat([overallTrainAccuracy,overallTestAccuracy],axis=1)
df.columns=['trainingAccuracy','testAccuracy']


fig3=plt.title('Validation Curve with SVM')
fig3=plt.xlabel('$\gamma$ (gamma)')
fig3=plt.ylabel('Score')
fig3=plt.ylim(0.5, 1.1)
lw = 2

fig3=plt.semilogx(param_range, overallTrainAccuracy, label='Training score',
            color='darkorange', lw=lw)

fig3=plt.semilogx(param_range, overallTestAccuracy, label='Cross-validation score',
            color='navy', lw=lw)

fig3=plt.legend(loc='best')
fig3.figure.savefig('fig3.png', bbox_inches='tight')

Output image: 

1.8 a Decision trees – R code

Decision trees in R can be plotted using RPart package

library(rpart)
library(rpart.plot)
rpart = NULL
# Create a decision tree
m <-rpart(Species~.,data=iris)
#Plot
rpart.plot(m,extra=2,main="Decision Tree - IRIS")

 

1.8 b Decision trees – Python code

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
import graphviz 

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3)
clf = DecisionTreeClassifier().fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph
## Accuracy of Decision Tree classifier on training set: 1.00
## Accuracy of Decision Tree classifier on test set: 0.97

1.9a Feature importance – R code

I found the following code which had a snippet for feature importance. Sklean has a nice method for this. For some reason the results in R and Python are different. Any thoughts?

set.seed(3)
# load the library
library(mlbench)
library(caret)
# load the dataset
cancer <- read.csv("cancer.csv")
cancer$target <- as.factor(cancer$target)
# Split as data
data <- cancer[,1:31]
target <- cancer[,32]

# Train the model
model <- train(data, target, method="rf", preProcess="scale", trControl=trainControl(method = "cv"))
# Compute variable importance
importance <- varImp(model)
# summarize importance
print(importance)
# plot importance
plot(importance)

1.9b Feature importance – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np
# Read the data
cancer= load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)
# Use the DecisionTreClassifier
clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8,
                            random_state = 0).fit(X_train, y_train)

c_features=len(cancer.feature_names)
print('Breast cancer dataset: decision tree')
print('Accuracy of DT classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of DT classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

# Plot the feature importances
fig4=plt.figure(figsize=(10,6),dpi=80)

fig4=plt.barh(range(c_features), clf.feature_importances_)
fig4=plt.xlabel("Feature importance")
fig4=plt.ylabel("Feature name")
fig4=plt.yticks(np.arange(c_features), cancer.feature_names)
fig4=plt.tight_layout()
plt.savefig('fig4.png', bbox_inches='tight')
## Breast cancer dataset: decision tree
## Accuracy of DT classifier on training set: 0.96
## Accuracy of DT classifier on test set: 0.94

Output image: 

1.10a Precision-Recall, ROC curves & AUC- R code

I tried several R packages for plotting the Precision and Recall and AUC curve. PRROC seems to work well. The Precision-Recall curves show the tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa.AUC curves that hug the top left corner indicate a high sensitivity,specificity and an excellent accuracy.

source("RFunctions-1.R")
library(dplyr)
library(caret)
library(e1071)
library(PRROC)
# Read the data (this data is from sklearn!)
d <- read.csv("digits.csv")
digits <- d[2:66]
digits$X64 <- as.factor(digits$X64)

# Split as training and test sets
train_idx <- trainTestSplit(digits,trainPercent=75,seed=5)
train <- digits[train_idx, ]
test <- digits[-train_idx, ]

# Fit a SVM model with linear basis kernel with probabilities
svmfit=svm(X64~., data=train, kernel="linear",scale=FALSE,probability=TRUE)
ypred=predict(svmfit,test,probability=TRUE)
head(attr(ypred,"probabilities"))
##               0            1
## 6  7.395947e-01 2.604053e-01
## 8  9.999998e-01 1.842555e-07
## 12 1.655178e-05 9.999834e-01
## 13 9.649997e-01 3.500032e-02
## 15 9.994849e-01 5.150612e-04
## 16 9.999987e-01 1.280700e-06
# Store the probability of 0s and 1s
m0<-attr(ypred,"probabilities")[,1]
m1<-attr(ypred,"probabilities")[,2]

# Create a dataframe of scores
scores <- data.frame(m1,test$X64)

# Class 0 is data points of +ve class (in this case, digit 1) and -ve class (digit 0)
#Compute Precision Recall
pr <- pr.curve(scores.class0=scores[scores$test.X64=="1",]$m1,
               scores.class1=scores[scores$test.X64=="0",]$m1,
               curve=T)

# Plot precision-recall curve
plot(pr)

#Plot the ROC curve
roc<-roc.curve(m0, m1,curve=TRUE)
plot(roc)

1.10b Precision-Recall, ROC curves & AUC- Python code

For Python Logistic Regression is used to plot Precision Recall, ROC curve and compute AUC

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve, auc
#Load the digits
dataset = load_digits()
X, y = dataset.data, dataset.target
#Create 2 classes -i) Digit 1 (from digit 1) ii) Digit 0 (from all other digits)
# Make a copy of the target
z= y.copy()
# Replace all non 1's as 0
z[z != 1] = 0

X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0)
# Fit a LR model
lr = LogisticRegression().fit(X_train, y_train)

#Compute the decision scores
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

#Show the decision_function scores for first 20 instances
y_score_list

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
#Plot
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.savefig('fig5.png', bbox_inches='tight')

#Compute and plot the ROC
y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes()
plt.savefig('fig6.png', bbox_inches='tight')

output

output

1.10c Precision-Recall, ROC curves & AUC- Python code

In the code below classification probabilities are used to compute and plot precision-recall, roc and AUC

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

dataset = load_digits()
X, y = dataset.data, dataset.target
# Make a copy of the target
z= y.copy()
# Replace all non 1's as 0
z[z != 1] = 0


X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0)
svm = LinearSVC()
# Need to use CalibratedClassifierSVC to redict probabilities for lInearSVC
clf = CalibratedClassifierCV(svm) 
clf.fit(X_train, y_train)
y_proba_lr = clf.predict_proba(X_test)
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr[:,1])
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
#plt.figure(figsize=(15,15),dpi=80)
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.savefig('fig7.png', bbox_inches='tight')

output

Note: As with other posts in this series on ‘Practical Machine Learning with R and Python’,   this post is based on these 2 MOOC courses
1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

Conclusion

This 4th part looked at SVMs with linear and radial basis, decision trees, precision-recall tradeoff, ROC curves and AUC.

Stick around for further updates. I’ll be back!
Comments, suggestions and correction are welcome.

Also see
1. A primer on Qubits, Quantum gates and Quantum Operations
2. Dabbling with Wiener filter using OpenCV
3. The mind of a programmer
4. Sea shells on the seashore
5. yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

To see all posts see Index of posts

Practical Machine Learning with R and Python – Part 2

In this 2nd part of the series “Practical Machine Learning with R and Python – Part 2”, I continue where I left off in my first post Practical Machine Learning with R and Python – Part 2. In this post I cover the some classification algorithmns and cross validation. Specifically I touch
-Logistic Regression
-K Nearest Neighbors (KNN) classification
-Leave out one Cross Validation (LOOCV)
-K Fold Cross Validation
in both R and Python.

As in my initial post the algorithms are based on the following courses.

You can download this R Markdown file along with the data from Github. I hope these posts can be used as a quick reference in R and Python and Machine Learning.I have tried to include the coolest part of either course in this post.

Note: Please listen to my video presentations Machine Learning in youtube
1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

 

The following classification problem is based on Logistic Regression. The data is an included data set in Scikit-Learn, which I have saved as csv and use it also for R. The fit of a classification Machine Learning Model depends on how correctly classifies the data. There are several measures of testing a model’s classification performance. They are

Accuracy = TP + TN / (TP + TN + FP + FN) – Fraction of all classes correctly classified
Precision = TP / (TP + FP) – Fraction of correctly classified positives among those classified as positive
Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate (True positive) – Fraction of correctly classified as positive among all positives in the data
F1 = 2 * Precision * Recall / (Precision + Recall)

1a. Logistic Regression – R code

The caret and e1071 package is required for using the confusionMatrix call

source("RFunctions.R")
library(dplyr)
library(caret)
library(e1071)
# Read the data (from sklearn)
cancer <- read.csv("cancer.csv")
# Rename the target variable
names(cancer) <- c(seq(1,30),"output")
# Split as training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

# Fit a generalized linear logistic model, 
fit=glm(output~.,family=binomial,data=train,control = list(maxit = 50))
# Predict the output from the model
a=predict(fit,newdata=train,type="response")
# Set response >0.5 as 1 and <=0.5 as 0
b=ifelse(a>0.5,1,0)
# Compute the confusion matrix for training data
confusionMatrix(b,train$output)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 154   0
##          1   0 272
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9914, 1)
##     No Information Rate : 0.6385     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3615     
##          Detection Rate : 0.3615     
##    Detection Prevalence : 0.3615     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
m=predict(fit,newdata=test,type="response")
n=ifelse(m>0.5,1,0)
# Compute the confusion matrix for test output
confusionMatrix(n,test$output)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  4
##          1  5 81
##                                           
##                Accuracy : 0.9366          
##                  95% CI : (0.8831, 0.9706)
##     No Information Rate : 0.5986          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8677          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9123          
##             Specificity : 0.9529          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9419          
##              Prevalence : 0.4014          
##          Detection Rate : 0.3662          
##    Detection Prevalence : 0.3944          
##       Balanced Accuracy : 0.9326          
##                                           
##        'Positive' Class : 0               
## 

1b. Logistic Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
os.chdir("C:\\Users\\Ganesh\\RandPython")
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
fig, subaxes = plt.subplots(1, 1, figsize=(7, 5))
# Fit a model
clf = LogisticRegression().fit(X_train, y_train)

# Compute and print the Accuray scores
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
y_predicted=clf.predict(X_test)
# Compute and print confusion matrix
confusion = confusion_matrix(y_test, y_predicted)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, y_predicted)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96
## Accuracy: 0.96
## Precision: 0.99
## Recall: 0.94
## F1: 0.97

2. Dummy variables

The following R and Python code show how dummy variables are handled in R and Python. Dummy variables are categorival variables which have to be converted into appropriate values before using them in Machine Learning Model For e.g. if we had currency as ‘dollar’, ‘rupee’ and ‘yen’ then the dummy variable will convert this as
dollar 0 0 0
rupee 0 0 1
yen 0 1 0

2a. Logistic Regression with dummy variables- R code

# Load the dummies library
library(dummies) 
df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?"))

# Remove rows which have NA
df1 <- df[complete.cases(df),]
dim(df1)
## [1] 30161    16
# Select specific columns
adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain,
                               capital.loss,hours.per.week,native.country,salary)
# Set the dummy data with appropriate values
adult1 <- dummy.data.frame(adult, sep = ".")

#Split as training and test
train_idx <- trainTestSplit(adult1,trainPercent=75,seed=1111)
train <- adult1[train_idx, ]
test <- adult1[-train_idx, ]

# Fit a binomial logistic regression
fit=glm(salary~.,family=binomial,data=train)
# Predict response
a=predict(fit,newdata=train,type="response")
# If response >0.5 then it is a 1 and 0 otherwise
b=ifelse(a>0.5,1,0)
confusionMatrix(b,train$salary)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16065  3145
##          1   968  2442
##                                           
##                Accuracy : 0.8182          
##                  95% CI : (0.8131, 0.8232)
##     No Information Rate : 0.753           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4375          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9432          
##             Specificity : 0.4371          
##          Pos Pred Value : 0.8363          
##          Neg Pred Value : 0.7161          
##              Prevalence : 0.7530          
##          Detection Rate : 0.7102          
##    Detection Prevalence : 0.8492          
##       Balanced Accuracy : 0.6901          
##                                           
##        'Positive' Class : 0               
## 
# Compute and display confusion matrix
m=predict(fit,newdata=test,type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
n=ifelse(m>0.5,1,0)
confusionMatrix(n,test$salary)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5263 1099
##          1  357  822
##                                           
##                Accuracy : 0.8069          
##                  95% CI : (0.7978, 0.8158)
##     No Information Rate : 0.7453          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4174          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9365          
##             Specificity : 0.4279          
##          Pos Pred Value : 0.8273          
##          Neg Pred Value : 0.6972          
##              Prevalence : 0.7453          
##          Detection Rate : 0.6979          
##    Detection Prevalence : 0.8437          
##       Balanced Accuracy : 0.6822          
##                                           
##        'Positive' Class : 0               
## 

2b. Logistic Regression with dummy variables- Python code

Pandas has a get_dummies function for handling dummies

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Read data
df =pd.read_csv("adult1.csv",encoding="ISO-8859-1",na_values=[""," "," ?"])
# Drop rows with NA
df1=df.dropna()
print(df1.shape)
# Select specific columns
adult = df1[['age','occupation','education','educationNum','capitalGain','capital-loss', 
             'hours-per-week','native-country','salary']]

X=adult[['age','occupation','education','educationNum','capitalGain','capital-loss', 
             'hours-per-week','native-country']]
# Set approporiate values for dummy variables
X_adult=pd.get_dummies(X,columns=['occupation','education','native-country'])
y=adult['salary']

X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y,
                                                   random_state = 0)
clf = LogisticRegression().fit(X_adult_train, y_train)

# Compute and display Accuracy and Confusion matrix
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_adult_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_adult_test, y_test)))
y_predicted=clf.predict(X_adult_test)
confusion = confusion_matrix(y_test, y_predicted)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, y_predicted)))
## (30161, 16)
## Accuracy of Logistic regression classifier on training set: 0.82
## Accuracy of Logistic regression classifier on test set: 0.81
## Accuracy: 0.81
## Precision: 0.68
## Recall: 0.41
## F1: 0.51

3a – K Nearest Neighbors Classification – R code

The Adult data set is taken from UCI Machine Learning Repository

source("RFunctions.R")
df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?"))
# Remove rows which have NA
df1 <- df[complete.cases(df),]
dim(df1)
## [1] 30161    16
# Select specific columns
adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain,
                               capital.loss,hours.per.week,native.country,salary)
# Set dummy variables
adult1 <- dummy.data.frame(adult, sep = ".")

#Split train and test as required by KNN classsification model
train_idx <- trainTestSplit(adult1,trainPercent=75,seed=1111)
train <- adult1[train_idx, ]
test <- adult1[-train_idx, ]
train.X <- train[,1:76]
train.y <- train[,77]
test.X <- test[,1:76]
test.y <- test[,77]

# Fit a model for 1,3,5,10 and 15 neighbors
cMat <- NULL
neighbors <-c(1,3,5,10,15)
for(i in seq_along(neighbors)){
    fit =knn(train.X,test.X,train.y,k=i)
    table(fit,test.y)
    a<-confusionMatrix(fit,test.y)
    cMat[i] <- a$overall[1]
    print(a$overall[1])
}
##  Accuracy 
## 0.7835831 
##  Accuracy 
## 0.8162047 
##  Accuracy 
## 0.8089113 
##  Accuracy 
## 0.8209787 
##  Accuracy 
## 0.8184591
#Plot the Accuracy for each of the KNN models
df <- data.frame(neighbors,Accuracy=cMat)
ggplot(df,aes(x=neighbors,y=Accuracy)) + geom_point() +geom_line(color="blue") +
    xlab("Number of neighbors") + ylab("Accuracy") +
    ggtitle("KNN regression - Accuracy vs Number of Neighors (Unnormalized)")

3b – K Nearest Neighbors Classification – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

# Read data
df =pd.read_csv("adult1.csv",encoding="ISO-8859-1",na_values=[""," "," ?"])
df1=df.dropna()
print(df1.shape)
# Select specific columns
adult = df1[['age','occupation','education','educationNum','capitalGain','capital-loss', 
             'hours-per-week','native-country','salary']]

X=adult[['age','occupation','education','educationNum','capitalGain','capital-loss', 
             'hours-per-week','native-country']]
             
#Set values for dummy variables
X_adult=pd.get_dummies(X,columns=['occupation','education','native-country'])
y=adult['salary']

X_adult_train, X_adult_test, y_train, y_test = train_test_split(X_adult, y,
                                                   random_state = 0)
                                                   
# KNN classification in Python requires the data to be scaled. 
# Scale the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_adult_train)
# Apply scaling to test set also
X_test_scaled = scaler.transform(X_adult_test)
# Compute the KNN model for 1,3,5,10 & 15 neighbors
accuracy=[]
neighbors=[1,3,5,10,15]
for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train_scaled, y_train)
    accuracy.append(knn.score(X_test_scaled, y_test))
    print('Accuracy test score: {:.3f}'
        .format(knn.score(X_test_scaled, y_test)))

# Plot the models with the Accuracy attained for each of these models    
fig1=plt.plot(neighbors,accuracy)
fig1=plt.title("KNN regression - Accuracy vs Number of neighbors")
fig1=plt.xlabel("Neighbors")
fig1=plt.ylabel("Accuracy")
fig1.figure.savefig('foo1.png', bbox_inches='tight')
## (30161, 16)
## Accuracy test score: 0.749
## Accuracy test score: 0.779
## Accuracy test score: 0.793
## Accuracy test score: 0.804
## Accuracy test score: 0.803

Output image:

4 MPG vs Horsepower

The following scatter plot shows the non-linear relation between mpg and horsepower. This will be used as the data input for computing K Fold Cross Validation Error

4a MPG vs Horsepower scatter plot – R Code

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]
ggplot(df3,aes(x=horsepower,y=mpg)) + geom_point() + xlab("Horsepower") + 
    ylab("Miles Per gallon") + ggtitle("Miles per Gallon vs Hosrsepower")

4b MPG vs Horsepower scatter plot – Python Code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
#X=autoDF3[['cylinder','displacement','horsepower','weight']]
X=autoDF3[['horsepower']]
y=autoDF3['mpg']

fig11=plt.scatter(X,y)
fig11=plt.title("KNN regression - Accuracy vs Number of neighbors")
fig11=plt.xlabel("Neighbors")
fig11=plt.ylabel("Accuracy")
fig11.figure.savefig('foo11.png', bbox_inches='tight')

5 K Fold Cross Validation

K Fold Cross Validation is a technique in which the data set is divided into K Folds or K partitions. The Machine Learning model is trained on K-1 folds and tested on the Kth fold i.e.
we will have K-1 folds for training data and 1 for testing the ML model. Since we can partition this as C_{1}^{K} or K choose 1, there will be K such partitions. The K Fold Cross
Validation estimates the average validation error that we can expect on a new unseen test data.

The formula for K Fold Cross validation is as follows

MSE_{K} = \frac{\sum (y-yhat)^{2}}{n_{K}}
and
n_{K} = \frac{N}{K}
and
CV_{K} = \sum_{K=1}^{K} (\frac{n_{K}}{N}) MSE_{K}

where n_{K} is the number of elements in partition ‘K’ and N is the total number of elements
CV_{K} =\sum_{K=1}^{K} MSE_{K}

CV_{K} =\frac{\sum_{K=1}^{K} MSE_{K}}{K}
Leave Out one Cross Validation (LOOCV) is a special case of K Fold Cross Validation where N-1 data points are used to train the model and 1 data point is used to test the model. There are N such paritions of N-1 & 1 that are possible. The mean error is measured The Cross Valifation Error for LOOCV is

CV_{N} = \frac{1}{n} *\frac{\sum_{1}^{n}(y-yhat)^{2}}{1-h_{i}}
where h_{i} is the diagonal hat matrix

see [Statistical Learning]

The above formula is also included in this blog post

It took me a day and a half to implement the K Fold Cross Validation formula. I think it is correct. In any case do let me know if you think it is off

5a. Leave out one cross validation (LOOCV) – R Code

R uses the package ‘boot’ for performing Cross Validation error computation

library(boot)
library(reshape2)
# Read data
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
# Select complete cases
df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]
set.seed(17)
cv.error=rep(0,10)
# For polynomials 1,2,3... 10 fit a LOOCV model
for (i in 1:10){
    glm.fit=glm(mpg~poly(horsepower,i),data=df3)
    cv.error[i]=cv.glm(df3,glm.fit)$delta[1]
    
}
cv.error
##  [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305
##  [8] 18.96115 19.06863 19.49093
# Create and display a plot
folds <- seq(1,10)
df <- data.frame(folds,cvError=cv.error)
ggplot(df,aes(x=folds,y=cvError)) + geom_point() +geom_line(color="blue") +
    xlab("Degree of Polynomial") + ylab("Cross Validation Error") +
    ggtitle("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial")

5b. Leave out one cross validation (LOOCV) – Python Code

In Python there is no available function to compute Cross Validation error and we have to compute the above formula. I have done this after several hours. I think it is now in reasonable shape. Do let me know if you think otherwise. For LOOCV I use the K Fold Cross Validation with K=N

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Read data
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
# Remove rows with NAs
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['horsepower']]
y=autoDF3['mpg']

# For polynomial degree 1,2,3... 10
def computeCVError(X,y,folds):
    deg=[]
    mse=[]
    degree1=[1,2,3,4,5,6,7,8,9,10]
    
    nK=len(X)/float(folds)
    xval_err=0
    # For degree 'j'
    for j in degree1: 
        # Split as 'folds'
        kf = KFold(len(X),n_folds=folds)
        for train_index, test_index in kf:
            # Create the appropriate train and test partitions from the fold index
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]  

            # For the polynomial degree 'j'
            poly = PolynomialFeatures(degree=j)        
            # Transform the X_train and X_test
            X_train_poly = poly.fit_transform(X_train)
            X_test_poly = poly.fit_transform(X_test)
            # Fit a model on the transformed data
            linreg = LinearRegression().fit(X_train_poly, y_train)
            # Compute yhat or ypred
            y_pred = linreg.predict(X_test_poly)   
            # Compute MSE * n_K/N
            test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X))     
            # Add the test_mse for this partition of the data
            mse.append(test_mse)
        # Compute the mean of all folds for degree 'j'   
        deg.append(np.mean(mse))
        
    return(deg)


df=pd.DataFrame()
print(len(X))
# Call the function once. For LOOCV K=N. hence len(X) is passed as number of folds
cvError=computeCVError(X,y,len(X))

# Create and plot LOOCV
df=pd.DataFrame(cvError)
fig3=df.plot()
fig3=plt.title("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial")
fig3=plt.xlabel("Degree of Polynomial")
fig3=plt.ylabel("Cross validation Error")
fig3.figure.savefig('foo3.png', bbox_inches='tight')

 

6a K Fold Cross Validation – R code

Here K Fold Cross Validation is done for 4, 5 and 10 folds using the R package boot and the glm package

library(boot)
library(reshape2)
set.seed(17)
#Read data
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]
a=matrix(rep(0,30),nrow=3,ncol=10)
set.seed(17)
# Set the folds as 4,5 and 10
folds<-c(4,5,10)
for(i in seq_along(folds)){
    cv.error.10=rep(0,10)
    for (j in 1:10){
        # Fit a generalized linear model
        glm.fit=glm(mpg~poly(horsepower,j),data=df3)
        # Compute K Fold Validation error
        a[i,j]=cv.glm(df3,glm.fit,K=folds[i])$delta[1]
        
    }
    
}

# Create and display the K Fold Cross Validation Error
b <- t(a)
df <- data.frame(b)
df1 <- cbind(seq(1,10),df)
names(df1) <- c("PolynomialDegree","4-fold","5-fold","10-fold")

df2 <- melt(df1,id="PolynomialDegree")
ggplot(df2) + geom_line(aes(x=PolynomialDegree, y=value, colour=variable),size=2) +
    xlab("Degree of Polynomial") + ylab("Cross Validation Error") +
    ggtitle("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial")

6b. K Fold Cross Validation – Python code

The implementation of K-Fold Cross Validation Error has to be implemented and I have done this below. There is a small discrepancy in the shapes of the curves with the R plot above. Not sure why!

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Read data
autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1")
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
# Drop NA rows
autoDF3=autoDF2.dropna()
autoDF3.shape
#X=autoDF3[['cylinder','displacement','horsepower','weight']]
X=autoDF3[['horsepower']]
y=autoDF3['mpg']

# Create Cross Validation function
def computeCVError(X,y,folds):
    deg=[]
    mse=[]
    # For degree 1,2,3,..10
    degree1=[1,2,3,4,5,6,7,8,9,10]
    
    nK=len(X)/float(folds)
    xval_err=0
    for j in degree1: 
        # Split the data into 'folds'
        kf = KFold(len(X),n_folds=folds)
        for train_index, test_index in kf:
            # Partition the data acccording the fold indices generated
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]  

            # Scale the X_train and X_test as per the polynomial degree 'j'
            poly = PolynomialFeatures(degree=j)             
            X_train_poly = poly.fit_transform(X_train)
            X_test_poly = poly.fit_transform(X_test)
            # Fit a polynomial regression
            linreg = LinearRegression().fit(X_train_poly, y_train)
            # Compute yhat or ypred
            y_pred = linreg.predict(X_test_poly)  
            # Compute MSE *(nK/N)
            test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X))  
            # Append to list for different folds
            mse.append(test_mse)
        # Compute the mean for poylnomial 'j' 
        deg.append(np.mean(mse))
        
    return(deg)

# Create and display a plot of K -Folds
df=pd.DataFrame()
for folds in [4,5,10]:
    cvError=computeCVError(X,y,folds)
    #print(cvError)
    df1=pd.DataFrame(cvError)
    df=pd.concat([df,df1],axis=1)
    #print(cvError)
    
df.columns=['4-fold','5-fold','10-fold']
df=df.reindex([1,2,3,4,5,6,7,8,9,10])
df
fig2=df.plot()
fig2=plt.title("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial")
fig2=plt.xlabel("Degree of Polynomial")
fig2=plt.ylabel("Cross validation Error")
fig2.figure.savefig('foo2.png', bbox_inches='tight')

output

This concludes this 2nd part of this series. I will look into model tuning and model selection in R and Python in the coming parts. Comments, suggestions and corrections are welcome!
To be continued….
Watch this space!

Also see

  1. Design Principles of Scalable, Distributed Systems
  2. Re-introducing cricketr! : An R package to analyze performances of cricketers
  3. Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress
  4. Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket
  5. Simulating an Edge Shape in Android

To see all posts see Index of posts

Introducing cricket package yorkr:Part 4-In the block hole!

Introduction

“The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, the carbon in our apple pies were made in the interiors of collapsing stars. We are made of starstuff.”

“If you wish to make an apple pie from scratch, you must first invent the universe.”

“We are like butterflies who flutter for a day and think it is forever.”

“The absence of evidence is not the evidence of absence.”

“We are star stuff which has taken its destiny into its own hands.”

                              Cosmos - Carl Sagan

This post is the 4th and possibly, the last part of my introduction, to my latest cricket package yorkr. This is the 4th part of the introduction, the 3 earlier ones were

  1. Introducing cricket package yorkr-Part1:Beaten by sheer pace!.
  2. Introducing cricket package yorkr: Part 2-Trapped leg before wicket!
  3. Introducing cricket package yorkr: Part 3-Foxed by flight!

The 1st part included functions dealing with a specific match, the 2nd part dealt with functions between 2 opposing teams. The 3rd part dealt with functions between a team and all matches with all oppositions. This 4th part includes individual batting and bowling performances in ODI matches and deals with Class 4 functions.

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

d $4.99/Rs 320 and $6.99/Rs448 respectively

 

This post has also been published at RPubs yorkr-Part4 and can also be downloaded as a PDF document from yorkr-Part4.pdf.

You can clone/fork the code for the package yorkr from Github at yorkr-package

Checkout my interactive Shiny apps GooglyPlus (plots & tables) and Googly (only plots) which can be used to analyze IPL players, teams and matches.

Important note 1: Do check out all the posts on the python avatar of yorkr, namely ‘yorkpy’ in my post ‘Pitching yorkpy … short of good length to IPL – Part 1

Batsman functions

  1. batsmanRunsVsDeliveries
  2. batsmanFoursSixes
  3. batsmanDismissals
  4. batsmanRunsVsStrikeRate
  5. batsmanMovingAverage
  6. batsmanCumulativeAverageRuns
  7. batsmanCumulativeStrikeRate
  8. batsmanRunsAgainstOpposition
  9. batsmanRunsVenue
  10. batsmanRunsPredict

Bowler functions

  1. bowlerMeanEconomyRate
  2. bowlerMeanRunsConceded
  3. bowlerMovingAverage
  4. bowlerCumulativeAvgWickets
  5. bowlerCumulativeAvgEconRate
  6. bowlerWicketPlot
  7. bowlerWicketsAgainstOpposition
  8. bowlerWicketsVenue
  9. bowlerWktsPredict

Note: The yorkr package in its current avatar only supports ODI, T20 and IPL T20 matches.

library(yorkr)
library(gridExtra)
library(rpart.plot)
library(dplyr)
library(ggplot2)
rm(list=ls())

A. Batsman functions

1. Get Team Batting details

The function below gets the overall team batting details based on the RData file available in ODI matches. This is currently also available in Github at (https://github.com/tvganesh/yorkrData/tree/master/ODI/ODI-matches).  However you may have to do this as future matches are added! The batting details of the team in each match is created and a huge data frame is created by rbinding the individual dataframes. This can be saved as a RData file

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
india_details <- getTeamBattingDetails("India",dir=".", save=TRUE)
dim(india_details)
## [1] 11085    15
sa_details <- getTeamBattingDetails("South Africa",dir=".",save=TRUE)
dim(sa_details)
## [1] 6375   15
nz_details <- getTeamBattingDetails("New Zealand",dir=".",save=TRUE)
dim(nz_details)
## [1] 6262   15
eng_details <- getTeamBattingDetails("England",dir=".",save=TRUE)
dim(eng_details)
## [1] 9001   15

2. Get batsman details

This function is used to get the individual batting record for a the specified batsmen of the country as in the functions below. For analyzing the batting performances the following cricketers have been chosen

  1. Virat Kohli (Ind)
  2. M S Dhoni (Ind)
  3. AB De Villiers (SA)
  4. Q De Kock (SA)
  5. J Root (Eng)
  6. M J Guptill (NZ)
setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
kohli <- getBatsmanDetails(team="India",name="Kohli",dir=".")
## [1] "./India-BattingDetails.RData"
dhoni <- getBatsmanDetails(team="India",name="Dhoni")
## [1] "./India-BattingDetails.RData"
devilliers <-  getBatsmanDetails(team="South Africa",name="Villiers",dir=".")
## [1] "./South Africa-BattingDetails.RData"
deKock <-  getBatsmanDetails(team="South Africa",name="Kock",dir=".")
## [1] "./South Africa-BattingDetails.RData"
root <-  getBatsmanDetails(team="England",name="Root",dir=".")
## [1] "./England-BattingDetails.RData"
guptill <-  getBatsmanDetails(team="New Zealand",name="Guptill",dir=".")
## [1] "./New Zealand-BattingDetails.RData"

3. Runs versus deliveries

Kohli, De Villiers and Guptill have a good cluster of points that head towards 150 runs at 150 deliveries.

p1 <-batsmanRunsVsDeliveries(kohli,"Kohli")
p2 <- batsmanRunsVsDeliveries(dhoni, "Dhoni")
p3 <- batsmanRunsVsDeliveries(devilliers,"De Villiers")
p4 <- batsmanRunsVsDeliveries(deKock,"Q de Kock")
p5 <- batsmanRunsVsDeliveries(root,"JE Root")
p6 <- batsmanRunsVsDeliveries(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

runsVsDeliveries-1

4. Batsman Total runs, Fours and Sixes

The plots below show the total runs, fours and sixes by the batsmen

kohli46 <- select(kohli,batsman,ballsPlayed,fours,sixes,runs)
p1 <- batsmanFoursSixes(kohli46,"Kohli")
dhoni46 <- select(dhoni,batsman,ballsPlayed,fours,sixes,runs)
p2 <- batsmanFoursSixes(dhoni46,"Dhoni")
devilliers46 <- select(devilliers,batsman,ballsPlayed,fours,sixes,runs)
p3 <- batsmanFoursSixes(devilliers46, "De Villiers")
deKock46 <- select(deKock,batsman,ballsPlayed,fours,sixes,runs)
p4 <- batsmanFoursSixes(deKock46,"Q de Kock")
root46 <- select(root,batsman,ballsPlayed,fours,sixes,runs)
p5 <- batsmanFoursSixes(root46,"JE Root")
guptill46 <- select(guptill,batsman,ballsPlayed,fours,sixes,runs)
p6 <- batsmanFoursSixes(guptill46,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

foursSixes-1

5. Batsman dismissals

The type of dismissal for each batsman is shown below

p1 <-batsmanDismissals(kohli,"Kohli")
p2 <- batsmanDismissals(dhoni, "Dhoni")
p3 <- batsmanDismissals(devilliers, "De Villiers")
p4 <- batsmanDismissals(deKock,"Q de Kock")
p5 <- batsmanDismissals(root,"JE Root")
p6 <- batsmanDismissals(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

dismissal-1

6. Runs versus Strike Rate

De villiers has the best strike rate among all as there are more points to the right side of the plot for the same runs. Kohli and Dhoni do well too. Q De Kock and Joe Root also have a very good spread of points though they have fewer innings.

p1 <-batsmanRunsVsStrikeRate(kohli,"Kohli")
p2 <- batsmanRunsVsStrikeRate(dhoni, "Dhoni")
p3 <- batsmanRunsVsStrikeRate(devilliers, "De Villiers")
p4 <- batsmanRunsVsStrikeRate(deKock,"Q de Kock")
p5 <- batsmanRunsVsStrikeRate(root,"JE Root")
p6 <- batsmanRunsVsStrikeRate(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

runsSR-1

7. Batsman moving average

Kohli’s average is on a gentle increase from below 50 to around 60’s. Joe Root performance is impressive with his moving average of late tending towards the 70’s. Q De Kock seemed to have a slump around 2015 but his performance is on the increase. Devilliers consistently averages around 50. Dhoni also has been having a stable run in the last several years.

p1 <-batsmanMovingAverage(kohli,"Kohli")
p2 <- batsmanMovingAverage(dhoni, "Dhoni")
p3 <- batsmanMovingAverage(devilliers, "De Villiers")
p4 <- batsmanMovingAverage(deKock,"Q de Kock")
p5 <- batsmanMovingAverage(root,"JE Root")
p6 <- batsmanMovingAverage(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

ma-1

8. Batsman cumulative average

The functions below provide the cumulative average of runs scored. As can be seen Kohli and Devilliers have a cumulative runs rate that averages around 48-50. Q De Kock seems to have had a rocky career with several highs and lows as the cumulative average oscillates between 45-40. Root steadily improves to a cumulative average of around 42-43 from his 50th innings

p1 <-batsmanCumulativeAverageRuns(kohli,"Kohli")
p2 <- batsmanCumulativeAverageRuns(dhoni, "Dhoni")
p3 <- batsmanCumulativeAverageRuns(devilliers, "De Villiers")
p4 <- batsmanCumulativeAverageRuns(deKock,"Q de Kock")
p5 <- batsmanCumulativeAverageRuns(root,"JE Root")
p6 <- batsmanCumulativeAverageRuns(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

cAvg-1

9. Cumulative Average Strike Rate

The plots below show the cumulative average strike rate of the batsmen. Dhoni and Devilliers have the best cumulative average strike rate of 90%. The rest average around 80% strike rate. Guptill shows a slump towards the latter part of his career.

p1 <-batsmanCumulativeStrikeRate(kohli,"Kohli")
p2 <- batsmanCumulativeStrikeRate(dhoni, "Dhoni")
p3 <- batsmanCumulativeStrikeRate(devilliers, "De Villiers")
p4 <- batsmanCumulativeStrikeRate(deKock,"Q de Kock")
p5 <- batsmanCumulativeStrikeRate(root,"JE Root")
p6 <- batsmanCumulativeStrikeRate(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

cSR-1

10. Batsman runs against opposition

Kohli’s best performances are against Australia, West Indies and Sri Lanka

batsmanRunsAgainstOpposition(kohli,"Kohli")

runsOppn1-1

batsmanRunsAgainstOpposition(dhoni, "Dhoni")

runsOppn2-1

Kohli’s best performances are against Australia, Pakistan and West Indies

batsmanRunsAgainstOpposition(devilliers, "De Villiers")

runsOppn3-1

Quentin de Kock average almost 100 runs against India and 75 runs against England

batsmanRunsAgainstOpposition(deKock, "Q de Kock")

runsOppn4-1

Root’s best performances are against South Africa, Sri Lanka and West Indies

batsmanRunsAgainstOpposition(root, "JE Root")

runsOppn5-1

batsmanRunsAgainstOpposition(guptill, "MJ Guptill")

runsOppn6-1

11. Runs at different venues

The plots below give the performances of the batsmen at different grounds.

batsmanRunsVenue(kohli,"Kohli")

runsVenue1-1

batsmanRunsVenue(dhoni, "Dhoni")

runsVenue2-1

batsmanRunsVenue(devilliers, "De Villiers")

runsVenue3-1

batsmanRunsVenue(deKock, "Q de Kock")

runsVenue4-1

batsmanRunsVenue(root, "JE Root")

runsVenue5-1

batsmanRunsVenue(guptill, "MJ Guptill")

runsVenue6-1

12. Predict number of runs to deliveries

The plots below use rpart classification tree to predict the number of deliveries required to score the runs in the leaf node. For e.g. Kohli takes 66 deliveries to score 64 runs and for higher number of deliveries scores around 115 runs. Devilliers needs

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsPredict(kohli,"Kohli")
batsmanRunsPredict(dhoni, "Dhoni")
batsmanRunsPredict(devilliers, "De Villiers")

runsPredict1,runsVenue1-1

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsPredict(deKock,"Q de Kock")
batsmanRunsPredict(root,"JE Root")
batsmanRunsPredict(guptill,"MJ Guptill")

runsPredict2,runsVenue1-1

B. Bowler functions

13. Get bowling details

The function below gets the overall team bowling details based on the RData file available in ODI matches. This is currently also available in Github at (https://github.com/tvganesh/yorkrData/tree/master/ODI/ODI-matches). The bowling details of the team in each match is created and a huge data frame is created by rbinding the individual dataframes. This can be saved as a RData file

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
ind_bowling <- getTeamBowlingDetails("India",dir=".",save=TRUE)
dim(ind_bowling)
## [1] 7816   12
aus_bowling <- getTeamBowlingDetails("Australia",dir=".",save=TRUE)
dim(aus_bowling)
## [1] 9191   12
ban_bowling <- getTeamBowlingDetails("Bangladesh",dir=".",save=TRUE)
dim(ban_bowling)
## [1] 5665   12
sa_bowling <- getTeamBowlingDetails("South Africa",dir=".",save=TRUE)
dim(sa_bowling)
## [1] 3806   12
sl_bowling <- getTeamBowlingDetails("Sri Lanka",dir=".",save=TRUE)
dim(sl_bowling)
## [1] 3964   12

14. Get bowling details of the individual bowlers

This function is used to get the individual bowling record for a specified bowler of the country as in the functions below. For analyzing the bowling performances the following cricketers have been chosen

  1. R A Jadeja (Ind)
  2. Ravichander Ashwin (Ind)
  3. Mitchell Starc (Aus)
  4. Shakib Al Hasan (Ban)
  5. Ajantha Mendis (SL)
  6. Dale Steyn (SA)
jadeja <- getBowlerWicketDetails(team="India",name="Jadeja",dir=".")
ashwin <- getBowlerWicketDetails(team="India",name="Ashwin",dir=".")
starc <-  getBowlerWicketDetails(team="Australia",name="Starc",dir=".")
shakib <-  getBowlerWicketDetails(team="Bangladesh",name="Shakib",dir=".")
mendis <-  getBowlerWicketDetails(team="Sri Lanka",name="Mendis",dir=".")
steyn <-  getBowlerWicketDetails(team="South Africa",name="Steyn",dir=".")

15. Bowler Mean Economy Rate

Shakib Al Hassan is expensive in the 1st 3 overs after which he is very economical with a economy rate of 3-4. Starc, Steyn average around a ER of 4.0

p1<-bowlerMeanEconomyRate(jadeja,"RA Jadeja")
p2<-bowlerMeanEconomyRate(ashwin, "R Ashwin")
p3<-bowlerMeanEconomyRate(starc, "MA Starc")
p4<-bowlerMeanEconomyRate(shakib, "Shakib Al Hasan")
p5<-bowlerMeanEconomyRate(mendis, "A Mendis")
p6<-bowlerMeanEconomyRate(steyn, "D Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

meanER-1

16. Bowler Mean Runs conceded

Ashwin is expensive around 6 & 7 overs

p1<-bowlerMeanRunsConceded(jadeja,"RA Jadeja")
p2<-bowlerMeanRunsConceded(ashwin, "R Ashwin")
p3<-bowlerMeanRunsConceded(starc, "M A Starc")
p4<-bowlerMeanRunsConceded(shakib, "Shakib Al Hasan")
p5<-bowlerMeanRunsConceded(mendis, "A Mendis")
p6<-bowlerMeanRunsConceded(steyn, "D Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

meanRunsConceded-1

17. Bowler Moving average

RA jadeja and Mendis’ performance has dipped considerably, while Ashwin and Shakib have improving performances. Starc average around 4 wickets

p1<-bowlerMovingAverage(jadeja,"RA Jadeja")
p2<-bowlerMovingAverage(ashwin, "Ashwin")
p3<-bowlerMovingAverage(starc, "M A Starc")
p4<-bowlerMovingAverage(shakib, "Shakib Al Hasan")
p5<-bowlerMovingAverage(mendis, "Ajantha Mendis")
p6<-bowlerMovingAverage(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

bowlerMA-1

17. Bowler cumulative average wickets

Starc is clearly the most consistent performer with 3 wickets on an average over his career, while Jadeja averages around 2.0. Ashwin seems to have dropped from 2.4-2.0 wickets, while Mendis drops from high 3.5 to 2.2 wickets. The fractional wickets only show a tendency to take another wicket.

p1<-bowlerCumulativeAvgWickets(jadeja,"RA Jadeja")
p2<-bowlerCumulativeAvgWickets(ashwin, "Ashwin")
p3<-bowlerCumulativeAvgWickets(starc, "M A Starc")
p4<-bowlerCumulativeAvgWickets(shakib, "Shakib Al Hasan")
p5<-bowlerCumulativeAvgWickets(mendis, "Ajantha Mendis")
p6<-bowlerCumulativeAvgWickets(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

cumWkts-1

18. Bowler cumulative Economy Rate (ER)

The plots below are interesting. All of the bowlers seem to average around 4.5 runs/over. RA Jadeja’s ER improves and heads to 4.5, Mendis is seen to getting more expensive as his career progresses. From a ER of 3.0 he increases towards 4.5

p1<-bowlerCumulativeAvgEconRate(jadeja,"RA Jadeja")
p2<-bowlerCumulativeAvgEconRate(ashwin, "Ashwin")
p3<-bowlerCumulativeAvgEconRate(starc, "M A Starc")
p4<-bowlerCumulativeAvgEconRate(shakib, "Shakib Al Hasan")
p5<-bowlerCumulativeAvgEconRate(mendis, "Ajantha Mendis")
p6<-bowlerCumulativeAvgEconRate(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

cumER-1

19. Bowler wicket plot

The plot below gives the average wickets versus number of overs

p1<-bowlerWicketPlot(jadeja,"RA Jadeja")
p2<-bowlerWicketPlot(ashwin, "Ashwin")
p3<-bowlerWicketPlot(starc, "M A Starc")
p4<-bowlerWicketPlot(shakib, "Shakib Al Hasan")
p5<-bowlerWicketPlot(mendis, "Ajantha Mendis")
p6<-bowlerWicketPlot(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

wktPlot-1

20. Bowler wicket against opposition

#Jadeja's' best pertformance are against England, Pakistan and West Indies
bowlerWicketsAgainstOpposition(jadeja,"RA Jadeja")

wktsOppn1-1

#Ashwin's bets pertformance are against England, Pakistan and South Africa
bowlerWicketsAgainstOpposition(ashwin, "Ashwin")

wktsOppn2-1

#Starc has good performances against India, New Zealand, Pakistan, West Indies
bowlerWicketsAgainstOpposition(starc, "M A Starc")

wktsOppn3-1

bowlerWicketsAgainstOpposition(shakib,"Shakib Al Hasan")

wktsOppn4-1

bowlerWicketsAgainstOpposition(mendis, "Ajantha Mendis")

wktsOppn5-1

#Steyn has good performances against India, Sri Lanka, Pakistan, West Indies
bowlerWicketsAgainstOpposition(steyn, "Dale Steyn")

wktsOppn6-1

21. Bowler wicket at cricket grounds

bowlerWicketsVenue(jadeja,"RA Jadeja")

wktsAve1-1

bowlerWicketsVenue(ashwin, "Ashwin")

wktsAve2-1

bowlerWicketsVenue(starc, "M A Starc")
## Warning: Removed 2 rows containing missing values (geom_bar).

wktsAve3-1

bowlerWicketsVenue(shakib,"Shakib Al Hasan")

wktsAve4-1

bowlerWicketsVenue(mendis, "Ajantha Mendis")

wktsAve5-1

bowlerWicketsVenue(steyn, "Dale Steyn")

wktsAve6-1

22. Get Delivery wickets for bowlers

Thsi function creates a dataframe of deliveries and the wickets taken

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
jadeja1 <- getDeliveryWickets(team="India",dir=".",name="Jadeja",save=FALSE)
ashwin1 <- getDeliveryWickets(team="India",dir=".",name="Ashwin",save=FALSE)
starc1 <- getDeliveryWickets(team="Australia",dir=".",name="MA Starc",save=FALSE)
shakib1 <- getDeliveryWickets(team="Bangladesh",dir=".",name="Shakib",save=FALSE)
mendis1 <- getDeliveryWickets(team="Sri Lanka",dir=".",name="Mendis",save=FALSE)
steyn1 <- getDeliveryWickets(team="South Africa",dir=".",name="Steyn",save=FALSE)

23. Predict number of deliveries to wickets

#Jadeja and Ashwin need around 22 to 28 deliveries to make a break through
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(jadeja1,"RA Jadeja")
bowlerWktsPredict(ashwin1,"RAshwin")

wktsPred1-1

#Starc and Shakib provide an early breakthrough producing a wicket in around 16 balls. Starc's 2nd wicket comed around the 30th delivery
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(starc1,"MA Starc")
bowlerWktsPredict(shakib1,"Shakib Al Hasan")

wktsPred2-1

#Steyn and Mendis take 20 deliveries to get their 1st wicket
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(mendis1,"A Mendis")
bowlerWktsPredict(steyn1,"DSteyn")

wktsPred3-1

Conclusion

This concludes the 4 part introduction to my new R cricket package yorkr for ODIs. I will be enhancing the package to handle Twenty20 and IPL matches soon. You can fork/clone the code from Github at yorkr.

The yaml data from Cricsheet have already beeen converted into R consumable dataframes. The converted data can be downloaded from Github at yorkrData. There are 3 folders – ODI matches, ODI matches between 2 teams (oppnAllMatches), ODI matches between a team and the rest of the world (all matches,all oppositions).

As I have already mentioned I have around 67 functions for analysis, however I am certain that the data has a lot more secrets waiting to be tapped. So please do go ahead and run any machine learning or statistical learning algorithms on them. If you do come up with interesting insights, I would appreciate if attribute the source to Cricsheet(http://cricsheet.org), and my package yorkr and my blog Giga thoughts*, besides dropping me a note.

Hope you have a great time with my yorkr package!

Important note: Do check out my other posts using yorkr at yorkr-posts

Also see

  1. Introducing cricketr! : An R package to analyze performances of cricketers
  2. Cricket analytics with cricketr in paperback and Kindle versions
  3. My TEDx talk on the “Internet of Things”
  4. Bend it like Bluemix,MongoDB with autoscaling – Part 1
  5. The mind of a programmer
  6. Fun simulation of a chain in Android
  7. Taking cricketr for a spin-Part 1
  8. Latency,throughput implications for the cloud
  9. Hand detection through haar-training: A hands-on approach
  10. Cricket analytics with cricketr

Introducing cricket package yorkr: Part 3-Foxed by flight!

Introduction

He will win, who knows when to fight and when not to fight.

He will win, who knows how to handle both superior and inferior forces

If you know neither the enemy nor yourself, you will succumb in every battle.

Hence the skilful fighter puts himself in a position which makes defeat impossible, and does not miss the moment for defeating the enemy.

Hence that general is skillful in attack whose opponent does not know what to defend; and he is skilled in defense whose opponent does know what to attack.

                                         The Art of War - Sun Tzu

This post is a continuation of my introduction to my latest cricket package yorkr. This is the 3rd part of the introduction, the 2 earlier ones were

  1. Introducing cricket package yorkr-Part1:Beaten by sheer pace!.
  2. Introducing cricket package yorkr: Part 2-Trapped leg before wicket!

This post deals with Class 3 functions, namely the performances of a team in all matches against all oppositions for e.g India/Australia/South Africa against all oppositions in all matches. In other words it is the performance of the team against the rest of the world.

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

 

This post has also been published at RPubs yorkr-Part3 and can also be downloaded as a PDF document from yorkr-Part3.pdf.

You can clone/fork the code for the package yorkr from Github at yorkr-package

Checkout my interactive Shiny apps GooglyPlus (plots & tables) and Googly (only plots) which can be used to analyze IPL players, teams and matches.

Important note 1: Do check out all the posts on the python avatar of yorkr, namely ‘yorkpy’ in my post ‘Pitching yorkpy … short of good length to IPL – Part 1

The list of functions in Class 3 are

  1. teamBattingScorecardAllOppnAllMatches()
  2. teamBatsmenPartnershipAllOppnAllMatches()
  3. teamBatsmenPartnershipAllOppnAllMatchesPlot()
  4. teamBatsmenVsBowlersAllOppnAllMatchesRept()
  5. teamBatsmenVsBowlersAllOppnAllMatchesPlot()
  6. teamBowlingScorecardAllOppnAllMatchesMain()
  7. teamBowlersVsBatsmenAllOppnAllMatchesRept()
  8. teamBowlersVsBatsmenAllOppnAllMatchesPlot()
  9. teamBowlingWicketKindAllOppnAllMatches()
  10. teamBowlingWicketRunsAllOppnAllMatches()

Note 1: The yorkr package in its current avatar supports ODI, T20 and IPL T20 matches. 

Note 2: As in the previous parts the plots usually have the plot=TRUE/FALSE parameter. This is to allow the user to get a return value of the desired dataframe. The user can choose to plot this, in any way he/she likes for e.g in interactive charts using rcharts, ggvis,googleVis,plotly etc

1. Install the package from CRAN

The yorkr package can be installed directly from CRAN now! Install the yorkr package.

if (!require("yorkr")) {
  install.packages("yorkr") 
  library("yorkr")
}
rm(list=ls())

2. Get data for all matches against all oppositions for a team

We can get all matches against all oppositions for a team/country using the function below. The dir parameter should point to the folder in which the RData files of the individual matches exist. This function creates a data frame of all the matches and also saves the resulting dataframe as RData

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-team-allmatches-allOppositions")

# Get all matches against all oppositions for India and save as RData
matches <-getAllMatchesAllOpposition("India",dir=".",save=TRUE)
dim(matches)
## [1] 140655     25

“`

3. Save data for all matches against all oppositions

This can be done locally using the function below. This function gets all the matches of the country/team against all other countrioes//teams and combines them into a single dataframe and saves it in the current folder. The current implementation expects that the the RData files of individual matches are in ../data folder. Since I already have converted this I will not be running this again

#saveAllMatchesAllOpposition(dir=".",odir=".")

4. Load data directly for all matches between 2 teams

As in my earlier posts (yorkr-Part1 & yorkr-Part2) I have however already saved the data, for all matches of the individual countries, against all oppositons. The data for these matches for the individual teams/countries can be downloaded directly from Github folder at ODI-team-allmatches-allOppositions

Note: The dataframe for the different for all the matches of a country agaisnt all oppositons can be loaded directly into your code. As can be seen in the calls below the datframes are ~100,000+ rows x 25 columns. While I have 10+ functions to process these dataframes, for a particular team, feel free to download these data frames and perform your own analysis. The data frames include ball-by-ball details, details on non-striker, bowler, runs, extras, venue,date etc. Certainly these data frames are a gold-mine of interesting insights. So do go ahead and unleash your bagging/boosting algorithms, SVM classifiers or Random Forest algorithm on them.

I plan to try out some algorithms of statistical/machine learning in the months to come. If you do come up with interesting insights, I would appreciate if attribute the source to Cricsheet(http://cricsheet.org), and my package yorkr and my blog Giga thoughts, besides dropping me a note.*

As in my earlier post I will be directly loading the saved files. For the illustration of the functions, I will use India in all the functions, (for obvious reasons) and will randomly use the data from the rest of the top 8 teams

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-team-allmatches-allOppositions")
load("allMatchesAllOpposition-India.RData")
ind_matches <- matches
dim(ind_matches)
## [1] 140655     25
load("allMatchesAllOpposition-Australia.RData")
aus_matches <- matches
dim(aus_matches)
## [1] 128148     25
load("allMatchesAllOpposition-New Zealand.RData")
nz_matches <- matches
dim(nz_matches)
## [1] 98573    25
load("allMatchesAllOpposition-Pakistan.RData")
pak_matches <- matches
dim(pak_matches)
## [1] 117947     25
load("allMatchesAllOpposition-England.RData")
eng_matches <- matches
dim(eng_matches)
## [1] 118859     25
load("allMatchesAllOpposition-Sri Lanka.RData")
sl_matches <- matches
dim(sl_matches)
## [1] 125893     25
load("allMatchesAllOpposition-West Indies.RData")
wi_matches <- matches
dim(wi_matches)
## [1] 92716    25
load("allMatchesAllOpposition-South Africa.RData")
sa_matches <- matches
dim(sa_matches)
## [1] 100916     25

5. Team Batting Scorecard (all matches with opposition)

The following functions shows the batting scorecards in each country. It returns a dataframe with the top batsmen in each country

#Top ODI performers for India
m <-teamBattingScorecardAllOppnAllMatches(ind_matches,theTeam="India")
## Total= 58079
## Source: local data frame [68 x 5]
## 
##         batsman ballsPlayed fours sixes  runs
##          (fctr)       (int) (int) (int) (dbl)
## 1       V Kohli        7774   663    67  7039
## 2      MS Dhoni        7878   515   129  6885
## 3      SK Raina        5076   429   114  4964
## 4     G Gambhir        5138   472    15  4503
## 5     RG Sharma        5245   372    89  4385
## 6  SR Tendulkar        4708   504    43  4196
## 7  Yuvraj Singh        4472   403    96  3976
## 8      V Sehwag        3106   494    74  3681
## 9      S Dhawan        2956   314    37  2694
## 10    AM Rahane        2490   195    24  2009
## ..          ...         ...   ...   ...   ...
#Top ODI batsmen for Australia
m <-teamBattingScorecardAllOppnAllMatches(aus_matches,theTeam="Australia")
## Total= 54736
## Source: local data frame [70 x 5]
## 
##       batsman ballsPlayed fours sixes  runs
##        (fctr)       (int) (int) (int) (dbl)
## 1   MJ Clarke        7060   440    39  5485
## 2   SR Watson        5435   519   114  5035
## 3  RT Ponting        5301   447    43  4440
## 4  MEK Hussey        4990   286    60  4286
## 5   BJ Haddin        3308   266    69  2858
## 6   DA Warner        2701   264    43  2537
## 7   GJ Bailey        2805   176    43  2392
## 8   SPD Smith        2303   174    19  2082
## 9    CL White        2471   142    44  2018
## 10  ML Hayden        2276   219    37  2002
## ..        ...         ...   ...   ...   ...
#Top ODI batsmen for Pakistan
m <-teamBattingScorecardAllOppnAllMatches(pak_matches,theTeam="Pakistan")
## Total= NA
## Source: local data frame [74 x 5]
## 
##            batsman ballsPlayed fours sixes  runs
##             (fctr)       (int) (int) (int) (dbl)
## 1  Mohammad Hafeez        5714   471    71  4574
## 2      Younis Khan        4561   306    24  3465
## 3    Shahid Afridi        2316   264   132  3125
## 4     Shoaib Malik        3472   240    40  2897
## 5       Umar Akmal        3272   241    47  2843
## 6    Ahmed Shehzad        3386   259    18  2491
## 7  Mohammad Yousuf        2933   191    11  2241
## 8     Kamran Akmal        2533   247    25  2104
## 9      Salman Butt        2037   206     6  1653
## 10   Nasir Jamshed        1862   150    19  1418
## ..             ...         ...   ...   ...   ...
#Top ODI batsmen for New Zealand
m <-teamBattingScorecardAllOppnAllMatches(nz_matches,theTeam="New Zealand")
## Total= 39993
## Source: local data frame [68 x 5]
## 
##          batsman ballsPlayed fours sixes  runs
##           (fctr)       (int) (int) (int) (dbl)
## 1    LRPL Taylor        6153   418   103  5120
## 2    BB McCullum        4321   446   159  4489
## 3     MJ Guptill        5205   462   100  4460
## 4  KS Williamson        4044   325    25  3418
## 5      SB Styris        2324   167    23  1944
## 6     GD Elliott        2274   149    26  1889
## 7       JD Ryder        1232   139    33  1223
## 8       JDP Oram        1174    81    48  1195
## 9     DL Vettori        1238    97     8  1130
## 10      L Ronchi         927   108    32  1070
## ..           ...         ...   ...   ...   ...
#Top ODI batsmen for England
m <-teamBattingScorecardAllOppnAllMatches(eng_matches,theTeam="England")
## Total= 48152
## Source: local data frame [72 x 5]
## 
##           batsman ballsPlayed fours sixes  runs
##            (fctr)       (int) (int) (int) (dbl)
## 1         IR Bell        6401   488    31  5051
## 2      EJG Morgan        4249   323    98  3927
## 3    KP Pietersen        3828   315    44  3231
## 4         AN Cook        4052   360    10  3163
## 5  PD Collingwood        3693   213    48  2992
## 6       IJL Trott        3418   205     3  2653
## 7       RS Bopara        3326   202    32  2624
## 8      AJ Strauss        3062   276    20  2566
## 9         JE Root        2983   200    26  2543
## 10     JC Buttler        1467   155    54  1777
## ..            ...         ...   ...   ...   ...
#Top ODI batsmen for West Indies
m <-teamBattingScorecardAllOppnAllMatches(wi_matches,theTeam="West Indies")
## Total= 34622
## Source: local data frame [65 x 5]
## 
##          batsman ballsPlayed fours sixes  runs
##           (fctr)       (int) (int) (int) (dbl)
## 1       CH Gayle        3839   386   144  3635
## 2     MN Samuels        4057   294    72  3062
## 3  S Chanderpaul        3521   188    28  2469
## 4       DJ Bravo        2804   193    49  2390
## 5       DM Bravo        2916   174    41  2051
## 6      RR Sarwan        2682   172    20  1960
## 7     KA Pollard        2064   127    92  1947
## 8    LMP Simmons        2538   157    46  1863
## 9      DJG Sammy        1799   143    83  1835
## 10      D Ramdin        1817   115    23  1516
## ..           ...         ...   ...   ...   ...
#Top ODI batsmen for Sri Lanka
m <-teamBattingScorecardAllOppnAllMatches(sl_matches,theTeam="Sri Lanka")
## Total= NA
## Source: local data frame [60 x 5]
## 
##             batsman ballsPlayed fours sixes  runs
##              (fctr)       (int) (int) (int) (dbl)
## 1     KC Sangakkara       10449   852    64  8778
## 2        TM Dilshan        8838   914    45  7981
## 3  DPMD Jayawardene        7482   599    43  6260
## 4       WU Tharanga        5690   483    24  4232
## 5        AD Mathews        4383   288    59  3764
## 6     ST Jayasuriya        2266   297    61  2396
## 7   HDRL Thirimanne        3286   192    17  2371
## 8      LD Chandimal        3026   165    27  2308
## 9   KMDN Kulasekara        1406    83    37  1204
## 10      NLTC Perera        1007    90    42  1137
## ..              ...         ...   ...   ...   ...

6. Team Batting Scorecard

The following functions show the best batsmen from the opposition ‘theTeam’ in the ‘matches’. For e.g. when the matches=ind_matches and theTeam=“England” then the returned dataframe shows the best English batsmen against India

#Top England batsmen against India
m <-teamBattingScorecardAllOppnAllMatches(matches=ind_matches,theTeam="England")
## Total= 7620
## Source: local data frame [43 x 5]
## 
##           batsman ballsPlayed fours sixes  runs
##            (fctr)       (int) (int) (int) (dbl)
## 1         IR Bell        1238   110     9  1085
## 2    KP Pietersen         990    89    10   847
## 3         AN Cook        1049   103     2   822
## 4       RS Bopara         632    42     8   534
## 5  PD Collingwood         450    39     6   397
## 6         OA Shah         394    40     7   385
## 7       IJL Trott         410    33     2   349
## 8         JE Root         408    32     4   336
## 9        SR Patel         336    25    10   329
## 10   C Kieswetter         309    34    13   313
## ..            ...         ...   ...   ...   ...
#Top Australian batsmen against India
m <-teamBattingScorecardAllOppnAllMatches(matches=ind_matches,theTeam="Australia")
## Total= 9995
## Source: local data frame [47 x 5]
## 
##       batsman ballsPlayed fours sixes  runs
##        (fctr)       (int) (int) (int) (dbl)
## 1  RT Ponting        1107    86     8   876
## 2  MEK Hussey         816    56     5   753
## 3   GJ Bailey         578    51    13   614
## 4   SR Watson         653    81    10   609
## 5   MJ Clarke         786    45     5   607
## 6   ML Hayden         660    72     8   573
## 7   A Symonds         543    43    15   536
## 8    AJ Finch         617    52     9   525
## 9   SPD Smith         431    44     7   467
## 10  DA Warner         385    40     6   391
## ..        ...         ...   ...   ...   ...
#Top New Zealand batsmen against Australia
m <-teamBattingScorecardAllOppnAllMatches(aus_matches,theTeam="New Zealand")
## Total= 6106
## Source: local data frame [44 x 5]
## 
##        batsman ballsPlayed fours sixes  runs
##         (fctr)       (int) (int) (int) (dbl)
## 1  LRPL Taylor        1012    71    13   804
## 2  BB McCullum         768    71    25   761
## 3   MJ Guptill         618    50    17   485
## 4    PG Fulton         526    35     9   425
## 5   GD Elliott         469    29     4   405
## 6    SB Styris         415    36     5   369
## 7   DL Vettori         334    24     2   291
## 8    L Vincent         338    27     5   272
## 9  CD McMillan         227    28    10   266
## 10    JDP Oram         181    13     7   193
## ..         ...         ...   ...   ...   ...
#Top Sri Lankan batsmen against West Indies
m <-teamBattingScorecardAllOppnAllMatches(wi_matches,theTeam="Sri Lanka")
## Total= 1851
## Source: local data frame [28 x 5]
## 
##             batsman ballsPlayed fours sixes  runs
##              (fctr)       (int) (int) (int) (dbl)
## 1  DPMD Jayawardene         330    26     2   288
## 2     KC Sangakkara         326    16     2   238
## 3        TM Dilshan         173    18     7   224
## 4       WU Tharanga         349    22    NA   220
## 5        AD Mathews         171    10     3   161
## 6     ST Jayasuriya         146    19     4   160
## 7       ML Udawatte         138     8     1    87
## 8   HDRL Thirimanne         144     6    NA    67
## 9       MDKJ Perera          63     4     2    64
## 10    CK Kapugedera          68     2    NA    57
## ..              ...         ...   ...   ...   ...

7. Team Batting Partnerships

This gives the top batting partnerships in each team in all its matches against all oppositions. The report can either be a ‘summary’ or a ‘detailed’ breakup of the batting partnerships.

# The function gives the names of highest partnership for India. The default report parameter is "summary"
m <- teamBatsmenPartnershipAllOppnAllMatches(ind_matches,theTeam='India')
m
## Source: local data frame [68 x 2]
## 
##         batsman totalRuns
##          (fctr)     (dbl)
## 1       V Kohli      7039
## 2      MS Dhoni      6885
## 3      SK Raina      4964
## 4     G Gambhir      4503
## 5     RG Sharma      4385
## 6  SR Tendulkar      4196
## 7  Yuvraj Singh      3976
## 8      V Sehwag      3681
## 9      S Dhawan      2694
## 10    AM Rahane      2009
## ..          ...       ...
# When the report parameter is 'detailed' then the detailed break up of the partnership is returned as a data frame
m <- teamBatsmenPartnershipAllOppnAllMatches(matches,theTeam='India',report="detailed")
head(m,30)
##     batsman      nonStriker partnershipRuns totalRuns
## 1   V Kohli        S Dhawan             661      7039
## 2   V Kohli       AM Rahane             502      7039
## 3   V Kohli       RG Sharma            1073      7039
## 4   V Kohli      KD Karthik             139      7039
## 5   V Kohli    SR Tendulkar             278      7039
## 6   V Kohli        R Dravid             132      7039
## 7   V Kohli        V Sehwag             255      7039
## 8   V Kohli    Yuvraj Singh             420      7039
## 9   V Kohli        SK Raina            1072      7039
## 10  V Kohli        MS Dhoni             534      7039
## 11  V Kohli Harbhajan Singh              13      7039
## 12  V Kohli       IK Pathan               1      7039
## 13  V Kohli       G Gambhir             962      7039
## 14  V Kohli      RV Uthappa              10      7039
## 15  V Kohli       RA Jadeja              91      7039
## 16  V Kohli        R Ashwin              71      7039
## 17  V Kohli       AT Rayudu             345      7039
## 18  V Kohli Gurkeerat Singh               1      7039
## 19  V Kohli       YK Pathan              68      7039
## 20  V Kohli       STR Binny               4      7039
## 21  V Kohli       MK Tiwary             111      7039
## 22  V Kohli        AR Patel              39      7039
## 23  V Kohli        PA Patel             180      7039
## 24  V Kohli         M Vijay              33      7039
## 25  V Kohli       KM Jadhav              10      7039
## 26  V Kohli        AM Nayar              25      7039
## 27  V Kohli     S Badrinath               9      7039
## 28 MS Dhoni        S Dhawan              49      6885
## 29 MS Dhoni       AM Rahane              50      6885
## 30 MS Dhoni       RG Sharma             300      6885

9. More Team Batting Partnerships

When we use the dataframe ind_matches (matches of India against all opoositions) and choose another country in the theTeam then we will get the names of those top batsmen against India.

# Top England batting partnerships against India (report="summary")
m <- teamBatsmenPartnershipAllOppnAllMatches(ind_matches,theTeam='England')
m
## Source: local data frame [43 x 2]
## 
##           batsman totalRuns
##            (fctr)     (dbl)
## 1         IR Bell      1085
## 2    KP Pietersen       847
## 3         AN Cook       822
## 4       RS Bopara       534
## 5  PD Collingwood       397
## 6         OA Shah       385
## 7       IJL Trott       349
## 8         JE Root       336
## 9        SR Patel       329
## 10   C Kieswetter       313
## ..            ...       ...
# Top South Africa  batting partnerships against India (report="detailed")
m <- teamBatsmenPartnershipAllOppnAllMatches(ind_matches,theTeam='South Africa', report="detailed")
m[1:30,]
##           batsman       nonStriker partnershipRuns totalRuns
## 1  AB de Villiers       MN van Wyk              30      1179
## 2  AB de Villiers        JH Kallis             207      1179
## 3  AB de Villiers         HH Gibbs              20      1179
## 4  AB de Villiers        JP Duminy             168      1179
## 5  AB de Villiers       MV Boucher              37      1179
## 6  AB de Villiers          JM Kemp               5      1179
## 7  AB de Villiers      AN Petersen               8      1179
## 8  AB de Villiers       WD Parnell              56      1179
## 9  AB de Villiers         DW Steyn               5      1179
## 10 AB de Villiers    CK Langeveldt              19      1179
## 11 AB de Villiers          HM Amla              26      1179
## 12 AB de Villiers         GC Smith             106      1179
## 13 AB de Villiers     F du Plessis             133      1179
## 14 AB de Villiers        Q de Kock             113      1179
## 15 AB de Villiers        DA Miller             103      1179
## 16 AB de Villiers      F Behardien              64      1179
## 17 AB de Villiers        CH Morris              32      1179
## 18 AB de Villiers      AM Phangiso              37      1179
## 19 AB de Villiers       SM Pollock              10      1179
## 20        HM Amla       MN van Wyk              66       704
## 21        HM Amla   AB de Villiers               9       704
## 22        HM Amla        JH Kallis              88       704
## 23        HM Amla         HH Gibbs              10       704
## 24        HM Amla        JP Duminy              79       704
## 25        HM Amla        LE Bosman              43       704
## 26        HM Amla RE van der Merwe              17       704
## 27        HM Amla         GC Smith              92       704
## 28        HM Amla     F du Plessis              45       704
## 29        HM Amla      RJ Peterson               2       704
## 30        HM Amla        Q de Kock             211       704

10. Team Batting partnerships of other countries

#Top Indian batting partnerships  against England matches
m <- teamBatsmenPartnershipAllOppnAllMatches(eng_matches,theTeam='India',report="detailed")
head(m,30)
##     batsman    nonStriker partnershipRuns totalRuns
## 1  MS Dhoni     G Gambhir               6      1083
## 2  MS Dhoni      R Dravid              59      1083
## 3  MS Dhoni     PP Chawla               1      1083
## 4  MS Dhoni        Z Khan               4      1083
## 5  MS Dhoni      RP Singh              26      1083
## 6  MS Dhoni  Yuvraj Singh             157      1083
## 7  MS Dhoni      RR Powar              15      1083
## 8  MS Dhoni    RV Uthappa              29      1083
## 9  MS Dhoni     AM Rahane               1      1083
## 10 MS Dhoni       V Kohli              28      1083
## 11 MS Dhoni      SK Raina             372      1083
## 12 MS Dhoni       P Kumar              42      1083
## 13 MS Dhoni R Vinay Kumar              12      1083
## 14 MS Dhoni      R Ashwin              27      1083
## 15 MS Dhoni     RA Jadeja             238      1083
## 16 MS Dhoni     AT Rayudu              17      1083
## 17 MS Dhoni     STR Binny              41      1083
## 18 MS Dhoni     YK Pathan               8      1083
## 19 SK Raina     G Gambhir              23       918
## 20 SK Raina      R Dravid               1       918
## 21 SK Raina      MS Dhoni             450       918
## 22 SK Raina  Yuvraj Singh              56       918
## 23 SK Raina     AM Rahane              17       918
## 24 SK Raina       V Kohli             144       918
## 25 SK Raina     RG Sharma              58       918
## 26 SK Raina     MK Tiwary              28       918
## 27 SK Raina      R Ashwin              15       918
## 28 SK Raina     RA Jadeja              59       918
## 29 SK Raina     AT Rayudu              61       918
## 30 SK Raina      V Sehwag               6       918
#Top South Africa batting partnerships 
m <- teamBatsmenPartnershipAllOppnAllMatches(sa_matches,theTeam='South Africa', report="detailed")
head(m,30)
##           batsman       nonStriker partnershipRuns totalRuns
## 1  AB de Villiers         GC Smith             957      7693
## 2  AB de Villiers        JH Kallis             897      7693
## 3  AB de Villiers         HH Gibbs             295      7693
## 4  AB de Villiers       MV Boucher             143      7693
## 5  AB de Villiers          JM Kemp               8      7693
## 6  AB de Villiers       SM Pollock              16      7693
## 7  AB de Villiers    CK Langeveldt              19      7693
## 8  AB de Villiers          HM Amla            1437      7693
## 9  AB de Villiers        JP Duminy            1123      7693
## 10 AB de Villiers        JA Morkel             169      7693
## 11 AB de Villiers          J Botha              27      7693
## 12 AB de Villiers        Q de Kock             248      7693
## 13 AB de Villiers     F du Plessis             667      7693
## 14 AB de Villiers        DA Miller             571      7693
## 15 AB de Villiers        R McLaren             120      7693
## 16 AB de Villiers         DW Steyn              32      7693
## 17 AB de Villiers      AM Phangiso              37      7693
## 18 AB de Villiers         M Morkel              21      7693
## 19 AB de Villiers       WD Parnell              83      7693
## 20 AB de Villiers      F Behardien             223      7693
## 21 AB de Villiers     VD Philander              12      7693
## 22 AB de Villiers       RR Rossouw              90      7693
## 23 AB de Villiers      RJ Peterson               5      7693
## 24 AB de Villiers      AN Petersen             132      7693
## 25 AB de Villiers       MN van Wyk              89      7693
## 26 AB de Villiers        CH Morris              32      7693
## 27 AB de Villiers        KJ Abbott              21      7693
## 28 AB de Villiers          D Elgar              54      7693
## 29 AB de Villiers RE van der Merwe               1      7693
## 30 AB de Villiers        CA Ingram             138      7693
#Top Sri Lanka batting partnerships 
m <- teamBatsmenPartnershipAllOppnAllMatches(sl_matches,theTeam='Sri Lanka',report="summary")
m
## Source: local data frame [60 x 2]
## 
##             batsman totalRuns
##              (fctr)     (dbl)
## 1     KC Sangakkara      8778
## 2        TM Dilshan      7981
## 3  DPMD Jayawardene      6260
## 4       WU Tharanga      4232
## 5        AD Mathews      3764
## 6     ST Jayasuriya      2396
## 7   HDRL Thirimanne      2371
## 8      LD Chandimal      2308
## 9   KMDN Kulasekara      1204
## 10      NLTC Perera      1137
## ..              ...       ...
#Top England batting partnerships 
m <- teamBatsmenPartnershipAllOppnAllMatches(eng_matches,theTeam='England',report="summary")
m
## Source: local data frame [72 x 2]
## 
##           batsman totalRuns
##            (fctr)     (dbl)
## 1         IR Bell      5051
## 2      EJG Morgan      3927
## 3    KP Pietersen      3231
## 4         AN Cook      3163
## 5  PD Collingwood      2992
## 6       IJL Trott      2653
## 7       RS Bopara      2624
## 8      AJ Strauss      2566
## 9         JE Root      2543
## 10     JC Buttler      1777
## ..            ...       ...
#Top Australian batting partnerships in West Indian matches
m <- teamBatsmenPartnershipAllOppnAllMatches(wi_matches,theTeam='Australia',report="summary")
m
## Source: local data frame [39 x 2]
## 
##       batsman totalRuns
##        (fctr)     (dbl)
## 1   SR Watson       851
## 2  MEK Hussey       630
## 3  RT Ponting       503
## 4   MJ Clarke       435
## 5   GJ Bailey       341
## 6   A Symonds       252
## 7    SE Marsh       245
## 8   BJ Haddin       220
## 9   DJ Hussey       211
## 10   AC Voges       209
## ..        ...       ...
#Top England batting partnerships in New Zealand  matches
m <- teamBatsmenPartnershipAllOppnAllMatches(nz_matches,theTeam='England',report="summary")
m
## Source: local data frame [47 x 2]
## 
##           batsman totalRuns
##            (fctr)     (dbl)
## 1         IR Bell       654
## 2         JE Root       612
## 3  PD Collingwood       514
## 4      EJG Morgan       479
## 5         AN Cook       464
## 6       IJL Trott       362
## 7    KP Pietersen       358
## 8      JC Buttler       287
## 9         OA Shah       274
## 10      RS Bopara       222
## ..            ...       ...

11. Team Batting Partnership plots

Graphical plot of batting partnerships for the countries

# Plot of batting partnerships of India (Virat Kohli and M S Dhoni have the best partnerships)
teamBatsmenPartnershipAllOppnAllMatchesPlot(ind_matches,"India",main="India")

batsmenPartnership1-1

# Plot of batting partnerships of Pakistan
teamBatsmenPartnershipAllOppnAllMatchesPlot(pak_matches,"Pakistan",main="Pakistan")

batsmenPartnership1-2

# Plot of batting partnerships of Australia
teamBatsmenPartnershipAllOppnAllMatchesPlot(aus_matches,"Australia",main="Australia")

batsmenPartnership1-3

12. Top opposition batting partnerships.

This gives the best performance of the team against a specified country Indian partnetships against Australia

New Zealand Partnetship against South Africa

# Top India partnerships against West Indies
teamBatsmenPartnershipAllOppnAllMatchesPlot(ind_matches,"India",main="West Indies")

batsmenPartnership2-1

# Top Sri Lanka parnerships ahgains India
teamBatsmenPartnershipAllOppnAllMatchesPlot(sl_matches,"Sri Lanka",main="India")

batsmenPartnership2-2

# Top New Zealand partnerships against South Africa
teamBatsmenPartnershipAllOppnAllMatchesPlot(nz_matches,"New Zealand",main="South Africa")

batsmenPartnership2-3

13. Batsmen vs Bowlers

The function below gives the top performance of batsmen against the opposition countries

# Top batsmen against bowlers when rank=0
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"India",rank=0)
m
## Source: local data frame [68 x 2]
## 
##         batsman runsScored
##          (fctr)      (dbl)
## 1       V Kohli       7039
## 2      MS Dhoni       6885
## 3      SK Raina       4964
## 4     G Gambhir       4503
## 5     RG Sharma       4385
## 6  SR Tendulkar       4196
## 7  Yuvraj Singh       3976
## 8      V Sehwag       3681
## 9      S Dhawan       2694
## 10    AM Rahane       2009
## ..          ...        ...
# Performance of India batsman with rank=1 against international bowlers and runs scored against bowlers. This is Virat Kohli for India
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"India",rank=1,dispRows=30)
m
## Source: local data frame [30 x 3]
## Groups: batsman [1]
## 
##    batsman          bowler  runs
##     (fctr)          (fctr) (dbl)
## 1  V Kohli     NLTC Perera   242
## 2  V Kohli KMDN Kulasekara   196
## 3  V Kohli      SL Malinga   175
## 4  V Kohli      AD Mathews   155
## 5  V Kohli      BAW Mendis   132
## 6  V Kohli       R Rampaul   127
## 7  V Kohli     JW Dernbach   121
## 8  V Kohli     JP Faulkner   118
## 9  V Kohli       DJG Sammy   116
## 10 V Kohli    HMRKB Herath   113
## ..     ...             ...   ...
# Performance of India batsman with rank=2 against international bowlers and runs scored against these bowlers. This is M S Dhoni for India
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"India",rank=2,dispRows=50)
m
## Source: local data frame [50 x 3]
## Groups: batsman [1]
## 
##     batsman         bowler  runs
##      (fctr)         (fctr) (dbl)
## 1  MS Dhoni M Muralitharan   195
## 2  MS Dhoni  ST Jayasuriya   183
## 3  MS Dhoni     SL Malinga   144
## 4  MS Dhoni      SR Watson   135
## 5  MS Dhoni        ST Finn   130
## 6  MS Dhoni     MG Johnson   128
## 7  MS Dhoni    JP Faulkner   125
## 8  MS Dhoni  Shahid Afridi   120
## 9  MS Dhoni     TT Bresnan   111
## 10 MS Dhoni     AD Mathews   111
## ..      ...            ...   ...
# Performance of England batsman with rank=1 against international bowlers and runs scored against these bowlers. This returns a data frame of the the theTeam's batsmen against the bowlers for which the 'matches' dataframe is used. This Is IR Bell,
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(matches=ind_matches,theTeam="England",rank=1,dispRows=25)
m
## Source: local data frame [25 x 3]
## Groups: batsman [1]
## 
##    batsman       bowler  runs
##     (fctr)       (fctr) (dbl)
## 1  IR Bell       Z Khan   127
## 2  IR Bell    PP Chawla   111
## 3  IR Bell    RA Jadeja    94
## 4  IR Bell      B Kumar    78
## 5  IR Bell     MM Patel    77
## 6  IR Bell     R Ashwin    71
## 7  IR Bell   AB Agarkar    66
## 8  IR Bell     I Sharma    57
## 9  IR Bell     RP Singh    51
## 10 IR Bell Yuvraj Singh    51
## ..     ...          ...   ...
# All the best Australian batsmen against India in all of Indian matches
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"Australia",rank=0)
m
## Source: local data frame [47 x 2]
## 
##       batsman runsScored
##        (fctr)      (dbl)
## 1  RT Ponting        876
## 2  MEK Hussey        753
## 3   GJ Bailey        614
## 4   SR Watson        609
## 5   MJ Clarke        607
## 6   ML Hayden        573
## 7   A Symonds        536
## 8    AJ Finch        525
## 9   SPD Smith        467
## 10  DA Warner        391
## ..        ...        ...

14. Batsmen vs Bowlers (continued)

# The best India batsman(rank=0) against England and his performance against England bowlers
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(eng_matches,"India",rank=1,dispRows=30)
m
## Source: local data frame [28 x 3]
## Groups: batsman [1]
## 
##     batsman      bowler  runs
##      (fctr)      (fctr) (dbl)
## 1  MS Dhoni     ST Finn   130
## 2  MS Dhoni  TT Bresnan   111
## 3  MS Dhoni    GP Swann   101
## 4  MS Dhoni JW Dernbach    95
## 5  MS Dhoni   SCJ Broad    92
## 6  MS Dhoni JM Anderson    89
## 7  MS Dhoni    SR Patel    83
## 8  MS Dhoni JC Tredwell    40
## 9  MS Dhoni   CR Woakes    38
## 10 MS Dhoni  MS Panesar    37
## ..      ...         ...   ...
# All the top Sri Lanka batsmen (rank=0) against Australia and performances against Australian bowlers
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(aus_matches,"Sri Lanka",rank=0)
m
## Source: local data frame [31 x 2]
## 
##             batsman runsScored
##              (fctr)      (dbl)
## 1     KC Sangakkara        888
## 2  DPMD Jayawardene        846
## 3        TM Dilshan        799
## 4       WU Tharanga        464
## 5      LD Chandimal        413
## 6        AD Mathews        404
## 7   HDRL Thirimanne        290
## 8   KMDN Kulasekara        232
## 9     ST Jayasuriya        117
## 10       SL Malinga         91
## ..              ...        ...
#All the top England batsmen (rank=0) and their performances against South African bowlers
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(sa_matches,"England",rank=0)
m
## Source: local data frame [39 x 2]
## 
##           batsman runsScored
##            (fctr)      (dbl)
## 1       IJL Trott        424
## 2         JE Root        372
## 3         IR Bell        362
## 4      EJG Morgan        335
## 5  PD Collingwood        319
## 6        AD Hales        271
## 7    KP Pietersen        192
## 8      A Flintoff        192
## 9         OA Shah        177
## 10     JC Buttler        154
## ..            ...        ...

15. Batsmen vs Bowlers Plot

The following functions plot the performances of the batsman based on the rank chosen against opposition bowlers. Note: The rank has to be >0

#The following plot displays the performance of the top India batsman (rank=1) against all opposition bowlers. This is Virat Kohli for India

d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"India",rank=1,dispRows=50)
d
## Source: local data frame [50 x 3]
## Groups: batsman [1]
## 
##    batsman          bowler  runs
##     (fctr)          (fctr) (dbl)
## 1  V Kohli     NLTC Perera   242
## 2  V Kohli KMDN Kulasekara   196
## 3  V Kohli      SL Malinga   175
## 4  V Kohli      AD Mathews   155
## 5  V Kohli      BAW Mendis   132
## 6  V Kohli       R Rampaul   127
## 7  V Kohli     JW Dernbach   121
## 8  V Kohli     JP Faulkner   118
## 9  V Kohli       DJG Sammy   116
## 10 V Kohli    HMRKB Herath   113
## ..     ...             ...   ...
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler1-1

e <- teamBatsmenVsBowlersAllOppnAllMatchesPlot(d,plot=FALSE)
e
## Source: local data frame [50 x 3]
## Groups: batsman [1]
## 
##    batsman          bowler  runs
##     (fctr)          (fctr) (dbl)
## 1  V Kohli     NLTC Perera   242
## 2  V Kohli KMDN Kulasekara   196
## 3  V Kohli      SL Malinga   175
## 4  V Kohli      AD Mathews   155
## 5  V Kohli      BAW Mendis   132
## 6  V Kohli       R Rampaul   127
## 7  V Kohli     JW Dernbach   121
## 8  V Kohli     JP Faulkner   118
## 9  V Kohli       DJG Sammy   116
## 10 V Kohli    HMRKB Herath   113
## ..     ...             ...   ...
# The following plot displays the performance of the batsman (rank=2) against all opposition bowlers. This is M S Dhoni for India
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"India",rank=2,dispRows=50)
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler1-2

# Best batsman of South Africa against Indian  bowlers
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(ind_matches,"South Africa",rank=1,dispRows=30)
d
## Source: local data frame [30 x 3]
## Groups: batsman [1]
## 
##           batsman          bowler  runs
##            (fctr)          (fctr) (dbl)
## 1  AB de Villiers Harbhajan Singh   133
## 2  AB de Villiers         B Kumar    93
## 3  AB de Villiers       RA Jadeja    90
## 4  AB de Villiers        A Mishra    77
## 5  AB de Villiers       MM Sharma    68
## 6  AB de Villiers          Z Khan    65
## 7  AB de Villiers     S Sreesanth    61
## 8  AB de Villiers         A Nehra    58
## 9  AB de Villiers        R Ashwin    55
## 10 AB de Villiers       IK Pathan    45
## ..            ...             ...   ...
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler1-3

# Best batsman of England (rank=1) against Indian bowlers (matches=ind_matches)
d <-teamBatsmenVsBowlersAllOppnAllMatchesRept(matches=ind_matches,"England",rank=1,dispRows=50)
d
## Source: local data frame [28 x 3]
## Groups: batsman [1]
## 
##    batsman       bowler  runs
##     (fctr)       (fctr) (dbl)
## 1  IR Bell       Z Khan   127
## 2  IR Bell    PP Chawla   111
## 3  IR Bell    RA Jadeja    94
## 4  IR Bell      B Kumar    78
## 5  IR Bell     MM Patel    77
## 6  IR Bell     R Ashwin    71
## 7  IR Bell   AB Agarkar    66
## 8  IR Bell     I Sharma    57
## 9  IR Bell     RP Singh    51
## 10 IR Bell Yuvraj Singh    51
## ..     ...          ...   ...
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler1-4

15. Batsmen vs Bowlers Plot (continued)

# Top batsman of South Africa and performance against opposition bowlers of all countries
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(sa_matches,"South Africa",rank=1,dispRows=50)
d
## Source: local data frame [50 x 3]
## Groups: batsman [1]
## 
##           batsman          bowler  runs
##            (fctr)          (fctr) (dbl)
## 1  AB de Villiers   Shahid Afridi   227
## 2  AB de Villiers     Saeed Ajmal   174
## 3  AB de Villiers Mohammad Hafeez   151
## 4  AB de Villiers       JO Holder   138
## 5  AB de Villiers Harbhajan Singh   133
## 6  AB de Villiers      Wahab Riaz   130
## 7  AB de Villiers      MG Johnson   129
## 8  AB de Villiers        P Utseya   128
## 9  AB de Villiers       DJG Sammy   110
## 10 AB de Villiers        DJ Bravo   107
## ..            ...             ...   ...
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler2-1

# Do not display plot but return dataframe
e <- teamBatsmenVsBowlersAllOppnAllMatchesPlot(d,plot=FALSE)
e
## Source: local data frame [50 x 3]
## Groups: batsman [1]
## 
##           batsman          bowler  runs
##            (fctr)          (fctr) (dbl)
## 1  AB de Villiers   Shahid Afridi   227
## 2  AB de Villiers     Saeed Ajmal   174
## 3  AB de Villiers Mohammad Hafeez   151
## 4  AB de Villiers       JO Holder   138
## 5  AB de Villiers Harbhajan Singh   133
## 6  AB de Villiers      Wahab Riaz   130
## 7  AB de Villiers      MG Johnson   129
## 8  AB de Villiers        P Utseya   128
## 9  AB de Villiers       DJG Sammy   110
## 10 AB de Villiers        DJ Bravo   107
## ..            ...             ...   ...
# Top batsman of Sri Lanka against bowlers of all countries
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(sl_matches,"Sri Lanka",rank=1,dispRows=50)
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler2-2

# Best West Indian against English bowlrs
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(eng_matches,"West Indies",rank=1,dispRows=50)
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

batsmenVsBowler2-3

16 Team bowling scorecard against all opposition

The functions lists the top bowlers of each country in ODI matches. This function returns a dataframe when ‘matches’ is the matches of the country and ‘theTeam’ is the same country as in the functions below

teamBowlingScorecardAllOppnAllMatchesMain(matches=ind_matches,theTeam="India")
## Source: local data frame [57 x 5]
## 
##             bowler overs maidens  runs wickets
##             (fctr) (int)   (int) (dbl)   (dbl)
## 1        RA Jadeja    43       0  4749     153
## 2         R Ashwin    49       0  4225     146
## 3           Z Khan    47       0  3692     141
## 4  Harbhajan Singh    45       0  4040     123
## 5         I Sharma    51       0  3216     113
## 6         MM Patel    49       1  2400      92
## 7          P Kumar    50       2  2752      84
## 8         UT Yadav    51       0  2442      80
## 9   Mohammed Shami    43       0  1806      80
## 10    Yuvraj Singh    38       0  2588      77
## ..             ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(matches=aus_matches,theTeam="Australia")
## Source: local data frame [54 x 5]
## 
##          bowler overs maidens  runs wickets
##          (fctr) (int)   (int) (dbl)   (dbl)
## 1    MG Johnson    51       0  5635     245
## 2         B Lee    50       0  3400     147
## 3     SR Watson    45      NA    NA     136
## 4    NW Bracken    51       0  2763     114
## 5      CJ McKay    49      NA    NA     103
## 6      MA Starc    48       1  1769      97
## 7   JP Faulkner    44       0  2004      75
## 8      JR Hopes    43       0  2098      69
## 9       SW Tait    50       0  1461      66
## 10 DE Bollinger    51       0  1482      65
## ..          ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(eng_matches,"England")
## Source: local data frame [52 x 5]
## 
##            bowler overs maidens  runs wickets
##            (fctr) (int)   (int) (dbl)   (dbl)
## 1     JM Anderson    51       0  5688     202
## 2       SCJ Broad    51       0  5160     198
## 3      TT Bresnan    51       0  3730     117
## 4         ST Finn    49       0  2839     106
## 5        GP Swann    39       0  2760     106
## 6  PD Collingwood    40       1  2517      77
## 7      A Flintoff    45       0  1260      68
## 8     JC Tredwell    42       0  1614      62
## 9       CR Woakes    47       0  1859      57
## 10      RS Bopara    34       0  1508      42
## ..            ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(pak_matches,"Pakistan")
## Source: local data frame [55 x 5]
## 
##             bowler overs maidens  runs wickets
##             (fctr) (int)   (int) (dbl)   (dbl)
## 1    Shahid Afridi    45       0  6674     212
## 2      Saeed Ajmal    44       0  4089     184
## 3         Umar Gul    49       0  4127     151
## 4       Wahab Riaz    50       0  2954     111
## 5  Mohammad Hafeez    51       0  3502     109
## 6   Mohammad Irfan    49       0  2523      86
## 7    Sohail Tanvir    48       1  2534      75
## 8      Junaid Khan    48       1  2056      75
## 9   Iftikhar Anjum    49       2  1674      62
## 10    Shoaib Malik    41       1  2206      59
## ..             ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(sa_matches,"South Africa")
## Source: local data frame [41 x 5]
## 
##           bowler overs maidens  runs wickets
##           (fctr) (int)   (int) (dbl)   (dbl)
## 1       DW Steyn    51       0  4294     179
## 2       M Morkel    51       0  4012     172
## 3    LL Tsotsobe    42       0  2231     100
## 4    Imran Tahir    39       0  2124      93
## 5      R McLaren    41       1  1983      80
## 6      JH Kallis    44       0  2075      77
## 7     WD Parnell    44       0  1957      74
## 8        J Botha    44       0  2311      69
## 9    RJ Peterson    47       1  1872      68
## 10 CK Langeveldt    49       0  1829      65
## ..           ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(nz_matches,"New Zealand")
## Source: local data frame [51 x 5]
## 
##            bowler overs maidens  runs wickets
##            (fctr) (int)   (int) (dbl)   (dbl)
## 1        KD Mills    50       1  3918     160
## 2      DL Vettori    43       1  3767     147
## 3      TG Southee    51       0  3996     134
## 4  MJ McClenaghan    49       0  2252      85
## 5        JDP Oram    46       0  2064      78
## 6     NL McCullum    46       0  2840      67
## 7         SE Bond    37       1  1449      62
## 8        TA Boult    40       3  1324      58
## 9     CJ Anderson    41       0  1297      52
## 10       MJ Henry    41       0  1098      47
## ..            ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(sl_matches,"Sri Lanka")
## Source: local data frame [54 x 5]
## 
##             bowler overs maidens  runs wickets
##             (fctr) (int)   (int) (dbl)   (dbl)
## 1       SL Malinga    51       0  7214     281
## 2  KMDN Kulasekara    51       0  5481     179
## 3       BAW Mendis    47       0  2979     135
## 4      NLTC Perera    48       0  3624     129
## 5   M Muralitharan    45       0  2471     114
## 6       AD Mathews    51       0  3394     113
## 7       TM Dilshan    50       0  3049      73
## 8     CRD Fernando    51       1  2067      73
## 9     HMRKB Herath    41       0  2027      71
## 10     MF Maharoof    48       0  1860      70
## ..             ...   ...     ...   ...     ...
teamBowlingScorecardAllOppnAllMatchesMain(wi_matches,"West Indies")
## Source: local data frame [45 x 5]
## 
##        bowler overs maidens  runs wickets
##        (fctr) (int)   (int) (dbl)   (dbl)
## 1    DJ Bravo    51       0  4239     153
## 2   JE Taylor    50       0  2530     103
## 3   R Rampaul    46       1  2608     102
## 4   KAJ Roach    49       0  2500      98
## 5   SP Narine    47       0  1924      82
## 6   DJG Sammy    51       1  3584      79
## 7  AD Russell    48       0  1987      63
## 8    CH Gayle    38       0  1955      53
## 9   JO Holder    44       0  1542      50
## 10 MN Samuels    38       0  2209      48
## ..        ...   ...     ...   ...     ...

17 Team bowling scorecard against all opposition (continued)

The function lists the top bowlers of a country (‘matches’) against the opposition country

# Best Indian bowlers in matches against Australia
teamBowlingScorecardAllOppnAllMatches(ind_matches,'Australia')
## Source: local data frame [36 x 5]
## 
##             bowler overs maidens  runs wickets
##             (fctr) (int)   (int) (dbl)   (dbl)
## 1         I Sharma    44       1   739      26
## 2  Harbhajan Singh    40       0   926      25
## 3        IK Pathan    42       1   702      22
## 4         UT Yadav    37       2   606      18
## 5      S Sreesanth    34       0   454      18
## 6        RA Jadeja    39       0   867      16
## 7           Z Khan    33       1   500      15
## 8         R Ashwin    43       0   684      14
## 9          P Kumar    27       0   501      14
## 10   R Vinay Kumar    31       1   380      14
## ..             ...   ...     ...   ...     ...
# Best Australian bowlers in matches against India
teamBowlingScorecardAllOppnAllMatches(aus_matches,'India')
## Source: local data frame [39 x 5]
## 
##         bowler overs maidens  runs wickets
##         (fctr) (int)   (int) (dbl)   (dbl)
## 1   MG Johnson    47       0  1020      44
## 2        B Lee    41       3   671      28
## 3    SR Watson    36       1   532      18
## 4     CJ McKay    37       1   403      18
## 5      GB Hogg    33       0   427      17
## 6  JP Faulkner    26       0   598      16
## 7     JR Hopes    31       0   346      14
## 8   NW Bracken    35       1   429      13
## 9  JW Hastings    27       2   259      13
## 10    MA Starc    26       0   251      13
## ..         ...   ...     ...   ...     ...
# Best New Zealand bowlers in matches against England
teamBowlingScorecardAllOppnAllMatches(nz_matches,'England')
## Source: local data frame [33 x 5]
## 
##            bowler overs maidens  runs wickets
##            (fctr) (int)   (int) (dbl)   (dbl)
## 1      TG Southee    39       2   684      33
## 2      DL Vettori    27       1   561      28
## 3        KD Mills    27       0   742      24
## 4  MJ McClenaghan    25       1   515      20
## 5    JEC Franklin    23       0   418      12
## 6         SE Bond    16       0   205      12
## 7      GD Elliott    10       3   194      12
## 8       SB Styris     8       0   296       9
## 9     NL McCullum    24       0   425       7
## 10     MJ Santner    18       0   230       7
## ..            ...   ...     ...   ...     ...
# Best Sri Lankan bowlers in matches against West Indies
teamBowlingScorecardAllOppnAllMatches(sl_matches,"West Indies")
## Source: local data frame [24 x 5]
## 
##             bowler overs maidens  runs wickets
##             (fctr) (int)   (int) (dbl)   (dbl)
## 1       SL Malinga    28       1   280      14
## 2       BAW Mendis    15       0   267       9
## 3  KMDN Kulasekara    13       1   185       8
## 4       AD Mathews    14       0   191       7
## 5   M Muralitharan    20       1   157       6
## 6      MF Maharoof     9       2    14       6
## 7       WPUJC Vaas     7       2    82       5
## 8       RAS Lakmal     7       0    55       5
## 9     HMRKB Herath    10       1   124       4
## 10   ST Jayasuriya     1       0    38       4
## ..             ...   ...     ...   ...     ...

18. Team Bowlers versus Batsmen (against all oppositions)

The functions below give the peformance of bowlers versus batsman. They give the best bowlers and the total runs conceded and against whom were the runs conceded

# Best bowlers overall from India against all opposition (rank=0)
teamBowlersVsBatsmenAllOppnAllMatchesMain(ind_matches,theTeam="India",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1        RA Jadeja  4691
## 2         R Ashwin  4111
## 3  Harbhajan Singh  3858
## 4           Z Khan  3514
## 5         I Sharma  3100
## 6          P Kumar  2646
## 7     Yuvraj Singh  2542
## 8        IK Pathan  2359
## 9         UT Yadav  2343
## 10        MM Patel  2314
# Top ODI bowler of India and runs conceded against different opposition batsmen 
(rank=1)
## [1] 1
m <-teamBowlersVsBatsmenAllOppnAllMatchesMain(ind_matches,theTeam="India",rank=1)
m
## Source: local data frame [207 x 3]
## Groups: bowler [1]
## 
##       bowler          batsman runsConceded
##       (fctr)           (fctr)        (dbl)
## 1  RA Jadeja    KC Sangakkara          172
## 2  RA Jadeja DPMD Jayawardene          117
## 3  RA Jadeja       TM Dilshan          108
## 4  RA Jadeja     LD Chandimal          103
## 5  RA Jadeja        GJ Bailey           99
## 6  RA Jadeja      LRPL Taylor           95
## 7  RA Jadeja          IR Bell           94
## 8  RA Jadeja    KS Williamson           92
## 9  RA Jadeja   AB de Villiers           90
## 10 RA Jadeja        SR Watson           85
## ..       ...              ...          ...
# Top ODI bowler of India and runs conceded against different opposition batsmen (rank=2)
m <-teamBowlersVsBatsmenAllOppnAllMatchesMain(ind_matches,theTeam="India",rank=2)
m
## Source: local data frame [177 x 3]
## Groups: bowler [1]
## 
##      bowler          batsman runsConceded
##      (fctr)           (fctr)        (dbl)
## 1  R Ashwin        GJ Bailey          132
## 2  R Ashwin    KC Sangakkara          117
## 3  R Ashwin          AN Cook          115
## 4  R Ashwin    KS Williamson          114
## 5  R Ashwin         DM Bravo          111
## 6  R Ashwin       AD Mathews          100
## 7  R Ashwin     LD Chandimal           98
## 8  R Ashwin      LRPL Taylor           93
## 9  R Ashwin DPMD Jayawardene           93
## 10 R Ashwin     KP Pietersen           81
## ..      ...              ...          ...

18. Team Bowlers versus Batsmen (against all oppositions continued)

# Top bowlers versus batsmen of South Africa(rank=0)
teamBowlersVsBatsmenAllOppnAllMatchesMain(sa_matches,theTeam="South Africa",rank=0)
## Source: local data frame [10 x 2]
## 
##         bowler  runs
##         (fctr) (dbl)
## 1     DW Steyn  4116
## 2     M Morkel  3808
## 3      J Botha  2244
## 4  LL Tsotsobe  2147
## 5    JP Duminy  2111
## 6  Imran Tahir  2087
## 7    JH Kallis  2014
## 8   WD Parnell  1864
## 9    R McLaren  1863
## 10 RJ Peterson  1842
# Top bowlers versus batsmen of Pakistan(rank=0)
teamBowlersVsBatsmenAllOppnAllMatchesMain(pak_matches,theTeam="Pakistan",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1    Shahid Afridi  6444
## 2      Saeed Ajmal  3956
## 3         Umar Gul  3901
## 4  Mohammad Hafeez  3434
## 5       Wahab Riaz  2755
## 6   Mohammad Irfan  2399
## 7    Sohail Tanvir  2337
## 8     Shoaib Malik  2105
## 9      Junaid Khan  1974
## 10  Iftikhar Anjum  1626
# Top bowlers versus batsmen of Sri Lanka(rank=0)
teamBowlersVsBatsmenAllOppnAllMatchesMain(sl_matches,theTeam="Sri Lanka",rank=1)
## Source: local data frame [314 x 3]
## Groups: bowler [1]
## 
##        bowler         batsman runsConceded
##        (fctr)          (fctr)        (dbl)
## 1  SL Malinga Mohammad Hafeez          191
## 2  SL Malinga         V Kohli          175
## 3  SL Malinga       G Gambhir          170
## 4  SL Malinga        MS Dhoni          144
## 5  SL Malinga      Umar Akmal          142
## 6  SL Malinga        V Sehwag          140
## 7  SL Malinga         IR Bell          134
## 8  SL Malinga    SR Tendulkar          133
## 9  SL Malinga   Ahmed Shehzad          121
## 10 SL Malinga         AN Cook          120
## ..        ...             ...          ...
m <-teamBowlersVsBatsmenAllOppnAllMatchesMain(ind_matches,theTeam="India",rank=2)
m
## Source: local data frame [177 x 3]
## Groups: bowler [1]
## 
##      bowler          batsman runsConceded
##      (fctr)           (fctr)        (dbl)
## 1  R Ashwin        GJ Bailey          132
## 2  R Ashwin    KC Sangakkara          117
## 3  R Ashwin          AN Cook          115
## 4  R Ashwin    KS Williamson          114
## 5  R Ashwin         DM Bravo          111
## 6  R Ashwin       AD Mathews          100
## 7  R Ashwin     LD Chandimal           98
## 8  R Ashwin      LRPL Taylor           93
## 9  R Ashwin DPMD Jayawardene           93
## 10 R Ashwin     KP Pietersen           81
## ..      ...              ...          ...

19. Team bowlers versus batsmen report (all oppositions)

#Top bowlers of other countries against India
teamBowlersVsBatsmenAllOppnAllMatchesRept(matches=ind_matches,theTeam="India",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1  KMDN Kulasekara  1448
## 2       SL Malinga  1319
## 3      NLTC Perera   959
## 4      JM Anderson   954
## 5       MG Johnson   939
## 6        SCJ Broad   877
## 7       BAW Mendis   783
## 8       AD Mathews   776
## 9          ST Finn   751
## 10      TT Bresnan   741
# Best performer against India is KMDN Kulasekar of Sri Lanka in ODIs
a <- teamBowlersVsBatsmenAllOppnAllMatchesRept(ind_matches,theTeam="India",rank=1)
a
## Source: local data frame [31 x 3]
## Groups: bowler [1]
## 
##             bowler      batsman runsConceded
##             (fctr)       (fctr)        (dbl)
## 1  KMDN Kulasekara     V Sehwag          199
## 2  KMDN Kulasekara      V Kohli          196
## 3  KMDN Kulasekara    G Gambhir          157
## 4  KMDN Kulasekara SR Tendulkar          127
## 5  KMDN Kulasekara Yuvraj Singh          118
## 6  KMDN Kulasekara    RG Sharma          114
## 7  KMDN Kulasekara     SK Raina          104
## 8  KMDN Kulasekara     MS Dhoni           80
## 9  KMDN Kulasekara   KD Karthik           56
## 10 KMDN Kulasekara   SC Ganguly           51
## ..             ...          ...          ...

20. Team bowlers versus batsmen report (all oppositions continued)

#Top Indian bowlers against Sri Lanka 
teamBowlersVsBatsmenAllOppnAllMatchesRept(matches=ind_matches,theTeam="Sri Lanka",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1           Z Khan  1141
## 2        RA Jadeja   882
## 3         I Sharma   855
## 4  Harbhajan Singh   805
## 5          P Kumar   758
## 6         R Ashwin   740
## 7        IK Pathan   678
## 8          A Nehra   584
## 9         UT Yadav   544
## 10        MM Patel   488
#Top Indian bowlers against England
teamBowlersVsBatsmenAllOppnAllMatchesRept(ind_matches,"England",rank=0)
## Source: local data frame [10 x 2]
## 
##          bowler  runs
##          (fctr) (dbl)
## 1      R Ashwin   777
## 2     RA Jadeja   735
## 3        Z Khan   507
## 4      MM Patel   463
## 5      RP Singh   410
## 6      I Sharma   396
## 7     PP Chawla   375
## 8  Yuvraj Singh   370
## 9       B Kumar   353
## 10   AB Agarkar   336

21. Team bowlers versus batsmen report (all oppositions coninued-1)

#Top ODI opposition bowlers against New Zealand
teamBowlersVsBatsmenAllOppnAllMatchesRept(nz_matches,theTeam="New Zealand",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1      JM Anderson   889
## 2       MG Johnson   828
## 3    Shahid Afridi   751
## 4  KMDN Kulasekara   728
## 5        SCJ Broad   638
## 6       NW Bracken   626
## 7       SL Malinga   601
## 8         DW Steyn   556
## 9          ST Finn   482
## 10       SR Watson   468
# Top ODI opposition bowlers against Australia
teamBowlersVsBatsmenAllOppnAllMatchesRept(aus_matches,"Australia",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1      JM Anderson  1211
## 2       TT Bresnan  1087
## 3       SL Malinga  1078
## 4        SCJ Broad   948
## 5  Harbhajan Singh   890
## 6       DL Vettori   883
## 7  KMDN Kulasekara   875
## 8         DW Steyn   872
## 9        RA Jadeja   853
## 10        DJ Bravo   830
# Top ODI bowlers against Sri Lanka
teamBowlersVsBatsmenAllOppnAllMatchesRept(sl_matches,"Sri Lanka",rank=0)
## Source: local data frame [10 x 2]
## 
##             bowler  runs
##             (fctr) (dbl)
## 1    Shahid Afridi  1177
## 2           Z Khan  1141
## 3        RA Jadeja   882
## 4         I Sharma   855
## 5      Saeed Ajmal   814
## 6  Harbhajan Singh   805
## 7  Mohammad Hafeez   774
## 8          P Kumar   758
## 9         R Ashwin   740
## 10        Umar Gul   718

22. Team bowlers versus batsmen report (all oppositions) plot

This function can only be used for rank>0 (rank=1,2,3..)

# Top ODI bowler against India (KMDN Kulasekara)
df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(ind_matches,theTeam="India",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"India","India")

bowlerVsbatsmen1-1

# Top ODI Indian bowler versus England (R Ashwin)
df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(ind_matches,theTeam="England",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"India","England")

bowlerVsbatsmen1-2

#Top ODI Indian bowler against West Indies (RA Jadeja)
df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(ind_matches,theTeam="West Indies",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"India","West Indies")

bowlerVsbatsmen1-3

23. Team bowlers versus batsmen plot (all oppositions)

#Top bowler against South Africa (Shahid Afridi)
df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(sa_matches,theTeam="South Africa",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"South Africa","South Africa")

bowlerVsbatsmen2-1

# Top  bowler versus Pakistan (SL Malinga)
df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(pak_matches,theTeam="Pakistan",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"Pakistan","Pakistan")

bowlerVsbatsmen2-2

24. Team Bowler Wicket Kind

# Top opposition bowlers against India and the kind of wickets
teamBowlingWicketKindAllOppnAllMatches(ind_matches,t1="India",t2="All")

bowlingWicketkind1-1

# Get the data frame. Do not plot
m <-teamBowlingWicketKindAllOppnAllMatches(ind_matches,t1="India",t2="All",plot=FALSE)
m
## Source: local data frame [34 x 3]
## Groups: bowler [?]
## 
##         bowler        wicketKind     m
##         (fctr)             (chr) (int)
## 1   MG Johnson            bowled     8
## 2   MG Johnson            caught    27
## 3   MG Johnson caught and bowled     1
## 4   MG Johnson               lbw     6
## 5   MG Johnson           run out     2
## 6  JM Anderson            bowled     4
## 7  JM Anderson            caught    25
## 8  JM Anderson               lbw     1
## 9  JM Anderson           run out     3
## 10     ST Finn            bowled    10
## ..         ...               ...   ...
# Best Indian bowlers against South Africa
teamBowlingWicketKindAllOppnAllMatches(ind_matches,t1="India",t2="South Africa")

bowlingWicketkind1-2

# Best Indian bowlers against Pakistan
teamBowlingWicketKindAllOppnAllMatches(ind_matches,t1="India",t2="Pakistan")

bowlingWicketkind1-3

25. Team Bowler Wicket Kind (continued)

# Best ODI opposition bowlers against  England
teamBowlingWicketKindAllOppnAllMatches(eng_matches,t1="England",t2="All")

bowlingWicketkind2-1

# Best ODI opposition bowlers  Australia
teamBowlingWicketKindAllOppnAllMatches(aus_matches,t1="Australia",t2="All")

bowlingWicketkind2-2

# Best bowlera against  Sri Lanka
teamBowlingWicketKindAllOppnAllMatches(sl_matches,t1="Sri Lanka",t2="All")

bowlingWicketkind2-3

26. Team Bowler Wicket Runs

# Opposition bowlers against India and runs conceded
teamBowlingWicketRunsAllOppnAllMatches(ind_matches,t1="India",t2="All",plot=TRUE)

bowlingWicketRuns1-1

# Opposition bowlers against India and runs conceded returned as dataframe
m <-teamBowlingWicketRunsAllOppnAllMatches(ind_matches,t1="India",t2="All",plot=FALSE)
m
## Source: local data frame [10 x 3]
## 
##             bowler runsConceded wickets
##             (fctr)        (dbl)   (dbl)
## 1       MG Johnson         1020      44
## 2  KMDN Kulasekara         1492      40
## 3         DW Steyn          714      34
## 4       BAW Mendis          810      34
## 5      JM Anderson          991      33
## 6       SL Malinga         1402      33
## 7       AD Mathews          800      31
## 8          ST Finn          775      30
## 9      NLTC Perera          983      30
## 10       SCJ Broad          903      29
# Top Indian bowlers and runs conceded
teamBowlingWicketRunsAllOppnAllMatches(ind_matches,t1="India",t2="Australia",plot=TRUE)

bowlingWicketRuns1-2

27. Team Bowler Wicket Runs (continued)

#Top opposition bowlers against Pakistan
teamBowlingWicketRunsAllOppnAllMatches(pak_matches,t1="Pakistan",t2="All",plot=TRUE)

bowlingWicketRuns2-1

#Top opposition bowlers against West Indies
teamBowlingWicketRunsAllOppnAllMatches(wi_matches,t1="West Indies",t2="All",plot=TRUE)

bowlingWicketRuns2-2

#Top opposition bowlers against Sri Lanka
teamBowlingWicketRunsAllOppnAllMatches(sl_matches,t1="Sri Lanka",t2="All",plot=TRUE)

bowlingWicketRuns2-3

#Top opposition bowlers against New Zealand
teamBowlingWicketRunsAllOppnAllMatches(nz_matches,t1="New Zealand",t2="All",plot=TRUE)

bowlingWicketRuns2-4

Conclusion

This post included all functions for a team in all matches against all oppositions. As before the data frames are already available. You can load the data and begin to use them. If more insights from the dataframe are possible do go ahead. But please do attribute the source to Cricheet (http://cricsheet.org), my package yorkr and my blog. Do give the functions a spin for yourself.

I will be coming up with the last part to my introduction to cricket package yorkr soon.

Watch this space!

Important note: Do check out my other posts using yorkr at yorkr-posts

You may also like

  1. Introducing cricketr! : An R package to analyze performances of cricketers
  2. Cricket analytics with cricketr
  3. Literacy in India: A deepR dive
  4. Simulating an Edge shape in Android
  5. Re-working the Lucy Richardson algorithm in OpenCV
  6. Design principles of scalable distributed systems 7.TWS-4: Gossip protocol: Epidemics and rumors to the rescue

Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data.  If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

 However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

 

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is  given below

Consider a hypothesis function

hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22 + Ɵ3x33  +  Ɵ4x44

a1

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

Jtest(Ɵ) = 1/2mtest Σ(hƟ (xtest – ytesti)2

For logistic regression the test set error is similarly determined as

Jtest(Ɵ) = = 1/mtest Σ -ytest * log(hƟ (xtest))  – (1-ytest) * (log(1 – hƟ (xtest))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

  1. hƟ (x) = Ɵ0 + Ɵ1x1
  2. hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22
  3. hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets  as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

  1. Train all 10 models against the training set (60%)
  2. Compute the cost value J against the cross-validation set (20%)
  3. Determine the lowest cost model
  4. Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

a2

Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵi values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵi become negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

a3

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

a4

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below

a5

Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4

In both linear and logistic regression the choice of the degree of the polynomial for the hypothesis function is extremely critical. A low degree for the polynomial can result in an underfit, while a very high degree can overfit the data as shown below

41

The figure on the left the data is underfit as we try to fit the data with a first order polynomial which is a straight line. This is a case of strong ‘bias’

The rightmost figure a much higher polynomial is used. All the data points are covered by the polynomial curve however it is not effective in predicting other values. This is a case of overfitting or a high variance.

The middle figure is just right as it intuitively fits the data points the best possible way.

A similar problem exists with logistic regression as shown below

42

There are 2 ways to handle overfitting

a)      Reducing the number of features selected

b)      Using regularization

In regularization the magnitude of the parameters Ɵ is decreased to reduce the effect of overfitting

Hence if we choose a hypothesis function

hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33 +  Ɵ4x44

 

The cost function for this without regularization as mentioned in earlier posts

J(Ɵ) = 1/2m Σ(hƟ (xi  – yi)2

Where the key is minimize the above function for the least error

The cost function with regularization becomes

J(Ɵ) = 1/2m Σ(hƟ (xi  – yi)2 + λ Σ Ɵj2

 

As can be seen the regularization now adds a factor Ɵj2  as a part of the cost function which needs to be minimized.

Hence with the regularization factor the problem of underfitting/overfitting can be solved

43

However the trick is determine the value of λ. If λ is too big then it would result in underfitting or resulting in a high bias.

Similarly the regularized equation for logistic regression is as shown below

J(Ɵ) = |1/m Σ  -y * log(hƟ (x))  – (1-y) * (log(1 – hƟ (x))  | + λ/2m Σ Ɵj2

Some tips suggested by Prof Andrew Ng while determining the parameters and features for regression

a)      Get as many training examples. It is worth spending more effort in getting as much examples

b)      Add additional features

c)      Observe changes to the learning algorithm with different values of λ

This post is continued in my next post – Simplifying ML: Impact of degree of polynomial on bias, variance and other insights

Note: This post, in line with my previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng


Find me on Google+

Simplifying ML: Neural networks- Part 3

Neural networks try to overcome the shortcomings of logistic regression in which  we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

neuron

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

31

The above diagram shows a simple neural network with features x1, x2, x3 and a bias unit x0

 

With a hypothesis function hƟ(x) = 1/(1 + e-x)

The edges from the features xi  are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

32

As can be seen in a multi-layer neural network at the left we have the features x1,x2, .. xn.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a12 = g(Ɵ101x0+ Ɵ111x1+ Ɵ121x2 + Ɵ131x3) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

33

Here a12 is the activation parameter at layer 1

Ɵ10 is the model parameter at layer 1 and is the 0th parameter. Similarly Ɵ11 is the model parameter at layer 1 and is the 1st parameter and so on.

Similarly the other activation parameters can be written as

a22 = g(Ɵ201x0+ Ɵ211x1+ Ɵ221x2 + Ɵ231x3)

a32 = g(Ɵ301x0+ Ɵ311x1+ Ɵ321x2 + Ɵ331x3)

hƟ(x) = a13 = g(Ɵ102a0+ Ɵ112a1+ Ɵ122a2 + Ɵ132a3  – (A)

 

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x1,x2,… xbut on a new set of features or the activation units a1,a2, … an . In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

34

hƟ(x) = g(-30 + 20 * x1 + 20 * x2)

So for x1= 0 and x2 = 0 we would have

hƟ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

37

Similarly a NOT gate can be constructed with a neural network as follows

35

38

Neural networks can also be used for multi class classification.

36

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!


Find me on Google+