Mirror, mirror … the best batsman of them all?

“Full many a gem of purest serene
The dark oceans of cave bear.”
Thomas Gray – Elegy in country churchyard

In this post I do a fine grained analysis of the batting performances of cricketing icons from India and also from the international scene to determine how they stack up against each other.  I perform 2 separate analyses 1) Between Indian legends (Sunil Gavaskar, Sachin Tendulkar & Rahul Dravid) and another 2) Between contemporary cricketing stars (Brian Lara, Sachin Tendulkar, Ricky Ponting and A B De Villiers)

In the world and more so in India, Tendulkar is probably placed on a higher pedestal than all other cricketers. I was curious to know how much of this adulation is justified. In “Zen and the art of motorcycle maintenance” Robert Pirsig mentions that while we cannot define Quality (in a book, music or painting) we usually know it when we see it. So do the people see an ineffable quality in Tendulkar or are they intuiting his greatness based on overall averages?

In this context, we need to keep in mind the warning that Daniel Kahnemann highlights in his book, ‘Thinking fast and slow’. Kahnemann suggests that we should regard “statistical intuition with proper suspicion and replace impression formation by computation wherever possible”. This is because our minds usually detects patterns and associations  even when none actually exist.

So this analysis tries to look deeper into these aspects by performing a detailed statistical analysis.

The data for all the batsman has been taken from ESPN Cricinfo. The data is then cleaned to remove ‘DNB’ (did not bat), ‘TDNB’ (Team did not bat) etc before generating the graphs.

The code, data and the plots can be cloned,forked from Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Feel free to agree, disagree, dispute or argue with my analysis.

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

 

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

The batting performances of the each of the cricketers is described in 3 plots a) Combined boxplot & histogram b) Runs frequency vs Runs plot c) Mean Strike Rate vs Runs plot

A) Batting performance of Sachin Tendulkar

a) Combined Boxplot and histogram of runs scored
srt-boxhist1

The above graph is combined boxplot and a histogram. The boxplot at the top shows the 1st quantile (25th percentile) which is the left side of the green rectangle, the 3rd quantile( 75% percentile) right side of the green rectangle and the mean and the median. These values are also shown in the histogram below. The histogram gives the frequency of Runs scored in the given range for e.g (0-10, 11-20, 21-30 etc) for Tendulkar

b) Batting performance – Runs frequency vs Runs
srt-perf

The graph above plots the  best fitting curve for Runs scored in the frequency ranges.

c) Mean Strike Rate vs Runs
srt-sr

This plot computes the Mean Strike Rate for each interval for e.g if between the ranges 11-21 the Strike Rates were 40.5, 48.5, 32.7, 56.8 then the average of these values is computed for the range 11-21 = (40.5 + 48.5 + 32.7 + 56.8)/4. This is done for all ranges and the Mean Strike Rate in each range is plotted and the loess curve is fitted for this data.

B) Batting performance of Rahul Dravid
a) Combined Boxplot and histogram of runs scored
dravid-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Rahul Dravid are shown above

b) Batting performance – Runs frequency vs Runs
dravid-perf

c) Mean Strike Rate vs Runs
dravid-sr

C) Batting performance of Sunil Gavaskar
a) Combined Boxplot and histogram of runs scored
gavaskar-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Sunil Gavaskar are shown above
b) Batting performance – Runs frequency vs Runs
gavaskar-perf

c) Mean Strike Rate vs Runs
gavaskar-sr
D) Relative performances of Tendulkar, Dravid and Gavaskar
relative-perf1

The above plot computes the percentage of the total career runs scored in a given range for each of the batsman.
For e.g if Dravid scored the runs 23, 22, 28, 21, 25 in the range 21-30 then the
Range 21 – 20 => percentageRuns = ( 23 + 22 + 28 + 21 + 25)/ Total runs in career * 100
The above plot shows that Rahul Dravid’s has a higher contribution in the range 20-70 while Tendulkar has a larger percentahe in the range 150-230

E) Relative Strike Rates of Tendulkar, Dravid and Gavaskar
relative-SR

With respect to the Mean Strike Rate Tendulkar is clearly superior to both Gavaskar & Dravid

F) Analysis of Tendulkar, Dravid and Gavaskar
rel-perf1

The above table captures the the career details of each of the batsman
The following points can be noted
1) The ‘number of innings’ is the data you get after removing rows with DNB, TDNB etc
2) Tendulkar has the higher average 48.39 > Gavaskar (47.3) > Dravid (46.46)
3) The skew of  Dravid (1.67) is greater which implies that there the runs scored are more skewed to right (greater runs) in comparison to mean

G) Batting performance of Brian Lara
a) Combined Boxplot and histogram of runs scored
lara-boxhist1
The mean, median, 1st and 3rd quartile are shown above

b) Batting performance – Runs frequency vs Runs
lara-perf

c) Mean Strike Rate vs Runs
lara-sr

H) Batting performance of Ricky Ponting
a) Combined Boxplot and histogram of runs scored
ponting-boxhist1

b) Batting performance – Runs frequency vs Runs
ponting-perf

c) Mean Strike Rate vs Runs
ponting-SR

I) Batting performance of AB De Villiers
a) Combined Boxplot and histogram of runs scored
devilliers-boxhist1

b) Batting performance – Runs frequency vs Runs
devillier-perf

c) Mean Strike Rate vs Runs
devilliers-SR

J) Relative performances of Tendulkar, Lara, Ponting and De Villiers
relative-perf-intl1

Clearly De Villiers is ahead in the percentage Runs scores in the range 30-80. Tendulkar is better in the range between 80-120. Lara’s career has a long tail.

K) Relative Strike Rates of Tendulkar, Lara, Ponting and De Villiers
relative-SR-intl

The Mean Strike Rate of Lara is ahead of the lot, followed by De Villiers, Ponting and then Tendulkar
L) Analysis of Tendulkar, Lara, Ponting and De Villiers
rel-perf-intl1
The following can be observed from the above table
1) Brian Lara has the highest average (51.52) > Sachin Tendulkar (48.39 > Ricky Ponting (46.61) > AB De Villiers (46.55)
2) Brian Lara also the highest skew which means that the data is more skewed to the right of the mean than the others

You can clone the code rom Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Also see
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Analyzing cricket’s batting legends – Through the mirage with R
4. Masters of spin – Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4.  Bend it like Bluemix, MongoDB with autoscaling – Part 2

Computer Vision: Ramblings on derivatives, histograms and contours

Images can be visualized to be functions of the form f(x,y) where f(x,y) represents the intensity at the pixel position ,y. However images can be grayscale, color or four channels and each channel may consist of integers or floating point numbers. However the changes in the values can be viewed as a continuous function. Here is a nice representation of an image as a continuous function (courtesy: Prof Darrell’s lecture at Berkeley on Filters)

Given that the image can be viewed as a continuous function in 2 or 3 axes we have derivatives that can be taken of the images. The derivates determine the maximum and minimum of this changing function. The key derivatives in image processing are the Sobel, the Scharr and the Laplacian filters. These provide the 1st order or 2nd order derivative and hence can be used for determining edges of an image.

I was keen on playing around with derivatives and also understanding how the histograms look like.

Here is the original image and its histogram. Clearly there is a nice spread of the values.

Sobel filter and its histogram

The output of the Sobel filter on the original image is shown. The edges with Sobel’s derivative somehow are not too pronounced. The Sobel derivate can be used for obtaining the gradient of the image. The corresponding histogram of the Sobel’s gradient is shown.

The code snippet ( The complete code is given below)

    IplImage* out_sobel = cvCreateImage( cvSize(img->width, img->height), IPL_DEPTH_16S, 1);
    cvSobel(in_gray, out_sobel, 1,1,7);
    cvShowImage("Sobel", out_sobel);

    //create an image to hold the histogram
    IplImage* histImage_Sobel = cvCreateImage(cvSize(300,400), 8, 1);

    //create a histogram to store the information from the image
    CvHistogram* histSobel = cvCreateHist(1, &hist_size, CV_HIST_ARRAY, ranges, 1);

    //calculate the histogram and apply to hist
    cvCalcHist( &histImage_Sobel, histSobel, 0, NULL );

    //grab the min and max values and their indeces
    cvGetMinMaxHistValue( histSobel, &min_value, &max_value, &min_idx, &max_idx);

    //scale the bin values so that they will fit in the image representation
    cvScale( histSobel->bins, histSobel->bins, ((double)histImage_Sobel->height)/max_value, 0 );

    //set all histogram values to 255
    cvSet( histImage_Sobel, cvScalarAll(255), 0 );

    //create a factor for scaling along the width
    bin_w = cvRound((double)histImage_Sobel->width/hist_size);

    for( i = 0; i < hist_size; i++ ) {
    //draw the histogram data onto the histogram image
    cvRectangle( histImage_Sobel, cvPoint(i*bin_w, histImage_Sobel->height),
    		   cvPoint((i+1)*bin_w,
    		   histImage_Sobel->height - cvRound(cvGetReal1D(histSobel->bins,i))),
    		   cvScalarAll(0), -1, 8, 0 );
    		//get the value at the current histogram bucket
    		float* bins = cvGetHistValue_1D(histSobel,i);
    		//increment the mean value
    		mean += bins[0];
    	}

    cvShowImage("Hist Sobel",histImage_Sobel);
...
...
(Please see Gavin S. Page's tutorial(vast.uccs.edu/~tboult/CS330/NOTES/OpenCVTutorial_II.ppt) on histograms)

Laplacian and its histogram

The Laplacian provides the 2nd order derivative and hence can be used to determine local maxima and local minima. The Laplacian provides for much more pronounced edges and can be used to extract features of an object of interest. Its corresponding histogram is also included.

Canny filter and Contours

The third filter is cvCanny which is most suitable for obtaining clear edges in an image. The canny is usually used along with cvFindContours to determine the general shape of an object. I used the canny filter which I passed to a contour detecting function. However the contour detecting function identified more than 228 contours most of which were useless except for 1 which had included the complete contour of the hand as shown.

However when I increased the max_depth to 1 I found that it was immediately able to get the complete contour of the hand besides a lot of extraneous contours.

I guess the challenge with the contour function is being able to programmatically reject all those contours which of lesser importance (possibly a future post).

Code for Sobel, Laplacian and Histograms

#include "cv.h"
#include "highgui.h"
#include "stdio.h"

int main(int argc, char** argv)
{
	IplImage* img = cvLoadImage("gazelle.jpg",1);
	IplImage* dst;
	IplImage* in_gray;
	int hist_size=30;
	float gray_ranges[] = { 0, 255 };
	float* ranges[]     = { gray_ranges};
	int min_idx,max_idx;
	float min_value,max_value;
	int bin_w;
	int i;
	float mean,variance;

	cvNamedWindow("Original",CV_WINDOW_AUTOSIZE);
	cvNamedWindow("histogram",CV_WINDOW_AUTOSIZE);
	cvNamedWindow("Sobel",CV_WINDOW_AUTOSIZE);
	cvNamedWindow("Hist Sobel",CV_WINDOW_AUTOSIZE);

	cvNamedWindow("Laplacian",CV_WINDOW_AUTOSIZE);
	cvNamedWindow("Hist Laplace",CV_WINDOW_AUTOSIZE);

	in_gray = cvCreateImage(cvSize(img->width, img->height), IPL_DEPTH_8U, 1);
	cvCvtColor(img, in_gray, CV_BGR2GRAY);
	cvShowImage("Original", in_gray);

	//create a rectangular area to evaluate
	CvRect rect = cvRect(0, 0, 300, 400 );
	//apply the rectangle to the image and establish a region of interest
	cvSetImageROI(in_gray, rect);

	//create an image to hold the histogram
	IplImage* histImage = cvCreateImage(cvSize(300,400), 8, 1);

	//create a histogram to store the information from the image
	CvHistogram* hist = cvCreateHist(1, &hist_size, CV_HIST_ARRAY, ranges, 1);

	//calculate the histogram and apply to hist
	cvCalcHist( &in_gray, hist, 0, NULL );

	//grab the min and max values and their indeces
	cvGetMinMaxHistValue( hist, &min_value, &max_value, &min_idx, &max_idx);

	//scale the bin values so that they will fit in the image representation
	cvScale( hist->bins, hist->bins, ((double)histImage->height)/max_value, 0 );

	//set all histogram values to 255
	cvSet( histImage, cvScalarAll(255), 0 );

	//create a factor for scaling along the width
	bin_w = cvRound((double)histImage->width/hist_size);

	for( i = 0; i < hist_size; i++ ) {
		//draw the histogram data onto the histogram image
		cvRectangle( histImage, cvPoint(i*bin_w, histImage->height),
		   cvPoint((i+1)*bin_w,
		   histImage->height - cvRound(cvGetReal1D(hist->bins,i))),
		   cvScalarAll(0), -1, 8, 0 );
		//get the value at the current histogram bucket
		float* bins = cvGetHistValue_1D(hist,i);
		//increment the mean value
		mean += bins[0];
	}

	//finish mean calculation
	mean /= hist_size;

	//go back through now that mean has been calculated in order to calculate variance
	for( i = 0; i < hist_size; i++ ) {
		float* bins = cvGetHistValue_1D(hist,i);
		variance += pow((bins[0] - mean),2);
	}
	//finish variance calculation
	variance /= hist_size;

	cvShowImage("histogram",histImage);

    IplImage* out_sobel = cvCreateImage( cvSize(img->width, img->height), IPL_DEPTH_16S, 1);
    cvSobel(in_gray, out_sobel, 1,1,7);
    cvShowImage("Sobel", out_sobel);

    //create an image to hold the histogram
    IplImage* histImage_Sobel = cvCreateImage(cvSize(300,400), 8, 1);

    //create a histogram to store the information from the image
    CvHistogram* histSobel = cvCreateHist(1, &hist_size, CV_HIST_ARRAY, ranges, 1);

    //calculate the histogram and apply to hist
    cvCalcHist( &histImage_Sobel, histSobel, 0, NULL );

    //grab the min and max values and their indeces
    cvGetMinMaxHistValue( histSobel, &min_value, &max_value, &min_idx, &max_idx);

    //scale the bin values so that they will fit in the image representation
    cvScale( histSobel->bins, histSobel->bins, ((double)histImage_Sobel->height)/max_value, 0 );

    //set all histogram values to 255
    cvSet( histImage_Sobel, cvScalarAll(255), 0 );

    //create a factor for scaling along the width
    bin_w = cvRound((double)histImage_Sobel->width/hist_size);

    for( i = 0; i < hist_size; i++ ) {
    //draw the histogram data onto the histogram image
    cvRectangle( histImage_Sobel, cvPoint(i*bin_w, histImage_Sobel->height),
    		   cvPoint((i+1)*bin_w,
    		   histImage_Sobel->height - cvRound(cvGetReal1D(histSobel->bins,i))),
    		   cvScalarAll(0), -1, 8, 0 );
    		//get the value at the current histogram bucket
    		float* bins = cvGetHistValue_1D(histSobel,i);
    		//increment the mean value
    		mean += bins[0];
    	}

    cvShowImage("Hist Sobel",histImage_Sobel);

    // Create Laplacian and the histogram for it
    IplImage *output=cvCreateImage( cvSize(img->width, img->height), IPL_DEPTH_16S, 1);
    cvLaplace(in_gray, output, 7);
    cvShowImage("Laplacian", output);

    //create an image to hold the histogram
     IplImage* histImage_Laplace = cvCreateImage(cvSize(300,400), 8, 1);

     //create a histogram to store the information from the image
     CvHistogram* histLaplace = cvCreateHist(1, &hist_size, CV_HIST_ARRAY, ranges, 1);

     //calculate the histogram and apply to hist
     cvCalcHist( &histImage_Laplace, histLaplace, 0, NULL );

     //grab the min and max values and their indeces
     cvGetMinMaxHistValue( histLaplace, &min_value, &max_value, &min_idx, &max_idx);

     //scale the bin values so that they will fit in the image representation
     cvScale( histLaplace->bins, histLaplace->bins, ((double)histImage_Laplace->height)/max_value, 0 );

     //set all histogram values to 255
     cvSet( histImage_Laplace, cvScalarAll(255), 0 );

     //create a factor for scaling along the width
     bin_w = cvRound((double)histImage_Laplace->width/hist_size);

     for( i = 0; i < hist_size; i++ ) {
     //draw the histogram data onto the histogram image
     cvRectangle( histImage_Laplace, cvPoint(i*bin_w, histImage_Laplace->height),
     		   cvPoint((i+1)*bin_w,
     		   histImage_Laplace->height - cvRound(cvGetReal1D(histLaplace->bins,i))),
     		   cvScalarAll(0), -1, 8, 0 );
     		//get the value at the current histogram bucket
     		float* bins = cvGetHistValue_1D(histLaplace,i);
     		//increment the mean value
     		mean += bins[0];
     	}

     cvShowImage("Hist Laplace",histImage_Laplace);

	cvWaitKey(0);

	printf("Mean= %f\n",mean);
	printf("variance=%f\n",variance);

	//clean up images
	cvReleaseImage(&histImage_Laplace);
	cvReleaseImage(&histImage_Sobel);
	cvReleaseImage(&histImage);
	cvReleaseImage(&in_gray);
	cvReleaseImage(&img);

	//remove windows
	cvDestroyWindow("Original");

	cvDestroyWindow("histogram");
}

Code for Canny & Contours
#include "cv.h"
#include "highgui.h"

#define CVX_RED		CV_RGB(0xff,0x00,0x00)
#define CVX_GREEN	CV_RGB(0x00,0xff,0x00)
#define CVX_BLUE	CV_RGB(0x00,0x00,0xff)

int main(int argc, char* argv[])
{
	CvSeq* c;
	int i;
	cvNamedWindow("Original", 1 );
	cvNamedWindow("Canny_Edge", 1 );
	cvNamedWindow("Contours", 1 );
	IplImage* img_8uc1 = cvLoadImage( argv[1], CV_LOAD_IMAGE_GRAYSCALE );
	IplImage* img_edge = cvCreateImage( cvGetSize(img_8uc1), 8, 1 );
	IplImage* img_8uc3 = cvCreateImage( cvGetSize(img_8uc1), 8, 3 );
	cvThreshold( img_8uc1, img_edge, 128, 255, CV_THRESH_BINARY );
	CvMemStorage* storage =cvCreateMemStorage(0);

	CvSeq* first_contour = NULL;

	int Nc;
	int n=0;

	cvShowImage("Original", img_8uc1);

    IplImage *out_canny=cvCreateImage( cvSize(img_8uc1->width, img_8uc1->height), IPL_DEPTH_8U, 1);
	cvCanny(img_8uc1, out_canny, 50.0 ,100.0, 3);
	cvShowImage("Canny_Edge", out_canny);

/*	Nc = cvFindContours(
			img_edge,
			storage,
			&first_contour,
			sizeof(CvContour),
			CV_RETR_LIST,
			CV_CHAIN_APPROX_SIMPLE,
			cvPoint(0,0)// Try all four values and see what happens
	);*/

	Nc = cvFindContours(
			out_canny,
			storage,
			&first_contour,
			sizeof(CvContour),
			CV_RETR_TREE,
			CV_CHAIN_APPROX_SIMPLE,
			cvPoint(0,0)// Try all four values and see what happens
	);

	printf("Total contours detected: %d\n",Nc);

	for(c=first_contour; c!=NULL; c=c->h_next )
	{
		cvCvtColor( img_8uc1, img_8uc3, CV_GRAY2BGR );
		cvDrawContours(
					img_8uc3,
					c,
					CVX_RED,
					CVX_BLUE,
					1,        // Try different values of max_level, and see what happens
					2,
					8,
					cvPoint(0,0));
			printf("Contour #%d\n",n);

			cvShowImage("Contours", img_8uc3 );
			printf(" %d elements: \n",c->total);

			for(i=0; i<c->total; ++i ) {
				CvPoint* p = CV_GET_SEQ_ELEM( CvPoint, c, i );
				printf("(%d,%d)\n",p->x,p->y);

			}
			cvWaitKey(0);
			n++;
	}
	printf("Finished all contours\n");
	cvCvtColor( img_8uc1, img_8uc3, CV_GRAY2BGR );
	cvShowImage( argv[0], img_8uc3 );
	cvWaitKey(0);
	cvDestroyWindow( argv[0] );
	cvReleaseImage( &img_8uc1 );
	cvReleaseImage( &img_8uc3 );
	cvReleaseImage( &img_edge );
	return 0;
}

 
 

Find me on Google+