Literacy in India – A deepR dive

Published in R-bloggers: Literacy in India – A deepR dive
You can do magic!
You can have anything,
That you desire
You can do magic – song by America (1982)

That is exactly how I feel when I write code in R. A few lines of R, lo behold, hundreds of rows and columns are magically transformed into  easily understandable graphs, regression curves or choropleth maps. (By the way, the song is a really cool! Listen to it if you have not heard it before). You really can do magic with R

In this post I do a deep dive into literacy in India The dataset is taken from Open Government Data (OGD) platform India was used for this purpose. This data is based on the 2001 census. Though the data is a little dated, it is extremely rich with literacy details across different age groups, and over all Indian States. The data includes the total number of persons/males/females who are in the primary, middle.matric, college,technical diploma, non-technical diploma and so on. In fact the data also includes the educational background of people in the districts in each state. I slice and dice the data across multiple parameters. I have created an interactive Shiny App which will provide very detailed visualization based on the parameters chosen

Do try out my interactive Shiny app : IndiaLiteracy

The entire code for this app is on GitHub. Feel free to download/clone/fork/modify or enhance the code – IndiaLiteracy

For analyzing   such a rich data set as the Census data of 2001, I create 4 tabs
1) State Literacy
2) Educational Levels vs Age
3) India Literacy and
4) District Literacy

Here are the details of these 4 tabs in my Shiny app

A) State Literacy
This tab provides the age wise distribution of people (Persons/Males/Females) who attend educational institutions. This is shown as a barplot. The plot also includes the national average. In the plot below which is for entire India we see that the national average


The distribution of females attending primary school in the state of Haryana is shown. Also included is the national average. As can be seen there are options for (Total/Urban/Rural) against (Persons/Males/Females) and whether these people attend educational institutions are illiterate of literate.


I also have another option under “Who’ which is “All” This will plot the age wise distribution of males/females/persons in urban/rural or entire state.


B. Educational Institutions vs Age plot

This plot displays the the educational institutions attended by people in a particular age group. So for example in the state of Orissa for the 18 year age group we can see that there persons who are in (Primary, Matric, Higher Secondary, Non-Technical Diploma and Technical Diploma). The bar length for each color is the percentage of the total persons at that level of education


C. Literacy across India
This tab plots a chorpleth map for a region(Urban+Rural, Urban, Rural), Who(Persons, Males, Females) and the literacy level (attending educational institutions, primary, higher secondary, Matric etc) across the whole of India.


D. Literacy within a state
This tab plots a chorpleth map of literacy in the districts of a state. A sample plot for Karnataka is shown below


E. Key observations

There is a wealth of insights you can glean by looking at the various charts. Here a few insights from my initial observations
1) The literacy in Kerala across ages is higher than the national average while in Bihar it is less than the national average

a) Kerala

8b) Bihar

2) In Rajasthan The Males Attending education instituions is higher than the national average while for females it less than the national average. However the situation is reverse in Chandigarh where there are the percentage of females attending education instiuons is higher than the national average and the males

a) Rajasthan

10b) Chandigarh

3) When we look at the number of persons attending educational institution across India the north-eastern states lead with Manipur, Nagaland and Sikkim in the top 3.


We have heard that Kerala is the most literate state. But  it looks like Manipur, Nagaland, Sikkim actually edge Kerala out. If we look at the State literacy chart for Kerala and Manipur this becomes more clear

a) Kerala


b) Manipur


It can be seen that in Manipur the number of persons attending educational instition in the age range 13-24 years it is much higher than the national average and much higher than Kerala

4) If we take a look at the District wise literacy for the state of Bihar we see that the literacy is lower in the north eastern districts.,


5) Here is another interesting observation I made. The top 3 states which are most ‘literate with no education’ are i) Rajasthan ii) Madhya Pradesh iii) Chhattisgarh


While I have included several charts with accompanying explanation, this is largely unnecessary as  most of the charts are self-explanatory.

Do try out the Shiny app and see for yourself the literacy in each state/district/age group educational  level etc –IndiaLiteracy

Feel free to clone/fork my code and make your own enhancements –IndiaLiteracy

You may also like
1.  Natural Language Processing: What would Shakespeare say?
2. Introducing cricketr! : An R package to analyze performances of cricketers
3. Revisiting crimes against women in India
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Re-working the Lucy-Richardson Algorithm in OpenCV
6.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
7.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
8. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
9. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
10.  Simulating an Edge Shape in Android

Revisiting crimes against women in India

Here I go again, raking the muck about crimes against women in India. My earlier post “A crime map of India in R: Crimes against women in India” garnered a lot of responses from readers. In fact one of the readers even volunteered to create the only choropleth map in that post. The data for this post is taken from You can download the data from the link “Crimes against women in India

I was so impressed by the choropleth map that I decided to do that for all crimes against women.(Wikipedia definition: A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map). Personally, I think pictures tell the story better. I am sure you will agree!

So here, I have it a Shiny app which will plot choropleth maps for a chosen crime in a given year.

You can try out my interactive Shiny app at  Crimes against women in India

Checkout out my book  on Amazon available in both  Paperback ($9.99) and a Kindle version($6.99/Rs449/). (see ‘Practical Machine Learning with R and Python – Machine Learning in stereo‘)

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

In the picture below  are the details of  ‘Rape” in the year 2015.

Interestingly the ‘Total Crime against women’ in 2001 shows the Top 5 as
1) Uttar Pradresh 2) Andhra Pradesh 3) Madhya Pradesh 4) Maharashtra 5) Rajasthan


But in 2015 West Bengal tops the list, as the real heavy weight in crimes against women. The new pecking order in 2015 for ‘Total Crimes against Women’ is

1) West Bengal 2) Andhra Pradesh 3) Uttar Pradesh  4) Rajasthan 5) Maharashtra


Similarly for rapes, West Bengal is nowhere in the top 5 list in 2001. In 2015, it is in second only to the national rape leader Madhya Pradesh.  Also in 2001 West Bengal is not in the top 5 for any of 6 crime heads. But in 2015, West Bengal is in the top 5 of 6 crime heads. The emergence of West Bengal as the leader in Crimes against Women is due to the steep increase in crime rate  over the years.Clearly the law and order situation in West Bengal is heading south.

In Dowry Deaths, UP, Bihar, MP, West Bengal lead the pack, and in that order in 2015.

The usual suspects for most crime categories are West Bengal, UP, MP, AP & Maharashtra.

The state-wise crime charts plot the incidence of the crime (rape, dowry death, assault on women etc) over the years. Data for each state and for each crime was available from 2001-2013. The data for period 2014-2018 are projected using linear regression. The shaded portion in the plots indicate the 95% confidence level in the prediction (i.e in other words we can be 95% certain that the true mean of the crime rate in the projected years will lie within the shaded region)


Several  interesting requests came from readers to my earlier post. Some of them were to to plot the crimes as function of population and per capita income of the State/Union Territory to see if the plots  throw up new crime leaders. I have not got the relevant state-wise population distribution data yet. I intend to update this when I get my hands on this data.

I have included the crimes.csv which has been used to generate the visualization. However for the Shiny app I save this as .RData for better performance of the app.

You can clone/download  the code for the Shiny app from GitHub at  crimesAgainWomenIndia

Please checkout my Shiny app : Crimes against women

I also intend to add further interactivity to my visualizations in a future version. Watch this space. I’ll be back!

You may like
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Natural Language Processing: What would Shakespeare say?
3. Introducing cricketr! : An R package to analyze performances of cricketers
4. A peek into literacy in India: Statistical Learning with R
5. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
6. Re-working the Lucy-Richardson Algorithm in OpenCV
7.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
8.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
9. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
10. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
11.  Simulating an Edge Shape in Android

Analyzing cricket’s batting legends – Through the mirage with R

In this post I do a deep dive into the records of the all-time batting legends of cricket to identify interesting information about their achievements. In my opinion, the usual currency for batsman’s performance like most number of centuries or highest batting average are too gross in their significance. I wanted something finer where we can pin-point specific strengths of different  players

This post will answer the following questions.
– How many times has a batsman scored runs in a specific range say 20-40 or 80-100 and so on?
– How do different batsmen compare against each other?
– Which of the batsmen stayed well beyond their sell-by date?
– Which of the batsmen retired too soon?
– What is the propensity for a batsman to get caught, bowled run out etc?

For this analysis I have chosen the batsmen below for the following reasons
Sir Don Bradman : With a  batting average of 99.94 Bradman was an obvious choice
Sunil Gavaskar is one of India’s batting icons who amassed 774 runs in his debut against the formidable West Indies in West Indies
Brian Lara : A West Indian batting hero who has double, triple and quadruple centuries under his belt
Sachin Tendulkar: A prolific run getter, India’s idol, who holds the record for most test centuries by any batsman (51 centuries)
Ricky Ponting:A dangerous batsman against any bowling attack and who can demolish any bowler on his day
Rahul Dravid: He was India’s most dependable batsman who could weather any storm in a match single-handedly
AB De Villiers : The destructive South African batsman who can pulverize any attack when he gets going

The analysis has been performed on these batsmen on various parameters. Clearly different batsmen have shone in different batting aspects. The analysis focuses on each of these to see how the different players stack up against each other.

The data for the above batsmen has been taken from ESPN Cricinfo. Only the batting statistics of the above batsmen in Test cricket has been taken. The implementation for this analysis has been done using the R language.  The R implementation, datasets and the plots can be accessed at GitHub at analyze-batting-legends. Feel free to fork or clone the code. You should be able to use the code with minor modifications on other players. Also go ahead make your own modifications and hack away!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!


Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

Key insights from my analysis below
a) Sir Don Bradman’s unmatchable record of 99.94 test average with several centuries, double and triple centuries makes him the gold standard of test batting as seen in the ‘All-time best batsman below’
b) Sunil Gavaskar is the king of batting in India, followed by Rahul Dravid and finally Sachin Tendulkar. See the charts below for details
c) Sunil Gavaskar and Rahul Dravid had at least 2 more years of good test cricket in them. Their retirement was premature. This is based on the individual batsmen’s career graph (moving average below)
d) Brian Lara, Sachin Tendulkar, Ricky Ponting, Vivian Richards retired at a time when their batting was clearly declining. The writing on the wall was clear and they had to go (see moving average below)
e) The biggest hitter of 4’s was Vivian Richards. In the 2nd place is Brian Lara. Tendulkar & Dravid follow behind. Dravid is a surprise as he has the image of a defender.
e) While Sir Don Bradman made huge scores, the number of 4’s in his innings was significantly less. This could be because the ground in those days did not carry the ball far enough
f) With respect to dismissals  Richards was able to keep his wicket intact (11%) of the times , followed by Ponting  Tendulkar, De Villiers, Dravid (10%) who carried the bat, and Gavaskar & Bradman (7%)

A) Runs frequency table and charts
These plots normalize the batting performance of different batsman, since the number of innings played ranges from 89 (Bradman) to 348 (Tendulkar), by calculating the percentage frequency the batsman scores runs in a particular range.   For e.g. Sunil Gavaskar made scores between 60-80 10% of his total innings

This is shown in a tabular form below

The individual charts for each of the players are shwon belowThe top performers after  removing ranges 0-20 & 20-40 are
Between 40-60 runs – 1) Ricky Ponting (16.4%) 2) Brian lara (15.8%) 3) AB De Villiers (14.6%)
Between 60-80 runs – 1) Vivian Richards (18%) 2) AB De Villiers (10.2%) 3) Sunil Gavaskar (10%)
Between 80-100 runs – 1) Rahul Dravid (7.6%) 2) Brian Lara (7.4%) 3) AB De Villiers (6.4%)
Between 100 -120 runs – 1) Sunil Gavaskar (7.5%) 2) Sir Don Bradman (6.8%) 3) Vivian Richards (5.8%)
Between 120-140 runs – 1) Sir Don Bradman (6.8%) 2) Sachin Tendulkar (2.5%) 3) Vivian Richards (2.3%)

The percentage frequency for Brian Lara is included below
1) Brian Lara

The above chart shows out of the total number of innings played by Brian Lara he scored runs in the range (40-60) 16% percent of the time. The chart also shows that Lara scored between 0-20, 40%  while also scoring in the ranges 360-380 & 380-400 around 1%.
The same chart is displayed as continuous graph below

The run frequency charts for other batsman are
2) Sir Don Bradman
a) Run frequency
Note: Notice the significant contributions by Sir Don Bradman in the ranges 120-140,140-160,220-240,all the way up to 340
b) Performance
3) Sunil Gavaskar
a) Runs frequency chart
b) Performance chart
4) Sachin Tendulkar
a) Runs frequency chart
b) Performance chart
5) Ricky Ponting
a) Runs frequency
b) Performance
6) Rahul Dravid
a) Runs frequency chart
b) Performance chart
7) Vivian Richards
a) Runs frequency chart
b) Performance chart
8) AB De Villiers
a) Runs frequency chart
b)  Performance chart

 B) Relative performance of the players
In this section I try to measure the relative performance of the players by superimposing the performance graphs obtained above.  You may say that “comparisons are odious!”. But equally odious are myths that are based on gross facts like highest runs, average or most number of centuries.
a) All-time best batsman
(Sir Don Bradman, Sunil Gavaskar, Vivian Richards, Sachin Tendulkar, Ricky Ponting, Brian Lara, Rahul Dravid, AB De Villiers)
From the above chart it is clear that Sir Don Bradman is the ‘gold’ standard in batting. He is well above others for run ranges above 100 – 350
b) Best Indian batsman (Sunil Gavaskar, Sachin Tendulkar, Rahul Dravid)
The above chart shows that Gavaskar is ahead of the other two for key ranges between 100 – 130 with almost 8% contribution of total runs. This followed by Dravid who is ahead of Tendulkar in the range 80-120. According to me the all time best Indian batsman is 1) Sunil Gavaskar 2) Rahul Dravid 3) Sachin Tendulkar

c) Best batsman -( Brian Lara, Ricky Ponting, Sachin Tendulkar, AB De Villiers)
This chart was prepared since this comparison was often made in recent times


This chart shows the following ranking 1) AB De Villiers 2) Sachin Tendulkar 3) Brian Lara/Ricky Ponting
C) Chart of 4’s

This chart is plotted with a 2nd order curve of the number of  4’s versus the total runs in the innings
1) Brian Lara
2) Sir Don Bradman
3) Sunil Gavaskar
4) Sachin Tendulkar
5) Ricky Ponting
6) Rahul Dravid
7) Vivian Richards
8) AB De Villiers
D) Proclivity for type of dismissal
The below charts show how often the batsman was out bowled, caught, run out etc
1) Brian Lara
2) Sir Don Bradman
3) Sunil  Gavaskar
4) Sachin Tendulkar
5) Ricky Ponting
6) Rahul Dravid
7) Vivian Richard
8) AB De Villiers
E) Moving Average
The plots below provide the performance of the batsman as a time series (chronological) and is displayed as the continuous gray lines. A moving average is computed using ‘loess regression’ and is shown as the dark line. This dark line represents the players performance improvement or decline. The moving average plots are shown below
1) Brian Lara
2) Sir Don Bradman
Sir Don Bradman’s moving average shows a remarkably consistent performance over the years. He probably could have a continued for a couple more years
3)Sunil Gavaskar


Gavaskar moving average does show a good improvement from a dip around 1983. Gavaskar retired bowing to public pressure on a mistaken belief that he was under performing. Gavaskar could have a continued for a couple of more years
4) Sachin Tendulkar


Tendulkar’s performance is clearly on the decline from 2011.  He could have announced his retirement at least 2 years prior
5) Ricky Ponting
Ponting peak performance was around 2005 and does go steeply downward from then on. Ponting could have also retired around 2012
6) Rahul Dravid


Dravid seems to have recovered very effectively from his poor for around 2009. His overall performance shows steady improvement. Dravid’s announcement appeared impulsive. Dravid had another 2 good years of test cricket in him
7) Vivian Richards
Richard’s performance seems to have dropped around 1984 and seems to remain that way.
8) AB De Villiers
AB De Villiers moving average shows a steady upward swing from 2009 onwards. De Villiers has at least 3-4 years of great test cricket ahead of him.

Finally as mentioned above the dataset, the R implementation and all the charts are available at GitHub at analyze-batting-legends. Feel free to fork and clone the code. The code should work for other batsman as-is. Also go ahead and make any modifications for obtaining further insights.

Conclusion: The batting legends have been analyzed from various angles namely i)  What is the frequency of runs scored in a particular range ii) How each batsman compares with others for relative runs in a specified range iii) How does the batsman get out?  iv) What were the peak and lean period of the batsman and whether they recovered or slumped from these periods.  While the batsman themselves have played in different time periods I think in an overall sense the performance under the conditions of the time will be similar.
Anyway feel free to let me know your thoughts. If you see other patterns in the data also do drop in your comment.

You may also like
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil,

Also see
– A crime map of India in R – Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

A crime map of India in R – Crimes against women

In this post I take a look at the gory crime scene across India to determine which states are the heavy weights in crimes. Who is the undisputed champion of rapes in a year? Which state excels in cruelty by husbands and the relatives to wives? Which state leads in dowry deaths? To get the answers to these questions I perform analysis of the state-wise crime data against women with the data  from Open Government Data (OGD) Platform India. The dataset  for this analysis was taken for the Crime against Women from OGD.

(Do see my post Revisiting crimes against women in India which includes an interactive Shiny app)

The data in OGD is available for crimes against women in different states under different ‘crime heads’ like rape, dowry deaths, kidnapping & abduction etc. The data is available for years from 2001 to 2012. This data is plotted as a scatter plot and a linear regression line is then fit on the available data. Based on this linear model,  the projected incidence of crimes likes rapes, dowry deaths, abduction & kidnapping is performed for each of the states. This is then used to build a table of  different crime heads for all the states predicting the number of crimes till the year 2018. Fortunately, R  crunches through the data sets quite easily. The overall projections of crimes against as women is shown below based on the linear regression for each of these states

Projections over the next couple of years
The tables below are based on the projected incidence of crimes under various categories assuming that these states maintain their torrid crime rate. A cursory look at the tables below clearly indicate the Uttar Pradesh is the undisputed heavy weight champion in 4 of 5 categories shown. Maharashtra and Andhra Pradesh take 2nd and 3rd ranks in the total crimes against women and are significant contenders in other categories too.

A) Projected rapes in India
The top 3 heavy weights in projected rapes over the next 5 years are 1) Madhya Pradesh  2) Uttar Pradesh 3) Maharashtra


Full table: Rape.csv
B) Projected Dowry deaths in India 

Full table: Dowry Deaths.csv
C) Kidnapping & Abduction

Full table: Kidnapping&Abduction.csv
D) Cruelty by husband & relatives

Full table: Cruelty by husbands_relatives.csv
E) Total crimes against women


Full table: Total crimes.csv
Here is a visualization of ‘Total crimes against women’  created as a choropleth map

1The implementation for this analysis was done using the  R language.  The R code, dataset, output and the crime charts can be accessed at GitHub at crime-against-women

Directory structure
– R code
dataset used

The analysis has been completely parametrized. A quick look at the implementation is shown  below. A function state crime was created as given below

This function (statecrime.R)  does the following
a) Creates a scatter plot for the state for the crime head
b) Computes a best linear regression fir and draws this line
c) Uses the model parameters (coefficients) to compute the projected crime in the years to come
d) Writes the projected values to a text file
c) Creates a directory with the name of the state if it does not exist and stores the jpeg of the plot there.

statecrime <- function(indiacrime, row, state,crime) {
year <- c(2001:2012)
# Make seperate folders for each state
if(!file.exists(state)) {
crimeplot <- paste(crime,".jpg")

# Plot the details of the crime
plot(year,thecrime ,pch= 15, col="red", xlab = "Year", ylab= crime, main = atitle,
,xlim=c(2001,2018),ylim=c(ymin,ymax), axes=FALSE)

A linear regression line is fit using ‘lm’

# Fit a linear regression model
lmfit <-lm(thecrime~year)
# Draw the lmfit line

The model parameters are then used to draw the line and also project for the next 5 years from 2013 to 2018

nyears <-c(2013:2018)
nthecrime <- rep(0,length(nyears))
# Projected crime incidents from 2013 to 2018 using a linear regression model
for (i in seq_along(nyears)) {
nthecrime[i] <- lmfit$coefficients[2] * nyears[i] + lmfit$coefficients[1]

The projected data for each state is appended into an appropriate file which is then used to display the tables at the top of this post

# Write the projected crime rate in a file
nthecrime <- round(nthecrime,2)
nthecrime <- c(state, nthecrime, "\n")
#write(nthecrime,file=fileconn, ncolumns=9, append=TRUE,sep="\t")
filename <- paste(crime,".txt")
# Write the output in the ./output directory
cat(nthecrime, file=filename, sep=",",append=TRUE)

The above function is then repeatedly called for each state for the different crime heads. (Note: It is possible to check the read both the states and crime heads with R and perform the computation repeatedly. However, I have done this the manual way!)

# 1. Andhra Pradesh
i <- 1
statecrime(indiacrime, i, "Andhra Pradesh","Rape")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Kidnapping& Abduction")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry Deaths")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Assault on Women")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Insult to modesty")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Cruelty by husband_relatives")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Imporation of girls from foreign country")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Immoral traffic act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry prohibition act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Indecent representation of Women Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Commission of Sati Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Total crimes against women")

and so on for all the states

Charts for different crimes against women

1) Uttar Pradesh

The plots for  Uttar Pradesh  are shown below

Rapes in UP


Dowry deaths in UP

Dowry Deaths

Cruelty by husband/relative

Cruelty by husband_relatives

Total crimes against women in Uttar Pradesh

Total crimes against women

You can find more charts in GitHub by clicking Uttar Pradesh

2) Maharashtra : Some of the charts for Maharashtra



Kidnapping & Abduction

Kidnapping& Abduction

Total crimes against women in Maharashtra

Total crimes against women

More crime charts  for Maharashtra

Crime charts can be accessed for the following states from GitHub ( in alphabetical order)

3) Andhra Pradesh
4) Arunachal Pradesh
5) Assam
6) Bihar
7) Chattisgarh
8) Delhi (Added as an exception based on its notoriety)
9) Goa
10) Gujarat
11) Haryana
12) Himachal Pradesh
13) Jammu & Kashmir
14) Jharkhand
15) Karnataka
16) Kerala
17) Madhya Pradesh
18) Manipur
19) Meghalaya
20) Mizoram
21) Nagaland
22) Odisha
23) Punjab
24) Rajasthan
25) Sikkim
26) Tamil Nadu
27) Tripura
28) Uttarkhand
29) West Bengal

The code, dataset and the charts can be cloned/forked from GitHub at crime-against-women

Let me know if you find any interesting patterns in the data.
Thoughts, comments welcome!

See also
My book ‘Practical Machine Learning with R and Python’ on Amazon
A peek into literacy in India: Statiscal learning with R

You may also like
– Analyzing cricket’s batting legends – Through the mirage with R
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

Continuing my earlier ‘innings’, of test driving my knowledge in Machine Learning acquired via Coursera,  I now turn my attention towards the bowling performances of our Indian bowling heroes. In this post I give a slightly different ‘spin’ to the bowling analysis and hope I can ‘swing’ your opinion based on my assessment.

I guess that is enough of my cricketing ‘double-speak’ for now and I will get down to the real business of my bowling analysis!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!



As in my earlier post Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid ,the first part of the post has my analyses and the latter part has the details of the implementation of the algorithm. Feel free to read the first part and either scan or skip the latter.

To perform this analysis I have skipped the data on our recent crop of new bowlers. The reason being that data is scant on these bowlers, besides they also seem to have a relatively shorter shelf life (hope there are a couple of finds in this Australian tour of Dec 2014). For the analyses I have chosen B S Chandrasekhar, Kapil Dev Anil Kumble. My rationale as to why I chose the above 3

B S Chandrasekhar also known as “Chandra’ was one of the most lethal leg spinners in the late 1970’s. He had a very dangerous combination of fast leg breaks, searing tops spins interspersed with the  occasional googly. On many occasions he would leave most batsmen completely clueless.

Kapil Nikhanj Dev, the Haryana Hurricane who could outwit the most technically sound batsmen  through some really clever bowling. His variations were almost always effective and he would achieve the vital breakthrough outsmarting the opponent.

And finally Anil Kumble, I chose Kumble because in my opinion he is truly the embodiment of the ‘thinking’ bowler. Many times I have seen Kumble repeatedly beat batsmen. It was like he was telling the batsman ‘check’ as he bowled faster leg breaks, flippers, a straighter delivery or top spins before finally crashing into the wickets or trapping the batsmen. It felt he was saying ‘checkmate dude!’

I have taken the data for the 3 bowlers from ESPN Cricinfo. Only the Test matches were considered for the analyses. All tests against all oppositions both at home and away were included

The assumptions taken and basis of the computation is included below
a.The data is based on the following 2 input variables a) Overs bowled b) Runs given. The output variable is ‘Wickets taken’

b.To my surprise I found that in the late 1970’s when BS Chandrasekhar used to bowl, an over had 8 balls for matches in Australia. So, I had to normalize this data for Chandra to make it on par with the others. Hence for Chandra where the overs were made up of 8 balls the overs was calculated as follows
Overs (O) = (Overs * 8)/6

c.The Economy rate E was calculated as below
E = Overs/runs was chosen as input variable to take into account fewer runs given by the bowler

d.The output variable was re-calculated as Strike Rate (SR) to determine the ‘bowling effectiveness’
Strike Rate = Wickets/Overs
(not be confused with a batsman’s strike rate batsman strike rate = runs/ balls faced)

e.Hence the analysis is based on
f(O,E) = SR
An outline of the Octave code and the data used can be cloned from GitHub at ml-bowling-analyze

 1. Surface of Bowling Effectiveness (SBE)
In my earlier post I was able to fit a ‘prediction plane’ based on the minutes at crease, balls faced versus the runs scored. But in this case a plane did not make sense as the wickets can only range from 0 – 10 and in most cases averaging between 3 and 5. So I plot the best fitting 3-D surface over the predicted hypothesis function. The steps performed are

1) The data for the different  bowlers were cleaned with data which indicated (DNB – Did not bowl)
2) The Economy Rate (E) = Runs given/Overs and Strike Rate(SR) = Wickets/overs were calculated.
3) The product of Overs (O), and Economy(E) were stored as Over_Economy(OE)
4) The hypothesis function was computed as h(O, E, OE) = y
5) Theta was calculated using the Normal Equation. The Surface of Bowling Effectiveness( SBE) was then plotted. The plots for each of the bowler is shown below

Here are the plots

A) Anil Kumble
The  data of Kumble, based on Overs bowled & Economy rate versus the Strike Rate is plotted as a 3-D scatter plot (pink crosses). The best fit as determined by solving the optimum theta using the Normal Equation is plotted as 3-D surface shown below.
The 3-D surface is what I have termed as ‘Surface of Bowling Effectiveness (SBE)’ as it depicts bowlers overall effectiveness as it plots the overs (O), ‘economy rate’ E against predicted ‘strike rate’ SR.
Here is another view
The theta values obtained for Kumble are
Theta =

And the cost at this theta is
Cost Function J = 0.0046269

B) B S Chandrasekhar
Here are the best optimal surface plot for Chandra with the data on O,E vs SR plotted as a 3D scatter plot.  Note: The dataset for  Chandrasekhar is smaller compared to the other two.
chandra-1Another view for Chandra

Theta values for B S Chandrasekhar are
Theta =
and the cost is
Cost Function J = 0.0032980

c) Kapil Dev
The plots  for Kapil
Another view of SBE for Kapil
The Theta values and cost function for Kapil are
Theta =
Cost Function J = 0.0035123

2. Predicting wickets
In the previous section the optimum theta with the lowest Cost Function J was calculated. Based on the value of theta, the wickets that will be taken by a bowler can be computed as the product of the hypothesis function and theta. i.e.

y= h(x) * theta  => Strike Rate (SR) = [1 O E OE] * theta
Now predicted wickets can be calculated as

wickets = Strike rate(SR) * Overs(O)
This is done  for Kumble, Chandra and Kapil  for different combinations of Overs(O) and Economy(E) rate.

Here are the results
Predicted wickets for Anil Kumble
The plot of predicted wickets for Kumble is represented below
This can also be represented as a a table

Predicted wickets for B S Chandrasekhar
The table for Chandra
 Predicted wickets for Kapil Dev

The plot

The predicted table from the hypothesis function for Kapil Dev

Observation: A closer look at  the predicted wickets for Kapil, Kumble and B S Chandra shows an interesting aspect. The predicted number of wickets is higher for lower economy rates. With a little thought we can see bowlers on turning or pitches with a lot of movement can not only be more economical but can also be destructive and take a lot of wickets. Hence the higher wickets for lower economy rates!

Implementation details
In this post I have used the Normal Equation to get the optimal values of theta for local minimum of the Gradient function.  As mentioned above when I had run the 3D scatter plot fitting a 2D plane did not seem quite right. So I had to experiment with different polynomial equations first trying 2nd order, 3rd order and also the sqrt

I tried the following where ‘O is Overs, ‘E’ stands for Economy Rate and ‘SR’ the predicated Strike rate. Theta is the computed theta from the Normal Equation. The notation in  Matrix notation is shown below

i) A linear plane
SR = [1 O E] * theta

ii) Using the sqrt function
SR = [1 sqrt(O) sqrt(E)]  * theta

iii) Using 2nd order plynomial
SR = [1 O^2 E^2] * theta

iv) Using the 3rd order polynomial
SR = [1 O^3 E^3] * theta

v) Before finally settling on
SR = [1 O E OE] * theta

where OE  = O .* E

The last one seemed to give me the lowest cost and also seemed the most logical visual choice.

A good resource to play around with different functions and check out the shapes of combinations of variables and polynomial order of equation is at WolframAlpha: Plotting and Graphics

Note 1: The gradient descent with the Normal Equation has been performed on the entire data set (approx 220 for Kumble & Kapil) and 99 for Chandra. The proper process for verifying a Machine Learning algorithm is to split the data set into (60% training data, 20% cross validation data and 20% as the test set).  We need to validate the prediction function against the cross-validation set, fine tune it and finally ensure that it  fits  the test set samples well.  However, this split was not done as the data set itself was very low. The entire data set was used to perform the optimal surface fit

Note 2: The optimal theta values have been chosen with a feature vector that is of the form
[1 x y x .* y] The Surface of  Bowling Effectiveness’ has been plotted above. It may appear that there is a’high bias’ in the fit and an even better fit could be obtained by choosing higher order polynomials like
[1 x y x*y x^2 y^2 (x^2) .* y x  .* (y^2)] or
[1 x y x*y x^2 y^2 x^3 y^3]  etc
While we can get a better fit we could run into the problem of ‘high variance; and without the cross validation and test set we will not be able to verify the results, Hence the simpler option [1 x y x*y] was chosen

The Octave code outline and the data used can be cloned from GitHub at ml-bowling-analyze


1) Predicted wickets: The predicted number of wickets is higher at lower economy rates
2) Comparing performances: There are different ways of looking at the results. One possible way is to check for a particular number of overs and economy rate who is most effective. Here is one way. Taking a small slice from each bowler’s predicted wickets table for anm Economy Rate=4.0 the predicted wickets are


From the above it does appear that Kapil is definitely more effective than the other two. However one could slice and dice in different ways, maybe the most economical for a given numbers and wickets combination or wickets taken in the least overs etc. Do add your thoughts. comments on my assessment or analysis

Also see
1. Analyzing cricket’s batting legends – Through the mirage with R
2. Masters of spin: Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1