A peek into literacy in India: Statistical Learning with R

In this post I take a peek into the literacy landscape across India as a whole using R language.  The dataset from Open Government Data (OGD) platform India was used for this purpose. This data is based on the 2011 census. The XL sheets for the states were downloaded for data for each state. The Union Territories were not included in the analysis.

A thin slice of the data from each data set was taken from the data for each individual state (Note: This could also have been done from the consolidated india.xls XL sheet which I came to know of, much later).

I calculate the following for age group

Males (%) attending education institutions = (Males attending educational institutions * 100)/ Total males
Females (%) attending education institutions = (Females attending educational institutions * 100)/ Total Females

This is then plotted as a bar chart with the age distribution. I then overlay the national average for each state over the barchart to check whether the literacy in the state is above or below the national average. The implementation in R is included below

The code and data can be forked/cloned from GitHub at india-literacy

The results based on the analysis is given below.

  1. Kerala is clearly the top ranker with the literacy rates for both males and females well above the average
  2. The states with above average literacy are – Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. The states with just about average literacy – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. The states with below average literacy – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan


A brief implementation of the basic code in R is shown bwelow

# Read the Arunachal Pradhesh literacy related data
arunachal = read.csv("arunachal.csv")
# Create as a matrix
arunachalmat = as.matrix(arunachal)
arunachalTotal = arunachalmat[2:19,7:28]
# Take transpose as this is necessary for plotting bar charts
arunachalmat = t(arunachalTotal)
# Set the scipen option to format the y axis (otherwise prints as e^05 etc.)
opt <- options("scipen" = 20)
#Create a vector of total Males & Females
arunachalTotalM = arunachalmat[3,]
arunachalTotalF = arunachalmat[4,]
#Create a vector of males & females attending education institution
arunachalM = arunachalmat[6,]
arunachalF = arunachalmat[7,]
#Calculate percent of males attending education of total
arunachalpercentM = round(as.numeric(arunachalM) *100/as.numeric(arunachalTotalM),1)
barplot(arunachalpercentM,names.arg=arunachalmat[1,],main ="Percentage males attending educational institutions in Arunachal Pradesh",
xlab = "Age", ylab= "Percentage",ylim = c(0,100), col ="lightblue", legend= c("Males"))
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )
#Calculate percent of females attending education of total
arunachalpercentF = round(as.numeric(arunachalF) *100/as.numeric(arunachalTotalF),1)
barplot(arunachalpercentF,names.arg=arunachalmat[1,],main ="Percentage females attending educational institutions in Arunachal Pradesh ",
xlab = "Age", ylab= "Percentage", ylim = c(0,100), col ="lightblue", legend= c("Females"))
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )

A) Overall plot for India

a) India – Males


b) India – females


The plots for each individual state is given below

1) Literacy in Tamil Nadu

Tamil Nadu is slightly over the national average. The women seem to do marginally better than the males

a) Tamil Nadu – males


b) Tamil Nadu – females


2) Literacy in Uttar Pradesh

UP is slightly below the national average. Women are comparatively below men here

a) Uttar Pradesh – males


b) Uttar Pradesh – females


3) Literacy in Bihar

Bihar is well below the national average for both men and women

a) Bihar – males


b) Bihar – females


4. Literacy in Kerala

Kerala is the winner all the way in literacy with almost 100% literacy across all age groups

a) Kerala – males


b) Kerala -females



5. Literacy in Andhra Pradesh

AP just meets the national average for literacy.

a) Andhra Pradesh – males


b) Andhra Pradesh – females


6. Literacy in Arunachal Pradesh

Arunachal Pradesh is below average for most of the age groups

a) Arunachal Pradesh – males


b) Arunachal Pradesh – females


7. Literacy in  Assam

Assam is below national average

a) Assam – males


b) Assam – females



8. Literacy in Chattisgarh

Chattisgarh is on par with the national average for both men and women

a) Chattisgarh – males


b) Chattisgarh – females



9. Literacy in Gujarat

Gujarat is just about average

a) Gujarat – males


b) Gujarat – females


10. Literacy in Haryana

Haryana is slightly above average

a) Haryana – males


b) Haryana – females

haryana-females11.  Literacy in Himachal Pradesh

Himachal Pradesh is cool and above average.

a) Himachal Pradesh – males



b) Himachal Pradesh – females


12. Literacy in Jammu and Kashmir

J & K is marginally below average

a) Jammu and Kashmir – males


b) Jammu and Kashmir – females



13. Literacy in Jharkhand

Jharkhand is some ways below average

a) Jharkhand – males


b) Jharkhand – females


14. Literacy in Karnataka

Karnataka is on average for men. Womem seem to do better than men here

a) Karnataka – males


b) Karnataka – females


15. Literacy in Madhya Pradesh

Madhya Pradesh meets the national average

a) Madhya Pradesh – males


b) Madhya Pradesh – females


16. Literacy in Maharashtra

Maharashtra is front-runner in literacy

a) Maharashtra – females


b) Maharastra – females



17. Literacy in Odisha

Odisha meets national average

a) Odisha – males


b) Odisha – females



18. Literacy in  Punjab

Punjab is marginally above average with women doing even better

a) Punjab – males


b) Punjab – females

punjab-females19. Literacy in Rajasthan

Rajasthan is average for males and below average for females

a) Rajasthan – males


b) Rajasthan – females

rajasthan-females20. Literacy in Uttarakhand

Uttarakhand rocks and is above average

a) Uttarakhand – males


b) Uttarakhand – females


21. Literacy in West Bengal

West Bengal just about meets the national average.

a) West Bengal – males



b) West Bengal – females


The code can be cloned/forked from GitHub  india-literacy. I have done my analysis on the overall data. The data is further sub-divided across districts in each state and further into urban and rural. Many different ways of analysing are possible. One method is shown here


  1. Kerala is clearly head and shoulders above all states when it comes to literacy
  2. Many states are above average. They are Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. States with average literacy are – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. States which fall below the national average are – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan

See also
– A crime map of India in R: Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

Statistical learning with R: A look at literacy in Tamil Nadu

In this post I make my first foray into data mining using the R language. As a start, I picked up the data from the Open Government Data (OGD Platform of India from the Ministry of Human Resources. There are many data sets under Education. To get started I picked the data set on Tamil Nadu which deals with the population attending educational institutions by age, sex and institution type. Similar data is available for all states.

I wanted to start off on a small scale, primarily to checkout some of the features of the R language. R is clearly the language of choice for processing large amounts of data. R has close to 4000+ packages that can do various things like statistical, regression analysis etc. However I found this is no easy task. There are a zillion ways in which you can take cross-sections of a large dataset. Some of them will provide useful insights while others will lead you nowhere.

Data science, which is predicted to be the technology of the future based on with the mountains of data being generated daily, will in my opinion, will be more of an art and less of a science. There will be wizards who will be able to spot remarkable truths in the mundane data while others will not be that successful.

Anyway back to my attempt to divine intelligence in the Tamil Nadu(TN)  literacy data. The data downloaded was an Excel sheet with 1767 rows and 28 columns. The first 60 rows deal with the overall statistics of literacy in Tamil Nadu state as a whole. Further below are the statistics on the individual districts of Tamil nadu.
Each of this is further divided into urban and rural parts. The data covers persons from the age of 4 upto the age of 60 and whether they attended school, college, vocational institute etc. To make my initial attempt manageable I have just focused on the data for Tamil Nadu state as a whole including the breakup of the urban and rural data.

My analysis is included below. The code and the dataset for this implementation is in R language and can be cloned from GitHub at tamilnadu-literacy-analysis

Analysis of Tamil Nadu (total)
The total population of Tamil Nadu based on an age breakup is shown below

1) Total population Tamil Nadu 

2) Males  & Females attending education institutions in TN

tneduThere are marginally more males attending educational institutions. Also the number of persons attending educational institutions seems to drop from 11 years of age. There is a spike around 20-24 years and people go to school and college at this age. See pie chart 8) below

3) Percentage of males attending educational institution of the total males


4) Percentage females attending educational institutions in TN 


There is a very similar trend between males and females. The attendance peaks between 9 – 11 years of age and then falls to roughly 50% around 15-19 years and rapidly falls off

5) Boys and girls attending school in TN


For some reason there is a marked increase for boys and girls around 20-24. Possibly people repeat classes around this age

6) Persons attending college in TN


7) Educational institutions attended by persons between 15- 19 years


8) Educational institutions attended by persons between 20-24 tnschool-2

As can be seen there is a large percentage (30%)  of people in the 20-24 age group who are in school. This is probably the reason for the spike in “Boys and girls attending school in TN” see 2) for the 20-24 years of age

Education in rural Tamil Nadu

1) Total rural population Tamil Nadu 


2) Males  & Females attending education institutions in  rural TNruraledu3) Percentage of rural males attending educational institution


4) Percentage females attending educational institutions in rural TN of total females


The persons attending education drops rapidly to 40% between 15-19 years of age for both males and females

5) Boys and girls attending school in rural TN


6) Persons attending college in rural TN


7) Educational institutions attended by persons between 15- 19 years in rural TN


8) Educational institutions attended by persons between 20-24 in rural TN


As can be seen there is a large percentage (39%)  of rural people in the 20-24 age group who are in school

Education in urban Tamil Nadu

1) Total population in urban Tamil Nadu 


2) Males  & Females attending education institutions in urban TNurbanedu3) Percentage of males attending educational institution of the total males in urban TNpercentruralM4) Percentage females attending educational institutions in urban TN


5) Boys and girls attending school in urban TN


6) Persons attending college in urban TN


7) Educational institutions attended by persons between 15- 19 years in urban TN


8) Educational institutions attended by persons between 20-24 in urban TN



As can be seen there is a large percentage (25%) of rural people in the 20-24 age group who are in school

The R implementation and the Tamil Nadu dataset can be cloned from my repository in GitHub at tamilnadu-literacy-analysis 

The above analysis is just one of a million possible ways the data can be analyzed and visually represented. I hope to hone my skill as progress along in similar analysis.

Hasta la vista! I’ll be back.

Watch this space!

The rise of analytics

Published in The Hindu – The rise of analytics

We are slowly, but surely, heading towards the age of “information overload”. The Sloan Digital Sky Survey started in the year 2000 returned around 620 terabytes of data in 11 months — more data than had ever been amassed in the entire history of astronomy.

The Large Hadron Collider (LHC) at CERN, Europe’s particle physics laboratory, in Geneva will during its search for the origins of the universe and the elusive Higgs particle, early next year, spew out terabytes of data in its wake. Now there are upward of five billion devices connected to the Internet and the numbers are showing no signs of slowing down.

A recent report from Cisco, the data networking giant, states that the total data navigating the Net will cross 1/2 a zettabyte (10 {+2} {+1}) by the year 2013.

Such astronomical volumes of data are also handled daily by retail giants including Walmart and Target and telcos such as AT&T and Airtel. Also, advances in the Human Genome Project and technologies like the “Internet of Things” are bound to throw up large quantities of data.

The issue of storing data is now slowly becoming non-existent with the plummeting prices of semi-conductor memory and processors coupled with a doubling of their capacity every 18 months with the inevitability predicted by Moore’s law.

Plumbing the depths

Raw data is by itself quite useless. Data has to be classified, winnowed and analysed into useful information before if it can be utilised. This is where analytics and data mining come into play. Analytics, once the exclusive preserve of research labs and academia, has now entered the mainstream. Data mining and analytics are now used across a broad swath of industries — retail, insurance, manufacturing, healthcare and telecommunication. Analytics enables the extraction of intelligence, identification of trends and the ability to highlight the non-obvious from raw, amorphous data. Using the intelligence that is gleaned from predictive analytics, businesses can make strategic game-changing decisions.

Analytics uses statistical methods to classify data, determine correlations, identify patterns, and highlight and detect deviations among large data sets. Analytics includes in its realms complex software algorithms such as decision trees and neural nets to make predictions from existing data sets. For e.g. a retail store would be interested in knowing the buying patterns of its consumers. If the store could determine that product Y is almost always purchased when product X is purchased then the store could come up with clever schemes like an additional discount on product Z when both products X & Y are purchased. Similarly, telcos could use analytics to identify predominant trends that promote customer loyalty.

Studying behaviour

Telcos could come with voice and data plans that attract customers based on consumer behaviour, after analysing data from its point of sale and retail stores. They could use analytics to determine causes for customer churn and come with strategies to prevent it.

Analytics has also been used in the health industry in predicting and preventing fatal infections in infants based on patterns in real-time data like blood pressure, heart rate and respiration.

Analytics requires at its disposal large processing power. Advances in this field have been largely fuelled by similar advances in a companion technology, namely cloud computing. The latter allows computing power to be purchased on demand almost like a utility and has been a key enabler for analytics.

Data mining and analytics allows industries to plumb the data sets that are held in the organisations through the process of selecting, exploring and modelling large amount of data to uncover previously unknown data patterns which can be channelised to business advantage.

Analytics help in unlocking the secrets hidden in data and provide real insights to businesses; and enable businesses and industries to make intelligent and informed choices.

In this age of information deluge, data mining and analytics are bound to play an increasingly important role and will become indispensable to the future of businesses.

Find me on Google+

Cloud, analytics key tools for today’s telcos

Published in Telecom Asia Aug 20, 2010 – http://bit.ly/dxKbsR

Operators facing dwindling revenue from wireline subscribers, fierce tariff wars and exploding mobile data traffic are continually being pressured to do more for less. Spending on infrastructure is increasing as they look to provide better service within slender budgets.
In these tough times telcos have to devise new and innovative strategies and make judicious technology choices. Two promising technologies, cloud computing and analytics, are shaping up as among the best choices to make.
Cloud architecture does away with the worry of planning the computing resources needed, the real estate, the costs of the acquiring them and thoughts of its obsolescence. It allows the CSPs to purchase processing power, platforms and databases almost as a utility like electricity or water.
Cloud consumers only pay for what they use. The magic of this promising technology is the elasticity that the cloud provides – it expands to accommodate increasing demands and contracts when the demand drops.
The cloud architectures of Amazon, Google and Microsoft – currently the three biggest cloud providers – vary widely in their capabilities and features. These strengths and weaknesses should be taken into account while planning a cloud system. Each is best suited for only a certain class of applications unique to each individual cloud provider.
On one end of the spectrum Amazon’s EC2 (Elastic Compute Cloud) provides a virtual machine and a wealth of associated tools for storage and notifications. But the trade-off for increased flexibility is that users must take responsibility for designing resiliency into their systems.
On the other end is Google’s App Engine, a highly scalable cloud architecture that handles failures but is a lot more restrictive. Microsoft’s Azure is based on the .NET architecture and in terms of flexibility and features lies between these two.
When implementing such architecture, an organization should take a long hard look its computing software inventory to decide which applications are worthy of migrating to the cloud. The best candidates are processing intensive in-house applications that deliver standardized functionality and interface, and whose software architecture is made up of loosely coupled communicating systems.
Applications that deal with sensitive data should be retained within the organization’s internal computing infrastructure, because security is currently the most glaring issue with the cloud. Cloud providers do provide various levels of security to users, but this is an area in keen need of standardization.
But if the CSP decides to build components of an OSS system – rather than buying a pre-packaged system – it makes good business sense to develop for the cloud.
A cloud-based application must have a few essential properties. First, it is preferable if the application was designed on SOA principles. Second, it should be loosely coupled. And lastly, it needs to be an application that can be scaled rapidly up or down based on the varying demands.
The other question is which legacy systems can be migrated. If the OSS/BSS systems are based on commercial off-the-shelf systems these can be excluded, but an offline bill processing system, for example, is typically a good candidate for migration.
Mining wisdom from data
The cloud can serve as the perfect companion for another increasingly vital operational practice – data analytics. The cloud is capable of modeling large amounts of data, and running models to process and analyze this data. It is possible to run thousands of simultaneous instances on the cloud and mine for business intelligence in the oceans of telecom data operators generate.
Today’s CSP maintains software systems generating all kinds of customer data, covering areas ranging from billing and order management to POS, VAS and provisioning. But perhaps the largest and richest vein of subscriber information is the call detail records database.
All this data is worthless, though, if it cannot be mined and analyzed. Formal data mining and data analytics tools can be used to identify patterns and trends that will allow operators to make strategic, knowledge-driven decisions.
Analytics involves many complex areas like predictive analytics, neural nets, decision trees and classification. Some of the approaches used in data analytics include prediction, deviation detection, degree of influence and classification.
With the intelligence that comes through analytics it is possible to determine customer buying patterns, identify causes for churn and develop strategies to promote loyalty. Call patterns based on demography or time of day will enable the CSPs to create innovative tariff schemes.
Determining the relations and buying patterns of users will provide opportunities for up-selling and cross-selling. The ability to identify marked deviation in customer behavior patterns help the CSP in deciding ahead of time whether this trend is a warning bell or an opportunity waiting to be tapped.
Tinniam V Ganesh

Find me on Google+