Statistical learning with R: A look at literacy in Tamil Nadu

In this post I make my first foray into data mining using the R language. As a start, I picked up the data from the Open Government Data (OGD Platform of India from the Ministry of Human Resources. There are many data sets under Education. To get started I picked the data set on Tamil Nadu which deals with the population attending educational institutions by age, sex and institution type. Similar data is available for all states.

I wanted to start off on a small scale, primarily to checkout some of the features of the R language. R is clearly the language of choice for processing large amounts of data. R has close to 4000+ packages that can do various things like statistical, regression analysis etc. However I found this is no easy task. There are a zillion ways in which you can take cross-sections of a large dataset. Some of them will provide useful insights while others will lead you nowhere.

Also see my post Literacy in India – A deepR dive!

Data science, which is predicted to be the technology of the future based on with the mountains of data being generated daily, will in my opinion, will be more of an art and less of a science. There will be wizards who will be able to spot remarkable truths in the mundane data while others will not be that successful.

Anyway back to my attempt to divine intelligence in the Tamil Nadu(TN)  literacy data. The data downloaded was an Excel sheet with 1767 rows and 28 columns. The first 60 rows deal with the overall statistics of literacy in Tamil Nadu state as a whole. Further below are the statistics on the individual districts of Tamil nadu.
Each of this is further divided into urban and rural parts. The data covers persons from the age of 4 upto the age of 60 and whether they attended school, college, vocational institute etc. To make my initial attempt manageable I have just focused on the data for Tamil Nadu state as a whole including the breakup of the urban and rural data.

My analysis is included below. The code and the dataset for this implementation is in R language and can be cloned from GitHub at tamilnadu-literacy-analysis

Analysis of Tamil Nadu (total)
The total population of Tamil Nadu based on an age breakup is shown below

1) Total population Tamil Nadu 
tntotal

2) Males  & Females attending education institutions in TN

tneduThere are marginally more males attending educational institutions. Also the number of persons attending educational institutions seems to drop from 11 years of age. There is a spike around 20-24 years and people go to school and college at this age. See pie chart 8) below

3) Percentage of males attending educational institution of the total males

percenteduM

4) Percentage females attending educational institutions in TN 

percenteduF

There is a very similar trend between males and females. The attendance peaks between 9 – 11 years of age and then falls to roughly 50% around 15-19 years and rapidly falls off

5) Boys and girls attending school in TN

tnschool

For some reason there is a marked increase for boys and girls around 20-24. Possibly people repeat classes around this age

6) Persons attending college in TN

tncollege

7) Educational institutions attended by persons between 15- 19 years

tnschool-1

8) Educational institutions attended by persons between 20-24 tnschool-2

As can be seen there is a large percentage (30%)  of people in the 20-24 age group who are in school. This is probably the reason for the spike in “Boys and girls attending school in TN” see 2) for the 20-24 years of age

Education in rural Tamil Nadu

1) Total rural population Tamil Nadu 

ruraltotal

2) Males  & Females attending education institutions in  rural TNruraledu3) Percentage of rural males attending educational institution

percentruralM

4) Percentage females attending educational institutions in rural TN of total females

percentruralF

The persons attending education drops rapidly to 40% between 15-19 years of age for both males and females

5) Boys and girls attending school in rural TN

ruralschool

6) Persons attending college in rural TN

ruralcollege

7) Educational institutions attended by persons between 15- 19 years in rural TN

rural-1

8) Educational institutions attended by persons between 20-24 in rural TN

rural-2

As can be seen there is a large percentage (39%)  of rural people in the 20-24 age group who are in school

Education in urban Tamil Nadu

1) Total population in urban Tamil Nadu 

urbantotal

2) Males  & Females attending education institutions in urban TNurbanedu3) Percentage of males attending educational institution of the total males in urban TNpercentruralM4) Percentage females attending educational institutions in urban TN

percenturbanF

5) Boys and girls attending school in urban TN

urbanschool

6) Persons attending college in urban TN

urbancollege

7) Educational institutions attended by persons between 15- 19 years in urban TN

urban-1

8) Educational institutions attended by persons between 20-24 in urban TN

urban-2

As can be seen there is a large percentage (25%) of rural people in the 20-24 age group who are in school

The R implementation and the Tamil Nadu dataset can be cloned from my repository in GitHub at tamilnadu-literacy-analysis 

The above analysis is just one of a million possible ways the data can be analyzed and visually represented. I hope to hone my skill as progress along in similar analysis.

Hasta la vista! I’ll be back.

Watch this space!

3 thoughts on “Statistical learning with R: A look at literacy in Tamil Nadu

    1. Ankit,
      Sure. Go ahead and clone the code from GitHub. Also take a look at my next article where I have done for other states though not all charts.

      Please drop me a note (link) when you are done

      Regards
      Ganesh

      Like

Leave a comment