A crime map of India in R – Crimes against women

In this post I take a look at the gory crime scene across India to determine which states are the heavy weights in crimes. Who is the undisputed champion of rapes in a year? Which state excels in cruelty by husbands and the relatives to wives? Which state leads in dowry deaths? To get the answers to these questions I perform analysis of the state-wise crime data against women with the data  from Open Government Data (OGD) Platform India. The dataset  for this analysis was taken for the Crime against Women from OGD.

(Do see my post Revisiting crimes against women in India which includes an interactive Shiny app)

The data in OGD is available for crimes against women in different states under different ‘crime heads’ like rape, dowry deaths, kidnapping & abduction etc. The data is available for years from 2001 to 2012. This data is plotted as a scatter plot and a linear regression line is then fit on the available data. Based on this linear model,  the projected incidence of crimes likes rapes, dowry deaths, abduction & kidnapping is performed for each of the states. This is then used to build a table of  different crime heads for all the states predicting the number of crimes till the year 2018. Fortunately, R  crunches through the data sets quite easily. The overall projections of crimes against as women is shown below based on the linear regression for each of these states

Projections over the next couple of years
The tables below are based on the projected incidence of crimes under various categories assuming that these states maintain their torrid crime rate. A cursory look at the tables below clearly indicate the Uttar Pradesh is the undisputed heavy weight champion in 4 of 5 categories shown. Maharashtra and Andhra Pradesh take 2nd and 3rd ranks in the total crimes against women and are significant contenders in other categories too.

A) Projected rapes in India
The top 3 heavy weights in projected rapes over the next 5 years are 1) Madhya Pradesh  2) Uttar Pradesh 3) Maharashtra

rapes

Full table: Rape.csv
B) Projected Dowry deaths in India 
dowrydeaths

Full table: Dowry Deaths.csv
C) Kidnapping & Abduction
kidnapping

Full table: Kidnapping&Abduction.csv
D) Cruelty by husband & relatives
cruelty

Full table: Cruelty by husbands_relatives.csv
E) Total crimes against women

total

Full table: Total crimes.csv
Here is a visualization of ‘Total crimes against women’  created as a choropleth map

1The implementation for this analysis was done using the  R language.  The R code, dataset, output and the crime charts can be accessed at GitHub at crime-against-women

Directory structure
– R code
dataset used
output
statewise-crime-charts

The analysis has been completely parametrized. A quick look at the implementation is shown  below. A function state crime was created as given below

statecrime.R
This function (statecrime.R)  does the following
a) Creates a scatter plot for the state for the crime head
b) Computes a best linear regression fir and draws this line
c) Uses the model parameters (coefficients) to compute the projected crime in the years to come
d) Writes the projected values to a text file
c) Creates a directory with the name of the state if it does not exist and stores the jpeg of the plot there.

statecrime <- function(indiacrime, row, state,crime) {
year <- c(2001:2012)
# Make seperate folders for each state
if(!file.exists(state)) {
dir.create(state)
}
setwd(state)
crimeplot <- paste(crime,".jpg")
jpeg(crimeplot)

# Plot the details of the crime
plot(year,thecrime ,pch= 15, col="red", xlab = "Year", ylab= crime, main = atitle,
,xlim=c(2001,2018),ylim=c(ymin,ymax), axes=FALSE)

A linear regression line is fit using ‘lm’

# Fit a linear regression model
lmfit <-lm(thecrime~year)
# Draw the lmfit line
abline(lmfit)

The model parameters are then used to draw the line and also project for the next 5 years from 2013 to 2018

nyears <-c(2013:2018)
nthecrime <- rep(0,length(nyears))
# Projected crime incidents from 2013 to 2018 using a linear regression model
for (i in seq_along(nyears)) {
nthecrime[i] <- lmfit$coefficients[2] * nyears[i] + lmfit$coefficients[1]
}

The projected data for each state is appended into an appropriate file which is then used to display the tables at the top of this post

# Write the projected crime rate in a file
nthecrime <- round(nthecrime,2)
nthecrime <- c(state, nthecrime, "\n")
print(nthecrime)
#write(nthecrime,file=fileconn, ncolumns=9, append=TRUE,sep="\t")
filename <- paste(crime,".txt")
# Write the output in the ./output directory
setwd("./output")
cat(nthecrime, file=filename, sep=",",append=TRUE)

The above function is then repeatedly called for each state for the different crime heads. (Note: It is possible to check the read both the states and crime heads with R and perform the computation repeatedly. However, I have done this the manual way!)

crimereport.R
# 1. Andhra Pradesh
i <- 1
statecrime(indiacrime, i, "Andhra Pradesh","Rape")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Kidnapping& Abduction")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry Deaths")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Assault on Women")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Insult to modesty")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Cruelty by husband_relatives")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Imporation of girls from foreign country")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Immoral traffic act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry prohibition act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Indecent representation of Women Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Commission of Sati Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Total crimes against women")
...
...

and so on for all the states

Charts for different crimes against women

1) Uttar Pradesh

The plots for  Uttar Pradesh  are shown below

Rapes in UP

Rape

Dowry deaths in UP

Dowry Deaths

Cruelty by husband/relative

Cruelty by husband_relatives

Total crimes against women in Uttar Pradesh

Total crimes against women

You can find more charts in GitHub by clicking Uttar Pradesh

2) Maharashtra : Some of the charts for Maharashtra

Rape

Rape

Kidnapping & Abduction

Kidnapping& Abduction

Total crimes against women in Maharashtra

Total crimes against women

More crime charts  for Maharashtra

Crime charts can be accessed for the following states from GitHub ( in alphabetical order)

3) Andhra Pradesh
4) Arunachal Pradesh
5) Assam
6) Bihar
7) Chattisgarh
8) Delhi (Added as an exception based on its notoriety)
9) Goa
10) Gujarat
11) Haryana
12) Himachal Pradesh
13) Jammu & Kashmir
14) Jharkhand
15) Karnataka
16) Kerala
17) Madhya Pradesh
18) Manipur
19) Meghalaya
20) Mizoram
21) Nagaland
22) Odisha
23) Punjab
24) Rajasthan
25) Sikkim
26) Tamil Nadu
27) Tripura
28) Uttarkhand
29) West Bengal

The code, dataset and the charts can be cloned/forked from GitHub at crime-against-women

Let me know if you find any interesting patterns in the data.
Thoughts, comments welcome!


See also
My book ‘Practical Machine Learning with R and Python’ on Amazon
A peek into literacy in India: Statiscal learning with R

You may also like
– Analyzing cricket’s batting legends – Through the mirage with R
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

A peek into literacy in India: Statistical Learning with R

In this post I take a peek into the literacy landscape across India as a whole using R language.  The dataset from Open Government Data (OGD) platform India was used for this purpose. This data is based on the 2011 census. The XL sheets for the states were downloaded for data for each state. The Union Territories were not included in the analysis.

A thin slice of the data from each data set was taken from the data for each individual state (Note: This could also have been done from the consolidated india.xls XL sheet which I came to know of, much later).

I calculate the following for age group

Males (%) attending education institutions = (Males attending educational institutions * 100)/ Total males
Females (%) attending education institutions = (Females attending educational institutions * 100)/ Total Females

This is then plotted as a bar chart with the age distribution. I then overlay the national average for each state over the barchart to check whether the literacy in the state is above or below the national average. The implementation in R is included below

The code and data can be forked/cloned from GitHub at india-literacy

The results based on the analysis is given below.

  1. Kerala is clearly the top ranker with the literacy rates for both males and females well above the average
  2. The states with above average literacy are – Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. The states with just about average literacy – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. The states with below average literacy – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan

 

A brief implementation of the basic code in R is shown bwelow

# Read the Arunachal Pradhesh literacy related data
arunachal = read.csv("arunachal.csv")
# Create as a matrix
arunachalmat = as.matrix(arunachal)
arunachalTotal = arunachalmat[2:19,7:28]
# Take transpose as this is necessary for plotting bar charts
arunachalmat = t(arunachalTotal)
# Set the scipen option to format the y axis (otherwise prints as e^05 etc.)
getOption("scipen")
opt <- options("scipen" = 20)
getOption("scipen")
#Create a vector of total Males & Females
arunachalTotalM = arunachalmat[3,]
arunachalTotalF = arunachalmat[4,]
#Create a vector of males & females attending education institution
arunachalM = arunachalmat[6,]
arunachalF = arunachalmat[7,]
#Calculate percent of males attending education of total
arunachalpercentM = round(as.numeric(arunachalM) *100/as.numeric(arunachalTotalM),1)
barplot(arunachalpercentM,names.arg=arunachalmat[1,],main ="Percentage males attending educational institutions in Arunachal Pradesh",
xlab = "Age", ylab= "Percentage",ylim = c(0,100), col ="lightblue", legend= c("Males"))
points(age,indiapercentM,pch=15)
lines(age,indiapercentM,col="red",pch=20,lty=2,lwd=3)
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )
#Calculate percent of females attending education of total
arunachalpercentF = round(as.numeric(arunachalF) *100/as.numeric(arunachalTotalF),1)
barplot(arunachalpercentF,names.arg=arunachalmat[1,],main ="Percentage females attending educational institutions in Arunachal Pradesh ",
xlab = "Age", ylab= "Percentage", ylim = c(0,100), col ="lightblue", legend= c("Females"))
points(age,indiapercentF,pch=15)
lines(age,indiapercentF,col="red",pch=20,lty=2,lwd=3)
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )

A) Overall plot for India

a) India – Males

india-males

b) India – females

india-females

The plots for each individual state is given below

1) Literacy in Tamil Nadu

Tamil Nadu is slightly over the national average. The women seem to do marginally better than the males

a) Tamil Nadu – males

tn-males

b) Tamil Nadu – females

tn-females

2) Literacy in Uttar Pradesh

UP is slightly below the national average. Women are comparatively below men here

a) Uttar Pradesh – males

UP-males

b) Uttar Pradesh – females

UP-females

3) Literacy in Bihar

Bihar is well below the national average for both men and women

a) Bihar – males

bihar-males

b) Bihar – females

bihar-females

4. Literacy in Kerala

Kerala is the winner all the way in literacy with almost 100% literacy across all age groups

a) Kerala – males


kerala-females

b) Kerala -females

kerala-females

 

5. Literacy in Andhra Pradesh

AP just meets the national average for literacy.

a) Andhra Pradesh – males

andhra-males

b) Andhra Pradesh – females

andhra-females

6. Literacy in Arunachal Pradesh

Arunachal Pradesh is below average for most of the age groups

a) Arunachal Pradesh – males

arunachal-males

b) Arunachal Pradesh – females

arunachal-females

7. Literacy in  Assam

Assam is below national average

a) Assam – males

assam-males

b) Assam – females

assam-females

 

8. Literacy in Chattisgarh

Chattisgarh is on par with the national average for both men and women

a) Chattisgarh – males

chattisgarh-males

b) Chattisgarh – females

chattisgarh-females

 

9. Literacy in Gujarat

Gujarat is just about average

a) Gujarat – males

gujarat-males

b) Gujarat – females

gujarat-females

10. Literacy in Haryana

Haryana is slightly above average

a) Haryana – males

haryana-males

b) Haryana – females

haryana-females11.  Literacy in Himachal Pradesh

Himachal Pradesh is cool and above average.

a) Himachal Pradesh – males

himachal-males

 

b) Himachal Pradesh – females

himachal-females

12. Literacy in Jammu and Kashmir

J & K is marginally below average

a) Jammu and Kashmir – males

jk-males

b) Jammu and Kashmir – females

jk-females

 

13. Literacy in Jharkhand

Jharkhand is some ways below average

a) Jharkhand – males

jharkand-males

b) Jharkhand – females

jharkand-feamles

14. Literacy in Karnataka

Karnataka is on average for men. Womem seem to do better than men here

a) Karnataka – males

karnataka-males

b) Karnataka – females

karnataka-females

15. Literacy in Madhya Pradesh

Madhya Pradesh meets the national average

a) Madhya Pradesh – males

mp-males

b) Madhya Pradesh – females

mp-females

16. Literacy in Maharashtra

Maharashtra is front-runner in literacy

a) Maharashtra – females

maharashtra

b) Maharastra – females

maharashtra-feamles

 

17. Literacy in Odisha

Odisha meets national average

a) Odisha – males

odisha-males

b) Odisha – females

odisha-females

 

18. Literacy in  Punjab

Punjab is marginally above average with women doing even better

a) Punjab – males

punjab-males

b) Punjab – females

punjab-females19. Literacy in Rajasthan

Rajasthan is average for males and below average for females

a) Rajasthan – males

rajashthan-males

b) Rajasthan – females

rajasthan-females20. Literacy in Uttarakhand

Uttarakhand rocks and is above average

a) Uttarakhand – males

uttarkhan-males

b) Uttarakhand – females

uttarkhand-females

21. Literacy in West Bengal

West Bengal just about meets the national average.

a) West Bengal – males

wb-males

 

b) West Bengal – females

wb-females

The code can be cloned/forked from GitHub  india-literacy. I have done my analysis on the overall data. The data is further sub-divided across districts in each state and further into urban and rural. Many different ways of analysing are possible. One method is shown here

Conclusion

  1. Kerala is clearly head and shoulders above all states when it comes to literacy
  2. Many states are above average. They are Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. States with average literacy are – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. States which fall below the national average are – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan

See also
– A crime map of India in R: Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

Statistical learning with R: A look at literacy in Tamil Nadu

In this post I make my first foray into data mining using the R language. As a start, I picked up the data from the Open Government Data (OGD Platform of India from the Ministry of Human Resources. There are many data sets under Education. To get started I picked the data set on Tamil Nadu which deals with the population attending educational institutions by age, sex and institution type. Similar data is available for all states.

I wanted to start off on a small scale, primarily to checkout some of the features of the R language. R is clearly the language of choice for processing large amounts of data. R has close to 4000+ packages that can do various things like statistical, regression analysis etc. However I found this is no easy task. There are a zillion ways in which you can take cross-sections of a large dataset. Some of them will provide useful insights while others will lead you nowhere.

Also see my post Literacy in India – A deepR dive!

Data science, which is predicted to be the technology of the future based on with the mountains of data being generated daily, will in my opinion, will be more of an art and less of a science. There will be wizards who will be able to spot remarkable truths in the mundane data while others will not be that successful.

Anyway back to my attempt to divine intelligence in the Tamil Nadu(TN)  literacy data. The data downloaded was an Excel sheet with 1767 rows and 28 columns. The first 60 rows deal with the overall statistics of literacy in Tamil Nadu state as a whole. Further below are the statistics on the individual districts of Tamil nadu.
Each of this is further divided into urban and rural parts. The data covers persons from the age of 4 upto the age of 60 and whether they attended school, college, vocational institute etc. To make my initial attempt manageable I have just focused on the data for Tamil Nadu state as a whole including the breakup of the urban and rural data.

My analysis is included below. The code and the dataset for this implementation is in R language and can be cloned from GitHub at tamilnadu-literacy-analysis

Analysis of Tamil Nadu (total)
The total population of Tamil Nadu based on an age breakup is shown below

1) Total population Tamil Nadu 
tntotal

2) Males  & Females attending education institutions in TN

tneduThere are marginally more males attending educational institutions. Also the number of persons attending educational institutions seems to drop from 11 years of age. There is a spike around 20-24 years and people go to school and college at this age. See pie chart 8) below

3) Percentage of males attending educational institution of the total males

percenteduM

4) Percentage females attending educational institutions in TN 

percenteduF

There is a very similar trend between males and females. The attendance peaks between 9 – 11 years of age and then falls to roughly 50% around 15-19 years and rapidly falls off

5) Boys and girls attending school in TN

tnschool

For some reason there is a marked increase for boys and girls around 20-24. Possibly people repeat classes around this age

6) Persons attending college in TN

tncollege

7) Educational institutions attended by persons between 15- 19 years

tnschool-1

8) Educational institutions attended by persons between 20-24 tnschool-2

As can be seen there is a large percentage (30%)  of people in the 20-24 age group who are in school. This is probably the reason for the spike in “Boys and girls attending school in TN” see 2) for the 20-24 years of age

Education in rural Tamil Nadu

1) Total rural population Tamil Nadu 

ruraltotal

2) Males  & Females attending education institutions in  rural TNruraledu3) Percentage of rural males attending educational institution

percentruralM

4) Percentage females attending educational institutions in rural TN of total females

percentruralF

The persons attending education drops rapidly to 40% between 15-19 years of age for both males and females

5) Boys and girls attending school in rural TN

ruralschool

6) Persons attending college in rural TN

ruralcollege

7) Educational institutions attended by persons between 15- 19 years in rural TN

rural-1

8) Educational institutions attended by persons between 20-24 in rural TN

rural-2

As can be seen there is a large percentage (39%)  of rural people in the 20-24 age group who are in school

Education in urban Tamil Nadu

1) Total population in urban Tamil Nadu 

urbantotal

2) Males  & Females attending education institutions in urban TNurbanedu3) Percentage of males attending educational institution of the total males in urban TNpercentruralM4) Percentage females attending educational institutions in urban TN

percenturbanF

5) Boys and girls attending school in urban TN

urbanschool

6) Persons attending college in urban TN

urbancollege

7) Educational institutions attended by persons between 15- 19 years in urban TN

urban-1

8) Educational institutions attended by persons between 20-24 in urban TN

urban-2

As can be seen there is a large percentage (25%) of rural people in the 20-24 age group who are in school

The R implementation and the Tamil Nadu dataset can be cloned from my repository in GitHub at tamilnadu-literacy-analysis 

The above analysis is just one of a million possible ways the data can be analyzed and visually represented. I hope to hone my skill as progress along in similar analysis.

Hasta la vista! I’ll be back.

Watch this space!

Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses  and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts.  This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages.  Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

map_reduce

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output.  Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

a

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is mij

Also let us assume that there is a vector ‘v’ whose jth element is vj . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

xi = ∑ mijvj

 

Map function: The map function applies to each single element of the matrix M. For each element mij the map task outputs a key-value pair as follows (i, mijvj).  Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function:  The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

xi = ∑ mijvj

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data.  Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

Find me on Google+

Simplifying Machine Learning – K- Means clusters – Part 6

Our brain is an extraordinary apparatus. It is amazing how we humans can instantaneously perceive shapes, objects, forms. For e.g. when see a scene with many objects we are immediately able to identify the different objects in the scene.  View this against the backdrop of a recent Google’s artificial brain experiment of a neural network with 16000 processors and a billion connections. This artificial brain was fed with 10 million thumbnails of you tube videos before it was able to recognize cat videos.

That’s an awful lot of work to recognize cat videos!

We can see that a lot of work involved getting a computer to do something as simple thing as this.

Consider how a baby learns to recognize objects for e.g. cat, dog, toy etc. The human brain does not try to measure the number of eyes, spacing between the eyes, the mouth shape of face etc. The brain immediately is able to distinguish the different animals. How does it do it? Amazing right?

In any case here is a machine learning algorithm that is capable of identifying structure in data. This is also known as K-Means and is a form of unsupervised learning algorithm.

The K-Means algorithm takes as input an unlabeled data set and identifies groups in the set. It tries to determine structure in the data set.

Take a look at the picture below

a

It is readily obvious that there are 2 clusters in the above diagram. However to the computer this is just a random set of points.

How does the K-Means cluster identify the clusters in the above diagram?

The algorithm is fairly simply and intuitive.

1)    Let us start by choosing 2 random points which we call as ‘cluster centroids

2)    We then associate each centroid with the points in the dataset that are closest to it.

3)    We then compute the average of each group of associated points in the centroid and move the centroid to that average.

4)    We then repeat steps 2 – 4 until there is no significant change in the centroid

This is shown below

4

The above algorithm can be implemented iteratively as follows

For training set (x1, x2, x3 …)

Randomly initialize K cluster centroids μ1, μ2, μ3 … μK

 

Repeat {

for 1 to m

c(i) = The cluster index from 1 to K that is closest to xi => (A)

end

for k = 1 to K

u(k) = average of all points assigned to K   => (B)

end

}

In step (A) the points xi closest to the centroid k is added to the centroid’s set. Hence if points 1,3,4,8 are in centroid 1 then

x1 = 1, x3 = 1, x4 = 1, x8 =1

In step (B) the mean of the points 1, 3, 4, 8 is taken

So the centroid

c1x = ¼ { x1 + x3 + x4 + x8} and c1y = ¼ { y1 + y3 + y4 + y8}

This becomes the new c1

 

However there can be occasions where the K-means cluster would get stuck in local optima. To choose optimum cluster centroid we have to determine the least cost. This can be done with the optimization objective.

The optimization objective of K-Means is as follows

K-Mean cluster determination is the problem of minimizing the distance of each point from its centroid. This is also known as the K-Means cost function or distortion function.

J(c1,c2…cm,, μ1,… μ2) = 1/m Σ|| xi – μc(i) ||2

I like to visualize the algorithm as follows.

In step 1 we can visualize that there is a force of attraction between the datapoints and the cluster centroid based on proximity of the centroid.

In step 2 we can visualize that each datapoint attracts the centroid towards it. The centroid moves to the point where the attraction among all the datapoints balances out. This is average mean squared difference.

As can be seen the objective is to determine the average of the mean squared error of each data point to its closest centroid.

Given a set of data points how we choose the random centroids? One way is to initially pick some random data points themselves as the cluster centroid. The algorithm is then iterated to identify the real cluster centroids.

As mentioned before the algorithm can sometimes get stuck in local optima.  One option is to choose another random set of data points and continue to iterate. We need to run this several times to determine the best clustering

There is also the problem of determining the number of cluster centroids. How we to determine how many clusters are would be there in a random data set? Visually we can easily identify the number of clusters. But a machine cannot.

One technique that can be used to determine the number of clusters is as follows. Start with 2, 3… 10 clusters and plot the cost function. Then pick the one with the least cost.

Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng


Find me on Google+

The moving edge of computing

Published in The Hindu – 30 Sep 2012 as “Three computing technologies that will power the world

“The moving edge of computing computes and having computed moves on…” We could thus rephrase the Rubaiyat of Omar Khayyam’s “The moving hand…” Computing technology has really advanced by leaps and bounds. We are now in a new era of computing. We are in the midst of “intelligent and cognitive” computing.

From the initial days of number crunching by languages of FORTRAN, to the procedural methodology of Pascal or C and later the object oriented paradigm of C++ and Java we have now come a long way.  In this age of information overload technologies that can just solve problems through steps & procedures are no longer adequate. We need technology to detect complex patterns, trends, understand nuances in human language and to automatically resolve problems. In this new era of computing the following 3 technologies are furthering the frontiers of computing technology.

Predictive Analytics

By 2016 130 Exabyte’s (130 * 2 ^ 60) will rip through the internet. The number of mobile devices will exceed the human population this year, 2012 and by 2016 the number of connected devices will touch almost 10 billion. The devices connected to the net will range from mobiles, laptops, tablets, sensors and the millions of devices based on the “internet of things”. All these devices will constantly spew data on the internet. A hot and happening trend in computing is the ability to make business and strategic decisions by determining patterns, trends and outliers among mountains of data. Predictive analytics will be a key discipline in our future and experts will be much sought after. Predictive analytics uses statistical methods to mine intelligence, information and patterns in structured, unstructured and streams of data. Predictive analytics will be applied across many domains from banking, insurance, retail, telecom, energy. There are also applications for energy grids, water management, besides determining user sentiment by mining data from social networks etc.

Cognitive Computing

The most famous technological product in the domain of cognitive computing is IBM’s supercomputer Watson. IBM’s Watson is an artificial intelligence computer system capable of answering questions posed in natural language. IBM’s supercomputer Watson is best known for successfully trouncing a national champion in the popular US TV quiz competition, Jeopardy. What makes this victory more astonishing is that IBM’s Watson had to successfully decipher the nuances of natural language and pick the correct answer.  Following the success at Jeopardy, IBM’s Watson supercomputer has now  been employed by a leading medical insurance firm in US to diagnose medical illnesses and to recommend treatment options for patients. Watson will be able to analyze 1 million books, or roughly 200 million pages of information. The other equally well known mobile app is Siri the voice recognition app on the iPhone. The earlier avatar of cognitive computing was expert systems based on Artificial Intelligence. These expert systems were inference engines that were based on knowledge rules. The most famous among the expert systems were “Dendral” and “Mycin”. We appear to be on the cusp of tremendous advancement in cognitive computing based on the success of IBM’s Watson.

Autonomic Computing

This is another computing trend that will become prevalent in the networks of tomorrow. Autonomic computing refers to the self-managing characteristics of a network. Typically it signifies the ability of a network to self-heal in the event of failures or faults. Autonomic network can quickly localize and isolate faults in the network while keeping other parts of the network unaffected. Besides these networks can quickly correct and heal the faulty hardware without human intervention. Autonomic networks are typical in smart grids where a fault can be quickly isolated and the network healed without resulting in a major outage in the electrical grid.

These are truly exciting times in computing as we move towards true intelligence!

Find me on Google+

The promise of predictive analytics

Published in Telecom Asia – Feb 20, 2012 –  The promise of predictive analytics

Published in Telecoms Europe – Feb 20, 2012 – Predictive analytics gold rush due

We are headed towards a more connected, more instrumented and more data driven world. This fact is underscored once again in  Cisco’s latest   Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016.The statistics from this report is truly mind boggling

By 2016 130 exabytes (130 * 2 ^ 60) will rip through the internet. The number of mobile devices will exceed the human population this year, 2012. By 2016 the number of connected devices will touch almost 10 billion.

The devices that are connected to the net range from mobiles, laptops, tablets, sensors and the millions of devices based on the “internet of things”. All these devices will constantly spew data on the internet and business and strategic decisions will be made by determining patterns, trends and outliers among mountains of data.

Predictive analytics will be a key discipline in our future and experts will be much sought after. Predictive analytics uses statistical methods to mine information and patterns in structured, unstructured and streams of data. The data can be anything from click streams, browsing patterns, tweets, sensor data etc. The data can be static or it could be dynamic. Predictive analytics will have to identify trends from data streams from mobile call records, retail store purchasing patterns etc.

Predictive analytics will be applied across many domains from banking, insurance, retail, telecom, energy. In fact predictive analytics will be the new language of the future akin to what C was a couple of decades ago.  C language was used in all sorts of applications spanning the whole gamut from finance to telecom.

In this context it is worthwhile to mention The R Language. R language is used for statistical programming and graphics. The Wikipedia defines R Language as “R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others”.

Predictive analytics is already being used in traffic management in identifying and preventing traffic gridlocks. Applications have also been identified for energy grids, for water management, besides determining user sentiment by mining data from social networks etc.

One very ambitious undertaking is “the Data-Scope Project” that believes that the universe is made of information and there is a need for a “new eye” to look at this data. The Data-Scope project is described as “a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB The Data-scope project is based on the premise that new discoveries will come from analysis of large amounts of data. Analytics is all about analyzing large datasets and predictive analytics takes it one step further in being able to make intelligent predictions based on available data.

Predictive analytics does open up a whole new universe of possibilities and the applications are endless.  Predictive analytics will be the key tool that will be used in our data intensive future.

Afterthought

I started to wonder whether predictive analytics could be used for some of the problems confronting the world today. Here are a few problems where analytics could be employed

–          Can predictive analytics be used to analyze outbreaks of malaria, cholera or AID and help in preventing their outbreaks in other places?

–          Can analytics analyze economic trends and predict a upward/downward trend ahead of time.

Find me on Google+

The rise of analytics

Published in The Hindu – The rise of analytics

We are slowly, but surely, heading towards the age of “information overload”. The Sloan Digital Sky Survey started in the year 2000 returned around 620 terabytes of data in 11 months — more data than had ever been amassed in the entire history of astronomy.

The Large Hadron Collider (LHC) at CERN, Europe’s particle physics laboratory, in Geneva will during its search for the origins of the universe and the elusive Higgs particle, early next year, spew out terabytes of data in its wake. Now there are upward of five billion devices connected to the Internet and the numbers are showing no signs of slowing down.

A recent report from Cisco, the data networking giant, states that the total data navigating the Net will cross 1/2 a zettabyte (10 {+2} {+1}) by the year 2013.

Such astronomical volumes of data are also handled daily by retail giants including Walmart and Target and telcos such as AT&T and Airtel. Also, advances in the Human Genome Project and technologies like the “Internet of Things” are bound to throw up large quantities of data.

The issue of storing data is now slowly becoming non-existent with the plummeting prices of semi-conductor memory and processors coupled with a doubling of their capacity every 18 months with the inevitability predicted by Moore’s law.

Plumbing the depths

Raw data is by itself quite useless. Data has to be classified, winnowed and analysed into useful information before if it can be utilised. This is where analytics and data mining come into play. Analytics, once the exclusive preserve of research labs and academia, has now entered the mainstream. Data mining and analytics are now used across a broad swath of industries — retail, insurance, manufacturing, healthcare and telecommunication. Analytics enables the extraction of intelligence, identification of trends and the ability to highlight the non-obvious from raw, amorphous data. Using the intelligence that is gleaned from predictive analytics, businesses can make strategic game-changing decisions.

Analytics uses statistical methods to classify data, determine correlations, identify patterns, and highlight and detect deviations among large data sets. Analytics includes in its realms complex software algorithms such as decision trees and neural nets to make predictions from existing data sets. For e.g. a retail store would be interested in knowing the buying patterns of its consumers. If the store could determine that product Y is almost always purchased when product X is purchased then the store could come up with clever schemes like an additional discount on product Z when both products X & Y are purchased. Similarly, telcos could use analytics to identify predominant trends that promote customer loyalty.

Studying behaviour

Telcos could come with voice and data plans that attract customers based on consumer behaviour, after analysing data from its point of sale and retail stores. They could use analytics to determine causes for customer churn and come with strategies to prevent it.

Analytics has also been used in the health industry in predicting and preventing fatal infections in infants based on patterns in real-time data like blood pressure, heart rate and respiration.

Analytics requires at its disposal large processing power. Advances in this field have been largely fuelled by similar advances in a companion technology, namely cloud computing. The latter allows computing power to be purchased on demand almost like a utility and has been a key enabler for analytics.

Data mining and analytics allows industries to plumb the data sets that are held in the organisations through the process of selecting, exploring and modelling large amount of data to uncover previously unknown data patterns which can be channelised to business advantage.

Analytics help in unlocking the secrets hidden in data and provide real insights to businesses; and enable businesses and industries to make intelligent and informed choices.

In this age of information deluge, data mining and analytics are bound to play an increasingly important role and will become indispensable to the future of businesses.

Find me on Google+

Cloud, analytics key tools for today’s telcos

Published in Telecom Asia Aug 20, 2010 – http://bit.ly/dxKbsR

Operators facing dwindling revenue from wireline subscribers, fierce tariff wars and exploding mobile data traffic are continually being pressured to do more for less. Spending on infrastructure is increasing as they look to provide better service within slender budgets.
In these tough times telcos have to devise new and innovative strategies and make judicious technology choices. Two promising technologies, cloud computing and analytics, are shaping up as among the best choices to make.
Cloud architecture does away with the worry of planning the computing resources needed, the real estate, the costs of the acquiring them and thoughts of its obsolescence. It allows the CSPs to purchase processing power, platforms and databases almost as a utility like electricity or water.
Cloud consumers only pay for what they use. The magic of this promising technology is the elasticity that the cloud provides – it expands to accommodate increasing demands and contracts when the demand drops.
The cloud architectures of Amazon, Google and Microsoft – currently the three biggest cloud providers – vary widely in their capabilities and features. These strengths and weaknesses should be taken into account while planning a cloud system. Each is best suited for only a certain class of applications unique to each individual cloud provider.
On one end of the spectrum Amazon’s EC2 (Elastic Compute Cloud) provides a virtual machine and a wealth of associated tools for storage and notifications. But the trade-off for increased flexibility is that users must take responsibility for designing resiliency into their systems.
On the other end is Google’s App Engine, a highly scalable cloud architecture that handles failures but is a lot more restrictive. Microsoft’s Azure is based on the .NET architecture and in terms of flexibility and features lies between these two.
When implementing such architecture, an organization should take a long hard look its computing software inventory to decide which applications are worthy of migrating to the cloud. The best candidates are processing intensive in-house applications that deliver standardized functionality and interface, and whose software architecture is made up of loosely coupled communicating systems.
Applications that deal with sensitive data should be retained within the organization’s internal computing infrastructure, because security is currently the most glaring issue with the cloud. Cloud providers do provide various levels of security to users, but this is an area in keen need of standardization.
But if the CSP decides to build components of an OSS system – rather than buying a pre-packaged system – it makes good business sense to develop for the cloud.
A cloud-based application must have a few essential properties. First, it is preferable if the application was designed on SOA principles. Second, it should be loosely coupled. And lastly, it needs to be an application that can be scaled rapidly up or down based on the varying demands.
The other question is which legacy systems can be migrated. If the OSS/BSS systems are based on commercial off-the-shelf systems these can be excluded, but an offline bill processing system, for example, is typically a good candidate for migration.
Mining wisdom from data
The cloud can serve as the perfect companion for another increasingly vital operational practice – data analytics. The cloud is capable of modeling large amounts of data, and running models to process and analyze this data. It is possible to run thousands of simultaneous instances on the cloud and mine for business intelligence in the oceans of telecom data operators generate.
Today’s CSP maintains software systems generating all kinds of customer data, covering areas ranging from billing and order management to POS, VAS and provisioning. But perhaps the largest and richest vein of subscriber information is the call detail records database.
All this data is worthless, though, if it cannot be mined and analyzed. Formal data mining and data analytics tools can be used to identify patterns and trends that will allow operators to make strategic, knowledge-driven decisions.
Analytics involves many complex areas like predictive analytics, neural nets, decision trees and classification. Some of the approaches used in data analytics include prediction, deviation detection, degree of influence and classification.
With the intelligence that comes through analytics it is possible to determine customer buying patterns, identify causes for churn and develop strategies to promote loyalty. Call patterns based on demography or time of day will enable the CSPs to create innovative tariff schemes.
Determining the relations and buying patterns of users will provide opportunities for up-selling and cross-selling. The ability to identify marked deviation in customer behavior patterns help the CSP in deciding ahead of time whether this trend is a warning bell or an opportunity waiting to be tapped.
Tinniam V Ganesh

Find me on Google+