The language R

In the universe of programming languages there is a rising staR. It is moving fasteR and getting biggeR and brighteR!

Ok, you get the hint! It is the language R or the R Language.

R language is the successor to the language S. R is extremely powerful for statistical computing and processing. It is an interpreted language much like Python, Perl. The power of the language R comes from the 4000+ software packages that make the R language almost indispensable for any type of statistical computing.

As I mentioned above in my opinion, R, is soon going to play a central role in the technological world. In today’s world we are flooded with data from all sides. To make sense of this information overload we need techniques like Big Data, Analytics and machine learning to make sense of this data deluge. This is where R with its numerous packages that make short work of data becomes critical. The packages also have very interesting graphic packages to display the data in many forms for faster analysis and easier consumption.

The language R can easily ingest large sets of data in CSV format and perform many computations on them. R language is being used in machine learning, data mining, classification and clustering, text mining besides also being utilized in sentiment analysis from social networks.

The R language contains the usual programming constructs namely logical, loops, assignment etc. The language enables to easily assign values to vectors, matrices, arrays and perform all the associated operations on them.

The R Language can be installed from R-project. The R Language package comes with many datasets which are data collected from various sources. One such dataset is the Iris dataset. The Iris dataset is dataset about the Iris plant( Iris is a genus of 260–300[1][2] species of flowering plants with showy flowers).

The dataset contains 5 parameters

1) Sepal length 2) Sepal Width 3) Petal length 4) Petal width 5) Species

This dataset has been used in many research papers. R allows you to easily perform any sophisticated set of statistical operations on this data set. Included below are a sample set of operations you can perform on the Iris dataset or any dataset

> iris[1:5,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

> summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

>hist(iris$Sepal.Length)

Here is a scatter plot of the Petal width, sepal length and sepal width

>scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

As can be seen R can really make short work of data with the numerous packages that come along with it. I have just skimmed the surface of R language.

I hope this has whetted your appetite. Do give R a spin!

Watch this space!

Also see
1. Designing a Social Web Portal
2. Design principles of scalable, distributed systems
3. A Cloud Medley with IBM’s Bluemix, Cloudant and Node.js
4. Programming Zen and now – Some essential tips -2
5. Fun simulation of a Chain in Android

Find me on Google+

Natural selection of database technology through the years

Charles Darwin in his landmark book “The Origin of Species’ discusses how flora and fauna evolved through the centuries. The different features of each individual species would undergo a process of natural selection by which modification of attributes would naturally occur that would enable the species to adapt and propagate through time. Those modifications that failed to adapt would naturally become extinct.

In this post I discuss how database (DB) technology has evolved over the years. As new requirements arose database technology has had to adapt and newer paradigms have evolved. However, unlike species which became extinct the older versions still exist as they continue o address the earlier problems that remain today.

Here is a short & brief history of evolution of databases

Relational databases: Relational databases had their genesis when E.F Codd of IBM came up with a relational model of organizing data. In this model all data is organized as tables with several rows. In a relational model each row has several columns and one of the columns contains a unique value for each row called primary key. Relational databases have ruled the enterprise domain for more than 3 decades. An enterprise’s data is organized as a set of related tables. Users can query the database using Structured Query Language or SQL.

I remember in the late 1980’s when I started to work in the industry, programming jobs were much sought after by all of us engineering graduates. In those days database jobs were ‘uncool’ and system programming jobs dealing with writing assemblers, compilers were the really cool jobs. I was also susceptible to this prevailing opinion and stayed away from databases. As fate would have it I eventually moved into telecom and telecom protocol work in which I worked for more than 2 decades and have largely maintained my distance from DB.

However it recent times I did want to look brush up whatever little I knew of DB. Recently I was listening to the Coursera course “Introduction to Data Science‘ by Bill Howe. In one of the lectures the professor uttered something that really caught my fancy. He mentions that SQL is probably the closest to natural language. How true! Once the DB schema and tables have been set up, querying the DB for all sorts of data can be done in SQL which is close to natural language. For e.g.

SELECT a,b,c from TABLE S,T where condition X1 AND/OR condition X2

The power of DBs comes from the fact that all the data is organized as tables and enables one to retrieve any sort of data from it. Trying to accomplish this with any other high level programming language would take several hundreds of lines of code and we would have to write functions for each in individual query.

NoSQL databases: However the utility of relational databases decreases as we scale to hundreds of Gbs of data. In this age of the internet and the worldwide web data is easily of the order of several terabytes to a few petabytes. For e.g. Weather modelling, Social networks like FB,Twitter or LinkedIn all need to operate on millions of status updates or tweets per day. Traditional relational databases cannot handle such large sets of data. This is where the concept of NoSQL DB came into existence. NoSQL databases typically store data as key, value pairs. The singular advantage of NoSQL is that the database can scale horizontally or in other words the performance does not degrade with large increases in data size. In NoSQL databases data is hashed and uniformly distributed across commodity servers through a technique known as ‘consistent hashing’. Also data in NoSQL databases is replicated across servers. This architecture of NoSQL databases is based on common, commodity servers which are expected to crash. However this would not affect the NoSQL DB to function correctly. The strength of NoSQL databases comes from the fact that servers can join or leave the NoSQL DB without affecting the functioning of the DB. Some of the more popular examples of NoSQL DB are CouchDB, MongoDB, Riak, Voldemort, Dynamo etc.Do take a look at my post “When NoSQL makes better sense that MySQL”

NewSQL: This variation of DB came into existence as there was a need for extremely fast performance for computing tasks like analytics etc. These DBs exist completely in memory and so the access is blazingly fast. The most famous of DBs of this paradigm is SAP’s HANA.

Graph Databases: Graph databases are the recent entrants into database technology. This strain of databases came into existence to handle associative data more efficiently. In a graph database data is represented as a graph. Nodes in the graph can be entities and edges can be relationships. A search on a graph database will result in a traversal from a specified start node to a specified terminating node. “Friends’ in Facebook, ‘followers/following’ in Twitter and ‘connections’ in LinkedIn all use Graph Database to map association and enable easy search. Graph Databases is what allows these databases to make recommendations like ‘You may know’. E.g. of Graph Database Google’s Graph DB, Neo4j

As we move ahead database technology will continue to evolve into newer architectures to handle

Find me on Google+

The brave, new frontiers of computing

This article was published in Telecom Asia, 21 March 2014 – The brave new frontiers of computing

Von Neumann reference architecture and the sequential processing of Turing machines have been the basis for ‘classical’ computers for the last 6 decades. The juggernaut of technology has resulted in faster and denser processors being churned out inexorably by the semiconductor industry, substantiating Gordon Moore’s claim of transistors density in chips doubling every 18 months, now famously known as Moore’s law. These days we have processors with an excess of billion transistors. We are now reaching the physical limit of the number of transistors on a chip. There is now an imminent need to look at alternative paradigms to crack problems of the internet age, confronting human which cannot be solved by classical computing

In the last decade or so 3 new, radical and lateral paradigms have surfaced which hold tremendous promise. They are

i) Deep learning ii) Quantum computing and iii) Genetic programming.

These techniques hold enormous potential and may offer solutions to problems which would take classical computers anywhere between a few years to a few decades to solve.

Deep Learning:Deep Learning is a new area of Machine Learning research. The objective of deep learning is to bring Machine Learning closer to one of its original goals namely Artificial Intelligence. Deep Learning is based on multi-level neural networks called deep neural networks. Deep Learning works on large sets of unclassified data and is able to learn lower level patterns on which it builds higher level representations much the same way the human brain works.

Deep learning tries to mimic the human brain For example, the visual cortex shows a sequence of areas where signals flow from one level to the next. In the visual cortex the feature hierarchy represents input at a different level of abstraction, with more abstract features further up in the hierarchy, defined in terms of the lower-level ones. Deep Learning is based on the premise that humans organize ideas hierarchically and compose more abstract concepts from simpler ones.

Deep Learning algorithms generally requires powerful processors and works on enormous amounts of data to learn key features. The characteristic of Deep Learning algorithms is that the input is passed through several non-linearities before generating its output.

About 3 years ago, researcher’s at Google’s Brain ran a deep learning algorithm on 10 million still images extracted from Youtube, on 1000’s of extremely powerful processors called GPUs. Google’s Brain was able independently infer that these images consisted of a preponderance of cat’s videos. A seemingly trivial result, but of great significance as the algorithm inferred this result without any other input!

An interesting article in Nature, “The learning machines”, discusses how deep learning has proved useful for several scientific tasks including handwriting recognition, speech recognition, natural language processing, and in analyzing 3 dimensional images of brain slices etc.

The importance of Deep Learning has not been lost on the Tech titans like Google, Facebook, Microsoft and IBM which have all taken steps to stay ahead in this race.

Deep Learning is in its infancy and is still esoteric knowledge. Deep Learning is truly a fascinating area of research and may be the harbinger of the real breakthrough in Artificial Intelligence has been looking for in decades.

Genetic Programming (GP) is another radical approach to computing. It had its origins in the early 1950’s and has been gaining traction in the last decade. Genetic programming (GP) is a branch of AI, based on Darwinian evolutionary principle of ‘natural selection’ and ‘survival of the fittest’. Essentially GP is a set of instructions and a fitness function to measure how well a computer program has performed a task. It is a specialization of genetic algorithms (GA) where each individual is a computer program.

Genetic Programming is a machine learning technique in which a population of computer programs are optimized according to ‘fitness criteria’ determined by a program’s ability to perform a given computational task. Fit programs survive and are moved along the evolutionary process. Fitness usually denotes the optimum value for a given objective function. In other words the fitness represents the ‘quality’ of a given solution over others. Individuals in a new population are created by the method of ‘reproduction’ and ‘cross over’.

In other words, the ‘most fit’ programs are crossbred and also possibly randomly mutated, creating a new generation of child programs. The unfit programs are discarded out and the best are bred again.

Once set up, the genetic program runs and evolves by itself and needs no further human input. Genetic Programming was pioneered by Stanford’s John Koza who was able to invent an antenna for NASA, identify proteins and invent electrical controllers.

The eerie part of GP is that the code is inscrutable. The program evolves and mutates into variations that cannot be easily reproduced. Clearly this is fodder for science fiction-like scenarios of self-aware, paranoid & psychopathic programs. Here is an interesting article that discusses this- This is What Happens When You Teach Machines the Power of Natural Selection

Quantum computing

Computers of today from hardy mainframes to smartphones operate on binary logic. The entire edifice of today’s computing is based on the binary states of the semiconductor which can be either in the state of ‘0’ or ‘1’. All computation can be reduced to arithmetic and logical operation on binary digits or more simply, binary arithmetic. Quantum computers deviate significantly from the binary arithmetic of classical computers. The unit in the quantum computer is the ‘qubit’ which can be in state ‘0’, ‘1’ and both the state ‘0’ and ‘1’ through the principle of superposition.

To understand the power of quantum computing here is an excerpt from ArsTechnica “A tale of two qubits: How quantum computers work”

“Bits, either classical or quantum, are the simplest possible units of information…. Measuring a bit, either classical or quantum, will result in one of two possible outcomes. At first glance, this makes it sound like there is no difference between bits and qubits. In fact, the difference is not in the possible answers, but in the possible questions. For normal bits, only a single measurement is permitted, meaning that only a single question can be asked: Is this bit a zero or a one? In contrast, a qubit is a system which can be asked many, many different questions, but to each question, only one of two answers can be given”

The article further goes on to state that “Classical computer memories are constrained to exist at any given time as a simple list of zeros and ones. In contrast, in a single quantum memory many such combinations can all exist simultaneously. During a quantum algorithm, this symphony of possibilities is split and merged, eventually coalescing around a single solution. “

Having more than 1 qubit results in additional property called ‘quantum entanglement’. A pair of qubits cannot be described by the states of the individual qubits alone. Those states which exhibit extra correlations are described as ‘entangled’ states. Hence in the case of 2 qubits ‘the whole is greater than the sum of its parts”. Entanglement and superposition are the cornerstones which gives quantum computing its power. Here is a short and interesting animation of quantum computing

With classical computing techniques searching an unsorted phonebook of 10,000 entries, would require us to look up at least 5000 entries, while a quantum search algorithm only needs to guess 100 times. In other words it would take a quantum computer only 5000 guesses to search through a phonebook with 25 million names. That is the power of quantum computers!

Applications of quantum computers range from weather modeling, cryptography, solving problems that have been considered ‘intractable’ with classical computing methods. NASA is planning to use quantum computers in its search for exoplanets.

Deep Learning, Genetic Programming and Quantum Computing represent paradigmatic, lateral shifts in computing. They herald a new era in computing and will enable us to crack extremely complex problems in this Age of the Internet.

Classical computing will continue to play a role in a daily lives but for real world problems of the next decade & beyond it will be these 3 computing approaches that will hold the key to our future!

Find me on Google+

Pregelian philosophy – Thinking Web Scale – Part 2

“If you squint the right way, you will notice there are graphs are everywhere”. This is a line from a Research blog at Google “Large scale graph computing at Google”. Transportation problems, disease outbreaks, computer networks and social networks all use graph in one way or another. Social networks of today, with friends of Facebook, followers in Twitter and connections in LinkedIn are all clearly graphs. The billions of pages in the World Wide Web with incoming and outgoing links is a massive directed graph.

Pregel is Google’s computing model for graph processing. Google came up with this programming model to compute Page Ranks of individual web sites. Pregel is a powerful framework for processing of directed graphs. A directed graph has vertices and edges. Edges are directed towards or away from the vertices. Pregel is a highly scalable model and is well suited for Web Scale problems. It is capable of processing directed graphs with billions of vertices and trillion of edges.

The Map Reduce paradigm with its message passing mechanism is not particularly well suited for this purpose. Map Reduce is capable of processing several documents, images, or matrices in parallel. But handling directed graphs with Map-reduce the combiners/reducers have to wait for the mappers to finish their tasks. In Pregel programs are executed as a sequence of iterations in which each vertex receives messages sent to it in the previous iteration, execute code, and send messages to other vertices. Each vertex can modify its state and the topology of the graph

Here is a the model of the Pregel.

This picture is taken from the lecture in Web Intelligence and Big Data course by Gautam Shroff on Coursera

Pregel works on a sequence of iterations known as supersteps. In each superstep, S, each vertex will receive messages sent to it in the previous superstep S-1, executes a user-defined function specified for the vertex and sends messages to other vertices which they will receive in superstep S+1. The supersteps at all vertices are conceptually supposed to occur in parallel. At each superstep the vertex can alter its state and the state of the outgoing edges. The synchronicity of the model is what makes the model semantically manageable.

Pregel is realized on hundreds of commodty serves. The input to the Pregel model is a directed graph. During initialization the vertices are partitioned and each server receives a set of vertices. Each vertex is associated with a modifiable user defined value.

In each superstep each node computes the user defined function in parallel using the message sent to it in the previous superstep. A vertex can modify its state, the outgoing edges or the topology of the graph.

Pregel has been used for a variety of different problems ranging from determining Page Rank, Shortest path etc. Vertices are first class citizens in Pregel. Algorithms terminate by a process of voting to halt. When all vertices vote to halt then the computation in Pregel is assumed to have completed. The algorithm as a whole terminates when all the nodes have voted to halt and there are no messages in transit.

A simple example of how Pregel determines the maximum value is illustrated in the original Google research paper Pregel: A System for Large-Scale Graph Processing.

The pseudo code for this can be written as

compute(){

i_val := val
while (m) {
if (m > val) {
val = m;
}
else if (i_val == val) {
vote_to_halt
}
else {
for each neighbor v
send_message(v, val)
}
}

In the above diagram the 4 vertices are initialized with the values shown. In superstep 1,

vertices 3 & 6 exchange their values. While 3 updates its value to 6 at its vertex, vertex 6 drops it and vote to halt. Similarly vertex 1 will update its value to 6 while vertex 2 which receives the value 1 will vote to halt. In superstep 2vertex 2 will receive the value 6 from the vertex to its right, will be woken up and will update its value. In superstep 3 no messages are passed and all nodes would have now voted

Another interesting application is the evaluation of Page Rank. Page Rank essentially determines the probability of hitting a Page if a surfer clicked links on a Web page at random.

Page Rank is evaluated iteratively as follows (source wkipedia.org)

This can be done iteratively through Pregel as below

virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {

for (; !msgs->Done(); msgs->Next()) {
sum += msgs->Value();
*MutableValue() = 0.15 / NumVertices() + 0.85 * sum; – – > (A)
}
if (superstep() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n); – – > (B)
} else {
VoteToHalt();
}
}
}

The Pregel computation is initialized such that in superstep 0, the value of each vertex

is 1 / NumVertices() . In each of the first 30 supersteps, each vertex sends along each outgoing edge its tentative PageRank divided by the number of outgoing edges (see step B above).

Starting from superstep 1, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to 0.15/NumVertices() + 0.85 * sum (see step A) After reaching superstep 30, no further messages are sent and each vertex votes to halt.

Clearly the computation of PageRank of the pages indexed by Google in the World Wide Web consisting of billions of pages can be computed fairly efficiently by Pregel.

Pregel also includes ‘combiners’ that perform functions like SUM, MIN,MAX and AVERAGE to save computational steps.

Pregel also includes checkpointing where vertices save their states to revert to a previous state in case of failure.

The synchronous methodology of Pregel helps in avoiding issues of deadlocks and races which are prevalent in asynchronous communicating programs.

Pregel is a powerful programming model which will find applications in many Web Scale applications in the future.

Find me on Google+

Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of $1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest. So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.

Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

n_u = Number of users

n_b= Number of books

r^(i,j) = Boolean whether user j rated a book i

y^(i,j) = The rating user j gave book i

m_j = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps. In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc

So each book has its own feature vector where x¹is the feature vector of the first book x² feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵ^jand a feature vector xⁱ the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θ^j)^T xⁱ

This reduces to the minimization problem over all θ^j for r=1

min 1/2m Σ ((θj)^T xⁱ – y ^i,j)²

θ^ji:r=1

Adding the regularization term this becomes

min 1/2m Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ θ^j)²

θ^ji:r=1

The recommender algorithm in essence tries to learn parameters θ^j for a set of features of xⁱthe chosen system for e.g. books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ Σ θ^j)²

θ¹…θⁿi:r=1

The minimization is performed by gradient descent as

Θ^j_k:= Θ^j_k – α (Σ((θ^j)^T xⁱ – y ^i,j)xⁱ + λ Θ^j_k

Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Perils and pitfalls of Big Data

Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high. This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

Find me on Google+

Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts. This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages. Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output. Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is m_ij

Also let us assume that there is a vector ‘v’ whose jth element is v_j . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

x_i = ∑ m_ijv_j

Map function: The map function applies to each single element of the matrix M. For each element m_ijthe map task outputs a key-value pair as follows (i, m_ijv_j). Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function: The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

x_i = ∑ m_ijv_j

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2^nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data. Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

Find me on Google+

Simplifying Machine Learning – K- Means clusters – Part 6

Our brain is an extraordinary apparatus. It is amazing how we humans can instantaneously perceive shapes, objects, forms. For e.g. when see a scene with many objects we are immediately able to identify the different objects in the scene. View this against the backdrop of a recent Google’s artificial brain experiment of a neural network with 16000 processors and a billion connections. This artificial brain was fed with 10 million thumbnails of you tube videos before it was able to recognize cat videos.

That’s an awful lot of work to recognize cat videos!

We can see that a lot of work involved getting a computer to do something as simple thing as this.

Consider how a baby learns to recognize objects for e.g. cat, dog, toy etc. The human brain does not try to measure the number of eyes, spacing between the eyes, the mouth shape of face etc. The brain immediately is able to distinguish the different animals. How does it do it? Amazing right?

In any case here is a machine learning algorithm that is capable of identifying structure in data. This is also known as K-Means and is a form of unsupervised learning algorithm.

The K-Means algorithm takes as input an unlabeled data set and identifies groups in the set. It tries to determine structure in the data set.

Take a look at the picture below

It is readily obvious that there are 2 clusters in the above diagram. However to the computer this is just a random set of points.

How does the K-Means cluster identify the clusters in the above diagram?

The algorithm is fairly simply and intuitive.

1) Let us start by choosing 2 random points which we call as ‘cluster centroids

2) We then associate each centroid with the points in the dataset that are closest to it.

3) We then compute the average of each group of associated points in the centroid and move the centroid to that average.

4) We then repeat steps 2 – 4 until there is no significant change in the centroid

This is shown below

The above algorithm can be implemented iteratively as follows

For training set (x¹, x², x³ …)

Randomly initialize K cluster centroids μ₁, μ₂, μ₃ … μ_K

Repeat {

for 1 to m

c(i) = The cluster index from 1 to K that is closest to xⁱ => (A)

end

for k = 1 to K

u(k) = average of all points assigned to K => (B)

end

}

In step (A) the points xⁱ closest to the centroid k is added to the centroid’s set. Hence if points 1,3,4,8 are in centroid 1 then

x¹ = 1, x³ = 1, x⁴ = 1, x⁸ =1

In step (B) the mean of the points 1, 3, 4, 8 is taken

So the centroid

c_1x = ¼ { x₁ + x₃ + x₄ + x₈} and c_1y = ¼ { y₁ + y₃ + y₄ + y₈}

This becomes the new c₁

However there can be occasions where the K-means cluster would get stuck in local optima. To choose optimum cluster centroid we have to determine the least cost. This can be done with the optimization objective.

The optimization objective of K-Means is as follows

K-Mean cluster determination is the problem of minimizing the distance of each point from its centroid. This is also known as the K-Means cost function or distortion function.

J(c¹,c²…c^m,, μ₁,… μ₂) = 1/m Σ|| xⁱ – μ_c(i) ||²

I like to visualize the algorithm as follows.

In step 1 we can visualize that there is a force of attraction between the datapoints and the cluster centroid based on proximity of the centroid.

In step 2 we can visualize that each datapoint attracts the centroid towards it. The centroid moves to the point where the attraction among all the datapoints balances out. This is average mean squared difference.

As can be seen the objective is to determine the average of the mean squared error of each data point to its closest centroid.

Given a set of data points how we choose the random centroids? One way is to initially pick some random data points themselves as the cluster centroid. The algorithm is then iterated to identify the real cluster centroids.

As mentioned before the algorithm can sometimes get stuck in local optima. One option is to choose another random set of data points and continue to iterate. We need to run this several times to determine the best clustering

There is also the problem of determining the number of cluster centroids. How we to determine how many clusters are would be there in a random data set? Visually we can easily identify the number of clusters. But a machine cannot.

One technique that can be used to determine the number of clusters is as follows. Start with 2, 3… 10 clusters and plot the cost function. Then pick the one with the least cost.

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

The thing about the Internet of Things

Published in Smart World Jan-Feb 2014, The thing about the Internet of Things

Introduction: It is now common knowledge that the world is becoming more connected, instrumented and data driven. In a world of 7 billion people we have almost 10 billion devices connected to the internet. A recent report from Cisco suggests that the number of connected devices will almost touch 50 billion by the year 2020.

This huge increase in the number connected devices will come largely from a couple of new technology trends namely Internet of Things (IoT), Smart grids etc.

What exactly is the Internet of Things?

The first formal definition of the Internet of Things happened when ITU-T the telecom wing of United Nations came with a report titled “The Internet of Things” in 2005. In this report ITU-T added a fourth dimension of ‘anything’ to the existing anyone, anywhere, anytime network. This report visualized a world where millions and millions of devices either passive, intelligent or sensors collected data from the environment and sent it through the network to a backend processing system.

In Mark Weiser’s classic words, “the most profound technologies are those that disappear and weave themselves into the fabric of everyday life until they are indistinguishable from it”. Embedded intelligence in the things themselves will further enhance the power of the network. IoT is just this vision of Mark Weiser.

This fourth dimension of ‘things’ or intelligent sensors give the ability to gather data from the environment which is then sent back through the wireless network to the internet for back end processing. Analysis of the gathered data helps in forecasting events ahead of the time.

The Internet of Things is also known as M2M or machine–to–machine computing, pervasive computing or ubiquitous computing.

The Maha Kumbh Mela experiment: Last year, 2013, coincided with the 12 year cycle of the Maha Kumbh mela festival. More than 100+ million people would have passed through the city of Allahabad for a holy dip in river Sangam at the confluence of Ganges & Yamuna. Almost 95% of this human mass would have carried mobile phones equipped with location sensors. Harvard Business University with the help of mobile Telecom Operators ran an experiment to track the movement of people through the city of Allahabad to understand the behavior of people. It was hoped that the study of this large amount of data, as people moved through the city, would help in identifying signatures of disaster and how they can be avoided.

This is possible because mobile phones have the ability to send their location data back to the net for processing. This is an example of the Internet of Things.

Some applications of the Internet of Things is outlined below

RFID or Radio Frequency Identification: RFID was one of the early enablers of this technology; The RFID is a passive device that responds with its identity when it is in the presence of a RFID receiver. The RFID receiver transmits a signal and a RFID tag responds with its unique tag id. The RFID technology has been used extensively by large retail stores like Walmart of US and Tesco of UK etc. These stores RFID tag all their products in the central warehouse. In the presence of an RFID receiver the RFID tags of all the products are read. So the warehouse has a complete list of its inventory. As the products move from the central warehouse to the regional warehouse and finally to the retail store the products are tracked. So the retail stores know exactly how many of each product is present in all its warehouses and stores. As customers buy products and check it out at the counter the count of the products in the store is also updated. So at any point in time each store will know the count of each of its products. So stores like Walmart can now forecast if there is a going o be a shortage of any of it products and can move some of them to the concerned store. In fact we can imagine a scenario where each shopping cart is equipped with a RFID receiver. As we keep putting products into our cart the cart can add each of the items we have taken so that we have the bill ready when we reach the counter. We need not scan the products at the check out counter.

Highway Tolls: An interesting application of IoT, is the payment of highway tools in which the vehicle do not need to stop to pay the toll. Toll is deducted from a device, with a driver, which is RFID tagged. There are also applications in which the tires of cars are embedded with sensors to detect the wear & tear of the tires. Insurance companies can use the driving data from these sensors to give discounts to safe drivers.

Car-to-car networks: Another certainty in the evolution of IoT is car-to-car networks. Vehicular Communication along with the Intelligent Transport Systems (ITS) achieves safety by enabling communication between vehicles, people and roads. Vehicle-to-vehicle communications are the fundamental building block of autonomous, self-driving cars. It enables the exchange of data between vehicles and allows automobiles to “see” and adapt to driving obstacles more completely, preventing accidents besides resulting in more efficient driving.

Intelligent homes: Rapid advances in technology will be closer to the home both literally and figuratively. The future home will have the ability to detect the presence of people, pets, smoke and changes to humidity, moisture, lighting, temperature. Smart devices will monitor the environment and take appropriate steps to save energy, improve safety and enhance security of homes. Devices will start learning your habits and enhance your comfort and convenience. Everything from thermostats, fire detectors, washing machines, refrigerators will be equipped electronics that will be capable of adapting to the environment. ‘Nest’ is a smart thermostat that made headlines recently. The thermostat learns your requirements and adjusts the temperature accordingly. All gadgets in the Smart Home will be accessible through laptops, tablets or smartphones from anywhere. Others gadgets in Intelligent Homes are smart locks, smart lighting etc. Hence, we will be able to monitor all aspects of our intelligent home from anywhere.

Intelligent offices: Smart devices will also make major inroads into offices leading to the birth of intelligent offices where the lighting, heating, cooling will be based on the presence of people in the offices. This will result in an enormous savings in energy. The advances in intelligent homes and intelligent offices will be in the greater context of the Smart Grid.

eHealth: IoT is being used by some hospitals for monitoring of heart patients Here a device is implanted into the patient. The device regularly sends data to a doctor who can monitor the patient’s pulse rate, heart rate, blood pressure etc. It can warn the physician when it detects an irregularity in the patient’s heart rhythm who can then call the patient and advice on appropriate medication to take avoiding a real cardiac arrest.

Smart Cities: How often we sit fretting and fuming in a traffic jam contributing to air pollution. Smart Cities are equipped with multiple devices that identify and measure traffic speed and volume on city roads. At the back end the systems analyze this continuous stream of real time and provide alternative routes based on predictive analytics based on real time and historical data. Studies have also shown that it is possible to control traffic by offering discounts to drivers on less crowded roads.

Smart Grid: The grid or the legacy electrical network has three components to it namely energy generation, energy transmission and energy distribution. The conventional electrical grid which is prevalent in most countries throughout the world has extremely high transmission losses besides having other issues. Typically an outage in one part of the network would cause a cascading effect throughout the network. Remember the infamous blackout in US in 2003 which was the largest black in US in history. More closer to home, in India, we had a blackout in Dec 2012 which was the largest black out ever. This is because of the domino effect where an issue causes a cascading effect. Closer to home we had the world’s biggest blackout in Jul 31 which left 600 million powerless for close to 2 days.

With the advent of Smart Grid the legacy electrical grid will have millions of electrical sensors which monitor the flow of energy. If there is a fault in any part of the network the sensors ensure that the failure is isolated so that outage does not spread to other parts.

Besides instead of the regular electrical meters Smart Grids include the concept of the Smart home equipped with smart meters. These smart meters have a two way communication. The price of energy which we get from the grid varies like the stock price. With the smart meters and smart appliances these appliances turn on when the price of drawing energy is low.

Wearable Technologies: he latest entrants to IoT are the wearable technology like Smart watches, Google Glass, Health bands. These technologies constantly monitor measure and send the data for processing to the backend. For e.g. Google’s glass can immediately recognize prominent landmarks and display it. Similarly health bands like Fitbit, Nike FuelBand etc can now measure steps, heart rate and provide feedback.

Challenges: There are still many challenges on the way to a future filled with M2M. There is still no universally accepted protocol. There are many competing protocols like WiFi, Zigbee, MQPP, XMPP etc and there is yet to be a single common standard between devices and the networks for the Internet Of Things.

In any case, the Internet of Things or M2M is happening technology and will soon come into our neighborhood and we should all be pretty swamped by this tidal wave in our future

Find me on Google+

Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data. If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is given below

Consider a hypothesis function

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂² + Ɵ₃x₃³ + Ɵ₄x₄⁴

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

J_test(Ɵ) = 1/2m_testΣ(hƟ (x_testⁱ – y_testⁱ)²

For logistic regression the test set error is similarly determined as

J_test(Ɵ) = = 1/m_test Σ -y_test * log(h_Ɵ(x_test)) – (1-y_test) * (log(1 – h_Ɵ(x_test))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂²
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₃³
…

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

Train all 10 models against the training set (60%)
Compute the cost value J against the cross-validation set (20%)
Determine the lowest cost model
Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵ_i values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵ_ibecome negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below