Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid

Having just completed the highly stimulating & inspiring Stanford’s Machine Learning course at Coursera, by the incomparable Professor Andrew Ng I wanted to give my newly acquired knowledge a try. As a start, I decided to try my hand at analyzing one of India’s fastest growing stars, namely Virat Kohli . For the data on Virat Kohli I used the ‘Statistics database’ at ESPN Cricinfo. To make matters more interesting, I also pulled data on the iconic Sachin Tendulkar and the Mr. Dependable, Rahul Dravid.

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr and Beaten by sheer pace-Cricket analytics with yorkr A must read for any cricket lover! Check it out!!

(Also do check out my R package cricketr Introducing cricketr! : An R package to analyze performances of cricketers and my interactive Shiny app implementation using my R package cricketr – Sixer – R package cricketr’s new Shiny avatar )

Based on the data of these batsmen I perform some predictions with the help of machine learning algorithms. That I have a proclivity for prediction, is not surprising, considering the fact that my Dad was an astrologer who had reasonable success at this esoteric art. While he would be concerned with planetary positions, about Rahu in the 7th house being in the malefic etc., I on the other hand focus my predictions on multivariate regression analysis and K-Means. The first part of my post gives the results of my analysis and some predictions for Kohli, Tendulkar and Dravid.

The second part of the post contains a brief outline of the implementation and not the actual details of implementation. This is ensure that I don’t violate Coursera’s Machine Learning’ Honor Code.

This code, data used and the output obtained can be accessed at GitHub at ml-cricket-analysis

Analysis and prediction of Kohli, Tendulkar and Dravid with Machine Learning As mentioned above, I pulled the data for the 3 cricketers Virat Kohli, Sachin Tendulkar and Rahul Dravid. The data taken from Cricinfo database for the 3 batsman is based on the following assumptions

Only ‘Minutes at Crease’ and ‘Balls Faced’ were taken as features against the output variable ‘Runs scored’
Only test matches were taken. This included both test ‘at home’ and ‘away tests’
The data was cleaned to remove any DNB (did not bat) values
No extra weightage was given to ‘not out’. So if Kohli made ‘28*’ 28 not out, this was taken to be 28 runs

Regression Analysis for Virat Kohli There are 51 data points for Virat Kohli regarding Tests played. The data for Kohli is displayed as a 3D scatter plot where x-axis is ‘minutes’ and y-axis is ‘balls faced’. The vertical z-axis is the ‘runs scored’. Multivariate regression analysis was performed to find the best fitting plane for the runs scored based on the selected features of ‘minutes’ and ‘balls faced’.

This is based on minimizing the cost function and then performing gradient descent for 400 iterations to check for convergence. This plane is shown as the 3-D plane that provides the best fit for the data points for Kohli. The diagram below shows the prediction plane of expected runs for a combination of ‘minutes at crease’ and ‘balls faced’. Here are 2 such plots for Virat Kohli. Another view of the prediction plane Prediction for Kohli I have also computed the predicted runs that will be scored by Kohli for different combinations of ‘minutes at crease’ and ‘balls faced’. As an example, from the table below, we can see that the predicted runs for Kohli after being in the crease for 110 minutes and facing 135 balls is 54 runs. Regression analysis for Sachin Tendulkar There was a lot more data on Tendulkar and I was able to dump close to 329 data points. As before the ‘minutes at crease’, ‘balls faced’ vs ‘runs scored’ were plotted as a 3D scatter plot. The prediction plane is calculated using gradient descent and is shown as a plane in the diagram below Another view of this below Predicted runs for Tendulkar The table below gives the predicted runs for Tendulkar for a combination of time at crease and balls faced. Hence, Tendulkar will score 57 runs in 110 minutes after facing 135 deliveries Regression Analysis for Rahul Dravid The same was done for ‘the Wall’ Dravid. The prediction plane is below Predicted runs for Dravid The predicted runs for Dravid for combinations of batting time and balls faced is included below. The predicted runs for Dravid after facing 135 deliveries in 110 minutes is 44. Further analysis While the ‘prediction plane’ was useful, it somehow does not give a clear picture of how effective each batsman is. Clearly the 3D plots show at least 3 clusters for each batsman. For all batsmen, the clustering is densest near the origin, become less dense towards the middle and sparse on the other end. This is an indication during which session during their innings the batsman is most prone to get out. So I decided to perform K-Means clustering on the data for the 3 batsman. This gives the 3 general tendencies for each batsman. The output is included below

K-Means for Virat The K-Means for Virat Kohli indicate the follow

Centroids found 255.000000 104.478261 19.900000
Centroids found 194.000000 80.000000 15.650000
Centroids found 103.000000 38.739130 7.000000

Analysis of Virat Kohli’s batting tendency
Kohli has a 45.098 percent tendency to bat for 104 minutes, face 80 balls and score 38 runs
Kohli has a 39.216 percent tendency to bat for 19 minutes, face 15 balls and score 7 runs
Kohli has a 15.686 percent tendency to bat for 255 minutes, face 194 balls and score 103 runs

The computation of this included in the diagram below

K-means for Sachin Tendulkar

The K-Means for Sachin Tendulkar indicate the following

Centroids found 166.132530 353.092593 43.748691
Centroids found 121.421687 250.666667 30.486911
Centroids found 65.180723 138.740741 15.748691

Analysis of Sachin Tendulkar’s performance

Tendulkar has a 58.232 percent tendency to bat for 43 minutes, face 30 balls and score 15 runs
Tendulkar has a 25.305 percent tendency to bat for 166 minutes, face 121 balls and score 65 runs
Tendulkar has a 16.463 percent tendency to bat for 353 minutes, face 250 balls and score 138 runs
K-Means for Rahul Dravid

Centroids found 191.836364 409.000000 50.506024
Centroids found 137.381818 290.692308 34.493976
Centroids found 56.945455 131.500000 13.445783

Analysis of Rahul Dravid’s performance
Dravid has a 50.610 percent tendency to bat for 50 minutes, face 34 balls and score 13 runs
Dravid has a 33.537 percent tendency to bat for 191 minutes, face 137 balls and score 56 runs
Dravid has a 15.854 percent tendency to bat for 409 minutes, face 290 balls and score 131 runs
Some implementation details The entire analysis and coding was done with Octave 3.2.4. I have included the outline of the code for performing the multivariate regression. In essence the pseudo code for this

Read the batsman data (Minutes, balls faced versus Runs scored)
Calculate the cost
Perform Gradient descent

The cost was plotted against the number of iterations to ensure convergence while performing gradient descent Plot the 3-D plane that best fits the data
The outline of this code, data used and the output obtained can be accessed at GitHub at ml-cricket-analysis

Conclusion: Comparing the results from the K-Means Tendulkar has around 48% to make a score greater than 60
Tendulkar has a 25.305 percent tendency to bat for 166 minutes, face 121 balls and score 65 runs
Tendulkar has a 16.463 percent tendency to bat for 353 minutes, face 250 balls and score 138 runs

And Dravid has a similar 48% tendency to score greater than 56 runs
Dravid has a 33.537 percent tendency to bat for 191 minutes, face 137 balls and score 56 runs
Dravid has a 15.854 percent tendency to bat for 409 minutes, face 290 balls and score 131 runs

Kohli has around 45% to score greater than 38 runs
Kohli has a 45.098 percent tendency to bat for 104 minutes, face 80 balls and score 38 runs

Also Kohli has a lesser percentage to score lower runs as against the other two
Kohli has a 39.216 percent tendency to bat for 19 minutes, face 15 balls and score 7 runs

The results must be looked in proper perspective as Kohli is just starting his career while the other 2 are veterans. Kohli has a long way to go and I am certain that he will blaze a trail of glory in the years to come!

Watch this space!

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2.Introducing cricketr! : An R package to analyze performances of cricketers
3.Informed choices with Machine Learning 2 – Pitting together Kumble, Kapil and Chandra
4. Analyzing cricket’s batting legends – Through the mirage with R
5. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
6. Bend it like Bluemix, MongoDB with autoscaling – Part 1

Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of $1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest. So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.

Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

n_u = Number of users

n_b= Number of books

r^(i,j) = Boolean whether user j rated a book i

y^(i,j) = The rating user j gave book i

m_j = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps. In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc

So each book has its own feature vector where x¹is the feature vector of the first book x² feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵ^jand a feature vector xⁱ the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θ^j)^T xⁱ

This reduces to the minimization problem over all θ^j for r=1

min 1/2m Σ ((θj)^T xⁱ – y ^i,j)²

θ^ji:r=1

Adding the regularization term this becomes

min 1/2m Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ θ^j)²

θ^ji:r=1

The recommender algorithm in essence tries to learn parameters θ^j for a set of features of xⁱthe chosen system for e.g. books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ Σ θ^j)²

θ¹…θⁿi:r=1

The minimization is performed by gradient descent as

Θ^j_k:= Θ^j_k – α (Σ((θ^j)^T xⁱ – y ^i,j)xⁱ + λ Θ^j_k

Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Perils and pitfalls of Big Data

Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high. This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

Find me on Google+

Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts. This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages. Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output. Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is m_ij

Also let us assume that there is a vector ‘v’ whose jth element is v_j . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

x_i = ∑ m_ijv_j

Map function: The map function applies to each single element of the matrix M. For each element m_ijthe map task outputs a key-value pair as follows (i, m_ijv_j). Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function: The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

x_i = ∑ m_ijv_j

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2^nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data. Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

Find me on Google+

Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data. If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is given below

Consider a hypothesis function

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂² + Ɵ₃x₃³ + Ɵ₄x₄⁴

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

J_test(Ɵ) = 1/2m_testΣ(hƟ (x_testⁱ – y_testⁱ)²

For logistic regression the test set error is similarly determined as

J_test(Ɵ) = = 1/m_test Σ -y_test * log(h_Ɵ(x_test)) – (1-y_test) * (log(1 – h_Ɵ(x_test))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂²
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₃³
…

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

Train all 10 models against the training set (60%)
Compute the cost value J against the cross-validation set (20%)
Determine the lowest cost model
Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵ_i values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵ_ibecome negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4

In both linear and logistic regression the choice of the degree of the polynomial for the hypothesis function is extremely critical. A low degree for the polynomial can result in an underfit, while a very high degree can overfit the data as shown below

The figure on the left the data is underfit as we try to fit the data with a first order polynomial which is a straight line. This is a case of strong ‘bias’

The rightmost figure a much higher polynomial is used. All the data points are covered by the polynomial curve however it is not effective in predicting other values. This is a case of overfitting or a high variance.

The middle figure is just right as it intuitively fits the data points the best possible way.

A similar problem exists with logistic regression as shown below

There are 2 ways to handle overfitting

a) Reducing the number of features selected

b) Using regularization

In regularization the magnitude of the parameters Ɵ is decreased to reduce the effect of overfitting

Hence if we choose a hypothesis function

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₃³ + Ɵ₄x₄⁴

The cost function for this without regularization as mentioned in earlier posts

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ – yⁱ)²

Where the key is minimize the above function for the least error

The cost function with regularization becomes

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ – yⁱ)^{2 +}λ Σ Ɵ_j²

As can be seen the regularization now adds a factor Ɵ_j² as a part of the cost function which needs to be minimized.

Hence with the regularization factor the problem of underfitting/overfitting can be solved

However the trick is determine the value of λ. If λ is too big then it would result in underfitting or resulting in a high bias.

Similarly the regularized equation for logistic regression is as shown below

J(Ɵ) = |1/m Σ -y * log(h_Ɵ(x)) – (1-y) * (log(1 – h_Ɵ(x)) | + λ/2m Σ Ɵ_j²

Some tips suggested by Prof Andrew Ng while determining the parameters and features for regression

a) Get as many training examples. It is worth spending more effort in getting as much examples

b) Add additional features

c) Observe changes to the learning algorithm with different values of λ

This post is continued in my next post – Simplifying ML: Impact of degree of polynomial on bias, variance and other insights

Note: This post, in line with my previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex. We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= h_Ɵ<= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e^‑z)

and h_Ɵ(x) = g(Ɵ^TX₎

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when h_Ɵ(x) >= 0.5 and h_Ɵ(x) asymptotes to 0 when h_Ɵ(x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₁ + Ɵ₄x₂

h_Ɵ(x) = 1/(1 + e –^{(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 + Ɵ4×2)})

We could choose the general form of a circle which is

f(x) = ax² + by² +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

Hence the cost can be written as

J(Ɵ) = Cost(h_Ɵ(x),y) = -y * log(h_Ɵ(x)) – (1-y) * (log(1 – h_Ɵ(x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵ_j = Ɵ_j – α δ/δ Ɵ_j J(Ɵ₀, Ɵ₁,… Ɵ_n)

This works out to a function that is similar to linear regression

Ɵj₌Ɵj – α 1/m { Σ h_Ɵ(x_i) – y_i} x_jⁱ

This will enable the machine to fairly accurately determine the parameters Ɵ_jfor the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+

Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a) Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

h_Ɵ(x) = Ɵ₁x+ Ɵ₀

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

h_Ɵ(x) = Ɵ₂x²+ Ɵ₁x+Ɵ₀

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (h_Ɵ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

h_Ɵ(x) = Ɵ₂xⁿ+ Ɵ₁x^n-1+ …+ Ɵ₀

Also read my post on Simplifying ML: Logistic regression – Part 2

Find me on Google+