How to program – Some essential tips

If one follows the arrow of time from the early 1980s to the present day, the number of programming problems have not only proliferated but have also become more difficult. Fortunately programming in itself has become more manageable with massive increases in computing horsepower, smarter tools and instant availability of information on the internet, typically with the click of a mouse.

Learning to program is no easy task, but can be done with the right mix of attitude, curiosity and interest. Becoming adept at programming, however, is something else. An interesting essay in this context is Peter Norvig’s ‘Teach yourself programming in 10 years’

Back in the 1980s when I wrote my first Fortran program on my college Mainframe, programming was a lengthy exercise, spanning several days.

My first program was to plot a sine wave of characters on a computer printout. Running this program required the following several steps

Enter the program on a teletype terminal and create a stack of Hollerith (punched) cards
Submit the stack of cards to the computer center
The computer center would do a batch execute in the evening on the Mainframe
God forbid, if your program has a syntax error. If you did find an error, go back to step 1, the next day.
Assuming everything is fine, the computer center would run your program and your output (printout) would be placed in the appropriate pigeon hole which you would need to pick up the next day.

The whole exercise to write a small-sized program could take anywhere between a couple of days to a whole week.

In the early 1990s things got a little better where one could code, compile, link and execute sitting at one’s desk. However while the programming itself got much simpler than before, certain tasks were still difficult. Till the late 90s programs of any sort had to be written using a regular text editor (vi , emacs etc.) You would then have to go through the process of compiling, linking and executing.

An angry compiler would typically spew forth venom at missing semi-colons, undeclared variables, and uninitialized values. This would happen till you are able to iron out all syntax errors. Then you would link, get undefined symbols and have to include appropriate libraries etc. And then finally you would execute your code, only to have it crash. The process of debugging would then start.

Luckily technology has made life a whole lot easier except for the last step where you could still run into an execution errors . In these days an IDE (Interactive Development Environment) like Eclipse will flag syntax errors, missing definitions/declarations etc. as you write your code. Moreover Eclipse can also indicate which libraries (imports) you would need to include in your package for it to build. The only missing step in IDEs of these days is the ability to predict possible execution errors in your program. I wouldn’t be surprised, if in future, like Microsoft Word, the IDE is able to tell you if a programming construct does not make sense.

So things have gotten a lot easier for the programmer. The following tips for are particularly useful as you progress along in programming

These days when you are learning a new programming language it is not necessary to know the language from cover to cover by reading a book. In those days when we learnt C it was necessary to know everything from bit structures, macros, pragma etc. The reason being that every syntax or execution error one had to rush to get the textbook and thumb through it for the answer. Not so, in these days of Google. You have the world’s library at your fingertips.
To get started it is necessary to learn just the most important programming constructs of the language say structure, class, car, cdr besides the usual suspects like loops, conditions and case constructs
Download and install an IDE for the language. In most case Eclipse will work
Try to write a simple program and test out your code.
To do any sort of programming these days you will necessarily need to make 3 friends
1. Google
2. Stackoverflow
3. Git & GitHub
Honing your Googling skills is very important. There are answers to almost any sort of programming problems out there. You would be surprised to know that there are many others who did exactly the same stupid mistake that you did out there. Also googling will take you to interesting tutorials, blogs, articles that discuss different aspects of the programming language and the problem you are trying to solve
Stackoverflow is really a God send to all programmers. There are so many questions on so many aspects of every programming language on earth there. If you spend time searching Stackoverflow you are bound to find answers, code snippets that you can readily use in your code
Post your questions in stackoverflow when you don’t find the answers there. You are bound to get quick answers. Thanks to the gamification of Stackoverflow (points, upvotes,badges etc) that has been created on Stackoverflow.
Git & GitHub: I would suggest that you download and install GitHub for Windows. This will provide you with version control on your desktop. You can modify code while being to switch back to an earlier version with Git. Read up a good tutorial on Git for Windows
Once you have working code you push it onto GitHub and share with other programmers

Now that you have the basic setup here are few other extremely important tips

The most important criteria for programming is ‘attitude’. Initially you are bound to get frustrated, angry, irritated etc. But it is necessary to look at the errors that you get with the right attitude. Know that an error is telling you something. Usually the answers to your mistake are in the ‘error message’ itself. Look at it closely and try to understand it. You will learn a lot more when you learn from errors than from copy-pasting from somebody else’s code, even if works right the first time around!
Make sure you do something different each time. As Einstein said “ If you keep doing the same thing, you will keep getting the same result’
There are different ways to debug your code. You could use the debugger and single step through the code and keep checking the values of the variables. I personally prefer print statements to localize where things are going wrong. I then try to narrow down the problem to a few lines of code and try to take it apart.

Hopefully the above tips are useful. Programming can be creative activity and will be indispensable in our future.

Above all have fun coding, there are so many possibilities these days!

Also see

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. The mind of the programmer
4. Programming Zen and now – Some essential tips -2

Programming Zen and now – Some essential tips-2

Applying the principles of Machine Learning

While working with multivariate regression there are certain essential principles that must be applied to ensure the correctness of the solution while being able to pick the most optimum solution. This is all the more important when the problem has a large number of features. In this post I apply these important principles to a regression data set which I was able to pull of the internet. This data set was taken from the UCI Machine Learning repository and deals with Boston housing data. The housing data provides the cost of house in Boston suburbs given the number of rooms, the connectivity to main highways, and crime rate in the area and several other data. There are a total of 506 data points in this data set with a total of 13 features.

This seemed a reasonable dataset to start to try out the principles of Machine Learning I had picked up from Coursera’s ML course.

Out of a total of 13 features 2 features ’ZN’ and ‘CHAS’ proximity to Charles river were dropped as the values were mostly zero in these columns . The remaining 11 features were used to map to the output variable of the price.

The following key rules have been applied on the

The dataset was divided into training samples (60%), cross-validation set (20%) and test set (20%) using a random index
Try out different polynomial functions while performing gradient descent to determine the theta values
Different combinations of ‘alpha’ learning rate and ‘lambda’ the regularization parameter were tried while performing gradient descent
The error rate is then calculated on the cross-validation and test set
The theta values that were obtained for the lowest cost for a polynomial was used to compute and plot the learning curve for the different polynomials against increasing number of training and cross-validation test samples to check for bias and variance.
The plot of the cost versus the polynomial degree was plotted to obtain the best fit polynomial for the data set.

A multivariate regression hypothesis can be represented as

hθ(x) = θ₀ + θ₁x₁ + θ₂x₂ + θ₃x₃ + θ₄x₄ + …
And the cost can is determined as
J(θ₀, θ₁, θ₂, θ₃..) = 1/2m ∑ (h_Θ (xⁱ) -yⁱ)²
The implementation was done using Octave. As in my previous posts some functions have not been include to comply with Coursera’s Honor Code. The code can be cloned from GitHub at machine-learning-principles

a) housing compute.m. In this module I perform gradient descent for different polynomial degrees and check the error that is obtained when using the computed theta on the cross validation and test set

max_degrees =4; J_history = zeros(max_degrees, 1); Jcv_history = zeros(max_degrees, 1); for degree = 1:max_degrees; [J Jcv alpha lambda] = train_samples(randidx, training,cross_validation,test_data,degree); end;

b) train_samples.m – This module uses gradient descent to check the best fit for a given polynomial degree for different combinations of alpha (learning rate) and lambda( regularization).

for i = 1:length(alpha_arr), for j = 1:length(lambda_arr) alpha = alpha_arr{i}; lambda= lambda_arr{j}; % Perform Gradient descent % Compute error for training sample for computed theta values % Compute the error rate for the cross validation samples % Compute the error rate against the test set end; end;

c) cross_validation.m – This module uses the theta values to compute cost for the cross validation set

d) test-samples.m – This modules computes the error when using the trained theta on the test set

e) poly.m – This module constructs polynomial vectors based on the degree as follows
function [x] = poly(xinput, n) x = []; for i= 1:n xtemp = xinput .^i; x = [x xtemp]; end;

e) learning_curve.m – The learning curve module plots the error rate for increasing number of training and cross validation samples. This is done as follows. For the theta with the lowest cost as determined by gradient descent
for i from 1 to 100

Compute the error for ‘i’ training samples
Compute the error for ‘i’ cross-validation
Plot the learning curve to determine the bias and variance of the polynomial fit

This is included below
for i = 1: 100 xsample = xtrain(1:i,:); ysample = ytrain(1:i,:); size(xsample); size(ysample); [xsample] = poly(xsample,degree); xsample= [ones(i, 1) xsample]; [c d] = size(xsample); theta = zeros(d, 1); % Minimize using fmincg J = computeCost(xsample, ysample, theta); Jtrain(i) = J; xsample_cv = xcv(1:i,:); ysample_cv = ycv(1:i,:); [xsample_cv] = poly(xsample_cv,degree); xsample_cv= [ones(i, 1) xsample_cv]; J_cv = computeCost(xsample_cv, ysample_cv,theta) Jcv(i) = J_cv; end;

Finally a plot is done been different lambda and the cost.

The results are included below

A) Polynomial degree 1
Convergence graph

Learning curve

The above figure does show a stronger bias. Note: the learning curve was done with around 100 samples
B) Polynomial degree 2

Convergence graph

Learning curve

The learning curve for degree 2 shows a stronger variance.

C) Polynomial degree 3
Convergence graph

Learning curve

D) Polynomial degree 4
Convergence graph

E) Learning curve

This plot is useful to determine which polynomial degree will give the best fit for the dataset and the lowest cost

Clearly from the above it can be seen that degree 2 will give a good fit for the data set.

F) Lambda vs Cost function

The above code demonstrates some key principles while performing multivariate regression
The code can be cloned from GitHub at machine-learning-principles

Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

Continuing my earlier ‘innings’, of test driving my knowledge in Machine Learning acquired via Coursera, I now turn my attention towards the bowling performances of our Indian bowling heroes. In this post I give a slightly different ‘spin’ to the bowling analysis and hope I can ‘swing’ your opinion based on my assessment.

I guess that is enough of my cricketing ‘double-speak’ for now and I will get down to the real business of my bowling analysis!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr and Beaten by sheer pace-Cricket analytics with yorkr A must read for any cricket lover! Check it out!!

As in my earlier post Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid ,the first part of the post has my analyses and the latter part has the details of the implementation of the algorithm. Feel free to read the first part and either scan or skip the latter.

To perform this analysis I have skipped the data on our recent crop of new bowlers. The reason being that data is scant on these bowlers, besides they also seem to have a relatively shorter shelf life (hope there are a couple of finds in this Australian tour of Dec 2014). For the analyses I have chosen B S Chandrasekhar, Kapil Dev Anil Kumble. My rationale as to why I chose the above 3

B S Chandrasekhar also known as “Chandra’ was one of the most lethal leg spinners in the late 1970’s. He had a very dangerous combination of fast leg breaks, searing tops spins interspersed with the occasional googly. On many occasions he would leave most batsmen completely clueless.

Kapil Nikhanj Dev, the Haryana Hurricane who could outwit the most technically sound batsmen through some really clever bowling. His variations were almost always effective and he would achieve the vital breakthrough outsmarting the opponent.

And finally Anil Kumble, I chose Kumble because in my opinion he is truly the embodiment of the ‘thinking’ bowler. Many times I have seen Kumble repeatedly beat batsmen. It was like he was telling the batsman ‘check’ as he bowled faster leg breaks, flippers, a straighter delivery or top spins before finally crashing into the wickets or trapping the batsmen. It felt he was saying ‘checkmate dude!’

I have taken the data for the 3 bowlers from ESPN Cricinfo. Only the Test matches were considered for the analyses. All tests against all oppositions both at home and away were included

The assumptions taken and basis of the computation is included below
a.The data is based on the following 2 input variables a) Overs bowled b) Runs given. The output variable is ‘Wickets taken’

b.To my surprise I found that in the late 1970’s when BS Chandrasekhar used to bowl, an over had 8 balls for matches in Australia. So, I had to normalize this data for Chandra to make it on par with the others. Hence for Chandra where the overs were made up of 8 balls the overs was calculated as follows
Overs (O) = (Overs * 8)/6

c.The Economy rate E was calculated as below
E = Overs/runs was chosen as input variable to take into account fewer runs given by the bowler

d.The output variable was re-calculated as Strike Rate (SR) to determine the ‘bowling effectiveness’
Strike Rate = Wickets/Overs
(not be confused with a batsman’s strike rate batsman strike rate = runs/ balls faced)

e.Hence the analysis is based on
f(O,E) = SR
An outline of the Octave code and the data used can be cloned from GitHub at ml-bowling-analyze

1. Surface of Bowling Effectiveness (SBE)
In my earlier post I was able to fit a ‘prediction plane’ based on the minutes at crease, balls faced versus the runs scored. But in this case a plane did not make sense as the wickets can only range from 0 – 10 and in most cases averaging between 3 and 5. So I plot the best fitting 3-D surface over the predicted hypothesis function. The steps performed are

1) The data for the different bowlers were cleaned with data which indicated (DNB – Did not bowl)
2) The Economy Rate (E) = Runs given/Overs and Strike Rate(SR) = Wickets/overs were calculated.
3) The product of Overs (O), and Economy(E) were stored as Over_Economy(OE)
4) The hypothesis function was computed as h(O, E, OE) = y
5) Theta was calculated using the Normal Equation. The Surface of Bowling Effectiveness( SBE) was then plotted. The plots for each of the bowler is shown below

Here are the plots

A) Anil Kumble
The data of Kumble, based on Overs bowled & Economy rate versus the Strike Rate is plotted as a 3-D scatter plot (pink crosses). The best fit as determined by solving the optimum theta using the Normal Equation is plotted as 3-D surface shown below.

The 3-D surface is what I have termed as ‘Surface of Bowling Effectiveness (SBE)’ as it depicts bowlers overall effectiveness as it plots the overs (O), ‘economy rate’ E against predicted ‘strike rate’ SR.
Here is another view

The theta values obtained for Kumble are
Theta =
0.104208
-0.043769
-0.016305
0.011949

And the cost at this theta is
Cost Function J = 0.0046269

B) B S Chandrasekhar
Here are the best optimal surface plot for Chandra with the data on O,E vs SR plotted as a 3D scatter plot. Note: The dataset for Chandrasekhar is smaller compared to the other two.
Another view for Chandra

Theta values for B S Chandrasekhar are
Theta =
0.095780
-0.025377
-0.024847
0.023415
and the cost is
Cost Function J = 0.0032980

c) Kapil Dev
The plots for Kapil

Another view of SBE for Kapil

The Theta values and cost function for Kapil are
Theta =
0.090219
0.027725
0.023894
-0.021434
Cost Function J = 0.0035123

2. Predicting wickets
In the previous section the optimum theta with the lowest Cost Function J was calculated. Based on the value of theta, the wickets that will be taken by a bowler can be computed as the product of the hypothesis function and theta. i.e.

y= h(x) * theta => Strike Rate (SR) = [1 O E OE] * theta
Now predicted wickets can be calculated as

wickets = Strike rate(SR) * Overs(O)
This is done for Kumble, Chandra and Kapil for different combinations of Overs(O) and Economy(E) rate.

Here are the results
Predicted wickets for Anil Kumble
The plot of predicted wickets for Kumble is represented below

This can also be represented as a a table

Predicted wickets for B S Chandrasekhar

The table for Chandra

Predicted wickets for Kapil Dev

The plot

The predicted table from the hypothesis function for Kapil Dev

Observation: A closer look at the predicted wickets for Kapil, Kumble and B S Chandra shows an interesting aspect. The predicted number of wickets is higher for lower economy rates. With a little thought we can see bowlers on turning or pitches with a lot of movement can not only be more economical but can also be destructive and take a lot of wickets. Hence the higher wickets for lower economy rates!

Implementation details
In this post I have used the Normal Equation to get the optimal values of theta for local minimum of the Gradient function. As mentioned above when I had run the 3D scatter plot fitting a 2D plane did not seem quite right. So I had to experiment with different polynomial equations first trying 2^nd order, 3^rd order and also the sqrt

I tried the following where ‘O is Overs, ‘E’ stands for Economy Rate and ‘SR’ the predicated Strike rate. Theta is the computed theta from the Normal Equation. The notation in Matrix notation is shown below

i) A linear plane
SR = [1 O E] * theta

ii) Using the sqrt function
SR = [1 sqrt(O) sqrt(E)] * theta

iii) Using 2^nd order plynomial
SR = [1 O^2 E^2] * theta

iv) Using the 3^rd order polynomial
SR = [1 O^3 E^3] * theta

v) Before finally settling on
SR = [1 O E OE] * theta

where OE = O .* E

The last one seemed to give me the lowest cost and also seemed the most logical visual choice.

A good resource to play around with different functions and check out the shapes of combinations of variables and polynomial order of equation is at WolframAlpha: Plotting and Graphics

Note 1: The gradient descent with the Normal Equation has been performed on the entire data set (approx 220 for Kumble & Kapil) and 99 for Chandra. The proper process for verifying a Machine Learning algorithm is to split the data set into (60% training data, 20% cross validation data and 20% as the test set). We need to validate the prediction function against the cross-validation set, fine tune it and finally ensure that it fits the test set samples well. However, this split was not done as the data set itself was very low. The entire data set was used to perform the optimal surface fit

Note 2: The optimal theta values have been chosen with a feature vector that is of the form
[1 x y x .* y] The Surface of Bowling Effectiveness’ has been plotted above. It may appear that there is a’high bias’ in the fit and an even better fit could be obtained by choosing higher order polynomials like
[1 x y x*y x^2 y^2 (x^2) .* y x .* (y^2)] or
[1 x y x*y x^2 y^2 x^3 y^3] etc
While we can get a better fit we could run into the problem of ‘high variance; and without the cross validation and test set we will not be able to verify the results, Hence the simpler option [1 x y x*y] was chosen

The Octave code outline and the data used can be cloned from GitHub at ml-bowling-analyze

Conclusion:

1) Predicted wickets: The predicted number of wickets is higher at lower economy rates
2) Comparing performances: There are different ways of looking at the results. One possible way is to check for a particular number of overs and economy rate who is most effective. Here is one way. Taking a small slice from each bowler’s predicted wickets table for anm Economy Rate=4.0 the predicted wickets are

From the above it does appear that Kapil is definitely more effective than the other two. However one could slice and dice in different ways, maybe the most economical for a given numbers and wickets combination or wickets taken in the least overs etc. Do add your thoughts. comments on my assessment or analysis

Also see
1. Analyzing cricket’s batting legends – Through the mirage with R
2. Masters of spin: Unraveling the web with R

Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid

Having just completed the highly stimulating & inspiring Stanford’s Machine Learning course at Coursera, by the incomparable Professor Andrew Ng I wanted to give my newly acquired knowledge a try. As a start, I decided to try my hand at analyzing one of India’s fastest growing stars, namely Virat Kohli . For the data on Virat Kohli I used the ‘Statistics database’ at ESPN Cricinfo. To make matters more interesting, I also pulled data on the iconic Sachin Tendulkar and the Mr. Dependable, Rahul Dravid.

(Also do check out my R package cricketr Introducing cricketr! : An R package to analyze performances of cricketers and my interactive Shiny app implementation using my R package cricketr – Sixer – R package cricketr’s new Shiny avatar )

Based on the data of these batsmen I perform some predictions with the help of machine learning algorithms. That I have a proclivity for prediction, is not surprising, considering the fact that my Dad was an astrologer who had reasonable success at this esoteric art. While he would be concerned with planetary positions, about Rahu in the 7th house being in the malefic etc., I on the other hand focus my predictions on multivariate regression analysis and K-Means. The first part of my post gives the results of my analysis and some predictions for Kohli, Tendulkar and Dravid.

The second part of the post contains a brief outline of the implementation and not the actual details of implementation. This is ensure that I don’t violate Coursera’s Machine Learning’ Honor Code.

This code, data used and the output obtained can be accessed at GitHub at ml-cricket-analysis

Analysis and prediction of Kohli, Tendulkar and Dravid with Machine Learning As mentioned above, I pulled the data for the 3 cricketers Virat Kohli, Sachin Tendulkar and Rahul Dravid. The data taken from Cricinfo database for the 3 batsman is based on the following assumptions

Only ‘Minutes at Crease’ and ‘Balls Faced’ were taken as features against the output variable ‘Runs scored’
Only test matches were taken. This included both test ‘at home’ and ‘away tests’
The data was cleaned to remove any DNB (did not bat) values
No extra weightage was given to ‘not out’. So if Kohli made ‘28*’ 28 not out, this was taken to be 28 runs

Regression Analysis for Virat Kohli There are 51 data points for Virat Kohli regarding Tests played. The data for Kohli is displayed as a 3D scatter plot where x-axis is ‘minutes’ and y-axis is ‘balls faced’. The vertical z-axis is the ‘runs scored’. Multivariate regression analysis was performed to find the best fitting plane for the runs scored based on the selected features of ‘minutes’ and ‘balls faced’.

This is based on minimizing the cost function and then performing gradient descent for 400 iterations to check for convergence. This plane is shown as the 3-D plane that provides the best fit for the data points for Kohli. The diagram below shows the prediction plane of expected runs for a combination of ‘minutes at crease’ and ‘balls faced’. Here are 2 such plots for Virat Kohli. Another view of the prediction plane Prediction for Kohli I have also computed the predicted runs that will be scored by Kohli for different combinations of ‘minutes at crease’ and ‘balls faced’. As an example, from the table below, we can see that the predicted runs for Kohli after being in the crease for 110 minutes and facing 135 balls is 54 runs. Regression analysis for Sachin Tendulkar There was a lot more data on Tendulkar and I was able to dump close to 329 data points. As before the ‘minutes at crease’, ‘balls faced’ vs ‘runs scored’ were plotted as a 3D scatter plot. The prediction plane is calculated using gradient descent and is shown as a plane in the diagram below Another view of this below Predicted runs for Tendulkar The table below gives the predicted runs for Tendulkar for a combination of time at crease and balls faced. Hence, Tendulkar will score 57 runs in 110 minutes after facing 135 deliveries Regression Analysis for Rahul Dravid The same was done for ‘the Wall’ Dravid. The prediction plane is below Predicted runs for Dravid The predicted runs for Dravid for combinations of batting time and balls faced is included below. The predicted runs for Dravid after facing 135 deliveries in 110 minutes is 44. Further analysis While the ‘prediction plane’ was useful, it somehow does not give a clear picture of how effective each batsman is. Clearly the 3D plots show at least 3 clusters for each batsman. For all batsmen, the clustering is densest near the origin, become less dense towards the middle and sparse on the other end. This is an indication during which session during their innings the batsman is most prone to get out. So I decided to perform K-Means clustering on the data for the 3 batsman. This gives the 3 general tendencies for each batsman. The output is included below

K-Means for Virat The K-Means for Virat Kohli indicate the follow

Centroids found 255.000000 104.478261 19.900000
Centroids found 194.000000 80.000000 15.650000
Centroids found 103.000000 38.739130 7.000000

Analysis of Virat Kohli’s batting tendency
Kohli has a 45.098 percent tendency to bat for 104 minutes, face 80 balls and score 38 runs
Kohli has a 39.216 percent tendency to bat for 19 minutes, face 15 balls and score 7 runs
Kohli has a 15.686 percent tendency to bat for 255 minutes, face 194 balls and score 103 runs

The computation of this included in the diagram below

K-means for Sachin Tendulkar

The K-Means for Sachin Tendulkar indicate the following

Centroids found 166.132530 353.092593 43.748691
Centroids found 121.421687 250.666667 30.486911
Centroids found 65.180723 138.740741 15.748691

Analysis of Sachin Tendulkar’s performance

Tendulkar has a 58.232 percent tendency to bat for 43 minutes, face 30 balls and score 15 runs
Tendulkar has a 25.305 percent tendency to bat for 166 minutes, face 121 balls and score 65 runs
Tendulkar has a 16.463 percent tendency to bat for 353 minutes, face 250 balls and score 138 runs
K-Means for Rahul Dravid

Centroids found 191.836364 409.000000 50.506024
Centroids found 137.381818 290.692308 34.493976
Centroids found 56.945455 131.500000 13.445783

Analysis of Rahul Dravid’s performance
Dravid has a 50.610 percent tendency to bat for 50 minutes, face 34 balls and score 13 runs
Dravid has a 33.537 percent tendency to bat for 191 minutes, face 137 balls and score 56 runs
Dravid has a 15.854 percent tendency to bat for 409 minutes, face 290 balls and score 131 runs
Some implementation details The entire analysis and coding was done with Octave 3.2.4. I have included the outline of the code for performing the multivariate regression. In essence the pseudo code for this

Read the batsman data (Minutes, balls faced versus Runs scored)
Calculate the cost
Perform Gradient descent

The cost was plotted against the number of iterations to ensure convergence while performing gradient descent Plot the 3-D plane that best fits the data
The outline of this code, data used and the output obtained can be accessed at GitHub at ml-cricket-analysis

Conclusion: Comparing the results from the K-Means Tendulkar has around 48% to make a score greater than 60
Tendulkar has a 25.305 percent tendency to bat for 166 minutes, face 121 balls and score 65 runs
Tendulkar has a 16.463 percent tendency to bat for 353 minutes, face 250 balls and score 138 runs

And Dravid has a similar 48% tendency to score greater than 56 runs
Dravid has a 33.537 percent tendency to bat for 191 minutes, face 137 balls and score 56 runs
Dravid has a 15.854 percent tendency to bat for 409 minutes, face 290 balls and score 131 runs

Kohli has around 45% to score greater than 38 runs
Kohli has a 45.098 percent tendency to bat for 104 minutes, face 80 balls and score 38 runs

Also Kohli has a lesser percentage to score lower runs as against the other two
Kohli has a 39.216 percent tendency to bat for 19 minutes, face 15 balls and score 7 runs

The results must be looked in proper perspective as Kohli is just starting his career while the other 2 are veterans. Kohli has a long way to go and I am certain that he will blaze a trail of glory in the years to come!

Watch this space!

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2.Introducing cricketr! : An R package to analyze performances of cricketers
3.Informed choices with Machine Learning 2 – Pitting together Kumble, Kapil and Chandra
4. Analyzing cricket’s batting legends – Through the mirage with R
5. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
6. Bend it like Bluemix, MongoDB with autoscaling – Part 1

Presentation on ‘Evolution to LTE’

My presentation on ‘Evolution to LTE’

From developerWorks – What’s up, Watson? Using Watson QAAPI with Bluemix and NodeExpress

My post in IBM developer Works – What’s up, Watson? Using Watson QAAPI with Bluemix and NodeExpress

Create a Bluemix™ application that uses Watson’s Question and Answer API (QAAPI). IBM’s Watson is capable of understanding the nuances of the English language. Bluemix now includes eight services from Watson, including Concept Expansion, Language Identification, Machine Translation, and Question and Answer. For more information on Watson’s QAAPI and the many services that have been included in Bluemix, see Watson Services.

The current release of Bluemix Watson is a corpus of medical facts. Watson has been made to ingest medical documents in multiple formats (doc, pdf, html, text, and so on), and the user can pose medical questions to the Watson QAAPI.

“This tutorial shows how to use the Watson Question and Answer API to make queries and get results of various types

n the application described in this tutorial, NodeExpress is used to create a web server and to post questions to Watson using REST APIs. Jade is used to format the results of Watson’s response.

For more details and the latest code please see my full article in IBM developerWorks What’s up, Watson? Using Watson QAAPI with Bluemix and NodeExpress

Bend it like Bluemix, MongoDB with auto-scaling – Part 3

In this last post of this series, I test the performance of Bluemix & MongoDB against concurrent queries and deletes to the cloud based app with Mongo DB, with auto-scaling on. Before I started these series of tests I moved the Overload policy a couple of notches higher and made it scale out if memory utilization > 75% for 120 secs and < 30% for 120 secs (from the earlier 55% memory utilization) as shown below.

The code for bluemixMongo app can be forked from Devops at bluemixMongo or can be cloned from GitHub at bluemix-mongo-autoscale. The multi-mechanize scripts can be downloaded from GitHub at multi-mechanize Before starting the testing I checked the current number of documents inserted by the concurrent inserts (see Bend it like Bluemix., MongoDB using Auto-scaling – Part 2). The total number as determined by checking the logs was 1380 Sure enough with the scaling policy change after 2 minutes the number of instanced dropped from 3 to 2

1. Querying the bluemixMongo app with Multi-mechanize

The Multi-mechanize Python script used for querying the bluemixMongo app simply invokes the app’s userlist URL (resp=br.open(“http://bluemixmongo.mybluemix.net/userlist/”)

v_user.py

def run(self): # create a Browser instance br = mechanize.Browser() # don"t bother with robots.txt br.set_handle_robots(False) # start the timer start_timer = time.time() #print("Display userlist") # Display 5 random documents resp=br.open("http://bluemixmongo.mybluemix.net/userlist/") assert("Example Mongo Page" in resp.get_data()) # stop the timer latency = time.time() - start_timer self.custom_timers["Userlist"] = latency r = random.uniform(1, 2) time.sleep(r) self.custom_timers['Example_Timer'] = r

The configuration setup for this script creates 2 sets of 10 concurrent threads

config.cfg
run_time = 300 rampup = 0 results_ts_interval = 10 progress_bar = on console_logging = off xml_report = off [user_group-1] threads = 10 script = v_user.py [user_group-2] threads = 10 script = v_user.py

The corresponding userlist.js for querying the app is shown below. Here the query is constructed by creating a ‘RegularExpression’ with a random Firstname, consisting of a random letter and a random number. Also the query is also limited to 5 documents.

function(callback) { // Display a random set of 5 records based on a regular expression made with random letter, number var randnum = Math.floor((Math.random() * 10) + 1); var alpha = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','X','Y','Z']; var randletter = alpha[Math.floor(Math.random() * alpha.length)]; var val = randletter + ".*" + randnum + ".*"; // Limit the display to 5 documents var results = collection.find({"FirstName": new RegExp(val)}).limit(5).toArray(function(err, items){ if(err) { console.log(err + " Error getting items for display"); } else { res.render('userlist', { "userlist" : items }); // end res.render } //end else db.close(); // Ensure that the open connection is closed }); // end toArray function callback(null, 'two'); }

2. Running the userlist query

The following screenshot shows the userlist query being executed concurrently with Multi-mechanize. Note that the number of instances also drops down to 1

3. Deleting documents with Multi-mechanize

The multi-mechanize script for deleting a document is shown below. This script calls the URL with resp = br.open(“http://bluemixmongo.mybluemix.net/remuser”). No values are required to be entered into the form and the ‘submit’ is simulated.

v_user.py def run(self): # create a Browser instance br = mechanize.Browser() # don"t bother with robots.txt br.set_handle_robots(False) br.addheaders = [("User-agent", "Mozilla/5.0Compatible")] # start the timer start_timer = time.time() # submit the request resp = br.open("http://bluemixmongo.mybluemix.net/remuser") #resp = br.open("http://localhost:3000/remuser") resp.read() # stop the timer latency = time.time() - start_timer # store the custom timer self.custom_timers["Load_Front_Page"] = latency # think-time time.sleep(2) # select first (zero-based) form on page br.select_form(nr=0) # set form field br.form["firstname"] = "" br.form["lastname"] = "" br.form["mobile"] = "" # start the timer start_timer = time.time() # submit the form resp = br.submit() resp.read() print("Removed") # stop the timer latency = time.time() - start_timer # store the custom timer self.custom_timers["Delete"] = latency # think-time time.sleep(2)

config.cfg

The config file is set to start 2 sets of 10 concurrent threads and execute for 300 secs

[global] run_time = 300 rampup = 0 results_ts_interval = 10 progress_bar = on console_logging = off xml_report = off [user_group-1] threads = 10 script = v_user.py [user_group-2] threads = 10 script = v_user.py ;

deleteuser.js

This Node.js script does a findOne() document and does a remove with the ‘justOne’ set to true

collection.findOne(function(err, item) { // Delete just a single record collection.remove(item, {justOne:true},(function (err, doc) { if (err) { // If it failed, return error res.send("There was a problem removing the information to the database."); } else { // If it worked redirect to userlist res.location("userlist"); // And forward to success page res.redirect("userlist"); } })); }); collection.find().toArray(function(err, items) { console.log("Length =----------------" + items.length); db.close(); }); callback(null, 'two');

4. Running the deleteuser multimechanize script

The output of the script executing and the reduction of the number of instances because of the change in the memory utilization policy is shown

5. Multi-mechanize

As mentioned in the previous posts

The multi-mechanize commands are executed as follows
To create a new project
multimech-newproject.exe userlist
This will create 2 folders a) results b) test_scripts and the file c) config.cfg. The v_user.py needs to be updated as required

To run the script
multimech-run.exe userlist

The details of the response times for the query is shown below .

More details on latency and throughput for the queries and the deletes are included in the results folder of multi-mechanize

6. Autoscaling The details of the auto-scaling service is shown below

a. Scaling Metric Statistics

b. Scaling history

7. Monitoring and Analytics (M & A) The output from M & A is shown below

a. Performance Monitoring

b. Log Analysis output The log analysis give a detailed report on the calls made to the app, the console log output and other useful information

The series of the 3 posts Bend it like Bluemix, MongoDB with auto-scaling demonstrated the ability of the cloud to expand and shrink based on the load on the cloud.An important requirement for Cloud Architects is design applications that can scale horizontally without impacting the performance while keeping the costs optimum. The real challenge to auto-scaling is the need to make the application really distributed as opposed to the monolithic architectures we are generally used to. I hope to write another post on creating simple distributed application later.

Hasta la Vista!

Also see
1. Bend it like Bluemix, MongoDB with autoscaling – Part 1
2. Bend it like Bluemix, MongoDB with autoscaling – Part 2

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Bend it like Bluemix, MongoDB using Auto-scaling – Part 2!

This post takes off from my previous post Bend it like Bluemix, MongoDB using Auto-scale – Part 1! In this post I generate traffic using Multi-Mechanize a performance test framework and check out the auto-scaling on Bluemix, besides also doing some rudimentary check on the latency and throughput for this test application. In this particular post I generate concurrent threads which insert documents into MongoDB.

Note: As mentioned in my earlier post this is more of a prototype and the typical situation when architecting cloud applications. Clearly I have not optimized my cloud app (bluemixMongo) for maximum efficiency. Also this a simple 2 tier application with a rudimentary Web interface and a NoSQL DB at This is more of a Proof of Concept (PoC) for the auto-scaling service on Bluemix.

As earlier mentioned the bluemixMongo app is a modification of my earlier post Spicing up a IBM Bluemix cloud app with MongoDB and NodeExpress. The bluemixMongo cloud app that was used for this auto-scaling test can be forked from Devops at bluemixMongo or from GitHib at bluemix-mongo-autoscale. The Multi-mechanize config file, scripts and results can be found at GitHub in multi-mechanize

The document to be inserted into MongoDB consists of 3 fields – Firstname, Lastname and Mobile. To simulate the insertion of records into MongoDB I created a Multi-Mechanize script that will generate random combination of letters and numbers for the First and Last names and a random 9 digit number for the mobile. The code for this script is shown below

1. The snippet below measure the latency for loading the ‘New User’ page

v_user.py
def run(self): # create a Browser instance br = mechanize.Browser() # don"t bother with robots.txt br.set_handle_robots(False) print("Rendering new user") br.addheaders = [("User-agent", "Mozilla/5.0Compatible")] # start the timer start_timer = time.time() # submit the request resp = br.open("http://bluemixmongo.mybluemix.net/newuser") #resp = br.open("http://localhost:3000/newuser") resp.read() # stop the timer latency = time.time() - start_timer # store the custom timer self.custom_timers["Load Add User Page"] = latency # think-time time.sleep(2)

The script also measures the time taken to submit the form containing the Firstname, Lastname and Mobile

# select first (zero-based) form on page br.select_form(nr=0) # Create random Firstname a = (''.join(random.choice(string.ascii_uppercase) for i in range(5))) b = (''.join(random.choice(string.digits) for i in range(5))) firstname = a + b # Create random Lastname a = (''.join(random.choice(string.ascii_uppercase) for i in range(5))) b = (''.join(random.choice(string.digits) for i in range(5))) lastname = a + b # Create a random mobile number mobile = (''.join(random.choice(string.digits) for i in range(9))) # set form field br.form["firstname"] = firstname br.form["lastname"] = lastname br.form["mobile"] = mobile # start the timer start_timer = time.time() # submit the form resp = br.submit() print("Submitted.") resp.read() # stop the timer latency = time.time() - start_timer # store the custom timer self.custom_timers["Add User"] = latency

2. The config.cfg file is setup to generate 2 asynchronous thread pools of 10 threads for about 400 seconds

config.cfg
run_time = 400 rampup = 0 results_ts_interval = 10 progress_bar = on console_logging = off xml_report = off [user_group-1] threads = 10 script = v_user.py [user_group-2] threads = 10 script = v_user.py

3. The code to add a new user in the app (adduser.js) uses the ‘async’ Node module to enforce sequential processing.

adduser.js
async.series([ function(callback) { collection = db.collection('phonebook', function(error, response) { if( error ) { return; // Return immediately } else { console.log("Connected to phonebook"); } }); callback(null, 'one'); }, function(callback) // Insert the record into the DB collection.insert({ "FirstName" : FirstName, "LastName" : LastName, "Mobile" : Mobile }, function (err, doc) { if (err) { // If it failed, return error res.send("There was a problem adding the information to the database."); } else { // If it worked, redirect to userlist - Display users res.location("userlist"); // And forward to success page res.redirect("userlist") } }); collection.find().toArray(function(err, items) { console.log("**************************>>>>>>>Length =" + items.length); db.close(); // Make sure that the open DB connection is close }); callback(null, 'two'); } ]);

4. To checkout auto-scaling the instance memory was kept at 128 MB. Also the scale-up policy was memory based and based on the memory of the instance exceeding 55% of 128 MB for 120 secs. The scale up based on CPU utilization was to happen when the utilization exceed 80% for 300 secs.

5. Check the auto-scaling policy

6. Initially as seen there is just a single instance

7. At around 48% of the script with around 623 transactions the instance is increased by 1. Note that the available memory is decreased by 640 MB – 128 MB = 512 MB.

8. At around 1324 transactions another instance is added

Note: Bear in mind

a) The memory threshold was artificially brought down to 55% of 128 MB.b) The app itself is not optimized for maximum efficiency

9. The Metric Statistics tab for the Autoscaling service shows this memory breach and the trigger for autoscaling

10. The Scaling history Tab for the Auto-scaling service displays the scale-up and scale-down and the policy rules based on which the scaling happened

11. If you go to the results folder for the Multi-mechanize tool the response and throughput are captured.

The multi-mechanize commands are executed as follows
To create a new project
multimech-newproject.exe adduser
This will create 2 folders a) results b) test_scripts and the file c) config.cfg. The v_user.py needs to be updated as required

To run the script
multimech-run.exe adduser

12.The results are shown below

a) Load Add User page (Latency)

b) Load Add User (Throughput)

c)Load Add User (Latency)

d) Load Add User (Throughput)

The detailed results can be seen at GitHub at multi-mechanize

13. Check the Monitoring and Analytics Page

a) Availability

b) Performance monitoring

So once the auto-scaling happens the application can be fine-tuned and for performance. Obviously one could do it the other way around too.

As can be seen adding NoSQL Databases like MongoDB, Redis, Cloudant DB etc. Setting up the auto-scaling policy is also painless as seen above.

Of course the real challenge in cloud applications is to make them distributed and scalable while keeping the applications themselves lean and mean!

a) Latency, throughput implications for the cloud

b) The many faces of latency

c) Design principles of scalable, distributed systems

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Bend it like Bluemix, MongoDB using Auto-scale – Part 1!

In the next series of posts I turn on the heat on my cloud deployment in IBM Bluemix and check out the elastic nature of this PaaS offering. Handling traffic load and elastically expanding and contracting is what the cloud does best. This is where the ‘rubber really meets the road”. In this series of posts I generate the traffic load using Multi –Mechanize a performance test framework created by Corey Goldberg.

This post is based on an earlier cloud app that I created on Bluemix namely Spicing up a IBM Bluemix Cloud app with MongoDB and NodeExpress. I had to make changes to this code to iron out issues while handling concurrent inserts, displays and deletes issued from the multi-mechanize tool and also to manage the asynchronous nightmare of Nodejs.

The code for this Bluemix, MongoDB with Auto-scaling can be forked from Devops at bluemixMongo. The code can also be cloned from GitHub at bluemix-mongo-autoscale

1. To get started, fork the code from Devops at bluemixMongo. Then change the host name in manifest.yml to something unique and click the Build and Deploy button on the top right in the page.

1a. Alternatively the code can be cloned from GitHub at bluemix-mongo-autoscale. From the directory where the code is cloned push the code using Cloud Foundry’s cf command as follows

cf login -a https://api.ng.bluemix.net

cf push bluemixMongo –p . –m 128M

2. Now add the MongoDB service and click ‘OK’ to restage the server.

3. Add the Monitoring and Analytics (M & A) and also the Auto-scaling service. The M& A gives a good report on the Availability, Performance logging, and also provides Logging Analysis. The Auto-scaling service is the service that allows the app to expand elastically to changing traffic loads.

4. You should see the bluemixMongo app running with 3 services MongoDB, Autoscaling and M&A

5. You should now be able click the bluemixMongo.mybluemix.net and check the application out.

6.Now you configure the Overload Policy (auto scaling) policy. This is a slightly contrived example and the scaling policy is set to scale up if the Memory exceeds 55%. (Typically the scale up would be configured for > 80% memory usage)

7. Now check the configured Auto-scaling policy

8. Change the Memory Quota as appropriate. In my case I have kept the memory quota as 128 MB. Note the available memory is 640 MB and hence allows up to 5 instances. (By the way it is also possible to set any other value like 100 MB).

9. Click the Monitoring and Analytics service and take a look at the output in the different tabs

10. Next you need to set up the Performance test tool – Multi mechanize. Multi-mechanize creates concurrent threads to generate the load on a Web site or service. It is based on Python which makes it easy to modify the scripts for hitting a website, making a REST call or submitting a form.

To setup Multi-mechanize you also need additional packages like numpy matplotlib etc as the tool generates traffic based on a user provided script, measures latency and throughput besides also generating graphs for these.

For a detailed steps for setup of Multi mechanize please follow the steps in Trying out multi-mechanize web performance and load testing framework. Note: I would suggest that you install Python 2.7.2 and not the later 3.x version as some of the packages are based on the 2.7 version which has a slightly different syntax for certain Python statements

In the next post I will run a traffic test on the bluemixMongo application using Multi-mechanize and observe how the cloud app responds to the load.

Watch this space!
Also see
Bend it like Bluemix, MongoDB with autoscaling – Part 2!
Bend it like Bluemix, MongoDB with autoscaling – Part 3

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Where is the Cloud Computing bus going?

Technological innovation patterns have often repeated themselves in history. So it is with Cloud Computing. Familiar patterns of change seem to emerge today

Here are some of main trends that I see in Cloud Computing

Advent of containers: Containers are the new hot topic in cloud computing. In virtualization guest OS’es run separately. Running separate guest OS over the hypervisor is associated with a lot of overhead for each of the heavy weight OS’es. Containers can be used as an alternative to OS-level virtualization to run multiple isolated systems on a single host. Containers within a single operating system are much more efficient being light weight while being able to provide the same level of isolation. Containers run the same kernel as the host. Here is an interesting article on containers Containers, not virtual machines are the future of the cloud.

In many ways this containers over VM innovation pattern is reminiscent of the advantages of lightweight ‘threads’ over the heavy and slow ‘process’ approach in the OS world. It is inevitable that containers will eventually score over VMs

Open ‘something’ over proprietary’ness: Technology over the decades has always moved into an ‘open’ approach over proprietary solutions. Hence, for example, we have OpenStack for creating instances, provisioning storage, network to do many things that are being done separately by VMWare, Citrix, Hyper-V. The intent is to have a common approach over several disparate approaches. In the networking world there is OpenFlow which tries to have a uniform interface to the many different standards maintained by the Ciscos, Junipers and Brocades of the world. There are also other technologies like OpenCV (Computer Vision processing), Open VPN (VPN protocol) etc. In all these approaches there is either to move to unify or to provide a layer over and above the disparate approaches. I am not sure whether Openstack will prevail, only time will tell. I personally think we will move to a level abstraction that will be even above that of Open Stack.

Software Defined Everything: Cloud Computing started with the need to be able to provision computing resources through a user interface or the Web portal. This was made possible, thanks to virtualization. Users could now define and request computing resources. Soon this led to the need for being able to programmatically request storage. The trick in storage is to do ‘thin-provisioning’ or to provision resources that barely satisfies the needs of the application. The application will be able to request more storage programmatically. Not to be outdone, networking followed suit when Software Defined Networking became a reality when Stanford and University of California came with the Open Flow protocol. We have now entered into the era of Software Defined Datacenter. This is a dominant theme in Cloud Computing.

These are some of the predominant trends that are emerging in the Cloud Computing arena.

I have spent more than 2 decades of my career in telecom, implementing telecom protocols, starting in the mid-1980s. The mid 1980s was the time when digital switches started to emerge. This was followed by a spate of protocols and dizzying innovations like mobile telephony, ISDN, Intelligent Networks, Softswitch, UMTS,3G, HSDPA, LTE etc.

I personally think that Cloud computing, to use a very frayed and hackneyed term, is at a similar ‘inflexion point’. Trends are emerging and we will soon be caught in the maelstrom of rapid change and innovation.

In this post I am going to do a Marty McFly of the ‘Back to Future’ trilogy. I am going to set the clock of the Delorean DMC-12 to 2020 and ‘Whoosh…..’

21 Apr 2020:

It is 21 Apr 2020 and a sunny day. Here is a look at the Cloud Computing landscape

The Organization of Cloud Computing Standards (OCCS) now sets and governs the standards for all Cloud Providers of the world
Common APIs govern provisioning of instances on the cloud regardless of the Cloud Provider. Instances are defined by RPE values, RAM and IOPS, LB, DNS requirements
Networking bandwidth, security and storage are also standards based
Enterprises use a ‘diffuse deployment’ strategy where the organization’s workloads are deployed to multiple cloud providers.
Workloads are Cloud Provider agnostic.
Enterprise applications themselves may span multiple cloud providers for e.g. the e-commerce in Cloud Provider 1, Analytics on HPC instances on Cloud Provider 2 and secure applications on Private Cloud of Cloud Provider 3. Appropriate contracts are maintained between the Cloud Providers for charging for the usage.
Algorithms are used by enterprises to deploy workloads to cloud providers. The algorithms match the SLA and cost requirements of the application with those offered by the cloud provider to minimize the cost while meeting the SLA requirements of the applications.
Compute, storage and networking costs fluctuate and enterprises use algorithms to optimize the deployment of workloads. Workloads are migrated to take advantage of these price changes
Consolidation and acquisitions happen at an alarming pace. Cloud providers, storage, network and HPC providers aslo compete fiercely
Cloud providers are swallowed by others and some lose out. The battle scene is bloody

Time to get back to Delorean. This time the clock on Delorean is set to 2025

18 Sep 2025

Today it is 18 Sep 2025, and it is sunny again, coincidentally.

Cloud Computing is dead, mate. These days technology has moved to ‘Cloud Computing in a box’.
The technology of these times are ‘Haze works’ where the computation happens in the stratosphere over the ether …

So much for looking into the future. It is now time to get back to the reality of VMs