Working with Node.js and PostgreSQL

In this post I create a simple Webserver with Node.js which uses PostgreSQL database in the backend. This post shows how to perform simple database operations of insert, select, update and delete using REST APIs (POST, GET, PUT & DELETE).

Assuming that you already have Node.js installed here are the steps to create this CRUD (Create, Remove Update & Delete) Webserver.

1.Install Node.js if you already haven’t from Nodejs.org
2. Create a test directory pg for this PostgreSQL based Node.js Webserver
3. Open a command prompt and run
npm install pg
4.You will also need to install the PostgreSQL Enterprise DB with installer. Choose the appropriate OS and CPU
5.For this example I create a simple Employee Database.
6. In Windows Click Start->All Programs->PostgreSQL 9.3->pgAdmin III
7. Right click Databases-> New Database and Enter employees

8. The next step is to create Node.js Webserver and connect to this Database. PostgreSQL accepts DB connections through port 5432.
9.In this post The Webserver accepts connections from port 5433. Here is the shell of the Node.js Webserver
var pg = require("pg") var http = require("http") var port = 5433; var host = '127.0.0.1';
http.createServer(function(req, res) { if(req.method == 'POST') { insert_records(req,res); } else if(req.method == 'GET') { list_records(req,res); } else if(req.method == 'PUT') { update_record(req,res); } else if(req.method == 'DELETE') { delete_record(req,res); } }).listen(port,host); console.log("Connected to " + port + " " + host);

The Webserver accepts the usual 4 REST API calls – POST, GET, UPDATE, and DELETE for which there are 4 separate function calls. The REST API calls correspond to the database operations insert, select, update and delete respectively.

10. Prior to performing each operation the a client connects to the database as follows

var conString = "pg://postgres:postgres@localhost:5432/employees"; var client = new pg.Client(conString); client.connect();

11. The POST operation is performed as follows

var insert_records = function(req, res) { … … //Drop table if it exists client.query("DROP TABLE IF EXISTS emps"); // Creat table and insert 2 records into it client.query("CREATE TABLE IF NOT EXISTS emps(firstname varchar(64), lastname varchar(64))"); client.query("INSERT INTO emps(firstname, lastname) values($1, $2)", ['Tinniam', 'Ganesh']); client.query("INSERT INTO emps(firstname, lastname) values($1, $2)", ['Anand', 'Karthik']);

12. To display the contents of the database the list_records function is used as follows

var list_records = function(req, res) { console.log("In listing records"); // Select all rows in the table var query = client.query("SELECT firstname, lastname FROM emps ORDER BY lastname, firstname"); query.on("row", function (row, result) { result.addRow(row); }); query.on("end", function (result) {

13. The REST API Update is performed as below

var update_record = function(req, res) { // Update the record where the firstname is Anand query = client.query("UPDATE emps set firstname = 'Kumar' WHERE firstname='Anand' AND lastname='Karthik'");

14.Finally a delete is performed using a delete_record method

var delete_record = function(req, res) { // Delete the record where the lastname is Karthik client.query("DELETE FROM emps WHERE lastname = 'Karthik'");

15.The output of each operation is sent back as HTML as

query.on("row", function (row, result) { result.addRow(row); }); query.on("end", function (result) { // On end JSONify and write the results to console and to HTML output console.log(JSON.stringify(result.rows, null, " ")); res.writeHead(200, {'Content-Type': 'text/plain'}); res.write(JSON.stringify(result.rows) + "\n"); res.end(); });

16.To test the Webserver you need to install a REST API client for the browser you use. I installed the SureUtils-.REST API Client. his is a Chrome extension nand can be installed from Chrome Web Store (search for REST API client). You could choose any REST API client of your choice for the browser you intend to use (Chrome, Firefox)

17.Here are the tests I performed

18.The POST API call

19. The GET API call

20.The PUT call followed by GET call

The PUT API updates Anand Karthik to Kumar Karthik. This is shown in the GET API call.
21. The DELETE call followed by the GET call

Here the DELETE API call deletes the Kumar Karthik record. The GET API call now displays only 1 record.

22. The console.log output for the operations above is shown below

The code for the Node.js- PostgreSQL can be cloned from GitHub at node-pg

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

Also see
– Introducing cricketr: An R package for analyzing performances of cricketers
– A crime map of India in R: Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1
– Analyzing cricket’s batting legends – Through the mirage with R
– Masters of spin: Unraveling the web with R

Find me on Google+

Mixing Twilio with IBM Bluemix

This post walks you through the steps to get started with Twilio on IBM’s Bluemix. Twilio comes as a service that you can add to your Mobile Cloud or Node.js app. Here’s a quick look at Twilio. Twilio, is a cloud communications IaaS organization which allows you use standard web languages to build voice, SMS and VOIP applications via a Web API.

Twilio provides the ability to build VOIP applications using APIs. Twilio itself resides in the cloud and is always available. It also provides SIP integration which means that it can be integrated with Soft switches. Twilio looks really interesting with its ability to combine the cloud, Web and VOIP, SMS and the like.

This post barely scratches the surface of Twilio & Blue mix. This article provides aa hands-on experience for integration of Twilio with Bluemix and is based on this Twilio blog post. It enables you to send a SMS to your mobile phone by typing in a URL.

As in my earlier post the steps are

1) Fire-up a Node.js Webstarter application from the Bluemix dashboard. In my case I have named the application websms. Once this is up and running

2) Click Add a Service and under ‘Web and Application’ choose Twilio.

3) Enter a name for the Twilio service. You will also need the Account SID and Authorization token

4) For this go to http://www.twilio.com and sign up

5) Once you have registered, go to your Dashboard for the Account SID and Auth Token. If the Auth token is encrypted, you can click the ‘lock’ symbol to display the Auth token in plain text.

6) Enter the Accout SID and Auth Token in the Twilio service in Bluemix

7) To get started you can simply fork my Twilio websms code from devops.

8) Now clone the code into a folder you create as follows

git clone https://hub.jazz.net/git/tvganesh/websms

9) You will need to modify the following files

package.json

manifest.yml

app.js

10) You can create package.json by running
npm init. Make sure you enter the name of the application you created in Bluemix. In my case it is “websms’ For the rest of the options you can choose the default. Here is the package.json file
"name": "websms", "version": "0.0.0", "description": "This README.md file is displayed on your project page. You should edit this \r file to describe your project, including instructions for building and \r running the project, pointers to the license under which you are making the \r project available, and anything else you think would be useful for others to\r know.", "main": "app.js", "dependencies": { "gopher": "^0.0.7", "express": "^3.12.0", "twilio": "^1.6.0", "ejs": "^1.0.0" }, "devDependencies": {}, "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "repository": { "type": "git", "url": "https://hub.jazz.net/git/tvganesh/websms" }, "author": "", "license": "ISC" }

11) In the manifest.yml make sure you enter the name of your application and the host

applications: - host: websms disk: 1024M name: websms command: node app.js path: . domain: <your domain> mem: 128M instances: 1

12) Lastly make changes to your app.js.

// dependencies var app = require('gopher'), twilio = require('twilio'); var config = JSON.parse(process.env.VCAP_SERVICES); var twilioSid, twilioToken; config['user-provided'].forEach(function(service) { if (service.name == 'Twilio') { twilioSid = service.credentials.accountSID; twilioToken = service.credentials.authToken; } }); // URL test app.get('/', function(request, response) { var client = new twilio.RestClient(twilioSid, twilioToken); client.sendMessage({ to:'<Your mobile number>', from:'<Number from Twilio dashboard', body:'Twilio notification through Bluemix!' }, function(err, message) { response.send('Message sent! ID: '+message.sid); }); });

13) After you have made the changes you will need to push the changes to Bluemix using the command line based ‘cf’ tool
14) Login into cf with
cf login – a http://api.ng.bluemix.net

15) Push the websms onto bluemix

16) In the folder where you websms files reside entr the following command
cf push websms -p . -m 512M

17) This should push the code to Bluemix.
Note: If you happen to get a
Server error, status code: 400, error code: 170001, message: Staging error: cannot get instances since staging failed
then you need to make sure to check the changes made to files app.js, package.,json or the manigfest,yml.

18) If all things went smoothly, go to your Bluemix dashboard and click the link adjacent to the Routes. You should see that an SMS has been sent as shown

19) Your mobile should now display the message that was sent as shown below

20) Check the analytics in your Twilio dashboard

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Find me on Google+

Introducing the Software Defined Computing Pattern

We are on the verge of a new ‘Software Defined’ revolution. The phrase ‘software defined’ refers to the ability to be able to programmatically control computing elements namely compute, storage, network. We are entering into a bold, brave ‘software defined’ era. Before we delve into the ‘whats’ of this revolution I would rather like to outline the ‘whys’. What motivated this new thinking in computing?

Why “Software Defined’?

In the late 90s, IT infrastructure was unwieldy and unmanageable, Whenever new IT infrastructure had to be procured there was the need to accurately size the required hardware infrastructure, software, software licenses, routers, switches and storage elements The problem in those days had to do with dimensioning. The CIO and IT managers had to be able to calculate the requisite hardware, and software elements. The problem was that if the estimate was too conservative the infrastructure would be under-dimensioned and would not be able to handle the load. On the other hand if it was over-dimensioned then hardware and software would lie idle and would result in a wasted resources and money. So it used to be a fine balancing act. Even if the IT managers got lucky and got the size right, it is quite likely that conditions in the enterprise changed resulting in them having to take a relook at their infrastructure.

This problem of dimensioning IT infrastructure was effectively solved by a technology called ‘virtualization’. In the mid 1960s IBM created a CP-67 Mainframe computer, which had the elements of virtualization. Much later in 1998, VMWare created the VMWare workstation that could run multiple Operating Systems (OS’es). In essence virtualization abstracts the hardware of the computer, storage and network ports through a software known as the hypervisor. Over the hypervisor, the user can run any operating system like Windows, Linux, AIX etc. These OS’es which run on top of the hypervisor are known as guest OS’es. Besides, virtualization technology, enables different virtual servers to share one physical server. This process, called server consolidation, helps to increase hardware utilization, load balancing, and optimization of the IT resources.

The ability to virtualize the computer hardware really triggered some major advancements in computing. Prior to virtualization each server would run a single OS with a single application resulting in the server being idle for close to 60% of the time. Virtualization now made it possible for enterprises to run several OS’es each with its own application on a single computer. Hence the computing resources were used more effectively and efficiently. This is shown below

Virtualization and the dotcom bust around the year 2000 effectively paved the way for a ‘Software Defined’ future. In others words there was a need to control resources programmatically aimed at more efficient utilization of the resources.

The move to the Cloud: Prior to the advent of the cloud, enterprises hosted their applications in their internal IT infrastructure with virtualization technology. With the pay-per-use, utility style computing, spearheaded by the likes of Amazon, many enterprises moved their applications to shared, multi-tenant (multiple customers) , 3rd party hosting service provider, also known as the cloud providers

With the advent of Cloud Computing the software defined era made major advances. Here is the reason why. Computing as such stands on 3 main pillars- computing, storage and networking.

As mentioned earlier in the post, one of the thorny issues in procuring & managing IT infrastructure is the problem of dimensioning or right sizing. Virtualization did solve this problem to some extent but there was a need to provide more control to the user. This is where the ‘Software Defined’ technologies emerged. This ‘Software Defined’ paradigm is based on prudence and sound engineering judgment. The whole premise of making anything ‘software defined’ is to ensure that resources allocated for any task (computing, storage or networking) are optimal. The idea is that resources should be allocated exactly as needed and released and included into a shared, common pool, when idle. Hence we have the advent of

Software Defined Compute
Software Defined Storage
Software Defined Network

Software Defined Compute (SDC): In the clouds these days it is possible to precisely control the computing elements that will make up your application. So you can choose your CPU type, CPU speed, hypervisor, OS, RAM size, disks etc. You can also provision your application to expand or contract elastically to the demands of the times rather than under-provisioning or over-provisioning, This is done through a process called auto scaling. The desired configuration can be controlled through APIs provided by the Cloud Provider.

Software Defined Storage (SDS): There are multiple storage technologies that span DAS, SATA drives, SAN and NAS storage. These different storage technologies address different needs of price, storage capacity and performance, The Software Defined Storage allows the user to control the type of storage that is needed for the application through software APIs. In storage the initial allocation to each application is rather conservative. Additional storage is assigned from a common pool of storage to the applications that needs it the most. Once the storage is no longer needed it is reclaimed.

Software Defined Network(SDN): SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate. Software Defined Networks (SDN) decouples the routing and switching of the data flows and moves the control of the flow to a separate network element namely, the flow controller. The motivation for this is that the flow of data packets through the network can be controlled in a programmatic manner allowing for multiple data streams to flow over the communicating paths with each stream individually defined for speed, latency, QoS etc.

Software Defined Datacenter (SDDC): A datacenter has racks and racks of servers, storage boxes, and networking equipment. A datacenter where one is able to provision, manage and operate these equipment through APIs or through programs is a Software Defined Datacenter. Imagine being able to put together a car with the body of a BMW, the interior of a Merc, the engine of a Ferrari and the electronics of a Tesla! That is what a SDDC allows you to do!

Software Defined Computing Pattern (SDCP): Once the SDC, SDS and SDN reach a level of maturity I think the next logical step would be a move to Software Defined Computing Patterns. This is what I am implying by this. Theoretically we can reduce the different types of enterprise applications to a set of computing patterns for e.g. e-commerce, social network, email server, Web portal etc. The Software Defined Computing Pattern would allow the user to choose a computing pattern based on the enterprise application. This would result in the setting up of the appropriate computing resources, storage resources, middleware and networking elements in a cloud. . The user would them need to host their applications on this environment. Here is a good link to cloud patterns.

In this context I would like to bring to your notice that there is another parallel trend called Software Defined Architecture (SDA) coined by Gartner in 2014. The SDA Gateway is responsible for virtualizing the internal API, protocols and models used to external API, User Interface and resources. Here is a diagram of SDA

The pace of progress in the last couple of years has been really scorching. The ability to have solve most large problem through a Software Defined Computing Pattern is sure to happen.

The mind of a programmer

Here is a short essay on the minds of the programmer and programming in general. Programming has been variously described as a science, an art, as black magic, as the work of craftsman etc. It is true, programming can be any or all of that described above. Programming in my opinion is going to become increasingly important in the years ahead. I would certainly advocate some knowledge and grasp of programming. There are many books that claim to teach programming anywhere between 3 to 21 days etc. This is not true. Learning to program is just the beginning of a never ending process. Here is a great piece by Peter Norvig – Teach yourself programming in 10 years.

Programming can be considered to be a language to express your thoughts on the solution to a problem. The ability to express in a programming language can vary between being simply pedestrian to being absolutely poetic! There are those who can wax eloquent in a programming language. In any case, programming is a means to an end, the end being the solution to a problem. Typically the solution to the problem, is expressed as an algorithm, which is then is coded through a programming language,. Programming can be a highly analytical and creative activity.

Programming is different from most other professions that I can think of. To get started all you need is a computer and an Integrated Development Environments (IDE) for e.g. Eclipse, which can be downloaded for free. The IDE can be used for writing code. There are no other associated costs.

Programming is also different from other professions in the sense, that you get your response immediately. For e.g. a painter can paint anything and imagine that he/she is the next Rembrandt or Picasso. A guitarist can create the most hideous sound and think he is Jimi Hendrix’s re-incarnation. Other professions like architects, civil engineers, scientists have to wait for several months to know whether they are in the right direction or not. It is not so with programming. You write code. When you compile it or execute it, the verdict is instantaneous. It is simply a “no go”, if you are wrong. There is no middle path. You are either right or you are wrong.

Having said that, I would like to look at the typical experiences of a programmer?

Tears, sweat and frustration: In the beginning programming is usually very intimidating and frustrating. In the initial stages when you grapple with the quirky syntax of the language, and try to formulate your thoughts around the problem, you will hit many speed bumps. It can be exhausting, tiring and nerve racking. There are no shortcuts in learning how to program. You have to go through the grind, memorize certain phrases and hope that your program works. Once you have you arms around the syntax, you are on your way to actually writing code that achieves something. Here again you will run into all sorts of problems, like loops that never end, inexplicable program crashes and mysterious run time errors etc. The early stages can be difficult and quite unforgiving. This phase requires patience to get through.

Feelings of megalomania: Someone with 5 to 7 years of programming experience knows most of the typical constructs by heart and will be able to quickly churn out programs, rather fast. This is a dangerous phase. Since you have been doing the same thing for a couple of years you are typically aware of the problems and can possibly tweak code to make it solve a slightly different problem. This is usually the stage when programmers start to experience a sense of megalomania. There are delusions of grandeur. You may remember the programmer shown in Golden Eye who keeps saying “I am invincible!” whenever he is able to solve a knotty problem. These programmers have the feeling that “Nothing is impossible”

Programming is a great leveler. Programming can be a great boost to your ego. When you are able to visualize a problem, strategize the solution and actually get it to work, it does wonders to your ego. Programming can really boost your self-esteem. But you should not just stick to your comfort zone and write code in exactly the same language in exactly the same domain. It really helps to move to a different language, preferably a different paradigm – for example a move from procedural (C) to Object Oriented (Java, C++) or from object oriented to functional (Lisp, Haskell). Similarly moving from Web programming to protocol design or from data communication to app design will do wonders. The shift to a new programming paradigm and new technical domain will put you on even keel. All your knowledge and expertise will evaporate when you move to a new domain. Moving around in technology will keep you more grounded. You will realize that there is still so much more to learn. There is yet another universe.

In other words, programming keeps you honest!

My journey of 25+ years as a programming has helped me to learn technology in all its flavors. More importantly I was able to learn about myself. I have seen it all. Sweat, tears, frustration, fear, anger, pride and ecstasy.

A few years back, once you learned the basics, if your work did not involve coding, there was not much to do. But these days you can really do some fun things. You can imagine any app you want and actually start to realize it. Who knows, your app may be the next block buster! I am certain all of us have ideas which we want to implement. Programming allows you to just that!

Programming really makes you exercise your grey cells. Who knows we will soon hear that research has proved that programming helps prevent Alzheimer’s and Parkinson’s disease.:-)

In any case, learning to program is one good thing.

Also see
1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. How to program – Some essential tips
4. Programming Zen and now – Some essential tips -2

Find me on Google+

Divining Twitterverse with R

In this post I continue my journey into Twitterverse with R and capture the tweet frequency for the hashtags #NaMo, #AAP and #RaGa over the last 7 days. This seemed the most appropriate thing to do given that the 16^th Indian General Election 2014 is just around the corner. The handshake that has to be established with Twitter is the same as mentioned in my last post “To R is human …”

Here is a great blog post on measuring tweet frequencies – Getting Genetics done by Stephen Turner.

Once the initial handshake is done the following has to be done. It appears that searchTwitter can only search tweets within the last 7 days and that too for a maximum of 1500 tweets.

This is done as follows for the hashtag #NaMo. The dates variable creates 7 date strings. The for loop performs a searchTwitter everyday for the last 7 days

#Search the last 7 days for the hashtag #NaMo everyday

dates <- paste(“2014-03-“,10:17,sep=””) # need to go to 18th to catch tweets from 17th

for (i in 2:length(dates)) {

print(paste(dates[i-1], dates[i]))

tweets <- c(tweets, searchTwitter(“#Namo”, since=dates[i-1], until=dates[i], n=1500))

}

The tweets are then converted to dataframes for processing

# Create a dataframe from the tweets

tweets <- twListToDF(tweets)

tweets <- unique(tweets)

Finally the tweets are plotted using ggplot

#Plot the frequency of tweets in 2 hour windows

minutes <- 120

ggplot(data=tweets, aes(x=created)) +

geom_bar(aes(fill=..count..), binwidth=60*minutes) +

scale_x_datetime(“Date”) +

scale_y_continuous(“Frequency”) +

opts(title=”#NaMo Tweet Frequency March 11-17″, legend.position=’none’)

ggsave(file=’NaMo-frequency.png’, width=7, height=7, dpi=100)

The plot for #NaMo is shown below

The same is performed for

#AAP

And for #RaGa

While the number of tweets for #NaMo is very high, #RaGa seems to occur in lower number but consistently everyday

Of course we can check the tweets whether is sentiment is positive or negative for the hashtags. Thats for another day though.

The code can be cloned at Rtweet-frequency

Find me on Google+

Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of $1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest. So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.

Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

n_u = Number of users

n_b= Number of books

r^(i,j) = Boolean whether user j rated a book i

y^(i,j) = The rating user j gave book i

m_j = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps. In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc

So each book has its own feature vector where x¹is the feature vector of the first book x² feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵ^jand a feature vector xⁱ the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θ^j)^T xⁱ

This reduces to the minimization problem over all θ^j for r=1

min 1/2m Σ ((θj)^T xⁱ – y ^i,j)²

θ^ji:r=1

Adding the regularization term this becomes

min 1/2m Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ θ^j)²

θ^ji:r=1

The recommender algorithm in essence tries to learn parameters θ^j for a set of features of xⁱthe chosen system for e.g. books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θ^j)^T xⁱ – y ^i,j)² + λ/2m(Σ Σ θ^j)²

θ¹…θⁿi:r=1

The minimization is performed by gradient descent as

Θ^j_k:= Θ^j_k – α (Σ((θ^j)^T xⁱ – y ^i,j)xⁱ + λ Θ^j_k

Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Perils and pitfalls of Big Data

Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high. This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

Find me on Google+

Simplifying Machine Learning – K- Means clusters – Part 6

Our brain is an extraordinary apparatus. It is amazing how we humans can instantaneously perceive shapes, objects, forms. For e.g. when see a scene with many objects we are immediately able to identify the different objects in the scene. View this against the backdrop of a recent Google’s artificial brain experiment of a neural network with 16000 processors and a billion connections. This artificial brain was fed with 10 million thumbnails of you tube videos before it was able to recognize cat videos.

That’s an awful lot of work to recognize cat videos!

We can see that a lot of work involved getting a computer to do something as simple thing as this.

Consider how a baby learns to recognize objects for e.g. cat, dog, toy etc. The human brain does not try to measure the number of eyes, spacing between the eyes, the mouth shape of face etc. The brain immediately is able to distinguish the different animals. How does it do it? Amazing right?

In any case here is a machine learning algorithm that is capable of identifying structure in data. This is also known as K-Means and is a form of unsupervised learning algorithm.

The K-Means algorithm takes as input an unlabeled data set and identifies groups in the set. It tries to determine structure in the data set.

Take a look at the picture below

It is readily obvious that there are 2 clusters in the above diagram. However to the computer this is just a random set of points.

How does the K-Means cluster identify the clusters in the above diagram?

The algorithm is fairly simply and intuitive.

1) Let us start by choosing 2 random points which we call as ‘cluster centroids

2) We then associate each centroid with the points in the dataset that are closest to it.

3) We then compute the average of each group of associated points in the centroid and move the centroid to that average.

4) We then repeat steps 2 – 4 until there is no significant change in the centroid

This is shown below

The above algorithm can be implemented iteratively as follows

For training set (x¹, x², x³ …)

Randomly initialize K cluster centroids μ₁, μ₂, μ₃ … μ_K

Repeat {

for 1 to m

c(i) = The cluster index from 1 to K that is closest to xⁱ => (A)

end

for k = 1 to K

u(k) = average of all points assigned to K => (B)

end

}

In step (A) the points xⁱ closest to the centroid k is added to the centroid’s set. Hence if points 1,3,4,8 are in centroid 1 then

x¹ = 1, x³ = 1, x⁴ = 1, x⁸ =1

In step (B) the mean of the points 1, 3, 4, 8 is taken

So the centroid

c_1x = ¼ { x₁ + x₃ + x₄ + x₈} and c_1y = ¼ { y₁ + y₃ + y₄ + y₈}

This becomes the new c₁

However there can be occasions where the K-means cluster would get stuck in local optima. To choose optimum cluster centroid we have to determine the least cost. This can be done with the optimization objective.

The optimization objective of K-Means is as follows

K-Mean cluster determination is the problem of minimizing the distance of each point from its centroid. This is also known as the K-Means cost function or distortion function.

J(c¹,c²…c^m,, μ₁,… μ₂) = 1/m Σ|| xⁱ – μ_c(i) ||²

I like to visualize the algorithm as follows.

In step 1 we can visualize that there is a force of attraction between the datapoints and the cluster centroid based on proximity of the centroid.

In step 2 we can visualize that each datapoint attracts the centroid towards it. The centroid moves to the point where the attraction among all the datapoints balances out. This is average mean squared difference.

As can be seen the objective is to determine the average of the mean squared error of each data point to its closest centroid.

Given a set of data points how we choose the random centroids? One way is to initially pick some random data points themselves as the cluster centroid. The algorithm is then iterated to identify the real cluster centroids.

As mentioned before the algorithm can sometimes get stuck in local optima. One option is to choose another random set of data points and continue to iterate. We need to run this several times to determine the best clustering

There is also the problem of determining the number of cluster centroids. How we to determine how many clusters are would be there in a random data set? Visually we can easily identify the number of clusters. But a machine cannot.

One technique that can be used to determine the number of clusters is as follows. Start with 2, 3… 10 clusters and plot the cost function. Then pick the one with the least cost.

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

Unraveling the mysteries of life

This article was published in Gigaom, Nov 23, “Unraveling the mysteries of life”

SUMMARY:

The future of technology will bring big changes, including advances in AI, brain-to-brain interfaces, and the ability to halt death

Time, space and matter were created 13.7 billion years ago, when the Big Bang occurred. This pale, blue planet, so termed by Carl Sagan, our earth, came into existence about 4.5 billion years ago. Life originated on earth about 3.8 billion years ago. Our species, the home sapiens, came much later at about 0.2 million years while recorded history is merely 6000 years old.

However in the last 60 years or so, man has started to unravel many secrets of his own existence. There have been extremely rapid advances in science and mankind is now grappling with very profound aspects of life from intelligence, perception, aging all the way to death itself…. more

Find me on Google+

Dissecting the Cloud – Part 1

“The Cloud brings it with it the promise of utility-style computing and the ability to pay according to usage.

Cloud Computing provides elasticity or the ability to grow and shrink based on traffic patterns.

Cloud Computing does away with CAPEX and the need to buy infrastructure upfront and replaces it with OPEX model and so on”.

All this old news and has been repeated many times. But what exactly constitutes cloud computing? What brings about the above features? What are its building blocks of the cloud that enable one to realize the above?

This post tries to look deeper into the innards of the Cloud to determine what the cloud really is.

Before we get to this I would like to dwell on an analogy to understand the Cloud better.

Let us assume, Mr. A owns a large building of about 15,000 sq feet and about 100 feet tall. Let us assume that Mr. A wants to rent this building.

Now, assume that the door of this building opens to single, large room on the inside!

Mr. X comes to rent this building. If this was the case then poor Mr. X would have to pay through his nose, presumably, for the entire building even though his requirement would have been for a small room of about 600 x 600 feet. Imagine the waste of space. Moreover this would also have resulted in an enormous waste of electricity. Imagine the lighting needed. Also an inordinate amount of water would have to be utilized if this single, large room needed to be cleaned. The cost for all of this would have to be borne by Mr. X.

This is clearly not a pleasant state of affairs for either Mr. X or for the owner Mr. A of the building.

The solution to this is easy. What Mr. A needs to do, is to partition the building into self-contained rooms (600 x 600 sq feet) with all the amenities. Each self-contained unit would need to have its own electricity and water meter.

Now Mr. A can rent rooms to different tenants on their need basis. This is a win-win situation both for Mr, A and Mr. X. The tenants only need to pay for the rooms they occupy and the electricity and water they consume.

This is exactly the principle behind cloud computing and is known as ‘virtualization’

There are 3 computing components that one must consider. CPU, Network and Storage. The below picture shows the virtualization of CPU,RAM, NIC (network card), Disk (storage)

The Cloud is essentially made up of anywhere between 100 servers to 100,000 servers. The servers are akin to the large building. Running a single OS and application(s) on the entire server is a waste of computing, storage and network resources.

Virtualization abstracts the hardware, storage and network through the use of software known as the ‘hypervisor’. On top of the hypervisor several ‘guest OSes’ can run. Applications can then run on these guest OSes.

Hence over the CPU (single, dual or multi-core) of the server, multiple guest OS’es can run each with its own set of applications

This is similar to partitioning the large CPU resource of the server into smaller units.

There are 3 main Virtualization technologies namely VMware, Citrix and MS Hyper-V

Here is a diagram showing the 3 main the virtualization technologies