The thing about the Internet of Things

Published in Smart World Jan-Feb 2014, The thing about the Internet of Things

Introduction: It is now common knowledge that the world is becoming more connected, instrumented and data driven. In a world of 7 billion people we have almost 10 billion devices connected to the internet. A recent report from Cisco suggests that the number of connected devices will almost touch 50 billion by the year 2020.

This huge increase in the number connected devices will come largely from a couple of new technology trends namely Internet of Things (IoT), Smart grids etc.

What exactly is the Internet of Things?

The first formal definition of the Internet of Things happened when ITU-T the telecom wing of United Nations came with a report titled “The Internet of Things” in 2005. In this report ITU-T added a fourth dimension of ‘anything’ to the existing anyone, anywhere, anytime network. This report visualized a world where millions and millions of devices either passive, intelligent or sensors collected data from the environment and sent it through the network to a backend processing system.

In Mark Weiser’s classic words, “the most profound technologies are those that disappear and weave themselves into the fabric of everyday life until they are indistinguishable from it”. Embedded intelligence in the things themselves will further enhance the power of the network. IoT is just this vision of Mark Weiser.

This fourth dimension of ‘things’ or intelligent sensors give the ability to gather data from the environment which is then sent back through the wireless network to the internet for back end processing. Analysis of the gathered data helps in forecasting events ahead of the time.

The Internet of Things is also known as M2M or machine–to–machine computing, pervasive computing or ubiquitous computing.

The Maha Kumbh Mela experiment: Last year, 2013, coincided with the 12 year cycle of the Maha Kumbh mela festival. More than 100+ million people would have passed through the city of Allahabad for a holy dip in river Sangam at the confluence of Ganges & Yamuna. Almost 95% of this human mass would have carried mobile phones equipped with location sensors. Harvard Business University with the help of mobile Telecom Operators ran an experiment to track the movement of people through the city of Allahabad to understand the behavior of people. It was hoped that the study of this large amount of data, as people moved through the city, would help in identifying signatures of disaster and how they can be avoided.

This is possible because mobile phones have the ability to send their location data back to the net for processing. This is an example of the Internet of Things.

Some applications of the Internet of Things is outlined below

RFID or Radio Frequency Identification: RFID was one of the early enablers of this technology; The RFID is a passive device that responds with its identity when it is in the presence of a RFID receiver. The RFID receiver transmits a signal and a RFID tag responds with its unique tag id. The RFID technology has been used extensively by large retail stores like Walmart of US and Tesco of UK etc. These stores RFID tag all their products in the central warehouse. In the presence of an RFID receiver the RFID tags of all the products are read. So the warehouse has a complete list of its inventory. As the products move from the central warehouse to the regional warehouse and finally to the retail store the products are tracked. So the retail stores know exactly how many of each product is present in all its warehouses and stores. As customers buy products and check it out at the counter the count of the products in the store is also updated. So at any point in time each store will know the count of each of its products. So stores like Walmart can now forecast if there is a going o be a shortage of any of it products and can move some of them to the concerned store. In fact we can imagine a scenario where each shopping cart is equipped with a RFID receiver. As we keep putting products into our cart the cart can add each of the items we have taken so that we have the bill ready when we reach the counter. We need not scan the products at the check out counter.

Highway Tolls: An interesting application of IoT, is the payment of highway tools in which the vehicle do not need to stop to pay the toll. Toll is deducted from a device, with a driver, which is RFID tagged. There are also applications in which the tires of cars are embedded with sensors to detect the wear & tear of the tires. Insurance companies can use the driving data from these sensors to give discounts to safe drivers.

Car-to-car networks: Another certainty in the evolution of IoT is car-to-car networks. Vehicular Communication along with the Intelligent Transport Systems (ITS) achieves safety by enabling communication between vehicles, people and roads. Vehicle-to-vehicle communications are the fundamental building block of autonomous, self-driving cars. It enables the exchange of data between vehicles and allows automobiles to “see” and adapt to driving obstacles more completely, preventing accidents besides resulting in more efficient driving.

Intelligent homes: Rapid advances in technology will be closer to the home both literally and figuratively. The future home will have the ability to detect the presence of people, pets, smoke and changes to humidity, moisture, lighting, temperature. Smart devices will monitor the environment and take appropriate steps to save energy, improve safety and enhance security of homes. Devices will start learning your habits and enhance your comfort and convenience. Everything from thermostats, fire detectors, washing machines, refrigerators will be equipped electronics that will be capable of adapting to the environment. ‘Nest’ is a smart thermostat that made headlines recently. The thermostat learns your requirements and adjusts the temperature accordingly. All gadgets in the Smart Home will be accessible through laptops, tablets or smartphones from anywhere. Others gadgets in Intelligent Homes are smart locks, smart lighting etc. Hence, we will be able to monitor all aspects of our intelligent home from anywhere.

Intelligent offices: Smart devices will also make major inroads into offices leading to the birth of intelligent offices where the lighting, heating, cooling will be based on the presence of people in the offices. This will result in an enormous savings in energy. The advances in intelligent homes and intelligent offices will be in the greater context of the Smart Grid.

eHealth: IoT is being used by some hospitals for monitoring of heart patients Here a device is implanted into the patient. The device regularly sends data to a doctor who can monitor the patient’s pulse rate, heart rate, blood pressure etc. It can warn the physician when it detects an irregularity in the patient’s heart rhythm who can then call the patient and advice on appropriate medication to take avoiding a real cardiac arrest.

Smart Cities: How often we sit fretting and fuming in a traffic jam contributing to air pollution. Smart Cities are equipped with multiple devices that identify and measure traffic speed and volume on city roads. At the back end the systems analyze this continuous stream of real time and provide alternative routes based on predictive analytics based on real time and historical data. Studies have also shown that it is possible to control traffic by offering discounts to drivers on less crowded roads.

Smart Grid: The grid or the legacy electrical network has three components to it namely energy generation, energy transmission and energy distribution. The conventional electrical grid which is prevalent in most countries throughout the world has extremely high transmission losses besides having other issues. Typically an outage in one part of the network would cause a cascading effect throughout the network. Remember the infamous blackout in US in 2003 which was the largest black in US in history. More closer to home, in India, we had a blackout in Dec 2012 which was the largest black out ever. This is because of the domino effect where an issue causes a cascading effect. Closer to home we had the world’s biggest blackout in Jul 31 which left 600 million powerless for close to 2 days.

With the advent of Smart Grid the legacy electrical grid will have millions of electrical sensors which monitor the flow of energy. If there is a fault in any part of the network the sensors ensure that the failure is isolated so that outage does not spread to other parts.

Besides instead of the regular electrical meters Smart Grids include the concept of the Smart home equipped with smart meters. These smart meters have a two way communication. The price of energy which we get from the grid varies like the stock price. With the smart meters and smart appliances these appliances turn on when the price of drawing energy is low.

Wearable Technologies: he latest entrants to IoT are the wearable technology like Smart watches, Google Glass, Health bands. These technologies constantly monitor measure and send the data for processing to the backend. For e.g. Google’s glass can immediately recognize prominent landmarks and display it. Similarly health bands like Fitbit, Nike FuelBand etc can now measure steps, heart rate and provide feedback.

Challenges: There are still many challenges on the way to a future filled with M2M. There is still no universally accepted protocol. There are many competing protocols like WiFi, Zigbee, MQPP, XMPP etc and there is yet to be a single common standard between devices and the networks for the Internet Of Things.

In any case, the Internet of Things or M2M is happening technology and will soon come into our neighborhood and we should all be pretty swamped by this tidal wave in our future

Find me on Google+

Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data. If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is given below

Consider a hypothesis function

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂² + Ɵ₃x₃³ + Ɵ₄x₄⁴

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

J_test(Ɵ) = 1/2m_testΣ(hƟ (x_testⁱ – y_testⁱ)²

For logistic regression the test set error is similarly determined as

J_test(Ɵ) = = 1/m_test Σ -y_test * log(h_Ɵ(x_test)) – (1-y_test) * (log(1 – h_Ɵ(x_test))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁ + Ɵ₂x₂²
h_Ɵ (x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₃³
…

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

Train all 10 models against the training set (60%)
Compute the cost value J against the cross-validation set (20%)
Determine the lowest cost model
Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵ_i values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵ_ibecome negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Also see
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
3.The Clash of the Titans in Test and ODI cricket
4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
5.Latency, throughput implications for the Cloud
6. Simulating a Web Joint in Android
5. Pitching yorkpy … short of good length to IPL – Part 1

Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4

In both linear and logistic regression the choice of the degree of the polynomial for the hypothesis function is extremely critical. A low degree for the polynomial can result in an underfit, while a very high degree can overfit the data as shown below

The figure on the left the data is underfit as we try to fit the data with a first order polynomial which is a straight line. This is a case of strong ‘bias’

The rightmost figure a much higher polynomial is used. All the data points are covered by the polynomial curve however it is not effective in predicting other values. This is a case of overfitting or a high variance.

The middle figure is just right as it intuitively fits the data points the best possible way.

A similar problem exists with logistic regression as shown below

There are 2 ways to handle overfitting

a) Reducing the number of features selected

b) Using regularization

In regularization the magnitude of the parameters Ɵ is decreased to reduce the effect of overfitting

Hence if we choose a hypothesis function

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₃³ + Ɵ₄x₄⁴

The cost function for this without regularization as mentioned in earlier posts

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ – yⁱ)²

Where the key is minimize the above function for the least error

The cost function with regularization becomes

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ – yⁱ)^{2 +}λ Σ Ɵ_j²

As can be seen the regularization now adds a factor Ɵ_j² as a part of the cost function which needs to be minimized.

Hence with the regularization factor the problem of underfitting/overfitting can be solved

However the trick is determine the value of λ. If λ is too big then it would result in underfitting or resulting in a high bias.

Similarly the regularized equation for logistic regression is as shown below

J(Ɵ) = |1/m Σ -y * log(h_Ɵ(x)) – (1-y) * (log(1 – h_Ɵ(x)) | + λ/2m Σ Ɵ_j²

Some tips suggested by Prof Andrew Ng while determining the parameters and features for regression

a) Get as many training examples. It is worth spending more effort in getting as much examples

b) Add additional features

c) Observe changes to the learning algorithm with different values of λ

This post is continued in my next post – Simplifying ML: Impact of degree of polynomial on bias, variance and other insights

Note: This post, in line with my previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+

A method to crowd source pothole marking on (Indian) roads

In, India, roads and potholes are 2 sides of the same coin! You cannot think of one in exclusion of another. This post of mine looks at a novel technique of rapidly identifying & marking potholes in (Indian) roads. This approach can be used for any city in the world but is very pertinent to Indian roads.

This idea of mine provides a technique of quickly marking pothole in roads through the method of crowd sourcing

Introduction: It is a well known fact that Indian roads are riddled with potholes. Some may even say that there are potholes with patches of road in between them. This disclosure looks at a novel technique of rapidly identifying & marking potholes in (Indian) roads. The approach can be used for any city in the world. However this disclosure will focus on Indian roads. This disclosure proposes a novel technique of crowd-sourcing the marking of potholes on roads rather than having any single government body (NHAI etc) travel on roads to make the markings.

Description: This post proposes a novel crowd-sourced method for pothole marking that will be easy to conduct and extremely rapid The crowd-sourced pothole marking application will be made of the following components namely Pot-hole marking app, Backend server, Map Matching utility, Pothole ranking utility.

Pothole marking App: A location based smartphone app will need to be created preferably both on Android and iOS. The app will display the map with buttons to mark the following

a) Points in map of potholes

b) Bad segments of roads with potholes

Backend Server: The backend server will collect all the data (marked potholes) and bad segments of roads and will update a database. A map-matching utility will map the latitude, longitude of the marked point on to a map. When the geographical location of a pothole is received (latitude, longitude) the backend server will also store the time stamp.

Pothole ranker: This module will run on a periodical basis, say once every 3 minutes. This module will determine all the potholes that have been entered in the last 3 minutes and add to the accumulated count of marked potholes. Each marked pothole will hold the count of the marks and also the time stamp of the mark. It will also rank the criticality of the pothole based on the accumulated count of potholes over the period.

The pothole ranker will maintain the following metrics

Pothole criticality = Total accumulated count/ Total time
Pothole impact measure = Max rate of pothole marks (Pothole marks/hr)
Bad stretches of roads with many potholes =

Number of adjacent potholes/ Distance in meters

Description: This how the scheme will work in practice. The app will be uploaded into Google Play and Apple’s App store. All users who would like to participate in the pothole marking exercise can download and install the app on their smart phones. These users when they are traveling on a road can mark potholes as they encounter them. It is assumed that the users are passengers in vehicles or pillion riders. The fact that users all over the city can simultaneously mark potholes as they encounter them will make the gathering of pothole data rapid and extremely accurate. A map of a city would need to be generated with the circles/points for locations of potholes, color-coded appropriately. We could use the color red for higher ranked potholes and yellow for lower ranked potholes with intermediate colors like purple, pink etc. This data can then be used by Government bodies in addressing roads in fixing the roads.

There are three advantages of crowd sourcing the pothole marking

1) The process of gathering data is rapid

2) Roads where the traffic is heaviest will have potholes with a higher rank and can be addressed first

3) The process will be very accurate

Crowd sourcing of pothole marking will have the following benefits

The marking of potholes will be extremely rapid
The potholes will be ranked based on accumulated count
Ranking of potholes can be done on

– Total accumulated count/Total time

– Rate of pothole mark

– Critical segments with major potholes

4. It will be easy to segregate

– Critical potholes
– Max impactful potholes
– Bad road segment

The process will be very accurate

Conclusion: The process of crowd sourcing pothole marking of Indian roads will be extremely efficient in marking potholes and bringing it to the attention of the Government.

A map of a city with the circles for locations of potholes, color-coded appropriately, to indicate higher marked potholes versus the lower ranked potholes could be generated. This map can be used to bring to the attention of the government the really bad roads and terrible road segments. Rather than having a couple of vehicles trying to ply roads and mark roads this will be very fast and extremely accurate.

Afterword: The concept of crowd sourcing for traffic is not new. Waze, which Google bought for close to $2bn does just that. It crowd sources traffic conditions and alerts users of the app. Also I did a Google search on using mobile apps for potholes marking and, not surprisingly, there were others who had also thought of a similar idea in Boston & Florida see the links below

However, I personally think that the situation in India is different, where there are ‘roads in between potholes’ ;-). While in the above 2 cases in US, only the location of the potholes is important, my idea ranks potholes based on the accumulated count and the rate of pothole marks. These metrics can be used by the government in addressing those sections of roads where the potholes have a higher rank i.e. where the traffic is highest.

Your thoughts are welcome.

Find me on Google+

Unraveling the mysteries of life

This article was published in Gigaom, Nov 23, “Unraveling the mysteries of life”

SUMMARY:

The future of technology will bring big changes, including advances in AI, brain-to-brain interfaces, and the ability to halt death

Time, space and matter were created 13.7 billion years ago, when the Big Bang occurred. This pale, blue planet, so termed by Carl Sagan, our earth, came into existence about 4.5 billion years ago. Life originated on earth about 3.8 billion years ago. Our species, the home sapiens, came much later at about 0.2 million years while recorded history is merely 6000 years old.

However in the last 60 years or so, man has started to unravel many secrets of his own existence. There have been extremely rapid advances in science and mankind is now grappling with very profound aspects of life from intelligence, perception, aging all the way to death itself…. more

Find me on Google+

Simplifying ML: Neural networks- Part 3

Neural networks try to overcome the shortcomings of logistic regression in which we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

The above diagram shows a simple neural network with features x₁, x₂, x₃and a bias unit x₀

With a hypothesis function h_Ɵ(x) = 1/(1 + e^-x)

The edges from the features x_i are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

As can be seen in a multi-layer neural network at the left we have the features x₁,x₂, .. x_n.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a₁² = g(Ɵ₁₀¹x₀+ Ɵ₁₁¹x₁+ Ɵ₁₂¹x₂ + Ɵ₁₃¹x₃) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

Here a₁²is the activation parameter at layer 1

Ɵ₁₀is the model parameter at layer 1 and is the 0^th parameter. Similarly Ɵ₁₁is the model parameter at layer 1 and is the 1^st parameter and so on.

Similarly the other activation parameters can be written as

a₂² = g(Ɵ₂₀¹x₀+ Ɵ₂₁¹x₁+ Ɵ₂₂¹x₂ + Ɵ₂₃¹x₃)

a₃² = g(Ɵ₃₀¹x₀+ Ɵ₃₁¹x₁+ Ɵ₃₂¹x₂ + Ɵ₃₃¹x₃)

h_Ɵ(x) = a₁³ = g(Ɵ₁₀²a₀+ Ɵ₁₁²a₁+ Ɵ₁₂²a₂ + Ɵ₁₃²a₃ – (A)

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x₁,x₂,… x_nbut on a new set of features or the activation units a₁,a₂, … a_n. In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

h_Ɵ(x) = g(-30 + 20 * x₁ + 20 * x₂)

So for x₁= 0 and x₂ = 0 we would have

h_Ɵ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

Similarly a NOT gate can be constructed with a neural network as follows

Neural networks can also be used for multi class classification.

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!

Find me on Google+

Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex. We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= h_Ɵ<= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e^‑z)

and h_Ɵ(x) = g(Ɵ^TX₎

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when h_Ɵ(x) >= 0.5 and h_Ɵ(x) asymptotes to 0 when h_Ɵ(x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

h_Ɵ(x) = Ɵ₀ + Ɵ₁x₁² + Ɵ₂x₂² + Ɵ₃x₁ + Ɵ₄x₂

h_Ɵ(x) = 1/(1 + e –^{(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 + Ɵ4×2)})

We could choose the general form of a circle which is

f(x) = ax² + by² +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(h_Ɵ(x),y) = { -log(h_Ɵ(x)) if y = 1

-log(1 – h_Ɵ(x))) if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

Hence the cost can be written as

J(Ɵ) = Cost(h_Ɵ(x),y) = -y * log(h_Ɵ(x)) – (1-y) * (log(1 – h_Ɵ(x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵ_j = Ɵ_j – α δ/δ Ɵ_j J(Ɵ₀, Ɵ₁,… Ɵ_n)

This works out to a function that is similar to linear regression

Ɵj₌Ɵj – α 1/m { Σ h_Ɵ(x_i) – y_i} x_jⁱ

This will enable the machine to fairly accurately determine the parameters Ɵ_jfor the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

Find me on Google+

Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a) Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

h_Ɵ(x) = Ɵ₁x+ Ɵ₀

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

h_Ɵ(x) = Ɵ₂x²+ Ɵ₁x+Ɵ₀

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (h_Ɵ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(h_Ɵ (xⁱ)– yⁱ)²

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

h_Ɵ(x) = Ɵ₂xⁿ+ Ɵ₁x^n-1+ …+ Ɵ₀

Also read my post on Simplifying ML: Logistic regression – Part 2

Find me on Google+

Dissecting the Cloud – Part 2

This post delves a little more deeply into the cloud. In the last post Dissecting the Cloud –Part 1, I described the analogy of a person partitioning a large house by creating self-contained units through the use of a hypervisor which abstracts the underlying hardware( CPU, storage and NICs) into virtual CPUs, virtual NICs and virtual disks.

Hence there are has several instances on the cloud each with its own CPU, NIC and storage. In fact several tenants can reside on the same cloud with their own individual CPU, NIC and storage. This is known as multi-tenancy.

However multi-tenancy creates a unique set of associated issues similar to that of a multi-tenanted house. For e.g. how does one isolate one tenant from another? How does one charge each tenant? Are the tenants secured from the prying eyes of their neighbors? How can the owner ensure that one particular tenant does not consume an inordinate amount of water or electricity at the expense of other tenants?

These are typical problems in a multi-tenanted cloud. A common and a high profile issue in the cloud is that of the ‘noisy neighbor’. In this situation one of the instances of the cloud hogs the network bandwidth or the storage tier, resulting in a severe bandwidth crunch or storage access problems for other instances. Here is an interesting article on the noisy neighbor issue “The Problem with noisy neighbors in the cloud”.

It appears that IBM has patented a solution for the bandwidth crunch caused by noisy neighbors: IBM patents ‘noisy neighbor’ problem with SDN.

In order to ensure that multi-tenancy can be realized in the cloud it is essential to isolate the virtual CPUs, network and storage in the cloud

Network isolation: Network isolation is achieved through the use of VPNs (virtual private network), VLANs (Virtual LANS) and subnetting.

A VPN creates a secure tunnel between a user and the cloud instance while accessing the instance from the internet. The data in motion is encrypted using IPSec. Also vNICs belonging to a client are logically grouped together in a VLAN. Groups of vNICs can be sub-netted together to allow broadcast between then. VLANs can effectively isolate traffic between itself and other VLANs. A very good write-up of VLANs and sub-netting can be seen at “What is the difference between subnetting and VLAN”.

Storage isolation: Storage in cloud can be made of block storage, SAN or NAS storage. Storage isolation is typically achieved through the hypervisor and zoning. Zoning is the partitioning of a Fibre Channel fabric into smaller subsets to restrict interference, add security, and to simplify management. While a SAN makes available several devices and/or ports to a single device, each system connected to the SAN should only be allowed access to a controlled subset of these devices/ports.

CPU isolation: The hypervisor does create individual instances all fairly isolated from one another. However this is the area that is receiving more attention than storage or networking isolation because of security concerns and is prone to attack. In fact I was greatly surprised to hear that there is a technique called ‘side channel’ attack by which an intruder by just observing the time that is taken for computations and the temperatures generated can reverse engineer the actual instructions. This is really a scary thought!

This is how multi-tenancy is achieved in clouds. I hope to revisit this topic again in the future.

Find me on Google+

Close encounters with the future

Published in Telecom Asia, Oct 22,2013 – Close encounters with the future

Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have only 1,000 vacuum tubes and perhaps weigh 1.5 tons.—POPULAR MECHANICS, 1949

Introduction: Ray Kurzweil in his non-fiction book “The Singularity is near – When humans transcend biology” predicts that by the year 2045 the Singularity will allow humans to transcend our ‘frail biological bodies’ and our ‘petty, derivative and circumscribed brains’ . Specifically the book claims “that there will be a ‘technological singularity’ in the year 2045, a point where progress is so rapid it outstrips humans’ ability to comprehend it. Irreversibly transformed, people will augment their minds and bodies with genetic alterations, nanotechnology, and artificial intelligence”.

He believes that advances in robotics, AI, nanotechnology and genetics will grow exponentially and will lead us into a future realm of intelligence that will far exceed biological intelligence. This explosion will be the result of ‘accelerating returns from significant advances in technology”

Futurescape

Here is a look at some of the more fascinating key trends in technology. You can decide whether we are heading to Singularity or not.

Autonomous Vehicles (AVs): Self driving cars have moved from the realm of science fiction to reality in recent times. Google’s autonomous cars has already driven around half a million miles. All the major car manufacturers of the world from BMW, Mercedes, Toyota, Nissan, Ford or GM are all coming with their own versions of autonomous cars. These cars are equipped with Adaptive Cruise Control and Collision Avoidance technologies and are already taking away control drivers. Moreover AVs alert drivers, if their attention strays from the road ahead, for too long. Autonomous Vehicles work with the help of Vehicular Communication Technology.

Vehicular Communication along with the Intelligent Transport Systems (ITS) achieves safety by enabling communication between vehicles, people and roads. Vehicle-to-vehicle communications are the fundamental building block of autonomous, self-driving cars. It enables the exchange of data between vehicles and allows automobiles to “see” and adapt to driving obstacles more completely, preventing accidents besides resulting in more efficient driving.

Smart Assistants: From the defeat of Kasparov in chess by IBM’s Deep Blue in 1997, and then subsequently to the resounding victory of IBM’s Watson in Jeopardy, capable of understanding natural human language, to the more prevalent Apple’s intelligent assistant Siri, Artificially Intelligent (AI) systems have come a long way. The newest trend in this area is Smart Assistants. Robots are currently analyzing documents, filling prescriptions, and handling other tasks that were once exclusively done by humans. Smart Assistants are already taking over the tasks of BPO operators, paralegals, store clerks, baby sitters. Robots, in many ways, are not only smarter than humans, but also do not get easily bored,

Intelligent homes and intelligent offices. Rapid advances in technology will be closer to the home both literally and figuratively. The future home will have the ability to detect the presence of people, pets, smoke and changes to humidity, moisture, lighting, temperature. Smart devices will monitor the environment and take appropriate steps to save energy, improve safety and enhance security of homes. Devices will start learning your habits and enhance your comfort and convenience. Everything from thermostats, fire detectors, washing machines, refrigerators will be equipped electronics that will be capable of adapting to the environment. All gadgets at home will be accessible through laptops, tablets or smartphones from anywhere. We will be able to monitor all aspects of our intelligent home from anywhere.

Smart devices will also make major inroads into offices leading to the birth of intelligent offices where the lighting, heating, cooling will be based on the presence of people in the offices. This will result in an enormous savings in energy. The advances in intelligent homes and intelligent offices will be in the greater context of the Smart Grid.

Swarms of drones: Contrary to the use of weaponized drones for unmanned aerial survey of enemy territory we will soon have commercial drones. Drone will start being used for civilian purposes. The most compelling aspect of drones these days is the fact that they can be easily manufactured in large quantities, are cheap and can perform complex tasks either singly or collectively. Remotely controlled drones can perform hundreds of civilian jobs, including traffic monitoring, aerial surveying, and oil pipeline inspections and monitoring of crop conditions. Drones are also being employed for conservation of wildlife. In the wilderness of Africa, drones are already helping in providing aerial footage of the landscape, tracking poachers and in also herding elephants. However, before drones become a common sight, it is necessary to ensure that appropriate laws are made for maintaining the safety and security of civilians. This is likely to happen in US in 2015, when the Federal Aviation Administration (FAA) will come up with rules to safely integrate drones into the American skies.

MOOC (Massive Online Open Course): The concept of MOOC, or the ‘Massive Open Online Course’ from top colleges, though just a few years old, is already taking the world by storm. Coursera, edX and Udacity are the top 3 MOOCs besides many others and offer a variety of courses on technology, philosophy, sociology, computer science etc. As more courses are available online, the requirements of having a uniform start and end date will diminish gradually. The availability of course lectures at all times and through all devices, namely the laptop, tablet or smartphone, will result in large scale adoption by students of all ages.

Contrary to regimented classes MOOCs now allow students to take classes at their own pace. It is likely that some students will breeze through an entire semester worth of classes in a few weeks. It is also likely that a few students will graduate in 4 years with more than a couple of degrees. MOOCs are a natural development considering that the world is going to be more knowledge driven where there will be the need for experts with a diverse set of in-depth skills. Here is an interesting article in WSJ “What College will be like in 2023”

3D Printing: This is another technology that is bound to become ubiquitous in our future. 3D printers will revolutionize manufacturing in ways we could never imagine. A 3-D printer is similar to a hot-glue gun attached to a robotic arm. A 3-D printer creates an object by stacking one layer of material, typically plastic or metal, on top of another. 3D printers have been used for making everything from prosthetic limbs, phone cases, lamps all the way to a NASA funded 3D pizza. Here is a great article in New York Times “Dinner is Printed” It is likely that a 3D printer would be indispensable to our future homes much like the refrigerator and microwave.

Artificial sense organs: A recent news items in Science 2.0 “The Future touch sensitive prosthetic limbs” discusses the invention of a prosthetic limb that can actually provide the sense of touch by stimulating the regions of the brain that deal with the sense of touch. The researchers identified the neural activity that occurs when grasping or feeling an object and successfully induced these patterns in the brain. Two parallel efforts are underway to understand how the human brain works. They are “The Human Brain Project” which has 130 members of the European Union and Obama’s BRAIN project. Both these projects attempt to ‘to give us a deeper and more meaningful understanding of how the human brain operates”. Possibilities as in the movies ‘Avatar’ or ‘Terminator’ may not be far away.

The Others: Besides the above, technologies like Big Data, Cloud Computing, Semantic Web, Internet of Things and Smart Grid will also be swamp us in the future and much has already been said about it.

Conclusion: The above sets of technologies represent seismic shifts and are bound to explode in our future in a million ways.

Given the advances in bionic limbs, Machine Intelligent AI systems, MOOCs, Autonomous Vehicles are we on target for the Singularity?

I wouldn’t be surprised at all!

Find me on Google+