The language R

In the universe of programming languages there is a rising staR. It is moving fasteR and getting biggeR and brighteR!

Ok, you get the hint! It is the language R or the R Language.

R language is the successor to the language S. R is extremely powerful for statistical computing and processing. It is an interpreted language much like Python, Perl. The power of the language R comes from the 4000+ software packages that make the R language almost indispensable for any type of statistical computing.

As I mentioned above in my opinion, R, is soon going to play a central role in the technological world. In today’s world we are flooded with data from all sides. To make sense of this information overload we need techniques like Big Data, Analytics and machine learning to make sense of this data deluge. This is where R with its numerous packages that make short work of data becomes critical. The packages also have very interesting graphic packages to display the data in many forms for faster  analysis and easier consumption.

The language R can easily ingest large sets of data in CSV format and perform many computations on them. R language is being used in machine learning, data mining, classification and clustering, text mining besides also being utilized in sentiment analysis from social networks.

The R language contains the usual programming constructs namely logical, loops, assignment etc. The language enables to easily assign values to vectors, matrices, arrays and perform all the associated operations on them.

The R Language can be installed from R-project. The R Language package comes with many datasets which are data collected from various sources. One such dataset is the Iris dataset. The Iris dataset is dataset about the Iris plant( Iris is a genus of 260–300[1][2] species of flowering plants with showy flowers).

The dataset contains 5 parameters

1)      Sepal length 2) Sepal Width 3) Petal length 4) Petal width 5) Species

This dataset has been used in many research papers. R allows you to easily perform any sophisticated set of statistical operations on this data set. Included below are a sample set of operations you can perform on the Iris dataset or any dataset

> iris[1:5,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

> summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species

Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50

1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50

Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50

Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199

3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800

Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

>hist(iris$Sepal.Length)

1

Here is a scatter plot of the Petal width, sepal length and sepal width

>scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

2

 

As can be seen R can really make short work of data with the numerous packages that come along with it. I have just skimmed the surface of R language.

I hope this has whetted your appetite. Do give R a spin!

Watch this space!

You may also like
1. Introducing cricketr! : An R package to analyze performances of cricketers
2. Literacy in India : A deepR dive.
3. Natural Language Processing: What would Shakespeare say?
4. Revisiting crimes against women in India
5. Sixer – R package cricketr’s new Shiny Avatar

Also see
1. Designing a Social Web Portal
2. Design principles of scalable, distributed systems
3. A Cloud Medley with IBM’s Bluemix, Cloudant and Node.js
4. Programming Zen and now – Some essential tips -2 
5. Fun simulation of a Chain in Android

Find me on Google+

The dark side of the Internet

Published in Telecom Asia 26 Sep 2012 – The dark side of the internet

Imagine a life without the internet. You can’t! That’s how inextricably enmeshed the internet is in our lives. Kids learn to play “angry birds” on the PC before they learn to say “duh”, school children hobnob on Facebook and many of us regularly browse, upload photos, watch videos and do a dozen other things on the internet.

So on one side of the internet is the user with his laptop, smartphones or iPad. So what’s on the other side of the Internet and what is the Internet? The Internet is a global system of interconnected computer network that uses the TCP/IP protocol.  The Internet or more generally the internet is network of networks made of hundreds of millions of computers.

During the early days the internet was most probably used for document retrieval, email and browsing. But with the passage of time the internet and the uses of the internet have assumed gigantic proportions. Nowadays we use the internet to search billions of documents, share photographs with our online community, blog and stream video. So, while the early internet was populated with large computers to perform the tasks, the computations of the internet of today require a substantially larger infrastructure. The internet is now powered by datacenters. Datacenters contain anywhere between 100s to 100,000s servers. A server is a more beefed up computer that is designed for high performance sans a screen and a keyboard. Datacenters contain servers stacked over one another on a rack.

These datacenters are capable of handling thousands of simultaneous users and delivering results in split second. In this age of exploding data and information overload where split second responses and blazing throughputs are the need of the hour, datacenters really fill the need. But there is a dark side to these data centers. The issue is that these datacenters consume a lot of energy and are extremely power hungry besides. In fact out of a 100% of utility power supplied to datacenter only 6 – 12 % is used for actual computation. The rest of the power is either used for air conditioning or is lost through power distribution.

In fact a recent article “Power, pollution and the Internet” in the New York Times claims that “Worldwide, the digital warehouses use about 30 billion watts of electricity, roughly equivalent to the output of 30 nuclear power plants.”  Further the article states that “it is estimated that Google’s data centers consume nearly 300 million watts and Facebook’s about 60 million watts or 60 MW”

For e.g. It is claimed that Facebook annually draws 509 million kilowatt hours  of power for its  data centers  (see Estimate: Facebook running 180,000 servers). This article further concludes “that the social network is delivering 54.27 megawatts (MW) to servers” or approximately 60 MW to its datacenter.  The other behemoths in this domain including Google, Yahoo, Twitter, Amazon, Microsoft, and Apple all have equally large or larger data centers consuming similar amounts of energy.  Recent guesstimates have placed Google’s server count at more than 1 million and consuming approximately 220 MW. Taking a look at the power generation capacities of power plants in India we can see that 60 MW is between to 20%-50% of the power generation capacity of  power plants  while 220 MW is entire capacity of medium sized power plants (see List of power stations in India”)

One of the challenges that these organizations face is the need to make the datacenter efficient. New techniques are constantly being used in the ongoing battle to reduce energy consumption in a data center. These tools are also designed to boost a data center’s Power Usage Effectiveness (PUE) rating. Google, Facebook, Yahoo, and Microsoft compete to get to the lowest possible PUE measure in their newest data centers. The earlier datacenters used to average 2.0 PUE while advanced data centers these days aim for lower ratings of the order of 1.22 or 1.16 or lower.

In the early days of datacenter technology the air-conditioning systems used to cool by brute force. Later designs segregated the aisles as hot & cold aisle to improve efficiency. Other technique use water as a coolant along with heat exchangers. A novel technique was used by Intel recently in which servers were dipped in oil. While Intel claimed that this improved the PUE rating there are questions about the viability of this method considering the messiness of removing or inserting new circuit board from the servers.

Datacenters are going to proliferate in the coming days as information continues to explode. The hot new technology “Cloud Computing” is nothing more that datacenters which uses virtualization technique or the ability to run different OS on the hardware improving server utilization.

Clearly the thrust of technology in the days to come will be on identifying renewable sources of energy and making datacenters more efficient.

Datacenters will become more and more prevalent in the internet and technologies to make them efficient as we move to a more data driven world

Find me on Google+

The Next Frontier

Published in Telecom Asia – The next frontier, 21, Mar, 2012

In his classic book “The Innovator’s Dilemma” Prof. Clayton Christensen of Harvard Business School presents several compelling cases of great organizations that fail because they did not address disruptive technologies, occurring in the periphery, with the unique mindset required in managing these disruptions.

In the book the author claims that when these disruptive technologies appeared on the horizon there were few takers for these technologies because there were no immediate applications for them. For e.g. when the hydraulic excavator appeared its performance was inferior to the existing predominant manual excavator. But in course of time the technology behind hydraulic excavators improved significantly to displace existing technologies. Similarly the appearance of 3.5 inch disk had no immediate takers in desktop computers but made its way to the laptop.

Similarly the mini computer giant Digital Equipment Corporation (DEC) ignored the advent of the PC era and focused all its attention on making more powerful mini-computers. This led to the ultimate demise of DEC and several other organizations in this space. This book includes several such examples of organizations that went defunct because disruptive technologies ended up cannibalizing established technologies.

In the last couple of months we have seen technology trends pouring in.  It is now accepted that cloud computing, mobile broadband, social networks, big data, LTE, Smart Grids, and Internet of Things will be key players in the world of our future. We are now at a point in time when serious disruption is not just possible but seems extremely likely. The IT Market Research firm IDC in its Directions 2012 believes that we are in the cusp of a Third Platform that will dominate the IT landscape.

There are several technologies that have been appearing on the periphery and have only gleaned marginal interest for e.g. Super Wi-Fi or Whitespaces which uses unlicensed spectrum to access larger distances of up to 100 kms. Whitespaces has been trialed by a few companies in the last year. Another interesting technology is WiMAX which provides speeds of 40 Mbps for distances of up to 50 km. WiMAX’s deployment has been spotty and has not led to widespread adoption in comparison to its apparent competitor LTE.

In the light of the technology entrants, the disruption in the near future may occur because of a paradigm shift which I would like to refer as the “Neighborhood Area Computing (NAC)” paradigm.  It appears that technology will veer towards neighborhood computing given the bandwidth congestion issues of WAN. A neighborhood area network (NAN) will supplant the WAN for networks which address a community in a smaller geographical area

This will lead to three main trends

Neighborhood Area Networks (NAN):  Major improvements in Neighborhood Area Networks (NAN) are inevitable given the rising importance of smart grids and M2M technology in the context of WAN latencies. Residential homes of the future will have a Home Area Network (HAN) based on bluetooth or Zigbee protocols connecting all electrical appliances. In a smart grid contextNAN provides the connectivity between the Home Area Network (HAN) of a future Smart Home with the WAN network. While it is possible that the utility HAN network will be separate from the IP access network of the residential subscriber, the more likely possibility is that the HAN will be a subnet within the home network and will connect toNAN network.

The data generated from smart grids, m2m networks and mobile broadband will need to be stored and processed immediately through big data analytics on a neighborhood datacenter. Shorter range technologies like WiMAX, Super WiFi/ Whitespaces will transport the data to a neighborhood cloud on which a Hadoop based Big Data analytics will provide real time analytics

Death of the Personal Computer:  The PC/laptop will soon give way to a cloud based computing platform similar to Google’s Chrome book. Not only will we store all our data on the cloud (music, photos, videos) we will also use the cloud for our daily computing needs. Given the high speeds of theNAN this should be quite feasible in the future. The cloud will remove our worries about virus attacks, patch updates and the need to buy new software.  We will also begin to trust our data in the cloud as we progress to the future. Moreover the pay-per-use will be very attractive to consumers.

Exploding Datacenters:  As mentioned above a serious drawback of the cloud is the WAN latency. It is quite likely that with the increases in processing powers and storage capacity coupled with dropping prices that cloud providers will have hundreds of data centers with around 1000 servers for each city rather than a few mega data centers with 10,000’s of servers.  These data centers will address the computing needs of a community in a small geographical area. Such smaller data centers, typically in a small city, will solve 2 problems. One it will build into the cloud geographical redundancy besides also providing excellent performance asNAN latencies will be significantly less in comparison to WAN latencies.

These technologies will improve significantly and fill in the need for handling neighborhood high speed data

The future definitely points to computing in the neighborhood.

Find me on Google+

The promise of predictive analytics

Published in Telecom Asia – Feb 20, 2012 –  The promise of predictive analytics

Published in Telecoms Europe – Feb 20, 2012 – Predictive analytics gold rush due

We are headed towards a more connected, more instrumented and more data driven world. This fact is underscored once again in  Cisco’s latest   Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016.The statistics from this report is truly mind boggling

By 2016 130 exabytes (130 * 2 ^ 60) will rip through the internet. The number of mobile devices will exceed the human population this year, 2012. By 2016 the number of connected devices will touch almost 10 billion.

The devices that are connected to the net range from mobiles, laptops, tablets, sensors and the millions of devices based on the “internet of things”. All these devices will constantly spew data on the internet and business and strategic decisions will be made by determining patterns, trends and outliers among mountains of data.

Predictive analytics will be a key discipline in our future and experts will be much sought after. Predictive analytics uses statistical methods to mine information and patterns in structured, unstructured and streams of data. The data can be anything from click streams, browsing patterns, tweets, sensor data etc. The data can be static or it could be dynamic. Predictive analytics will have to identify trends from data streams from mobile call records, retail store purchasing patterns etc.

Predictive analytics will be applied across many domains from banking, insurance, retail, telecom, energy. In fact predictive analytics will be the new language of the future akin to what C was a couple of decades ago.  C language was used in all sorts of applications spanning the whole gamut from finance to telecom.

In this context it is worthwhile to mention The R Language. R language is used for statistical programming and graphics. The Wikipedia defines R Language as “R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others”.

Predictive analytics is already being used in traffic management in identifying and preventing traffic gridlocks. Applications have also been identified for energy grids, for water management, besides determining user sentiment by mining data from social networks etc.

One very ambitious undertaking is “the Data-Scope Project” that believes that the universe is made of information and there is a need for a “new eye” to look at this data. The Data-Scope project is described as “a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB The Data-scope project is based on the premise that new discoveries will come from analysis of large amounts of data. Analytics is all about analyzing large datasets and predictive analytics takes it one step further in being able to make intelligent predictions based on available data.

Predictive analytics does open up a whole new universe of possibilities and the applications are endless.  Predictive analytics will be the key tool that will be used in our data intensive future.

Afterthought

I started to wonder whether predictive analytics could be used for some of the problems confronting the world today. Here are a few problems where analytics could be employed

–          Can predictive analytics be used to analyze outbreaks of malaria, cholera or AID and help in preventing their outbreaks in other places?

–          Can analytics analyze economic trends and predict a upward/downward trend ahead of time.

Find me on Google+

Technological hurdles: 2012 and beyond

Published in Telecom Asia, Jan 11,2012 – Technological hurdles – 2012 and beyond

You must have heard it all by now – the technological trends for 2012 and the future. The predictions range over BigData, cloud computing, internet of things, LTE, semantic web, social commerce and so on.

In this post, I thought I should focus on what seems to be significant hurdles as we advance to the future. So for a change, I wanted to play the doomsayer rather than a soothsayer. The positive trends are bound to continue and in our exuberance we may lose sight of the hurdles before us. Besides, “problems are usually opportunities in disguise”. So here is my list of the top issues that is facing the industry now.

Bandwidth shortage: A key issue of the computing infrastructure of today is data affinity, which is the result of the dual issues of data latency and the economics of data transfer. Jim Gray (Turing award in 1998) whose paper on “Distributed Computing Economics” states that that programs need to be migrated to the data on which they operate rather than transferring large amounts of data to the programs. In this paper Jim Gray tells us that the economics of today’s computing depends on four factors namely computation, networking, database storage and database access. He then equates $1 as follows

One dollar equates to

= 1 $

≈ 1 GB sent over the WAN

≈ 10 Tops (tera cpu operations)

≈ 8 hours of cpu time

≈ 1 GB disk space

≈ 10 M database accesses

≈ 10 TB of disk bandwidth

≈ 10 TB of LAN bandwidth

As can be seen from above breakup, there is a disproportionate contribution by the WAN bandwidth in comparison to the others.  In others words while the processing power of CPUs and the storage capacities have multiplied accompanied by dropping prices, the cost of bandwidth has been high. Moreover the available bandwidth is insufficient to handle the explosion of data traffic.

In fact it has been found that  the “cheapest and fastest way to move a Terabyte cross country is sneakernet (i.e. the transfer of electronic information, especially computer files, by physically carrying removable media such as magnetic tape, compact discs, DVDs, USB flash drives, or external drives from one computer to another).

With the burgeoning of bandwidth hungry applications it is obvious that we are going to face a bandwidth shortage. The industry will have to come with innovative solutions to provide what I would like to refer as “bandwidth-on-demand”.

The Spectrum Crunch: Powerful smartphones, extremely fast networks, content-rich applications, and increasing user awareness, have together resulted in a virtual explosion of mobile broadband data usage. There are 2 key drivers behind this phenomenal growth in mobile data. One is the explosion of devices-smartphones, tablet PCs, e-readers, laptops with wireless access. The second is video. Over 30% of overall mobile data traffic is video streaming, which is extremely bandwidth hungry. All these devices deliver high-speed content and web browsing on the move. The second is video. Over 30% of overall mobile data traffic is video streaming, which is extremely bandwidth hungry. The rest of the traffic is web browsing, file downloads, and email

The growth in mobile data traffic has been exponential. According to a report by Ericsson, mobile data is expected to double annually till 2015. Mobile broadband will see a billion subscribers this year (2011), and possibly touch 5 billion by 2015.

In an IDATE (a consulting firm) report,  the total mobile data will exceed 127 exabytes (an exabyte is 1018 bytes, or 1 mn terabytes) by 2020, an increase of over 33% from 2010).

Given the current usage trends, coupled with the theoretical limits of available spectrum, the world will run out of available spectrum for the growing army of mobile users. The current spectrum availability cannot support the surge in mobile data traffic indefinitely, and demand for wireless capacity will outstrip spectrum availability by the middle of this decade or by 2014.

This is a really serious problem. In fact, it is a serious enough issue to have the White House raise a memo titled “Unleashing the Wireless Broadband Revolution”. Now the US Federal Communication Commission (FCC) has taken the step to meet the demand by letting wireless users access content via unused airwaves on the broadcast spectrum known as “White Spaces”. Google and Microsoft are already working on this technology which will allow laptops, smartphones and other wireless devices to transfer in GB instead of MB thro Wi-Fi.

But spectrum shortage is an immediate problem that needs to be addressed immediately.

IPv4 exhaustion: IPv4 address space exhaustion has been around for quite some time and warrants serious attention in the not too distant future.  This problem may be even more serious than the Y2K problem. The issue is that IPv4 can address only 2^32 or 4.3 billion devices. Already the pool has been exhausted because of new technologies like IMS which uses an all IP Core and the Internet of things with more devices, sensors connected to the internet – each identified by an IP address. The solution to this problem has been addressed long back and requires that the Internet adopt IPv6 addressing scheme. IPv6 uses 128-bit long address and allows 3.4 x 1038 or 340 trillion, trillion, trillion unique addresses. However the conversion to IPv6 is not happening at the required pace and pretty soon will have to be adopted on war footing. It is clear that while the transition takes place, both IPv4 and IPv6 will co-exist so there will be an additional requirement of devices on the internet to be able to convert from one to another.

We are bound to run into a wall if organizations and enterprises do not upgrade their devices to be able to handle IPv6.

Conclusion: These are some of the technological hurdles that confront the computing industry.  Given mankind’s ability to come up with innovative solutions we may find new industries being spawned in solving these bottlenecks.

Find me on Google+

Monetizing mobile data traffic

Published in Telecom Asia – May 31, 2010 – Monetizing mobile data traffic

Abstract: In the last couple of years mobile data traffic has seen explosive growth and has in fact crossed the voice traffic. CSPs have been forced to upgrade their access, core and backhaul networks to be able to handle the increased demands on the network. Despite the heavy growth in mobile data traffic the corresponding ARPU for mobile data traffic has only been marginal. So what is the way out for the Service Providers? This article looks at some of the avenues through which CSPs can increase their revenue in the face of increasing traffic demands by converting the growth in mobile traffic from a challenge to an opportunity.

Growth in Mobile Data Traffic
Mobile data traffic is exploding in carrier networks. In a recent report by Ericsson the findings show that mobile data traffic globally grew 280% during each of the last two years, and is forecast to double annually over the next five years. This exponential growth in data traffic has been fueled by the entry of smartphones, laptops with dongles and other devices hungry for bandwidth. The advent of smartphones like iPhone, NexusOne and Droid has resulted in several data hungry applications squeezing the available bandwidth of carriers. Smartphones are here to stay. Social networking sites on mobile devices and mobile broadband-based PCs also now account for a large percentage of mobile data traffic. In fact it is rumored that a major carrier’s network started to choke as a result of these bandwidth hungry smartphones.

Marginal ARPU issue
Nowadays professionals everywhere use their laptops with dongles for checking their emails, browsing or to perform high bandwidth downloads. However the ARPU from data traffic has been relatively flat or at most marginal. In fact one report claims that despite the phenomenal growth in data traffic the ARPU from data traffic has not grown proportionately. In fact the ARPU from voice traffic continues to exceed that of the ARPU of data traffic. This clearly defies logic. On the one hand there is enormous growth of data traffic but there are no corresponding returns for the Service Provider. To add to this situation there are now new devices like the iPad and its soon-to-be competitors which will start its demands on the wireless network. One of the reasons why the growth in ARPU for data traffic is not proportional to the data traffic growth is because of data schemes like “all-you-can- eat” or flat-rate charging. Such charging schemes result in excessive usage with little or no consequent increase in revenue generation. To make matters worse Over the Top (OTT) video service and other third party services place a heavy data load on the networks while siphoning away the revenue. Also the increased demands on the network necessitate the need to upgrade the access, core and backhaul networks to handle the increasing data traffic loads. The CSPs are forced to upgrade to LTE/WiMAX to improve the access and move their backhaul to the Evolved Packet Core (EPC). Hence the CSPs are faced with the situation where they do “more for less”. While they have to increase their CAPEX there is no corresponding ROI for the new hardware. This article looks at some of the possible ways the CSPs can monetize this growth in traffic.

Avenues for CSPs
There are four ways for turning this bandwidth crunch into an opportunity
1) Policy Based Traffic
The first technique is to study the usage patterns of the subscribers. The CSPs need to identify the applications that are most frequently used and have high bandwidth demands. The CSPs may be required to perform Deep Packet Inspection (DPI) to determine the kind of traffic in the network. The CSPs can then apply premium charging for these types of traffic. The CSPs need to have Policy Servers that apply different policies based on the type of traffic (data, video etc). The Service Providers can charge a premium for specific kinds of traffic usage based on the policies set. However the downside of this approach is that it may not go down well with the subscribers who have been using the flat rate charging

2) Mobile Ads
The second method which carriers can use is through mobile ads. This method avoids increasing the charge on customers. The carriers can maintain a fixed charge for the subscriber or in other words provide a subsidy to the subscriber by having the subscriber receive commercial advertisements, The Service Provider s can have a business model with the commercial provider and receive a small fee for carrying the commercial to the mobile. The mobile ads should be non-intrusive and should be non-distracting to the user. They can be displayed at the top or the bottom of the mobile phone for example when the subscriber is looking through his/her contact list. Alternatively if the subscriber is using a data intensive application the user may be required to watch a 30 sec commercial prior to the start of the video clip.

3) Revenue sharing with Content Owners
The third technique for the CSPs is to enter into an innovative business model with the content provider or content owner. Some lessons can be learnt by the business model of successful enterprises like Google, Yahoo, eBay or PayPal. These organizations receive a small fee for facilitating a particular service for example hosting an ad on the web page or facilitating a payment. So also the carriers should enter into a business model with the content owners where a small fee is received by the Service Providers for providing the network infrastructure for the music or video service. This would be akin to paying a toll for using a well maintained highway. So also the carriers should levy a small toll for the usage of it network highway.

4) App Stores
The carriers can also maintain app stores which besides providing downloadable applications should also provide for downloadable content for e.g. Music or video. So the carriers can generate revenue both from providing the content and also from providing the infrastructure for transport of the content to the mobile.

Conclusion: Some avenues for revenue generation in these times where the growth of data traffic is increasing at a tremendous pace is for the Service Providers to
1) Have the ability to differentiate traffic and use a policy manager and charge based on the data being transported. Provide a personalized service to individual users based on traffic types and charge appropriately
2) Subsidize usage for the subscriber through the delivery of mobile ads and enter into revenue share with the organization for whom the commercials are being provided
3) Levy a small charge to the content owners for the delivery of their content to the mobile users
4) Creatively use app stores for providing apps, music, video and other differentiated content.

Posted by T V Ganesh

Find me on Google+