To Hadoop, or not to Hadoop

Published in Telecom Asia, Jul 23, 2012 – To Hadoop or not to Hadoop

To Hadoop, or not to Hadoop: that is the question.  In many  of my discussions I find that Hadoop with its implementation of Map-Reduce crops  ups time and time and again.  To many Map-Reduce is the panacea for all kinds of performance evils. It appears that somehow using the Map-Reduce in your application will magically transform your application into a high performing, screaming application.

The fact is the Map-Reduce algorithm is applicable to only certain class of problems.  Ideally it is suited to what is commonly referred to as “embarrassingly parallel” class of problems.  These are problems that are inherently parallel for e.g. the creation of inverted indices from web crawled documents.

Map-Reduce is an algorithm that has popularized by Google. The term map –reduce actually originates from Lisp in which the “map” function takes a list of arguments and performs the same operation on all of its argument. The” reduce” then applies a common criterion to pick a reduced set of values from this list. Google uses the Map-Reduce to create an inverted index. An inverted index basically provides a mapping of a word with the list of documents in which it occurs. This typically happens in two stages. A set of parallel “map” tasks take as input documents, parse them and emits a sequence of (word, document id) pairs. In other words, the map takes as input a key value pair (k1, v1) and maps it into an intermediate (k2, v2) pair. The reduce tasks take the pair of (word, document id), reduce them, and emit a (word, list {document id}). Clearly applications, like the inverted index, make sense for the Map-Reduce algorithm as several mapping tasks can work in parallel on separate documents. Another typical application is counting the occurrence of words in documents or the number of times a web URL has been hit from a traffic log.

The key point in all these typical class of problems is that the problem can be handled in parallel. Tasks that can execute independently besides being inherently parallel are eminently suitable for Hadoop processing. These tasks work on extraordinarily large data sets.  This is also another criterion for Hadoop worthy applications.

Hadoop uses a large number of commodity servers to execute the algorithm. A complementary technology along with Hadoop is the Hadoop Distributed File System (HDFS). The HDFS is a storage system in which the input data is partitioned across several servers. Google uses the Google File System (GFS) for its inverted index and page ranking algorithm.

Typical applications that are prime candidates for Hadoop are those applications that have to operate on terabytes of data. Also the additional requirement is that the application can run some sort of transformation or “map” algorithm on the data independently and produce an intermediate result for the “reduce” part of the algorithm. The “reduce” essentially applies some criteria on the intermediate sets to produce a zero or 1 output.

However several real world applications do not fall into this category where we can parallelize the execution of the application. For example an e-retail application which allows users to search for book, electronic products, add to shopping card and finally make the purchase, in my opinion,  is not really suited for Hadoop as each individual transaction is separate  and typically has its own unique sequential flow. An Ad serving application also is not ideally suited for Hadoop. Each individual transaction has its own individual flow in time.

However on closer look we can see that there are certain aspects of the application that are conducive to Hadoop based Map-Reduce algorithm. For e.g. if the application needs to search through large data sets for example the e-retail application will have tens of thousands of electronic products and books from different vendors with their own product id.  We could use Hadoop to pre-process these large amounts of data, classify and create smaller data sets which the e-retail or other application can use. Hadoop is a clear winner when large data sets have to searched, sorted or some subset selected from.

In these kinds of applications Hadoop has a clear edge over other types as it can really crunch data. Hadoop is also resilient to failures and is based on the principle of “data locality” which allows the “map” or “reduce” to use data stored locally on its sever or in a neighboring machine.

Hence while Hadoop is no silver bullet for all types of applications if due diligence is performed we can identify aspects of the application which can be crunched by Hadoop.

Find me on Google+

The Future is C-cubed: Computing, Communication and the Cloud

We are the on the verge of the next great stage of technological evolution. The trickle of different trends clearly point to what I would like to term as C-cubed (C3) representing the merger of computing technologies, communication advances and the cloud.

There are no surprises in this assessment. Clearly it does not fall into the category of Chaos theories’ “butterfly effect” where a seemingly unrelated cause has a far-reaching effect, typically the fluttering of a butterfly in Puerto Rico is enough to cause an earthquake in China.

The C-cubed future that seems very probable is based on the advances in mobile broadband, advances in communication and the emergence of cloud computing.  A couple of years back Scott McNealy of Sun Microsystems believed that the “network is the computer”. Now with the introduction of Google’s Chrome book this trend will soon catch on. In fact I can easily visualize a ubiquitous device which I would like to call as the “cloudbook”.

The cloudbook would be a device that would resemble a tablet like the iPad, Playbook etc but would carry little or no hard disk.  Local storage will be through USB devices or SD-Cards which these days come with large storage in the range of 80GB and above. The Cloud book would have no operating system. It would simply have a bootstrap program which will allow the user to choose from several different Operating Systems (OS) namely Window’s, Linux, Solaris and Mac etc which will execute on the cloud. All applications will be executed directly on the cloud. The user will also store all his programs and data on the cloud.  Some amount of offline storage will be possible in portable storage devices like the memory stick, SD card etc.

The cloudbook will be a ubiquitous device.  It will access the internet through mobile broadband.  The access could be through a GPRS, WCDMA or a LTE connection. With the blazing speeds of 56 Mbps promised by LTE the ability to access the public cloud for executing programs and for storing of data is extremely feasible. Access should be almost instantaneous. Using the mobile broadband for access and the cloud for computing and storage will be the trend in the future.

Besides its use for computing, the cloudbook will also be used for making voice or video calls. This is the promise of IP Multimedia Systems (IMS) technology. IMS is a technology that has been in the wings for quite some time. IMS technology envisages an all-IP Core Network that will be used for transporting voice, data and video. As the speeds of the IP pipes become faster and the algorithms to iron out QOS issues are worked out the complete magnificence of the vision of IMS will become a reality and high speed video applications will become common place.

The cloudbook will use the WCDMA, 3G, network to make voice and video calls to others. The 3G RNC or the 4G eNodeB’s will enable the transmission and reception of voice, data or video to and from the Core Network. LTE networks will either user Circuit Switched Fall Back (CSFB) or VOLTE (Voice over LTE) to transfer voice and video over either the 3G network or over the Evolved Packet Core (EPC).  In the future high speed video based calls and applications will be extremely prevalent and a device like the cloudbook will increase the user experience manifold.

Besides IMS also envisions Applications Server (AS) spread across the network providing other services like Video-on-Demand, Real-time multi player gaming. It is clear that these AS may actually be instances sitting off the public cloud.

Hence the future clearly points to a marriage of computing, communication and the cloud where each will have a symbiotic relationship with the other resulting in each other. The network can be visualized as one large ambient network of IMS Call Session Control Function Servers (CSCFs) , Virtualized Servers on the Cloud and Application servers (AS).

Mobile broadband will become commonplace and all computing and communication will be through 3G or 4G networks.

The future is almost here and the future is C-cubed (C3)!!!

Published in Telecom Asia, Jul 8 2011 – The Future is C-cubed

Find me on Google+

Managing Multi-Region Deployments

If there is one lesson from this year’s major Amazon’s EC2 outage it is “don’t deploy all your application instances in a single region”. The outage has clearly demonstrated that entire regions are not immune to disasters. Thus, it has become imperative for designers and architects to deploy applications spanning major regions. Currently there are 4 major regions – US-West, US-East, Europe and APAC.

Both fundamentally and from a strategic point of view it makes sense to deploy web applications in different regions for e.g. both in US-East and US-West. This will build into the application a certain amount of geographical resiliency . In this way you are protected from major debacles like the Amazon’s EC2 outage in April 2011 or a possible meteor crashing and burning in one of the data centers.

Deploying instances in different regions is almost like minimizing risk by diversifying your portfolio. The design of application besides including other methods of fault tolerance should also incorporate geographical resilience.

Currently Amazon’s ELB does not support load balancing across regions. The ELB can only distribute traffic among instances in different availability zones of a region. The solution is to go for other DNS services like UltraDNS, DNSMadeEasy or DynDNS.

These DNS services provide geoIP based load balancer that can distribute traffic based on the region from which it originated. Currently there are 4 major regions in the world – US-East, US-West, Europe and APAC. GeoIP based traffic distribution besides balancing the load based on origination also has the added benefit of getting to the application closest to the origination thus reducing latencies.

The GeoIP based traffic distributor can distribute traffic to the closest region. An Amazon’s ELB can then internally distribute the traffic among the instances within that region. For a look at some typical problems in multi-region cloud deployments do look at my post “Cache-22

INWARDi Technologies

Deploying across regions

Find me on Google+

Cloud, analytics key tools for today’s telcos

Published in Telecom Asia Aug 20, 2010 – http://bit.ly/dxKbsR

Operators facing dwindling revenue from wireline subscribers, fierce tariff wars and exploding mobile data traffic are continually being pressured to do more for less. Spending on infrastructure is increasing as they look to provide better service within slender budgets.
In these tough times telcos have to devise new and innovative strategies and make judicious technology choices. Two promising technologies, cloud computing and analytics, are shaping up as among the best choices to make.
Cloud architecture does away with the worry of planning the computing resources needed, the real estate, the costs of the acquiring them and thoughts of its obsolescence. It allows the CSPs to purchase processing power, platforms and databases almost as a utility like electricity or water.
Cloud consumers only pay for what they use. The magic of this promising technology is the elasticity that the cloud provides – it expands to accommodate increasing demands and contracts when the demand drops.
The cloud architectures of Amazon, Google and Microsoft – currently the three biggest cloud providers – vary widely in their capabilities and features. These strengths and weaknesses should be taken into account while planning a cloud system. Each is best suited for only a certain class of applications unique to each individual cloud provider.
On one end of the spectrum Amazon’s EC2 (Elastic Compute Cloud) provides a virtual machine and a wealth of associated tools for storage and notifications. But the trade-off for increased flexibility is that users must take responsibility for designing resiliency into their systems.
On the other end is Google’s App Engine, a highly scalable cloud architecture that handles failures but is a lot more restrictive. Microsoft’s Azure is based on the .NET architecture and in terms of flexibility and features lies between these two.
When implementing such architecture, an organization should take a long hard look its computing software inventory to decide which applications are worthy of migrating to the cloud. The best candidates are processing intensive in-house applications that deliver standardized functionality and interface, and whose software architecture is made up of loosely coupled communicating systems.
Applications that deal with sensitive data should be retained within the organization’s internal computing infrastructure, because security is currently the most glaring issue with the cloud. Cloud providers do provide various levels of security to users, but this is an area in keen need of standardization.
But if the CSP decides to build components of an OSS system – rather than buying a pre-packaged system – it makes good business sense to develop for the cloud.
A cloud-based application must have a few essential properties. First, it is preferable if the application was designed on SOA principles. Second, it should be loosely coupled. And lastly, it needs to be an application that can be scaled rapidly up or down based on the varying demands.
The other question is which legacy systems can be migrated. If the OSS/BSS systems are based on commercial off-the-shelf systems these can be excluded, but an offline bill processing system, for example, is typically a good candidate for migration.
Mining wisdom from data
The cloud can serve as the perfect companion for another increasingly vital operational practice – data analytics. The cloud is capable of modeling large amounts of data, and running models to process and analyze this data. It is possible to run thousands of simultaneous instances on the cloud and mine for business intelligence in the oceans of telecom data operators generate.
Today’s CSP maintains software systems generating all kinds of customer data, covering areas ranging from billing and order management to POS, VAS and provisioning. But perhaps the largest and richest vein of subscriber information is the call detail records database.
All this data is worthless, though, if it cannot be mined and analyzed. Formal data mining and data analytics tools can be used to identify patterns and trends that will allow operators to make strategic, knowledge-driven decisions.
Analytics involves many complex areas like predictive analytics, neural nets, decision trees and classification. Some of the approaches used in data analytics include prediction, deviation detection, degree of influence and classification.
With the intelligence that comes through analytics it is possible to determine customer buying patterns, identify causes for churn and develop strategies to promote loyalty. Call patterns based on demography or time of day will enable the CSPs to create innovative tariff schemes.
Determining the relations and buying patterns of users will provide opportunities for up-selling and cross-selling. The ability to identify marked deviation in customer behavior patterns help the CSP in deciding ahead of time whether this trend is a warning bell or an opportunity waiting to be tapped.
Tinniam V Ganesh

Find me on Google+