Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts. This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages. Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output. Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is m_ij

Also let us assume that there is a vector ‘v’ whose jth element is v_j . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

x_i = ∑ m_ijv_j

Map function: The map function applies to each single element of the matrix M. For each element m_ijthe map task outputs a key-value pair as follows (i, m_ijv_j). Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function: The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

x_i = ∑ m_ijv_j

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2^nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data. Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

Find me on Google+

Towards an auction-based Internet

The post below was quoted and discussed extensively in (see the link) GigaOM, 14 Jan 2011 – Software Defined Networks could create an auction-based bazaar.

Published in Telecom Asia, Jan 13,2012 – Towards an auction-based internet

Are we headed to an auction-based Internet? This train of thought (no pun intended), which struck me while I was travelling from Chennai to Bangalorelast evening, was the result of the synthesis of different ideas and technologies which I had read in the recent past.

The current state of technology and the technology trends do seem to indicate such a possibility. An auction-based internet would be a business model in which bandwidth would be allocated to different data traffic on the internet based on dynamic bidding by different network elements. Such an eventuality is a distinct possibility considering the economics and latencies involved in data transfer, the evolution of the smart grid concept and the emergence of the promising technology known as the OpenFlow protocol. This is further elaborated below

Firstly, in the book “Grids, cloud and virtualization”, by Massimo Caforo and Giovanni Aloisio, the authors highlight a typical problem of the computing infrastructure of today. In the book, the authors contend that a key issue in large scale computing is data affinity, which is the result of the dual issues of data latency and the economics of data transfer. They quote, Jim Gray (Turing award in 1998) whose paper on “Distributed Computing Economics” states that that programs need to be migrated to the data on which they operate rather than transferring large amounts of data to the programs. This is in fact used in the Hadoop paradigm, where the principle of locality is maintained by keeping the programs close to the data on which they operate.

The book highlights another interesting fact. It says “cheapest and fastest way to move a Terabyte cross country is sneakernet (i.e. the transfer of electronic information, especially computer files, by physically carrying removable media such as magnetic tape, compact discs, DVDs, USB flash drives, or external drives from one computer to another). Google used sneakernet to transfer 120 TB of data. The SETI@home also used sneakernet to transfer data recorded by their telescopes inArecibo, Puerto Rico stored in magnetic tapes toBerkeley,California.

It is now a well known fact that mobile and fixed line data has virtually exploded clogging the internet. YouTube, video downloads and other streaming data choke the data pipes of the internet and Service Providers have not found a good way to monetize this data explosion. While there has been a tremendous advancement in CPU processing power (CPU horsepower in the range of petaflops) and enormous increases in storage capacity(of the order of petabytes) coupled with dropping prices, there has been no corresponding drop in bandwidth prices in relation to the bandwidth capacity.

Secondly, in the book “Hot, flat and crowded” Thomas L. Friedman describes the “Smart Homes” of the future in which all the home appliances will have sensors and will participate in the energy auction in real time as a part of the Smart Grid. The price of energy in the Energy Grid fluctuates like stock prices since enterprises are bidding for energy during the day. In his Smart Home, Friedman envisions a situation in which the washing machine will turn on during off-peak hours when the prices of energy in the energy grid is low. In this way all the appliances in the homes of the future will minimize energy consumption by adjusting the cycles accordingly.

Why could not the internet also behave in a similar fashion? The internet pipes get crowded at different periods of the day, during seasons and during popular sporting events. Why cannot we have an intelligent network in place in which price of different data transfer rates vary depending on the time of the day, the type of traffic and the quality of service required. Could the internet be based on an auction-mechanism in which different devices bid for bandwidth based on the urgency, speed and quality of services required? Is this possible with the routers, switches of today?

The answer is yes. This can be achieved by the new, path breaking innovation known as Software Defined Networks (SDNs) based on the OpenFlow protocol. SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate. Do read my post Software Defined Networks : A glimpse of tomorrow for a more detailed look at SDNs. SDNs can be made to dynamically route traffic flows based on decisions in real time. The flow of data packets through the network can be controlled in a programmatic manner through the OpenFlow protocol. In order to dynamically allocate smaller or fatter pipes for different flows, it necessary for the logic in the Flow Controller to be updated dynamically based on the bid price.

For e.g. we could assume that a corporate has 3 different flows namely, immediate, (ASAP), price below $x. Based on the upper ceiling for the bid price, the OpenFlow controller will allocate a flow for the immediate traffic of the corporation. For the ASAP flow, the corporate would have requested that the flow be arranged when the bid price falls between a range $a – $b. The OpenFlow Controller will ensure that it can arrange for such a flow. The last type of traffic which is not important it will be send during non-peak hours. This will require that the OpenFlow controller be able to allocate different flows dynamically based on winning the auction process that happens in this scheme. The current protocols of the internet of today namely RSVP, DiffServ allocate pipes based on the traffic type & class which is static once allocated. This strategy enables OpenFlow to dynamically adjust the traffic flows based on the current bid price prevailing in that part of the network.

The ability of the OpenFlow protocol to be able to dynamically allocate different flows will once and for all solve the problem of being able to monetize mobile and fixed line data. Users can decide the type of service they are interested and choose appropriately. This will be a win-win for both the Service Providers and the consumer. The Service Provider will be able to get a ROI for the infrastructure based on the traffic flowing through his network. The consumer rather than paying a fixed access charge could have a smaller charge because of low bandwidth usage.

An auction-based internet is not just a possibility but would also be a worthwhile business model to pursue. The ability to route traffic dynamically based on an auction mechanism in the internet enables the internet infrastructure to be utilized optimally. It will serve the dual purpose of solving traffic congestion, as highest bidders will get the pipe but will also monetize data traffic based on its importance to the end user.

An auction based internet is a very distinct possibility in our future given the promise of the OpenFlow protocol.

All thoughts, ideas or counter opinions are welcome!

Find me on Google+

Big Data – Getting bigger!

Published in Telecom Asia – Big Data is getting bigger

There are two very significant ways that our world has changed in the past decade. Firstly, we are more “connected”. Secondly we are “awash with data.” In a planet with 7 billion people there are now 2 billion PCs and upward of 6 billion mobile connections. Besides the connection which we as human beings have there are now numerous connections to the internet from devices, sensors and actuators. In other words the world is getting more and more instrumented. There are in excess of 30 billion RFID tags which enable tracking of goods as they move from warehouse, to retail store, sensors on cars and bridges besides cardiac implants in the human body that are constantly sending a stream of data to the network (do look at my post The Internet of Things” . In addition we have the emergence of the Smart Grid with its millions and millions of smart meters that are capable of sensing power loads and appropriately redistributing power and drawing less power during peak hours.

All these devices be it laptops, cell phones, sensors, RFIDs or smart meters are sending enormous amounts of data to the network. In other words there is an enormous data overload happening in the networks of today. According to a Cisco report the projected increase in data traffic between 2014 and 2015 is of the order of 200 exabytes (10^18)). In addition the report states that the total number of connected to the network will be twice the world population or around 15 billion).

Fortunately the explosion in data has been accompanied by falling prices in storage and extraordinary increases in processing capacity. The data that is generated by the devices by the devices, cell phones, PC etc by themselves are useless. However if processed they can provide insights into trends and patterns which can be used to make key decisions. For e.g. the data exhaust that comes from a user’s browsing trail, click stream provide important insight into user behavior which can be mined to make important decisions. Similarly inputs from social media like Twitter, Facebook provide businesses with key inputs which can be used for making business decisions. Call Detail records that are created for mobile calls can also be a source of user behavior. Data from retail store provide insights into consumer choices. For all these to happen the enormous amounts of data has to be analyzed using algorithms to determine statistical trends, patterns and tendencies in the data.

It is here that Big Data enters the picture. Big Data enables the management of the 3 V’s of data , namely volume, velocity and variety. As mentioned above the volume of data is growing at an exponential rate and should exceed 200 exabytes by 2015. The rate at which the data is generated, or the velocity, is also growing phenomenally given the variety and the number of devices that are connected to the network. Besides there is a tremendous variety to the data. Data is both structured, semi-structured and unstructured. Logs could be in plain text, CSV,XML, JSON and so on. The issue of 3 V’s of data makes Big Data most suited for crunching this enormous proliferation of data at the velocity at which it is generated.

Big Data : Big Data or Analytics (see my post “The Rise of Analytics” ) deals with the algorithms that analyze petabytes (10^15)of data and identify key patterns in them. The patterns that are so identified can be used to make important predictions in the future. For example Big Data has been used by energy companies in identifying key locations for positioning their wind turbines. To identify the precise location requires that petabytes of data be crunched rapidly and appropriate patterns be identified. There are several applications of Big Data including identifying brand sentiment from social media, to customer behavior from click exhaust to identifying optimal power usage by consumers.

The key difference between Big Data and traditional processing methods are that the volume of data that has be processed and the speed with which it has to be processed. As mentioned before the 3 V’s of volume, velocity and variety make traditional methods unsuitable for handling this data. In this context, besides the key algorithms of analytics another player is extremely important in Big Data – that is Hadoop. Hadoop is a processing technique that involves tremendous parallelization of the task (for details look at To Hadoop, or not to Hadoop)

The Hadoop Ecosystem – Hadoop had its origins at Google during its work with the Google’s File System (GFS) and the Map Reduce programming paradigm.

HDFS and Map-Reduce : Hadoop in essence is the Hadoop Distributed File System (HDFS) and the Map Reduce paradigm. The Hadoop System is made up of thousands of distributed commodity servers. The data is stored in the HDFS in blocks of 64 MB or 128 MB. The data is replicated among two or more servers to maintain redundancy. Since Hadoop is made of regular commodity servers which are prone to failures, fault tolerance is included by design. The Map Reduce Paradigm essentially breaks a job into multiple tasks which are executed in parallel. Initially the “Map” part processes the input data and outputs a pair of tuples. The “Reduce” part then scans the pair of tuples and generates a consolidated output. For e.g. The “map” part could count the number of occurrences of different words in different sets of files and output the words and their count as pairs. The “reduce” would then sum up the counts of the word from the individual ‘map’ parts and provide the total occurrences of the words in multiple files.

Pig and PigLatin : This is a programming language developed at Yahoo to relieve programmers of the intricacies of programming the Map-Reduce and assigning tasks to individual parts. Pig is made up of two parts namely PigLatin, the language and the environment in which it will execute.

Hive: Hive is a Hadoop run-time support structure that was developed by Facebook. Hive has a distinct SQL flavor to it and also simplifies the task of Hadoop programming.

JAQL : JAQL is a declarative query language developed by IBM for handling JSON objects. JAQL is another programming paradigm that is used to programming Hadoop.

Conclusion: It is a foregone conclusion that Big Data and Hadoop will take center stage in the not too distant future given the explosion of data and the dire need of being able to glean useful business insights from them. Big Data and its algorithms provide the way for identifying useful pearls of wisdom from otherwise useless data. Big Data is bound to become mission critical in the enterprises of the future.

Find me on Google+

To Hadoop, or not to Hadoop

Published in Telecom Asia, Jul 23, 2012 – To Hadoop or not to Hadoop

To Hadoop, or not to Hadoop: that is the question. In many of my discussions I find that Hadoop with its implementation of Map-Reduce crops ups time and time and again. To many Map-Reduce is the panacea for all kinds of performance evils. It appears that somehow using the Map-Reduce in your application will magically transform your application into a high performing, screaming application.

The fact is the Map-Reduce algorithm is applicable to only certain class of problems. Ideally it is suited to what is commonly referred to as “embarrassingly parallel” class of problems. These are problems that are inherently parallel for e.g. the creation of inverted indices from web crawled documents.

Map-Reduce is an algorithm that has popularized by Google. The term map –reduce actually originates from Lisp in which the “map” function takes a list of arguments and performs the same operation on all of its argument. The” reduce” then applies a common criterion to pick a reduced set of values from this list. Google uses the Map-Reduce to create an inverted index. An inverted index basically provides a mapping of a word with the list of documents in which it occurs. This typically happens in two stages. A set of parallel “map” tasks take as input documents, parse them and emits a sequence of (word, document id) pairs. In other words, the map takes as input a key value pair (k1, v1) and maps it into an intermediate (k2, v2) pair. The reduce tasks take the pair of (word, document id), reduce them, and emit a (word, list {document id}). Clearly applications, like the inverted index, make sense for the Map-Reduce algorithm as several mapping tasks can work in parallel on separate documents. Another typical application is counting the occurrence of words in documents or the number of times a web URL has been hit from a traffic log.

The key point in all these typical class of problems is that the problem can be handled in parallel. Tasks that can execute independently besides being inherently parallel are eminently suitable for Hadoop processing. These tasks work on extraordinarily large data sets. This is also another criterion for Hadoop worthy applications.

Hadoop uses a large number of commodity servers to execute the algorithm. A complementary technology along with Hadoop is the Hadoop Distributed File System (HDFS). The HDFS is a storage system in which the input data is partitioned across several servers. Google uses the Google File System (GFS) for its inverted index and page ranking algorithm.

Typical applications that are prime candidates for Hadoop are those applications that have to operate on terabytes of data. Also the additional requirement is that the application can run some sort of transformation or “map” algorithm on the data independently and produce an intermediate result for the “reduce” part of the algorithm. The “reduce” essentially applies some criteria on the intermediate sets to produce a zero or 1 output.

However several real world applications do not fall into this category where we can parallelize the execution of the application. For example an e-retail application which allows users to search for book, electronic products, add to shopping card and finally make the purchase, in my opinion, is not really suited for Hadoop as each individual transaction is separate and typically has its own unique sequential flow. An Ad serving application also is not ideally suited for Hadoop. Each individual transaction has its own individual flow in time.

However on closer look we can see that there are certain aspects of the application that are conducive to Hadoop based Map-Reduce algorithm. For e.g. if the application needs to search through large data sets for example the e-retail application will have tens of thousands of electronic products and books from different vendors with their own product id. We could use Hadoop to pre-process these large amounts of data, classify and create smaller data sets which the e-retail or other application can use. Hadoop is a clear winner when large data sets have to searched, sorted or some subset selected from.

In these kinds of applications Hadoop has a clear edge over other types as it can really crunch data. Hadoop is also resilient to failures and is based on the principle of “data locality” which allows the “map” or “reduce” to use data stored locally on its sever or in a neighboring machine.

Hence while Hadoop is no silver bullet for all types of applications if due diligence is performed we can identify aspects of the application which can be crunched by Hadoop.

Find me on Google+