Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses  and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts.  This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages.  Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

map_reduce

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output.  Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

a

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is mij

Also let us assume that there is a vector ‘v’ whose jth element is vj . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

xi = ∑ mijvj

 

Map function: The map function applies to each single element of the matrix M. For each element mij the map task outputs a key-value pair as follows (i, mijvj).  Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function:  The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

xi = ∑ mijvj

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data.  Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

Find me on Google+

Test driving Apache Hadoop: Standalone & pseudo-distributed mode

The Hadoop paradigm originated from Google and is used in crunching large data sets. It is ideally suited for applications like Big Data, creating an inverted index used by search engines and other problems which require terabytes of data to processed in parallel. One key aspect of Hadoop is that it is made up of commodity servers. Hence server, disk crashes or network issues are assumed to be norm rather than an exception.

The Hadoop paradigm is made of the Map-Reduce  & the HDFS parts. The Map-Reduce has 2 major components to it. The Map part takes as input key- value pairs and emits a transformed key-value pair. For e.g Map could count the number of occurrences of words or created an inverted index of a word and its location in a document. The Reduce part takes as input the emitted key-value pairs of the Map output and performs another operation on the inputs from the Map part for.e.g summing up the counts of words. A great tutorial on Map-Reduce can be found at http://developer.yahoo.com/hadoop/tutorial/module4.html

The HDFS (Hadoop Distributed File System) is the special storage that is used in tandem with the Map-Reduce algorithm. The HDFS distributes data among Datanodes. A Namenode maintains the meta data of where individual pieces of data are stored.

To get started with Apache Hadoop download a stable release of Hadoop from (e.g. hadoop-1.0.4.tar.gz)

http://hadoop.apache.org/common/releases.html#Download

a) Install Hadoop on your system preferably in /usr/local/
tar xzf ./Downloads/hadoop-1.0.4.tar.gz

sudo mv hadoop-1.0.4 hadoop (rename hadoop-1.0.4 to hadoop for convenience)
sudo chown -R hduser:hadoop hadoop
Apache Hadoop requires Java to be installed. Download and install Java on your machine from

http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u4-downloads-1591156.html

b) After you have installed java set $JAVA_HOME in
usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/bin/java (uncomment and set the correct path)

c) Create a user hduser in group hadoop
For this click  Applications->Other->User & Groups
Choose Add User – hduser  &  Add Group – hadoop
Choose properties and add hduser to the hadoop group.

Standalone Operation
Under root do
/usr/sbin/sshd
then
$ssh localhost

If you cannot do a ssh to localhost with passphrase then do the following
$ ssh-keygen -t rsa -P “”

You will get the following

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.


$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now  re-run
$ssh localhost – This time it should go fine

Create a directory input and copy *.xml files from conf/
$mkdir input
$cp /usr/local/hadoop/share/hadoop/templates/conf/*.xml input

Then execute the following. This searches for the string “dfs*” in all the XML files under the input directory
$/usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
You should see
[root@localhost hadoop]# /usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
12/06/10 13:00:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
..

12/06/10 13:01:45 INFO mapred.JobClient:     Reduce output records=38
12/06/10 13:01:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
12/06/10 13:01:45 INFO mapred.JobClient:     Map output records=38

Where it indicates that there are 38 such record strings with dfs* in it.
If you get an error
java.lang.OutOfMemoryError: Java heap space
then increase the heap size from 128 to 1024 as below

<property>
<name>mapred.child.java.opts</name>
<value>-server -Xmx1024m -Djava.net.preferIPv4Stack=true</value>
</property>
….

Pseudo distributed mode
In the pseudo distributed mode separate Java processes are started for the Job Tracker which schedules tasks, the Task tracker which executes tasks and the Namenode which contains the data
A good post on Hadoop standalone mode is given in Michael Nolls post – Running Hadoop on Ubuntu Linux (single node cluster)

a) Execute the following commands
. ./.bashrc
$mkdir -p /home/hduser/hadoop/tmp
$chown hduser:hadoop /home/hduser/hadoop/tmp
$chmod 750 /home/hduser/hadoop/tmp

b) Do the following
Note:Files  core-site.xml, mapred-site,xml & hdfs-site.xml exist under
/usr/local/hadoop/share/hadoop/templates/conf& usr/local/hadoop/conf
It appears that Apache hadoop gives precedence to /usr/local/hadoop/conf. So add the following between <configuration and </configuration>
In file /usr/local//hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.
</description>
<final>true</final>
</property>

In  /usr/local//hadoop/conf//mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<final>true</final
</property>

In  /usr/local//hadoop/conf//hdfs-site.xml add
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Now perform
$sudo /usr/sbin/ssh
$ssh localhost (Note you may have to generate key as above if you get an error)

c) Since the pseudo distributed mode will use the HDFS file system we need to format this.So run the following command
$$HADOOP_HOME/bin/hadoop namenode -format
[root@localhost hduser]#  /usr/local/hadoop/bin/hadoop namenode -format
12/06/10 15:48:16 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by ‘hortonfo’ on Tue May  8 20:16:59 UTC 2012
************************************************************/

12/06/10 15:48:17 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/06/10 15:48:18 INFO common.Storage: Storage directory /home/hduser/hadoop/tmp/dfs/name has been successfully formatted.
12/06/10 15:48:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
d) Now start all the Hadoop processes
$/usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to /var/log/hadoop/root/hadoop-root-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-localhost.localdomain.out ocalhost: starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-localhost.localdomain.out
Verify that all processes have started by executing /usr/java/jdk1.7.0_04/bin/jps
[root@localhost hduser]# /usr/java/jdk1.7.0_04/bin/jps
10971 DataNode
10866 NameNode
11077 SecondaryNameNode
11147 JobTracker
11264 TaskTracker
11376 Jps

You will see JobTracker,taskTracker,NameNode,DataNode and SecondaryNameNode

You can also do netstat -plten | grep java
tcp        0      0 0.0.0.0:50090               0.0.0.0:*                   LISTEN      0          166832     11077/java
tcp        0      0 0.0.0.0:50060               0.0.0.0:*                   LISTEN      0          167407     11264/java
tcp        0      0 0.0.0.0:50030               0.0.0.0:*                   LISTEN      0          166747     11147/java
tcp        0      0 0.0.0.0:50070               0.0.0.0:*                   LISTEN      0          165669     10866/java
tcp        0      0 0.0.0.0:50010               0.0.0.0:*                   LISTEN      0          166951     10971/java
tcp        0      0 0.0.0.0:50075               0.0.0.0:*                   LISTEN      0          166955     10971/java
tcp        0      0 127.0.0.1:55839             0.0.0.0:*                   LISTEN      0          166816     11264/java
tcp        0      0 0.0.0.0:50020               0.0.0.0:*                   LISTEN      0          165843     10971/java
tcp        0      0 127.0.0.1:54310             0.0.0.0:*                   LISTEN      0          165535     10866/java
tcp        0      0 127.0.0.1:54311             0.0.0.0:*                   LISTEN      0          166733     11147/java

e) Now copy files from your local directory /home/hduser/input  to the HDFS file system
Now you can check the web interface for JobTracker & NameNode
This is as per mapred-site.xml & hdfs-site.xml in /conf directory. They are at
ñJobTracker – http://localhost:50030/

ñNameNode – http://localhost:50070/

f) Copy files from your local directory to HDFS
$/usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/hduser/input /user/hduser/input
Ensure that the files have been copies by listing the contents of HDFS

g) Check that files have been copied
$/usr/local/hadoop/bin/hadoop dfs -ls /user/hduser/input
Found 9 items
-rw-r–r–   1 root supergroup       7457 2012-06-10 10:31 /user/hduser/input/capacity-scheduler.xml
-rw-r–r–   1 root supergroup       2447 2012-06-10 10:31 /user/hduser/input/core-site.xml
-rw-r–r–   1 root supergroup       2300 2012-06-10 10:31 /user/hduser/input/core-site_old.xml
-rw-r–r–   1 root supergroup       5044 2012-06-10 10:31 /user/hduser/input/hadoop-policy.xml
-rw-r–r–   1 root supergroup       7595 2012-06-10 10:31 /user/hduser/input/hdfs-site.xml

h) Now execute the grep functionality
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/input /user/hduser/output ‘dfs[a-z.]+’
12/06/10 10:34:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library
….
12/06/10 10:34:23 INFO mapred.JobClient: Running job: job_201206101010_0003
12/06/10 10:34:24 INFO mapred.JobClient:  map 0% reduce 0%
12/06/10 10:34:48 INFO mapred.JobClient:  map 11% reduce 0%


12/06/10 10:35:21 INFO mapred.JobClient:  map 88% reduce 22%
12/06/10 10:35:24 INFO mapred.JobClient:  map 100% reduce 22%
12/06/10 10:35:27 INFO mapred.JobClient:  map 100% reduce 29%
12/06/10 10:35:36 INFO mapred.JobClient:  map 100% reduce 100%
12/06/10 10:35:42 INFO mapred.JobClient: Job complete: job_201206101010_0003
….
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce input groups=3
12/06/10 10:36:16 INFO mapred.JobClient:     Combine output records=0
12/06/10 10:36:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=180502528
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce output records=36
12/06/10 10:36:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=695119872
12/06/10 10:36:16 INFO mapred.JobClient:     Map output records=36
i) Check the result
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop dfs -cat /user/hduser/output/*
6          dfs.data.dir
2          dfs.
2          dfs.block.access.token.enable
2          dfs.cluster.administrators
2          dfs.datanode.address
2          dfs.datanode.data.dir.perm
2          dfs.datanode.http.address
2          dfs.datanode.kerberos.principal
2          dfs.datanode.keytab.file
2          dfs.exclude
…..

j)/usr/local/hadoop/bin/stop-all.sh
Have fun with hadoop…

Find me on Google+

Towards an auction-based Internet

The post below was quoted and discussed extensively in (see the link) GigaOM, 14 Jan 2011 – Software Defined Networks could create an auction-based bazaar.

Published in Telecom Asia, Jan 13,2012 – Towards an auction-based internet

Are we headed to an auction-based Internet? This train of thought (no pun intended), which struck me while I was travelling from Chennai to Bangalorelast evening, was the result of the synthesis  of different ideas and technologies which I had read  in the recent past.

The current state of technology and the technology trends do seem to indicate such a possibility.  An auction-based internet would be a business model in which bandwidth would be allocated to different data traffic on the internet based on dynamic bidding by different network elements. Such an eventuality is a distinct possibility considering the economics and latencies involved in data transfer, the evolution of the smart grid concept and the emergence of the promising technology known as the OpenFlow protocol.  This is further elaborated below

Firstly, in the book “Grids, cloud and virtualization”, by Massimo Caforo and Giovanni Aloisio, the authors highlight a typical problem of the computing infrastructure of today. In the book, the authors contend that a key issue in large scale computing is data affinity, which is the result of the dual issues of data latency and the economics of data transfer. They quote, Jim Gray (Turing award in 1998) whose paper on “Distributed Computing Economics” states that that programs need to be migrated to the data on which they operate rather than transferring large amounts of data to the programs.  This is in fact used in the Hadoop paradigm, where the principle of locality is maintained by keeping the programs close to the data on which they operate.

The book highlights another interesting fact. It says “cheapest and fastest way to move a Terabyte cross country is sneakernet (i.e. the transfer of electronic information, especially computer files, by physically carrying removable media such as magnetic tape, compact discs, DVDs, USB flash drives, or external drives from one computer to another). Google used sneakernet to transfer 120 TB of data. The SETI@home also used sneakernet to transfer data recorded by their telescopes inArecibo, Puerto Rico stored in magnetic tapes toBerkeley,California.

It is now a well known fact that mobile and fixed line data has virtually exploded clogging the internet. YouTube, video downloads and other streaming data choke the data pipes of the internet and Service Providers have not found a good way to monetize this data explosion. While there has been a tremendous advancement in CPU processing power (CPU horsepower in the range of petaflops) and enormous increases in storage capacity(of the order of petabytes) coupled with dropping prices,  there has been no corresponding drop in bandwidth prices in relation to the bandwidth capacity.

Secondly, in the book “Hot, flat and crowded” Thomas L. Friedman  describes the “Smart Homes” of the future in which all the home appliances will have sensors and will participate in the energy auction in real time as a part of the Smart Grid.  The price of energy in the Energy Grid fluctuates like stock prices since enterprises are bidding for energy during the day. In his Smart Home, Friedman envisions a situation in which the washing machine will turn on during off-peak hours when the prices of energy in the energy grid is low. In this way all the appliances in the homes of the future will minimize energy consumption by adjusting the cycles accordingly.

Why could not the internet also behave in a similar fashion? The internet pipes get crowded at different periods of the day, during seasons and during popular sporting events. Why cannot we have an intelligent network in place in which price of different data transfer rates vary depending on the time of the day, the type of traffic and the quality of service required.  Could the internet be based on an auction-mechanism in which different devices bid for bandwidth based on the urgency, speed and quality of services required? Is this possible with the routers, switches of today?

The answer is yes. This can be achieved by the new, path breaking innovation known as Software Defined Networks (SDNs) based on the OpenFlow protocol. SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate.  Do read my post Software Defined Networks : A glimpse of tomorrow   for a more detailed look at SDNs. SDNs can be made to dynamically route traffic flows based on decisions in real time.  The flow of data packets through the network can be controlled in a programmatic manner through the OpenFlow protocol. In order to dynamically allocate smaller or fatter pipes for different flows, it necessary for the logic in the Flow Controller to be updated dynamically based on the bid price.

For e.g. we could assume that a corporate has 3 different flows namely, immediate, (ASAP), price below $x. Based on the upper ceiling for the bid price, the OpenFlow controller will allocate a flow for the immediate traffic of the corporation. For the ASAP flow, the corporate would have requested that the flow be arranged when the bid price falls between a range $a – $b. The OpenFlow Controller will ensure that it can arrange for such a flow. The last type of traffic which is not important it will be send during non-peak hours. This will require that the OpenFlow controller be able to allocate different flows dynamically based on winning the auction process that happens in this scheme. The current protocols of the internet of today namely RSVP, DiffServ allocate pipes based on the traffic type & class which is static once allocated. This strategy enables OpenFlow to dynamically adjust the traffic flows based on the current bid price prevailing in that part of the network.

The ability of the OpenFlow protocol to be able to dynamically allocate different flows will once and for all solve the problem of being able to monetize mobile and fixed line data.  Users can decide the type of service they are interested and choose appropriately. This will be a win-win for both the Service Providers and the consumer. The Service Provider will be able to get a ROI for the infrastructure based on the traffic flowing through his network. The consumer rather than paying a fixed access charge could have a smaller charge because of low bandwidth usage.

An auction-based internet is not just a possibility but would also be a worthwhile business model to pursue. The ability to route traffic dynamically based on an auction mechanism in the internet enables the internet infrastructure to be utilized optimally. It will serve the dual purpose of solving traffic congestion, as highest bidders will get the pipe but will also monetize data traffic based on its importance to the end user.

An auction based internet is a very distinct possibility in our future given the promise of the OpenFlow protocol.

All  thoughts, ideas or counter opinions are welcome!

Find me on Google+

Big Data – Getting bigger!

Published in Telecom Asia – Big Data is getting bigger

There are two very significant ways that our world has changed in the past decade. Firstly, we are more “connected”. Secondly we are “awash with data.” In a planet with 7 billion people there are now 2 billion PCs and upward of 6 billion mobile connections. Besides the connection which we as human beings have there are now numerous connections to the internet from devices, sensors and actuators. In other words the world is getting more and more instrumented. There are in excess of 30 billion RFID tags which enable tracking of goods as they move from warehouse, to retail store, sensors on cars and bridges besides cardiac implants in the human body that are constantly sending a stream of data to the network (do look at my post The Internet of Things” . In addition we have the emergence of the Smart Grid with its millions and millions of smart meters that are capable of sensing power loads and appropriately redistributing power and drawing less power during peak hours.

All these devices be it laptops, cell phones, sensors, RFIDs or smart meters are sending enormous amounts of data to the network. In other words there is an enormous data overload happening in the networks of today. According to a Cisco report the projected increase in data traffic between 2014 and 2015 is of the order of 200 exabytes (10^18)). In addition the report states that the total number of connected to the network will be twice the world population or around 15 billion).

Fortunately the explosion in data has been accompanied by falling prices in storage and extraordinary increases in processing capacity. The data that is generated by the devices by the devices, cell phones, PC etc by themselves are useless. However if processed they can provide insights into trends and patterns which can be used to make key decisions. For e.g. the data exhaust that comes from a user’s browsing trail, click stream provide important insight into user behavior which can be mined to make important decisions. Similarly inputs from social media like Twitter, Facebook provide businesses with key inputs which can be used for making business decisions. Call Detail records that are created for mobile calls can also be a source of user behavior. Data from retail store provide insights into consumer choices. For all these to happen the enormous amounts of data has to be analyzed using algorithms to determine statistical trends, patterns and tendencies in the data.

It is here that Big Data enters the picture. Big Data enables the management of the 3 V’s of data , namely volume, velocity and variety. As mentioned above the volume of data is growing at an exponential rate and should exceed 200 exabytes by 2015. The rate at which the data is generated, or the velocity, is also growing phenomenally given the variety and the number of devices that are connected to the network. Besides there is a tremendous variety to the data. Data is both structured, semi-structured and unstructured. Logs could be in plain text, CSV,XML, JSON and so on. The issue of 3 V’s of data makes Big Data most suited for crunching this enormous proliferation of data at the velocity at which it is generated.

Big Data : Big Data or Analytics (see my post “The Rise of Analytics” ) deals with the algorithms that analyze petabytes (10^15)of data and identify key patterns in them. The patterns that are so identified can be used to make important predictions in the future. For example Big Data has been used by energy companies in identifying key locations for positioning their wind turbines. To identify the precise location requires that petabytes of data be crunched rapidly and appropriate patterns be identified. There are several applications of Big Data including identifying brand sentiment from social media, to customer behavior from click exhaust to identifying optimal power usage by consumers.

The key difference between Big Data and traditional processing methods are that the volume of data that has be processed and the speed with which it has to be processed. As mentioned before the 3 V’s of volume, velocity and variety make traditional methods unsuitable for handling this data. In this context, besides the key algorithms of analytics another player is extremely important in Big Data – that is Hadoop. Hadoop is a processing technique that involves tremendous parallelization of the task (for details look at To Hadoop, or not to Hadoop)

The Hadoop Ecosystem – Hadoop had its origins at Google during its work with the Google’s File System (GFS) and the Map Reduce programming paradigm.

HDFS and Map-Reduce : Hadoop in essence is the Hadoop Distributed File System (HDFS) and the Map Reduce paradigm. The Hadoop System is made up of thousands of distributed commodity servers. The data is stored in the HDFS in blocks of 64 MB or 128 MB. The data is replicated among two or more servers to maintain redundancy. Since Hadoop is made of regular commodity servers which are prone to failures, fault tolerance is included by design. The Map Reduce Paradigm essentially breaks a job into multiple tasks which are executed in parallel. Initially the “Map” part processes the input data and outputs a pair of tuples. The “Reduce” part then scans the pair of tuples and generates a consolidated output. For e.g. The “map” part could count the number of occurrences of different words in different sets of files and output the words and their count as pairs. The “reduce” would then sum up the counts of the word from the individual ‘map’ parts and provide the total occurrences of the words in multiple files.

Pig and PigLatin : This is a programming language developed at Yahoo to relieve programmers of the intricacies of programming the Map-Reduce and assigning tasks to individual parts. Pig is made up of two parts namely PigLatin, the language and the environment in which it will execute.

Hive: Hive is a Hadoop run-time support structure that was developed by Facebook. Hive has a distinct SQL flavor to it and also simplifies the task of Hadoop programming.

JAQL : JAQL is a declarative query language developed by IBM for handling JSON objects. JAQL is another programming paradigm that is used to programming Hadoop.

Conclusion: It is a foregone conclusion that Big Data and Hadoop will take center stage in the not too distant future given the explosion of data and the dire need of being able to glean useful business insights from them. Big Data and its algorithms provide the way for identifying useful pearls of wisdom from otherwise useless data. Big Data is bound to become mission critical in the enterprises of the future.

Find me on Google+