TWS-5: Google’s Page Rank: Predicting the movements of a random web walker

Internet history can be divided into 2 epochs. The epoch before the Google search and that after. Prior to Google there were many unsuccessful attempts to organize the Web, which a miniscule fraction of what we have today, through Web portals. So we had Yahoo, Excite, Alta-vista, Lycos etc. trying to categorize the pages of the Web into News, Sports, and Finance etc. Navigating through them was an exercise an frustration but one had to live with this for quite some time. ( The material for this post is taken from Mining Massive Datasets lecture from Coursera – Lecture by Prof. Jure Leskovec, Stanford University)

The Google Search powered by the Page Rank algorithm arrived at a time when the internet was exploding. This was precisely what ‘the doctor ordered’ as navigating the web became synonymous with the Web search. This post takes a look at the Page Rank algorithm behind Google Search.

The Web can be viewed as a large directed graph with out-links from Web pages to other pages (links from a page to external Web pages) and in-links into Web pages from other pages.

For the Google search, Google uses Web crawlers to index the pages of the Web and probably creates an inverted index of keywords to documents that contain them. It then uses the Page Rank algorithm to determine the relevance and importance of the Web page

How does Google identify the importance of a Web page?

The importance of a Web page is determined by the number of in-links to the page. Each in-link is considered a vote for this page. Also the in-link from an important page is higher than another in-link from a less important page. So for example an in-link from New York Times will be much larger than an in-link from the National Enquirer for example

In the figure above it can B has a highest Page Rank because it has the highest number of in-links. In addition the out-link from B to C increases the Page Rank of C.

A) Flow formulation: The Flow formulation for Page Rank is based on the following

Each Web page’s vote (in-link) is proportional to the importance of the source page
If a page ‘j’ with page rank rj has n out-links each link gets rj/n votes
Page ‘j’s own importance is the sum of all the votes on its in-links

Where rj = ri/3 + rk/4 as seen from the above figure

According to the Flow equation for Page rank, the rank rj for a page j is
rj = ∑ ri/d
I -> j

In other words the rank rj is the sum of the the in-links from all pages ri divided by its out-degree.

The flow equations for the above simple view of a Web links can be expressed as based on the rank ri of each node divided by its out-degree. So ry and ra have an out-degree of 2 and hence they are ry/2 and ra/2 per out-link

ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

B) The Matrix formulation

In the Matirx formulation for Page Rank an Adjacent matrix Mji is defined as follows
If a page I has di out-links
If page I has an out-link to page j then
I -> j Mji = 1/di else Mji =0

The Rank vector ri is the importance of page i
It is also assumed that ∑ri = 1

The Flow formulation for the above was shown to be
ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

The Matrix formulation is

However when we a billions of Web pages with several hundred thousand in-links and out-links the Page rank is iteratively calculated

If we start with

To start the page rank of ra=ry=rm = 1/3 so that the sum ∑ri =1
This is then iterated
Using the

r = M x r to arrive at values that converge
ry            ½     ½     0                             1/3
ra    =     ½     0      1          x               1/3
r m         0     ½      0                           1/3

This will eventually converge at ry=2/5 ra=2/5 and rm =1/5

The ability to rank Web pages on the order of importance was a real breakthrough for Google

The Page Rank also implies the probability that a Web surfer who randomly clicks the ou-links of a the Web pages will land on after some time. It is the probability of a random walk of the Web when clicking the Web links on pages at random.

While Google does a great job in crawling and serving pages it is rumored that more than 75% of the Web is inaccessible to Web search engines. This is known as the “Dark Net‘ or “Dark Web” much like the dark matter of the universe

Also see
1. A crime map of India in R: Crimes against women
2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded

Programming Zen and now – Some essential tips-2

This post is a follow-up to my earlier post – How to program – Some essential tips. In this post I expand on some of the ideas of my earlier post.

Programming means different things to different people. To some programming is a drudgery almost akin to manual labor, to others programming is an insurmountable mountain full of frustrations and disappointments while to others it is an intense problem solving and a creative activity. In my opinion programming can mean anything to you. It is your attitude towards coding that make it a chore, a daunting task or something really creative.

Here are some my insights on how to go about learning to code

Eyes wide open: People generally get frustrated when a piece of code that they wrote does not do what they intended it to do. In some cases the code snippet will do nothing when they were expecting final result, sometimes the code will crash or it will go into an infinite loop and drive the person nuts. (Let me assure you – I have been there, done that!) The usual reaction when this happens is anger and frustration where we generally tinker around with the code only to get the same result. Soon the emotions will progress from anger to hopelessness.

The first thing that one needs to while coding is to keep your ‘eyes wide open’. We tend to be guilty of ignoring the error messages that show up. Here one way to attack coding

a) Fully understand the ‘what’ of the problem. If there is an infinite loop or a core dump check after which point does it happen? If there is an execution error, what is the error trying to tell us?
b) Next look into ‘why’ the error occurred. You could either use debugger or insert appropriate print statements to take the offending code apart.
c) Thirdly think ‘how‘ you can address the situation. Make appropriate changes and re-run the code
d) Did it solve the issue.If yes, move forward. Otherwise go to step a)

Remember that we learn more from our programming mistakes more than when our code just ‘happens’ to work! Mistakes in our code make us to explain every part of the program

Changing times:

Times have changed. Programming Zen and programming now are worlds apart. In many ways, IDEs, Git, Google etc. have made the programmer’s life a lot easier

‘Git’ing from here to there: Here is a trick that I learnt fairly recently, though it should have occurred to me more than 2 years back. This is using Git judiciously for all programming tasks (Note: I am saying nothing new here!). I find it really useful in writing code with incremental changes. I create my initial code on the master and then test out incremental changes on a ‘new branch’ even for personal projects. Once I have proved a small increment works, I merge it with the ‘main’ branch. I again start working on the ‘new’ for the next incremental change followed by a merge to the master

The steps are

Make initial changes

1. git add .
2. git commit –m “ Initial changes’

Create a new branch
3. git checkout –b ‘new

Make incremental changes. Test.
4.git add .
5. git commit –m “Change 1”

Merge with the master
6.git checkout master
7. git merge new

Continue to work with ‘new’.
8 . git checkout new
9. Go to step 4)

This process can be continued till you get your final product. I find this extremely useful instead of just using an IDE to make code changes. Invariably you can run into a situation where you had something working some time back and in the next instant it is broken and you can’t figure out all the changes you made to the working code. This can be extremely frustrating. With Git you have a history of changes and you can switch to an earlier version of working code and start from there.

Rarely do I find a reason to have more than 1 branch

Here is a pictorial version of this

Taking help from Dr. Google: For most questions and errors that you encounter you will find others who have hit similar bugs. Just google it. You will more than surprised that others went down the exact same path that you are treading. Besides the internet is full of tutorials, blogs and articles on key aspects of programming

Explore the cave of Stack overflow: Spend time exploring Stack overflow. Stack overflow is replete with code snippets and questions that you wanted to ask. There is so much information out there. If you really don’t find an answer to your problem, post it in Stack overflow and you are bound to get an answer or a link to a similar question asked previously

Finally programming requires dollops of patience. Develop patience along with your skill in coding and soon programming will much more enjoyable to you.

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. How to program – Some essential tips
4. The mind of a programmer

How to program – Some essential tips

If one follows the arrow of time from the early 1980s to the present day, the number of programming problems have not only proliferated but have also become more difficult. Fortunately programming in itself has become more manageable with massive increases in computing horsepower, smarter tools and instant availability of information on the internet, typically with the click of a mouse.

Learning to program is no easy task, but can be done with the right mix of attitude, curiosity and interest. Becoming adept at programming, however, is something else. An interesting essay in this context is Peter Norvig’s ‘Teach yourself programming in 10 years’

Back in the 1980s when I wrote my first Fortran program on my college Mainframe, programming was a lengthy exercise, spanning several days.

My first program was to plot a sine wave of characters on a computer printout. Running this program required the following several steps

Enter the program on a teletype terminal and create a stack of Hollerith (punched) cards
Submit the stack of cards to the computer center
The computer center would do a batch execute in the evening on the Mainframe
God forbid, if your program has a syntax error. If you did find an error, go back to step 1, the next day.
Assuming everything is fine, the computer center would run your program and your output (printout) would be placed in the appropriate pigeon hole which you would need to pick up the next day.

The whole exercise to write a small-sized program could take anywhere between a couple of days to a whole week.

In the early 1990s things got a little better where one could code, compile, link and execute sitting at one’s desk. However while the programming itself got much simpler than before, certain tasks were still difficult. Till the late 90s programs of any sort had to be written using a regular text editor (vi , emacs etc.) You would then have to go through the process of compiling, linking and executing.

An angry compiler would typically spew forth venom at missing semi-colons, undeclared variables, and uninitialized values. This would happen till you are able to iron out all syntax errors. Then you would link, get undefined symbols and have to include appropriate libraries etc. And then finally you would execute your code, only to have it crash. The process of debugging would then start.

Luckily technology has made life a whole lot easier except for the last step where you could still run into an execution errors . In these days an IDE (Interactive Development Environment) like Eclipse will flag syntax errors, missing definitions/declarations etc. as you write your code. Moreover Eclipse can also indicate which libraries (imports) you would need to include in your package for it to build. The only missing step in IDEs of these days is the ability to predict possible execution errors in your program. I wouldn’t be surprised, if in future, like Microsoft Word, the IDE is able to tell you if a programming construct does not make sense.

So things have gotten a lot easier for the programmer. The following tips for are particularly useful as you progress along in programming

These days when you are learning a new programming language it is not necessary to know the language from cover to cover by reading a book. In those days when we learnt C it was necessary to know everything from bit structures, macros, pragma etc. The reason being that every syntax or execution error one had to rush to get the textbook and thumb through it for the answer. Not so, in these days of Google. You have the world’s library at your fingertips.
To get started it is necessary to learn just the most important programming constructs of the language say structure, class, car, cdr besides the usual suspects like loops, conditions and case constructs
Download and install an IDE for the language. In most case Eclipse will work
Try to write a simple program and test out your code.
To do any sort of programming these days you will necessarily need to make 3 friends
1. Google
2. Stackoverflow
3. Git & GitHub
Honing your Googling skills is very important. There are answers to almost any sort of programming problems out there. You would be surprised to know that there are many others who did exactly the same stupid mistake that you did out there. Also googling will take you to interesting tutorials, blogs, articles that discuss different aspects of the programming language and the problem you are trying to solve
Stackoverflow is really a God send to all programmers. There are so many questions on so many aspects of every programming language on earth there. If you spend time searching Stackoverflow you are bound to find answers, code snippets that you can readily use in your code
Post your questions in stackoverflow when you don’t find the answers there. You are bound to get quick answers. Thanks to the gamification of Stackoverflow (points, upvotes,badges etc) that has been created on Stackoverflow.
Git & GitHub: I would suggest that you download and install GitHub for Windows. This will provide you with version control on your desktop. You can modify code while being to switch back to an earlier version with Git. Read up a good tutorial on Git for Windows
Once you have working code you push it onto GitHub and share with other programmers

Now that you have the basic setup here are few other extremely important tips

The most important criteria for programming is ‘attitude’. Initially you are bound to get frustrated, angry, irritated etc. But it is necessary to look at the errors that you get with the right attitude. Know that an error is telling you something. Usually the answers to your mistake are in the ‘error message’ itself. Look at it closely and try to understand it. You will learn a lot more when you learn from errors than from copy-pasting from somebody else’s code, even if works right the first time around!
Make sure you do something different each time. As Einstein said “ If you keep doing the same thing, you will keep getting the same result’
There are different ways to debug your code. You could use the debugger and single step through the code and keep checking the values of the variables. I personally prefer print statements to localize where things are going wrong. I then try to narrow down the problem to a few lines of code and try to take it apart.

Hopefully the above tips are useful. Programming can be creative activity and will be indispensable in our future.

Above all have fun coding, there are so many possibilities these days!

Also see

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. The mind of the programmer
4. Programming Zen and now – Some essential tips -2

Programming Zen and now – Some essential tips-2

Pregelian philosophy – Thinking Web Scale – Part 2

“If you squint the right way, you will notice there are graphs are everywhere”. This is a line from a Research blog at Google “Large scale graph computing at Google”. Transportation problems, disease outbreaks, computer networks and social networks all use graph in one way or another. Social networks of today, with friends of Facebook, followers in Twitter and connections in LinkedIn are all clearly graphs. The billions of pages in the World Wide Web with incoming and outgoing links is a massive directed graph.

Pregel is Google’s computing model for graph processing. Google came up with this programming model to compute Page Ranks of individual web sites. Pregel is a powerful framework for processing of directed graphs. A directed graph has vertices and edges. Edges are directed towards or away from the vertices. Pregel is a highly scalable model and is well suited for Web Scale problems. It is capable of processing directed graphs with billions of vertices and trillion of edges.

The Map Reduce paradigm with its message passing mechanism is not particularly well suited for this purpose. Map Reduce is capable of processing several documents, images, or matrices in parallel. But handling directed graphs with Map-reduce the combiners/reducers have to wait for the mappers to finish their tasks. In Pregel programs are executed as a sequence of iterations in which each vertex receives messages sent to it in the previous iteration, execute code, and send messages to other vertices. Each vertex can modify its state and the topology of the graph

Here is a the model of the Pregel.

This picture is taken from the lecture in Web Intelligence and Big Data course by Gautam Shroff on Coursera

Pregel works on a sequence of iterations known as supersteps. In each superstep, S, each vertex will receive messages sent to it in the previous superstep S-1, executes a user-defined function specified for the vertex and sends messages to other vertices which they will receive in superstep S+1. The supersteps at all vertices are conceptually supposed to occur in parallel. At each superstep the vertex can alter its state and the state of the outgoing edges. The synchronicity of the model is what makes the model semantically manageable.

Pregel is realized on hundreds of commodty serves. The input to the Pregel model is a directed graph. During initialization the vertices are partitioned and each server receives a set of vertices. Each vertex is associated with a modifiable user defined value.

In each superstep each node computes the user defined function in parallel using the message sent to it in the previous superstep. A vertex can modify its state, the outgoing edges or the topology of the graph.

Pregel has been used for a variety of different problems ranging from determining Page Rank, Shortest path etc. Vertices are first class citizens in Pregel. Algorithms terminate by a process of voting to halt. When all vertices vote to halt then the computation in Pregel is assumed to have completed. The algorithm as a whole terminates when all the nodes have voted to halt and there are no messages in transit.

A simple example of how Pregel determines the maximum value is illustrated in the original Google research paper Pregel: A System for Large-Scale Graph Processing.

The pseudo code for this can be written as

compute(){

i_val := val
while (m) {
if (m > val) {
val = m;
}
else if (i_val == val) {
vote_to_halt
}
else {
for each neighbor v
send_message(v, val)
}
}

In the above diagram the 4 vertices are initialized with the values shown. In superstep 1,

vertices 3 & 6 exchange their values. While 3 updates its value to 6 at its vertex, vertex 6 drops it and vote to halt. Similarly vertex 1 will update its value to 6 while vertex 2 which receives the value 1 will vote to halt. In superstep 2vertex 2 will receive the value 6 from the vertex to its right, will be woken up and will update its value. In superstep 3 no messages are passed and all nodes would have now voted

Another interesting application is the evaluation of Page Rank. Page Rank essentially determines the probability of hitting a Page if a surfer clicked links on a Web page at random.

Page Rank is evaluated iteratively as follows (source wkipedia.org)

This can be done iteratively through Pregel as below

virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {

for (; !msgs->Done(); msgs->Next()) {
sum += msgs->Value();
*MutableValue() = 0.15 / NumVertices() + 0.85 * sum; – – > (A)
}
if (superstep() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n); – – > (B)
} else {
VoteToHalt();
}
}
}

The Pregel computation is initialized such that in superstep 0, the value of each vertex

is 1 / NumVertices() . In each of the first 30 supersteps, each vertex sends along each outgoing edge its tentative PageRank divided by the number of outgoing edges (see step B above).

Starting from superstep 1, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to 0.15/NumVertices() + 0.85 * sum (see step A) After reaching superstep 30, no further messages are sent and each vertex votes to halt.

Clearly the computation of PageRank of the pages indexed by Google in the World Wide Web consisting of billions of pages can be computed fairly efficiently by Pregel.

Pregel also includes ‘combiners’ that perform functions like SUM, MIN,MAX and AVERAGE to save computational steps.

Pregel also includes checkpointing where vertices save their states to revert to a previous state in case of failure.

The synchronous methodology of Pregel helps in avoiding issues of deadlocks and races which are prevalent in asynchronous communicating programs.

Pregel is a powerful programming model which will find applications in many Web Scale applications in the future.

Find me on Google+

‘The Search’ is not yet over!

Published in Telecom Asia, Oc9, 2013 – ‘The search’ is not yet over!

In this post I take a look at the technologies that power the now indispensable and ubiquitous ‘search’ that we do over the web. It would be easy to rewind the world by at least 3 decades by simply removing ‘the search’ from our lives.

A classic article in the New York Times, ‘The Twitter trap’ discusses how technology is slowly replacing some of our more common faculties like the ability to memorize or perform simple calculations mentally.

For e.g. until the 15^th century people had to remember a lot of information. Then came Gutenberg with his landmark invention, the printing press, which did away with the need to store information. Closer to the 2oth century the ‘Google Search’ now obviates the need to remember facts about anything. Detailed information is just a mouse click away.

Here’s a closer look at evolution of search technologies

The Inverted Index: The inverted index is a way to search the existence of key words or phrases in documents. The inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The ability to store words and the documents in which it is present, allows for an quick retrieval of the related documents in which the word(s) are present. Search engines like Google, Bing or Yahoo typically crawl of the web continuously and keep updating this index of words versus the documents as new web pages and web sites are added. The inverted index is a simplistic method and is neither accurate nor efficient.

Google’s Page Rank: As mentioned before merely the presence of words in documents alone is not sufficient to return good search results. So Google came along with its PageRank algorithm. PageRank is an algorithm used by the Google’s web search engine to rank websites in their search engine results. According to Google PageRank works by “counting the number and quality of links to a page to determine a rough estimate of how important the website is.” The underlying assumption is that more important websites are likely to receive more links from other websites.

In essence the PageRank algorithm tries to determine the likelihood that a surfer will land on a particular page by randomly clicking on links. Clearly the PageRank algorithm has been very effective for searches as now ‘googling’ is synonymous to searching (see below from Wikipedia)

Graph database: While the ‘Google Search’ is extremely powerful it would make more sense if the search could be precisely tailored to what the user is trying to search. A typical Google search throws up a few 100 million results. This has led to even more powerful techniques, one of which is the ‘Graph database’. In a Graph database data is represented as a graph. Nodes in the graph can be entities and edges could be relationships. A search on the graph database will result in the traversal of the graph from a specific start node to specific terminating node. Here is a fun representation of a simple Graph database representation from InfoQ

Google has recently come out with its Knowledge Graph which is based on this technology. Facebook allows users to create complex queries of status updates using the graph database.

Finally, what next??

A Cognitive Search??: Even with the graph database the results cannot be very context specific. What I would hope to happen in the future is have a sort of a ‘Cognitive Search’ where the results would be bounded and would take into account the semantics and context of a user specified phrase or request.

So for e.g. if a user specified ‘Events leading to the Iraq war’ the search should throw all results which finally culminated in the Iraq war.

Alternatively if I was interested in knowing for e.g. ‘the impact of iPad on computing’ then the search should throw precise results from the making of the iPad, the launch of iPad, the various positive and negative reviews and impact iPad has had on the tablet and computing industry as a whole.

Another interesting query would ‘The events that led to the downfall of a particular government in election of 1998’ the search should precisely output all those events that were significant during this specific period.

I would assume that the results themselves would come out as a graph with nodes and edges specifying which event(s) triggered which other event(s) with varying degrees of importance.

However this kind of ability where the search is cognitive is probably several years away!

Find me on Google+

The dark side of the Internet

Published in Telecom Asia 26 Sep 2012 – The dark side of the internet

Imagine a life without the internet. You can’t! That’s how inextricably enmeshed the internet is in our lives. Kids learn to play “angry birds” on the PC before they learn to say “duh”, school children hobnob on Facebook and many of us regularly browse, upload photos, watch videos and do a dozen other things on the internet.

So on one side of the internet is the user with his laptop, smartphones or iPad. So what’s on the other side of the Internet and what is the Internet? The Internet is a global system of interconnected computer network that uses the TCP/IP protocol. The Internet or more generally the internet is network of networks made of hundreds of millions of computers.

During the early days the internet was most probably used for document retrieval, email and browsing. But with the passage of time the internet and the uses of the internet have assumed gigantic proportions. Nowadays we use the internet to search billions of documents, share photographs with our online community, blog and stream video. So, while the early internet was populated with large computers to perform the tasks, the computations of the internet of today require a substantially larger infrastructure. The internet is now powered by datacenters. Datacenters contain anywhere between 100s to 100,000s servers. A server is a more beefed up computer that is designed for high performance sans a screen and a keyboard. Datacenters contain servers stacked over one another on a rack.

These datacenters are capable of handling thousands of simultaneous users and delivering results in split second. In this age of exploding data and information overload where split second responses and blazing throughputs are the need of the hour, datacenters really fill the need. But there is a dark side to these data centers. The issue is that these datacenters consume a lot of energy and are extremely power hungry besides. In fact out of a 100% of utility power supplied to datacenter only 6 – 12 % is used for actual computation. The rest of the power is either used for air conditioning or is lost through power distribution.

In fact a recent article “Power, pollution and the Internet” in the New York Times claims that “Worldwide, the digital warehouses use about 30 billion watts of electricity, roughly equivalent to the output of 30 nuclear power plants.” Further the article states that “it is estimated that Google’s data centers consume nearly 300 million watts and Facebook’s about 60 million watts or 60 MW”

For e.g. It is claimed that Facebook annually draws 509 million kilowatt hours of power for its data centers (see Estimate: Facebook running 180,000 servers). This article further concludes “that the social network is delivering 54.27 megawatts (MW) to servers” or approximately 60 MW to its datacenter. The other behemoths in this domain including Google, Yahoo, Twitter, Amazon, Microsoft, and Apple all have equally large or larger data centers consuming similar amounts of energy. Recent guesstimates have placed Google’s server count at more than 1 million and consuming approximately 220 MW. Taking a look at the power generation capacities of power plants in India we can see that 60 MW is between to 20%-50% of the power generation capacity of power plants while 220 MW is entire capacity of medium sized power plants (see List of power stations in India”)

One of the challenges that these organizations face is the need to make the datacenter efficient. New techniques are constantly being used in the ongoing battle to reduce energy consumption in a data center. These tools are also designed to boost a data center’s Power Usage Effectiveness (PUE) rating. Google, Facebook, Yahoo, and Microsoft compete to get to the lowest possible PUE measure in their newest data centers. The earlier datacenters used to average 2.0 PUE while advanced data centers these days aim for lower ratings of the order of 1.22 or 1.16 or lower.

In the early days of datacenter technology the air-conditioning systems used to cool by brute force. Later designs segregated the aisles as hot & cold aisle to improve efficiency. Other technique use water as a coolant along with heat exchangers. A novel technique was used by Intel recently in which servers were dipped in oil. While Intel claimed that this improved the PUE rating there are questions about the viability of this method considering the messiness of removing or inserting new circuit board from the servers.

Datacenters are going to proliferate in the coming days as information continues to explode. The hot new technology “Cloud Computing” is nothing more that datacenters which uses virtualization technique or the ability to run different OS on the hardware improving server utilization.

Clearly the thrust of technology in the days to come will be on identifying renewable sources of energy and making datacenters more efficient.

Datacenters will become more and more prevalent in the internet and technologies to make them efficient as we move to a more data driven world

Find me on Google+

Test driving Apache Hadoop: Standalone & pseudo-distributed mode

The Hadoop paradigm originated from Google and is used in crunching large data sets. It is ideally suited for applications like Big Data, creating an inverted index used by search engines and other problems which require terabytes of data to processed in parallel. One key aspect of Hadoop is that it is made up of commodity servers. Hence server, disk crashes or network issues are assumed to be norm rather than an exception.

The Hadoop paradigm is made of the Map-Reduce & the HDFS parts. The Map-Reduce has 2 major components to it. The Map part takes as input key- value pairs and emits a transformed key-value pair. For e.g Map could count the number of occurrences of words or created an inverted index of a word and its location in a document. The Reduce part takes as input the emitted key-value pairs of the Map output and performs another operation on the inputs from the Map part for.e.g summing up the counts of words. A great tutorial on Map-Reduce can be found at http://developer.yahoo.com/hadoop/tutorial/module4.html

The HDFS (Hadoop Distributed File System) is the special storage that is used in tandem with the Map-Reduce algorithm. The HDFS distributes data among Datanodes. A Namenode maintains the meta data of where individual pieces of data are stored.

To get started with Apache Hadoop download a stable release of Hadoop from (e.g. hadoop-1.0.4.tar.gz)

http://hadoop.apache.org/common/releases.html#Download

a) Install Hadoop on your system preferably in /usr/local/
tar xzf ./Downloads/hadoop-1.0.4.tar.gz

sudo mv hadoop-1.0.4 hadoop (rename hadoop-1.0.4 to hadoop for convenience)
sudo chown -R hduser:hadoop hadoop
Apache Hadoop requires Java to be installed. Download and install Java on your machine from

http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u4-downloads-1591156.html

b) After you have installed java set $JAVA_HOME in
usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/bin/java (uncomment and set the correct path)

c) Create a user hduser in group hadoop
For this click Applications->Other->User & Groups
Choose Add User – hduser & Add Group – hadoop
Choose properties and add hduser to the hadoop group.

Standalone Operation
Under root do
/usr/sbin/sshd
then
$ssh localhost

If you cannot do a ssh to localhost with passphrase then do the following
$ ssh-keygen -t rsa -P “”

You will get the following

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.

…
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now re-run
$ssh localhost – This time it should go fine

Create a directory input and copy *.xml files from conf/
$mkdir input
$cp /usr/local/hadoop/share/hadoop/templates/conf/*.xml input

Then execute the following. This searches for the string “dfs*” in all the XML files under the input directory
$/usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
You should see
[root@localhost hadoop]# /usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
12/06/10 13:00:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
..
…
12/06/10 13:01:45 INFO mapred.JobClient:     Reduce output records=38
12/06/10 13:01:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
12/06/10 13:01:45 INFO mapred.JobClient:     Map output records=38

Where it indicates that there are 38 such record strings with dfs* in it.
If you get an error
java.lang.OutOfMemoryError: Java heap space
then increase the heap size from 128 to 1024 as below
…
<property>
<name>mapred.child.java.opts</name>
<value>-server -Xmx1024m -Djava.net.preferIPv4Stack=true</value>
</property>
….

Pseudo distributed mode
In the pseudo distributed mode separate Java processes are started for the Job Tracker which schedules tasks, the Task tracker which executes tasks and the Namenode which contains the data
A good post on Hadoop standalone mode is given in Michael Nolls post – Running Hadoop on Ubuntu Linux (single node cluster)

a) Execute the following commands
. ./.bashrc
$mkdir -p /home/hduser/hadoop/tmp
$chown hduser:hadoop /home/hduser/hadoop/tmp
$chmod 750 /home/hduser/hadoop/tmp

b) Do the following
Note:Files core-site.xml, mapred-site,xml & hdfs-site.xml exist under
/usr/local/hadoop/share/hadoop/templates/conf& usr/local/hadoop/conf
It appears that Apache hadoop gives precedence to /usr/local/hadoop/conf. So add the following between <configuration and </configuration>
In file /usr/local//hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.
</description>
<final>true</final>
</property>

In /usr/local//hadoop/conf//mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<final>true</final
</property>

In /usr/local//hadoop/conf//hdfs-site.xml add
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Now perform
$sudo /usr/sbin/ssh
$ssh localhost (Note you may have to generate key as above if you get an error)

c) Since the pseudo distributed mode will use the HDFS file system we need to format this.So run the following command
$$HADOOP_HOME/bin/hadoop namenode -format
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop namenode -format
12/06/10 15:48:16 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by ‘hortonfo’ on Tue May 8 20:16:59 UTC 2012
************************************************************/
…
12/06/10 15:48:17 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/06/10 15:48:18 INFO common.Storage: Storage directory /home/hduser/hadoop/tmp/dfs/name has been successfully formatted.
12/06/10 15:48:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
d) Now start all the Hadoop processes
$/usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to /var/log/hadoop/root/hadoop-root-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-localhost.localdomain.out ocalhost: starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-localhost.localdomain.out
Verify that all processes have started by executing /usr/java/jdk1.7.0_04/bin/jps
[root@localhost hduser]# /usr/java/jdk1.7.0_04/bin/jps
10971 DataNode
10866 NameNode
11077 SecondaryNameNode
11147 JobTracker
11264 TaskTracker
11376 Jps

You will see JobTracker,taskTracker,NameNode,DataNode and SecondaryNameNode

You can also do netstat -plten | grep java
tcp        0      0 0.0.0.0:50090               0.0.0.0:*                   LISTEN      0          166832     11077/java
tcp        0      0 0.0.0.0:50060               0.0.0.0:*                   LISTEN      0          167407     11264/java
tcp        0      0 0.0.0.0:50030               0.0.0.0:*                   LISTEN      0          166747     11147/java
tcp        0      0 0.0.0.0:50070               0.0.0.0:*                   LISTEN      0          165669     10866/java
tcp        0      0 0.0.0.0:50010               0.0.0.0:*                   LISTEN      0          166951     10971/java
tcp        0      0 0.0.0.0:50075               0.0.0.0:*                   LISTEN      0          166955     10971/java
tcp        0      0 127.0.0.1:55839             0.0.0.0:*                   LISTEN      0          166816     11264/java
tcp        0      0 0.0.0.0:50020               0.0.0.0:*                   LISTEN      0          165843     10971/java
tcp        0      0 127.0.0.1:54310             0.0.0.0:*                   LISTEN      0          165535     10866/java
tcp        0      0 127.0.0.1:54311             0.0.0.0:*                   LISTEN      0          166733     11147/java

e) Now copy files from your local directory /home/hduser/input to the HDFS file system
Now you can check the web interface for JobTracker & NameNode
This is as per mapred-site.xml & hdfs-site.xml in /conf directory. They are at
ñJobTracker – http://localhost:50030/

ñNameNode – http://localhost:50070/

f) Copy files from your local directory to HDFS
$/usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/hduser/input /user/hduser/input
Ensure that the files have been copies by listing the contents of HDFS

g) Check that files have been copied
$/usr/local/hadoop/bin/hadoop dfs -ls /user/hduser/input
Found 9 items
-rw-r–r–   1 root supergroup       7457 2012-06-10 10:31 /user/hduser/input/capacity-scheduler.xml
-rw-r–r–   1 root supergroup       2447 2012-06-10 10:31 /user/hduser/input/core-site.xml
-rw-r–r–   1 root supergroup       2300 2012-06-10 10:31 /user/hduser/input/core-site_old.xml
-rw-r–r–   1 root supergroup       5044 2012-06-10 10:31 /user/hduser/input/hadoop-policy.xml
-rw-r–r–   1 root supergroup       7595 2012-06-10 10:31 /user/hduser/input/hdfs-site.xml

h) Now execute the grep functionality
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/input /user/hduser/output ‘dfs[a-z.]+’
12/06/10 10:34:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library
….
12/06/10 10:34:23 INFO mapred.JobClient: Running job: job_201206101010_0003
12/06/10 10:34:24 INFO mapred.JobClient: map 0% reduce 0%
12/06/10 10:34:48 INFO mapred.JobClient: map 11% reduce 0%
…
…
12/06/10 10:35:21 INFO mapred.JobClient: map 88% reduce 22%
12/06/10 10:35:24 INFO mapred.JobClient: map 100% reduce 22%
12/06/10 10:35:27 INFO mapred.JobClient: map 100% reduce 29%
12/06/10 10:35:36 INFO mapred.JobClient: map 100% reduce 100%
12/06/10 10:35:42 INFO mapred.JobClient: Job complete: job_201206101010_0003
….
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce input groups=3
12/06/10 10:36:16 INFO mapred.JobClient:     Combine output records=0
12/06/10 10:36:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=180502528
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce output records=36
12/06/10 10:36:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=695119872
12/06/10 10:36:16 INFO mapred.JobClient:     Map output records=36
i) Check the result
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop dfs -cat /user/hduser/output/*
6          dfs.data.dir
2          dfs.
2          dfs.block.access.token.enable
2          dfs.cluster.administrators
2          dfs.datanode.address
2          dfs.datanode.data.dir.perm
2          dfs.datanode.http.address
2          dfs.datanode.kerberos.principal
2          dfs.datanode.keytab.file
2          dfs.exclude
…..