TWS-5: Google’s Page Rank: Predicting the movements of a random web walker

Internet history can be divided into 2 epochs. The epoch before the Google search and that after. Prior to Google there were many unsuccessful attempts to organize the Web, which  a miniscule fraction of what we have today, through Web portals. So we had Yahoo, Excite, Alta-vista, Lycos etc. trying to categorize the pages of the Web into News, Sports, and Finance etc. Navigating through them was an exercise an frustration but one had to live with this for quite some time. ( The material for this post is taken from Mining Massive Datasets lecture from Coursera – Lecture by Prof. Jure Leskovec, Stanford University)

The Google Search powered by the Page Rank algorithm arrived at a time when the internet was exploding. This was precisely what ‘the doctor ordered’ as navigating the web became synonymous with the Web search. This post takes a look at the Page Rank algorithm behind Google Search.

The Web can be viewed as a large directed graph with out-links from Web pages to other pages (links from a page to external Web pages) and in-links into Web pages from other pages.

For the Google search, Google uses Web crawlers to index the pages of the Web and probably creates an inverted index of keywords to documents that contain them. It then uses the Page Rank algorithm to determine the relevance and importance of the Web page

How does Google identify the importance of a Web page?

The importance of a Web page is determined by the number of in-links to the page. Each in-link is considered a vote for this page. Also the in-link from an important page is higher than another in-link from a less important page. So for example an in-link from New York Times will be much larger than an in-link from the National Enquirer for example

1

In the figure above it can B has a highest Page Rank because it has the highest number of in-links. In addition the out-link from B to C increases the Page Rank of C.

A) Flow formulation: The Flow formulation for Page Rank is based on the following

  • Each Web page’s vote (in-link) is proportional to the importance of the source page
  • If a page ‘j’ with page rank rj has n out-links each link gets rj/n votes
  • Page ‘j’s own importance is the sum of all the votes on its in-links

2

Where rj = ri/3 + rk/4 as seen from the above figure

According to the Flow equation for Page rank, the rank rj for a page j is
rj = ∑ ri/d
I -> j

In other words the rank rj is the sum of the the in-links from all pages ri divided by its out-degree.

3

The flow equations for the above simple view of a Web links can be expressed as based on the rank ri of each node divided by its out-degree. So ry and ra have an out-degree of 2 and hence they are ry/2 and ra/2 per out-link

ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

B) The Matrix formulation

In the Matirx formulation for Page Rank an Adjacent matrix Mji is defined as follows
If a page I has di out-links
If page I has an out-link to page j then
I -> j                   Mji = 1/di else Mji =0

The Rank vector ri is the importance of page i
It is also assumed that  ∑ri = 1

3

The Flow formulation for the above was shown to be
ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

The Matrix formulation is

4

However when we a billions of Web pages with several hundred thousand in-links and out-links the Page rank is iteratively calculated

If we start with

5

To start the page rank of ra=ry=rm = 1/3 so that the sum ∑ri =1
This is then iterated
Using the

r = M x r to arrive at values that converge
ry            ½     ½     0                             1/3
ra    =     ½     0      1          x                   1/3
r m         0     ½      0                               1/3

This will eventually converge at ry=2/5 ra=2/5 and rm =1/5

The ability to rank Web pages on the order of importance was a real breakthrough for Google

The Page Rank also implies the probability that a Web surfer who randomly clicks the ou-links of a the Web pages will land on after some time. It is the probability of a random walk of the Web when clicking the Web links on pages at random.

While Google does a great job in crawling and serving pages it is rumored that more than 75% of the Web is inaccessible to Web search engines. This is known as the “Dark Net‘ or “Dark Web” much like the dark matter of the universe

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded

Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

In the last decade and a half, there has arisen a class of problem that are becoming very critical in the computing domain. These problems deal with computing in a highly distributed environments. A key characteristic of this domain is the need to grow elastically with increasing workloads while tolerating failures without missing a beat.  In short I would like to refer to this as ‘Web Scale Computing’ where the number of servers exceeds several 100’s and the data size is of the order of few hundred terabytes to several Exabytes.

There are several features that are unique to large scale distributed systems

  1. The servers used are not specialized machines but regular commodity, off-the-shelf servers
  2. Failures are not the exception but the norm. The design must be resilient to failures
  3. There is no global clock. Each individual server has its own internal clock with its own skew and drift rates. Algorithms exist that can create a notion of a global clock
  4. Operations happen at these machines concurrently. The order of the operations, things like causality and concurrency, can be evaluated through special algorithms like Lamport or Vector clocks
  5. The distributed system must be able to handle failures where servers crash, disk fails or there is a network problem. For this reason data is replicated across servers, so that if one server fails the data can still be obtained from copies residing on other servers.
  6. Since data is replicated there are associated issues of consistency. Algorithms exist that ensure that the replicated data is either ‘strongly’ consistent or ‘eventually’ consistent. Trade-offs are often considered when choosing one of the consistency mechanisms
  7. Leaders are elected democratically.  Then there are dictators who get elected through ‘bully’ing.

In some ways distributed systems behave like a murmuration of starlings (or a school of fish),  where a leader is elected on the fly (pun unintended) and the starlings or fishes change direction based on a few (typically 6) closest neighbors.

This series of posts, Thinking Web Scale (TWS) ,  will be about Web Scale problems and the algorithms designed to address this.  I would like to keep these posts more essay-like and less pedantic.

In the early days,  computing used to be done in a single monolithic machines with its own CPU, RAM and a disk., This situation was fine for a long time,  as technology promptly kept its date with Moore’s Law which stated that the “ computing power  and memory capacity’ will  double every 18 months. However this situation changed drastically as the data generated from machines grew exponentially – whether it was the call detail records, records from retail stores, click streams, tweets, and status updates of social networks of today

These massive amounts of data cannot be handled by a single machine. We need to ‘divide’ and ‘conquer this data for processing. Hence there is a need for a hundreds of servers each handling a slice of the data.

The first post is about the fairly recent computing paradigm “Map-Reduce”.  Map- Reduce is a product of Google Research and was developed to solve their need to calculate create an Inverted Index of Web pages, to compute the Page Rank etc. The algorithm was initially described in a white paper published by Google on the Map-Reduce algorithm. The Page Rank algorithm now powers Google’s search which now almost indispensable in our daily lives.

The Map-Reduce assumes that these servers are not perfect, failure-proof machines. Rather Map-Reduce folds into its design the assumption that the servers are regular, commodity servers performing a part of the task. The hundreds of terabytes of data is split into 16MB to 64MB chunks and distributed into a file system known as ‘Distributed File System (DFS)’.  There are several implementations of the Distributed File System. Each chunk is replicated across servers. One of the servers is designated as the “Master’. This “Master’ allocates tasks to ‘worker’ nodes. A Master Node also keeps track of the location of the chunks and their replicas.

When the Map or Reduce has to process data, the process is started on the server in which the chunk of data resides.

The data is not transferred to the application from another server. The Compute is brought to the data and not the other way around. In other words the process is started on the server where the data, intermediate results reside

The reason for this is that it is more expensive to transmit data. Besides the latencies associated with data transfer can become significant with increasing distances

Map-Reduce had its genesis from a Lisp Construct of the same name

Where one could apply a common operation over a list of elements and then reduce the resulting list of elements with a reduce operation

The Map-Reduce was originally created by Google solve Page Rank problem Now Map-Reduce is used across a wide variety of problems.

The main components of Map-Reduce are the following

  1. Mapper: Convert all d ∈ D to (key (d), value (d))
  2. Shuffle: Moves all (k, v) and (k’, v’) with k = k’ to same machine.
  3. Reducer: Transforms {(k, v1), (k, v2) . . .} to an output D’ k = f(v1, v2, . . .). …
  4. Combiner: If one machine has multiple (k, v1), (k, v2) with same k then it can perform part of Reduce before Shuffle

A schematic of the Map-Reduce is included below\

2

Map Reduce is usually a perfect fit for problems that have an inherent property of parallelism. To these class of problems the map-reduce paradigm can be applied in simultaneously to a large sets of data.  The “Hello World” equivalent of Map-Reduce is the Word count problem. Here we simultaneously count the occurrences of words in millions of documents

The map operation scans the documents in parallel and outputs a key-value pair. The key is the word and the value is the number of occurrences of the word. E.g. In this case ‘map’ will scan each word and emit the word and the value 1 for the key-value pair

So, if the document contained

“All men are equal. Some men are more equal than others”

Map would output

(all,1),  (men,1), (are,1), (equal,1), (some,1), (men,1), (are,1),  (equal,1), (than,1), (others,1)

The Reduce phase will take the above output and give sum all key value pairs with the same key

(all,1),  (men,2), (are,2),(equal,2), (than,1), (others,1)

So we get to count all the words in the document

In the Map-Reduce the Master node assigns tasks to Worker nodes which process the data on the individual chunks

3

Map-Reduce also makes short work of dealing with large matrices and can crunch matrix operations like matrix addition, subtraction, multiplication etc.

Matrix-Vector multiplication

As an example if we consider a Matrix-Vector multiplication (taken from the book Mining Massive Data Sets by Jure Leskovec, Anand Rajaraman et al

For a n x n matrix if we have M with the value mij in the ith row and jth column. If we need to multiply this with a vector vj, then the matrix-vector product of M x vj is given by xi

1

Here the product of mij x vj   can be performed by the map function and the summation can be performed by a reduce operation. The obvious question is, what if the vector vj or the matrix mij did not fit into memory. In such a situation the vector and matrix are divided into equal sized slices and performed acorss machines. The application would have to work on the data to consolidate the partial results.

Fortunately, several problems in Machine Learning, Computer Vision, Regression and Analytics which require large matrix operations. Map-Reduce can be used very effectively in matrix manipulation operations. Computation of Page Rank itself involves such matrix operations which was one of the triggers for the Map-Reduce paradigm.

Handling failures:  As mentioned earlier the Map-Reduce implementation must be resilient to failures where failures are the norm and not the exception. To handle this the ‘master’ node periodically checks the health of the ‘worker’ nodes by pinging them. If the ping response does not arrive, the master marks the worker as ‘failed’ and restarts the task allocated to worker to generate the output on a server that is accessible.

Stragglers: Executing a job in parallel brings forth the famous saying ‘A chain is as strong as the weakest link’. So if there is one node which is straggler and is delayed in computation due to disk errors, the Master Node starts a backup worker and monitors the progress. When either the straggler or the backup complete, the master kills the other process.

Mining Social Networks, Sentiment Analysis of Twitterverse also utilize Map-Reduce.

However, Map-Reduce is not a panacea for all of the industry’s computing problems (see To Hadoop, or not to Hadoop)

But the Map-Reduce is a very critical paradigm in the distributed computing domain as it is able to handle mountains of data, can handle multiple simultaneous failures, and is blazingly fast.

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

To see all posts click ‘Index of Posts

‘The Search’ is not yet over!

Published in Telecom Asia, Oc9, 2013 – ‘The search’ is not yet over!

In this post I take a look at the technologies that power the now indispensable and ubiquitous ‘search’ that we do over the web. It would be  easy to rewind the world by at least 3 decades by simply removing ‘the search’ from our lives.

A classic article in the New York Times, ‘The Twitter trap’ discusses how technology is slowly replacing some of our more common faculties like the ability to memorize or perform simple calculations mentally.

For e.g. until the 15th century people had to remember a lot of information. Then came Gutenberg with his landmark invention, the printing press, which did away with the need to store information. Closer to the 2oth century the ‘Google Search’ now obviates the need to remember facts about anything. Detailed information is just a mouse click away.

Here’s a closer look at evolution of search technologies

The Inverted Index: The inverted index is a way to search the existence of key words or phrases in documents.  The inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The ability to store words and the documents in which it is present, allows for an quick  retrieval of the related documents in which the word(s) are present. Search engines like Google, Bing or Yahoo typically crawl of the web continuously and keep updating this index of words versus the documents as new web pages and web sites are added. The inverted index is a simplistic method and is neither accurate nor efficient.

inverted_index

Google’s Page Rank: As mentioned before merely the presence of words in documents alone is not sufficient to return good search results. So Google came along with its PageRank algorithm. PageRank is an algorithm used by the Google’s  web search engine to rank websites in their search engine results.  According to Google PageRank works by “counting the number and quality of links to a page to determine a rough estimate of how important the website is.”  The underlying assumption is that more important websites are likely to receive more links from other websites.

In essence the PageRank algorithm tries to determine the likelihood that a surfer will land on a particular page by randomly clicking on links. Clearly the PageRank algorithm has been very effective for searches as now ‘googling’ is synonymous to searching (see below from Wikipedia)

PageRanks

Graph database: While the ‘Google Search’ is extremely powerful it would make more sense if the search could be precisely tailored to what the user is trying to search. A typical Google search throws up a few 100 million results.  This has led to even more powerful techniques, one of which is the ‘Graph database’. In a Graph database data is represented as a graph. Nodes in the graph can be entities and edges could be relationships. A search on the graph database will result in the traversal of the graph from a specific start node to specific terminating node. Here is a fun representation of a simple Graph database representation from InfoQ

neo4j_matrix_0411

Google has recently come out with its Knowledge Graph which is based on this technology. Facebook allows users to create complex queries of status updates using the graph database.

Finally, what next??

A Cognitive Search??: Even with the graph database the results cannot be very context specific. What I would hope to happen in the future is have a sort of a ‘Cognitive Search’ where the results would be bounded and would take into account the semantics and context of a user specified phrase or request.

So for e.g. if a user specified ‘Events leading to the Iraq war’ the search should throw all results which finally culminated in the Iraq war.

Alternatively if I was interested in knowing for e.g. ‘the impact of iPad on computing’ then the search should throw precise results from  the making of the iPad, the launch of iPad, the various positive and negative reviews and impact iPad has had on the tablet and computing industry as a whole.

Another interesting query would ‘The events that led to the downfall of a particular government in election of 1998’ the search should precisely output all those events that were significant during this specific period.

I would assume that the results themselves would come out as a graph with nodes and edges specifying which event(s) triggered which other event(s) with varying degrees of importance.

However this kind of ability where the search is cognitive is probably several years away!

Find me on Google+