Internet history can be divided into 2 epochs. The epoch before the Google search and that after. Prior to Google there were many unsuccessful attempts to organize the Web, which a miniscule fraction of what we have today, through Web portals. So we had Yahoo, Excite, Alta-vista, Lycos etc. trying to categorize the pages of the Web into News, Sports, and Finance etc. Navigating through them was an exercise an frustration but one had to live with this for quite some time. ( The material for this post is taken from Mining Massive Datasets lecture from Coursera – Lecture by Prof. Jure Leskovec, Stanford University)

The Google Search powered by the Page Rank algorithm arrived at a time when the internet was exploding. This was precisely what ‘the doctor ordered’ as navigating the web became synonymous with the Web search. This post takes a look at the Page Rank algorithm behind Google Search.

The Web can be viewed as a large directed graph with out-links from Web pages to other pages (links from a page to external Web pages) and in-links into Web pages from other pages.

For the Google search, Google uses Web crawlers to index the pages of the Web and probably creates an inverted index of keywords to documents that contain them. It then uses the Page Rank algorithm to determine the relevance and importance of the Web page

How does Google identify the importance of a Web page?

The importance of a Web page is determined by the number of in-links to the page. Each in-link is considered a vote for this page. Also the in-link from an important page is higher than another in-link from a less important page. So for example an in-link from New York Times will be much larger than an in-link from the National Enquirer for example

In the figure above it can B has a highest Page Rank because it has the highest number of in-links. In addition the out-link from B to C increases the Page Rank of C.

**A) Flow formulation**: The Flow formulation for Page Rank is based on the following

- Each Web page’s vote (in-link) is proportional to the importance of the source page
- If a page ‘j’ with page rank rj has n out-links each link gets rj/n votes
- Page ‘j’s own importance is the sum of all the votes on its in-links

Where rj = ri/3 + rk/4 as seen from the above figure

According to the Flow equation for Page rank, the rank rj for a page j is

rj = ∑ ri/d

I -> j

In other words the rank rj is the sum of the the in-links from all pages ri divided by its out-degree.

The flow equations for the above simple view of a Web links can be expressed as based on the rank ri of each node divided by its out-degree. So ry and ra have an out-degree of 2 and hence they are ry/2 and ra/2 per out-link

ry = ry/2 + ra/2

ra = rm + ry/2

rm = ra/2

**B) The Matrix formulation**

In the Matirx formulation for Page Rank an Adjacent matrix Mji is defined as follows

If a page I has di out-links

If page I has an out-link to page j then

I -> j Mji = 1/di else Mji =0

The Rank vector ri is the importance of page i

It is also assumed that ∑ri = 1

The Flow formulation for the above was shown to be

ry = ry/2 + ra/2

ra = rm + ry/2

rm = ra/2

The Matrix formulation is

However when we a billions of Web pages with several hundred thousand in-links and out-links the Page rank is iteratively calculated

If we start with

To start the page rank of ra=ry=rm = 1/3 so that the sum ∑ri =1

This is then iterated

Using the

r = M x r to arrive at values that converge

ry ½ ½ 0 1/3

ra = ½ 0 1 x 1/3

r m 0 ½ 0 1/3

This will eventually converge at ry=2/5 ra=2/5 and rm =1/5

The ability to rank Web pages on the order of importance was a real breakthrough for Google

The Page Rank also implies the probability that a Web surfer who randomly clicks the ou-links of a the Web pages will land on after some time. It is the probability of a random walk of the Web when clicking the Web links on pages at random.

While Google does a great job in crawling and serving pages it is rumored that more than 75% of the Web is inaccessible to Web search engines. This is known as the “Dark Net‘ or “Dark Web” much like the dark matter of the universe

Also see

1. A crime map of India in R: Crimes against women

2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1

3. Bend it like Bluemix, MongoDB with autoscaling – Part 2

4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

6. Deblurring with OpenCV:Weiner filter reloaded