TWS-5: Google’s Page Rank: Predicting the movements of a random web walker

Internet history can be divided into 2 epochs. The epoch before the Google search and that after. Prior to Google there were many unsuccessful attempts to organize the Web, which a miniscule fraction of what we have today, through Web portals. So we had Yahoo, Excite, Alta-vista, Lycos etc. trying to categorize the pages of the Web into News, Sports, and Finance etc. Navigating through them was an exercise an frustration but one had to live with this for quite some time. ( The material for this post is taken from Mining Massive Datasets lecture from Coursera – Lecture by Prof. Jure Leskovec, Stanford University)

The Google Search powered by the Page Rank algorithm arrived at a time when the internet was exploding. This was precisely what ‘the doctor ordered’ as navigating the web became synonymous with the Web search. This post takes a look at the Page Rank algorithm behind Google Search.

The Web can be viewed as a large directed graph with out-links from Web pages to other pages (links from a page to external Web pages) and in-links into Web pages from other pages.

For the Google search, Google uses Web crawlers to index the pages of the Web and probably creates an inverted index of keywords to documents that contain them. It then uses the Page Rank algorithm to determine the relevance and importance of the Web page

How does Google identify the importance of a Web page?

The importance of a Web page is determined by the number of in-links to the page. Each in-link is considered a vote for this page. Also the in-link from an important page is higher than another in-link from a less important page. So for example an in-link from New York Times will be much larger than an in-link from the National Enquirer for example

In the figure above it can B has a highest Page Rank because it has the highest number of in-links. In addition the out-link from B to C increases the Page Rank of C.

A) Flow formulation: The Flow formulation for Page Rank is based on the following

Each Web page’s vote (in-link) is proportional to the importance of the source page
If a page ‘j’ with page rank rj has n out-links each link gets rj/n votes
Page ‘j’s own importance is the sum of all the votes on its in-links

Where rj = ri/3 + rk/4 as seen from the above figure

According to the Flow equation for Page rank, the rank rj for a page j is
rj = ∑ ri/d
I -> j

In other words the rank rj is the sum of the the in-links from all pages ri divided by its out-degree.

The flow equations for the above simple view of a Web links can be expressed as based on the rank ri of each node divided by its out-degree. So ry and ra have an out-degree of 2 and hence they are ry/2 and ra/2 per out-link

ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

B) The Matrix formulation

In the Matirx formulation for Page Rank an Adjacent matrix Mji is defined as follows
If a page I has di out-links
If page I has an out-link to page j then
I -> j Mji = 1/di else Mji =0

The Rank vector ri is the importance of page i
It is also assumed that ∑ri = 1

The Flow formulation for the above was shown to be
ry = ry/2 + ra/2
ra = rm + ry/2
rm = ra/2

The Matrix formulation is

However when we a billions of Web pages with several hundred thousand in-links and out-links the Page rank is iteratively calculated

If we start with

To start the page rank of ra=ry=rm = 1/3 so that the sum ∑ri =1
This is then iterated
Using the

r = M x r to arrive at values that converge
ry            ½     ½     0                             1/3
ra    =     ½     0      1          x               1/3
r m         0     ½      0                           1/3

This will eventually converge at ry=2/5 ra=2/5 and rm =1/5

The ability to rank Web pages on the order of importance was a real breakthrough for Google

The Page Rank also implies the probability that a Web surfer who randomly clicks the ou-links of a the Web pages will land on after some time. It is the probability of a random walk of the Web when clicking the Web links on pages at random.

While Google does a great job in crawling and serving pages it is rumored that more than 75% of the Web is inaccessible to Web search engines. This is known as the “Dark Net‘ or “Dark Web” much like the dark matter of the universe

Also see
1. A crime map of India in R: Crimes against women
2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded

Programming Zen and now – Some essential tips-2

This post is a follow-up to my earlier post – How to program – Some essential tips. In this post I expand on some of the ideas of my earlier post.

Programming means different things to different people. To some programming is a drudgery almost akin to manual labor, to others programming is an insurmountable mountain full of frustrations and disappointments while to others it is an intense problem solving and a creative activity. In my opinion programming can mean anything to you. It is your attitude towards coding that make it a chore, a daunting task or something really creative.

Here are some my insights on how to go about learning to code

Eyes wide open: People generally get frustrated when a piece of code that they wrote does not do what they intended it to do. In some cases the code snippet will do nothing when they were expecting final result, sometimes the code will crash or it will go into an infinite loop and drive the person nuts. (Let me assure you – I have been there, done that!) The usual reaction when this happens is anger and frustration where we generally tinker around with the code only to get the same result. Soon the emotions will progress from anger to hopelessness.

The first thing that one needs to while coding is to keep your ‘eyes wide open’. We tend to be guilty of ignoring the error messages that show up. Here one way to attack coding

a) Fully understand the ‘what’ of the problem. If there is an infinite loop or a core dump check after which point does it happen? If there is an execution error, what is the error trying to tell us?
b) Next look into ‘why’ the error occurred. You could either use debugger or insert appropriate print statements to take the offending code apart.
c) Thirdly think ‘how‘ you can address the situation. Make appropriate changes and re-run the code
d) Did it solve the issue.If yes, move forward. Otherwise go to step a)

Remember that we learn more from our programming mistakes more than when our code just ‘happens’ to work! Mistakes in our code make us to explain every part of the program

Changing times:

Times have changed. Programming Zen and programming now are worlds apart. In many ways, IDEs, Git, Google etc. have made the programmer’s life a lot easier

‘Git’ing from here to there: Here is a trick that I learnt fairly recently, though it should have occurred to me more than 2 years back. This is using Git judiciously for all programming tasks (Note: I am saying nothing new here!). I find it really useful in writing code with incremental changes. I create my initial code on the master and then test out incremental changes on a ‘new branch’ even for personal projects. Once I have proved a small increment works, I merge it with the ‘main’ branch. I again start working on the ‘new’ for the next incremental change followed by a merge to the master

The steps are

Make initial changes

1. git add .
2. git commit –m “ Initial changes’

Create a new branch
3. git checkout –b ‘new

Make incremental changes. Test.
4.git add .
5. git commit –m “Change 1”

Merge with the master
6.git checkout master
7. git merge new

Continue to work with ‘new’.
8 . git checkout new
9. Go to step 4)

This process can be continued till you get your final product. I find this extremely useful instead of just using an IDE to make code changes. Invariably you can run into a situation where you had something working some time back and in the next instant it is broken and you can’t figure out all the changes you made to the working code. This can be extremely frustrating. With Git you have a history of changes and you can switch to an earlier version of working code and start from there.

Rarely do I find a reason to have more than 1 branch

Here is a pictorial version of this

Taking help from Dr. Google: For most questions and errors that you encounter you will find others who have hit similar bugs. Just google it. You will more than surprised that others went down the exact same path that you are treading. Besides the internet is full of tutorials, blogs and articles on key aspects of programming

Explore the cave of Stack overflow: Spend time exploring Stack overflow. Stack overflow is replete with code snippets and questions that you wanted to ask. There is so much information out there. If you really don’t find an answer to your problem, post it in Stack overflow and you are bound to get an answer or a link to a similar question asked previously

Finally programming requires dollops of patience. Develop patience along with your skill in coding and soon programming will much more enjoyable to you.

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. How to program – Some essential tips
4. The mind of a programmer

How to program – Some essential tips

If one follows the arrow of time from the early 1980s to the present day, the number of programming problems have not only proliferated but have also become more difficult. Fortunately programming in itself has become more manageable with massive increases in computing horsepower, smarter tools and instant availability of information on the internet, typically with the click of a mouse.

Learning to program is no easy task, but can be done with the right mix of attitude, curiosity and interest. Becoming adept at programming, however, is something else. An interesting essay in this context is Peter Norvig’s ‘Teach yourself programming in 10 years’

Back in the 1980s when I wrote my first Fortran program on my college Mainframe, programming was a lengthy exercise, spanning several days.

My first program was to plot a sine wave of characters on a computer printout. Running this program required the following several steps

Enter the program on a teletype terminal and create a stack of Hollerith (punched) cards
Submit the stack of cards to the computer center
The computer center would do a batch execute in the evening on the Mainframe
God forbid, if your program has a syntax error. If you did find an error, go back to step 1, the next day.
Assuming everything is fine, the computer center would run your program and your output (printout) would be placed in the appropriate pigeon hole which you would need to pick up the next day.

The whole exercise to write a small-sized program could take anywhere between a couple of days to a whole week.

In the early 1990s things got a little better where one could code, compile, link and execute sitting at one’s desk. However while the programming itself got much simpler than before, certain tasks were still difficult. Till the late 90s programs of any sort had to be written using a regular text editor (vi , emacs etc.) You would then have to go through the process of compiling, linking and executing.

An angry compiler would typically spew forth venom at missing semi-colons, undeclared variables, and uninitialized values. This would happen till you are able to iron out all syntax errors. Then you would link, get undefined symbols and have to include appropriate libraries etc. And then finally you would execute your code, only to have it crash. The process of debugging would then start.

Luckily technology has made life a whole lot easier except for the last step where you could still run into an execution errors . In these days an IDE (Interactive Development Environment) like Eclipse will flag syntax errors, missing definitions/declarations etc. as you write your code. Moreover Eclipse can also indicate which libraries (imports) you would need to include in your package for it to build. The only missing step in IDEs of these days is the ability to predict possible execution errors in your program. I wouldn’t be surprised, if in future, like Microsoft Word, the IDE is able to tell you if a programming construct does not make sense.

So things have gotten a lot easier for the programmer. The following tips for are particularly useful as you progress along in programming

These days when you are learning a new programming language it is not necessary to know the language from cover to cover by reading a book. In those days when we learnt C it was necessary to know everything from bit structures, macros, pragma etc. The reason being that every syntax or execution error one had to rush to get the textbook and thumb through it for the answer. Not so, in these days of Google. You have the world’s library at your fingertips.
To get started it is necessary to learn just the most important programming constructs of the language say structure, class, car, cdr besides the usual suspects like loops, conditions and case constructs
Download and install an IDE for the language. In most case Eclipse will work
Try to write a simple program and test out your code.
To do any sort of programming these days you will necessarily need to make 3 friends
1. Google
2. Stackoverflow
3. Git & GitHub
Honing your Googling skills is very important. There are answers to almost any sort of programming problems out there. You would be surprised to know that there are many others who did exactly the same stupid mistake that you did out there. Also googling will take you to interesting tutorials, blogs, articles that discuss different aspects of the programming language and the problem you are trying to solve
Stackoverflow is really a God send to all programmers. There are so many questions on so many aspects of every programming language on earth there. If you spend time searching Stackoverflow you are bound to find answers, code snippets that you can readily use in your code
Post your questions in stackoverflow when you don’t find the answers there. You are bound to get quick answers. Thanks to the gamification of Stackoverflow (points, upvotes,badges etc) that has been created on Stackoverflow.
Git & GitHub: I would suggest that you download and install GitHub for Windows. This will provide you with version control on your desktop. You can modify code while being to switch back to an earlier version with Git. Read up a good tutorial on Git for Windows
Once you have working code you push it onto GitHub and share with other programmers

Now that you have the basic setup here are few other extremely important tips

The most important criteria for programming is ‘attitude’. Initially you are bound to get frustrated, angry, irritated etc. But it is necessary to look at the errors that you get with the right attitude. Know that an error is telling you something. Usually the answers to your mistake are in the ‘error message’ itself. Look at it closely and try to understand it. You will learn a lot more when you learn from errors than from copy-pasting from somebody else’s code, even if works right the first time around!
Make sure you do something different each time. As Einstein said “ If you keep doing the same thing, you will keep getting the same result’
There are different ways to debug your code. You could use the debugger and single step through the code and keep checking the values of the variables. I personally prefer print statements to localize where things are going wrong. I then try to narrow down the problem to a few lines of code and try to take it apart.

Hopefully the above tips are useful. Programming can be creative activity and will be indispensable in our future.

Above all have fun coding, there are so many possibilities these days!

Also see

1. Programming languages in layman’s language
2. The common alphabet of programming languages
3. The mind of the programmer
4. Programming Zen and now – Some essential tips -2