inverted index – Giga thoughts …

Published in Telecom Asia, Oc9, 2013 – ‘The search’ is not yet over!

In this post I take a look at the technologies that power the now indispensable and ubiquitous ‘search’ that we do over the web. It would be easy to rewind the world by at least 3 decades by simply removing ‘the search’ from our lives.

A classic article in the New York Times, ‘The Twitter trap’ discusses how technology is slowly replacing some of our more common faculties like the ability to memorize or perform simple calculations mentally.

For e.g. until the 15^th century people had to remember a lot of information. Then came Gutenberg with his landmark invention, the printing press, which did away with the need to store information. Closer to the 2oth century the ‘Google Search’ now obviates the need to remember facts about anything. Detailed information is just a mouse click away.

Here’s a closer look at evolution of search technologies

The Inverted Index: The inverted index is a way to search the existence of key words or phrases in documents. The inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The ability to store words and the documents in which it is present, allows for an quick retrieval of the related documents in which the word(s) are present. Search engines like Google, Bing or Yahoo typically crawl of the web continuously and keep updating this index of words versus the documents as new web pages and web sites are added. The inverted index is a simplistic method and is neither accurate nor efficient.

Google’s Page Rank: As mentioned before merely the presence of words in documents alone is not sufficient to return good search results. So Google came along with its PageRank algorithm. PageRank is an algorithm used by the Google’s web search engine to rank websites in their search engine results. According to Google PageRank works by “counting the number and quality of links to a page to determine a rough estimate of how important the website is.” The underlying assumption is that more important websites are likely to receive more links from other websites.

In essence the PageRank algorithm tries to determine the likelihood that a surfer will land on a particular page by randomly clicking on links. Clearly the PageRank algorithm has been very effective for searches as now ‘googling’ is synonymous to searching (see below from Wikipedia)

Graph database: While the ‘Google Search’ is extremely powerful it would make more sense if the search could be precisely tailored to what the user is trying to search. A typical Google search throws up a few 100 million results. This has led to even more powerful techniques, one of which is the ‘Graph database’. In a Graph database data is represented as a graph. Nodes in the graph can be entities and edges could be relationships. A search on the graph database will result in the traversal of the graph from a specific start node to specific terminating node. Here is a fun representation of a simple Graph database representation from InfoQ

Google has recently come out with its Knowledge Graph which is based on this technology. Facebook allows users to create complex queries of status updates using the graph database.

Finally, what next??

A Cognitive Search??: Even with the graph database the results cannot be very context specific. What I would hope to happen in the future is have a sort of a ‘Cognitive Search’ where the results would be bounded and would take into account the semantics and context of a user specified phrase or request.

So for e.g. if a user specified ‘Events leading to the Iraq war’ the search should throw all results which finally culminated in the Iraq war.

Alternatively if I was interested in knowing for e.g. ‘the impact of iPad on computing’ then the search should throw precise results from the making of the iPad, the launch of iPad, the various positive and negative reviews and impact iPad has had on the tablet and computing industry as a whole.

Another interesting query would ‘The events that led to the downfall of a particular government in election of 1998’ the search should precisely output all those events that were significant during this specific period.

I would assume that the results themselves would come out as a graph with nodes and edges specifying which event(s) triggered which other event(s) with varying degrees of importance.

However this kind of ability where the search is cognitive is probably several years away!

Find me on Google+

The Hadoop paradigm originated from Google and is used in crunching large data sets. It is ideally suited for applications like Big Data, creating an inverted index used by search engines and other problems which require terabytes of data to processed in parallel. One key aspect of Hadoop is that it is made up of commodity servers. Hence server, disk crashes or network issues are assumed to be norm rather than an exception.

The Hadoop paradigm is made of the Map-Reduce & the HDFS parts. The Map-Reduce has 2 major components to it. The Map part takes as input key- value pairs and emits a transformed key-value pair. For e.g Map could count the number of occurrences of words or created an inverted index of a word and its location in a document. The Reduce part takes as input the emitted key-value pairs of the Map output and performs another operation on the inputs from the Map part for.e.g summing up the counts of words. A great tutorial on Map-Reduce can be found at http://developer.yahoo.com/hadoop/tutorial/module4.html

The HDFS (Hadoop Distributed File System) is the special storage that is used in tandem with the Map-Reduce algorithm. The HDFS distributes data among Datanodes. A Namenode maintains the meta data of where individual pieces of data are stored.

To get started with Apache Hadoop download a stable release of Hadoop from (e.g. hadoop-1.0.4.tar.gz)

http://hadoop.apache.org/common/releases.html#Download

a) Install Hadoop on your system preferably in /usr/local/
tar xzf ./Downloads/hadoop-1.0.4.tar.gz

sudo mv hadoop-1.0.4 hadoop (rename hadoop-1.0.4 to hadoop for convenience)
sudo chown -R hduser:hadoop hadoop
Apache Hadoop requires Java to be installed. Download and install Java on your machine from

http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u4-downloads-1591156.html

b) After you have installed java set $JAVA_HOME in
usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/bin/java (uncomment and set the correct path)

c) Create a user hduser in group hadoop
For this click Applications->Other->User & Groups
Choose Add User – hduser & Add Group – hadoop
Choose properties and add hduser to the hadoop group.

Standalone Operation
Under root do
/usr/sbin/sshd
then
$ssh localhost

If you cannot do a ssh to localhost with passphrase then do the following
$ ssh-keygen -t rsa -P “”

You will get the following

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.

…
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now re-run
$ssh localhost – This time it should go fine

Create a directory input and copy *.xml files from conf/
$mkdir input
$cp /usr/local/hadoop/share/hadoop/templates/conf/*.xml input

Then execute the following. This searches for the string “dfs*” in all the XML files under the input directory
$/usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
You should see
[root@localhost hadoop]# /usr/local/hadoop/bin/hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’
12/06/10 13:00:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
..
…
12/06/10 13:01:45 INFO mapred.JobClient:     Reduce output records=38
12/06/10 13:01:45 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
12/06/10 13:01:45 INFO mapred.JobClient:     Map output records=38

Where it indicates that there are 38 such record strings with dfs* in it.
If you get an error
java.lang.OutOfMemoryError: Java heap space
then increase the heap size from 128 to 1024 as below
…
<property>
<name>mapred.child.java.opts</name>
<value>-server -Xmx1024m -Djava.net.preferIPv4Stack=true</value>
</property>
….

Pseudo distributed mode
In the pseudo distributed mode separate Java processes are started for the Job Tracker which schedules tasks, the Task tracker which executes tasks and the Namenode which contains the data
A good post on Hadoop standalone mode is given in Michael Nolls post – Running Hadoop on Ubuntu Linux (single node cluster)

a) Execute the following commands
. ./.bashrc
$mkdir -p /home/hduser/hadoop/tmp
$chown hduser:hadoop /home/hduser/hadoop/tmp
$chmod 750 /home/hduser/hadoop/tmp

b) Do the following
Note:Files core-site.xml, mapred-site,xml & hdfs-site.xml exist under
/usr/local/hadoop/share/hadoop/templates/conf& usr/local/hadoop/conf
It appears that Apache hadoop gives precedence to /usr/local/hadoop/conf. So add the following between <configuration and </configuration>
In file /usr/local//hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.
</description>
<final>true</final>
</property>

In /usr/local//hadoop/conf//mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<final>true</final
</property>

In /usr/local//hadoop/conf//hdfs-site.xml add
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

Now perform
$sudo /usr/sbin/ssh
$ssh localhost (Note you may have to generate key as above if you get an error)

c) Since the pseudo distributed mode will use the HDFS file system we need to format this.So run the following command
$$HADOOP_HOME/bin/hadoop namenode -format
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop namenode -format
12/06/10 15:48:16 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by ‘hortonfo’ on Tue May 8 20:16:59 UTC 2012
************************************************************/
…
12/06/10 15:48:17 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/06/10 15:48:18 INFO common.Storage: Storage directory /home/hduser/hadoop/tmp/dfs/name has been successfully formatted.
12/06/10 15:48:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
d) Now start all the Hadoop processes
$/usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to /var/log/hadoop/root/hadoop-root-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to /var/log/hadoop/root/hadoop-root-jobtracker-localhost.localdomain.out ocalhost: starting tasktracker, logging to /var/log/hadoop/root/hadoop-root-tasktracker-localhost.localdomain.out
Verify that all processes have started by executing /usr/java/jdk1.7.0_04/bin/jps
[root@localhost hduser]# /usr/java/jdk1.7.0_04/bin/jps
10971 DataNode
10866 NameNode
11077 SecondaryNameNode
11147 JobTracker
11264 TaskTracker
11376 Jps

You will see JobTracker,taskTracker,NameNode,DataNode and SecondaryNameNode

You can also do netstat -plten | grep java
tcp        0      0 0.0.0.0:50090               0.0.0.0:*                   LISTEN      0          166832     11077/java
tcp        0      0 0.0.0.0:50060               0.0.0.0:*                   LISTEN      0          167407     11264/java
tcp        0      0 0.0.0.0:50030               0.0.0.0:*                   LISTEN      0          166747     11147/java
tcp        0      0 0.0.0.0:50070               0.0.0.0:*                   LISTEN      0          165669     10866/java
tcp        0      0 0.0.0.0:50010               0.0.0.0:*                   LISTEN      0          166951     10971/java
tcp        0      0 0.0.0.0:50075               0.0.0.0:*                   LISTEN      0          166955     10971/java
tcp        0      0 127.0.0.1:55839             0.0.0.0:*                   LISTEN      0          166816     11264/java
tcp        0      0 0.0.0.0:50020               0.0.0.0:*                   LISTEN      0          165843     10971/java
tcp        0      0 127.0.0.1:54310             0.0.0.0:*                   LISTEN      0          165535     10866/java
tcp        0      0 127.0.0.1:54311             0.0.0.0:*                   LISTEN      0          166733     11147/java

e) Now copy files from your local directory /home/hduser/input to the HDFS file system
Now you can check the web interface for JobTracker & NameNode
This is as per mapred-site.xml & hdfs-site.xml in /conf directory. They are at
ñJobTracker – http://localhost:50030/

ñNameNode – http://localhost:50070/

f) Copy files from your local directory to HDFS
$/usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/hduser/input /user/hduser/input
Ensure that the files have been copies by listing the contents of HDFS

g) Check that files have been copied
$/usr/local/hadoop/bin/hadoop dfs -ls /user/hduser/input
Found 9 items
-rw-r–r–   1 root supergroup       7457 2012-06-10 10:31 /user/hduser/input/capacity-scheduler.xml
-rw-r–r–   1 root supergroup       2447 2012-06-10 10:31 /user/hduser/input/core-site.xml
-rw-r–r–   1 root supergroup       2300 2012-06-10 10:31 /user/hduser/input/core-site_old.xml
-rw-r–r–   1 root supergroup       5044 2012-06-10 10:31 /user/hduser/input/hadoop-policy.xml
-rw-r–r–   1 root supergroup       7595 2012-06-10 10:31 /user/hduser/input/hdfs-site.xml

h) Now execute the grep functionality
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar grep /user/hduser/input /user/hduser/output ‘dfs[a-z.]+’
12/06/10 10:34:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library
….
12/06/10 10:34:23 INFO mapred.JobClient: Running job: job_201206101010_0003
12/06/10 10:34:24 INFO mapred.JobClient: map 0% reduce 0%
12/06/10 10:34:48 INFO mapred.JobClient: map 11% reduce 0%
…
…
12/06/10 10:35:21 INFO mapred.JobClient: map 88% reduce 22%
12/06/10 10:35:24 INFO mapred.JobClient: map 100% reduce 22%
12/06/10 10:35:27 INFO mapred.JobClient: map 100% reduce 29%
12/06/10 10:35:36 INFO mapred.JobClient: map 100% reduce 100%
12/06/10 10:35:42 INFO mapred.JobClient: Job complete: job_201206101010_0003
….
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce input groups=3
12/06/10 10:36:16 INFO mapred.JobClient:     Combine output records=0
12/06/10 10:36:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=180502528
12/06/10 10:36:16 INFO mapred.JobClient:     Reduce output records=36
12/06/10 10:36:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=695119872
12/06/10 10:36:16 INFO mapred.JobClient:     Map output records=36
i) Check the result
[root@localhost hduser]# /usr/local/hadoop/bin/hadoop dfs -cat /user/hduser/output/*
6          dfs.data.dir
2          dfs.
2          dfs.block.access.token.enable
2          dfs.cluster.administrators
2          dfs.datanode.address
2          dfs.datanode.data.dir.perm
2          dfs.datanode.http.address
2          dfs.datanode.kerberos.principal
2          dfs.datanode.keytab.file
2          dfs.exclude
…..