Published in Telecom Asia, Oc9, 2013 – ‘The search’ is not yet over!
In this post I take a look at the technologies that power the now indispensable and ubiquitous ‘search’ that we do over the web. It would be easy to rewind the world by at least 3 decades by simply removing ‘the search’ from our lives.
A classic article in the New York Times, ‘The Twitter trap’ discusses how technology is slowly replacing some of our more common faculties like the ability to memorize or perform simple calculations mentally.
For e.g. until the 15th century people had to remember a lot of information. Then came Gutenberg with his landmark invention, the printing press, which did away with the need to store information. Closer to the 2oth century the ‘Google Search’ now obviates the need to remember facts about anything. Detailed information is just a mouse click away.
Here’s a closer look at evolution of search technologies
The Inverted Index: The inverted index is a way to search the existence of key words or phrases in documents. The inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The ability to store words and the documents in which it is present, allows for an quick retrieval of the related documents in which the word(s) are present. Search engines like Google, Bing or Yahoo typically crawl of the web continuously and keep updating this index of words versus the documents as new web pages and web sites are added. The inverted index is a simplistic method and is neither accurate nor efficient.
Google’s Page Rank: As mentioned before merely the presence of words in documents alone is not sufficient to return good search results. So Google came along with its PageRank algorithm. PageRank is an algorithm used by the Google’s web search engine to rank websites in their search engine results. According to Google PageRank works by “counting the number and quality of links to a page to determine a rough estimate of how important the website is.” The underlying assumption is that more important websites are likely to receive more links from other websites.
In essence the PageRank algorithm tries to determine the likelihood that a surfer will land on a particular page by randomly clicking on links. Clearly the PageRank algorithm has been very effective for searches as now ‘googling’ is synonymous to searching (see below from Wikipedia)
Graph database: While the ‘Google Search’ is extremely powerful it would make more sense if the search could be precisely tailored to what the user is trying to search. A typical Google search throws up a few 100 million results. This has led to even more powerful techniques, one of which is the ‘Graph database’. In a Graph database data is represented as a graph. Nodes in the graph can be entities and edges could be relationships. A search on the graph database will result in the traversal of the graph from a specific start node to specific terminating node. Here is a fun representation of a simple Graph database representation from InfoQ
Google has recently come out with its Knowledge Graph which is based on this technology. Facebook allows users to create complex queries of status updates using the graph database.
Finally, what next??
A Cognitive Search??: Even with the graph database the results cannot be very context specific. What I would hope to happen in the future is have a sort of a ‘Cognitive Search’ where the results would be bounded and would take into account the semantics and context of a user specified phrase or request.
So for e.g. if a user specified ‘Events leading to the Iraq war’ the search should throw all results which finally culminated in the Iraq war.
Alternatively if I was interested in knowing for e.g. ‘the impact of iPad on computing’ then the search should throw precise results from the making of the iPad, the launch of iPad, the various positive and negative reviews and impact iPad has had on the tablet and computing industry as a whole.
Another interesting query would ‘The events that led to the downfall of a particular government in election of 1998’ the search should precisely output all those events that were significant during this specific period.
I would assume that the results themselves would come out as a graph with nodes and edges specifying which event(s) triggered which other event(s) with varying degrees of importance.
However this kind of ability where the search is cognitive is probably several years away!