Natural selection of database technology through the years

1Charles Darwin in his landmark book “The Origin of Species’ discusses how flora and fauna evolved through the centuries. The different features of each individual species would undergo a process of natural selection by which modification of attributes would naturally occur that would enable the species to adapt and propagate through time. Those modifications that failed to adapt would naturally become extinct.

In this post I discuss how database (DB) technology has evolved over the years. As new requirements arose database technology has had to adapt and newer paradigms have evolved. However, unlike species which became extinct the older versions still exist as they continue o address the earlier problems that remain today.

Here is a short & brief history of evolution of databases

Relational databases: Relational databases had their genesis when E.F Codd of IBM came up with a relational model of organizing data. In this model all data is organized as tables with several rows. In a relational model each row has several columns and one of the columns contains a unique value for each row called primary key. Relational databases have ruled the enterprise domain for more than 3 decades. An enterprise’s data is organized as a set of related tables. Users can query the database using Structured Query Language or SQL.

I remember in the late 1980’s when I started to work in the industry, programming jobs were much sought after by all of us engineering graduates. In those days database jobs were ‘uncool’ and system programming jobs dealing with writing assemblers, compilers were the really cool jobs. I was also susceptible to this prevailing opinion and stayed away from databases. As fate would have it I eventually moved into telecom and telecom protocol work in which I worked for more than 2 decades and have largely maintained my distance from DB.

However it recent times I did want to look brush up whatever little I knew of DB. Recently I was listening to the Coursera course “Introduction to Data Science‘ by Bill Howe. In one of the lectures the professor uttered something that really caught my fancy. He mentions that SQL is probably the closest to natural language. How true! Once the DB schema and tables have been set up, querying the DB for all sorts of data can be done in SQL which is close to natural language. For e.g.

SELECT a,b,c from TABLE S,T where condition X1 AND/OR condition X2

The power of DBs comes from the fact that all the data is organized as tables and enables one to retrieve any sort of data from it. Trying to accomplish this with any other high level programming language would take several hundreds of lines of code and we would have to write functions for each in individual query.

NoSQL databases: However the utility of relational databases decreases as we scale to hundreds of Gbs of data. In this age of the internet and the worldwide web data is easily of the order of several terabytes to a few petabytes. For e.g. Weather modelling, Social networks like FB,Twitter or LinkedIn all need to operate on millions of status updates or tweets per day. Traditional relational databases cannot handle such large sets of data. This is where the concept of NoSQL DB came into existence. NoSQL databases typically store data as key, value pairs. The singular advantage of NoSQL is that the database can scale horizontally or in other words the performance does not degrade with large increases in data size. In NoSQL databases data is hashed and uniformly distributed across commodity servers through a technique known as ‘consistent hashing’. Also data in NoSQL databases is replicated across servers. This architecture of NoSQL databases is based on common, commodity servers which are expected to crash. However this would not affect the NoSQL DB to function correctly. The strength of NoSQL databases comes from the fact that servers can join or leave the NoSQL DB without affecting the functioning of the DB. Some of the more popular examples of NoSQL DB are CouchDB, MongoDB, Riak, Voldemort, Dynamo etc.Do take a look at my post “When NoSQL makes better sense that MySQL

NewSQL: This variation of DB came into existence as there was a need for extremely fast performance for computing tasks like analytics etc. These DBs exist completely in memory and so the access is blazingly fast. The most famous of DBs of this paradigm is SAP’s HANA.

Graph Databases: Graph databases are the recent entrants into database technology. This strain of databases came into existence to handle associative data more efficiently. In a graph database data is represented as a graph. Nodes in the graph can be entities and edges can be relationships. A search on a graph database will result in a traversal from a specified start node to a specified terminating node. “Friends’ in Facebook, ‘followers/following’ in Twitter and ‘connections’ in LinkedIn all use Graph Database to map association and enable easy search. Graph Databases is what allows these databases to make recommendations like ‘You may know’. E.g. of Graph Database Google’s Graph DB, Neo4j

As we move ahead database technology will continue to evolve into newer architectures to handle

Find me on Google+

Pregelian philosophy – Thinking Web Scale – Part 2

“If you squint the right way, you will notice there are graphs are everywhere”. This is a line from a Research blog at Google “Large scale graph computing at Google”.  Transportation problems, disease outbreaks, computer networks and social networks all use graph in one way or another. Social networks of today, with friends of  Facebook, followers in Twitter and connections in LinkedIn are all clearly graphs. The billions of pages in the World Wide Web with incoming and outgoing links is a massive directed graph.

Pregel is Google’s computing model for graph processing. Google came up with this programming model to compute Page Ranks of individual web sites. Pregel is a powerful framework for processing of directed graphs. A directed graph has vertices and edges. Edges are directed towards or away from the vertices. Pregel is a highly scalable model and is well suited for Web Scale problems. It is capable of processing directed graphs with billions of vertices and trillion of edges.

The Map Reduce paradigm with its message passing mechanism is not particularly well suited for this purpose. Map Reduce is capable of processing several documents, images, or matrices in parallel. But handling directed graphs with Map-reduce the combiners/reducers have to wait for the mappers to finish their tasks.  In Pregel programs are executed as a sequence of iterations in which each vertex receives messages sent to it in the previous iteration, execute code, and send messages to other vertices. Each vertex can  modify its state and the topology of the graph

Here is a the model of the Pregel.

pregel

This picture is taken from the lecture in Web Intelligence and Big Data course by Gautam Shroff on Coursera

Pregel works on a sequence of iterations known as supersteps. In each superstep, S, each vertex will receive messages sent to it in the previous superstep S-1, executes a user-defined function specified for the vertex and sends messages to other vertices which they will receive in superstep S+1. The supersteps at all vertices are conceptually supposed to occur in parallel.  At each superstep the vertex can alter its state and the state of the outgoing edges.  The synchronicity of the model is what makes the model semantically manageable.

Pregel is realized on hundreds of commodty serves. The input to the Pregel model is a directed graph. During initialization the vertices are partitioned and each server receives a set of vertices. Each vertex is associated with a modifiable user defined value.

In each superstep each node computes the user defined function in parallel using the message sent to it in the previous superstep.  A vertex can modify its state, the outgoing edges or the topology of the graph.

Pregel has been used for a variety of different problems ranging from determining Page Rank, Shortest path etc. Vertices are first class citizens in Pregel.  Algorithms terminate by a process of voting to halt. When all vertices vote to halt then the computation in Pregel is assumed to have completed. The algorithm as a whole terminates when all the nodes have voted to halt and there are no messages in transit.

A simple example of how Pregel determines the maximum value is illustrated in the original Google research paper Pregel: A System for Large-Scale Graph Processing.

1
The pseudo code for this can be written as

compute(){

i_val := val
while (m) {
if (m > val) {
val = m;
}
else if (i_val == val) {
vote_to_halt
}
else {
for each neighbor v
send_message(v, val)
}
}

In the above diagram the 4 vertices are initialized with the values shown. In superstep 1,

vertices 3 & 6 exchange their values. While 3 updates its value to 6 at its vertex, vertex 6 drops it and vote to halt. Similarly vertex 1 will update its value to 6 while vertex 2 which receives the value 1 will  vote to halt. In superstep 2vertex 2 will receive the value 6 from the vertex to its right, will be woken up and will update its value. In superstep 3 no messages are passed and all nodes would have now voted

Another interesting application is the evaluation of Page Rank. Page Rank essentially determines the probability of hitting a Page if a surfer clicked links on a Web page at random.

Page Rank is evaluated iteratively as follows (source wkipedia.org)

formula

This can be done iteratively through Pregel as below

virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {

for (; !msgs->Done(); msgs->Next()) {
sum += msgs->Value();
*MutableValue() = 0.15 / NumVertices() + 0.85 * sum; – – > (A)
}
if (superstep() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n); – – > (B)
} else {
VoteToHalt();
}
}
}

The Pregel computation is initialized such that in superstep 0, the value of each vertex

is 1 / NumVertices() .  In each of the first 30 supersteps, each vertex sends along each outgoing edge its tentative PageRank divided by the number of outgoing edges (see step B above).

Starting from superstep 1, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to 0.15/NumVertices() + 0.85 * sum (see step A)  After reaching superstep 30, no further messages are sent and each vertex votes to halt.

Clearly the computation of PageRank of the pages indexed by Google in the World Wide Web consisting of billions of pages can be computed fairly efficiently by Pregel.

Pregel also includes ‘combiners’ that perform functions like SUM, MIN,MAX and AVERAGE to save computational steps.

Pregel also includes checkpointing where vertices save their states to revert to a previous state in case of failure.

The synchronous methodology of Pregel helps in avoiding issues of deadlocks and races which are prevalent in asynchronous communicating programs.

Pregel is a powerful programming model which will find applications in many Web Scale applications in the future.

Find me on Google+

‘The Search’ is not yet over!

Published in Telecom Asia, Oc9, 2013 – ‘The search’ is not yet over!

In this post I take a look at the technologies that power the now indispensable and ubiquitous ‘search’ that we do over the web. It would be  easy to rewind the world by at least 3 decades by simply removing ‘the search’ from our lives.

A classic article in the New York Times, ‘The Twitter trap’ discusses how technology is slowly replacing some of our more common faculties like the ability to memorize or perform simple calculations mentally.

For e.g. until the 15th century people had to remember a lot of information. Then came Gutenberg with his landmark invention, the printing press, which did away with the need to store information. Closer to the 2oth century the ‘Google Search’ now obviates the need to remember facts about anything. Detailed information is just a mouse click away.

Here’s a closer look at evolution of search technologies

The Inverted Index: The inverted index is a way to search the existence of key words or phrases in documents.  The inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. The ability to store words and the documents in which it is present, allows for an quick  retrieval of the related documents in which the word(s) are present. Search engines like Google, Bing or Yahoo typically crawl of the web continuously and keep updating this index of words versus the documents as new web pages and web sites are added. The inverted index is a simplistic method and is neither accurate nor efficient.

inverted_index

Google’s Page Rank: As mentioned before merely the presence of words in documents alone is not sufficient to return good search results. So Google came along with its PageRank algorithm. PageRank is an algorithm used by the Google’s  web search engine to rank websites in their search engine results.  According to Google PageRank works by “counting the number and quality of links to a page to determine a rough estimate of how important the website is.”  The underlying assumption is that more important websites are likely to receive more links from other websites.

In essence the PageRank algorithm tries to determine the likelihood that a surfer will land on a particular page by randomly clicking on links. Clearly the PageRank algorithm has been very effective for searches as now ‘googling’ is synonymous to searching (see below from Wikipedia)

PageRanks

Graph database: While the ‘Google Search’ is extremely powerful it would make more sense if the search could be precisely tailored to what the user is trying to search. A typical Google search throws up a few 100 million results.  This has led to even more powerful techniques, one of which is the ‘Graph database’. In a Graph database data is represented as a graph. Nodes in the graph can be entities and edges could be relationships. A search on the graph database will result in the traversal of the graph from a specific start node to specific terminating node. Here is a fun representation of a simple Graph database representation from InfoQ

neo4j_matrix_0411

Google has recently come out with its Knowledge Graph which is based on this technology. Facebook allows users to create complex queries of status updates using the graph database.

Finally, what next??

A Cognitive Search??: Even with the graph database the results cannot be very context specific. What I would hope to happen in the future is have a sort of a ‘Cognitive Search’ where the results would be bounded and would take into account the semantics and context of a user specified phrase or request.

So for e.g. if a user specified ‘Events leading to the Iraq war’ the search should throw all results which finally culminated in the Iraq war.

Alternatively if I was interested in knowing for e.g. ‘the impact of iPad on computing’ then the search should throw precise results from  the making of the iPad, the launch of iPad, the various positive and negative reviews and impact iPad has had on the tablet and computing industry as a whole.

Another interesting query would ‘The events that led to the downfall of a particular government in election of 1998’ the search should precisely output all those events that were significant during this specific period.

I would assume that the results themselves would come out as a graph with nodes and edges specifying which event(s) triggered which other event(s) with varying degrees of importance.

However this kind of ability where the search is cognitive is probably several years away!

Find me on Google+