Presentation on the “Design principles of scalable, distributed systems”

Design principles of scalable, distributed systems from Tinniam Ganesh

Also see my blog post on this topic “Design principles of scalable, distributed system“

Design Principles of Scalable, Distributed Systems

Designing scalable, distributed systems involves a completely different set of principles and paradigms when compared to regular monolithic client-server systems. Typical large distributed systems of Google, Facebook or Amazon are made up of commodity servers. These servers are expected to fail, have disk crashes, run into network issues or be struck by any natural disasters.

Rather than assuming that failures and disasters will be the exception these systems are designed assuming the worst will happen. The principles and protocols assume that failures are the rule rather than the exception. Designing distributed systems to accommodate failures is the key to a good design of distributed scalable systems. A key consideration of distributed system is the need to maintain consistency, availability and reliability. This of course is limited by the CAP Theorem postulated by Eric Brewer which states that a system can only provide any two of “consistency, availability and partition tolerance” fully,

Some key techniques in distributed systems

Vector Clocks: An obvious issue in distributed systems with hundreds of servers is that each server will have its own clock running at a slightly different rate. It is difficult to get a view of a global time considering that each system has slightly different clock speeds. How does one determine causality in such a distributed system? The solution to this problem is provided by Vector Clocks devised by Leslie Lamport. Vector Clocks provide a way of determining the causal ordering of events. Each system maintains an array of timestamps based on its own internal clock which it keeps incrementing. When a system needs to send an event to another system it sends the message with the timestamp generated from its internal array. When the receiving system receives the message at a timestamp that is less than the sender’s timestamp it increments its own timestamp by 1 and continues to increments its internal array through its own internal clock. In the figure the event sent from System 1 to System 2 is assumed to be fine since the timestamp of the sender “2” < “15. However when System 3 sends an event with timestamp 40 to System 2 which received it timestamp 35, to ensure a causal ordering where System 2 knows that it received the event after it was sent from System the vector clock is incremented by 1 i.e 40 + 1 = 41 and System 2 increments at it did before, This ensures that partial ordering of events is maintained across systems.

Vector clocks have been used in Amazon’s e-retail website to reconcile updates. The use of vector clocks to manage consistency has been mentioned in Amazon’s Dynamo Architecture

Distributed Hash Table (DHT): The Distributed Hash Table uses a 128 bit hash mechanism to distribute keys over several nodes that can be conceptually assumed to reside on the circumference of a circle. The hash of the largest key coincides with the hash of the smallest key. There are several algorithms that are used to distribute the keys over this conceptual circle. One such algorithm is the Chord System. These algorithms try to get to the exact node in the smallest number of hops by storing a small amount of local data at each node. The Chord System maintains a finger table that allows it to get to the destination node in O (log n) number of hops. Other algorithms try to get to the desired node in O (1) number of hops. Databases like Cassandra, Big Table, and Amazon use a consistent hashing technique. Cassandra spreads the keys of records over distributed servers by using a 128 bit hash key.

Quorum Protocol: Since systems are essentially limited to choosing two of the three parameters of consistency, availability and partition tolerance, tradeoffs are made based on cost, performance and user experience. Google’s BigTable chooses consistency over availability while Amazon’s Dynamo chooses ‘availability over consistency”. While the CAP theorem maintains that only 2 of the 3 parameters of consistency, availability and partition tolerance are possible it does not mean that Google’s system does not support some minimum availability or the Dynamo does not support consistency. In fact Amazon’s Dynamo provides for “eventual consistency” by which data become consistent after a period of time.

Since failures are inevitable and a number of servers will fail at any instant of time writes are replicated across many servers. Since data is replicated across servers a write is considered “successful” if the data can be replicated in N/2 +1 servers. When the acknowledgement comes from N/2+1 server the write is considered successful. Similarly a quorum of reads from >N/2 servers is considered successful. Typical designs have W+R > N as their design criterion where N is the total number of servers in the system. This ensures that one can read their writes in a consistent way. Amazon’s Dynamo uses the sloppy quorum technique where data is replicated on N healthy nodes as opposed to N nodes obtained through consistent hashing.

Gossip Protocol: This is the most preferred protocol to allow the servers in the distributed system to become aware of server crashes or new servers joining into the system, Membership changes and failure detection are performed by propagating the changes to a set of randomly chosen neighbors, who in turn propagate to another set of neighbors. This ensures that after a certain period of time the view becomes consistent.

Hinted Handoff and Merkle trees: To handle server failures replicas are sometimes sent to a healthy node if the node to which it was destined was temporarily down. For e.g. data destined for Node A is delivered to Node D which maintains a hint in its metadata that the data is to be eventually handed off to Node A when it is healthy. Merkle trees are used to synchronize replicas amongst nodes. Merkle trees minimize the amount of data that needs to be transferred for synchronization.

These are some of the main design principles that are used while designing scalable, distributed systems. A related post is “Designing a scalable architecture for the cloud”

Find me on Google+

When NoSQL makes better sense than MySQL

In large web applications where performance and scalability are key concerns a non –relational database like NoSQL is a better choice to the more traditional databases like MySQL, ORACLE, PostgreSQL etc. While the traditional databases are designed to preserve the ACID (atomic, consistent, isolated and durable) properties of data, these databases are capable of only small and frequent reads/writes.

However when there is a need to scale the application to be capable of handling millions of transactions the NoSQL model works better. There are several examples of such databases – the more reputed are Google’s BigTable, HBase, Amazon’s Dynamo, CouchDB & MongoDB. These databases are based on a large number of regular commodity servers. Accesses to the data are based on get(key) or set(key,value) type of APIs.

The database is itself distributed across several commodity servers. Accesses to the data are based on a consistent hashing scheme for example the Distributed Hash Table (DHT) method. In this method the key is hashed efficiently to one of the servers which can be visualized as lying on the circumference of the circle. The Chord System is one such example of the DHT algorithm. Once the destination server is identified the server does a local search in its data for the key value. Hence the key benefit of the DHT is that it is able to spread the data across multiple servers rather than having a monolithic database with a hot standby present.

The ability to distribute data and the queries to one of several servers provides the key benefit of scalability. Clearly having a single database handling an enormous amount of transactions will result in performance degradation as the number of transaction increases.

However the design of distributing data across several commodity servers has its own challenges, besides the ability to have an appropriate function to distribute the queries to. For e.g. the NoSQL database has to be able handle the requirement of new servers joining the system. Similarly since the NoSQL database is based on general purpose commodity servers the DHT algorithm must be able to handle server crashes and failures. In a distributed system this is usually done as follows. The servers in the system periodically convey messages to each other in order to update and maintain their list of the active servers in the database system. This is performed through a method known as “gossip protocol”

While databases like NoSQL, HBase, Dynamo etc do not have ACID properties they generally follow the CAP postulate. The CAP (Consistency, Availability and Partition Tolerance) theorem states that it is difficult to achieve all the 3 CAP features simultaneously. The NoSQL types of databases in order to provide for availability, typically also replicates data across servers in order to be able to handle server crashes. Since data is replicated across servers there is the issue of maintaining consistency across the servers. Amazon’s Dynamo system is based on a concept called “eventual consistency” where the data becomes consistent after a few seconds. What this signifies is that there is a small interval in which it is not consistent.

The NoSQL since it is non-relational does not provide for the entire spectrum of SQL queries. Since NoSQL is not based on the relational model queries that are based on JOINs must necessarily be iterated over in these applications. Hence the design of any application that needs to leverage the benefits of such non-relational databases must clearly separate the data management layer from the data storage layer. By separating the data management layer from how the data is stored we can easily accrue the benefits of databases like NoSQL.

While NoSQL kind of databases clearly have an excellent advantage over regular relational databases where high performance and scalability are key requirements the applications must be appropriately be tailored to take full advantage of the non-relation and distributed aspect of the database. You may also find the post “To Hadoop, or not to Hadoop” interesting.

Find me on Google+

Scaling out

Web Applications have challenges that are unique to the domain of the web. The key differentiating fact between internet technologies and other technologies is the need to scale up to handle sudden and large increases in traffic. While telecommunication and data communication also have the need to handle high traffic these products can be dimensioned on some upper threshold. Typical web applications have the need to provide low latencies, handle large throughput and also be extremely scalable to changing demands.

The ability to scale seems trivial. It appears that one could just add more CPU horsepower and throw in memory and bandwidth in large measure to get the performance we need. But unfortunately this is not as simple as it appears. For one, adding more CPU horsepower may not necessarily result in better performance. A badly designed application will only improve marginally.

Some applications are “embarrassingly parallel”. There is no dependency of one task on another. For example if the task is to search for a particular string in documents or one that requires the conversion from AVI to MPEG format then this can be done in parallel. In a public cloud this could be achieved by running more instances.

However, most real world applications are more sequential than parallel. There is a lot of dependency of data between the modules of the application. When multiple instances have to run in a public cloud the design considerations can be quite daunting.

For example if we had an application with parts 1,2,3,4 as follows

Now let us further assume that this is a web application and thousands of requests come to it. For simplicity sake, if we assume that for each request there is counter that has to be incremented. How does one keep track of the total requests that are coming to application? The challenge is how to manage such a global counter when there are multiple instances. In a monolithic application this does not pose a problem. But with multiple instances handling web requests, each having its own copy of this counter the design becomes a challenge

Each instance has its own copy of the counter which it will update based on the requests that come to that instance through a load balancer. However, how does one compute the total number of requests that e come to all the instances

One possible solution is to use the memcached approach. Memcached was developed as solution by Danga Corporation for Livejournal. Memcached is a distributed caching mechanism that stores data in multiple participating servers. Memcached has some simple API calls like get(key) and set (key,value) . Memcached uses a consistent hashing mechanism which hashes the key to one of the servers among several servers. This method is known as the Distributed Hashing Table (DHT) by which it is able to distribute the keys to one of the servers. The Consistent Hashing technique is able to handle server crashes and new servers joining in the distributed cache. Since data is distributed among servers participating in the distributed cache when a server crashes all its data is distributed to the remaining servers. Similarly a server joining in the distributed cache also distributes some of the data to it. Memcached has been used in Facebook, Zynga and Livejournal.