Design Principles of Scalable, Distributed Systems

Designing scalable, distributed systems involves a completely different set of principles and paradigms when compared to regular monolithic client-server systems. Typical large distributed systems of Google, Facebook or Amazon are made up of commodity servers.  These servers are expected to fail, have disk crashes, run into network issues or be struck by any natural disasters.

Rather than assuming that failures and disasters will be the exception these systems are designed assuming the worst will happen.  The principles and protocols assume that failures are the rule rather than the exception. Designing distributed systems to accommodate failures is the key to a  good design of distributed scalable systems. A key consideration of distributed system is the need to maintain consistency, availability and reliability. This of course is limited by the CAP Theorem postulated by Eric Brewer which states that a system can only provide any two of “consistency, availability and partition tolerance” fully,

Some key techniques in distributed systems

Vector Clocks: An obvious issue in distributed systems with hundreds of servers is that each server will have its own clock running at a slightly different rate. It is difficult to get a view of a global time considering that each system has slightly different clock speeds. How does one determine causality in such a distributed system? The solution to this problem is provided by Vector Clocks devised by Leslie Lamport. Vector Clocks provide a way of determining the causal ordering of events.  Each system maintains an array of timestamps based on its own internal clock which it keeps incrementing. When a system needs to send an event to another system it sends the message with the timestamp generated from its internal array.  When the receiving system receives the message at a timestamp that is less than the sender’s timestamp it increments its own timestamp by 1 and continues to increments its internal array through its own internal clock. In the figure the event sent from System 1 to System 2  is assumed to be fine since the timestamp of the sender “2”  < “15. However when System 3 sends an event with timestamp 40 to System 2 which received it timestamp 35, to ensure a causal ordering where System 2 knows that it received the event after it was sent from System the vector clock is incremented by 1 i.e 40 + 1 = 41 and System 2 increments at it did before, This ensures that partial ordering of events is maintained across systems.

Vector clocks have been used in Amazon’s e-retail website to reconcile updates.  The use of vector clocks to manage consistency has been mentioned in Amazon’s Dynamo Architecture

Distributed Hash Table (DHT): The Distributed Hash Table uses a 128 bit hash mechanism to distribute keys over several nodes that can be conceptually assumed to reside on the circumference of a circle. The hash of the largest key coincides with the hash of the smallest key. There are several algorithms that are used to distribute the keys over this conceptual circle. One such algorithm is the Chord System. These algorithms try to get to the exact node in the smallest number of hops by storing a small amount of local data at each node. The Chord System maintains a finger table that allows it to get to the destination node in O (log n) number of hops. Other algorithms try to get to the desired node in O (1) number of hops.  Databases like Cassandra, Big Table, and Amazon use a consistent hashing technique. Cassandra spreads the keys of records over distributed servers by using a 128 bit hash key.

Quorum Protocol:  Since systems are essentially limited to choosing two of the three parameters of consistency, availability and partition tolerance, tradeoffs are made based on cost, performance and user experience. Google’s BigTable chooses consistency over availability while Amazon’s Dynamo chooses ‘availability over consistency”. While the CAP theorem maintains that only 2 of the 3 parameters of consistency, availability and partition tolerance are possible it does not mean that Google’s system does not support some minimum availability or the Dynamo does not support consistency. In fact Amazon’s Dynamo provides for “eventual consistency” by which data become consistent after a period of time.

Since failures are inevitable and a number of servers will fail at any instant of time writes are replicated across many servers. Since data is replicated across servers a write is considered “successful” if the data can be replicated in N/2 +1 servers. When the acknowledgement comes from N/2+1 server the write is considered successful. Similarly a quorum of reads from >N/2 servers is considered successful. Typical designs have W+R > N as their design criterion where N is the total number of servers in the system. This ensures that one can read their writes in a consistent way.  Amazon’s Dynamo uses the sloppy quorum technique where data is replicated on N healthy nodes as opposed to N nodes obtained through consistent hashing.

Gossip Protocol: This is the most preferred protocol to allow the servers in the distributed system to become aware of server crashes or new servers joining into the system, Membership changes and failure detection are performed by propagating the changes to a set of randomly chosen neighbors, who in turn propagate to another set of neighbors. This ensures that after a certain period of time the view becomes consistent.

Hinted Handoff and Merkle trees: To handle server failures replicas are sometimes sent to a healthy node if the node to which it was destined was temporarily down. For e.g.  data destined for Node A is delivered to Node D which maintains a hint in its metadata that the data is to be eventually handed off to  Node A when it is healthy.  Merkle trees are used to synchronize replicas amongst nodes. Merkle trees minimize the amount of data that needs to be transferred for synchronization.

These are some of the main design principles that are used while designing scalable, distributed systems. A related  post is “Designing a scalable architecture for the cloud

Find me on Google+

Spectrum: The Big Crunch is Coming

Published in The Hindu “Scarce spectrum impacts mobile broadband

Published in Voice & Data: Spectrum: The Big Crunch is Coming

The ubiquity of the mobile phone and its ability to access the internet has been nothing short of miraculous. Mobile broadband has had such a powerful impact in recent times that it was described as the “Mobile Miracle” by the ITU-T.

A report by the Broadband Commission (set up by ITU-T and UNESCO) says that mobile users grew from 740 mn in 2000 to 5 bn in 2010, of which 1.8 bn were mobile broadband users. And this report says that for a 10% increase in mobile penetration, there is an increase of 1.38% in the GDP of the region.

Powerful smartphones, extremely fast networks, content-rich applications, and increasing user awareness, have together resulted in a virtual explosion of mobile broadband data usage. This explosion has begun to ring warning bells the world over. For it is predicted that with the existing spectrum availability, the world will run out of spectrum capacity by the middle of this decade.

The reasons behind this are fairly obvious. The growth in mobile data traffic has been exponential. According to a report by Ericsson, mobile data is expected to double annually till 2015. Mobile broadband will see a billion subscribers this year (2011), and possibly touch 5 bn by 2015.

According to IDATE, a consulting firm, the total mobile data will exceed 127 exabytes (an exabyte is 1018 bytes, or 1 mn terabytes) by 2020, an increase of over 33% from 2010.

There are 2 key drivers behind this phenomenal growth in mobile data. One is the explosion of devices-smartphones, tablet PCs, e-readers, laptops with wireless access. All these devices deliver high-speed content and web browsing on the move. The second is video. Over 30% of overall mobile data traffic is video streaming, which is extremely bandwidth hungry. The rest of the traffic is web browsing, file downloads, and email.

The growth has been fuelled by advances in wireless technology, as it evolved from EDGE, HSPA to LTE. There’s high growth of HSPA networks in the US, Canada and Latin America. And there will be over 25 operators with commercial deployments of LTE by 2015. EDGE, HSPA, and LTE have been enabling the delivery of extremely high-speed data to and from the internet and between devices.
However the ability to squeeze more and more bits per hertz of spectrum comes with additional costs and increased complexity. And despite all the advances, there is a technological limit to the bandwidth possible in the existing spectrum. This upper bound is determined by Shannon’s theorem, which provides the theoretical limits to the capacity of a channel for sending or receiving data.

Given the current usage trends, coupled with the theoretical limits of available spectrum, the world will run out of available spectrum for the growing army of mobile users. The current spectrum availability cannot support the surge in mobile data traffic indefinitely, and demand for wireless capacity will outstrip spectrum availability by the middle of this decade.

According a report published by the International Telecommunication Union–Radio (ITU-R), the spectrum requirement for regions in the world will be between 500 MHz and 1 GHz by 2020. The demand for spectrum bandwidth, based on average mobile broadband spectrum usage, clearly indicates that this demand will exceed the supply of spectral capacity by the middle of 2014.
Mobile Spectrum is a scarce resource and the governments of all the nations must work to optimize the usage of this resource. The ITU-R allocates spectrum frequencies for the use of various countries. In this context, the NGMN alliance (a global alliance of operators) states that “a timely and globally aligned spectrum allocation policy will play a key role in the development of a viable ecosystem on a national, regional and global scale, whose benefits will last well beyond the next decade”. Hence, there is a need for global harmonization in spectrum allocation, to prevent fragmentation, and to promote innovation for the next generation of networks.

The issue of spectrum scarcity is the real problem which must be addressed immediately by all nations going forward, given the fact that it typically takes some 6 years for spectrum to be operational, from the time it is allocated.

Find me on Google+

Designing a Scalable Architecture for the Cloud

The promise of the cloud is the unlimited computing power and storage capacities coupled with the pay-per-use policy. This makes the cloud particularly irresistible for hosting web applications and applications whose demand vary periodically. In order to take full advantage of the cloud the application must be designed for optimum performance. Though the cloud provides resources on-demand a badly designed application can hog resources and prove to be extremely expensive in the long run.

One of the first requirements for deploying applications on the cloud is that it should be scalable. Scalability denotes the ability to handle increasing traffic simply by adding more computing resources of the same kind rather than adding resources with greater horse power. This is also referred to scaling horizontally.

Assuming that the application has been sufficiently profiled and tuned for high performance there are certain key considerations that need to be taken into account while deploying on the cloud – public or private.  Some of them are being able to scale on demand, providing for high availability, resiliency and having sufficient safeguards against failures.

Given these requirements a scalable design for the Cloud can be viewed as being made up of the following 5 tiers of layers

The DNS tier – In this tier the user domain is hosted on a DNS service like Ultra DNS or Route 53. These DNS services distribute the DNS lookups geographically. This results in connecting to a DNS Server that is geographically closer to the user thus speeding the DNS lookup times. Moreover since the DNS lookups are distributed geographically it also builds geographic resiliency as far as DNS lookups are concerned

Load Balancer-Auto Scaling Tier – This tier is responsible for balancing the incoming traffic among compute instances in the cloud. The load balancing may be made on a simple round-robin technique or may be based on the actual CPU utilization of the individual instances. Typically at this layer we should also have an auto-scaling policy which will add more instances if the traffic to the application increases above a threshold or terminate instances when the traffic falls below a specific threshold.

Compute-Instance Tier – This layer hosts the actual application in individual compute instances on the cloud. It is assumed that the application has been tuned for maximum performance. The choice of small, medium or large CPU should be based on the traffic handling capacity of the instance type versus the cost/hr of the instance.

Cache Tier – This is an important layer in the cloud application where there are multiple instances. The cache tier provides a distributed cache for all the instances. With a distributed caching system like memcached it is possible to share global data between instances. The memcached application uses a consistent-hashing technique to distribute data among a set of participating servers. The consistent hashing method allow for handling of server crashes and new servers joining into the cache layer.

Database Tier – The Database tier is one of the most critical layers of the application. At a minimum the database should be configured in an active-standby mode. Ideally it is always better to have the active and standby in different availability zones to better handle disasters in a particular zone. Another consideration is have separate read replicas that handle reads to database while the primary database handles the write operations

Besides the above considerations it is always good to host the web application in different availability zone thus safeguarding against disasters in a particular region.

Working with Amazon’s EBS, ELB and Route 53

Here are some key learning’s  to get going on Amazon’s Elastic Block Storage (EBS), Elastic Load Balancer (ELB) and Route 53 which Amazon’s DNS  service

Amazon’s EBS: Amazon’s Elastic Block Storage provided persistent storage for your applications. It is extremely useful when migrating from a small/medium instance to a large/extra large instance. The EBS is akin to a hard disk. The steps that are needed to migrate are

– Create an EBS volume from your snapshot of your small/medium instance

– Launch a large instance

– Attach your EBS volume to your large instance (for e.g. /dev/sda2)

– Open a ssh window to your large instance

– Create a test directory (/home/ec2-user/test)

– Mount your volume (mount /dev/sda2 /home/ec2-user/test)

– Copy all your files and directories to their appropriate location

– Unmount the mounted volume (umount /dev/sda2)

– Now you have all the files from your medium instance

– Detach the volume

Amazon’s ELB: The key thing about the Amazon’s ELB is the fact that the ELB created (my-load-balancer-nnnn-abc.amazon.com) actually maps to a set of IP addresses internally. Amazon suggests CNAMEing a subdomain to point to the ELB for better performance. Also an important thing to understand about Amazon’s ELB is that it performs significantly better if user requests come from different IPs rather from a single machine. So a performance tool that simulates users from multiple IPs will give a better throughput. The alternative is run the performance tool from multiple machines

Amazon’s Route 53: Route 53 is Amazon’s DNS service.  Route 53 distributes your domains to multiple geographical zones enabling quicker DNS lookup. To use Route 53 you need to

– create a hosted zone for your domain (for e.g http://www.mydomain.com) in Route 53

– migrate all your A, MX, CNAME resource records from your current registered domain to Route 53.

Since Route 53 is distributed it will speed name lookups. Currently updates to Route 53 are through dnscurl.pl a Perl script. However there are good GUI tools that make the job very simple.

This should get you started on the EBS, ELB and Route 53. Do also take a look at my post “Managing multi-region deployments“.

Find me on Google+