The Anatomy of Latency

Latency is a measure of the time delay experienced in a system. In data communications, latency would be measured as the round-trip delay between sending a packet and receiving response from the destination. In the world of web applications latency is the response time of a web site. In web applications latency is dependent on both the round trip time on the communication link and also the processing time of the application, Hence we could say that

latency = 2 * round trip time  + Processing time

The round trip time is probably less susceptible to increasing traffic than the processing time taken for handling the increased loads. The processing time of the application is particularly pernicious in that it susceptible to changing traffic. This article tries to analyze why the latency or response times of web applications typically increase with increasing traffic. While the latency increases exponentially as the traffic increases the throughput increases to a point and then finally starts to drop substantially.  The ideal situation for all internet applications is to have the ability to scale horizontally allowing the application to handle increasing traffic by simply adding more commodity servers to the application while maintaining the response times to acceptable limits. However in the real world this never happens.

The price of Latency

Latency hurts business. Amazon found out that every 100 ms of latency cost them 1% of sales.  Similarly Google realized that a 0.5 second increase in search results dropped the search traffic by 20%. Latency really matters.    Reactions to bad response times in web sites range from minor annoyance to complete frustration and loss of users and business.

The cause of processing latency

One of the fundamental requirements of scalable systems is that they should be loosely coupled. The application needs to have a modular architecture with well defined interfaces with the other modules.  Ideally, applications which have been designed with fairly efficient processing times of the order of O(logn) or O(nlogn)  will be immune to changing loads but will be impacted by changes in number of data elements  So the algorithms adopted by the applications themselves do not contribute the increasing response times for increase traffic. So finally what really is the performance bottleneck for increasing latencies and decreasing throughput for increased loads?

Contention- the culprit

One of the culprits behind the deteriorating response is the thread locking and resource contention. Assuming that application has been designed with Reader-Writer locks or message queue based synchronization mechanism then the time spent in waiting for resources to become free, while traffic increases, will result in the degraded performance.

Let us assume that the application is read-heavy, write-light and has implemented Reader-Writer synchronization mechanism. Further let us assume that a write-thread locks a resource for 250 ms.  At low loads we could have 4 such threads each locking the resource for 250 ms for a total span of 1s.  Hence in 1s there can be a maximum of 4 threads each of which has executed a write lock for 250 ms for a total of 1s. In this interval all reader threads will be forced to wait. When the traffic load is low the number of reader threads waiting for the lock to be released will be low and will not have much impact but as the traffic increases the number of threads that are waiting for the lock to be released will be increase. Since a write lock takes a finite amount of time to complete processing we cannot go over the 4 write threads in 1 second with the given CPU speed.

However as the traffic further increases the number of waiting threads not only increases but also consume CPU and memory. Now this adversely impacts the writer threads which find that they have lesser CPU cycles and less memory and hence take longer times to complete. This downward cycle worsens and hence results in an increase in the response time and a worsening throughput in the application.

The solution to this problem is not easy. We need to revisit the areas where the application blocks waiting for something. Locking besides causing threads to wait also adds the overhead of getting scheduled prior to being able to execute again. We need to minimize the time a thread holds a resource before allowing others threads access to it.

Find me on Google+

Designing for Cloud Worthiness

Cloud Computing is changing the rules of computing to the enterprise. Enterprises are no longer constrained by capital costs of upfront equipment purchase. Rather they can concentrate on the application and deploy it on the cloud and pay in a utility style based on usage. Cloud computing essentially presents a virtualized platform on which applications can be deployed.

The Cloud exhibits the property of elasticity by automatically adding more resources to the application as demand grows and shrinking the resources when the demand drops. It is this property of elasticity of the cloud and the ability to pay based on actual usage that makes Cloud Computing so alluring.

However to take full advantage of the Cloud the application must use the available cloud resources judiciously.  It is important for applications that are to be deployed on the cloud to have the property of scaling horizontally.  What this implies is that the application should be able to handle more transactions per second when more resources are added to application. For example if the application has been designed to run in a small CPU instance of 1.7GHz,32 bit and 160 GB of instance storage with a throughput of 800 transactions per second then one should be able to add 4 such instances and scale to handling 4000 transactions per second.

However there is a catch in this. How does one determine what should be theoretical limit of transactions per second for a single instance?  Ideally we should maximize the throughput and minimize the latency for each instance prior to going to the next step of adding more instances on the cloud. One should squeeze the maximum performance from the application in the instance of choice prior to using multiple instances on the cloud. Typical applications perform reasonably well under small loads but as the traffic is increased the response time increases and the throughput also starts dipping.

There is a need to run some profiling tools and remove bottlenecks in the application. The standard refrain for applications to be deployed on the cloud is that they should be loosely coupled and also be stateless. However, most applications tend to be multi-threaded with resource sharing in various modules.  The performance of the application because of locks and semaphores should be given due consideration. Typically a lot of time wasted in the wait state of threads in the application. A suitable technique should be used for providing concurrency among threads. The application should be analyzed whether it read-heavy and write-light or write-heavy and read-light. Suitable synchronization techniques like reader-Writer, message queue based exclusion or monitors should be used.

I have found callgrind for profiling and gathering performance characteristics along with KCachegrind for providing a graphical display of performance times extremely useful.

Another important technique to improve performance is the need to maintain in-memory cache of frequently accessed data. Rather than making frequent queries to the database periodic updates from the database need to be made and stored in in-memory cache. However while this technique works fine with a single instance the question of how to handle in-memory caches for multiple instances in the cloud represents quite a challenge. In the cloud when there are multiple instances there is a need for a distributed cache which is  shared among multiple instances. Memcached is appropriate technique for maintaining a distributed cache in the cloud.

Once the application has been ironed out for maximum performance the application can be deployed on the cloud and stress tested for peak loads.

Some good tools that can be used for generating loads on the application are loadUI and multi-mechanize. Personally I prefer multi-mechanize as it uses test scripts that are based on Python which can be easily modified for the testing. One can simulate browser functionality to some extent with Python in multi-mechanize which can prove useful.

Hence while the cloud provides CPUs, memory and database resources on demand the enterprise needs to design applications such that the use of these resources are done judiciously. Otherwise the enterprise will not be able to reap the benefits of utility computing if it deploys inefficient applications that hog a lot of resources without appropriate revenue generating performance.

INWARDi Technologies