Eliminating the Performance Drag

Nothing is more important to designers and architects of web applications than performance. Everybody dreams of applications that can roar and spit fire! Attributes that are most sought after in large scale web applications are low latency and high throughput. Performance is money. So it is really important to work the design and churn the best possible design.

Squeezing performance from your application is truly an art. The challenges and excitement are somewhat similar to what a race car designer or an aeronautical engineer would experience in reducing the drag on the machine while increasing the thrust of the engine. Understanding the physics of program execution is truly a rare art.

This post will attempt to touch upon some key aspects that have to be looked at a little more closely to wring the maximum performance from your application.

The Store Clerk Pattern: This is an oft quoted analogy to describe the relation between latency and throughput. In this example the application can be likened to retail store with store clerks checking out customers who queue with their purchases. Assume that there are 5 store clerks and each take 1sec to process a customer. If there 5 customers the response time for each customer will be 1 sec and a total throughput of 5 customers/sec can be achieved. However if a 6^th customer enters then he/she will have to wait 1 sec in the queue while 5 customers are being handled by the store clerks. For this customer the response time will be 1 sec (waiting) + 1 sec (processing) or a total response time of 2 secs. It can be readily seen that as more customers queue up the wait times will increase and the latency will keep increasing. Also it has to be noted that the throughput will plateau at 5 customers/sec and cannot go above that.

The first point of attack in improving performance is to identify the store clerk pattern in your application. Identify where you application has a queue of incoming requests and a thread pool to address these requests as they get processed. The latency and throughput are governed by the number of parallel thread (store clerks) who process and the wait times in the queue. One naïve technique is to increase the number of threads in the pool or increase the number of pools. However this may be limited by the CPU and system limitations. What is extremely important is to identify what factors contribute to the processing of each request. While processing, do threads need to access and retrieve data? Do they have to make API or SQL calls? Identify what is the worst case performance of the thread and determine if this worst case can be improved by a different algorithm. So the key in the store clerk pattern is a) to optimize the threads in the pool and b) to improve the worst case performance of processing the request.

Resource Contention: This is another area of the application that needs to be looked at very closely. It is quite likely that data is being shared by many threads. Access to shared data is going to involve locks and waits. Identify and determine the worst case wait for threads. Is your application read-heavy and write-light or write-heavy and read-light? In the former situation it may be worthwhile to use a Reader-Writer locking algorithm in which many number of readers can simultaneously read data by updating a semaphore. However a write, which happens occasionally, will result in locking the resource and cause the wait of all reader threads. However if the application is write-heavy then other alternatives like message based locking could be used. Clearly thread waits can be a drain on performance.

Algorithmic changes: If there are modules that perform enormous number of insertions, updates or deletions on data in memory then this has to be looked at closely. Determine the type of data structures or STLs being used. The solution is to be able to re-organize data so that the operation happens much more efficiently ideally reaching towards O (1). Maybe the data may need to be organized as hash map of lists or a hash map pointing to n-ary trees instead of a list of lists. This will really require deep thought and careful analysis to identify the best possible approach that provides the least possible times for the most common operation.

From Relational to NoSQL : Though the transition from a RDBMS to a NoSQL databases like Cassandra, CouchDB etc would really be based on scalability, the ability to partition data horizontally and hash the key for accesses, updations and deletions will be really fast and is an avenue that is worth looking into.

Caching : This is a widely used technique to reduce frequent SQL queries to the database. Data that is commonly used can be cached in-memory. One such technique is to use memcached. Memcached caches data across several servers. Access to data is through hashing and is of the order of O (1). If there is a miss of data in the memcached server’s then data is accessed through a SQL query. Access to data is through simple get, put methods in which the key is hashed to identify the server in which the data is stored.

Profiling : The judicious use of profiling tools is extremely important in optimizing performance. Tools like valgrind truly help in identifying bottlenecks. Other tools also help in monitoring thread pools and identifying where resource contention is taking place. It may also be worthwhile to timestamp different modules and collect data over several thousand runs, average them and pin-point trouble spots.

These are some technique that can be used for optimizing performance. However improving performance beyond a point will really depend on being able to visualize the application in execution and divining problem hot spots.

Find me on Google+