Bend it like Bluemix, MongoDB using Auto-scale – Part 1!

In the next series of posts I turn on the heat on my cloud deployment in IBM Bluemix and check out the elastic nature of this PaaS offering. Handling traffic load and elastically expanding and contracting is what the cloud does best. This is where the ‘rubber really meets the road”. In this series of posts I generate the traffic load using Multi –Mechanize a performance test framework created by Corey Goldberg.

This post is based on an earlier cloud app that I created on Bluemix namely Spicing up a IBM Bluemix Cloud app with MongoDB and NodeExpress. I had to make changes to this code to iron out issues while handling concurrent inserts, displays and deletes issued from the multi-mechanize tool and also to manage the asynchronous nightmare of Nodejs.

The code for this Bluemix, MongoDB with Auto-scaling can be forked from Devops at bluemixMongo. The code can also be cloned from GitHub at bluemix-mongo-autoscale

1. To get started, fork the code from Devops at bluemixMongo. Then change the host name in manifest.yml to something unique and click the Build and Deploy button on the top right in the page.

1a. Alternatively the code can be cloned from GitHub at bluemix-mongo-autoscale. From the directory where the code is cloned push the code using Cloud Foundry’s cf command as follows

cf login -a https://api.ng.bluemix.net

cf push bluemixMongo –p . –m 128M

2. Now add the MongoDB service and click ‘OK’ to restage the server.

3. Add the Monitoring and Analytics (M & A) and also the Auto-scaling service. The M& A gives a good report on the Availability, Performance logging, and also provides Logging Analysis. The Auto-scaling service is the service that allows the app to expand elastically to changing traffic loads.

4. You should see the bluemixMongo app running with 3 services MongoDB, Autoscaling and M&A

5. You should now be able click the bluemixMongo.mybluemix.net and check the application out.

6.Now you configure the Overload Policy (auto scaling) policy. This is a slightly contrived example and the scaling policy is set to scale up if the Memory exceeds 55%. (Typically the scale up would be configured for > 80% memory usage)

7. Now check the configured Auto-scaling policy

8. Change the Memory Quota as appropriate. In my case I have kept the memory quota as 128 MB. Note the available memory is 640 MB and hence allows up to 5 instances. (By the way it is also possible to set any other value like 100 MB).

9. Click the Monitoring and Analytics service and take a look at the output in the different tabs

10. Next you need to set up the Performance test tool – Multi mechanize. Multi-mechanize creates concurrent threads to generate the load on a Web site or service. It is based on Python which makes it easy to modify the scripts for hitting a website, making a REST call or submitting a form.

To setup Multi-mechanize you also need additional packages like numpy matplotlib etc as the tool generates traffic based on a user provided script, measures latency and throughput besides also generating graphs for these.

For a detailed steps for setup of Multi mechanize please follow the steps in Trying out multi-mechanize web performance and load testing framework. Note: I would suggest that you install Python 2.7.2 and not the later 3.x version as some of the packages are based on the 2.7 version which has a slightly different syntax for certain Python statements

In the next post I will run a traffic test on the bluemixMongo application using Multi-mechanize and observe how the cloud app responds to the load.

Watch this space!
Also see
Bend it like Bluemix, MongoDB with autoscaling – Part 2!
Bend it like Bluemix, MongoDB with autoscaling – Part 3

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Profiting from a cloud deployment

Cloud computing does offer enterprises and organizations a mixed bag of goodies. For one it provides for a utility style computing, the ability to grow and shrink with changing loads, zero upfront costs etc. The benefits of cloud computing are many but does it all add up to profit for an enterprise? That is the critical question that needs to be answered.

This post will take a look on what it takes for a cloud deployment to be profitable for an organization.

The critical parameters for any web application are latency and throughput. A well designed web application whether it is an e-retail site or an ad serving application will try to minimize the latency or response time while at the same time maximizing the throughput of the application. For any application while the latency can be kept within specified limits the throughput will tend to plateau at a certain level and will not increase with increasing traffic. Utilizing a larger instance can improve the throughput plateau slightly. In any case the reality is that throughput tends to flatten as the traffic is increased.

A typical cloud application will be made of several compute instances, database instances, DNS services etc. Cloud usage is billed by the hour. Hence we can represent the cost of a cloud deployment as follows

Cost (cloud deployment) = m * compute instance + n * database instance + o * network bytes + P

Where P = cost of DNS + Elastic IPs + other costs.

This can be represented by the formula

C = a * D * t

where C = cost of cloud deployment

D = costs per hour of the deployment

and ‘a’ is some arbitrary constant and ‘t’ is the time

Let us assume that for the cloud deployment we get a throughput of T.

The revenue for a web application whether it is an e-commerce site, an e-ticketing site or an ad serving engine will all depend on the throughput i.e. larger the throughput, larger the revenue and hence profit. We can then say that ‘R’ the revenue is

R (revenue) α k * T * t

In others words the revenue is proportional to the throughput.

Hence to determine the profitability of a particular cloud deployment we need to compare the cost of the deployment for a given throughput versus a projected profit margin. As long the cost of the deployment is less than the revenue arising from the throughput, the deployment will be profitable. This can be represented pictorially as below.

The graph clearly shows that for a profitable deployment

d/dt (k * T *t) > d/dt (a * D * t) or

k * T > a * D

Hence as can seen from the picture as long as the slope of the cumulative deployment costs are less that the slope of the revenue the deployment will be profitable.

Find me on Google+

Eliminating the Performance Drag

Nothing is more important to designers and architects of web applications than performance. Everybody dreams of applications that can roar and spit fire! Attributes that are most sought after in large scale web applications are low latency and high throughput. Performance is money. So it is really important to work the design and churn the best possible design.

Squeezing performance from your application is truly an art. The challenges and excitement are somewhat similar to what a race car designer or an aeronautical engineer would experience in reducing the drag on the machine while increasing the thrust of the engine. Understanding the physics of program execution is truly a rare art.

This post will attempt to touch upon some key aspects that have to be looked at a little more closely to wring the maximum performance from your application.

The Store Clerk Pattern: This is an oft quoted analogy to describe the relation between latency and throughput. In this example the application can be likened to retail store with store clerks checking out customers who queue with their purchases. Assume that there are 5 store clerks and each take 1sec to process a customer. If there 5 customers the response time for each customer will be 1 sec and a total throughput of 5 customers/sec can be achieved. However if a 6^th customer enters then he/she will have to wait 1 sec in the queue while 5 customers are being handled by the store clerks. For this customer the response time will be 1 sec (waiting) + 1 sec (processing) or a total response time of 2 secs. It can be readily seen that as more customers queue up the wait times will increase and the latency will keep increasing. Also it has to be noted that the throughput will plateau at 5 customers/sec and cannot go above that.

The first point of attack in improving performance is to identify the store clerk pattern in your application. Identify where you application has a queue of incoming requests and a thread pool to address these requests as they get processed. The latency and throughput are governed by the number of parallel thread (store clerks) who process and the wait times in the queue. One naïve technique is to increase the number of threads in the pool or increase the number of pools. However this may be limited by the CPU and system limitations. What is extremely important is to identify what factors contribute to the processing of each request. While processing, do threads need to access and retrieve data? Do they have to make API or SQL calls? Identify what is the worst case performance of the thread and determine if this worst case can be improved by a different algorithm. So the key in the store clerk pattern is a) to optimize the threads in the pool and b) to improve the worst case performance of processing the request.

Resource Contention: This is another area of the application that needs to be looked at very closely. It is quite likely that data is being shared by many threads. Access to shared data is going to involve locks and waits. Identify and determine the worst case wait for threads. Is your application read-heavy and write-light or write-heavy and read-light? In the former situation it may be worthwhile to use a Reader-Writer locking algorithm in which many number of readers can simultaneously read data by updating a semaphore. However a write, which happens occasionally, will result in locking the resource and cause the wait of all reader threads. However if the application is write-heavy then other alternatives like message based locking could be used. Clearly thread waits can be a drain on performance.

Algorithmic changes: If there are modules that perform enormous number of insertions, updates or deletions on data in memory then this has to be looked at closely. Determine the type of data structures or STLs being used. The solution is to be able to re-organize data so that the operation happens much more efficiently ideally reaching towards O (1). Maybe the data may need to be organized as hash map of lists or a hash map pointing to n-ary trees instead of a list of lists. This will really require deep thought and careful analysis to identify the best possible approach that provides the least possible times for the most common operation.

From Relational to NoSQL : Though the transition from a RDBMS to a NoSQL databases like Cassandra, CouchDB etc would really be based on scalability, the ability to partition data horizontally and hash the key for accesses, updations and deletions will be really fast and is an avenue that is worth looking into.

Caching : This is a widely used technique to reduce frequent SQL queries to the database. Data that is commonly used can be cached in-memory. One such technique is to use memcached. Memcached caches data across several servers. Access to data is through hashing and is of the order of O (1). If there is a miss of data in the memcached server’s then data is accessed through a SQL query. Access to data is through simple get, put methods in which the key is hashed to identify the server in which the data is stored.

Profiling : The judicious use of profiling tools is extremely important in optimizing performance. Tools like valgrind truly help in identifying bottlenecks. Other tools also help in monitoring thread pools and identifying where resource contention is taking place. It may also be worthwhile to timestamp different modules and collect data over several thousand runs, average them and pin-point trouble spots.

These are some technique that can be used for optimizing performance. However improving performance beyond a point will really depend on being able to visualize the application in execution and divining problem hot spots.

Find me on Google+

Latency, throughput implications for the Cloud

The key considerations for any website are latency and throughput. These two parameters are extremely important to web designers as the response time of the web site and the ability to handle large amounts of traffic are directly related to the user experience and the loyalty of returning users.

What are these two parameters and why are they significant? Before looking at latency we need to understand what the response time of the web application is. Ideally this could be defined as the time between the receipt of the HTTP request and the emitting of the corresponding response. Unfortunately any web site hosted on the World Wide Web adds a lot more delay than the response time. This delay comes as the latency of the web site and is primarily due to the propagation and transmission delays on the internet. There are many contributors to this latency starting from the DNS lookup, to the link bandwidth etc.

Throughput on the other hand represents the maximum simultaneous queries or transactions per second that the web application is capable of handling. This is usually measured as transactions-per-second (tps) or queries-per-second (qps).

A good way to understand response time and throughput is to use a oft used example, of a retail store handling customers. Assuming that there are 5 counter clerks who take 1 minute to check out a customer we can readily see that as the number of customers to the store increases the throughput increases from 1 customer/minute to a maximum of 5 customers/minute. Since the cashiers are able to process in 1 minute the response time for the customer is 1 minute/customer. Assuming a 6^th customer enters and needs to checkout he/she will have to wait, for e.g.1 minute, if the 5 counter clerks are busy processing 5 other clients,. Hence the response time for the customer will be 1 minute (waiting) + 1 minute (servicing) = 2 minute. The response time increases from 1 minute to 2 minute. If further clients are ready to check out the length of the wait in the queue will increase and hence the response time. Clearly the throughput cannot increase beyond 5 customers/minute while the response time will increase non-linearly as the clients enter the store faster than they can checked out by the counter clerks.

This is precisely the behavior of web applications. When the traffic to a web site is increased the throughput increases linearly and finally reaches a throughput “plateau”. After this point as the load is increased the throughput remains saturated at this level. While on the other hand the response time is low at low traffic it starts to increase non-linearly with increasing load and continues to increase as it maxes out system resources like the CPU and memory.

When deploying applications on the cloud the latency and throughput are key considerations which are needed to determine the kind of computing resources that are needed in the cloud. Assuming the web application has been optimized and performance tuned for optimum performance what needs to be done is run load testing of the application on the cloud using different CPU instances. For example assume that application is load tested on a small CPU instance. We need to get the response times and throughput plots with increasing loads. Similarly we now need to deploy the web application on a medium instance and plot response times and the throughput plateaus on the medium instances.

Now the choice as to whether to go for a small CPU instance or medium CPU instance can be calculated as follows. Assuming that the requirements of the web application is to have a response time of ‘t’ seconds then we determine the corresponding traffic handling capacity , for the small CPU instance, say ‘c’ and for the medium CPU instance, let’s assume ‘C’. If the web site has to handle to total traffic of T then we determine the number of instances needed in each case. For the

small CPU instance it will be n= (T/c) + 1

and for

the medium CPU instance it will be N =( T/C)+1.

Now we compute the relative costs of the small and medium CPU instances and identify which is more economical. For example if r1 is the cost per hour of the small CPU instance and R1 is the cost of the medium CPU instance we choose

The small CPU instance if r1 *n < R1 *N (per hour)

While on the other hand if R1 *N < r1 *n then we will choose the medium instance.

Hence the determination of which CPU instance and the configuration of the web application on the cloud will depend on appropriate performance tuning and proper load testing on the cloud. Do also ready my other posts on latency namely ‘The Many faces of latency” and “The Anatomy of Latency“.

Also see latency and throughput in action in the following series of posts

– Bend it like Bluemix, MongoDB with autoscaling – Part 1

– Bend it like Bluemix, MongoDB with autoscaling – Part 2

– Bend it like Bluemix, MongoDB with autoscaling – Part 3

Find me on Google+

The Anatomy of Latency

Latency is a measure of the time delay experienced in a system. In data communications, latency would be measured as the round-trip delay between sending a packet and receiving response from the destination. In the world of web applications latency is the response time of a web site. In web applications latency is dependent on both the round trip time on the communication link and also the processing time of the application, Hence we could say that

latency = 2 * round trip time + Processing time

The round trip time is probably less susceptible to increasing traffic than the processing time taken for handling the increased loads. The processing time of the application is particularly pernicious in that it susceptible to changing traffic. This article tries to analyze why the latency or response times of web applications typically increase with increasing traffic. While the latency increases exponentially as the traffic increases the throughput increases to a point and then finally starts to drop substantially. The ideal situation for all internet applications is to have the ability to scale horizontally allowing the application to handle increasing traffic by simply adding more commodity servers to the application while maintaining the response times to acceptable limits. However in the real world this never happens.

The price of Latency

Latency hurts business. Amazon found out that every 100 ms of latency cost them 1% of sales. Similarly Google realized that a 0.5 second increase in search results dropped the search traffic by 20%. Latency really matters. Reactions to bad response times in web sites range from minor annoyance to complete frustration and loss of users and business.

The cause of processing latency

One of the fundamental requirements of scalable systems is that they should be loosely coupled. The application needs to have a modular architecture with well defined interfaces with the other modules. Ideally, applications which have been designed with fairly efficient processing times of the order of O(logn) or O(nlogn) will be immune to changing loads but will be impacted by changes in number of data elements So the algorithms adopted by the applications themselves do not contribute the increasing response times for increase traffic. So finally what really is the performance bottleneck for increasing latencies and decreasing throughput for increased loads?

Contention- the culprit

One of the culprits behind the deteriorating response is the thread locking and resource contention. Assuming that application has been designed with Reader-Writer locks or message queue based synchronization mechanism then the time spent in waiting for resources to become free, while traffic increases, will result in the degraded performance.

Let us assume that the application is read-heavy, write-light and has implemented Reader-Writer synchronization mechanism. Further let us assume that a write-thread locks a resource for 250 ms. At low loads we could have 4 such threads each locking the resource for 250 ms for a total span of 1s. Hence in 1s there can be a maximum of 4 threads each of which has executed a write lock for 250 ms for a total of 1s. In this interval all reader threads will be forced to wait. When the traffic load is low the number of reader threads waiting for the lock to be released will be low and will not have much impact but as the traffic increases the number of threads that are waiting for the lock to be released will be increase. Since a write lock takes a finite amount of time to complete processing we cannot go over the 4 write threads in 1 second with the given CPU speed.

However as the traffic further increases the number of waiting threads not only increases but also consume CPU and memory. Now this adversely impacts the writer threads which find that they have lesser CPU cycles and less memory and hence take longer times to complete. This downward cycle worsens and hence results in an increase in the response time and a worsening throughput in the application.

The solution to this problem is not easy. We need to revisit the areas where the application blocks waiting for something. Locking besides causing threads to wait also adds the overhead of getting scheduled prior to being able to execute again. We need to minimize the time a thread holds a resource before allowing others threads access to it.

Find me on Google+