Bend it like Bluemix, MongoDB with auto-scaling – Part 3

In this last post of this series, I test the performance of Bluemix & MongoDB against  concurrent queries and deletes to the cloud based app with Mongo DB, with auto-scaling on. Before I started these series of tests I moved the Overload policy a couple of notches higher and made it scale out if  memory utilization > 75% for 120 secs and < 30% for 120 secs (from the earlier 55% memory utilization) as shown below.

27

The code for bluemixMongo app can be forked from Devops at bluemixMongo or can be cloned from GitHub at  bluemix-mongo-autoscale. The multi-mechanize scripts can be downloaded from GitHub at multi-mechanize     Before starting the testing I checked the current number of documents inserted by the concurrent inserts (see Bend it like Bluemix., MongoDB using Auto-scaling – Part 2). The total number as determined by checking the logs was 1380 17   Sure enough with the scaling policy change after 2 minutes the number of instanced dropped from 3 to 2

18

1. Querying the bluemixMongo app with Multi-mechanize

The Multi-mechanize Python script used for querying the bluemixMongo app simply invokes the app’s userlist URL (resp=br.open(“http://bluemixmongo.mybluemix.net/userlist/&#8221;)

v_user.py

def run(self):
# create a Browser instance
br = mechanize.Browser()
# don"t bother with robots.txt
br.set_handle_robots(False)
# start the timer
start_timer = time.time()
#print("Display userlist")
# Display 5 random documents
resp=br.open("http://bluemixmongo.mybluemix.net/userlist/")
assert("Example Mongo Page" in resp.get_data())
# stop the timer
latency = time.time() - start_timer
self.custom_timers["Userlist"] = latency
r = random.uniform(1, 2)
time.sleep(r)
self.custom_timers['Example_Timer'] = r

The configuration setup for this script creates 2 sets of 10 concurrent threads

config.cfg
run_time = 300
rampup = 0
results_ts_interval = 10
progress_bar = on
console_logging = off
xml_report = off
[user_group-1]
threads = 10
script = v_user.py
[user_group-2]
threads = 10
script = v_user.py

The corresponding userlist.js for querying the app is shown below. Here the query is constructed by creating a ‘RegularExpression’ with  a random Firstname, consisting of a random letter and a random number. Also the query is also limited to 5 documents.

function(callback)
{
// Display a random set of 5 records based on a regular expression made with random letter, number
var randnum = Math.floor((Math.random() * 10) + 1);
var alpha = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','X','Y','Z'];
var randletter = alpha[Math.floor(Math.random() * alpha.length)];
var val =  randletter + ".*" + randnum + ".*";
// Limit the display to 5 documents
var results = collection.find({"FirstName": new RegExp(val)}).limit(5).toArray(function(err, items){
if(err) {
console.log(err + " Error getting items for display");
}
else {
res.render('userlist',
{ "userlist" : items
}); // end res.render
} //end else
db.close(); // Ensure that the open connection is closed
}); // end toArray function
callback(null, 'two');
}

2. Running the userlist query

The following screenshot shows the userlist query being executed concurrently with Multi-mechanize. Note that the number of  instances also drops down to 1

18

3. Deleting documents with Multi-mechanize

The multi-mechanize script for deleting a document is shown below. This script calls the URL with resp = br.open(“http://bluemixmongo.mybluemix.net/remuser&#8221;). No values are required to be entered into the form and the ‘submit’ is simulated.

v_user.py
def run(self):
# create a Browser instance
br = mechanize.Browser()
# don"t bother with robots.txt
br.set_handle_robots(False)
br.addheaders = [("User-agent", "Mozilla/5.0Compatible")]
# start the timer
start_timer = time.time()
# submit the request
resp = br.open("http://bluemixmongo.mybluemix.net/remuser")
#resp = br.open("http://localhost:3000/remuser")
resp.read()
# stop the timer
latency = time.time() - start_timer
# store the custom timer
self.custom_timers["Load_Front_Page"] = latency
# think-time
time.sleep(2)
# select first (zero-based) form on page
br.select_form(nr=0)
# set form field
br.form["firstname"] = ""
br.form["lastname"] = ""
br.form["mobile"] = ""
# start the timer
start_timer = time.time()
# submit the form
resp = br.submit()
resp.read()
print("Removed")
# stop the timer
latency = time.time() - start_timer
# store the custom timer
self.custom_timers["Delete"] = latency
# think-time
time.sleep(2)

config.cfg

The config file is set to start 2 sets of 10 concurrent threads and execute for 300 secs

[global]
run_time = 300
rampup = 0
results_ts_interval = 10
progress_bar = on
console_logging = off
xml_report = off
[user_group-1]
threads = 10
script = v_user.py
[user_group-2]
threads = 10
script = v_user.py
;

deleteuser.js

This Node.js script does a findOne() document and does a remove with the ‘justOne’ set to true

collection.findOne(function(err, item) {
// Delete just a single record
collection.remove(item, {justOne:true},(function (err, doc) {
if (err) {
// If it failed, return error
res.send("There was a problem removing the information to the database.");
}
else {
// If it worked redirect to userlist
res.location("userlist");
// And forward to success page
res.redirect("userlist");
}
}));
});
collection.find().toArray(function(err, items) {
console.log("Length =----------------" + items.length);
db.close();
});
callback(null, 'two');

4. Running the deleteuser multimechanize script

The output of the script executing and the reduction of the number of instances because of the change in the memory utilization policy is shown

21

5. Multi-mechanize

As mentioned in the previous posts

The multi-mechanize commands are executed as follows
To create a new project
multimech-newproject.exe userlist
This will create 2 folders a) results b) test_scripts and the file c) config.cfg. The v_user.py needs to be updated as required

To run the script
multimech-run.exe userlist

The details of the response times for the query is shown below .

All_Transactions_response_times_intervals

 

More details on latency and throughput for the queries and the deletes are included in the results folder of multi-mechanize    

6. Autoscaling The details of the auto-scaling service is shown below

a. Scaling Metric Statistics

22

b. Scaling history 23

7. Monitoring and Analytics (M & A) The output from M & A is shown  below

 

a. Performance Monitoring 24  

b. Log Analysis output The log analysis give a detailed report on the calls made to the app, the console log output and other useful information

25

The series of the 3 posts Bend it like Bluemix, MongoDB with auto-scaling demonstrated the ability of the cloud to expand and shrink based on the load on the cloud.An important requirement for Cloud Architects is design applications that can  scale horizontally without impacting the performance while keeping the costs optimum. The real challenge to auto-scaling is the need to make the application really distributed as opposed to the monolithic architectures we are generally used to.   I hope to write another post on creating simple distributed application later.

Hasta la Vista!

Also see
1.  Bend it like Bluemix, MongoDB with autoscaling – Part 1
2. Bend it like Bluemix, MongoDB with autoscaling – Part 2

You may like :
a) Latency, throughput implications for the cloud
b) The many faces of latency
c) Brewing a potion with Bluemix, PostgreSQL & Node.js in the cloud
d)  A Bluemix recipe with MongoDB and Node.js
e)Spicing up IBM Bluemix with MongoDB and NodeExpress
f) Rock N’ Roll with Bluemix, Cloudant & NodeExpress

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Bend it like Bluemix, MongoDB using Auto-scale – Part 1!

In the next series of posts I turn on the heat on my cloud deployment in IBM Bluemix and check out the elastic nature of this PaaS offering. Handling traffic load and elastically expanding and contracting is what the cloud does best. This is  where the ‘rubber really meets the road”. In this series of posts I generate the traffic load using Multi –Mechanize a performance test framework created by Corey Goldberg.

This post is based on an earlier cloud app that I created on Bluemix namely Spicing up a IBM Bluemix Cloud app with MongoDB and NodeExpress. I had to make changes to this code to iron out issues while handling concurrent  inserts, displays and deletes issued from the multi-mechanize tool and also to manage the asynchronous nightmare of Nodejs.

The code for this Bluemix, MongoDB with Auto-scaling can be forked  from Devops at bluemixMongo. The code can also be cloned from GitHub at bluemix-mongo-autoscale

1.  To get started, fork the code from Devops at bluemixMongo. Then change the host name in manifest.yml to something unique and click the Build and Deploy button on the top right in the page.

26

1a.  Alternatively the code can be cloned from GitHub at bluemix-mongo-autoscale. From the directory where the code is cloned push the code using Cloud Foundry’s cf command as follows

cf login -a https://api.ng.bluemix.net

cf push bluemixMongo –p . –m 128M

2. Now add the MongoDB service and click ‘OK’ to restage the server.

3

3. Add the Monitoring and Analytics (M & A) and also the Auto-scaling service. The M& A gives a good report on the Availability, Performance logging, and also provides Logging Analysis. The Auto-scaling service is the service that allows the app to expand elastically to changing traffic loads.

4

4. You should see the bluemixMongo app running with 3 services MongoDB, Autoscaling and M&A

5

5. You should now be able click the bluemixMongo.mybluemix.net and check the application out.

6.Now you configure the Overload Policy (auto scaling) policy. This is a slightly contrived example and the scaling policy is set to scale up if the Memory exceeds 55%. (Typically the scale up would be configured for > 80% memory usage)

6

7. Now check the configured Auto-scaling policy

7

8. Change the Memory Quota as appropriate. In my case I have kept the memory quota as 128 MB. Note the available memory is 640 MB and hence allows up to 5 instances. (By the way it is also possible to set any other value like 100 MB).

5

9. Click the Monitoring and Analytics service and take a look at the output in the different tabs

8

10. Next you need to set up the Performance test tool – Multi mechanize. Multi-mechanize creates concurrent threads to generate the load on a Web site or service. It is based on Python which  makes it easy to modify the scripts for hitting a website, making a REST call or submitting a form.

To setup Multi-mechanize you also need additional packages like numpy  matplotlib etc as the tool generates traffic based on a user provided script, measures latency and throughput besides also generating graphs for these.

For a detailed steps for setup of Multi mechanize please follow the steps in Trying out multi-mechanize web performance and load testing framework. Note: I would suggest that you install Python 2.7.2 and not the later 3.x version as some of the packages are based on the 2.7 version which has a slightly different syntax for certain Python statements

In the next post I will run a traffic test on the bluemixMongo application using Multi-mechanize and observe how the cloud app responds to the load.

Watch this space!
Also see
Bend it like Bluemix, MongoDB with autoscaling – Part 2!
Bend it like Bluemix, MongoDB with autoscaling – Part 3

You may like :
a) Latency, throughput implications for the cloud
b) The many faces of latency
c) Brewing a potion with Bluemix, PostgreSQL & Node.js in the cloud
d)  A Bluemix recipe with MongoDB and Node.js
e)Spicing up IBM Bluemix with MongoDB and NodeExpress
f) Rock N’ Roll with Bluemix, Cloudant & NodeExpress

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Profiting from a cloud deployment

Cloud computing does offer enterprises and organizations a mixed bag of goodies. For one it provides for a utility style computing, the ability to grow and shrink with changing loads, zero upfront costs etc. The benefits of cloud computing are many but does it all add up to profit for an enterprise? That is the critical question that needs to be answered.

This post will take a look on what it takes for a cloud deployment to be profitable for an organization.

The critical parameters for any web application are latency and throughput.  A well designed web application whether it is an e-retail site or an ad serving application will try to minimize the latency or response time while at the same time maximizing the throughput of the application. For any application while the latency can be kept within specified limits the throughput will tend to plateau at a certain level and will not increase with increasing traffic. Utilizing a larger instance can improve the throughput plateau slightly. In any case the reality is that throughput tends to flatten as the traffic is increased.

A typical cloud application will be made of several compute instances, database instances, DNS services etc. Cloud usage is billed by the hour. Hence we can represent the cost of a cloud deployment as follows

Cost (cloud deployment) = m * compute instance + n * database instance + o * network bytes + P

Where P = cost of DNS + Elastic IPs + other costs.

This can be represented by the formula

C = a * D * t

where C = cost of cloud deployment

D = costs per hour of the deployment

and ‘a’ is some arbitrary constant and ‘t’ is the time

Let us assume that for the cloud deployment we get a throughput of T.

The revenue for a web application whether it is an e-commerce site, an e-ticketing site or an ad serving engine will all depend on the throughput i.e. larger the throughput, larger the revenue and hence profit. We can then say that ‘R’ the revenue is

R (revenue) α k * T * t

In others words  the revenue is proportional to the throughput.

Hence to determine the profitability of a particular cloud deployment we need to compare the cost of the deployment for a given throughput versus a projected  profit margin. As long the cost of the deployment is less than the revenue arising from the throughput, the deployment will be profitable.  This can be represented pictorially as below.

The graph clearly shows that for a profitable deployment

d/dt (k * T *t) > d/dt (a * D * t) or

k * T > a * D

Hence as can seen from the picture as long as the slope of the cumulative deployment costs are less that the slope of the revenue the deployment will be profitable.

Find me on Google+

The Anatomy of Latency

Latency is a measure of the time delay experienced in a system. In data communications, latency would be measured as the round-trip delay between sending a packet and receiving response from the destination. In the world of web applications latency is the response time of a web site. In web applications latency is dependent on both the round trip time on the communication link and also the processing time of the application, Hence we could say that

latency = 2 * round trip time  + Processing time

The round trip time is probably less susceptible to increasing traffic than the processing time taken for handling the increased loads. The processing time of the application is particularly pernicious in that it susceptible to changing traffic. This article tries to analyze why the latency or response times of web applications typically increase with increasing traffic. While the latency increases exponentially as the traffic increases the throughput increases to a point and then finally starts to drop substantially.  The ideal situation for all internet applications is to have the ability to scale horizontally allowing the application to handle increasing traffic by simply adding more commodity servers to the application while maintaining the response times to acceptable limits. However in the real world this never happens.

The price of Latency

Latency hurts business. Amazon found out that every 100 ms of latency cost them 1% of sales.  Similarly Google realized that a 0.5 second increase in search results dropped the search traffic by 20%. Latency really matters.    Reactions to bad response times in web sites range from minor annoyance to complete frustration and loss of users and business.

The cause of processing latency

One of the fundamental requirements of scalable systems is that they should be loosely coupled. The application needs to have a modular architecture with well defined interfaces with the other modules.  Ideally, applications which have been designed with fairly efficient processing times of the order of O(logn) or O(nlogn)  will be immune to changing loads but will be impacted by changes in number of data elements  So the algorithms adopted by the applications themselves do not contribute the increasing response times for increase traffic. So finally what really is the performance bottleneck for increasing latencies and decreasing throughput for increased loads?

Contention- the culprit

One of the culprits behind the deteriorating response is the thread locking and resource contention. Assuming that application has been designed with Reader-Writer locks or message queue based synchronization mechanism then the time spent in waiting for resources to become free, while traffic increases, will result in the degraded performance.

Let us assume that the application is read-heavy, write-light and has implemented Reader-Writer synchronization mechanism. Further let us assume that a write-thread locks a resource for 250 ms.  At low loads we could have 4 such threads each locking the resource for 250 ms for a total span of 1s.  Hence in 1s there can be a maximum of 4 threads each of which has executed a write lock for 250 ms for a total of 1s. In this interval all reader threads will be forced to wait. When the traffic load is low the number of reader threads waiting for the lock to be released will be low and will not have much impact but as the traffic increases the number of threads that are waiting for the lock to be released will be increase. Since a write lock takes a finite amount of time to complete processing we cannot go over the 4 write threads in 1 second with the given CPU speed.

However as the traffic further increases the number of waiting threads not only increases but also consume CPU and memory. Now this adversely impacts the writer threads which find that they have lesser CPU cycles and less memory and hence take longer times to complete. This downward cycle worsens and hence results in an increase in the response time and a worsening throughput in the application.

The solution to this problem is not easy. We need to revisit the areas where the application blocks waiting for something. Locking besides causing threads to wait also adds the overhead of getting scheduled prior to being able to execute again. We need to minimize the time a thread holds a resource before allowing others threads access to it.

Find me on Google+

Designing for Cloud Worthiness

Cloud Computing is changing the rules of computing to the enterprise. Enterprises are no longer constrained by capital costs of upfront equipment purchase. Rather they can concentrate on the application and deploy it on the cloud and pay in a utility style based on usage. Cloud computing essentially presents a virtualized platform on which applications can be deployed.

The Cloud exhibits the property of elasticity by automatically adding more resources to the application as demand grows and shrinking the resources when the demand drops. It is this property of elasticity of the cloud and the ability to pay based on actual usage that makes Cloud Computing so alluring.

However to take full advantage of the Cloud the application must use the available cloud resources judiciously.  It is important for applications that are to be deployed on the cloud to have the property of scaling horizontally.  What this implies is that the application should be able to handle more transactions per second when more resources are added to application. For example if the application has been designed to run in a small CPU instance of 1.7GHz,32 bit and 160 GB of instance storage with a throughput of 800 transactions per second then one should be able to add 4 such instances and scale to handling 4000 transactions per second.

However there is a catch in this. How does one determine what should be theoretical limit of transactions per second for a single instance?  Ideally we should maximize the throughput and minimize the latency for each instance prior to going to the next step of adding more instances on the cloud. One should squeeze the maximum performance from the application in the instance of choice prior to using multiple instances on the cloud. Typical applications perform reasonably well under small loads but as the traffic is increased the response time increases and the throughput also starts dipping.

There is a need to run some profiling tools and remove bottlenecks in the application. The standard refrain for applications to be deployed on the cloud is that they should be loosely coupled and also be stateless. However, most applications tend to be multi-threaded with resource sharing in various modules.  The performance of the application because of locks and semaphores should be given due consideration. Typically a lot of time wasted in the wait state of threads in the application. A suitable technique should be used for providing concurrency among threads. The application should be analyzed whether it read-heavy and write-light or write-heavy and read-light. Suitable synchronization techniques like reader-Writer, message queue based exclusion or monitors should be used.

I have found callgrind for profiling and gathering performance characteristics along with KCachegrind for providing a graphical display of performance times extremely useful.

Another important technique to improve performance is the need to maintain in-memory cache of frequently accessed data. Rather than making frequent queries to the database periodic updates from the database need to be made and stored in in-memory cache. However while this technique works fine with a single instance the question of how to handle in-memory caches for multiple instances in the cloud represents quite a challenge. In the cloud when there are multiple instances there is a need for a distributed cache which is  shared among multiple instances. Memcached is appropriate technique for maintaining a distributed cache in the cloud.

Once the application has been ironed out for maximum performance the application can be deployed on the cloud and stress tested for peak loads.

Some good tools that can be used for generating loads on the application are loadUI and multi-mechanize. Personally I prefer multi-mechanize as it uses test scripts that are based on Python which can be easily modified for the testing. One can simulate browser functionality to some extent with Python in multi-mechanize which can prove useful.

Hence while the cloud provides CPUs, memory and database resources on demand the enterprise needs to design applications such that the use of these resources are done judiciously. Otherwise the enterprise will not be able to reap the benefits of utility computing if it deploys inefficient applications that hog a lot of resources without appropriate revenue generating performance.

INWARDi Technologies