TWS-4: Gossip protocol: Epidemics and rumors to the rescue

Having successfully completed a grueling yet enjoyable ‘Cloud Computing Concepts’ course at Coursera, from the University of Illinois at Urbana-Champaign, by Prof Indranil Gupta, I continue on my “Thinking Web Scale (TWS)” series of posts. In this post, I would like to dwell on Gossip Protocol.

Gossip protocol finds its way into distributed system from Epidemiology, a branch of science, which studies and models how diseases, rumors spread through society. The gossip protocol disseminates information – the way diseases, rumors spread in society or the way a computer virus is able to infect large networks very rapidly

Gossip protocol is particularly relevant in large distributed systems with hundreds and hundreds of servers spread across multiple data centers for e.g. Social networks like Facebook, Google or Twitter etc.. The servers that power Google’s search, or the Facebook or Twitter engine is made of hundreds of commercial off the shelf (COTS) computers. This is another way of saying that the designers of these systems should fold extremely high failure rates of the servers into their design. In other words “failures will be the norm and not the exception”

As mentioned in my earlier post, in these large distributed systems servers will be fail and new servers will be continuously joining the system. The distributed system must be able to accommodate servers joining or leaving the system. There is no global clock and each server has its own clock. To handle server failures data is replicated over many servers which obviously leads to issues of maintaining data consistency between the replicas.

A well-designed distributed system must include in its design key properties of

Availability – Data should be available when you want it
Consistency – Data should consistent across multiple copes
Should be fault tolerant
Should be scalable
Handle servers joining or leaving the systems transparently

One interesting aspect of Distributed Systems much like Operating System (OS) is the fact that a lot of the design choices are based on engineering judgments. The design choices are usually a trade-off of slightly different performance characteristics. Some of them are obvious and some not so obvious.

Why Gossip protocol? What makes it attractive?

Here are some approaches

Centralized Server:

Let us assume that in a network of servers we have a server (Server A) has some piece of information which it needs to spread to other servers. One way is to have this server send the message to all the servers. While this would work there are 2 obvious deficiencies with this approach

The Server A will hog the bandwidth in transmitting the information to all other servers
Server A will be a hot spot besides also being a Single Point of Failure

Cons: In other words if we have a central server always disseminating information then we run into the issue of ‘Single point of Failure’ of this central server.

Directed Graph

Assuming that we construct a directed overlay graph over the network of servers, we could transmit the message from server A to all other servers. While this approach, has the advantage of lesser traffic as each server node will typically have around a 1 -3 children. This will result in lesser bandwidth utilization. However the disadvantage to this approach, will be that , when an intermediate non-leaf node fails then information will not reach all children of the failed nodes.

Cons: Does not handle failures of non-leaf nodes well

Ring Architecture

In this architecture we could have Server A, pass the message round the ring till it gets to the desired server. Clearly each node has one predecessor and one successor. Like the previous example this has the drawback that if one or more servers of the ring fail then the message does not get to its destination.

Cons: Does not handle failures of nodes in the ring well

Note: We should note that these engineering choices only make sense in certain circumstances. So for e.g. the directed graph or the ring structure discussed below have deficiencies for the distributed system case, however these are accepted design patterns in computer networking for e.g. the Token Ring IEEE 802.5 and graph of nodes in a network. Hierarchical trees are the norm in telecom networks where international calls reach the main trunk exchange, then the central office and finally to the local office in a route that is a root-non-leaf-leaf route.

Gossip protocol

Enter the Gossip protocol (here is a good summary on gossip protocol). In the gossip protocol each server sends the message to ‘b’ random peers. The value ‘b’ typically a small number is called the fan-out. The server A which has the data is assumed to be ‘infected’. In the beginning only server A is infected while all other servers are ‘susceptible’. Each server receiving the message is now considered to be infected. Each infected server transmits to ‘b’ other servers. It is likely that the receiving sever is already infected in which case it will drop the message.

In many ways this is similar to the spread of a disease is through a virus. The disease spreads when an infected person comes in contact with another person.

The nice part about the gossip protocol is that is light weight and it can infect the entire set of servers in the order of O (log N)

This is fairly obvious as each round the ‘b’ infected servers will infect ‘b*n’ other servers where ‘n’ is the fan-out.
The computation is as follows

Let x0 = n (Initial state, all un-infected) and y0 =1 (1 infected server) at time t = 0
With x0 + y= n + 1 at all times

Let β be the contact rate between the ‘susceptible’ and ‘infected’ (x*y), then the rate of infection can be represents as
dx/dt= -βxy

The negative sign indicates that the number of ‘non-infected’ servers will decrease over time
(It is amazing how we can capture the entire essence of the spread of disease through a simple, compact equation)

The solution for the above equation (which I have taken in good faith, as my knowledge in differential equations is a faint memory. Hope to refresh my memory when I get the chance, though!)
x=n(n+1)/(n+e^β(n+1)t ) – 1
y=(n+1)/(1+ne^(-β(n+1)t)) – 2

The solution (1) clearly shows that the number ‘x’ of un-infected servers at time‘t’ rapidly to 0 as the denominator becomes too large. The number of infected units ‘y’ as t increases tends to n+1, or in other words all servers get infected

This method where infected server sends a message to ‘b’ servers is known as the ‘push’ approach.

Pros: The Gossip protocol clearly is more resilient to servers failing as the gossip message is sent a ‘b’ random targets and can handle failures better.
Cons: There is a possibility that the ‘b’ random targets selected for infection are already infected, in which case the infection can die rapidly if these infected servers fail.

The solution for the above is to have a ‘pull’ approach where after a time ‘t’ the un-infected servers pull the data from random servers. This way the un-infected servers will also get infected if they pull the data from already infected servers

A third approach is to have a combination of a push-pull approach.
Gossip has been used extensively in Facebook’s and Apache’s Cassandra NoSQL database. Amazon’s Dynamo DB and Riak NoSQL DB also use forms of Gossip Protocol

Failure detection: Gossip protocol has been used extensively in detecting failures. The failed servers are removed from the membership list and this is list is gossiped so that all servers have a uniform view of the set of live servers. However, as with any approach this is prone to high rate false-positives, where servers are assumed to have failed even though this may have been marked as ‘failed’ because of a temporary network failure. Moreover the network load on epidemic style membership lists are also high.

Some methods to handle false positives is to initially place failed servers under a ‘suspicion’. When the number of messages attributing failure to this server increases above a threshold ‘t’, then the server is assumed to have failed and removed from the membership list.

Cassandra uses a failure ‘accrual’ mechanism to detect failures in the distributed NoSQL datanase

Epidemic protocols, like the gossip protocol are particularly useful in large scale distributed systems where servers leave and join the system.

One interesting application of the epidemic protocol is to simply to collect the overall state of the system. If we consider an information exchange where all nodes have set an internal value xi = 0 except node 1 which has x1=1 (infected) (from the book Distributed Systems: Principles & paradigms by Andrew Tannenbaum and Maarten Van Steen)

where xi = 1 if i =1, or 0 if i > 1
If the nodes gossip this value and compute the average (xi + xj) /2, then after a period of time this value will tend towards 1/N where N is the total number of nodes in the system. Hence all the servers in the system will become aware of the total size of the system.

Conclusion: Gossip protocol has widespread application in distributed systems of today, from spreading information, membership, failure detection, monitoring and alarming. It is really interesting to note that the theory of epidemics or disease spread from a branch of sociology become so important in a field of computer science.

Also see
1. A crime map of India in R: Crimes against women
2. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3. Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

Bend it like Bluemix, MongoDB using Auto-scale – Part 1!

In the next series of posts I turn on the heat on my cloud deployment in IBM Bluemix and check out the elastic nature of this PaaS offering. Handling traffic load and elastically expanding and contracting is what the cloud does best. This is where the ‘rubber really meets the road”. In this series of posts I generate the traffic load using Multi –Mechanize a performance test framework created by Corey Goldberg.

This post is based on an earlier cloud app that I created on Bluemix namely Spicing up a IBM Bluemix Cloud app with MongoDB and NodeExpress. I had to make changes to this code to iron out issues while handling concurrent inserts, displays and deletes issued from the multi-mechanize tool and also to manage the asynchronous nightmare of Nodejs.

The code for this Bluemix, MongoDB with Auto-scaling can be forked from Devops at bluemixMongo. The code can also be cloned from GitHub at bluemix-mongo-autoscale

1. To get started, fork the code from Devops at bluemixMongo. Then change the host name in manifest.yml to something unique and click the Build and Deploy button on the top right in the page.

1a. Alternatively the code can be cloned from GitHub at bluemix-mongo-autoscale. From the directory where the code is cloned push the code using Cloud Foundry’s cf command as follows

cf login -a https://api.ng.bluemix.net

cf push bluemixMongo –p . –m 128M

2. Now add the MongoDB service and click ‘OK’ to restage the server.

3. Add the Monitoring and Analytics (M & A) and also the Auto-scaling service. The M& A gives a good report on the Availability, Performance logging, and also provides Logging Analysis. The Auto-scaling service is the service that allows the app to expand elastically to changing traffic loads.

4. You should see the bluemixMongo app running with 3 services MongoDB, Autoscaling and M&A

5. You should now be able click the bluemixMongo.mybluemix.net and check the application out.

6.Now you configure the Overload Policy (auto scaling) policy. This is a slightly contrived example and the scaling policy is set to scale up if the Memory exceeds 55%. (Typically the scale up would be configured for > 80% memory usage)

7. Now check the configured Auto-scaling policy

8. Change the Memory Quota as appropriate. In my case I have kept the memory quota as 128 MB. Note the available memory is 640 MB and hence allows up to 5 instances. (By the way it is also possible to set any other value like 100 MB).

9. Click the Monitoring and Analytics service and take a look at the output in the different tabs

10. Next you need to set up the Performance test tool – Multi mechanize. Multi-mechanize creates concurrent threads to generate the load on a Web site or service. It is based on Python which makes it easy to modify the scripts for hitting a website, making a REST call or submitting a form.

To setup Multi-mechanize you also need additional packages like numpy matplotlib etc as the tool generates traffic based on a user provided script, measures latency and throughput besides also generating graphs for these.

For a detailed steps for setup of Multi mechanize please follow the steps in Trying out multi-mechanize web performance and load testing framework. Note: I would suggest that you install Python 2.7.2 and not the later 3.x version as some of the packages are based on the 2.7 version which has a slightly different syntax for certain Python statements

In the next post I will run a traffic test on the bluemixMongo application using Multi-mechanize and observe how the cloud app responds to the load.

Watch this space!
Also see
Bend it like Bluemix, MongoDB with autoscaling – Part 2!
Bend it like Bluemix, MongoDB with autoscaling – Part 3

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Where is the Cloud Computing bus going?

Technological innovation patterns have often repeated themselves in history. So it is with Cloud Computing. Familiar patterns of change seem to emerge today

Here are some of main trends that I see in Cloud Computing

Advent of containers: Containers are the new hot topic in cloud computing. In virtualization guest OS’es run separately. Running separate guest OS over the hypervisor is associated with a lot of overhead for each of the heavy weight OS’es. Containers can be used as an alternative to OS-level virtualization to run multiple isolated systems on a single host. Containers within a single operating system are much more efficient being light weight while being able to provide the same level of isolation. Containers run the same kernel as the host. Here is an interesting article on containers Containers, not virtual machines are the future of the cloud.

In many ways this containers over VM innovation pattern is reminiscent of the advantages of lightweight ‘threads’ over the heavy and slow ‘process’ approach in the OS world. It is inevitable that containers will eventually score over VMs

Open ‘something’ over proprietary’ness: Technology over the decades has always moved into an ‘open’ approach over proprietary solutions. Hence, for example, we have OpenStack for creating instances, provisioning storage, network to do many things that are being done separately by VMWare, Citrix, Hyper-V. The intent is to have a common approach over several disparate approaches. In the networking world there is OpenFlow which tries to have a uniform interface to the many different standards maintained by the Ciscos, Junipers and Brocades of the world. There are also other technologies like OpenCV (Computer Vision processing), Open VPN (VPN protocol) etc. In all these approaches there is either to move to unify or to provide a layer over and above the disparate approaches. I am not sure whether Openstack will prevail, only time will tell. I personally think we will move to a level abstraction that will be even above that of Open Stack.

Software Defined Everything: Cloud Computing started with the need to be able to provision computing resources through a user interface or the Web portal. This was made possible, thanks to virtualization. Users could now define and request computing resources. Soon this led to the need for being able to programmatically request storage. The trick in storage is to do ‘thin-provisioning’ or to provision resources that barely satisfies the needs of the application. The application will be able to request more storage programmatically. Not to be outdone, networking followed suit when Software Defined Networking became a reality when Stanford and University of California came with the Open Flow protocol. We have now entered into the era of Software Defined Datacenter. This is a dominant theme in Cloud Computing.

These are some of the predominant trends that are emerging in the Cloud Computing arena.

I have spent more than 2 decades of my career in telecom, implementing telecom protocols, starting in the mid-1980s. The mid 1980s was the time when digital switches started to emerge. This was followed by a spate of protocols and dizzying innovations like mobile telephony, ISDN, Intelligent Networks, Softswitch, UMTS,3G, HSDPA, LTE etc.

I personally think that Cloud computing, to use a very frayed and hackneyed term, is at a similar ‘inflexion point’. Trends are emerging and we will soon be caught in the maelstrom of rapid change and innovation.

In this post I am going to do a Marty McFly of the ‘Back to Future’ trilogy. I am going to set the clock of the Delorean DMC-12 to 2020 and ‘Whoosh…..’

21 Apr 2020:

It is 21 Apr 2020 and a sunny day. Here is a look at the Cloud Computing landscape

The Organization of Cloud Computing Standards (OCCS) now sets and governs the standards for all Cloud Providers of the world
Common APIs govern provisioning of instances on the cloud regardless of the Cloud Provider. Instances are defined by RPE values, RAM and IOPS, LB, DNS requirements
Networking bandwidth, security and storage are also standards based
Enterprises use a ‘diffuse deployment’ strategy where the organization’s workloads are deployed to multiple cloud providers.
Workloads are Cloud Provider agnostic.
Enterprise applications themselves may span multiple cloud providers for e.g. the e-commerce in Cloud Provider 1, Analytics on HPC instances on Cloud Provider 2 and secure applications on Private Cloud of Cloud Provider 3. Appropriate contracts are maintained between the Cloud Providers for charging for the usage.
Algorithms are used by enterprises to deploy workloads to cloud providers. The algorithms match the SLA and cost requirements of the application with those offered by the cloud provider to minimize the cost while meeting the SLA requirements of the applications.
Compute, storage and networking costs fluctuate and enterprises use algorithms to optimize the deployment of workloads. Workloads are migrated to take advantage of these price changes
Consolidation and acquisitions happen at an alarming pace. Cloud providers, storage, network and HPC providers aslo compete fiercely
Cloud providers are swallowed by others and some lose out. The battle scene is bloody

Time to get back to Delorean. This time the clock on Delorean is set to 2025

18 Sep 2025

Today it is 18 Sep 2025, and it is sunny again, coincidentally.

Cloud Computing is dead, mate. These days technology has moved to ‘Cloud Computing in a box’.
The technology of these times are ‘Haze works’ where the computation happens in the stratosphere over the ether …

So much for looking into the future. It is now time to get back to the reality of VMs

What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1

Published in IBM developerWorks ‘Whats up Watson? Using Watson QAAPI with Bluemix and NodeExpress‘

In this post I take the famed IBM Watson through the paces (yes, that’s right!, this post is about using the same IBM Watson which trounced 2 human Jeopardy titans in a classic duel in 2011). IBM’s Watson (see What is Watson?) is capable of understanding the nuances of the English language and heralds a new era in the domain of cognitive computing. IBM Bluemix now includes 8 services from Watson ranging from Concept Expansion, Language Identification, Machine Translation, Question-Answer etc. For more information on Watson’s QAAPI and the many services that have been included in Bluemix please see Watson Services.

In this article I create an application on IBM Bluemix and use Watson’s QAAPI (Question-Answer API) as a service to the Bluemix application. For the application I have used NodeExpress to create a Webserver and post the REST queries to Watson. Jade is used format the results of Watson’s Response.

In this current release of Bluemix Watson comes with a corpus of medical facts. In other words Watson has been made to ingest medical documents in multiple formats (doc, pdf, html, text etc) and the user can pose medical questions to Dr.Watson. In its current avatar, its medical diet consisted of dishes from (CDC Health Topics, National Heart, Lung, and Blood Institute (NHLBI) National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Neurological Disorders and Stroke (NINDS), Cancer.gov (physician data query) etc.)

Try out my Watson app on Bluemix here – Whats up Watson?

To get down to Watson QAAPI business with Bluemix app you can fork the code from Devops at whatsup. This can then be downloaded to your local machine. You can also clone the code from GitHub at whatsup

To get started go to the directory where you have cloned the code for Whatsup app

2.Push the app to Bluemix using Cloud Foundry’s ‘cf’ commands as shown below

cf login -a https://api.ng.bluemix.net
3. Next push the app to Bluemix
cf push whatsup –p . –m 512M

In the Bluemix dashboard you should see ‘whatsup’ app running. Now click ‘Add Service’ and under Watson add ‘Question Answer’

Add Qatson QAAPI

You will be prompted with ‘Restage Application’. Click ‘Ok’. Once you have the app running you should be able to get started with Doc Watson.

The code for this Bluemix app with QAAPI as a Service is based on the following article Examples using the Question and Answer API

Here’s a look at the code for the Bluemix & Watson app.

In this Bluemix app I show the different types of Questions we can ask Watson and the responses we get from it. The app has a route for each of the different types of questions and options

a. Simple Synchronous Query: Post a simple synchronous query to Watson
This is the simplest query that we can pose to Watson. Here we need to just include the text of the question and the also a Sync Timeout. The Sync Timeout denotes the time client will wait for responses from the Watson service
// Ask Watson a simple synchronous query
app.get('/question',question.list); app.post('/simplesync',simplesync.list);
b. Evidence based question: Ask Watson to respond to evidence given to it
Ask Watson for responses based on evidence given like medical conditions etc. This would be a used for diagnostic purposes I would presume.
// Ask Watson for responses based on evidence provided
app.get('/evidence',evidence.list); app.post('/evidencereq',evidencereq.list);
c. Request for a specified set of answers to a question: Ask Dr. Watson to give a specified number of responses to a question
// Ask Watson to provide specified number of responses to a query
app.get('/items',items.list); app.post('/itemsreq',itemsreq.list);
d. Get a formatted response to a question: Ask Dr. Watson to format the response to the question
// Get a formatted response from Watson for a query
app.get('/format',format.list); app.post('/formatreq',formatreq.list);

To get started with Watson we would need to connect the Bluemix app to the Watson’s QAAPI as a service by parsing the environment variable. This is shown below

//Get the VCAP environment variables to connect Watson service to the Bluemix application

question.js
o o o if (process.env.VCAP_SERVICES) { var VCAP_SERVICES = JSON.parse(process.env.VCAP_SERVICES); // retrieve the credential information from VCAP_SERVICES for Watson QAAPI var hostname = VCAP_SERVICES["Watson QAAPI-0.1"][0].name; var passwd = VCAP_SERVICES["Watson QAAPI-0.1"][0].credentials.password; var userid = VCAP_SERVICES["Watson QAAPI-0.1"][0].credentials.userid; var watson_url = VCAP_SERVICES["Watson QAAPI-0.1"][0].credentials.url;

Next we need to format the header for the POST request

var parts = url.parse(watson_url); // Create the request options to POST our question to Watson var options = {host: parts.hostname, port: 443, path: parts.pathname, method: 'POST', headers: headers, rejectUnauthorized: false, // ignore certificates requestCert: true, agent: false};

The question that is to be asked of Watson needs to be formatted appropriately based on the input received in the appropriate form (for e.g. simplesync.jade)

question.js
// Get the values from the form var syncTimeout = req.body.timeout; var query = req.body.query; // create the Question text to ask Watson var question = {question : {questionText :query }}; var evidence = {"evidenceRequest":{"items":1,"profile":"yes"}}; // Set the POST body and send to Watson req.write(JSON.stringify(question)); req.write("\n\n"); req.end();

Now you POST the Question to Dr. Watson and receive the stream of response using Node.js’ .on(‘data’,) & .on(‘end’) shown below

question.js
…..
var req = https.request(options, function(result) {
// Retrieve and return the result back to the client
result.on(“data”, function(chunk) {
output += chunk;
});

result.on('end', function(chunk) { // Capture Watson's response in output. Parse Watson's answer for the fields var results = JSON.parse(output); res.render( 'answer', { "results":results }); }); });

The results are parsed and formatted displayed using Jade. For the Jade templates I have used a combination of Jade and inline HTML tags (Jade can occasionally be very stubborn and make you sweat quite a bit. So I took the easier route of inline HTML tagging. In a later post I will try out CSS stylesheets to format the response.)

Included below is the part of the jade template with inline HTML tagging

Answer.jade
o o o <h2 style="color:blueviolet"> Question Details </style> </h2> for result in results.question.qclasslist p <font color="blueviolet"> Value = <font color="black "> #{result.value} </font> p <font color="blueviolet"> Focuslist </font> = <font color="black "> #{results.question.focuslist[0].value} </font> // The 'How' query's response does not include latlist. Hence conditional added. if latlist p <font color="blueviolet"> Latlist </font> = <font color="black "> #{results.question.latlist[0].value} </font>

o o o

Now that the code is all set you can fire the Watson. To do this click on the route

Click the route whatsup.mybluemix.net and ‘Lo and behold’ you should see Watson ready and raring to go.

As the display shows there are 4 different Question-Answer options that there is for Watson QAAPI

Simple Synchronous Question-Answer
This option is the simplest option. Here we need to just include the text of the question and the also a Sync Timeout. The question can be any medical related question as Watson in its current Bluemix avatar has a medical corpus

For e.g.1) What is carotid artery disease?

2) What is the difference between hepatitis A and hepatitis B etc.

The Sync Timeout parameter specifies the number of seconds the QAAPI client will wait for the streaming response from Watson. An example question and Watson’s response are included below

;

When we click Submit Watson spews out the following response

Evidence based response:

In this mode of operation, questions can be posed to Watson based on observed evidence. Watson will output all relevant information based on the evidence provided. As seen in the output Watson provides a “confidence factor” for each of its response

Watson gives response with appropriate confidence values based on the given evidence

Question with specified number of responses
In this option we can ask Watson to provide us with at least ‘n’ items in its response. If it cannot provide as many items it will give an error notification

This will bring up the following screen where the question asked is “What is the treatment for Down’s syndrome?” and Items as 3.

Watson gives 3 items in the response as shown below

Formatted Response: Here Watson gives a formatted response to question asked. Since I had already formatted the response using Jade it does not do extra formatting as seen in the screen shot.

Updated synonym based response. In this response we can change the synonym list based on which Watson will search its medical corpus and modify its response. The synonym list for the the question “What is fever?” is shown below. We can turn off synonyms by setting to ‘false’ and possibly adding other synonyms for the search

This part of the code has not been included in this post and is left as an exercise to the reader 🙂

As mentioned before you can fork and clone the code from IBM devops at whatsup or clone from GitHub at whatsup

There are many sections to Watson’s answer which cannot be included in this post as the amount of information is large and really needs to be pared to pick out important details. I am including small sections from each part of Watson’s response below to the question “How is carotid artery disease treated/”

I will follow up this post with another post where I will take a closer look at Watson’s response which has many parts to it
namely

– Question Details

– Evidence list

– Synonym list

– Answers

– Error notifications

There you have it. Go ahead and get all your medical questions answered by Watson.

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Introducing the Software Defined Computing Pattern

We are on the verge of a new ‘Software Defined’ revolution. The phrase ‘software defined’ refers to the ability to be able to programmatically control computing elements namely compute, storage, network. We are entering into a bold, brave ‘software defined’ era. Before we delve into the ‘whats’ of this revolution I would rather like to outline the ‘whys’. What motivated this new thinking in computing?

Why “Software Defined’?

In the late 90s, IT infrastructure was unwieldy and unmanageable, Whenever new IT infrastructure had to be procured there was the need to accurately size the required hardware infrastructure, software, software licenses, routers, switches and storage elements The problem in those days had to do with dimensioning. The CIO and IT managers had to be able to calculate the requisite hardware, and software elements. The problem was that if the estimate was too conservative the infrastructure would be under-dimensioned and would not be able to handle the load. On the other hand if it was over-dimensioned then hardware and software would lie idle and would result in a wasted resources and money. So it used to be a fine balancing act. Even if the IT managers got lucky and got the size right, it is quite likely that conditions in the enterprise changed resulting in them having to take a relook at their infrastructure.

This problem of dimensioning IT infrastructure was effectively solved by a technology called ‘virtualization’. In the mid 1960s IBM created a CP-67 Mainframe computer, which had the elements of virtualization. Much later in 1998, VMWare created the VMWare workstation that could run multiple Operating Systems (OS’es). In essence virtualization abstracts the hardware of the computer, storage and network ports through a software known as the hypervisor. Over the hypervisor, the user can run any operating system like Windows, Linux, AIX etc. These OS’es which run on top of the hypervisor are known as guest OS’es. Besides, virtualization technology, enables different virtual servers to share one physical server. This process, called server consolidation, helps to increase hardware utilization, load balancing, and optimization of the IT resources.

The ability to virtualize the computer hardware really triggered some major advancements in computing. Prior to virtualization each server would run a single OS with a single application resulting in the server being idle for close to 60% of the time. Virtualization now made it possible for enterprises to run several OS’es each with its own application on a single computer. Hence the computing resources were used more effectively and efficiently. This is shown below

Virtualization and the dotcom bust around the year 2000 effectively paved the way for a ‘Software Defined’ future. In others words there was a need to control resources programmatically aimed at more efficient utilization of the resources.

The move to the Cloud: Prior to the advent of the cloud, enterprises hosted their applications in their internal IT infrastructure with virtualization technology. With the pay-per-use, utility style computing, spearheaded by the likes of Amazon, many enterprises moved their applications to shared, multi-tenant (multiple customers) , 3rd party hosting service provider, also known as the cloud providers

With the advent of Cloud Computing the software defined era made major advances. Here is the reason why. Computing as such stands on 3 main pillars- computing, storage and networking.

As mentioned earlier in the post, one of the thorny issues in procuring & managing IT infrastructure is the problem of dimensioning or right sizing. Virtualization did solve this problem to some extent but there was a need to provide more control to the user. This is where the ‘Software Defined’ technologies emerged. This ‘Software Defined’ paradigm is based on prudence and sound engineering judgment. The whole premise of making anything ‘software defined’ is to ensure that resources allocated for any task (computing, storage or networking) are optimal. The idea is that resources should be allocated exactly as needed and released and included into a shared, common pool, when idle. Hence we have the advent of

Software Defined Compute
Software Defined Storage
Software Defined Network

Software Defined Compute (SDC): In the clouds these days it is possible to precisely control the computing elements that will make up your application. So you can choose your CPU type, CPU speed, hypervisor, OS, RAM size, disks etc. You can also provision your application to expand or contract elastically to the demands of the times rather than under-provisioning or over-provisioning, This is done through a process called auto scaling. The desired configuration can be controlled through APIs provided by the Cloud Provider.

Software Defined Storage (SDS): There are multiple storage technologies that span DAS, SATA drives, SAN and NAS storage. These different storage technologies address different needs of price, storage capacity and performance, The Software Defined Storage allows the user to control the type of storage that is needed for the application through software APIs. In storage the initial allocation to each application is rather conservative. Additional storage is assigned from a common pool of storage to the applications that needs it the most. Once the storage is no longer needed it is reclaimed.

Software Defined Network(SDN): SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate. Software Defined Networks (SDN) decouples the routing and switching of the data flows and moves the control of the flow to a separate network element namely, the flow controller. The motivation for this is that the flow of data packets through the network can be controlled in a programmatic manner allowing for multiple data streams to flow over the communicating paths with each stream individually defined for speed, latency, QoS etc.

Software Defined Datacenter (SDDC): A datacenter has racks and racks of servers, storage boxes, and networking equipment. A datacenter where one is able to provision, manage and operate these equipment through APIs or through programs is a Software Defined Datacenter. Imagine being able to put together a car with the body of a BMW, the interior of a Merc, the engine of a Ferrari and the electronics of a Tesla! That is what a SDDC allows you to do!

Software Defined Computing Pattern (SDCP): Once the SDC, SDS and SDN reach a level of maturity I think the next logical step would be a move to Software Defined Computing Patterns. This is what I am implying by this. Theoretically we can reduce the different types of enterprise applications to a set of computing patterns for e.g. e-commerce, social network, email server, Web portal etc. The Software Defined Computing Pattern would allow the user to choose a computing pattern based on the enterprise application. This would result in the setting up of the appropriate computing resources, storage resources, middleware and networking elements in a cloud. . The user would them need to host their applications on this environment. Here is a good link to cloud patterns.

In this context I would like to bring to your notice that there is another parallel trend called Software Defined Architecture (SDA) coined by Gartner in 2014. The SDA Gateway is responsible for virtualizing the internal API, protocols and models used to external API, User Interface and resources. Here is a diagram of SDA

The pace of progress in the last couple of years has been really scorching. The ability to have solve most large problem through a Software Defined Computing Pattern is sure to happen.

Dissecting the Cloud – Part 1

“The Cloud brings it with it the promise of utility-style computing and the ability to pay according to usage.

Cloud Computing provides elasticity or the ability to grow and shrink based on traffic patterns.

Cloud Computing does away with CAPEX and the need to buy infrastructure upfront and replaces it with OPEX model and so on”.

All this old news and has been repeated many times. But what exactly constitutes cloud computing? What brings about the above features? What are its building blocks of the cloud that enable one to realize the above?

This post tries to look deeper into the innards of the Cloud to determine what the cloud really is.

Before we get to this I would like to dwell on an analogy to understand the Cloud better.

Let us assume, Mr. A owns a large building of about 15,000 sq feet and about 100 feet tall. Let us assume that Mr. A wants to rent this building.

Now, assume that the door of this building opens to single, large room on the inside!

Mr. X comes to rent this building. If this was the case then poor Mr. X would have to pay through his nose, presumably, for the entire building even though his requirement would have been for a small room of about 600 x 600 feet. Imagine the waste of space. Moreover this would also have resulted in an enormous waste of electricity. Imagine the lighting needed. Also an inordinate amount of water would have to be utilized if this single, large room needed to be cleaned. The cost for all of this would have to be borne by Mr. X.

This is clearly not a pleasant state of affairs for either Mr. X or for the owner Mr. A of the building.

The solution to this is easy. What Mr. A needs to do, is to partition the building into self-contained rooms (600 x 600 sq feet) with all the amenities. Each self-contained unit would need to have its own electricity and water meter.

Now Mr. A can rent rooms to different tenants on their need basis. This is a win-win situation both for Mr, A and Mr. X. The tenants only need to pay for the rooms they occupy and the electricity and water they consume.

This is exactly the principle behind cloud computing and is known as ‘virtualization’

There are 3 computing components that one must consider. CPU, Network and Storage. The below picture shows the virtualization of CPU,RAM, NIC (network card), Disk (storage)

The Cloud is essentially made up of anywhere between 100 servers to 100,000 servers. The servers are akin to the large building. Running a single OS and application(s) on the entire server is a waste of computing, storage and network resources.

Virtualization abstracts the hardware, storage and network through the use of software known as the ‘hypervisor’. On top of the hypervisor several ‘guest OSes’ can run. Applications can then run on these guest OSes.

Hence over the CPU (single, dual or multi-core) of the server, multiple guest OS’es can run each with its own set of applications

This is similar to partitioning the large CPU resource of the server into smaller units.

There are 3 main Virtualization technologies namely VMware, Citrix and MS Hyper-V

Here is a diagram showing the 3 main the virtualization technologies

To be continued …

Find me on Google+

Architecting a cloud based IP Multimedia System (IMS)

Here is an idea of mine that has been slow cooking in my head for more than 1 and a 1/2 year. Finally managed to work its way to IP.com. See link below

Architecting a cloud based IP Multimedia System (IMS)

The full article is included below

Abstract

This article describes an innovative technique of “cloudifying” the network elements of the IP Multimedia (IMS) framework in order to take advantage of keys benefits of the cloud like elasticity and the utility style pricing. This approach will provide numerous advantages to the Service Provider like better Return-on-Investment(ROI), reduction in capital expenditure and quicker deployment times, besides offering the end customer benefits like the availability of high speed and imaginative IP multimedia services

Introduction

IP Multimedia Systems (IMS) is the architectural framework proposed by 3GPP body to establish and maintain multimedia sessions using an all IP network. IMS is a grand vision that is access network agnostic, uses an all IP backbone to begin, manage and release multimedia sessions. This is done through network elements called Call Session Control Function (CSCFs), Home Subscriber Systems (HSS) and Application Servers (AS). The CSCFs use SDP over SIP protocol to communicate with other CSCFs and the Application Servers (AS’es). The CSCFs also use DIAMETER to talk to the Home Subscriber System (HSS’es).

Session Initiation Protocol (SIP) is used for signaling between the CSCFs to begin, control and release multi-media sessions and Session Description Protocol (SDP) is used to describe the type of media (voice, video or data). DIAMETER is used by the CSCFs to access the HSS. All these protocols work over IP. The use of an all IP core network for both signaling and transmitting bearer media makes the IMS a very prospective candidate for the cloud system.

This article proposes a novel technique of “cloudifying” the network elements of the IMS framework (CSCFs) in order to take advantage of the cloud technology for an all IP network. Essentially this idea proposes deploying the CSCFs (P-CSCF, I-CSCF, S-CSCF, BGCF) over a public cloud. The HSS and AS’es can be deployed over a private cloud for security reasons. The above network elements either use SIP/SDP over IP or DIAMETER over IP. Hence these network elements can be deployed as instances on the servers in the cloud with NIC cards. Note: This does not include the Media Gateway Control Function (MGCF) and the Media Gate Way (MGW) as they require SS7 interfaces. Since IP is used between the servers in the cloud the network elements can setup, maintain and release SIP calls over the servers of the cloud. Hence the IMS framework can be effectively “cloudified” by adopting a hybrid solution of public cloud for the CSCF entities and the private cloud for the HSS’es and AS’es.

This idea enables the deployment of IMS and the ability for the Operator, Equipment Manufacturer and the customer to quickly reap the benefits of the IMS vision while minimizing the risk of such a deployment.

Summary

IP Multimedia Systems (IMS) has been in the wings for some time. There have been several deployments by the major equipment manufacturers, but IMS is simply not happening. The vision of IMS is truly grandiose. IMS envisages an all-IP core with several servers known as Call Session Control Function (CSCF) participating to setup, maintain and release of multi-media call sessions. The multi-media sessions can be any combination of voice, data and video.

In the 3GPP Release 5 Architecture IMS draws an architecture of Proxy CSCF (P-CSCF), Serving CSCF(S-CSCF), Interrogating CSCF(I-CSCF), and Breakout CSCF(BGCF), Media Gateway Control Function (MGCF), Home Subscriber Server(HSS) and Application Servers (AS) acting in concert in setting up, maintaining and release media sessions. The main protocols used in IMS are SIP/SDP for managing media sessions which could be voice, data or video and DIAMETER to the HSS.

IMS is also access agnostic and is capable of handling landline or wireless calls over multiple devices from the mobile, laptop, PDA, smartphones or tablet PCs. The application possibilities of IMS are endless from video calling, live multi-player games to video chatting and mobile handoffs of calls from mobile phones to laptop. Despite the numerous possibilities IMS has not made prime time.

The technology has not turned into a money spinner for Operators. One of the reasons may be that Operators are averse to investing enormous amounts into new technology and turning their network upside down.

The IMS framework uses CSCFs which work in concert to setup, manage and release multi media sessions. This is done by using SDP over SIP for signaling and media description. Another very prevalent protocol used in IMS is DIAMETER. DIAMETER is the protocol that is used for authorizing, authenticating and accounting of subscribers which are maintained in the Home Subscriber System (HSS). All the above protocols namely SDP/SIP and DIAMETER protocols work over IP which makes the entire IMS framework an excellent candidate for deploying on the cloud.

Benefits

There are 6 key benefits that will accrue directly from the above cloud deployment for the IMS. Such a cloud deployment will

i. Obviate the need for upfront costs for the Operator

ii. The elasticity and utility style pricing of the cloud will have multiple benefits for the Service Provider and customer

iii. Provider quicker ROI for the Service Provider by utilizing a innovative business model of revenue-sharing for the Operator and the equipment manufacturer

iv. Make headway in IP Multimedia Systems

v. Enable users of the IMS to avail of high speed and imaginative new services combining voice, data, video and mobility.

vi. The Service Provider can start with a small deployment and grow as the subscriber base and traffic grows in his network

Also, a cloud deployment of the IMS solution has multiple advantages to all the parties involved namely

a) The Equipment manufacturer

b) The Service Provider

c) The customer

A cloud deployment of IMS will serve to break the inertia that Operators have for deploying new architectures in the network.

a) The Equipment manufactures for e.g. the telecommunication organizations that create the software for the CSCFs can license the applications to the Operators based on innovative business model of revenue sharing with the Operator based on usage

b) The Service Provider or the Operator does away with the Capital Expenditure (CAPEX) involved in buying CSCFs along with the hardware. The cost savings can be passed on to the consumers whose video, data or voice calls will be cheaper. Besides, the absence of CAPEX will provide better margins to the operator. A cloud based IMS will also greatly reduce the complexity of dimensioning a core network. Inaccurate dimensioning can result in either over-provisioning or under-provisioning of the network. Utilizing a cloud for deploying the CSCFs, HSS and AS can obviate the need upfront infrastructure expenses for the Operator. As mentioned above the Service Provider can pay the equipment manufactured based on the number of calls or traffic through the system

c) Lastly the customer stands to gain as the IMS vision truly allows for high speed multimedia sessions with complex interactions like multi-party video conferencing, handoffs from mobile to laptop or vice versa. Besides IMS also allows for whiteboarding and multi-player gaming sessions.

Also the elasticity of the cloud can be taken advantage of by the Operator who can start small and automatically scale as the user base grows.

Description

This article describes a method in which the Call Session Control Function (CSCFs) namely the P-CSCF, S-CSCF,I-CSCF and BGCF can be deployed on a public cloud. This is possible because there are no security risks associated with deploying the CSCFs on the public cloud. Moreover the elasticity and the pay per use of the public cloud are excellent attributes for such a cloudifying process. Similarly the HSS’es and AS’es can be deployed on a private cloud. This is required because the HSS and the AS do have security considerations as they hold important subscriber data like the IMS Public User Identity (IMPU) and the IMS Private User Identity (IMPI). However, the Media Gateway Control Function (MGCF) and Media Gateway (MGW) are not included this architecture as these 2 elements require SS7 interfaces

Using the cloud for deployment can bring in the benefits of zero upfront costs, utility style charging based on usage and the ability to grow or shrink elastically as the call traffic expands or shrinks.

This is shown diagrammatically below where all the IMS network elements are deployed on a cloud.

In Fig 1., all the network elements are shown as being part of a cloud.

Fig 1. Cloudifying the IMS architecture.

Detailed description

This idea requires that the IMS solution be “cloudified “i.e. the P-CSCF, I-CSCF, S-CSCF and the BGCF should be deployed on a public cloud. These CSCFs are used to setup, manage and release calls and the information that is used for the call does not pose any security risk. These network elements use SIP for signaling and SDP over SIP for describing the media sessions. The media sessions can be voice, video or data.

However the HSS and AS which contain the Public User Identity (IMPU) and Private User Identity (IMPI) and other important data can be deployed in a private cloud. Hence the IMS solution needs a hybrid solution that uses both the public and private cloud. Besides the proxy SIP servers, Registrars and redirect SIP servers also can be deployed on the public cloud.

The figure Fig 2. below shows how a hybrid cloud solution can be employed for deploying the IMS framework

Fig 2: Utilizing a hybrid cloud solution for deploying the IMS architecture

The call from a user typically originated from a SIP phone and will initially reach the P-CSCF. After passing through several SIP servers it will reach a I-CSCF. The I-CSCF will use DIAMETER to query the HSS for the correct S-CSCF to handle the call. Once the S-CSCF is identified the I-CSCF then signals the S-CSCF to reach a terminating a P-CSCF and finally the end user on his SIP phone. Since the call uses SDP over SIP we can imagine that the call is handled by P-CSCF, I-CSCF, S-CSCF and BGCF instances on the cloud. Each of the CSCFs will have the necessary stacks for communicating to the next CSCF. The CSCF typically use SIP/SDP over TCP or UDP and finally over IP. Moreover query from the I-CSCF or S-CSCF to the HSS will use DIAMTER over UDP/IP. Since IP is the prevalent technology between servers in the cloud communication between CSCFs is possible.

Methodology

The Call Session Control Functions (CSCFs P-CSCF, I-CSCF, S-CSCF, BGCF) typically handle the setup, maintenance and release of SIP sessions. These CSCFs use either SIP/SDP to communicate to other CSCFs, AS’es or SIP proxies or they use DIAMETER to talk to the HSS. SIP/SDP is used over either the TCP or the UDP protocol.

We can view each of the CSCF, HSS or AS as an application capable of managing SIP or DIAMETER sessions. For this these CSCFs need to maintain different protocol stacks towards other network elements. Since these CSCFs are primarily applications which communicate over IP using protocols over it, it makes eminent sense for deploying these CSCFs over the cloud.

The public cloud contains servers in which instances of applications can run in virtual machines (VMs). These instances can communicate to other instances on other servers using IP. In essence the entire IMS framework can be viewed as CSCF instances which communicate to other CSCF instances, HSS or AS over IP. Hence to setup, maintain and release SIP sessions we can view that instances of P-CSCF, I-CSCF, S-CSCF and B-CSCF executed as separate instances on the servers of a public cloud and communicated using the protocol stacks required for the next network element. The protocol stacks for the different network elements is shown below

The CSCF’s namely the P-CSCF, I-CSCF, S-CSCF & the BGCF all have protocol interfaces that use IP. The detailed protocol stacks for each of these network elements are shown below. Since they communicate over IP the servers need to support 100 Base T Network Interface Cards (NIC) and can typically use RJ-45 connector cables, Hence it is obvious that high performance servers which have 100 Base T NIC cards can be used for hosting the instances of the CSCFs (P-CSCF, I-CSCF, S-CSCF and BGCF). Similarly the private cloud can host the HSS which uses DIAMETER/TCP-SCTP/IP and AS uses SDP/SIP/UDP/IP. Hence these can be deployed on the private cloud.

Network Elements on the Public Cloud

The following network elements will be on the public cloud

a) P-CSCF b) I-CSCF c) S-CSCF d) BGCF

The interfaces of each of the above CSCFs are shown below

a) Proxy CSCF (P-CSCF) interface

As can be seen from above all the interfaces (Gm, Gq, Go and Mw) of the P-CSCF are either UDP/IP or SCTP/TCP/IP.

b) Interrogating CSCF(I- CSCF) interface

As can be seen from above all the interfaces (Cx, Mm and Mw) of the I-CSCF are either UDP/IP or SCTP/TCP/IP.

c) Serving CSCF (S-CSCF) interfaces

The interfaces of the S-CSCF (Mw, Mg, Mi, Mm, ISC and Cx) are all either UDP/IP or SCTP/TCP/IP

d) Breakout CSCF (BGCF) interface

The interfaces of the BGCF (Mi, Mj, Mk) are all UDP/IP.

Network elements on the private cloud

The following network elements will be on the private cloud

a) HSS b) AS

a) Home Subscriber Service (HSS) interface

The HSS interface (Cx) is DIAMETER/SCTP/TCP over IP.

b) Application Server (AS) Interface

The AS interface ISC is SDP/SIP/UDP over IP.

As can be seen the interfaces the different network elements have towards other elements are over either UDP/IP or TCP/IP.

Hence we can readily see that a cloud deployment of the IMS framework is feasible.

Conclusion

claimtoken-515bdc0894cdb

Thus it can be seen that a cloud based IMS deployment is feasible given the IP interface of the CSCFs, HSS and AS. Key features of the cloud like elasticity and utility style charging will be make the service attractive to the Service providers. A cloud based IMS deployment is truly a great combination for all parties involved namely the subscriber, the Operator and the equipment manufactures. A cloud based deployment will allow the Operator to start with a small customer base and grow as the service becomes popular. Besides the irresistibility of IMS’ high speed data and video applications are bound to capture the subscribers imagination while proving a lot cheaper.

Also see my post on “Envisioning a Software Defined Ip Multimedia System (SD-IMS)”

Find me on Google+

Technologies to watch: 2012 and beyond

Published in Telecom Asia – Technologies to watch:2012 and beyond

Published in Telecoms Europe – Hot technologies for 2012 and beyond

A keen observer of the technological firmament, today, will observe a grand spectacle of diverse technological events. Some technological trends will blaze a trail and will become trend setters while others will vanish without a trace. The factors that make certain technologies to endure in comparison to others could be many, ranging from pure necessity to a coolness factor, from innovativeness to a cost factor. This article looks at some of the technologies that are certain to be trail blazers in the years to come

Software Defined Networks (SDNs): Software Defined Networks (SDNs) are based on the path breaking paradigm of separating the control of a network flow from the actual flow of data. SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate. Software Defined Networks (SDN) decouples the routing and switching of the data flows and moves the control of the flow to a separate network element namely, the Flow controller. The motivation for this is that the flow of data packets through the network can be controlled in a programmatic manner. The OpenFlow Protocol has 3 components to it. The Flow Controller that controls the flows, the OpenFlow switch and the Flow Table and a secure connection between the Flow Controller and the OpenFlow switch. Software Define Networks (SDNs) also include the ability to virtualize the network resources. Virtualized network resources are known as a “network slice”. A slice can span several network elements including the network backbone, routers and hosts. The ability to control multiple traffic flows programmatically provides enormous flexibility and power in the hands of users. SDNs are bound to be the networks elements of the future.

Smart Grids: The energy industry is delicately poised for a complete transformation with the evolution of the smart grid concept. There is now an imminent need for an increased efficiency in power generation, transmission and distribution coupled with a reduction of energy losses. In this context many leading players in the energy industry are coming up with a connected end-to-end digital grid to smartly manage energy transmission and distribution. The digital grid will have smart meters, sensors and other devices distributed throughout the grid capable of sensing, collecting, analyzing and distributing the data to devices that can take action on them. The huge volume of collected data will be sent to intelligent device which will use the wireless 3G networks to transmit the data. Appropriate action like alternate routing and optimal energy distribution would then happen. Smart Grids are a certainty given that this technology addresses the dire need of efficient energy management. Smart Grids besides managing energy efficiently also save costs by preventing inefficiency and energy losses.

The NoSQL Paradigm: In large web applications where performance and scalability are key concerns a non –relational database like NoSQL is a better choice to the more traditional relational databases. There are several examples of such databases – the more reputed are Google’s BigTable, HBase, Amazon’s Dynamo, CouchDB & MongoDB. These databases partition the data horizontally and distribute it among many regular commodity servers. Accesses to the data are based on get(key) or set(key, value) type of APIs. Accesses to the data are based on a consistent hashing scheme for example the Distributed Hash Table (DHT) method. The ability to distribute data and the queries to one of several servers provides the key benefit of scalability. Clearly having a single database handling an enormous amount of transactions will result in performance degradation as the number of transaction increases. Applications that have to frequently access and manage petabytes of data will clearly have to move to the NoSQL paradigm of databases.

Near Field Communications (NFC): Near Field Communications (NFC) is a technology whose time has come. Mobile phones enabled with NFC technology can be used for a variety of purposes. One such purpose is integrating credit card functionality into mobile phones using NFC. Already the major players in mobile are integrating NFC into their newer versions of mobile phones including Apple’s iPhone, Google’s Android, and Nokia. We will never again have to carry in our wallets with a stack of credit cards. Our mobile phone will double up as a Visa, MasterCard, etc. NFC also allows retail stores to send promotional coupons to subscribers who are in the vicinity of the shopping mall. Posters or trailers of movies running in a theatre can be sent as multi-media clips when travelling near a movie hall. NFC also allows retail stores to send promotional coupons to subscribers who are in the vicinity of the shopping mall besides allowing exchanging contact lists with friends when they are close proximity.

The Other Suspects: Besides the above we have other usual suspects

Long Term Evolution (LTE): LTE enables is latest wireless technology that enables wireless access speeds of up to 56 Mbps. With the burgeoning interest in tablets, smartphones with the countless apps LTE will be used heavily as we move along. For a vision of where telecom is headed, do read my post ‘The Future of Telecom“.

Cloud Computing: Cloud Computing is the other technology that is bound to gain momentum in the years ahead. Besides obviating the need for upfront capital expenditure the cloud enables quick and easy deployment of applications. Moreover the elasticity of the cloud will make it irresistible to large enterprises and corporations.

The above is a list of technologies to watch as create new paths and blaze new trails. All these technologies are bound to transform the world as we know it and make our lives easier, better and more comfortable. These are the technologies that we need to focus on as we move bravely into our future. Do read my post for the year 2011 “Technology Trends – 2011 and beyond”

Find me on Google+

Designing a Scalable Architecture for the Cloud

The promise of the cloud is the unlimited computing power and storage capacities coupled with the pay-per-use policy. This makes the cloud particularly irresistible for hosting web applications and applications whose demand vary periodically. In order to take full advantage of the cloud the application must be designed for optimum performance. Though the cloud provides resources on-demand a badly designed application can hog resources and prove to be extremely expensive in the long run.

One of the first requirements for deploying applications on the cloud is that it should be scalable. Scalability denotes the ability to handle increasing traffic simply by adding more computing resources of the same kind rather than adding resources with greater horse power. This is also referred to scaling horizontally.

Assuming that the application has been sufficiently profiled and tuned for high performance there are certain key considerations that need to be taken into account while deploying on the cloud – public or private. Some of them are being able to scale on demand, providing for high availability, resiliency and having sufficient safeguards against failures.

Given these requirements a scalable design for the Cloud can be viewed as being made up of the following 5 tiers of layers

The DNS tier – In this tier the user domain is hosted on a DNS service like Ultra DNS or Route 53. These DNS services distribute the DNS lookups geographically. This results in connecting to a DNS Server that is geographically closer to the user thus speeding the DNS lookup times. Moreover since the DNS lookups are distributed geographically it also builds geographic resiliency as far as DNS lookups are concerned

Load Balancer-Auto Scaling Tier – This tier is responsible for balancing the incoming traffic among compute instances in the cloud. The load balancing may be made on a simple round-robin technique or may be based on the actual CPU utilization of the individual instances. Typically at this layer we should also have an auto-scaling policy which will add more instances if the traffic to the application increases above a threshold or terminate instances when the traffic falls below a specific threshold.

Compute-Instance Tier – This layer hosts the actual application in individual compute instances on the cloud. It is assumed that the application has been tuned for maximum performance. The choice of small, medium or large CPU should be based on the traffic handling capacity of the instance type versus the cost/hr of the instance.

Cache Tier – This is an important layer in the cloud application where there are multiple instances. The cache tier provides a distributed cache for all the instances. With a distributed caching system like memcached it is possible to share global data between instances. The memcached application uses a consistent-hashing technique to distribute data among a set of participating servers. The consistent hashing method allow for handling of server crashes and new servers joining into the cache layer.

Database Tier – The Database tier is one of the most critical layers of the application. At a minimum the database should be configured in an active-standby mode. Ideally it is always better to have the active and standby in different availability zones to better handle disasters in a particular zone. Another consideration is have separate read replicas that handle reads to database while the primary database handles the write operations

Besides the above considerations it is always good to host the web application in different availability zone thus safeguarding against disasters in a particular region.

Working with Amazon’s EBS, ELB and Route 53

Here are some key learning’s to get going on Amazon’s Elastic Block Storage (EBS), Elastic Load Balancer (ELB) and Route 53 which Amazon’s DNS service

Amazon’s EBS: Amazon’s Elastic Block Storage provided persistent storage for your applications. It is extremely useful when migrating from a small/medium instance to a large/extra large instance. The EBS is akin to a hard disk. The steps that are needed to migrate are

– Create an EBS volume from your snapshot of your small/medium instance

– Launch a large instance

– Attach your EBS volume to your large instance (for e.g. /dev/sda2)

– Open a ssh window to your large instance

– Create a test directory (/home/ec2-user/test)

– Mount your volume (mount /dev/sda2 /home/ec2-user/test)

– Copy all your files and directories to their appropriate location

– Unmount the mounted volume (umount /dev/sda2)

– Now you have all the files from your medium instance

– Detach the volume

Amazon’s ELB: The key thing about the Amazon’s ELB is the fact that the ELB created (my-load-balancer-nnnn-abc.amazon.com) actually maps to a set of IP addresses internally. Amazon suggests CNAMEing a subdomain to point to the ELB for better performance. Also an important thing to understand about Amazon’s ELB is that it performs significantly better if user requests come from different IPs rather from a single machine. So a performance tool that simulates users from multiple IPs will give a better throughput. The alternative is run the performance tool from multiple machines

Amazon’s Route 53: Route 53 is Amazon’s DNS service. Route 53 distributes your domains to multiple geographical zones enabling quicker DNS lookup. To use Route 53 you need to

– create a hosted zone for your domain (for e.g http://www.mydomain.com) in Route 53

– migrate all your A, MX, CNAME resource records from your current registered domain to Route 53.

Since Route 53 is distributed it will speed name lookups. Currently updates to Route 53 are through dnscurl.pl a Perl script. However there are good GUI tools that make the job very simple.

This should get you started on the EBS, ELB and Route 53. Do also take a look at my post “Managing multi-region deployments“.

Find me on Google+