Posted By: Ali Hodroj on March 4, 2013
The following is Part 1 of 2 on architecting highly available cloud applications using Cloudify. The first part introduces the concepts, challenges, and solutions to achieve highest levels of availability and disaster recovery in the cloud. The second part revisits the concept to provide a solution architecture and demo source code showing a real-world scenario implementation for a GigaSpaces Cloudify customer.
Introduction
Failure in the cloud is inevitable, since the end of 2012 we’ve witnessed two major incidents in the cloud computing world which were the latest in the string of high profile cloud computing outages. The outages, within Amazon Web Services as well as Microsoft Windows Azure, brought down major online services. Such outages are an alarming reminder about the true nature of cloud resources—it’s still just shared infrastructure of servers and data centers behind an API. Even though cloud vendors provide a set of infrastructure features to implement high availability, one cannot escape the increase in deployment complexity when trying to utilize those features for highly available systems. In this two-part blog post, I’ll outline a real world scenario with a GigaSpaces customer that illustrates how Cloudify was leveraged to implement cloud-scale high availability while eliminating increase in accidental complexity.
Approaches for Cloud High Availability
Cloud computing enables economies of scale facilitating high redundancy and geographically separate deployments. Approaching high availability a cloud environment requires the implementation of patterns and practices that introduce high redundancy and strive towards a shared-nothing architecture. Such patterns can be achieved through techniques such as geographically redundant cloud deployments and multi-site data replication to provide better fault tolerance and disaster recovery. This enables the elimination of single points of failure that could stop the entire system from operating in case of a disaster or node failure. There are three key principles to focus on:
Geographic Separation – Maintain high fault tolerance by utilizing multiple zones, regions, and cloud providers. For instance, Amazon’s EC2 currently supports three regions in the US, each split into several availability zones, demarcated by different power sources and are physically segregated.
Replication and Failover – Ensure data redundancy and backup by continuous replication of data across geographically separated sites. In addition, provide proactive monitoring to automate failover operations allowing applications to rapidly come back online after a failure occurs.
Monitoring, Elastic Scaling, and Cloud Bursting – Enable proactive and actionable monitoring to enable: 1) elastic expansion of resources in case of load changes allowing applications to automatically scale up and down, 2) recovering from failure by automatically scaling up a DR environment to handle production load, and finally 3) Implement cloud bursting by dynamically deploying your application onto more powerful cloud to address a spike in demand.
Challenges and Complexities – Portability and Automation
There are two significant impediments facing the climbing up the availability spectrum: lack of portability and automation immaturity. As we move beyond relatively simple web applications (deployed within an out of the box PaaS or simply hosted in an IaaS environment), the increase in solving the essential complexity of resilient cloud architecture triggers a direct increase in the accidental complexity of that architecture. This complexity emerges mostly from the need to design and implement custom automation solutions for auto deployment across traditional data centers, zones, regions, and private/public clouds. According to recent research, the second and third highest cloud enablement investment in the enterprise is spent on automation and orchestration.
In addition to automation immaturity leading to significant IT or DevOps investment, the lack of portable cloud-agnostic orchestration solutions makes it difficult to retarget your cloud deployed application to another provider due to a variance between API sets, resource semantics, and data center levels of abstractions. To summarize, the issues are:
Different levels of abstraction – The premise of cloud computing is that it provides a useful abstraction either at the infrastructure,platform, or application-level. However, the specifics of these abstractions vary greatly as we look into the ontology of their data centers and the hierarchy of resources within. As you try to move your architecture to a different cloud, you will run into some form impedance mismatch between the generics of your deployment model and the specifics of a certain IaaS model.
Different API – The set of cloud API vary drastically across different cloud vendors. In addition to the nuts and bolts of provisioning resources, security policies, authentication keys, and general administrative workflow vary greatly from a cloud provider to another. This presents a challenge when trying to architect a deployment model that is adaptable enough to arbitrary regions and clouds.
Cloudify – Increasing Resiliency without the Complexity
Although cloud vendors provide simplified access to a large pool of resources through API calls, this does not mitigate the increasing cost of implementing high availability and disaster recovery through custom automationapproaches. To address this problem, Cloudify proposes a higher level of abstraction designed to isolate an application from the underlying cloud environment, and to provide a common foundation for integration with all major cloud and virtualization vendors. The realization of this is achieved through cloud driversthat support both public clouds (Amazon EC2, Windows Azure, Rackspace, HP Cloud…etc) and private clouds (OpenStack, cloud.com, VMWare vCenter, Citrix XenServer…etc).
With this flexibility at hand, creating a new cloud with your application stack is a matter of creating a set of recipes through a Groovy-based DSL that would describe the different lifecycle phases for deployment. In addition to solving the portability challenge through cloud driver abstraction and recipes, Cloudify simplifies portability even more through the introduction of cloud overrides: an easy way to parameterize your cloud deploymnet through a set of property files. This parameterization allows us to move from multi-region, muti-zone, to multi-cloud with a simple change in configuration files — minimizing cycle time and coping with cloud platform incompatibility. Simply put, it’s an implementation of “infrastructure as code” pattern from a high availability perspective, such that provisioning a DR cloud on a different zone, region, or provider becomes a matter of passing a configuration file to bootstrap another cloud. The details on how this can be implemented will be presented in the second part of the this blog post.
Conclusion
The first part of the series aimed to present the complexities, challenges, and potential approaches for achieving a resilient cloud architecture through cloud high availability maturity model. We’ve shown how the increase in high availability of a cloud application triggers a direct increase in deployment complexity. Through Cloudify, we are able to climb up the model while keeping the complexity of the deployment workflow at minimal levels by utilizing the core portability and lifecycle management automation features of the framework.
The second part of these series will dive into the design and implementation of a real-world GigaSpaces customer scenario, revisiting the cloud availability model and applying it to an actual architecture on Amazon EC2 along with Cloudify recipes and sample code.