http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
http://don.blogs.smugmug.com/2011/04/24/how-smugmug-survived-the-amazonpocalypse/
http://blog.rightscale.com/2011/04/25/amazon-ec2-outage-summary-and-lessons-learned/
http://www.readwriteweb.com/cloud/2011/04/almost-as-galling-as-the.php
http://developers.simplegeo.com/blog/2011/04/26/how-simplegeo-stayed-up/
http://broadcast.oreilly.com/2011/04/the-aws-outage-the-clouds-shining-moment.html
The Five Levels of Redundancy
In a cloud computing environment, there are five possible levels of redundancy:
?Physical
?Virtual resource
?Availability zone
?Region
?Cloud
Amazon Web Services
Amazon’s cloud hosted Web Services experienced a catastrophic failure in April at one of its east coast datacenters, knocking hundreds of sites off the web include some well-known websites such as Foursquare. This lesson shows the importance of solid distributed architectural design when building cloud services.
据说了解,在亚马逊服务中断事件中,有几家公司幸免于难。例如,Twilio公司的服务就没有关闭。该公司没有详细说明它在北维吉尼亚可用区的业务受到了怎样的影响,但是它的联合创始人兼首席技术运营官埃文-库克(Evan Cooke)在博客中描述了其基础架构的设计原则。这些原则包括将资源分解到各个独立的存储池中,支持超时连接和重试等待。
服务故障长达三天 Over 10 hours
在AWS故障中不受影响 Twilio的云架构原则
当我们在Amazon Web Services上培育并扩展Twilio的时候,我们遵循了一系列的架构设计原则,以便能将底层基础设施中偶发但不可避免的问题所带来的影响降到最小。
?故障单元是单台主机
构建由单台主机构成的简单服务,而非多依赖的主机构成的服务,可以创建复制服务实例来抵御主机故障。
?短时间超时与快速重试
发生故障时,让软件快速识别失败并重试请求。每个服务都运行多个冗余拷贝,短时间内超时,然后绕过失败或不可访问的服务进行重试。
?幂等的服务接口(http://en.wikipedia.org/wiki/Idempotence)
如果所依赖服务的API是幂等的,那就意味着可以安全地对失败请求进行重试。
Build services that allow requests to be safely retried. If you aren’t familiar with the concept, go read up on the wonderful world of idempotency.
“In computer science, the term idempotent is used more comprehensively to describe an operation that will produce the same results if executed once or multiple times.”
If the API of a dependent service is idempotent, that means it is safe to retry failed requests. (See #2 above) For example, if a service provides the capability to add money to a user’s account, an idempotent interface to that service allows failed request to that service to be safely retried. There’s a lot to this topic, we’ll make a point of covering it in much more detail in the future.
?小的无状态服务
将业务逻辑分散到小的无状态服务中,这些服务可以被放到简单的同构服务池中。
?宽松的一致性要求
在不需要严格一致性时,为要读取的数据做复制和冗余。
One of the most important conceptual separations you can do at an application level is to partition the reading and writing of data. For example, if there is a large pool of data that is written infrequently, separate the reads and writes to that data. By making this separation, one can create redundant read copies that independently service requests. For example, by writing to a database master and reading from database slaves, you can scale up the number of read slave to improve availability and performance.
Stateless Services
One of the major design goals of the Netflix re-architecture was to move to stateless services. These services are designed such that any service instance can serve any request in a timely fashion and so if a server fails it’s not a big deal. In the failure case requests can be routed to another service instance and we can automatically spin up a new node to replace it.
Data Stored Across Zones
In cases where it was impractical to re-architect in a stateless fashion we ensure that there are multiple redundant hot copies of the data spread across zones. In the case of a failure we retry in another zone, or switch over to the hot standby.
Graceful Degradation
Our systems are designed for failure. With that in mind we have put a lot of thought into what we do when (not if) a component fails. The general principles are:
Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.
Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. For example if we cannot generate personalized rows of movies for a user we will fall back to cached (stale) or un-personalized results.
Feature Removal: If a feature is non-critical then if it’s slow we may remove the feature from any given page to prevent it from impacting the member experience.
"N+1" Redundancy
Our cloud architecture is is designed with N+1 redundancy in mind. In other words we allocate more capacity than we actually need at any point in time. This capacity gives us the ability to cope with large spikes in load caused by member activity or the ripple effects of transient failures; as well as the failure of up to one complete AWS zone. All zones are active, so we don't have one hot zone and one idle zone as used by simple master/slave redundancy. The term N+1 indicates that one extra is needed, and the larger N is, the less overhead is needed for redundancy. We spread our systems and services out as evenly as we can across three of the four zones. When provisioning capacity, we use AWS reserved instances in every zone, and reserve more than we actually use, so that we will have guaranteed capacity to allocate if any one zone fails. As a result we have higher confidence that the other zones are able to grow to pick up the excess load from a zone that is not functioning properly. This does cost money (reservations are for one to three years with an advance payment) however, this is money well spent since it makes our systems more resilient to failures.
Load Balancing
Netflix uses Amazons Elastic Load Balance (ELB) service to route traffic to our front end services. We utilize ELB for almost all our web services. There is one architectural limitation with the service: losing a large number of servers in a zone can create a service interruption.
ELB's are setup such that load is balanced across zones first, then instances. This is because the ELB is a two tier load balancing scheme. The first tier consists of basic DNS based round robin load balancing. This gets a client to an ELB endpoint in the cloud that is in one of the zones that your ELB is configured to use. The second tier of the ELB service is an array of load balancer instances (provisioned directly by AWS), which does round robin load balancing over our own instances that are behind it in the same zone.
For middle tier load balancing Netflix uses its own software load balancing service that does balance across instances evenly, independent of which zone they are in. Services using middle tier load balancing are able to handle uneven zone capacity with no intervention.
教训一:将应用分布在不同区也不能避免遭受宕机之苦