设计和部署internet级可扩展服务-On Designing and Deploying Internet-Scale Services

 

James第一条经验“Design for failure”是所有互联网架构成功的一个关键。互联网系统的工程理论其实非常简单,James paper中内容几乎称不上理论,而是多条实践经验分享,每个公司对这些经验的理解及执行力决定了架构成败。

 

http://www.mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf

 

Three simple tenets
1. Expect failures
2. Keep things simple
3. Automate everything

1. Overall Application Design
1. Design for failure
2. Redundancy and fault recovery
3. Commodity hardware slice
4. Single-version software
5. Multi-tenancy
6. Quick service health check
7. Develop in the full environment
8. Zero trust of underlying components
9. Do not build the same functionality in multiple components
10. One pod or cluster should not affect another pod or cluster
11. Allow rare emergency human intervention
12. Keep things simple and robust
13. Enforce admission control at all levels
14. Partition the service
15. Understand the network design
16. Analyze throughput and latency
17. Treat operations utilityies as part of the service
18. Understand access patterns
19. Version everything
20. Keep the unit/functional tests from the last release.
21. Avoid single points of failure

2. Automatic Management and Provisioning
1. Be restartable and redundant
2. Support geo-distribution
3. Automatic provisioning and installation
4. Configuration and code as a unit
5. Manage server roles or personalities rather than servers
6. Multi-system failures are common
7. Recover at the service level
8. Never rely on local storage for non-recoverable information
9. Keep deployment simple
10. Fail services regularly

3. Dependency Management
1. Expect letency
2. Isolate failures
3. Use shipping and proven components
4. Implement inter-service monitoring and alerting
5. Dependent services require the same desgin point
6. Decouple components

4. Release Cycle and Testing
1. Ship often
2. Use production data to find problems
3. Invest in engineering
4. Support version roll-back
5. Maintain forward and backward compatibility
6. Single-server deployment
7. Stress test for load
8. Perform capacity and performance testing prior to new releases
9. Build and deploy shallowly and iteratively
10. Test with real data
11. Run system-level acceptance tests
12. Test and develop in full environments

5. Hardware Selection and Standardization
1. Use only stantard SKUs
2. Purchase full racks
3. Write to a hardware abstraction
4. Abstract the network and naming

6. Operations and Capacity Planning
1. Make the development team responsible
2. Soft delete only
3. Track resource allocation
4. Make one change at a time
5. Make Everything Configurable

7. Auditing, Monitoring and Alerting
1. Instrument everything
2. Data is the most valuable asset
3. Have a customer view of service
4. Instrmentation required for production testing
5. Latencies are the toughest problem
6. Have sufficient production data
7. Configurable logging
8. Expose health information for monitoring
9. Make all reported errors actionable
10. Enable quick diagnosis of production problems

8. Graceful Degradation and Admission Control
1. Support a "big red switch"
2. Control admission
3. Meter admission

9. Customer and Press Communication Plan

10. Customer Self-Provisioning and Self-Help

你可能感兴趣的:(设计和部署internet级可扩展服务-On Designing and Deploying Internet-Scale Services)