阅读更多
On Designing and Deploying Internet-Scale Services
Three simple tenets
1. Expect failures
2. Keep things simple
3. Automate everything
1. Overall Application Design
1. Design for failure
2. Redundancy and fault recovery
3. Commodity hardware slice
4. Single-version software
5. Multi-tenancy
6. Quick service health check
7. Develop in the full environment
8. Zero trust of underlying components
9. Do not build the same functionality in multiple components
10. One pod or cluster should not affect another pod or cluster
11. Allow rare emergency human intervention
12. Keep things simple and robust
13. Enforce admission control at all levels
14. Partition the service
15. Understand the network design
16. Analyze throughput and latency
17. Treat operations utilityies as part of the service
18. Understand access patterns
19. Version everything
20. Keep the unit/functional tests from the last release.
21. Avoid single points of failure
2. Automatic Management and Provisioning
1. Be restartable and redundant
2. Support geo-distribution
3. Automatic provisioning and installation
4. Configuration and code as a unit
5. Manage server roles or personalities rather than servers
6. Multi-system failures are common
7. Recover at the service level
8. Never rely on local storage for non-recoverable information
9. Keep deployment simple
10. Fail services regularly
3. Dependency Management
1. Expect letency
2. Isolate failures
3. Use shipping and proven components
4. Implement inter-service monitoring and alerting
5. Dependent services require the same desgin point
6. Decouple components
4. Release Cycle and Testing
1. Ship often
2. Use production data to find problems
3. Invest in engineering
4. Support version roll-back
5. Maintain forward and backward compatibility
6. Single-server deployment
7. Stress test for load
8. Perform capacity and performance testing prior to new releases
9. Build and deploy shallowly and iteratively
10. Test with real data
11. Run system-level acceptance tests
12. Test and develop in full environments
5. Hardware Selection and Standardization
1. Use only stantard SKUs
2. Purchase full racks
3. Write to a hardware abstraction
4. Abstract the network and naming
6. Operations and Capacity Planning
1. Make the development team responsible
2. Soft delete only
3. Track resource allocation
4. Make one change at a time
5. Make Everything Configurable
7. Auditing, Monitoring and Alerting
1. Instrument everything
2. Data is the most valuable asset
3. Have a customer view of service
4. Instrmentation required for production testing
5. Latencies are the toughest problem
6. Have sufficient production data
7. Configurable logging
8. Expose health information for monitoring
9. Make all reported errors actionable
10. Enable quick diagnosis of production problems
8. Graceful Degradation and Admission Control
1. Support a "big red switch"
2. Control admission
3. Meter admission
9. Customer and Press Communication Plan
10. Customer Self-Provisioning and Self-Help