Site Reliability Engineering

原文链接:https://medium.com/@charleschangyangxu/site-reliability-engineering-ab8297334eab

This is my interpretation of the recommended SRE principles based on experience and ideas from: https://landing.google.com/sre/books/ which does not represent Google or any other organization.

Overview:

SRE distinguishes itself from Devops by being a subset of the latter and focuses mainly on reliability.

The most common challenge for SRE is managing workload as system scales and dealing with repetitive work.

There are 3 types of monitoring: Alerts (fix now), Tickets (fix later), and Loggings (no need to be read by humans). Monitoring is only effective if can notify someone who can fix the issue quickly with a needed action.

There are 2 types of system availability: Time between failures, and time to repair. Manual steps taken during repairment is a crucial metric, which can vary widely in the same team. This is why an up-to-date ‘playbook’ is important.

Allow releases only when having extra SLA error budget available (error budge = 1 - “availability target”) for specific services, which also resets quarterly or monthly. This way can force the dev-team to self-police in order to get their own features out.

Limit the work in progress to maximum of two things. Eradicate as much unplanned work as possible, however, interrupts are unavoidable. Practically SRE should ensure 50%-70% of time writing code. Even it means to expand team size, push releases and include product team in the on-call rotation and SRE meetings. Google highly suggests enough SRE staffing to avoid fatigue, unsustainability and high turnover. As it takes time to respond, fix outages, start the postmortem, file bugs, and etc. More frequent events may degrade the quality of response, not able to suggest something wrong with the system’s design, or decrease the monitoring sensitivity.

Taking human out of the release process will paradoxically reduce SRE toil and increase system reliability
Outage should be minimized according to the equation of duration + scope + frequency. Always practice and rehearsal. The game of ‘Wheel of Misfortune’ based on postmortems is a fantastic way of practicing it.

Book Chapter Summary

Embracing Risk: 100% is not always the reliability target. Not only is it impossible to achieve, it is typically more than users can notice. Transfer target level of service availability to error budget, which aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of releases and to effectively defuse discussions about outages with stakeholders. Service Level Objectives: Most SLA’s are actually referring to SLO’s, because when SLA is missed there will certainly be explicit consequences. Choose service level indicators (SLI) based on user. For example, choose Gmail client instead of Gmail server. Don’t overachieve because users build on the reality of what you offer, rather than what you say you can or you will supply. It is wise to be conservative and plan outages to bring down user expectations simetimes. SLO is a must have for SRE, which can benefit in multiple aspects shown in later chapters. Eliminating Toil: Toil is manual, repetitive, automatable, tactical, with no enduring value, and on with service growth. Good innovative and design-driven engineering approach should be used for eliminating toil little by little, day by day. Monitoring Distributed Systems: System should be as simple as possible and focus on latency, traffic, errors, saturation in the format of symptom and cause. The pipeline should be easy to reason about and reserve cause-oriented heuristics to serve as aids to debugging problems. Dashboard is better than email alerts. Automation: The approach with highest leverage occurs in the design phase. Autonomous operation is difficult to retrofit to sufficiently large systems, but standard good practices in software engineering will help considerably: having decoupled subsystems, introducing APIs, minimizing size effects, and so on. Reliability is always the fundamental feature. Release Engineering: Release Engineers define best practices for using tools in order to make sure projects are released using consistent and repeatable methodologies. Examples include compiler flags, formats for build identification tags, and required steps during a build. Simplicity: Remove person’s emotional attachment to code. Balance stability with agility. Being simple is not the same as being lazy, it is about knowing what to achieve in the easiest way, which requires innovation and real engineering. Being On-Call: This is the final or the continuing education part of a newbie being able to assume SRE responsibility. Google suggests at least 50% of SRE time into engineering, no more than 25% can be spend on-call, leaving up to another 25% on other types of operational and nonproject work for each individual. Even though this might not be immediately applicable in many contexts, but the idea is solid for fostering sustainable and high quality work. Ultimately the goal for SRE is to scale with the growing volume of on-call work. Analogously, think how a handful of pilots can keep up with the technology advances in the cockpit. Effective Troubleshooting: Ignore all instincts and experience. Instead, use a systematic approach to every problem. The initial problem report should be made as detailed to everyone as possible. Make the system work before trying to find the root cause. Examine from both white box and black box perspectives. If not, system needs to be designed with observable interfaces between components. Ask what, where, and why. Use bisection method can be quicker than tracing linearly in large distributed systems. Avoid chasing wild goose, latching on to causes of past problems, or hunting down spurious correlations. Emergency Response: There are test-induced, change-induced, and process-induced emergencies. It’s important to know what we have learned as well as what went well, followed by more proactive testings. Managing incidents: Good way of managing incidents is to have recursive separation of responsibilities, a recognized command post, live incident state document, and clear, live handoff. Having the same person staying extended hours for the same incident is detrimental for health and quality of recovery. Rather, the same person is recommended to use the next shift to write a complete post-mortem. Here are some key words for a good cycle of incident management: prioritize, prepare, trust, introspect, consider alternatives, practice, change it around. Postmortem Culture: Avoid blame and keep it constructive. No postmortem should be left unreviewed. Visibly reward people for doing the right thing. Ask SRE and development team for feedback on postmortem effectiveness. Record timeline, action items, necessary and unnecessary manual steps taken, what was good, bad, and lucky. Most importantly, post-mortems should be well organized to study from regularly, rather than left unattended in the digital archive. It is a very good and fun exercise to have both the SRE and dev team randomly pick an outage and start role playing on a weekly basis. The mediator can provide the distractions and responses from other teams according to the timeline. For the senior SRE, this is a good refresher and chance to see different approaches from the newbies, and with a lot of fun! Tracking outages: At Google, all alerts notifications for SRE share a central replicated system that tracks whether a human has acknowledged receipt of the notification. If no acknowledgement is received after a configured interval, the system escalates to the next configured destination. Test for reliability: The hierarchy of traditional tests are unit tests, integration tests, and system tests. System tests include smoke tests, performance tests, regression tests. Tests in the production environment include configuration tests, stress tests, and canary tests. Canary test refers to using a live bird to detect toxic gases before humans were poisoned, which is more like a structured user acceptance test. Software engineering in SRE: It is important because production specific knowledge within SRE allows engineers to design and create software with the appropriate considerations for system scalability, graceful degradation, and easy interface with other infrastructure or tools. Having the people with direct and unique experience on day-to-day production code that will contribute to efficiency, uptime, logging, high-signal bug report, and innovation to age-old problems makes a lot of sense. Fully fledged software developing projects within SRE serves as an outlet for engineers who don’t want their coding skills to get rusty, as well as provides balance with the on-call work and system engineering. Beyond the design of automation tools and other efforts to reduce the workload for engineers in SRE, software development projects can further benefit the SRE organization by attracting and helping to retain engineers with a broad variety of skills. The desirability of team diversity is doubly true for SRE, where a variety of backgrounds and problem-solving approaches can help prevent blind spots in problem solving. More often generalists need to team up with in-house specialists. With above said, not all projects are suited for SRE projects. The project should provide noticeable benefits and be advocatable, such as reducing toil for SRE, improving an existing piece of infrastructure, or streamlining a complex process. Poor SRE projects include too many moving parts, function outside of SRE world, iterative development, and without a PM. Finally SRE team needs to avoid writing single-purpose software solutions that can’t be shared or have low standards, which ultimately leads to duplicated effort and wasted time. Addressing Cascading Failures: When systems are overloaded, something needs to give in order to remedy the situation. Once a service passes its breaking point, it is better to allow some user-visible errors or lower-quality results to slip through than try to fully serve every request. Understanding and having planned outages on where those breaking points are and how the system behaves beyond them is critical. Be careful that retries can amplify low error rates into higher levels of traffic, leading to cascading failures. Distributed Consensus for Reliability: The distributed consensus problem deals with reaching agreement among a group of processes connected by an unreliable communications network. CAP theorem holds that a distributed system cannot simultaneously have all three of properties: consistent view of data at each node; availability of data at each node; tolerance to network partitions. In practice, Google approaches the distributed consensus problem in bounded time (message will always be passed with specific timing guarantees) by ensuring that the system will have sufficient healthy replicas and network connectivity to make progress reliably most of the time. Distributed Periodic Scheduling with Cron: The use of Pax’s distributed consensus algorithm allows Google to build a robust distributed cron service. Pax’s operates as a sequence of proposals. If a proposal isn’t accepted, it fails. Each proposal has a sequence number, which imposes a strict ordering on all of the operations in the system. Data Processing Pipelines: A multiphase pipeline is chaining the output to another input for big data processing. A periodic pipeline can be thought as cron. If used in distributed environment, there are drawbacks of high resource cost (via ssh) and preemptions because some cron jobs are not idempotent (cannot be launched twice). Therefore the leader-follower workflow is a better solution. The task master uses the system prevalence pattern to hold all job states in memory for fast availability while synchronously journaling mutations to persistent disk. The view is the workers that continually update the system state transactionally with the master according to their perspective as a subcomponent of the pipeline. Although all pipeline data may be stored in the Task Master, the best performance is usually achieved when only pointers to work are stored in the task master, and the actual input and output data is stored in a common filesystem or other storage. Data Integrity: Data backups can be loaded back into an application, while archives cannot. The latter is meant to be retrieved without the need to complete SLO requirement. From the user’s point of view, data integrity without expected and regular availability is effectively no data at all! SRE and software engineers should deliver a recovery system rather than a backup system which is classically neglected, deferred, and viewed as delegated tax. There are 3 ways of how Google faces data integrity problem: 1. soft deletion 2. backups and related recovery methods 3. early detection-knowing that recovery will work. Recognizing that not just anything can go wrong, but that everything will go wrong is a significant step toward preparation for any real emergency. A matrix of all possible combinations of disasters with plans to address each of these disasters permits you to sleep soundly for at least one night; keeping your recovery plans current and exercised permits you to sleep the other 364 nights of the year; switching from planning recovery to planning prevention permits you to sit on the beach on that well-deserved vacation. Reliable Product Launches at Scale: Google has its own Launch Coordination Engineering (LCE) team to audit, act as liaison, drive technical aspects of launch, gatekeep unsafe releases, and educate change with reliability in mind. Accelerating SREs to On-Call and Beyond: Trial by fire and menial work for newbie should be replaced by cumulative and orderly learning paths. If the set of work one encounters in a tickets queue will adequately provide training for the said job, then this is not an SRE position. Doing is not the same as reasoning. The stellar engineers and improvisational thinkers build on fundamentals, strong reverse engineering skill, and statistical and scientific thinking to uncloak problems. Sharing, reading, and role-playing postmortems are great ways to scale new SREs and mingle with the senior SREs. Dealing with Interrupts: Minimize context switches. Undergo as much as your cognitive flow state as possible. Stress is often caused by treating on-call as interrupts, but the on-call engineer himself/herself will not see the same as interrupts or stress. So the solution is to define specific roles that let anyone on the team take up the mantle, and then rotate with complete handover process. For example, there can be an on-call engineer, a ticket engineer, a release engineer, and an engineer who is on project. Add additional SREs if one role is not capable to handle the workload. Ideally Google prefers specialization over generation in the SRE team with a large enough staffing pool. Embedding an SRE to Recover from Operational Overload: If a SRE team has too much burden, it effectively becomes an Ops team when focusing on quickly addressing emergencies rather than reducing the number of emergencies. A good way to resolve is to find another SRE or outside consultant to shadow the team and on-caller, in order to suggest good habits, automation, and simplicity from third person view. First the embedder needs to scope pain points. Example kindling are: knowledges gaps, obviating temporary fixes, alerts not being diagnosed, not actioned capacity plans, services that SREs don’t know, etc. Example points to improve include: having agreed SLO, documenting things up and distribute, explaining reasonings, doing postmortems rather than reviewing them, writing what you were thinking of each point in time during outage so can find where the system misled and where cognitive demands were too high. Communication and Collaboration in SRE: Specialization is good for technical mastery but bad for ignoring the big picture. A SRE team should be diverse and have each role’s responsibility well defined. People with specific goal is generally more motivated and will better maintain contributions. The Evolving SRE Engagement Model: It used to be either full SRE support, or approximately no SRE engagement depending on the service reliability and availability. The need for SRE support will always be greater than the bandwidth of SRE team. Especially with the increase of micro-services trend, SRE has to come-up with a third engagement model together with the production readiness review. Frameworks is the structural solution: codified best practices, reusable solutions (for scalability and reliability concerns), a common production platform with a common control surface (operational, monitoring, logging, configurations), easier automation and smarter systems (SRE can readily receive a single view of relevant information for an outage, rather than hand collecting and analyzing mostly raw data form disparate sources). This new model reduces SRE staffing from common technology across products. And enables the creation of production platforms with separation of concerns between SRE and services-specific support done by development team. Lessons learned from other industries: This chapter focuses on 4 principles: preparedness and disaster testing, postmortem culture, automated and reduced operational overhead, and structured and rational decision making. In the nuclear, aviation, or medical industries, for example, lives could be at stake in the event of an outage. Therefore, different industries need to balance innovation, velocity, scale, complexity to reach the ultimate goal of reliability.

Useful templates for:
incident, postmortem, launch coordination checklist, and production meeting minute: Google SRE templates

你可能感兴趣的:(Site Reliability Engineering)