《Site.Reliability.Engineering.2016.3》SRE：Google运维解密

[TOC]

induction

what is SRE？

SRE的本质：
- availability
- latency
- performance
- efficiency
- change management
- monitoring
- emergency response
- capacity planning
服务100%可能与99.9%可用的差别：

100%需要多做很多的努力，而对用户来说99.9%与100%没太大差异，因为服务与用户之间还有很多媒介（wifi，网络环境等），即使100%了，也可能因为中间的媒介导致用户感受到得只有99.9%

核心方法论

长期关注研发工作

工作目标：运维时间控制在50%内，超过的比例通过运维开发工程师设计自动化软件控制在50%
方法：
- 转移工作到研发团队
- 指派bug和工单到研发团队

基于不破坏SLO下，追求最大改变速度

SLO： service level object 服务水平目标

可用性定义

用户感受到满意的可用性等级？
当用户不满意时，有哪些可替代的方法？
不同的可用性等级，用户的使用习惯是怎么样的？

监控 Monitoring

合理的监控输出：

alerts：必须马上做出响应处理
tickets：相当于警告，不需要马上处理，延后处理
logging: 不需要关注的信息，记录方便以后查看

及时响应 Emergency Response

指标：

MTTF: mean time to failure, 平均失效时间
MTTR：mean time to restoration: 平均恢复时间

方法：故障预案准备

变更管理 Change Management

Implementing progressive rollouts
Quickly and accurately detecting problems
Rolling back changes safely when problems arise

需求预测与容量规划 Demand Forecasting and Capacity Planning

容量规划需要考虑的事情：

精准的自然增长需求预测
非自然增长关联的预测
周期性调整测试，将容量与服务关联

快速服务部署 Provisioning

Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.

效率与性能 Efficiency and Performance

SRE 需要关注效率与性能，这与快速部署关联

Google Envirmonts

terminology

Machine: A piece of hardware (or perhaps a VM)
Server: A piece of software that implements a service
Racks: Tens of machines are placed in a rack.
Row: Racks stand in a row
Cluster: One or more rows form a cluster
Datacenter: A datacenter building houses multiple clusters
Campus: Multiple datacenter buildings that are located close together form a campus

Embracing Risk

可用性计算

时间维度：availability = uptime/ (uptime + downtime)
分布式维度：availability = successful requests / total requests

Risk Tolerance of Consumer Services

Target level of availability
Types of failures
Cost
Other service metrics

Risk Tolerance of Infrastructure Services

Target level of availability
Types of failures
Cost

Forming Your Error Budget

Product Management defines an SLO, which sets an expectation of how much
uptime the service should have per quarter
The actual uptime is measured by a neutral third party: our monitoring system.
The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.

Service Level Objective

service level indicator in practice

Collecting Indicators(users care about):
- User-facing serving systems: availability, latency, and throughput
- Storage systems: latency, availability, and durability
- Big data systems: data processing pipelines, throughput, end-to-end latency
- All systems: correctness
- Others: error rate
Aggregation
- Using percentiles for indicators
Standardize Indicators

service level objective in parctice

example:
- lower bound ≤ SLI ≤ upper bound.
- SLI ≤ target
Defining Objectives:
- For maximum clarity, SLOs should specify how they’re measured and the conditions
  under which they’re valid.
- eg:
  - 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
  - 90% of Get RPC calls will complete in less than 1 ms
  - 99% of Get RPC calls will complete in less than 10 ms
Choosing Targets:
- Don’t pick a target based on current performance
- Keep it simple
- Avoid absolutes
- Have as few SLOs as possible
- Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
  tainable.
Control Measures:
- Monitor and measure the system’s SLIs
- Compare the SLIs to the SLOs, and decide whether or not action is needed
- If action is needed, figure out what needs to happen in order to meet the target
- Take that action
SLOs Set Expectations:
- Keep a safety margin
- Don’t overachieve

service level agreements in practice

an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.

Eliminating Toil

Toil Define

manual
repetitive
automatable
tactical
no enduring value
O(n) with service growth

calculating toil

个体值班时间/运维人员一轮轮班时间。四个运维人员，每个人值班一周，运维时间占比：1/4=25%

What Qualifies as Engineering

Software engineering
Systems engineering: 线上环境配置,线上环境优化。一次性工作，免去重复劳动，初始化工作，参数优化。
Toil：Work directly tied to running a service that is repetitive, manual, etc.
Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.

the bad of toil

Career stagnation
Low morale
Creates confusion
Slows progress
Sets bad precedent
Promotes attrition
Causes breach of faith

monitoring

Definitions

Monitoring
White-box monitoring
Black-box monitoring
Dashboard
Alart
Root cause
Node and machine
Push

Four Golden Signals

Latency
Traffic
Errors
Saturation

Worrying About Your Tail

use histogram instead mean(avg) metric

Choosing an Appropriate Resolution for Measurements

收集
设置粒度，取样
聚合

Principles

Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
Extra code to detect and expose possible causes
Associated dashboards for each of these possible causes
The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
Every page should be actionable.
Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
Pages should be about a novel problem or an event that hasn’t been seen before.

临时方案

调整部分阀值
临时方案过渡

Effective Troubleshooting

最好的方法：知道系统如何设计，如何构建起来的（可以不用太细，再通过model的过程排错）。

model

Problem Report
Triage
Examine
Diagnose
Test/Treat -loop-> 2/3
Cure

Problem Report

就包含如下信息：

expected behavior
actiual behavior
optional: how to reproduce this behavior.

辅助的工具：

告警信息平台，可查看告警相关联的信息，尽量做到看这些信息就能定位原因，并修复。

Triage

事故定级：冷静定级
止损优于排查

Examine

监控系统：监控某些metric
logging:
- 分级
- 取样
- 日志查询平台：支持某种语言去查询

Diagnose

Simplify and reduce
- 黑盒测试
  - 正向测试
  - negative测试
- 分而治之
  - 分两部分：比如分区，分地域
    - 分层
Ask "what", "where" and "why": 递归反推原因
事件记录：
- 配置改变
- 代码上线
- 系统配置改变
- 节点变化
- 其他
特殊系统：专门为某些服务设计的排查系统

Test And Treat

列出几条可能的原因
设计测试方案
- 首先设计最容易测试的
- 各个测试间应该互斥
- 测试的结果可能误导认知。
- 前后测试可能相互影响。比如负载变高了
- 有些测试比较难操作，尽量避免做这些测试。
总结：
- 要明白要测试什么，要做哪些测试，测试的结果是什么
- 如果是复杂的且多的测试，及时记录文档，避免需要重复这些步骤

Negative Results Are Magic

负面效果不能被忽略
负面效果至关重要
测试中使用的工具和方法，在将来的工作中会用到
发布负面效果对整个行业有帮助

Cure

确认原因
编写事故报告
修复

Make Troubleshooting Easier

两大原则：

服务可观察：输出各种有用指标，日志，在服务设计时就需要考虑到
设计良好易理解的组件接口
良好的全链路追踪系统：方便追踪上下游