[TOC]
induction
what is SRE?
-
SRE的本质:
- availability
- latency
- performance
- efficiency
- change management
- monitoring
- emergency response
- capacity planning
服务100%可能与99.9%可用的差别:
100%需要多做很多的努力,而对用户来说99.9%与100%没太大差异,因为服务与用户之间还有很多媒介(wifi,网络环境等),即使100%了,也可能因为中间的媒介导致用户感受到得只有99.9%
核心方法论
长期关注研发工作
- 工作目标:运维时间控制在50%内,超过的比例通过运维开发工程师设计自动化软件控制在50%
- 方法:
- 转移工作到研发团队
- 指派bug和工单到研发团队
基于不破坏SLO下,追求最大改变速度
SLO: service level object 服务水平目标
可用性定义
- 用户感受到满意的可用性等级?
- 当用户不满意时,有哪些可替代的方法?
- 不同的可用性等级,用户的使用习惯是怎么样的?
监控 Monitoring
合理的监控输出:
- alerts:必须马上做出响应处理
- tickets:相当于警告,不需要马上处理,延后处理
- logging: 不需要关注的信息,记录方便以后查看
及时响应 Emergency Response
指标:
MTTF: mean time to failure, 平均失效时间
MTTR:mean time to restoration: 平均恢复时间
方法:故障预案准备
变更管理 Change Management
- Implementing progressive rollouts
- Quickly and accurately detecting problems
- Rolling back changes safely when problems arise
需求预测与容量规划 Demand Forecasting and Capacity Planning
容量规划需要考虑的事情:
- 精准的自然增长需求预测
- 非自然增长关联的预测
- 周期性调整测试,将容量与服务关联
快速服务部署 Provisioning
Provisioning combines both change management and capacity planning. In our experience, provisioning must be conducted quickly and only when necessary, as capacity is expensive.
效率与性能 Efficiency and Performance
SRE 需要关注效率与性能,这与快速部署关联
Google Envirmonts
terminology
- Machine: A piece of hardware (or perhaps a VM)
- Server: A piece of software that implements a service
- Racks: Tens of machines are placed in a rack.
- Row: Racks stand in a row
- Cluster: One or more rows form a cluster
- Datacenter: A datacenter building houses multiple clusters
- Campus: Multiple datacenter buildings that are located close together form a campus
Embracing Risk
可用性计算
- 时间维度:availability = uptime/ (uptime + downtime)
- 分布式维度:availability = successful requests / total requests
Risk Tolerance of Consumer Services
- Target level of availability
- Types of failures
- Cost
- Other service metrics
Risk Tolerance of Infrastructure Services
- Target level of availability
- Types of failures
- Cost
Forming Your Error Budget
- Product Management defines an SLO, which sets an expectation of how much
uptime the service should have per quarter - The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the “budget” of how much “unreli‐ ability” is remaining for the quarter.
- As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.
Service Level Objective
service level indicator in practice
- Collecting Indicators(users care about):
- User-facing serving systems: availability, latency, and throughput
- Storage systems: latency, availability, and durability
- Big data systems: data processing pipelines, throughput, end-to-end latency
- All systems: correctness
- Others: error rate
- Aggregation
- Using percentiles for indicators
- Standardize Indicators
service level objective in parctice
- example:
- lower bound ≤ SLI ≤ upper bound.
- SLI ≤ target
- Defining Objectives:
- For maximum clarity, SLOs should specify how they’re measured and the conditions
under which they’re valid. - eg:
- 99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).
- 90% of Get RPC calls will complete in less than 1 ms
- 99% of Get RPC calls will complete in less than 10 ms
- For maximum clarity, SLOs should specify how they’re measured and the conditions
- Choosing Targets:
- Don’t pick a target based on current performance
- Keep it simple
- Avoid absolutes
- Have as few SLOs as possible
- Perfection can wait: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unat‐
tainable.
- Control Measures:
- Monitor and measure the system’s SLIs
- Compare the SLIs to the SLOs, and decide whether or not action is needed
- If action is needed, figure out what needs to happen in order to meet the target
- Take that action
- SLOs Set Expectations:
- Keep a safety margin
- Don’t overachieve
service level agreements in practice
- an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
- SRE’s role is to help them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
- It is wise to be conservative in what you advertise to users, as the broader the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.
Eliminating Toil
Toil Define
- manual
- repetitive
- automatable
- tactical
- no enduring value
- O(n) with service growth
calculating toil
- 个体值班时间/运维人员一轮轮班时间。四个运维人员,每个人值班一周,运维时间占比:1/4=25%
What Qualifies as Engineering
- Software engineering
- Systems engineering: 线上环境配置,线上环境优化。一次性工作,免去重复劳动,初始化工作,参数优化。
- Toil:Work directly tied to running a service that is repetitive, manual, etc.
- Overhead: Administrative work not tied directly to running a service. Examples include hir‐ ing, HR paperwork, team/company meetings, bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.
the bad of toil
- Career stagnation
- Low morale
- Creates confusion
- Slows progress
- Sets bad precedent
- Promotes attrition
- Causes breach of faith
monitoring
Definitions
- Monitoring
- White-box monitoring
- Black-box monitoring
- Dashboard
- Alart
- Root cause
- Node and machine
- Push
Four Golden Signals
- Latency
- Traffic
- Errors
- Saturation
Worrying About Your Tail
- use histogram instead mean(avg) metric
Choosing an Appropriate Resolution for Measurements
- 收集
- 设置粒度,取样
- 聚合
Principles
- Alerts on different latency thresholds, at different percentiles, on all kinds of dif‐ ferent metrics
- Extra code to detect and expose possible causes
- Associated dashboards for each of these possible causes
- The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
- Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
- Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
- Every page should be actionable.
- Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
- Pages should be about a novel problem or an event that hasn’t been seen before.
临时方案
- 调整部分阀值
- 临时方案过渡
Effective Troubleshooting
最好的方法:知道系统如何设计,如何构建起来的(可以不用太细,再通过model的过程排错)。
model
- Problem Report
- Triage
- Examine
- Diagnose
- Test/Treat -loop-> 2/3
- Cure
Problem Report
就包含如下信息:
- expected behavior
- actiual behavior
- optional: how to reproduce this behavior.
辅助的工具:
- 告警信息平台,可查看告警相关联的信息,尽量做到看这些信息就能定位原因,并修复。
Triage
- 事故定级:冷静定级
- 止损优于排查
Examine
- 监控系统:监控某些metric
- logging:
- 分级
- 取样
- 日志查询平台:支持某种语言去查询
Diagnose
- Simplify and reduce
- 黑盒测试
- 正向测试
- negative测试
- 分而治之
- 分两部分:比如分区,分地域
- 分层
- 分两部分:比如分区,分地域
- 黑盒测试
- Ask "what", "where" and "why": 递归反推原因
- 事件记录:
- 配置改变
- 代码上线
- 系统配置改变
- 节点变化
- 其他
- 特殊系统:专门为某些服务设计的排查系统
Test And Treat
- 列出几条可能的原因
- 设计测试方案
- 首先设计最容易测试的
- 各个测试间应该互斥
- 测试的结果可能误导认知。
- 前后测试可能相互影响。比如负载变高了
- 有些测试比较难操作,尽量避免做这些测试。
- 总结:
- 要明白要测试什么,要做哪些测试,测试的结果是什么
- 如果是复杂的且多的测试,及时记录文档,避免需要重复这些步骤
Negative Results Are Magic
- 负面效果不能被忽略
- 负面效果至关重要
- 测试中使用的工具和方法,在将来的工作中会用到
- 发布负面效果对整个行业有帮助
Cure
- 确认原因
- 编写事故报告
- 修复
Make Troubleshooting Easier
两大原则:
- 服务可观察:输出各种有用指标,日志,在服务设计时就需要考虑到
- 设计良好易理解的组件接口
- 良好的全链路追踪系统:方便追踪上下游