《System Performance: Enterprise and the Cloud, 2nd Edition (2020)》阅读笔记简要记录
性能分析视角 | |
---|---|
工作负载分析 | 应用开发人员关心 |
资源分析 | 系统管理员关心 |
主观性 | 技术一般是客观的,但性能一般是主观性很强的 比如“The average disk I/O response time is 1 ms”是“好”还是“差”,恐怕也因人而已; |
复杂性 | 系统整体的复杂性; 多模块交互的复杂性(单个模块可能没问题,交互就会出现性能问题) 性能瓶颈问题 工作负载的复杂 |
多原因 | 性能问题往往很难找到根原因;比如多个单独看起来没问题的事件一起发生就引起了性能问题 |
多问题并存 | 即使一个成熟的系统上,也会有许多已知但未解决的性能问题;性能分析的难点之一并不是找一个性能问题,而是找到和你问题最相关的performance issues; 量化分析(quantify)issues的重要程度,并且估计解决后能带来的性能提升程度; latency(延时)非常适合用来量化性能指标; |
latency是什么:是等待完成的时间;
Latency is a measure of time spent waiting, and is an essential performance metric
如图,访问数据库的latency是100ms,其中阻塞再磁盘读写上占了80ms,那么可以消除磁盘读写的性能损耗,使得latency降至20ms,性能提升5倍(5x),这个提升是可以量化的;
上面计算方法不一定适用于其他metrics(指标)。
系统可观察,观察的工具一般分几类:
观察工具使用的方法 | |
---|---|
计数器(counters) | 应用程序和内核通常会提供有关其状态和活动的数据:操作计数,字节计数,延迟测量,资源利用率和错误率。 它们通常以称为计数器的整数变量的形式实现,这些变量在软件中进行了硬编码,其中一些是累积的且始终递增 |
分析(profiling) | 在系统性能中,术语“分析”通常是指使用执行采样的工具:通过测量数据的子集绘制一些粗略的示图等 |
跟踪(tracing) | 跟踪一般都是基于事件驱动的数据采集过程,例如针对系统调用的追踪工具strace,针对网络分析的抓包工具tcpdump,通用能追踪软硬件事件的跟踪工具如ftrace、bcc以及bpftrace等 |
观察的工具一般不包括benchmark工具,因为它是通过workload来测试系统得出数据,这便修改了系统的原始状态;
# 何为counters?
# 在《system performance: enterprise and cloud, 2nd》一书中得到了比较权威的解释
Applications and the kernel typically provide data on their state and activity: operation counts,byte counts, latency measurements, resource utilization, and error rates. They are typicallyimplemented as integer variables called counters that are hard-coded in the software, some ofwhich are cumulative and always increment.
# 计数器:应用程序或者内核通常会在软件中硬编码进去一些整形变量,用于操作统计、字节统计、延时衡量、资源利用统计、错误计数等;比如/proc文件系统下的各种统计文件节点;
# 何为profiling?
In systems performance, the term profiling usually refers to the use of tools that perform sampling:taking a subset (a sample) of measurements to paint a coarse picture of the target.
# 基于工具的样点采集,并用其中的数据绘制一些直观示意图的分析方法;
跟踪一般都是基于事件驱动的数据采集过程,例如针对系统调用的追踪工具strace,针对网络分析的抓包工具tcpdump,通用能追踪软硬件事件的跟踪工具如ftrace、bcc以及bpftrace等。这些都基于event source,比如静态测量点、动态测量点以及可编程工具bpf;
static instrument在linux中又叫tracepoints,硬编码的测量追踪点;
dynamic instrument类似动态debug一样,软件运行中在内存中插入instrument测量点;Dtrace工具值得好好分析下;
BPF,原来是代表berkerly packet filter,但现在已经不是这个意思了,它也不算缩写,就代表一种方法,一种跟踪方法;eBPF was initially used to describe this extended BPF; however, the technology is now referred to as just BPF;
macro-benchmark tools 较为宏观的整体负载测试工具;
micro-benchmark tools 针对具体小方向的负载测试工具,比如网络、cpu等;
系统性能的关键术语
术语 | |
---|---|
IOPS | 数据传输的一个度量方法,input/outpu per second;对于磁盘读写,IOPS指每秒读和写的次数; |
throughput | 评价操作执行的速率;数据传输方面,throughput用于描述数据传输速度(bytes/s或bits/s)。某些场景下如数据库,吞吐量指的是操作的速度(每秒的操作数或者每秒业务数) |
response time | 一个操作完成所需要的时间,包括等待时间、执行时间、结果返回时间等; |
latency | 延时时间,一个操作等待被执行所花费的时间;某些场景下,也可指response time; |
utilization | 使用率,资源使用率; |
saturation | 饱和度,资源无法满足服务的工作排队工作量程度; |
bottleneck | 瓶颈,系统性能中指性能上限; |
workload | 工作负载; |
cache | 缓存,用于复制或者缓冲一定量数据的高速存储区域; |
SUT: system under test
perturbations(扰动)会影响系统测试结果。
perturbations有哪些:scheduled system activity、other users of the system、 other workloads。
queueing system
很多工作组件和场景可以抽象成queue system model。 比如磁盘读写;
延时,性能研究的关键角色;
latency 和 throughput 是某些性能研究的最关键的要点;
何为latency:操作等待被执行的时间;
系统各组件的操作所处的时间量级差别很大。比如一次寄存器访问延时0.3ns,相当于生活中的1s时间。
常见的性能调整的权衡是CPU和memory之间,因为内存能用于缓存数据结果,降低的CPU的使用。
常见术语 | |
---|---|
operating system | 操作系统,安装在系统环境上的可用于执行程序的软件和文件系统,包含kernel、管理tools、系统库等; |
kernel | 内核,管理系统的程序,管理硬件设备、内存、CPU调度,对CPU拥有特权; |
process | 进程,OS中抽象出来的用于执行程序的实体; |
thread | 线程,可被CPU调度的执行context;内核有多个线程,一个进程可以有一个或多个线程; |
task | 任务,linux运行实体,可以指一个进程(单线程)、多线程进程中的一个线程、或者内核线程; |
BPF program | 在BPF环境中运行的内核态程序; |
main memory | 主存,系统的物理内存(比如RAM); |
virtual memory | 虚拟内存,支持多任务执行的主存的抽象概念; |
kernel space | 内核空间,虚拟内存中给kernel使用的地址空间; |
user space | 用户空间,虚拟内存中给user使用的地址空间; |
user land | 用户级的程序和库(/usr/bin, /usr/lib…); |
context switch | 上下文切换,一个进程或一个线程切换到另一个进程或线程,这是CPU调度器的基本功能,切换过程中包含对CPU寄存器的保存切换; |
mode switch | 模式切换,用户态和内核态的切换; |
system call | 系统调用; |
processor | 处理器,一个包含一个或多个CPU的物理芯片; |
trap | 陷入,给内核发送信号获取系统内核程序执行(特权执行),可通过系统调用、异常、中断触发trap; |
hardware interrupt | 硬件中断,物理设备给内核kernel发送信号,获取如I/O等服务;一个中断便是一次trap; |
system libraries并不是一个完整的圆环,因为linux运行应用程序直接调用system calls来进入内核;
新的linux允许BPF程序调用KPI: BPF helper;这样便运行一些程序或者系统函数被用于BPF,提供更高的安全级别和高性能。
用户态切换到内核态,就叫mode switch。
用户态切换到内核态的过程中,那些阻塞在比如磁盘等待、网络/IO等地方的进程还会引起context switch,以便调度其他进程来执行;
mode switch 和 context switch会带来额外的CPU循环消耗,如何避免?
避免模式切换带来损耗的方法 | 详细 |
---|---|
user-mode syscalls | 用户态系统调用:可以在用户态实现部分系统调用功能,linux kernel通过将a virtual dynamic shared object(vDSO) 映射到进程地址空间中来实现,比如系统调用gettimeofday, getcpu等; |
memory mappings | 内存映射:used for demand paging; |
kernel bypass | 越过内核,比如DPDK(the data plane development kit); |
kernel-mode applications | 内核态应用程序,比如TUX web服务器,eBPF技术等; |
用户态和内核态都有它们自己的执行环境,包含栈、寄存器。某些处理器架构将内核态和用户态的地址空间分隔,这样在模式切换时还需要切换虚拟内存环境。
诸如write()/open()
等等的system calls,一般器用途很明显。
以下几个用途不是很明显:
系统调用 | |
---|---|
ioctl | I/O设置以及一些混杂需求的设置下发; |
mmap | This is commonly used to map executables and libraries to the process address space, and for memory-mapped files; |
brk | 用于扩展堆指针,定义进程内存空间的大小; |
futex | Fast user-space mutex,用于处理用户态锁; |
同步中断、异步中断
异步中断实例:
同步中断:
同步中断都是由软件指令产生的。
术语 | 详解 |
---|---|
traps | 对内核的一次谨慎调用,比如中断指令,例如linux-x86中的int 0x80软中断; |
exceptions | 异常,比如除0; |
fault | 通常用于内存事件,比如page fault; |
时钟中断,linux中用jiffies记录从开机到现在的时钟中断次数。unix内核中的一个重要组件就是clock()程序,时钟中断,通常表示为频率,linux现在是1000Hz,也就是1秒钟有1000个时钟中断,每个时钟中断间隔1ms,这叫tick(节拍),这也是性能影响因素之一,
idle thread:当CPU空闲时,调度器会安排其到idle线程,idle任务能够power down CPU以节省能源;
何为进程?
A process is an environment for executing a user-level program, It consists of a memory address space, file descriptors, thread stacks, and registers
何为线程?
A process contains one or more threads, which operate in the process address space and share the same file descriptors. A thread is an executable context consisting of a stack, registers, and an instruction pointer (also called a program counter).
# 共享进程地址空间和文件描述符,拥有自己的栈、寄存器和指令指针(program counter)
在Linux系统中,进程process和线程thread都称为任务task。
进程的创建fork()和clone()用了copy-on-write(COW)写时复制技术,以提高系统性能。
进程的生命周期:
进程环境总览:
process environment:
该图也不是完全准确:内核空间应该比用户空间小。
scheduler的目的是什么?
The basic intent is to divide CPU time among the active processes and threads, and to maintain a notion of priority so that more important work can execute sooner
# 基本目标是将CPU时间划分给不同的进程和线程,在优先级基础上保持平衡,以保证更重要的任务能够尽快被执行
调度器只对处于read-to-run
状态的进程进行调度,需要被调度的进程在每个优先级的队列里,俗称运行队列run queues。
调度器可以动态的修改进程的priority以便提高系统性能。工作负载workloads可以分为两大类:
文件系统通过mount来挂在到tree上。一个经典的文件组织架构图如下:
虚拟文件系统VFS:
是内核文件系统的一个抽象接口,方便添加管理新的文件系统。
I/O栈:
什么是I/O栈:从用户态软件到存储设备的路径叫做I/O stack。下图中左边一条路径可以越过文件系统直达块存储设备,该路径经常被用于管理工具和数据库。
由于磁盘I/O操作的延时很长,软件栈中的很多层级通过缓存读写来避免这点缺点。
TCP/IP协议栈
字符设备驱动、块设备驱动
多处理器系统通常实现为SMP系统(symmetric multiprocessing,所有CPU被平等对待)。在SMP系统上,通常是NUMA(non-uniform memory access,非一致内存访问),这对性能也是一种挑战。
IPIs
SMP系统中,CPU之间也需要coordinate(同步协调),通过inter-processor interrupt(IPI)(也叫作SMP call或者CPU cross call)来实现,IPIs也用于抢占实现。
内核允许更高优先级的用户级进程来抢占内核并执行。
支持抢占的内核成为完全抢占的(fully preemptible),尽管其中有些代码还是不能被中断。
Linux还支持自愿内核抢占(voluntary kernel preemption),这能避免完全抢占的复杂性。
#CONFIG_PREEMPT_VOLUNTRAY ---- 自愿内核抢占
#CONFIG_PREEMPT ---- 允许所有内核代码(除了部分特殊段)开启抢占
#CONFIG_PREEMPT_NONE ---- 关闭抢占
比如通过nice()
设置CPU优先级,通过unlimit(1)
限制资源等。
Linux中,**control group(cgroup)**对于云计算当中管理OS虚拟化的性能影响很重要。
linux内核开发过程中尤其和性能相关的关键点:
CPU scheduling classes | |
I/O scheduling classes | |
TCP congestion algorithms | TCP拥塞算法 |
Overcommit | Along with the out-of-memory (OOM) killer, this is a strategy for doing more with less main memory |
Futex (2.5.7) | fast user-space mutex, this is used to provide high-performing userlevel synchronization primitives |
Huge pages (2.5.36) | |
OProfile (2.5.43) | A system profiler for studying CPU usage and other events, for both the kernel and applications. |
RCU (2.5.43) | |
epoll (2.5.46) | |
Modular I/O scheduling (2.6.10) | Linux provides pluggable scheduling algorithms for scheduling block device I/O |
DebugFS (2.6.11) | |
Cpusets (2.6.12) | exclusive CPU grouping for processes |
Voluntary kernel preemption (2.6.13) | |
inotify (2.6.13) | A framework for monitoring file system events |
blktrace (2.6.17) | A framework and tool for tracing block I/O events(后来迁移到了tracepoints) |
splice (2.6.17) | A system call to move data quickly between file descriptors and pipes, without a trip through user-space |
Delay accounting (2.6.18) | Tracks per-task delay states |
IO accounting (2.6.20) | Measures various storage I/O statistics per process |
DynTicks (2.6.21) | Dynamic ticks allow the kernel timer interrupt (clock) to not fire during idle, saving CPU resources and power |
SLUB (2.6.22) | new and simplified version of the slab memory allocator |
CFS (2.6.23) | Completely fair scheduler |
cgroups (2.6.24) | Control groups allow resource usage to be measured and limited for groups of processes |
TCP LRO (2.6.24) | |
latencytop (2.6.25) | 观察延时的工具 |
Tracepoints (2.6.28) | |
perf (2.6.31) | Linux Performance Events (perf) is a set of tools for performance observability, including CPU performance counter profiling and static and dynamic tracing |
No BKL (2.6.37) | 移除了linux内核中的大内核锁 |
KVM | The Kernel-based Virtual Machine (KVM) technology KVM allows virtual operating system instances to be created, running their own kernel |
BPF JIT (3.0) | A Just-In-Time (JIT) compiler for the Berkeley Packet Filter (BPF) to improve packet filtering performance by compiling BPF bytecode to native instructions |
CFS bandwidth control (3.2) | A CPU scheduling algorithm that supports CPU quotas and throttling |
TCP anti-bufferbloat (3.3+) | 为了对抗bufferbloat问题,个人理解TCP缓存臃肿,缓冲区越来越大不是好事,转发设备就变成了存储设备,延时越来越长 |
uprobes (3.5) | The infrastructure for dynamic tracing of user-level software, used by other tools (perf, SystemTap, etc.). |
TCP early retransmit (3.5) | |
TFO (3.6, 3.7, 3.13) | TCP Fast Open (TFO) can reduce the TCP three-way handshake to a single SYN packet with a TFO cookie, improving performance |
NUMA balancing (3.8+) | |
SO_REUSEPORT (3.9) | A socket option to allow multiple listener sockets to bind to the same port, improving multi-threaded scalability;原来这个功能linux-3.9才实现的哇 |
SSD cache devices (3.9) | Device mapper support for an SSD device to be used as a cache for a slower rotating disk |
bcache (3.10) | An SSD cache technology for the block interface |
TCP TLP (3.10) | TCP Tail Loss Probe (TLP) is a scheme to avoid costly timer-based retransmits by sending new data or the last unacknowledged segment after a shorter probe timeout, to trigger faster recovery |
NO_HZ_FULL (3.10, 3.12) | Also known as timerless multitasking or a tickless kernel, this allows non-idle threads to run without clock ticks, avoiding workload perturbations |
Multiqueue block I/O (3.13) | This provides per-CPU I/O submission queues rather than a single request queue |
SCHED_DEADLINE (3.14) | |
TCP autocorking (3.14) | |
MCS locks and qspinlocks (3.15) | Efficient kernel locks |
Extended BPF (3.18+) | An in-kernel execution environment for running secure kernelmode programs |
Overlayfs (3.18) | A union mount file system included in Linux. It creates virtual file systems on top of others, which can also be modified without changing the first. 容器中使用较多 |
DCTCP (3.18) | The Data Center TCP (DCTCP) congestion control algorithm |
DAX (4.0) | Direct Access (DAX) allows user space to read from persistent-memory storage devices directly, without buffer overheads |
Queued spinlocks (4.2) | Offering better performance under contention |
TCP lockless listener (4.4) | The TCP listener fast path became lockless(和TCP快转有关?) |
cgroup v2 (4.5, 4.15) | |
epoll scalability (4.5) | |
KCM (4.6) | |
TCP NV (4.8) | |
XDP (4.8, 4.18) | eXpress Data Path (XDP) is a BPF-based programmable fast path for high-performance networking |
TCP BBR (4.9) | Bottleneck Bandwidth and RTT (BBR) is a TCP congestion control algorithm that provides improved latency and throughput over networks suffering packet loss and bufferbloat. 谷歌提出的新型拥塞控制算法BBR |
Hardware latency tracer (4.9) | An Ftrace tracer that can detect system latency caused by hardware and firmware, including system management interrupts (SMIs) |
perf c2c (4.10) | The cache-to-cache (c2c) perf subcommand can help identify CPU cache performance issues, including false sharing |
Intel CAT (4.10) | Support for Intel Cache Allocation Technology (CAT) allowing tasks to have dedicated CPU cache space. This can be used by containers to help with the noisy neighbor problem |
Multiqueue I/O schedulers: BPQ, Kyber (4.12) | |
Kernel TLS (4.13, 4.17) | |
MSG_ZEROCOPY (4.14) | A send(2) flag to avoid extra copies of packet bytes between an application and the network interface |
PCID (4.14) | Linux added support for process-context ID (PCID), a processor MMU feature to help avoid TLB flushes on context switches |
PSI (4.20, 5.2) | Pressure stall information (PSI) is a set of new metrics to show time spent stalled on CPU, memory, or I/O; 压力阻塞信息统计 |
TCP EDT (4.20) | The TCP stack switched to Early Departure Time (EDT) |
Multi-queue I/O (5.0) | Multi-queue block I/O schedulers became the default in 5.0, and classic schedulers were removed. |
UDP GRO (5.0) | UDP Generic Receive Offload (GRO) improves performance by allowing packets to be aggregated by the driver and card and passed up stack |
io_uring (5.1) | A generic asynchronous interface for fast communication between applications and the kernel, making use of shared ring buffers. Primary uses include fast disk and network I/O |
MADV_COLD, MADV_PAGEOUT (5.4) | These madvise(2) flags are hints to the kernel that memory is needed but not anytime soon |
MultiPath TCP (5.6) | Multiple network links (e.g., 3G and WiFi) can be used to improve the performance and reliability of a single TCP connection |
Boot-time tracing (5.6) | Allows Ftrace to trace the early boot process |
Thermal pressure (5.7) | The scheduler accounts for thermal throttling to make better placement decisions. 这个是什么特性?和调度器性能有关 |
perf flame graphs (5.8) | perf火焰图 |
Linux中和性能较为相关的主题介绍:systemd,KPTI,extended BPF。
systemd包含的两个重要功能:
性能相关的一个方向就是系统启动时间,systemd可以显示哪里可以调教启动时间
root@ubuntu:~# systemd-analyze
Startup finished in 7.197s (kernel) + 1min 32.950s (userspace) = 1min 40.147s
graphical.target reached after 19.606s in userspace
root@ubuntu:~#
root@ubuntu:~# systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
graphical.target @19.606s
└─multi-user.target @19.606s
└─kerneloops.service @19.567s +39ms
└─network-online.target @19.552s
└─vmware-tools.service @10.629s +8.922s
└─basic.target @10.414s
└─sockets.target @10.414s
└─snapd.socket @10.412s +1ms
└─sysinit.target @10.349s
└─snapd.apparmor.service @10.007s +341ms
└─apparmor.service @9.259s +747ms
└─local-fs.target @9.258s
└─run-user-1000-gvfs.mount @45.084s
└─run-user-1000.mount @44.854s
└─local-fs-pre.target @3.865s
└─systemd-tmpfiles-setup-dev.service @3.625s +239ms
└─systemd-sysusers.service @3.286s +338ms
└─systemd-remount-fs.service @3.174s +53ms
└─systemd-journald.socket @2.945s
└─system.slice @2.937s
└─-.slice @2.937s
# 可以看出vmware-tools.server花费了8.922s,时间最长
# blame子命令中显示了启动时间的由长到短的排序
root@ubuntu:~# systemd-analyze blame
54.409s apt-daily.service
30.781s fwupd-refresh.service
8.922s vmware-tools.service
8.457s networkd-dispatcher.service
8.066s snapd.service
7.526s man-db.service
5.914s dev-sda5.device
5.505s udisks2.service
5.248s NetworkManager.service
5.047s accounts-daemon.service
4.869s polkit.service
3.258s dev-loop8.device
3.234s avahi-daemon.service
3.216s dev-loop9.device
3.177s dev-loop2.device
3.130s switcheroo-control.service
...
systemd-analyze plot > xxx.svg
可以生成生成启动的时间详细图表
KPTI(Kernel PageTable Isolation)全称内核页表隔离。KPTI是由KAISER补丁修改而来。之前,进程地址空间被分成了内核地址空间和用户地址空间。其中内核地址空间映射到了整个物理地址空间,而用户地址空间只能映射到指定的物理地址空间。内核地址空间和用户地址空间共用一个页全局目录表(PGD表示进程的整个地址空间) 。meltdown漏洞就恰恰利用了这一点。攻击者在非法访问内核地址和CPU处理异常的时间窗口,通过访存微指令获取内核数据。为了彻底防止用户程序获取内核数据,可以令内核地址空间和用户地址空间使用两组页表集(也就是使用两个PGD)。
KPTI对性能有很大影响。在上下文切换中带来TLB冲刷,影响性能。linux后来引进了PCID,在某些场景下可以避免TLB冲刷。
BPF程序运行在内核态。
BPF bytecode must first pass through a verifier that checks for safety, ensuring that the BPF program will not crash or corrupt the kernel. It may also use a BPF Type Format (BTF) system for understanding data types and structures. BPF programs can output data via a perf ring buffer, an efficient way to emit per-event data, or via maps, which are suited for statistics.