蜗牛_Wolf

Ceph源代码分析

2.1 SDS 1

2.2 Disk 2

2.2.1 块式与流式 2

2.2.2 磁盘 2

2.2.3 固态硬盘 4

2.3 块存储指令与协议 6

2.3.1 硬盘物理接口 6

2.3.2 SCSI指令体系 6

2.3.3 块存储指令通信协议 7

2.4 Raid 8

2.4.1 基本术语 8

2.4.2 6种Raid模式 9

2.4.3 Raid卡结构 10

2.4.4 Raid与LVM 10

2.4.5 Raid的缺点 10

3 存储架构 11

3.1 传统存储架构 11

3.2 存储架构发展历程 13

3.3 分布式存储架构 16

3.3.1 分布式存储系统通用逻辑结构 16

3.3.2 分布式存储系统相关理论 18

3.3.3 HDFS分布式文件存储架构 23

3.3.4 Swift分布式对象存储架构 24

3.3.5 Ceph分布式统一存储架构 26

3.3.6 对比分析 27

4 ceph框架分析 28

4.1 相关接口 28

4.1.1 bufferraw/bufferptr/bufferlist 28

4.1.2 序列化encode/反序列化decode 31

4.2 逻辑结构 36

4.2.1 0层分解 36

4.2.2 1层分解 39

4.3 关键概念 44

4.3.1 Object 对象 45

4.3.2 Pool 池 46

4.3.3 PG Map 48

4.3.4 OSD Map 50

4.3.5 Monitor Map 50

4.3.6 CRUSH Map 51

4.4 主要流程 57

4.4.1 命令下发、解析流程 57

4.4.2 RBD客户端写入过程 59

4.4.3 PG数据恢复过程 92

4.4.4 PG数据清理过程 106

4.5 CRUSH算法 107

4.5.1 CRUSH MAP 108

4.5.2 数据映射规则（ruleset、replica placement） 109

4.5.3 CRUSH MAP改变与数据移动 128

5 安装与编译 129

5.1 安装 129

5.2 源码编译 129

6 调试与调优 130

7 附录 130

7.1 C++语言 130

8 参考资料 142

引言
1. 编写目的

本文档是规划超融合产品的系列调研文档之一。

背景

http://ceph.com/

Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability.

为什么官方上没有提ceph也提供块存储服务呢？

下面这句摘抄自《Ceph cookbook》:" Ceph is currently the hottest Software Defined Storage (SDS)technology that is shaking up the entire storage industry. It is an open source project that provides unified software defined solutions for Block, File, and Object storage. The core idea of Ceph is to provide a distributed storage system that is massively scalable and high performing with no single point of failure. From the roots, it has been designed to be highly scalable (up to the exabyte level and beyond) while running on general-purpose commodity hardware."

基本概念
1. SDS

SDS is what is needed to reduce TCO for your storage infrastructure. In addition to reduced storage cost, an SDS can offer exibility（灵活性）, scalability（扩展性）, and reliability（可靠性）.

Cloud Storage

Cloud Storage is a storage system that should be fully integrated with cloud systems and can provide lower TCO without any compromise to reliability and scalability. The cloud systems are software defined and are built on top of commodity hardware; similarly, it needs a storage system that follows the same methodology, that is, being software defined on top of commodity hardware,

Unified Storage

A storage system that supports blocks, files, and object storage from a single system.

Disk

我们最好将Disk翻译成"硬盘"，当前来说主要包括磁盘和固态硬盘（SSD），前者主要是以磁分子的两个磁极来存储数据，后者主要以电荷来存储数据。

块式与流式

所谓块式存储，即数据以一块一块的方式存储在介质上，数据存储位置的定位是以"块"为单位；流式存储，即数据以bit方式连续的存储在介质上，数据存储位置的定位是以"比特位"为单位。最根本的区别还是在数据存储位置的定位方式上，实际上数据在介质上还不是以一个bit一个bit的存储的吗？！

比如MP3播放器存储歌曲只能从歌曲的开头播放，而磁带机则可以从任何位置开始播放。

磁盘

【磁盘物理结构】

【数据存储】

磁盘一般存在多个盘片，每个盘片有两个盘面，每个盘面会有一个磁头，划分为多个同心磁道，每个磁道划分为多个等长的扇区
扇区是磁盘存储的最小单元，在扇区内的数据存储是流式的，但是对于磁盘来说，存储是按照以扇区为最小单元来存储的，即所谓的"块"存储
CHS=柱面cylinder+磁头Header+扇区Sector，扇区的地址由这三个唯一确定
LBA：Logical Block Address，顺序编址，这个其实是物理编址CHS的逻辑编址，即为了硬盘控制器能够识别的地址，LBA定义了逻辑地址与物理地址的映射关系
每个扇区即一段弧线，所以一个扇区的长度即弧线的长度，一个扇区的宽度是多少？注意比较容易误解的认为是两个磁道之间的距离，实际是磁分子的宽度，也即磁头的大小
磁极N为1，S极为0，所以每个磁道上的磁分子非常密集，才能存储大量数据
磁盘的读写是先0号柱面，0号磁头（盘面），0号磁道开始，当0号磁道写满或读完后，1号盘面0号柱面，0号磁道进行。当0号柱面都写满或读完后，再开始从1号柱面开始，即换磁道
硬盘的性能主要由磁道切换的速度来决定的，磁道切换由磁头来进行，速度比较慢，盘面切换由电机来控制
硬盘的电路部分实际是一个小型系统，有MCU、DSP、数字电路、BIOS，特别是软件控制系统存放在BIOS里面，比如磁盘的低级格式化等程序等，同时还可以存储一些动态的信息，比如磁头位置

【磁盘性能】

磁盘性能指标主要有两个：读写IOPS和读写吞吐量。

影响磁盘性能的因素主要有四个：

转速，磁盘吞吐量的最大影响因素
寻道速度，磁盘随机IOPS的最大影响因素
容量
接口

固态硬盘

SSD：Solid Storage Disk，固态存储硬盘，注意这时候就不能称之为磁盘了，因为它不再是以磁粉子的N和S极来存储数据，而是以每个电子是否充电或电势来存储数据。

SSD有两种，一种是用DRAM芯片来存储数据，又称之为RAM-disk，当外部电源断开后，需要使用电池来维持DRAM的数据；一种是基于Flash介质的SSD。

固态存储的优势：没有寻道的开销、任何地址的访问开销是相等的，所以随机IO性能很好，而且几乎没有差别。

RAM：Random-Access-Memory,随机访问存储器，有DRAM、SRAM、SDRAM，DRAM需要靠不断的刷新来存储数据，SRAM则不需要刷新，但是比较昂贵，一般用于CPU的cache、CMOS芯片；SDRAM，即Synchronous DRAM，靠时钟相同频率去刷新。

ROM：Read-Only-Memory，只读存储器；PROM，Programmable ROM；EPROM，Erasable Programmable ROM；EEPROM，Electrically Erasable Programmable ROM；FLASH ROM则属于真正的单电压芯片，在使用上很类似EPROM，但是与PROM有些不同，PROM在删除时以Byte为最小单元，而Flash Rom以Block为最小单元。

但是无论是哪种ROM，都是以"浮动门场效应晶体管"来存储数据的，每个晶体管叫一个最小单元Cell，有两种Cell，一种SLC（Single Level Cell），可以保持1B数据，一种MLC（Multi Level Cell），可以保持2B数据。

【SSD硬盘逻辑结构】

【数据存储与读写】

Cell串，即上图3-34纵向的每列，每列同一时间只能有一个Cell被充电；在同一水平线上的cell构成了所谓的page。

从逻辑上讲，内部的组织结构则是page是Flash的最小IO单元，一定数量的page构成一个block，多个block构成plane，多个plane构成设备。

Flash读数据过程：

通过改变同一page的cell的电势，并加码成1或0，同时存储在RAM Buffer中，即完成一次读的过程，所以Flash读的最小单元是page；

Flash写数据过程：

先将一个block里面的所有cell放电，状态全部变为1，然后再写数据，如果本身是1的，则不作什么操作，如果要写0，则需要将cell充电。

那么Flash写为什么要先Erase，再写呢？为什么一定要擦出一个block，而不是一个page呢？先擦再写主要是为了解决同一page内的不同cell之间的干扰。要擦一个block主要出于效率的考虑。

【SSD硬盘的顽疾】

顽疾一：先擦再写，会带来比较大的开销，形成较大的写惩罚，所以通常需要较大的缓存；

顽疾二：反复充放电，二氧化硅绝缘能力会受到损耗，最终导致没有足够电荷而宣布硬盘失效，即所谓的wear off。

为了解决SSD硬盘的顽疾，常用方法如下：

药方1：尽可能用FreeSpace，然后集中回收已经被标记为garbage的page；

药方2：通过外部工具定期清理，比如Wiper；

药方3：TRIM，即文件在删除后，由文件系统通知SSD回收；

药方4：IO优化，比如Delay Write，即如果出现连续的对同一IO地址的write操作，则合并为一次；

药方5：预留一部分空间给SSD控制器自己使用，防止空间被完全写满。

块存储指令与协议
1. 硬盘物理接口

硬盘的指令体系主要有ATA和SCSI。

对应ATA指令体系的物理接口有IDE和SATA，IDE是并行ATA接口，SATA是串行ATA接口；

对应SCSI指令体系的物理接口有

并行SCSI接口
串行SCSI接口（SAS）
IBM专用串行SCSI接口（SSA）
采用SCSI指令体系并承载于FC协议的串行FC接口（FCP）

SCSI指令体系

SCSI接口包括物理接口、指令体系、协议。

SCSI：Small Computer System Interface，不仅仅是硬盘采用此接口，还有扫描仪、光驱、打印机也大多采用此接口。

采用SCSI接口的硬盘必须要求在主机侧有一个SCSI控制器，而这个SCSI控制器有自己的CPU，这是与ATA控制器的一个重要区别，也正是这个原因，导致SCSI硬盘比较昂贵，多用于商业系统。

SCSI协议的物理层即前面介绍的SCSI物理接口。

SCSI协议的链路层负责将数据帧成功传送到"线路"的对端，注意这里仅仅是线路（链路）的对端，如果通信两端中间经过多跳，则要将数据帧成功传输到对端，则是传输层的职责。

SCSI协议网络层，主要是"编址"与"寻址"。

SCSI总线编址采用SCSI ID，SCSI控制器会占用一个7号 ID，优先级最高，另外还可以有15个ID供SCSI设备使用。

SCSI寻址采用"控制器-通道-SCSI ID-LUN ID"，一台主机上可以通过PCI接口接多个SCSI控制器，每个SCSI控制器可以有多个通道（多条SCSI总线），每个通道上可以挂最多15个SCSI硬盘（或阵列），对于磁盘阵列还可以从逻辑上划分为多个LUN。

SCSI总线通信采用仲裁机制。

块存储指令通信协议

通信协议一遍都遵循OSI模型。

协议融合模式一般有三种：利用关系、MAP关系、Tunnel关系。利用关系是指本身没有这个功能，利用别的协议来使得自己满足，比如TCP协议没有IP的寻址功能，所以TCP和IP常常是一起使用的；MAP关系即协议翻译，除了payload外，其他7层内容都从一种协议翻译为另外一种协议，iFCP就是将FC协议和以太网+TCP/IP之间做翻译；Tunnel关系即隧道封装，比如FCIP就是将FC的数据包完整的封装在以太网数据包之中。

网络通信协议一般有四个层次，一个寻址层，一个交互逻辑层，也就是说接收到对方的信息后如何处理；一个是信息表示层，有点像信封即信封上的地址信息；一个是payload。

Raid

Raid是为了防止硬盘损坏时恢复数据的一种技术。

基本术语

Disk、Stripe、Segment、Block、Sector；
Stripe从上向下，从0开始编号；
Segment从左向右从0开始编号；
Block在同一个Stripe里面是从上向下，从左向右；
Block是针对Raid全局编号。

6种Raid模式

Raid卡结构

Raid与LVM

Raid和LVM都是通过软件（Raid卡实质也是软件）将多张"磁盘"虚拟成一个逻辑磁盘，Raid虚拟出来的逻辑磁盘通常称之为LUN，LVM虚拟出来的逻辑磁盘通常叫LV（Logic Volume）。

SCSI协议定义出三级单元：target IDàSCSI IDàLUN ID。

LUN是Raid卡虚拟的逻辑磁盘，PV是逻辑卷管理软件将LUN换了一个叫法PV（Physical Volume）
VG， Volume Group卷组，由多个PV组成
PP，Physical Partition，物理区块，每个PP由连续的多个扇区组成，VG被分成多个PP
LP，Logical Partition，逻辑区块，由多个PP组成，这多个PP之间可以按照类似Raid 0,1等模式来构成LP
LV，Logical Volume，逻辑卷，这是卷管理软件能够识别的最小单元

Raid的缺点

RAID rebuilds are painful
RAID spare（备份） disks increases TCO
RAID requires a set of identical disk drivers in a single RAID group
RAID-based systems often require expensive hardware components, such as RAID controllers, which significantly increases the system cost
After a point, you cannot grow your RAID-based system
RAID cannot ensure data reliability after a two-disk failure. This is one of the biggest drawbacks with RAID systems

存储架构
1. 传统存储架构

从IO路径的角度看传统存储架构:

传统存储体系结构大致有DAS、NAS、SAN三种。

【DAS】

Direct Attached Storage，即存储介质只能供一台主机服务器使用。比如主机内部的磁盘或只有一个SCSI接口的JBOD盘阵都属于DAS。

【NAS】

Network Attached Storage，即文件操作指令通过以太网络传送到远端服务器上去执行相关操作。

【SAN】

Storage Area Network，存储区域网络，本来SAN指的是一个涉及存储各个组件的网络，其交换的协议可以是FC协议、SCSI协议、NAS协议（CIFS、NFS）等。

从上图可以看出，SAN与NAS的区别：

文件系统与磁盘是否在一起，前者文件系统与磁盘不属于同一个系统（设备），后者则属于同一个系统（设备）
通过网络传输的指令类型不同，前者是磁盘操作指令通过网络传输，后者是文件操作指令通过网络传输
网络传输协议类型不同，前者可能为FC、SCSI、FCoE、iSCSI，后者为CIFS、NFS等

SAN根据后端磁盘操作指令网络传输介质的不同，分为FC SAN和IP SAN两大阵营。

存储架构发展历程

【第一阶段】DAS阶段1：磁盘与文件系统都在一台服务器里面

【第二阶段】DAS阶段2：磁盘JBOD外置在服务器外面

【第三阶段】SAN的初级阶段：独立的磁盘阵列

【第四阶段】SAN阶段2：网络化独立磁盘阵列

【第五阶段】NAS阶段1：廋服务端，独立NAS端

【第六阶段】NAS阶段2：独立NAS网关

【第七阶段】SAN与NAS融合：多协议磁盘阵列、SAN磁盘阵列、NAS设备、NAS网关

【第八阶段】分布式存储阶段1：独立分布式的磁盘阵列

之前的磁盘阵列的scale out扩展能力受限于磁盘阵列的"机头"，也就是磁盘阵列控制器的处理能力，随着云计算的发展与推动，则要求存储系统能够实现横向无限扩展，则使得磁盘阵列的体系结构发生了变化，必须采用分布式的存储体系结构来实现。

【第九阶段】分布式存储阶段2：软件定义存储SDS

软件定义存储的基本需求：

存储更加靠近计算，计算与存储融合在一起
提供存储的系统能够运行在普通的x86服务器上面，不依赖于专用硬件
支持scale out无限扩展

【第十阶段】统一存储：同时支持对象、块、文件存储

比如ceph就是统一存储理念的开源实现者。

其中第八、九、十阶段是目前云计算时代的必然发展，第八和第九阶段交错存在，第十阶段是存储界的理想，但是当前来说不是必须要这样。

分布式存储架构

从前面的存储架构发展历程中可以看到，分布式存储有独立和非独立的两种形态。其实这不是关键，分布式存储架构体系当前来说有两个，其区别主要在于是否有集中的和明显的元数据服务（器）。

分布式存储系统通用逻辑结构

分布式存储系统的通用逻辑结构如下：

对于一份数据来说，通常会有四个属性：数据在计算机系统里面的表达（或者叫存储方式）、数据所有者、数据访问者、数据存放位置。

对于一个典型的分布式存储系统至少要设计六个功能模块：

（1）数据表示

任何事物在计算机系统里面的表达方式都是用数据结构（面向对象编程里面的对象本质上也是数据结构），比如数据的副本数、数据的所有者、数据本身、数据存放位置等等。

（2）数据所有者管理

数据所有者的属性通常有账号、角色、层级等。比如某数据属于某企业的研发部门的支撑组的小李，所以小李的账户可能为xiaoli，角色是系统管理员，4级员工：某企业-研发部-支撑组。

（3）数据存放管理

对于分布式存储系统来说，数据存放管理通常会涉及：

一份数据通常需要多个副本以确保数据丢失后可以恢复；
原始数据如何均衡的存放在集群中的多个存储节点；
副本数据如何存放在集群中的多个节点以确保原始数据丢失或损坏后可以从副本恢复；
存放位置的逻辑表达，比如分成zone、node、partition等逻辑级别等；
原始数据与存放位置的映射关系，副本数据与存放位置的映射关系，原始数据与副本的映射关系；

（4）数据存放位置均衡

由于分布式存储系统的存储节点数量是在变化的，比如某个存储节点故障，新的存储节点加入等，此时就需要重新调整数据的存放以使得数据在新的集群状态下尽可能均衡。

（5）故障检测与恢复

对于分布式存储系统来说，务必要确保集群不存在单点故障，某个存储节点故障后数据能够得到及时恢复。

（6）数据访问均衡

数据访问均衡要考虑：

数据读/写操作在原始数据与多个副本之间的均衡；
原始数据与多个副本之间的同步更新
1. 分布式存储系统相关理论

分布式存储系统都会涉及到三个关键指标：Consistency（一致性）、 Availability（可用性）、Partition tolerance（分区容错性）。

一致性（C）：在分布式系统中的所有数据备份，在同一时刻是否同样的值。

可用性（A）：在集群中一部分节点故障后，集群整体是否还能响应客户端的读写请求。

分区容错性（P）：如果系统正在进行数据与备份数据之间的同步更新，这时候客户端发起读写请求，则需要系统在保障一致性和保障可用性之间做出选择，要么拒绝客户端读写请求，要么容忍数据的不一致，即系统必须在C和A之间做出选择，称之为分区容错性。

【CAP理论】

CAP理论指出，任何分布式存储系统，分区容错性P是必须要实现的，我们只能在C与A之间做出选择，无法做到三者都满足。

所以任何分布式存储系统，都需要进行数据存放位置的隔离，以实现分区容错性P。

【NWR策略】

NWR是一种在分布式存储系统中用于控制一致性级别的一种策略。

N：同一份数据的Replica的份数；

W：更新一个数据对象的时候需要确保成功更新的份数；

R：读取一个数据需要读取的Replica的份数。

NWR值的不同组合会产生不同的一致性效果，当W+R>N的时候，整个系统对于客户端来讲能保证强一致性。当W+R

比如N=3、W=2、R=2：

N=3，表示，任何一个对象都必须有三个副本（Replica），W=2表示，对数据的修改操作（Write）只需要在3个Replica中的2个上面完成就返回，R=2表示，从三个对象中要读取到2个数据对象，才能返回。

在分布式系统中，数据的单点是不允许存在的。即线上正常存在的Replica数量是1的情况是非常危险的，因为一旦这个Replica再次错误，就可能发生数据的永久性错误。假如我们把N设置成为2，那么，只要有一个存储节点发生损坏，就会有单点的存在。所以N必须大于2。N越高，系统的维护和整体成本就越高。通常把N设置为3。

当W是2、R是2的时候，W+R>N，这种情况对于客户端就是强一致性的。

在上图所示的系统中，由于W等于2，所以更新操作只需要确保完成两份就可以了。无论存储在Node3上面的第三份数据是否完成，都直接返回。假设后续的操作从Node1和Node3分别读取了两个数据。然而，Node3上面的数据尚未真正完成之前的更新操作。因此，客户端会发现读到的两个版本不一致，这个时候，只需要选择出最新的数据就可以了。

从不等式中可以看到，当W+R>N的时候，整个系统能够保证R>N-W。也就是说，至少每次都能够读到一份最新的数据。因此只需要把最新的数据返回即可。所以，虽然服务器上的三份Replica有不一致的情况，对于客户端来讲，每次读到的数据都是最新的。所以这种情况对于客户端来讲是强一致性的。

=<1,1,1>和单点运行的数据库是同一个配置。

=<2,1,1>，则相当于Slave-Master模式。由于1+1不大于2，所以这种情况是可能读到非最新数据的。也就是这种配置是不一致的。

W越大，写性能越差。R越大，读性能越差。N越大，数据可靠性就越强，当然成本也就越高。

为了保障一致性，平衡读写性能，通常的配置是：W=Q, R=Q ，Q=（N/2）+1（N=3，R=2，W=2的配置就满足这个公式）。

【分布式哈希表DHT】

1、哈希函数

哈希函数是一种计算方法，它可以把一个值A映射到一个特定范围[begin, end]之内的某个值。对于一个值的集合{k1, k2, … , kN}，哈希函数把他们均匀的映射到某个范围之中。这样，通过这些值就可以很快的找到与之对应的映射地址{index1, index2, … , indexN}。对于同一个值，哈希函数要能保证对这个值的运算结果总是相同的。

哈希函数需要经过精心设计才能够达到比较好的效果，但是总是无法达到理想的效果。多个值也许会映射到同样的地址上。这样就会产生冲突，如图中的红线所示。在设计哈希函数时要尽量减少冲突的产生。

最简单的哈希函数就是一个求余运算： hash(A) = A % N。这样就把A这个值映射到了[0~N-1]这样一个范围之中。

2.哈希表

由于哈希函数的结果都是数值，所以如果VALUE不是数值，比如是人名，则必须要用index=hash(KEY)作为中间纽带（索引）才能够将KEY与VALUE之间建立映射关系，用于存储index与VALUE之间关系的数据表称之为哈希表。

举例：图书馆中的书会被某人借走，这样"书名"和"人名"之间就形成了KEY与VALUE的关系。假设现在有三个记录：

这就是"书名"和"人名"的对应关系，它表示某人借了某本书。书名是KEY，人名是VALUE。假设index=hash(KEY)的结果如下：

然后我们就可以在一个表中存储"人名"了：

这三个人名分别存储在0、1和2号存储空间中。当我们想要查找《钢铁是怎样炼成的》这本书是被谁借走的时候，只要hash()一下这个书名，就可以找到它所对应的index，为2。然后在这个表中就可以找到对应的人名了。

当有大量的KEY VALUE对应关系的数据需要存储时，这种方法就非常有效。

3.分布式哈希表

哈希表把所有的东西都存储在一台机器上，当这台机器坏掉了之后，所存储的东西就全部消失了。分布式哈希表可以把一整张哈希表分成若干个不同的部分，分别存储在不同的机器上，这样就降低了数据全部被损坏的风险。

4.一致性哈希函数

分布式哈希表通常采用一致性哈希函数来对机器和数据进行统一运算。

一致性哈希函数hash()对机器（通常是其IP地址）和数据（通常是其KEY值）进行统一的运算，把他们映射到一个地址空间中。假设有一个一致性哈希函数可以把一个值映射到32bit的地址空间中，从0一直到2^32 – 1。我们用一个圆环来表示这个地址空间。

假设有N台机器，那么hash()就会把这N台机器映射到这个环的N个地方。然后我们把整个地址空间进行一下划分，使每台机器控制一个范围的地址空间。这样，当我们向这个系统中添加数据（比如哈希表的某部分）的时候，首先使用hash()函数计算一下这个数据的index，然后找出它所对应的地址在环中属于哪个地址范围，我们就可以把这个数据放到相应的机器上。这样，就把一个哈希表分布到了不同的机器上。如下图所示：

这里蓝色的圆点表示机器，红色的圆点表示某个数据经过hash()计算后所得出的地址。在这个图中，按照逆时针方向，每个机器占据的地址范围为从本机器开始一直到下一个机器为止。用顺时针方向来看，每个机器所占据的地址范围为这台机器之前的这一段地址空间。图中的虚线表示数据会存储在哪台机器上。

判定哈希函数好坏的四个指标：

1）平衡性(Balance)：平衡性是指哈希的结果能够尽可能分布到所有的index区间，这样可以使得所有的index区间都得到利用。

2）单调性(Monotonicity)：单调性是指如果已经有一些KEY通过哈希分派到了相应的index区间，当增加新的index区间时，哈希的结果应能够保证原有已分配的KEY可以被映射到原有的或者新的index区间，而不会被映射到旧的index集合中的其他index区间。

3）分散性(Spread)：相同的KEY被哈希到不同的index。好的哈希算法应能够尽量避免不一致的情况发生，也就是尽量降低分散性。

4）负载均衡性(Load)：负载问题实际上是从另一个角度看待分散性问题。不同的KEY哈希到相同的index，相当于加重了index的负载。与分散性一样，这种情况也是应当避免的，好的哈希算法应能够尽量降低index的负荷。

上面在讨论分布式哈希表的时候用到了一个hash（）函数：hash(IP,KEY)，这个哈希函数要求做到所谓的单调性，则称之为一致性哈希函数。

对于分布式存储系统来说，集群节点添加、删除是常有的事情，所以要求哈希函数也必须是一致性哈希函数，以确保数据能够重新映射到新的节点而不至于数据无法索引到。

1）节点（机器）的删除

如下图所示，如果NODE2出现故障被删除了，如果按照顺时针迁移的方法，object3将会被迁移到NODE3中，这样仅仅是object3的映射位置发生了变化，其它的对象没有任何的改动。

2）节点（机器）的添加

如果往集群中添加一个新的节点NODE4，通过哈希映射到环中下图所示的位置：

如果按顺时针迁移的规则，那么object2被迁移到了NODE4中，其它对象还保持这原有的存储位置。

通过对节点的添加和删除的分析，一致性哈希算法在保持了单调性的同时，还是数据的迁移达到了最小，这样的算法对分布式集群来说是非常合适的，避免了大量数据迁移，减小了服务器的的压力。

通过上面节点添加与删除，我们可以看到一致性哈希函数可以确保分布式存储系统的单调性、分散性和负载均衡性，但是可能会存在不平衡性：比如上图中只部署了NODE1和NODE3，object1存储到了NODE1中，而object2、object3、object4都存储到了NODE3中，这样就造成了非常不平衡的状态。

一致性哈希算法为了满足平衡性，引入了虚拟节点。

"虚拟节点"（ virtual node ）是实际节点（机器）在哈希index区间的复制品（ replica ），一个实际节点（机器）对应了若干个"虚拟节点"， "虚拟节点"在哈希index区间中以哈希值即index排列。

现在假设我们加入两个虚拟节点，这样整个hash环中就存在了4个虚拟节点，最后对象映射的关系图如下：

根据上图可知对象的映射关系：object1->NODE1-1，object2->NODE1-2，object3->NODE3-2，object4->NODE3-1。通过虚拟节点的引入，对象的分布就比较均衡了。

HDFS分布式文件存储架构

HDFS是一个主/从（Mater/Slave）体系结构，从最终用户的角度来看，它就像传统的文件系统一样，可以通过目录路径对文件执行CRUD（Create、Read、Update和Delete）操作。但由于分布式存储的性质，HDFS集群拥有一个NameNode和一些DataNode。NameNode管理文件系统的元数据，DataNode存储实际的数据。客户端通过同NameNode和DataNodes的交互访问文件系统。客户端联系NameNode以获取文件的元数据，而真正的文件I/O操作是直接和DataNode进行交互的。

Swift分布式对象存储架构

Swift是OpenStack开源云计算项目的子项目之一，被称为对象存储，提供了强大的扩展性、冗余和持久性。

Swift主要有三个组成部分：Proxy Service、Storage Service和Consistency Service。其中Storage和Consistency服务均允许在Storage Node上。一般Proxy Service会单独用一个Proxy节点部署。

Proxy Server是提供Swift API的服务器进程，负责Swift其余组件间的相互通信。对于每个客户端的请求，它将在Ring中查询Account、Container或Object的位置，并且相应地转发请求。Proxy提供了Rest-full API，并且符合标准的HTTP协议规范，这使得开发者可以快捷构建定制的Client与Swift交互。

Storage Server提供了磁盘设备上的存储服务。在Storage Server上如何进行数据存储呢？

【Swift数据所有者管理】

Swift分为Accout、Container和Object三层数据结构，Account对应租户，记录的是包含哪些Container。Container记录包含哪些object。可以将account、container、object理解为三张sqlite数据库，三个数据库之间是树形结构关系。

下面会将讲到Swift数据按照Zone、Node(Device)、Partition和Replica四级存放，Swift用Ring数据结构来建立Object与这四级信息建立映射关系，当查询account、container、object等信息的时候，就需要访问Ring找到数据所在的位置。

【Swift数据存放管理】

Swift用Zone、Node(Device)、Partition和Replica来进行管理。

Zone：

如果所有的Node都在一个机架或一个机房中，那么一旦发生断电、网络故障等，都将造成用户无法访问。因此需要一种机制对机器的物理位置进行隔离，以满足分区容忍性(CAP理论中的P)。因此，Swift中引入了Zone的概念，把集群的Node分配到每个Zone中。

同一份数据的多个副本Replica不能放在同一个Node或zone内。

注意，Zone的大小可以根据业务需求和硬件条件自定义，可以是一块磁盘、一台存储服务器，也可以是一个机架甚至一个IDC。下面是一个中等规模的Swift集群：

【Swift数据存放均衡】

Swift在Node数量发生变化时，数据及副本存放位置需要做调整以达到均衡，并确保可靠性。其动态均衡的算法采用所谓的"一致性哈希(consistent hashing)"算法。

Swift通过一个所谓的RING数据结构来建立object数据（包括副本）与存放位置的关系。

RING文件在系统初始化时创建，之后每次增减存储节点时，需要重新平衡一下Ring文件中的项目，以保证增减节点时，系统因此而发生迁移的数据最少。

所以，RING文件实际上就是元数据服务器，而且是具有集中的元数据服务器。

【Swift数据故障恢复】

Consistency Servers的目的是查找并解决由数据损坏和硬件故障引起的错误。主要存在三个Server：Auditor、Updater和Replicator。 Auditor运行在每个Swift服务器的后台持续地扫描磁盘来检测对象object、Container和账号acount的完整性。如果发现数据损坏，Auditor就会将该文件移动到隔离区域，然后由Replicator负责用一个完好的拷贝来替代该数据。

Ceph分布式统一存储架构

Ceph提供三种服务：对象存储、块存储和文件存储。

Ceph does not follow such traditional storage architecture; in fact, the architecture has been completely reinvented. Rather than storing and manipulating metadata, Ceph introduces a newer way: the CRUSH algorithm.

CRUSH stands for Controlled Replication Under Scalable Hashing. Instead of performing lookup in the metadata table for every client request, the CRUSH algorithm computes on demand where the data should be written to or read from. By computing metadata, the need to manage a centralized table for metadata is no longer there. The modern computers are amazingly fast and can perform a CRUSH lookup very quickly; moreover, this computing load, which is generally not too much, can be distributed across cluster nodes, leveraging the power of distributed storage. In addition to this, CRUSH has a unique property, which is infrastructure awareness. It understands the relationship between various components of your infrastructure and stores your data in a unique failure zone, such as a disk, node, rack, row, and datacenter room, among others. CRUSH stores all the copies of your data such that it is available even if a few components fail in a failure zone. It is due to CRUSH that Ceph can handle multiple component failures and provide reliability and durability.

The CRUSH algorithm makes Ceph self-managing and self-healing. In an event of component failure in a failure zone, CRUSH senses which component has failed and determines the effect on the cluster. Without any administrative intervention, CRUSH self-manages and self-heals by performing a recovering operation for the data lost due to failure. CRUSH regenerates the data from the replica copies that the cluster maintains. If you have configured the Ceph CRUSH map in the correct order, it makes sure that at least one copy of your data is always accessible. Using CRUSH, we can design a highly reliable storage infrastructure with no single point of failure. It makes Ceph a highly scalable and reliable storage system that is future ready.

CEPH FS的体系结构如下：

底层RADOS：

RADOS (Reliable, Autonomic Distributed Object Store) 是Ceph的核心之一，作为Ceph分布式文件系统的一个子项目，特别为Ceph的需求设计，能够在动态变化和异质结构的存储设备机群之上提供一种稳定、可扩展、高性能的单一逻辑对象(Object)存储接口和能够实现节点的自适应和自管理的存储系统。事实上，RADOS也可以单独作为一种分布式数据存储系统，给适合相应需求的分布式文件系统提供数据存储服务。

RADOS系统由OSD(Object Storage Device）、MDS（MetaData Server）和Monitor组成，这三个角色都可以是Cluster方式运行。

Moniter负责管理Cluster Map，其中Cluster Map是整个RADOS系统的关键数据结构，管理机群中的所有成员、关系、属性等信息以及数据的分发。

对比分析

这里只是讨论分布式文件系统，不讨论对象存储、块存储。

体系结构上比较，发现HDFS和Ceph都采用了"client-元数据服务器-存储节点"的结构。

元数据管理方面，HDFS用NameNode来统一管理，而且元数据在内存中管理，所以HDFS能够支撑的元数据容量与NameNode的内存来决定的。HDFS支持NameNode的HA以解决元数据管理单点故障；Ceph的元数据管理用元数据服务器MDS和监控服务Monitor来共同完成，同时都支持集群以解决单点故障。

数据存储方面，HDFS用DataNode来存储数据，Ceph用OSD来存储数据。HDFS存储块默认大小为64M，NameNode按照什么规则来将文件分成块，适合放在哪些DataNode上呢？Ceph的MDS根据哈希一致性算法CRUSH来将数据分配到具体的OSD（PG-OSD）。

在数据一致性和冗余性方面，HDFS采用的简单一致性模型（Master-slave），支持多副本；Ceph采用Paxos或Zookeeper中的Zap算法，支持多副本和纠删码。

在使用场景方面，HDFS适用于存储超大文件，流模式访问（一次写多次读），不支持多用户并发写入和随意修改文件（只能追加）；Ceph适用于存储大量小文件和随机读写等场景。

ceph框架分析
1. 相关接口
  1. bufferraw/bufferptr/bufferlist

Bufferlist是ceph中最重要的类，因为Bufferlist负责管理Ceph中所有的内存。整个Ceph中所有涉及到内存的操作，无论是msg分配内存接收消息，还是OSD构造各类数据结构的持久化表示（encode/decode），再到实际磁盘操作，都将bufferlist作为基础。

bufferlist类是ceph核心的缓存类，用于保存序列化结果、数据缓存、网络通讯等，可以将bufferlist理解为一个可变长度的char数组。

Ceph中bufferlist包含三个主要的类buffer::raw（bufferraw）、buffer::ptr（bufferptr）和buffer::list（bufferlist）。这三个类都定义在include/buffer.h中，都是buffer类的内部类，而buffer类本身没有任何内容，只起到了一个命名空间的作用。

这三个类的职责各有不同：

buffer::raw：对应一段真实物理内存，负责维护这段物理内存的引用计数nref和释放操作。

buffer::ptr：对应Ceph中的一段被使用的内存，也就是某个bufferraw的一部分或者全部。

buffer::list：表示一个ptr的列表（std::list），相当于将N个ptr构成一个更大的虚拟的连续内存。

buffer这三个类的相互关系可以用下面这个图来表示：

从这张图上我们就可以看出bufferlist的设计思路了：对于bufferlist来说，仅关心一个个ptr。bufferlist将ptr连在一起，当做是一段连续的内存使用。因此，可以通过bufferlist::iterator一个字节一个字节的迭代整个bufferlist中的所有内容，而不需要关心到底有几个ptr，更不用关心这些ptr到底和系统内存是怎么对应的；也可以通过bufferlist::write_file方法直接将bufferlist中的内容输出到一个文件中；或者通过bufferlist::write_fd方法将bufferlist中的内容写入到某个fd中。

与bufferlist相对的是负责管理系统内存的bufferraw。bufferraw只关心一件事：维护其所管理的系统内存的引用计数，并且在引用计数减为0时——即没有ptr再使用这块内存时，释放这块内存。

连接bufferlist和bufferraw的是bufferptr。bufferptr关心的是如何使用内存。每一个bufferptr一定有一个bufferraw为其提供系统内存，然后ptr决定使用这块内存的哪一部分。bufferlist只用通过ptr才能对应到系统内存中，而bufferptr而可以独立存在，只是大部分ptr还是为bufferlist服务的，独立的ptr使用的场景并不是很多。

通过引入ptr这样一个中间层次，bufferlist使用内存的方式可以非常灵活，这里可以举两个场景：

1. 快速encode/decode

在Ceph中经常需要将一个bufferlist编码（encode）到另一个bufferlist中，例如在msg发送消息的时候，通常msg拿到的osd等逻辑层传递给它的bufferlist，然后msg还需要给这个bufferlist加上消息头和消息尾，而消息头和消息尾也是用bufferlist表示的。这时候，msg通常会构造一个空的bufferlist，然后将消息头、消息尾、内容都encode到这个空的bufferlist。而bufferlist之间的encode实际只需要做ptr的copy，而不涉及到系统内存的申请和Copy，效率较高。

2. 一次分配，多次使用

我们都知道，调用malloc之类的函数申请内存是非常重量级的操作。利用ptr这个中间层可以缓解这个问题，即我们可以一次性申请一块较大的内存，也就是一个较大的bufferraw，然后每次需要内存的时候，构造一个bufferptr，指向这个bufferraw的不同部分。这样就不再需要向系统申请内存了。最后将这些ptr都加入到一个bufferlist中，就可以形成一个虚拟的连续内存。

序列化encode/反序列化decode

1.Ceph序列化的方式

序列化（ceph称之为encode）的目的是将数据结构表示为二进制流的方式，以便通过网络传输或保存在磁盘等存储介质上，其逆过程称之为反序列化（ceph称之为decode）。例如对于字符串"abc"，其序列化结果为8个字节（bytes）：03 00 00 00 61 62 63，其中头四个字节（03 00 00 00）表示字符串的长度为3个字符，后3个字节（61 62 63）分别是字符"abc"的ASCII码的16进制表示。

Ceph采用little-endian的序列化方式，即低地址存放最低有效字节，所以32位整数0x12345678的序列化结果为78 56 34 12。

由于序列化在整个系统中是非常基本，非常常用的功能，Ceph将其序列化方式设计为一个通用的框架，即任意支持序列化的数据结构，都必须提供一对定义在全局命名空间上的序列化/反序列化（encode/decode）函数。例如，如果我们定义了一个结构体inode，就必须在全局命名空间中定义以下两个方法：

encode(struct inode, bufferlist bl);

decode(struct inode, bufferlist::iterator bl);

在此基础上，序列化的使用就变得非常容易。即对于任意可序列化的类型T的实例instance_T，都可以通过以下语句：

::encode(instance_T, instance_bufferlist);

将instance_T序列化并保存到bufferlist类的实例instance_bufferlist中。

以下代码演示了将一个时间戳以及一个inode序列化到一个bufferlist中：

序列化后的数据可以通过反序列化方法读取，例如以下代码片段从一个bufferlist中反序列化一个时间戳和一个inode（前提是该bl中已经被序列化了一个utime_t和一个inode，否则会报错）:

2.数据结构的序列化

Ceph为其所有用到数据类型提供了序列化方法或反序列化方法，这些数据类型包括了绝大部分基础数据类型（int、bool等）、结构体类型（ceph_mds_request_head等）、集合类型（vector、list、set、map等）、以及自定义的复杂数据类型（例如表示inode的inode_t等）。

2.1 基本数据类型的序列化

基本数据类型的序列化结果基本就是该类型在内存中的表示形式。基本数据类型的序列化方法使用手工编写。在手工编写encode方法过程中，为了避免重复代码，借助了WRITE_RAW_ENCODER和WRITE_INTTYPE_ENCODER两个宏:

WRITE_RAW_ENCODER宏函数实际上是通过调用encode_raw实现的，而encode_raw调用bufferlist的append的方法，通过内存拷贝，将数据结构放入到bufferlist中。

定义在include/encoding.h中，包括以下基本数据类型：

2.2 集合类数据类型的序列化

集合数据类型序列化的基本思路包括两步：计算序列化集合大小，然后序列化集合内的所有元素。

例如vector& v的序列化方法：

其中元素的序列化通过调用该元素的encode方法实现。

常用集合数据类型的序列化已经由Ceph实现，位于include/encoding.h中，包括以下集合类型：

集合类型的序列化方法皆为基于泛型（模板类）的实现方式，适用于所有泛型派生类。

2.3 结构体类型的序列化

结构体类型的序列化方法与基本数据类型的序列化方法一致。

在结构体定义完成后，通过调用WRITE_RAW_ENCODER宏函数生成结构体的全局encode方法，比如：

2.4 复杂数据类型的序列化

除以上两种业务无关的数据类型外，其它数据类型的序列化实现包括两部分：先在类型内部现实encode方法，然后将类型内部的encode方法重定义为全局方法。

比如以CrushWrapper类为例（后面命令下发解析流程中会用到）：

CrushWrapper内部实现了encode和decode两个方法，WRITE_CLASS_ENCODER宏函数将这两个方法转化为全局方法。

WRITE_CLASS_ENCODER宏函数定义于include/encoding.h中，其定义如下：

复杂数据结构内部的encode方法的实现方式通常是调用其内部主要数据结构的encode方法。

逻辑结构
1. 0层分解

0层分解主要描述ceph与其他外部的关系。

Ceph的外部至少有两个：一个是裸金属上的操作系统，通常情况下是Linux，一个是使用ceph服务的clients，从下面这张图可知这些clients可能有哪些：

从网络结构的角度来看：

Ceph 推荐使用两个网络：

前端（北向）网络（ a public (front-side) network）连接客户端和集群
后端/东西向网络（a cluster (back-side) network）来连接 Ceph 各存储节点

这么做，主要是从性能（OSD 节点之间会有大量的数据拷贝操作）和安全性（两网分离）考虑。可以在 Ceph 配置文件的 [global] 部分配置这两个网络：

1层分解
模块分解

Ceph Object Storage Device (OSD): As soon as your application issues a writes operation to the Ceph cluster, data gets stored in the OSD in the form of objects. This is the only component of the Ceph cluster where actual user data is stored, and the same data is retrieved when the client issues a read operation. Usually, one OSD daemon is tied to one physical disk in your cluster. So, in general, the total number of physical disks in your Ceph cluster is the same as the number of OSD daemons working underneath to store user data on each physical disk.

Ceph 支持的数据盘上的 xfs、ext4 和 btrfs 文件系统，它们都是日志文件系统（其特点是文件系统将每个提交的数据变化保存到日志文件，以便在系统崩溃或者掉电时恢复数据），三者各有优势和劣势：

btrfs （B-tree 文件系统）是个很新的文件系统（Oracel 在2014年8月发布第一个稳定版），它将会支持许多非常高大上的功能，比如透明压缩（ transparent compression）、可写的COW 快照（writable copy-on-write snapshots）、去重（deduplication ）和加密（encryption ）
xfs 和 btrfs 相比较ext3/4而言，在高伸缩性数据存储方面具有优势
Ceph明确推荐在生产环境中使用 XFS，在开发、测试、非关键应用上使用 btrfs

Ceph Monitors (MON): Ceph monitors track the health of the entire cluster by keeping a map of the cluster state. It maintains a separate map of information for each component, which includes an OSD map, MON map, PG map, and CRUSH map. All the cluster nodes report to Monitor nodes and share information about every change in their state. The monitor does not store actual data; this is the job of the OSD.

Ceph Metadata Server (MDS): The MDS keeps track of file hierarchy and stores metadata only for the CephFS file system. The Ceph block device and RADOS gateway does not require metadata, hence they do not need the Ceph MDS daemon. The MDS does not serve data directly to clients, thus removing the single point of failure from the system.

RADOS: The Reliable Autonomic Distributed Object Store (RADOS) is the foundation of the Ceph storage cluster. Everything in Ceph is stored in the form of objects, and the RADOS object store is responsible for storing these objects irrespective of their data types. The RADOS layer makes sure that data always remains consistent. To do this, it performs data replication, failure detection and recovery, as well as data migration and rebalancing across cluster nodes.

librados: The librados library is a convenient way to gain access to RADOS with support to the PHP, Ruby, Java, Python, C, and C++ programming languages. It provides a native interface for the Ceph storage cluster (RADOS), as well as a base for other services such as RBD, RGW, and CephFS, which are built on top of librados. Librados also supports direct access to RADOS from applications with no HTTP overhead.

通常情况下librados的实现逻辑如下：

所以RadosClient是librados的核心。

RADOS Block Devices (RBDs): RBDs which are now known as the Ceph block device, provides persistent block storage, which is thin-provisioned, resizable, and stores data striped over multiple OSDs. The RBD service has been built as a native interface on top of librados.

RADOS Gateway interface (RGW): RGW provides object storage service. It uses librgw (the Rados Gateway Library) and librados, allowing applications to establish connections with the Ceph object storage. The RGW provides RESTful APIs with interfaces that are compatible with Amazon S3 and OpenStack Swift.

CephFS: The Ceph File system provides a POSIX-compliant file system that uses the Ceph storage cluster to store user data on a file system. Like RBD and RGW, the CephFS service is also implemented as a native interface to librados.

流程视图

关键概念

以client向ceph集群写入数据为例，涉及如下图所示的流程和关键概念。

For a read-and-write operation to Ceph clusters,

clients first contact a Ceph monitor and retrieve a copy of the cluster map, which is inclusive of 5 maps, namely the monitor, OSD, MDS, and CRUSH and PG maps. These cluster maps help clients know the state and configuration of the Ceph cluster.
Next, the data is converted to objects using an object name and pool names/IDs.
This object is then hashed with the number of PGs to generate a final PG within the required Ceph pool.
This calculated PG then goes through a CRUSH lookup function to determine the primary, secondary, and tertiary OSD locations to store or retrieve data.

Once the client gets the exact OSD ID, it contacts the OSDs directly and stores the data. All of these compute operations are performed by the clients; hence, they do not affect the cluster performance.

Object 对象

An object has an identifier, binary data, and metadata consisting of a set of name/value pairs. The semantics are completely up to（取决于） Ceph Clients. For example, CephFS uses metadata to store file attributes such as the file owner, created date, last modified date, and so forth.

对象具体到存储介质比如disk上面体现为什么呢？Each object corresponds to a file in a filesystem, which is stored on an Object Storage Device. Ceph OSD Daemons handle the read/write operations on the storage disks.

Pool 池

为了管理多个PG，ceph搞出一个Pool逻辑概念。

Ceph的Cluster创建：

ceph-deploy new {host [host], ...}

可见ceph集群是以物理节点为最小单元。

Ceph的Pool创建：

ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated] \

[crush-ruleset-name] [expected-num-objects]

ceph osd pool create {pool-name} {pg-num} {pgp-num} erasure \

[erasure-code-profile] [crush-ruleset-name] [expected_num_objects]

每个集群可以有多个pools。其属性包括：

所有性和访问权限
object对象数量
PG 数目
CRUSH 规则集合

Ceph Pool 有两种类型：

Replicated pool：拷贝型 pool，通过生成对象的多份拷贝来确保在部分 OSD 丢失的情况下数据不丢失。这种类型pool 需要更多的裸存储空间，但是它支持所有的 pool 操作。
Erasure-coded pool：纠错码型 pool（类似于 Software RAID）。在这种 pool 中，每个数据对象都被存放在 K+M 个数据块中：对象被分成 K 个数据块和 M 个编码块。

可见，这种 pool 用更少的空间实现存储，即节约空间；纠删码实现了高速的计算，但有2个缺点，一个是速度慢，一个是只支持对象的部分操作（比如：不支持局部写）。

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. A pool provides you with:

Resilience（容忍度）: You can set how many OSD are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. A typical configuration stores an object and one additional copy (i.e., size = 2), but you can determine the number of copies/replicas. For erasure coded pools, it is the number of coding chunks (i.e. m=2 in the erasure code profile)
Placement Groups: You can set the number of placement groups for the pool. A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as a whole.
CRUSH Rules: When you store data in a pool, a CRUSH ruleset mapped to the pool enables CRUSH to identify a rule for the placement of the object and its replicas (or chunks for erasure coded pools) in your cluster. You can create a custom CRUSH rule for your pool.
Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.

PG Map

It holds the PG version, time stamp, last OSD map epoch, full ratio, and near full ratio information. It also keeps track of each PG ID, object count, state, state stamp, up and acting OSD sets, and finally, the scrub details.

To check your cluster PG map, execute the following:

# ceph pg dump

PG类保存的是一个PG的信息，PG Map保存的是Cluster.pool的所有PG信息。

PG的核心是为了管理object本身及object的副本存放在哪些OSD上面。

Ceph引入PG的目的是为了减少将object直接映射到OSD的复杂度；
一个obejct对应一个PG，但是反过来，一个PG可以对应多个object。也就是说一个PG可以为多个object提供存放位置管理的服务；
一个PG所关联的OSD个数等于它所服务的obejct的副本数（包括object本身，比如2副本、3副本）；多个PG可能共享同一OSD；
PG用acting set和up set来管理存放在哪些OSD上以及OSD顺序，比如[0,8,3]表示osd.0是primary osd，osd.8，osd.3是replica osd。
PG的acting/up set里面的OSD有三种：
- 主（primary） OSD：在 acting set 中的首个 OSD，负责接收客户端写入数据；默认情况下，提供数据读服务，但是该行为可以被修改。它还负责 peering 过程，以及在需要的时候申请 PG temp；
- 次（replica）OSD：在 acting set 中的除了第一个以外的其余 OSD；
- 流浪（stray）OSD：已经不是 acting set 中了，但还没有被告知去删除数据的OSD；
PG 是Ceph 集群做清理（scrubbing）的基本单位，也就是说数据清理是一个一个PG来做的；
PG 和 OSD 之间的映射关系由 CRUSH MAP决定，决定的依据是 CRUSH 规则（rules）；

OSD Map

It stores some common fields, such as cluster ID, epoch for OSD map creation and last changed, and information related to pools, such as pool names, pool ID, type, replication level, and PGs. It also stores OSD information such as count, state, weight, last clean interval, and OSD host information.

You can check your cluster's OSD maps by executing the following:

# ceph osd dump

OSD类保存一个OSD节点的信息，OSD Map保存Cluster.pool的所有OSD节点信息。

OSDService类是OSD节点信息的维护和管理者。

Monitor Map

It holds end-to-end information about the monitor node, which includes the Ceph cluster id, monitor hostname, and IP address with the port number. It also stores the current epoch for map creation and last changed time too.

You can check your cluster's monitor map by executing the following:

# ceph mon dump

CRUSH Map

It holds information on your clusters devices, buckets, failure domain hierarchy, and the rules defined for the failure domain when storing data.

To check your cluster CRUSH map, execute the following:

# ceph osd crush dump

1）获取CRUSH map的二进制文件

ceph osd getcrushmap-o {compiled-crushmap-filename}

# ceph osd getcrushmap -o crushmap.map

2）反编译，将二进制文件转成文本文件

crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename}

# crushtool -d crushmap.map -o crushmap.txt

3）查看crush map

# vim crushmap.txt

从下面反汇编出来的crush map中：

weight: Ceph writes data evenly across the cluster disks, which helps in performance and better data distribution. This forces all the disks to participate in the cluster and make sure that all cluster disks are equally utilized, irrespective of their capacity. To do so, Ceph uses a weighting mechanism. CRUSH allocates weights to each OSD. The higher the weight of an OSD, the more physical storage capacity it will have. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1 TB storage device. Similarly, a weight of 0.5 would represent approximately 500 GB, and a weight of 3.00 would represent approximately 3 TB.

alg: Ceph supports multiple algorithm bucket types for your selection. These algorithms differ from each other on the basis of performance and reorganizational efficiency. Let's briefly cover these bucket types:

Uniform: The uniform bucket can be used if the storage devices have exactly the same weight. For non-uniform weights, this bucket type should not be used. The addition or removal of devices in this bucket type requires the complete reshuffling（洗牌） of data, which makes this bucket type less efficient.

List: List buckets aggregate their contents as linked lists and can contain storage devices with arbitrary weights. In the case of cluster expansion, new storage devices can be added to the head of a linked list with minimum data migration. However, storage device removal requires a significant amount of data movement. So, this bucket type is suitable for scenarios under which the addition of new devices to the cluster is extremely rare or non-existent（这句话似乎说反了？）. In addition, list buckets are efficient for small sets of items, but they may not be appropriate for large sets.

Tree: Tree buckets store their items in a binary tree. It is more efficient than list buckets because a bucket contains a larger set of items. Tree buckets are structured as a weighted binary search tree with items at the leaves. Each interior node knows the total weight of its left and right subtrees and is labeled according to a fixed strategy. Tree buckets are an all-around boon（全能的）, providing excellent performance and decent reorganization efficiency.

Straw: To select an item using List and Tree buckets, a limited number of hash values need to be calculated and compared by weight. They use a divide and conquer strategy（分而治之）, which gives precedence to certain items (for example, those at the beginning of a list). This improves the performance of the replica placement process but introduces moderate reorganization when bucket contents change due to addition, removal, or re-weighting.

The straw bucket type allows all items to compete fairly（公平竞争） against each other for replica placement. In a scenario where removal is expected and reorganization ef ciency is critical, straw buckets provide optimal migration behavior between subtrees. This bucket type allows all items to fairly "compete" against each other for replica placement through a process analogous to a draw of straws（抽签）.

Straw2: This is an improved straw bucket that correctly avoids any data movement between items A and B, when neither A's nor B's weights are changed. In other words, if we adjust the weight of item C by adding a new device to it, or by removing it completely, the data movement will take place to or from C, never between other items in the bucket. Thus, the straw2 bucket algorithm reduces the amount of data migration required when changes are made to the cluster.

Rules: The CRUSH maps contain CRUSH rules that determine the data placement for pools. As the name suggests, these are the rules that defined the pool properties and the way data gets stored in the pools. They defined the replication and placement policy that allows CRUSH to store objects in a Ceph cluster. The default CRUSH map contains a rule for default pools, that is, rbd. The general syntax of a CRUSH rule looks like this:

ruleset: An integer value; it classifies a rule as belonging to a set of rules.

type: A string value; it's the type of pool that is either replicated or erasure coded.

min_size: An integer value; if a pool makes fewer replicas than this number, CRUSH will not select this rule.

max_size: An integer value; if a pool makes more replicas than this number, CRUSH will not select this rule.

step take: 定义从CRUSH MAP的哪个bucket开始查找，default表示从root节点开始.

step choose firstn {num} type {bucket-type}: This selects the number (N) of buckets of a given type, where the number (N) is usually the number of replicas in the pool (that is, pool size):

‰ If num == 0, select N buckets ‰ If num > 0 && < N, select num buckets ‰

If num < 0, select N - num buckets

Choose与chooseleaf的区别是前者只是选择num个bucket-type类型的bucket，后者选择num个bucket-type类型的bucket后，再进入每个bucket的子树选择叶子节点；

Firstn或indep，前者表示按照广度优先搜索，后者表示深度优先搜索。

step chooseleaf firstn {num} type {bucket-type}: This first selects a set of buckets of a bucket type, and then chooses the leaf node from the subtree of each bucket in the set of buckets. The number of buckets in the set (N) is usually the number of replicas in the pool: ‰ If num == 0, selects N buckets ‰ If num > 0 && < N, select num buckets ‰ If num < 0, select N - num buckets

step emit: This first outputs the current value and empties the stack. This is typically used at the end of a rule but may also be used to form different trees in the same rule.

主要流程
1. 命令下发、解析流程

ceph的大部分命令都是在monitor节点上执行和解析的。

首先，用户通过ssh登录到ceph集群，实际上登录到monitor节点上面，monitor deamon提供命令窗口。比如：

Extract the CRUSH map from any of the monitor nodes:

# ceph osd getcrushmap -o crushmap_compiled_file

那么ceph的命令下发和解析流程是怎么的呢？

1.所有命令文本都定义在mon/MonCommands.h文件中：

2.命令解析与执行

RBD客户端写入过程
RBD客户端使用RBD方法

假设RBD客户端是虚机：

在客户端使用 rbd 时一般有两种方法：

第一种是 Kernel rbd

就是创建了rbd设备后，把rbd设备map到内核中，形成一个虚拟的块设备，这时这个块设备同其他通用块设备一样，一般的设备文件为/dev/rbd0，后续直接使用这个块设备文件就可以了。也可以把 /dev/rbd0 格式化后 mount 到某个目录，当做普通的文件来使用。

第二种是 librbd 方式。就是创建了rbd设备后，这时可以使用librbd、librados库进行访问管理块设备。这种方式不会map到内核，直接调用librbd提供的接口，可以实现对rbd设备的访问和管理。

应用写入rbd块设备的过程：应用调用 librbd 接口或者对linux 内核虚拟块设备写入二进制块。下面以 librbd作为ceph client为例：

1. 首先cluster = rados.Rados(conffile = 'ceph.conf')，用当前的这个ceph的配置文件去创建一个rados，这里主要是解析ceph.conf中中的集群配置参数。然后将这些参数的值保存在rados中。

2. cluster.connect() ，这里将会创建一个radosclient的结构，这里会把这个结构主要包含了几个功能模块：消息管理模块Messager，数据处理模块Objector，finisher线程模块。

3. ioctx = cluster.open_ioctx('mypool')，为一个名字叫做mypool的存储池创建一个ioctx ，ioctx中会指明radosclient与Objector模块，同时也会记录mypool的信息，包括pool的参数等。

4. rbd_inst.create(ioctx,'myimage',size) ，创建一个名字为myimage的rbd设备，之后就是将数据写入这个设备。

5. image = rbd.Image(ioctx,'myimage')，创建image结构，这里该结构将myimage与ioctx 联系起来，后面可以通过image结构直接找到ioctx。这里会将ioctx复制两份，分为为data_ioctx和md_ctx，一个用来处理rbd存储的数据，一个用来处理rbd的管理数据。

6. image.write(data,0)，通过image开始了一个写请求的生命的开始。这里指明了request的两个基本要素 buffer=data 和 offset=0。由这里开始进入了ceph的世界。

Python将image.write(data,0)连接为librbd.cc 文件中的Image::write()：

客户端写入涉及到的映射Mapping

以client向ceph集群写入数据为例，涉及如下图所示的流程和关键概念。

For a read-and-write operation to Ceph clusters,

clients first contact a Ceph monitor and retrieve a copy of the cluster map, which is inclusive of 5 maps, namely the monitor, OSD, MDS, and CRUSH and PG maps. These cluster maps help clients know the state and configuration of the Ceph cluster.
Next, the data is converted to objects using an object name and pool names/IDs.
This object is then hashed with the number of PGs to generate a final PG within the required Ceph pool.
This calculated PG then goes through a CRUSH lookup function to determine the primary, secondary, and tertiary OSD locations to store or retrieve data.

客户端写入就是将二进制数据块写入到底层的OSD/Disk上面去，首先则需要客户端知道数据应该写入哪个OSD，需经历(Pool, Object) → (Pool, PG) → OSD set → OSD/Disk 完整的映射过程。这个过程中包含了两次映射：

第一次是数据x到PG的映射。PG是抽象的存储节点，它不会随着物理节点的加入或则离开而增加或减少，因此数据到PG的映射是稳定的。

第二次PG ID到OSD ID的映射。

其中第一次映射包括下面两个关系：

文件：object = 1 ： n （由客户端实时计算）
object ：PG = n ： 1 （由客户端使用哈希算法计算）

第二次映射包括下面关系：

PG ：OSD = m ： n （由 MON 根据 CRUSH 算法计算）

第一次映射：

（1）创建 Pool 和它的 PG。根据上述的计算过程，PG 在 Pool 被创建后就会被 MON 在根据 CRUSH 算法计算出来的 PG 应该所在若干的 OSD 上被创建出来了。也就是说，在客户端写入对象的时候，PG 已经被创建好了，PG 和 OSD 的映射关系已经是确定了的。

如何确定一个pool应该有多少PG呢？

Ceph 不会自己计算，而是给出了一些参考原则，让 Ceph 用户自己计算：

少于 5 个 OSD，建议设为128
5到 10 个 OSD，建议设为 512
10 到 50 个 OSD，建议设为 4096
50 个 OSD 以上，就需要有更多的权衡来确定 PG 数目

If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability and distribution.

For a single pool of objects, you can use the following formula to get a baseline:

Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph osd erasure-code-profile get).

The result should be rounded up（取整） to the nearest power of two. Rounding up is optional, but recommended for CRUSH to evenly balance the number of objects among placement groups.

As an example, for a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows:

When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that provides reasonably low variance（方差） per OSD without taxing system resources or making the peering process too slow.

上面的计算过程的计算公式：：

一个Pool应该有多少个PG，一般根据上面的指导原则来确定，那么背后的逻辑是什么？

从数据可靠性角度，一个OSD上的PG不宜过多，如果OSD增加了，PG数量也应适当增加

考虑pool 的 size 为 3，表明每个 PG 会将数据存放在 3 个 OSD 上。当一个 OSD down 了后，一定间隔后将开始 recovery 过程，recovery结束前，有部分 PG 的数据将只有两个副本。这时候和需要被恢复的数据的数量有关系，如果该 OSD 上的 PG 过多，则花的时间将越长，风险将越大。如果此时再有一个 OSD down 了，那么将有一部分 PG 的数据只有一个副本，recovery 过程继续。如果再出现第三个 OSD down 了，那么可能会出现部分数据丢失。可见，每个 OSD 上的PG数目不宜过大，否则，会降低数据的持久性。这也就要求在添加 OSD 后，PG 的数目在需要的时候也需要相应增加。

从数据的均匀分布性角度，一个pool的PG也不宜过少

CRUSH 算法会伪随机地保证 PG 被选中来存放客户端的数据，它还会尽可能地保证所有的 PG 均匀分布在所有的 OSD 上。比方说，有10个OSD，但是只有一个 size 为 3 的 pool，它只有一个 PG，那么10个 OSD 中将只有三个 OSD 被用到。但是 CURSH 算法在计算的时候不会考虑到OSD上已有数据的大小。比方说，100万个4K对象共4G均匀地分布在10个OSD上的1000个PG内，那么每个 OSD 上大概有400M 数据。再加进来一个400M的对象（假设它不会被分割），那么有三块 OSD 上将有 400M + 400M = 800 M 的数据，而其它七块 OSD 上只有 400M 数据。

从资源消耗的角度，不是PG越多越好

PG 作为一个逻辑实体，它需要消耗一定的资源，包括内存，CPU 和带宽。太多 PG 的话，则占用资源会过多。

从数据清理scrub时间的较多，PG的object数量或数据量不宜太多

Ceph 的清理工作是以 PG 为单位进行的。如果一个 PG 内的数据太多，则其清理时间会很长。

PG与OSD的关系在pool创建的时候就已经确定了，如何确定的呢？

从上面创建pool的参数可以看出需要指定pool名称、pg数量、crush-ruleset名称、期望的object数量。PG与OSD的映射建立请看第二次映射过程。

（2）Ceph 客户端通过哈希算法计算出存放 object 的 PG 的 ID：

客户端输入 pool ID 和 object ID （比如 pool = "liverpool" and object-id = "john"）
ceph 对 object ID 做哈希
ceph 对该 hash 值取 PG 总数的模，得到 PG 编号（比如 58）（两次hash计算基本保证了一个 pool 的所有 PG 将会被均匀地使用）
ceph 对 pool ID 取 hash （比如 "liverpool" = 4）
ceph 将 pool ID 和 PG ID 组合在一起（比如 4.58）得到 PG 的完整ID。也就是：PG-id = hash(pool-id). hash(objet-id) % PG-number

第二次映射：PG ID—>OSD ID

（3）客户端通过 CRUSH 算法计算出（或者说查找出） object 应该会被保存到 PG 中哪个 OSD 上。（注意：这里是说"应该"，而不是"将会"，这是因为 PG 和 OSD 之间的关系是已经确定了的，那客户端需要做的就是需要知道它所选中的这个 PG 到底将会在哪些 OSD 上创建对象。）。这步骤也叫做 CRUSH 查找。

对 Ceph 客户端来说，只要它获得了 Cluster map，就可以使用 CRUSH 算法计算出某个 object 将要所在的 OSD 的 ID，然后直接与它通信:

Ceph client 从 MON 获取最新的 cluster map
Ceph client 根据上面的第（2）步计算出该 object 将要在的 PG 的 ID
Ceph client 再根据 CRUSH 算法计算出 PG 中目标主和次 OSD 的 ID。也就是：OSD-ids = CURSH(PG-id, cluster-map, cursh-rules)

先看pool创建过程，设置PG数量、object数量、ruleset等：

PG什么时候创建呢？如何建立OSD的映射关系的呢？

crush_do_rule()函数先简单介绍到这里，在介绍CRUSH算法的时候再仔细分析。

OSD Map建立（OSD Map的创建过程中同时创建了Crush Map）：

一个osd map的实例：

可见OSD Map包括：fsid、pool及参数、osd状态及权重等。

下面是CRUSH MAP的创建：

CRUSH Map的内容主要有四个部分：devices、types、buckets、rules。

CRUSH Map其实就是一个树形的结构，叶子节点是device（也就是osd），其他的节点称为bucket节点，这些bucket都是虚构的节点，可以根据物理结构进行抽象，当然树形结构只有一个最终的根节点称之为root节点，中间虚拟的bucket节点可以是数据中心抽象、机房抽象、机架抽象、主机抽象等如下图：

CRUSH Ruleset的建立：

从上面的关系我们可以看出：

OSD MAP保存了Pool及OSD的状态和属性；

PG MAP保存了PG与OSD的映射关系；

CRUSH MAP则保存了集群的物理拓扑与逻辑拓扑关系（types、buckets）；

CRUSH　MAP中的ruleset则用来表达数据映射的策略。这些策略可以灵活的设置object存放的区域。比如可以指定 pool1中所有objects放置在机架1上，所有objects的第1个副本放置在机架1上的服务器A上，第2个副本分布在机架1上的服务器B上。 pool2中所有的object分布在机架2、3、4上，所有Object的第1个副本分布在机架2的服务器上，第2个副本分布在机架3的服器上，第3个副本分布在机架4的服务器上。

客户端写入涉及到的数据结构

在介绍函数执行过程之前，先将涉及到的数据结构及其作用进行分析：

1.rados结构，首先创建io环境，创建rados信息，将配置文件中的数据结构化到rados中。

2.根据rados创建一个rados client的客户端结构，该结构包括了三个重要的模块，finiser 回调处理线程、Messager消息处理结构、Objector数据处理结构。最后的数据都是要封装成消息通过Messager发送给目标的osd。

3.根据pool的信息与radosclient创建一个ioctx，这里面包好了pool相关的信息，然后获得这些信息后在数据处理时会用到。

4.紧接着会复制这个ioctx到imagectx中，变成data_ioctx与md_ioctx数据处理通道，前者用于处理image的写入或读取的数据，后者用于管理数据处理。最后将imagectx封装到image结构当中。之后所有的写操作都会通过这个image进行。顺着image的结构可以找到前面创建并且可以使用的数据结构。

5.通过最右上角的image进行读写操作，当读写操作的对象为image时，这个image会开始处理请求，然后这个请求经过处理拆分成object对象的请求。拆分后会交给objector进行处理查找目标osd，当然这里使用的就是crush算法，找到目标osd的集合与主osd。

6.将请求op封装成MOSDOp消息，然后交给SimpleMessager处理，SimpleMessager会尝试在已有的osd_session中查找，如果没有找到对应的session，则会重新创建一个OSDSession，并且为这个OSDSession创建一个数据通道pipe，把数据通道保存在SimpleMessager中，可以下次使用。

7.pipe 会与目标osd建立Socket通信通道，pipe会有专门的写线程writer来负责socket通信。在线程writer中会先连接目标ip，建立通信。消息从SimpleMessager收到后会保存到pipe的outq队列中，writer线程另外的一个用途就是监视这个outq队列，当队列中存在消息等待发送时，会就将消息写入socket，发送给目标OSD。

RBD客户端写入流程

RBD客户端写入涉及到三层：librbd、（librados+rados）和OSD。

每层见到的数据形式不一样，librbd看到的是二进制块，librados及rados看到的是条带和对象，OSD看到的是对象对应的文件。

第一层：librbd对二进制块进行分块，默认块大小为 4M，成为一个对象

Ceph客户端，这里即librdb所见的是一个完整的连续的二进制数据块。

数据如何按照object进行拆分的呢？

librbd对写入的二进制块实施所谓的"条带化stripe"，从而使得数据可以分散存储在多个object。那么条带宽度、数量；object大小、数量；PG数量是怎么来定义的呢？

PG数量由创建或修改pool来设定：

条带及object属性由创建RBD镜像时设定：

通常情况下，pool size指的是此pool的副本数量，比如默认为3；pool order指的是object大小折算成2的幂次，比如默认为22，即object的大小默认为4M。

镜像创建时条带、object等属性设置对应的代码：

从上面的代码分析可以看出object大小与条带宽度、数量有什么关系呢：

object大小=2^order
条带宽度必须要小于等于object大小，而且必须要能够被object大小整除

第二层：librados

librados 负责在 RADOS 中创建对象（object），其大小为 pool 的 order 决定，默认情况下 order = 22 此时 object 大小为 4MB。

librados 控制哪个条带写入哪个 OSD （条带-à写入哪个----> object ----位于哪个 ----> OSD）。

Librados将一块二进制数据（文件）分为多个 object 来保存，从而使得对一个file 的多个读写可以分在多个 object 进行，从而可以防止某个 file非常大或者非常忙时单个节点称为性能瓶颈。还可以将 object 进一步条带化为多个条带（stripe unit）。条带（stripe）是 librados 通过 ODS 写入数据的基本单位。这么做的好处是在保持对象数目的同时，进一步减少可以同步读写的粒度（从 object 粒度减少到 stripe 粒度），从而提高读写效率。

Ceph 的条带化行为（如何划分条带和如何写入条带）受三个参数控制：

order：RADOS Object 的大小为 2^[order] bytes。默认的 oder 为 22，这时候对象大小为4MB。最小 4k，最大 32M，默认 4M.
stripe_unit：条带（stripe unit）的大小。每个 [stripe_unit] 的连续字节会被连续地保存到同一个对象中，client 写满 stripe unit 大小的数据后，接着去下一个 object 中写下一个 stripe unit 大小的数据。默认为 1，此时一个 stripe 就是一个 object。
stripe_count：在分别写入了 [stripe_unit] 个字节到 [stripe_count] 个对象后，ceph 又重新从一个新的对象开始写下一个条带，直到该对象达到了它的最大大小。这时候，ceph 转移到下 [stripe_unit] 字节。默认为 object site。

以下图为例：

（1）Client Data会被保存在8 个 RADOS object （计算方式为 client data size 除以 2^order）。

（2）stripe_unit 为 object size 的四分之一，也就是说每个 object 包含 4 个 stripe。

（3）stripe_count 为 4，即每个 object set 包含四个 object。这样，client 以 4 为一个循环，向一个 object set 中的每个 object 依次写入 stripe，写到第 16 个 stripe 后，按照同样的方式写第二个 object set。

具体过程如下：

继续接着AbstractAioImageWrite::send_request()函数，数据按照object拆分后，又是如何写入的呢？

由于_calc_target非常重要，稍后分析。先将_op_submit中的发送到目标osd的消息通信过程分析一下。

接下来重点分析_calc_target（）函数。

第三层：OSD

主 OSD 负责调用文件系统接口将二进制数据写入磁盘上的文件（每个 object 对应一个 file，file 的内容是一个或者多个 stripe）。

主 ODS 完成数据写入后，它使用 CRUSH 计算出第二个OSD（secondary OSD）和第三个OSD（tertiary OSD）的位置，然后向这两个 OSD 拷贝对象。都完成后，它向 ceph client 反馈该 object 保存完毕。

OSD的初始化：

OSD层写数据的大致步骤：

1）首先rados会将数据发送给主osd，主osd同样要先进行写操作预处理，完成后它要发送写消息给其他的从osd，让他们对副本pg进行更改；

2）从osd通过FileJournal完成写操作到Journal中后发送消息告诉主osd说完成，进入第4）步；

3）当主osd收到所有的从osd完成写操作的消息后，会通过FileJournal完成自身的写操作到Journal中。完成后会通知客户端，已经完成了写操作；

4）主osd，从osd的线程开始工作调用Filestore将Journal中的数据写入到底层文件系统。

OSD端接收来自RADOS发送来的写消息：

PG数据恢复过程
涉及PG相关概念

1. recovering操作与backfill操作

两种操作都是针对OSD发生变化后引发的数据恢复操作，如果是全量恢复，则称之为backfill操作，如果是增量恢复，则称之为recovering操作。

集群中的osd可能存在不稳定的现象，比如：

a. 一个osd.n由于不可抗力因素导致下线，如果一定的时间内恢复正常，那么在故障期间osd.n上所有的pg都可能发生写操作等变化，导致osd.n上保存的是旧数据old_object，这时就需要对old-object恢复到new_object。如果能够增量恢复，则进行recovering操作，如果无法增量恢复，只能将new_object全部拷贝到osd.n上，则进行backfill操作。

b.一个osd.n由于不可抗力因素导致下线，如果规定时间内无法恢复正常，就要选择一个新的osd.m代替osd.n。那么需要拷贝osd.n上的数据到osd.m（当然osd.n上的数据已经读不到了，所以从其他副本中恢复）。该过程只能进行backfill操作，即全量恢复。

2.PG Peering

PG 的状态管理使用叫做state_machine(状态机)的机制。该机制中包含一个叫做peering的状态，该状态的演化形成一个叫做peering的过程。

PG 所有副本的状态信息同步的过程叫做peering过程，该过程包括信息的交换等。

3.acting set、up set、pg_temp

每个pg都有这两个osd集合：acting set中保存是该pg下的所有副本所在OSD的集合，比如acting[0,1,2]，就表示这个pg的副本保存在OSD.0 、OSD.1、OSD.2中，而且排在第一位的是OSD.0 ，表示这个OSD.0是PG的primary副本。在通常情况下 up set 与 acting set是相同的。

pg_temp : 假设当一个PG的副本数量不够时，对应的acting/up = [1,2]/[1,2]。这时添加一个OSD.3作为PG的副本。经过crush的计算发现，这个OSD.3应该为当前PG的primary，但是，这OSD.3上面还没有PG的数据，所以无法承担primary，所以需要申请一个pg_temp，这个pg_temp就还采用OSD.1作为primary，此时acting set为[3,1,2]，pg_temp set为[1,2,3]，称之为up set。当OSD.3上的数据全部都恢复完成后，就变成了[3,1,2]/[3,1,2]。

所以，acting set是理论上最终的osd set，up set是演变过程中的临时set。

4. epoch

当集群中的OSD发生变化，则就会产生新的OSDmap,每个OSDmap都对应一个epoch，epoch按着先后顺序单调递增。epoch越大说明OSDmap越新。

5. current_interval、past_interval

每个PG都有interval。

interval 是一个epoch的序列，一个interval包括属于该PG的OSD成员未发生变化期间的epoch序列。如果该PG的成员发生变化，则此PG会产生new interval。

6. last_epoch_started、last_epoch_clean

last_epoch_started：经过peering后的osdmap版本号epoch。

last_epoch_clean：经过数据恢复（recovery或backfill）后的osdmap版本号epoch。

PG会先进行Peering过程，然后才会进行数据恢复，所以一般情况下，last_epoch_clean>= last_epoch_started。

7.pg_log

pg_log是用于恢复数据重要的结构，每个pg都有自己的log。

PG状态机

pg的所有状态，pg的状态管理全部都交给一个叫做recoverymachine的类来管理，pg的所有的状态是一个类似树形的结构，每个状态可能存在子状态，如下图：

状态之间的切换如下图所示：

1.PG状态

Creating ：PG 正在被创建；
Peering：表示一个过程，该过程中一个 PG 的所有 OSD 都需要互相通信来就PG 的对象及其元数据的状态达成一致。处于该状态的PG不能响应IO请求。Peering的过程其实就是pg状态从初始状态然后到active+clean的变化过程。一个 OSD 启动之后，上面的pg开始工作，状态为initial，这时进行比对所有osd上的pglog和pg_info，对pg的所有信息进行同步，选举primary osd和replica osd，peering过程结束，然后把peering的结果交给recovering，由recovering过程进行数据的恢复工作；
Active 活动的：Peering 过程完成后，PG 的状态就是 active 的。此状态下，在主次OSD 上的PG 数据都是可用的；
Clean 洁净的：此状态下，主次 OSD 都已经经过Peering过程处理，每个副本都就绪了；
Down：PG 掉线了，因为存放其某些关键数据（比如 pglog 和 pginfo，它们也是保存在OSD上）的副本 down 了；
Degraded 降级的：某个 OSD 被发现停止服务（down）了后，Ceph MON 将该 OSD 上的所有 PG 的状态设置为 degraded，此时该 OSD 的 peer OSD 会继续提供数据服务。这时会有两种结果：一是它会重新起来（比如重启机器时），需要再经过 peering 过程再到clean 状态，而且 Ceph 会发起 Recovering（恢复）过程，使该 OSD 上过期的数据被恢复到最新状态；二是 OSD 的 down 状态持续 300 秒后其状态被设置为 out，Ceph 会选择其它的 OSD 加入 acting set，并启动回填（backfilling）数据到新 OSD 的过程，使 PG 副本数恢复到规定的数目；
Recovering 恢复中：一个 OSD down 后，其上面的 PG 的内容的版本会比其它OSD上的 PG 副本的版本旧。在它重启之后（比如重启机器时），Ceph 会启动 Recovering过程来使其数据得到更新；
Backfilling 回填中：一个新 OSD 加入集群后，Ceph 会尝试级将部分其它 OSD 上的 PG 挪到该新 OSD 上，此过程被称为回填。与 recovery 相比，回填（backfill）是在零数据的情况下做全量拷贝，而恢复（recovery）是在已有数据的基础上做增量恢复；
Remapped 重映射：每当 PG 的 acting set 改变后，就会发生从旧 acting set 到新 acting set 的数据迁移。此过程结束前，旧 acting set 中的主 OSD 将继续提供服务。一旦该过程结束，Ceph 将使用新 acting set 中的主 OSD 来提供服务；
Stale 过期的：OSD 每隔 0.5 秒向 MON 报告其状态。如果因为任何原因，主 OSD 报告状态失败了，或者其它OSD已经报告其主 OSD down 了，Ceph MON 将会将它们的 PG 标记为 stale 状态。

只有当所有的 PG 都是 active + clean 状态时，集群的状态才是 HEALTH_OK 的。

2.PG Creating

MON 节点上有PGMonitotor，它发现有 pool 被创建后，判断该 pool 是否有 PG。如果有PG，则一一判断这些 PG 是否已经存在，如果不存在，则开始下面的创建 PG 的过程；
创建过程的开始，设置PG 状态为 Creating，并将它加入待创建PG队列 creating_pgs，等待被处理；
开始处理后，使用 CRUSH 算法根据当前的 OSD map 找出来 up/acting set，加入 PG map 中以这个 set 中 OSD 为索引的队列 creating_pgs_by_osd。（看起来只会加入到主OSD的队列中）；
队列处理函数将该 OSD 上需要创建的 PG 合并，生成消息MOSDPGCreate，通过消息通道发给 OSD；
OSD 收到消息字为 MSG_OSD_PG_CREATE 的消息，得到消息中待创建的 PG 信息，判断类型，并获取该PG的其它OSD，加入队列 creating_pgs （似乎是由主 OSD 负责发起创建次 OSD 上的PG），再创建具体的 PG；
PG 被创建出来以后，开始 Peering 过程。

在PG创建过程中，会涉及到每个Pool上应该有多少个PG，以及PG MAP与OSD MAP的映射关系的建立等，更多代码细节在"客户端写入涉及到的映射Mapping"章节。

数据恢复

数据恢复分为两个大阶段：PG Peering和Recovering/backfill。

1. PG Peering

下面分析PG Peering的处理：

PG Peering过程中主要要做下面几件准备工作：

确定参与peering过程的osd集合
该集合中合并出最权威的log记录
每一个osd缺失并且需要恢复的object
需要恢复的object可以从哪个osd上进行拷贝

回复的处理函数：PG::RecoveryState::GetInfo::react(const MNotifyRec& infoevt)

进入下一个状态Getlog

GetLog中的工作就都完成了，然后向best_osd发送log请求，等待best_osd回复。

best_osd回复PG::RecoveryState::GetLog::react(const GotLog&)：

接下来该进入GetMissing处理中,

（PG::RecoveryState::GetMissing::GetMissing(my_context ctx)）：

在active处理 PG::RecoveryState::Active::Active(my_context ctx)：

2.PG Recovering或backfill

等待所有的osd都发回了确认ack（差异日志发送给osd），进入如下处理的函数中PG::RecoveryState::Active::react(const AllReplicasActivated &evt)：

上面操作将这个pg添加到了osd->recovery_queue中，recovery_queue是一个工作队列，会有专门处理的线程来进行处理。该线程的主处理函数为void OSD::do_recovery(PG *pg, ThreadPool::TPHandle &handle)：

总结：

所有的数据恢复都要经过primary osd，

1）如果primary osd出现数据丢失object，则由primary osd主动pull拉取replicas osd上的object数据；

2）如果 replicas osd 上出现数据丢失object，则由primary osd 主动push 推送replicas osd上的 object数据；

3）如果primary osd 和部分replica osd缺失object数据，则先由primary osd从正常的replica osd上拉取数据，进行本地恢复。下一次再把数据推送到需要恢复的osd上。

下面是一个实例来说明数据恢复的过程（来自"一只小江"的博客）：

1）初始态

假设初始状态pg ,他由osd.0 、osd.1、osd.2组成的三副本形式，这时up集合为[0,1,2],acting集合为[0,1,2]，acting_primary 为osd 0。

假设该pg已经完成写入object1，那osd0，1，2上都保存这object1。

此时pg处于interval0;

2）加入osd3

当集群添加osd3有的pg上的数据会发生移动，刚巧这个pg的up由[0,1,2]变成[0,1,3]，这时的acting也变成了[0,1,3]，acting_primary 为osd 0。但是osd.3上没有object1，，就会发生由osd.0 向 osd.3上拷贝标记为虚线框黄底色的object1 (该过程是recovery或者backfill，确定是哪一种，要根据pglog来决定)。当拷贝数据的过程中发生，数据写入。

加入osd3后，pg的up集合由[0,1,2]变成[0,1,3]，所以pg此时为interval 1（可见interval由up set的变化而变化，而不是acting set）;

3）写入object2

在写入pg的时候，数据会写三个副本，分别写到osd0，1，3中。osd0，1中已经有object1，2，但osd3中的object1还在数据恢复中，所以此时osd3中只有object2。

由于up集合没发生变化，所以此时pg仍然是interval 1；

4）加入osd4

假设此时osd3的object1数据还未恢复完成时，就加入了osd4，同时假设恰巧osd3被选择为pg的primary osd。此时pg的up集合由[0,1,3]变成了[0,4,3]。所以pg进入到了interval 2。

osd0保存了object1，object2。osd3保存了object2，需要恢复object1。osd4上需要恢复object1，object2。这时的acting集合是[0,4,3]. acting_primary 为osd 0。

pg又重新进入到数据恢复的过程，恢复osd4，osd3上的数据。

5）写入object3

假设在osd4，osd3上的数据恢复没有完成的时候，又写入object3。

object3需要写入到osd0，4，3上，当object3写入数据完成，osd4仍然需要恢复object1,object2。osd3仍然需要恢复数据object1。

6）加入osd5

假设在osd4，osd3上的数据没有恢复完成前，又加入了osd5 引起了pg的up集合变化，pg的up集合由[0,4,3]变成[5,4,3]。

由于up集合变化，所以进入interval 3。

这时osd5，osd4，osd3上都有数据要恢复，在选取acting集合的时候要借鉴interval 2 中的acting集合（[0,4,3]），为了能恢复数据，此时的acting集合为[0,4,5,3]。

在确定acting_primary osd的时候，如果pg进入recoving状态，则选择osd5为acting_primary osd，若pg进入backfill状态则选择osd0为acting_primary osd。

假设此时是backfill，则选举osd0为acting_primay。

acting_primary 的作用就是用来处理客户端的请求的。此时osd3，osd4，osd5都有object需要恢复，重新开始恢复数据。

7）写入object4

此时在osd4，osd5，osd3 都有数据需要恢复，osd0为acting_primary，所以osd0 接收数据分成4份，分别写入osd0，osd3，osd4，osd5。此时osd0上有 object1，object2，object3，object4。osd3上有object2，object3，object4，需要恢复object1。osd4上有object3，object4，需要恢复object1，object2。osd5上只有object4，需要恢复object1，object2，object3。

8）等待数据恢复完成

当数据恢复完成时，会发送事件给pgmonitor，此时会重新发起pg的peering过程。但是此时up集合为[5,4,3],由于数据恢复完成不需要借鉴interval2，所以acting集合为[5,4,3] ，在acting集合中会选举osd5为acting_primary，将osd0踢出pg。此后由osd5负责处理客户端的请求。

PG数据清理过程

数据清理 scrubbing过程是Ceph 以 PG 为单位进行的数据清理，以保证数据的完整性，它的作用类似于文件系统的 fsck 工具。

Ceph 的 OSD 定期启动 scrub 线程来扫描部分对象，通过与其他副本比对来发现是否一致，如果存在不一致，抛出异常提示用户手动解决。管理员也可以手工发起。

Scrub 以 PG 为单位，对于每一个PG，Ceph 分析该 PG 下所有的对象, 产生一个类似于元数据信息摘要的数据结构，如对象大小，属性等，叫scrubmap, 比较主与副scrubmap，来保证是不是有object 丢失或者不匹配。
Ceph在分析PG下的对象的时候，有两种方式：
- light scrubbing：比较对象的size和属性，一般每天进行
- deep scrubbing：读取对象的数据，比较检验码，一般每周进行
由于Scrubbing 过程需要提取对象的校验信息然后跟其他副本的校验信息对比，这期间被校验对象的数据是不能被修改的, write 请求会被 block，所以Ceph引入两种模式：classic vs. chunky。
- Chunk scrub：每一次的比较只取其中一部分 objects 来比较，这样只 block一小部分object的write请求
- Classic scrub：会block所有write请求

该机制对保证数据的完整性非常重要，但是也会消耗大量的集群资源，block 住一部分对象的写入操作，降低集群的性能，特别是当一个OSD服务器上多个OSD同时进行深度清理的时候。

CRUSH算法

前面分析中，遇到两个函数未进行详细分析：

一是在Monitor初始化的时候，需要建立PG与OSD的关系，核心在crush_do_rule()函数，另外一个是在object写入之前需要查找此object应该写入哪个主OSD，核心在_calc_target()函数，要想看懂这两个函数，则需要先彻底搞清楚CRUSH算法原理。

因为大型系统的结构式动态变化的，CRUSH能够处理存储设备的添加和移除，并最小化由于存储设备的的添加和移动而导致的数据迁移。要做到这一点，则需要ceph能够做到每个节点的"计算负载"均衡、"存放的数据"均衡。但是简单HASH分布不能有效处理设备数量的变化，导致大量数据迁移。因此ceph开发了CRUSH（Controoled Replication Under Scalable Hashing），一种伪随机数据分布算法，它能够在层级结构的存储集群中有效的分布对象。

CRUSH有两个关键优点：

任何节点都可以独立计算出每个object所在的位置(去中心化)
只需要很少的元数据(cluster map)，只有当删除添加设备时，这些元数据才需要改变

CRUSH的目的是利用可用资源优化分配数据,当存储设备添加或删除时高效地重组数据,以及灵活地约束对象副本放置,当数据同步或者相关硬件故障的时候最大化保证数据安全。

CRUSH支持各种各样的数据安全机制,包括多副本,RAID奇偶校验方案或者其他形式的校验码,以及混合方法(比如RAID-10)。

CRUSH实现了一种伪随机(确定性)的函数，它的参数是object id或object group id，并返回一组存储设备(用于保存object的OSD)：

CRUSH利用多参数HASH函数，HASH函数中的参数包括x，使得从x到OSD集合是确定性的和独立的。CRUSH是伪随机算法，相似输入的结果之间没有相关性。

CRUSH有三个输入信息：

cluster map：包括OSD MAP、PG MAP、CRUSH MAP，用于描述集群状态和逻辑结构；
placement rules：数据映射规则，即CRUSH MAP中对应的ruleset。比如数据对象有多少个副本，这些副本存储的限制条件(比如3个副本放在不同的机架中)；
x：object id或object group id。

CRUSH MAP

CRUSH MAP由设备和桶（buckets）组成，设备和桶都有数值的描述和权重值。

桶可以包含任意多的设备或者其他的桶，使他们形成内部节点的存储层次结构,设备总是在叶节点。存储设备的权重由管理员设置以控制设备负责存储的相对数据量。

尽管大型系统的设备含不同的容量大小和性能特点,随机数据分布算法可以根据设备的利用率和负载来分布数据。这样设备的平均负载与存储的数据量成正比。桶的权重是它所包含的元素的权重的总和。

桶可由任意可用存储的层次结构组成。例如,可以创建这样一个集群映射，用名为"host"的桶代表最低层的一个主机来包含主机上的磁盘设备,然后用名为"rack"的桶来包含安装在同一个机架上的主机。在一个大的系统中，代表机架的"rack"桶可能还会包含在"row"桶或者"room"桶里。数据被通过一个伪随机类hash函数递归地分配到层级分明的桶元素中。传统的散列分布技术，一旦存储目标数量有变，就会导致大量的数据迁移；而CRUSH算法是基于桶的四个不同的类型,每一个都有不同的选择算法,以解决添加或删除设备造成的数据移动和整体的计算复杂度。

数据映射规则（ruleset、replica placement）

算法1的伪代码对应的代码：

伪代码分析1：一个典型的placement rule结构

tack(a) ：选择一个item，一般是bucket，并返回bucket所包含的所有item。这些item是后续操作的参数，这些item组成向量i。
select(n, t)：迭代每个item(向量i中的item)，对于每个item(向量i中的item)向下遍历(遍历这个item所包含的item)，都返回n个同为type t的不同的item，并把这些item都放到向量i中。select函数会调用c(r, x)函数，这个函数会在每个bucket中伪随机选择一个item。
emit：把向量i放到result中。
一个placement rule包含了多次{take, select, emit}操作，以从不同的存储池中获取不同的存储对象。

如上表中示例所示，该法则是从上图架构中的root节点开始，第一个select(1.row)操作选择了一个row类型的单例bucket。随后的select(3,cabinet)操作选择了3个嵌套在下面row2(cab21, cab23, cab24)行中不重复的值，同时，最后的select(1,disk)操作迭代了输入向量中的三个buckets，也选择了嵌套在它们其中的一个单例磁盘。最后的结果是三个磁盘空间分配给了三个块，但是所有的结果集中在同一行中。因此，这种方法允许object在bucket中被同时分割和合并。

CRUSH MAP中的ruleset格式如下：

下面代码则描述了上面格式的处理逻辑：

伪代码分析2：冲突、失败与过载

冲突：这个item已经在向量i中，已被选择
故障：设备发生故障，不能被选择
超载：设备使用容量超过警戒线，没有剩余空间保存数据对象

During this process, CRUSH may reject and reselect items using a modified input r′ for three different reasons:

if an item has already been selected in the current set (a collision—the select(n,t) result must be distinct),
if a device is failed, or if a device is overloaded.

Failed or overloaded devices are marked as such in the cluster map, but left in the hierarchy to avoid unnecessary shifting of data. CRUSH's selectively diverts（转移） a fraction of an overloaded device's data by pseudo-randomly rejecting with the probability specified in the cluster map----typically related to its reported over-utilization.

Algorithm 1 line 11：For failed or overloaded devices, CRUSH uniformly redistributes items across the storage cluster by restarting the recursion at the beginning of the select(n,t).

Algorithm 1 line 14：In the case of collisions, an alternate r′is used first at inner levels of the recursion to attempt a local search and avoid skewing the overall data distribution away from subtrees where collisions are more probable (e. g., where buckets are smaller than n).

伪代码分析3：副本排名

Parity and erasure coding schemes have slightly different placement requirements than replication.

Algorithm 1 line 16: In primary copy replication schemes, it is often desirable after a failure for a previous replica target (that already has a copy of the data) to become the new primary. In such situations, CRUSH can use the "first n" suitable targets by reselecting using r′ = r + f , where f is the number of failed placement attempts by the current select(n,t) (see Algorithm 1 line 16).

Algorithm 1 line 18: With parity and erasure coding schemes, however, the rank or position of a storage device in the CRUSH output is critical because each target stores different bits of the data object. In particular, if a storage device fails, it should be replaced in CRUSH's output list ⃗R in place, such that other devices in the list retain the same rank (i. e. position in ⃗R, see Figure 2). In such cases,CRUSH reselects using r′=r+frn,where fr is the number of failed attempts on r, thus defining a sequence of candidates for each replica rank that are probabilistically independent of others' failures.

see Figure 2:In contrast, RUSH has no special handling of failed devices; like other existing hashing distribution functions, it implicitly assumes the use of a "first n" approach to skip over failed devices in the result, making it unweildly for parity schemes.

伪代码分析4：bucket类型及对应的数据映射规则

Generally speaking, CRUSH is designed to reconcile two competing goals: efficiency and scalability of the mapping algorithm, and minimal data migration to restore a balanced distribution when the cluster changes due to the addition or removal of devices. To this end, CRUSH defines four different kinds of buckets to represent internal (non-leaf) nodes in the cluster hierarchy: uniform buckets（一般性桶）, list buckets（列表式桶）, tree buckets（树结构桶）, and straw buckets（稻草型桶）. Each bucket type is based on a different internal data structure and utilizes a different function c(r, x) for pseudo-randomly choosing nested（嵌套的） items during the replica placement process, representing a different tradeoff between computation and reorganization efficiency.

Uniform buckets are restricted in that they must contain items that are all of the same weight (much like a conventional hash-based distribution function), while the other bucket types can contain a mix of items with any combination of weights.

1. uniform buckets

Devices are rarely added individually in a large system. Instead, new storage is typically deployed in blocks of identical devices, often as an additional shelf（机架） in a server rack or perhaps an entire cabinet（机柜）. Devices reaching their end of life are often similarly decommissioned（退役） as a set (individual failures aside), making it natural to treat them as a unit. CRUSH uniform buckets are used to represent an identical set of devices in such circumstances. The key advantage in doing so is performance related: CRUSH can map replicas into uniform buckets in constant time. In cases where the uniformity restrictions are not appropriate, other bucket types can be used.

Given a CRUSH input value of x and a replica number r（副本数量r）, we choose an item from a uniform bucket of size m using the functionc(r,x)=(hash(x)+rp) mod m,where pisaradomly (but deterministically) chosen prime number（质数） greater than m.

For any r ≤ m we can show that we will always select a distinct(不同的) item using a few simple number theory lemmas（简单数论推理）.
For r > m this guarantee no longer holds, meaning two different replicas r with the same input x may resolve to the same item.

In practice, this means nothing more than a non-zero probability of collisions and subsequent backtracking by the placement algorithm (在实践中，这就意味着在数据映射过程中一定有某种概率发生item选择冲突或事后回溯的事件发生).

If the size of a uniform bucket changes, there is a complete reshuffling of data（完整的数据重组） between devices, much like conventional hash-based distribution strategies.

所以，uniform bucket 适用的情况：

适用于所有子节点权重相同的情况，而且bucket很少添加删除item

这种情况查找速度应该是最快的。因为uniform的bucket在选择子节点时是不考虑权重的问题，全部随机选择。所以在权重上不会进行特别的照顾，为了公平起见最好是相同的权重节点。

适用于子节点变化概率小的情况

当子节点的数量进行变化时，size发生改变，在随机组合perm数组时，即使x相同，则perm数组需要完全重新排列，也就意味着已经保存在子节点的数据要全部发生重排，造成很多数据的迁移。所以uniform不适合子节点变化的bucket，否则会产生大量已经保存的数据发生移动，所有的item上的数据都可能会发生相互之间的移动。

对应代码分析：

bucket的所有子节点都保存在item[]数组之中，perm_x是记录这次随机排布时x的值，perm[]是在perm_x时候对item随机排列后的结果。r则是第r个副本。

此函数的逻辑结构图如下：根据输入的x值判断是否为perm_x，如果不是，则需要重新排列perm[]数组，并且记录perm_x=x。如果x==perm_x时，这时算R = r%size，算后得到R，最后返回 perm[R]。

2.list bucket

List buckets structure their contents as a linked list, and can contain items with arbitrary weights. To place a replica, CRUSH begins at the head of the list with the most recently added item and compares its weight to the sum of all remaining items' weights. Depending on the value of hash(x, r, item), either the current item is chosen with the appropriate probability, or the process continues recursively down the list. This approach, derived from , recasts the placement question into that of "most recently added item, or older items?" This is a natural and intuitive choice for an expanding cluster: either an object is relocated to the newest device with some appropriate probability, or it remains on the older devices as before. The result is optimal data migration when items are added to the bucket. Items removed from the middle or tail of the list, however, can result in a significant amount of unnecessary movement, making list buckets most suitable for circumstances in which they never (or very rarely) shrink.

The algorithm is approximately equivalent to a two-level CRUSH hierarchy consisting of a single list bucket containing many uniform buckets.

所以，list bucket适用于：

集群拓展类型

当增加item时，会产生最优的数据移动。因为在list bucket中增加一个item节点时，都会增加到tail部，这时其他节点的sum_weight都不会发生变化，只需要将old_tail上的sum_weight和weight之和添加到new_tail的sum_weight就好了。这样时其他item之间不需要进行数据移动，其他的item上的数据只需要和 tail上比较就好，如果算的w值小于tail的weight，则需要移动到tail上，否则还保存在原来的item上。这样就获得了最优最少的数据移动。

b.list bucket存在一个缺点，就是在查找item节点时，只能顺序查找，时间复杂度为O(n)。所以只是适合小型规模的节点结构。

CRUSH从list bucket中选择item的理论：

从表tail开始查找副本的位置，它先得到tail item的权重Wh、剩余链表中所有item的权重之和Ws，然后根据hash(x, r, i)得到一个[0~1]的值v，假如这个值v在[0~Wh/Ws)之中，则选择此item作为副本的存放osd，并返回表头item的id，i是item的id号。否者继续遍历剩余的链表。

下面是代码在list bucket中选择item的逻辑：

3.tree bucket

Like any linked list data structure, list buckets are efficient for small sets of items but may not be appropriate for large sets, where their O(n) running time may be excessive. Tree buckets, derived from . This reduces the placement time to O(logn), making them suitable for managing much larger sets of devices or nested buckets. is equivalent to a two-level CRUSH hierarchy consisting of a single tree bucket containing many uniform buckets.

Tree buckets are structured as a weighted binary search tree（加权二叉排序树） with items at the leaves. Each interior node knows the total weight of its left and right subtrees and is labeled according to a fixed strategy (described below). In order to select an item within a bucket, CRUSH starts at the root of the tree and calculates the hash of the input key x, replica number r, the bucket identifier, and the label at the current tree node (initially the root). The result is compared to the weight ratio of the left and right subtrees to decide which child node to visit next. This process is repeated until a leaf node is reached, at which point the associated item in the bucket is chosen. Only logn hashes and node comparisons are needed to locate an item.

The bucket's binary tree nodes are labeled with binary values using a simple, fixed strategy designed to avoid label changes when the tree grows or shrinks. The left most（最左边） leaf in the tree is always labeled "1." Each time the tree is expanded, the old root becomes the new root's left child, and the new root node is labeled with the old root's label shifted one bit to the left (1, 10, 100, etc.). The labels for the right side of the tree mirror those on the left side except with a "1" prepended to each value. A labeled binary tree with six leaves is shown in Figure 4.

This strategy ensures that as new items are added to (or removed from) the bucket and the tree grows (or shrinks), the path taken through the binary tree for any existing leaf item only changes by adding (or removing) additional nodes at the root, at the beginning of the placement decision tree. Once an object is placed in a particular subtree, its final mapping will depend only on the weights and node labels within that subtree and will not change as long as that subtree's items remain fixed.

Although the hierarchical decision tree introduces some additional data migration between nested items, this strategy keeps movement to a reasonable level, while offering efficient mapping even for very large buckets.

Tree Bucket的关键是当添加删除叶子节点时，决策树中的其他节点的node_id不变。决策树中节点的node_id的标识是根据对二叉树的中序遍历来决定的(node_id不等于item的id，也不等于节点的权重)。

在Tree bucket中选择item的理论依据：

CRUSH从root节点开始查找副本的位置，它先得到节点的左子树的权重Wl，得到节点的权重Wn，然后根据hash(x, r, node_id)得到一个[0~1]的值v，假如这个值v在[0~Wl/Wn)中，则副本在左子树中，否者在右子树中。继续遍历节点，直到到达叶子节点。

代码逻辑：

tree bucket 会借助一个叫做node_weight[ ]的数组来进行帮助搜索定位item。

Node weight[ ]中不仅包含了item，而且增加了很多中间节点，item都作为叶子节点。父节点的重量等于左右子节点的重量之和，递归到根节点.如下图:

4.straw bucket

List and tree buckets are structured such that a limited number of hash values need to be calculated and compared to weights in order to select a bucket item. In doing so, they divide and conquer（分而治之） in a way that either gives certain items precedence (e. g., those at the beginning of a list) or obviates the need to consider entire subtrees of items at all（采取某种策略以使得不必要考虑整个子树）. That improves the performance of the replica placement process, but can also introduce suboptimal reorganization behavior when the contents of a bucket change due an addition, removal, or re-weighting of an item.

The straw bucket type allows all items to fairly "compete竞争" against each other for replica placement through a process analogous to a draw of straws（类似抽签的方式）. To place a replica, a straw of random length is drawn for each item in the bucket. The item with the longest straw wins. The length of each straw is initially a value in a fixed range, based on a hash of the CRUSH input x, replica number r, and bucket item i. Each straw length is scaled by a factor f(wi) based on the item's weight3(每一个straw长度都乘以根据该项权重的立方获得的一个系数 f(wi)) so that heavily weighted items are more likely to win the draw,

i.e. .

Although this process is almost twice as slow (on average) than a list bucket and even slower than a tree bucket (which scales logarithmically), straw buckets result in optimal data movement between nested items when modified.

代码逻辑：

The choice of bucket type can be guided based on expected cluster growth patterns to trade mapping function computation for data movement efficiency where it is appro- priate to do so:

When buckets are expected to be fixed (e. g., a shelf of identical disks), uniform buckets are fastest.
If a bucket is only expected to expand, list buckets provide optimal data movement when new items are added at the head of the list. This allows CRUSH to divert exactly as much data to the new device as is appropriate, without any shuf- fle between other bucket items. The downside is O(n) map- ping speed and extra data movement when older items are removed or reweighted.
In circumstances where removal is expected and reorganization efficiency is critical (e. g., near the root of the storage hierarchy), straw buckets provide optimal migration behavior between subtrees.
Tree buckets are an all around compromise(中庸), providing excellent performance and decent（得体的） reorganization efficiency.

These differences are summarized in Table 2.

CRUSH MAP改变与数据移动

在大型文件系统中一个比较典型的部分就是数据在存储资源中的增加和移动。为了避免非对称造成的系统压力和资源的不充分利用，CRUSH主张均衡的数据分布和系统负载。

When an individual device fails, CRUSH flags the device but leaves it in the hierarchy, where it will be rejected and its contents uniformly redistributed by the placement algorithm (参考伪代码分析2). Such cluster map changes result in an optimal (minimum) fraction of total data to be remapped to new storage targets because only data on the failed device is moved(where W is the total weight of all devices).

Figure 3：The situation is more complex when the cluster hierarchy is modified, as with the addition or removal of storage resources. The CRUSH mapping process, which uses the cluster map as a weighted hierarchical decision tree, can result in additional data movement beyond the theoretical optimum of . At each level of the hierarchy, when a shift in relative subtree weights alters the distribution, some data objects must move from subtrees with decreased weight to those with increased weight. Because the pseudo-random placement decision at each node in the hierarchy is statistically independent, data moving into a subtree is uniformly redistributed beneath that point, and does not necessarily get remapped to the leaf item ultimately responsible for the weight change(当子树的权重发生变化时，通常会导致数据从权重降低的子树到权重增加的子树。由于在树形层级结构中每个节点的数据映射伪随机决策是统计独立的，所以移动到子树上的数据会基于这个子树均匀地重新分布).

Only at subsequent (deeper) levels of the placement process does (often different) data get shifted to maintain the correct overall relative distributions. The amount of data movement in a hierarchy has a lower bound of , the fraction of data that would reside on a newly added device with weight ∆w. Data movement increases with the height h of the hierarchy, with a conservative asymptotic upper bound of h . The amount of movement approaches this upper bound when ∆w is small relative to W , because data objects moving into a subtree at each step of the recursion have a very low probability of being mapped to an item with a small relative weight（在层级之间移动的数据量的下限=*data，其中新加入的存储设备引入∆w权重的变化。在层级之间移动的数据量随着数据移动的层级高度h的增加而增加，最终会逐步趋向于上限h*data。当新加入设备而引入的权重变化∆w相对于整个树的权重之和W小得多的时候，则意味着需要移动的数据量趋近这个上限值，也就是说此时移动的数据量非常小，因为数据object被映射到相对权重较小的item的概率也是非常小的）。

安装与编译
1. 安装

建议在安装与使用之前，先对ceph做一些基本的了解，至少需要了解下面的基本知识后再安装会顺利得多：

1）分布式系统的基本原理、元素、框架

2）分布式存储系统是分布式系统的一个具体实现实例，与通用的分布式系统有哪些独特之处

3）几个典型的分布式存储系统异同：HDFS、Swift、Ceph

4）基本概念：块存储、对象存储、文件系统

如果物理资源不够（分布式系统至少需要3个节点），采用虚机节点会比较方便。

在Linux下面采用开源的KVM+virtmanager或者商业软件，比如vmware的vSphere、汉柏的OPV-Suite等。

Ceph本身的安装与使用在官方网站介绍比较清楚，请直接参考官方指南：

http://docs.ceph.com/docs/master/start/

源码编译

如果要基于ceph做进一步的开发或调试，编译ceph源代码是必不可少的工作：

1）源码下载

http://ceph.com/resources/downloads/

有两种方式，一种直接从git库下载，一种是下载tar包。

建议下载tar包到本地，解压缩编译。

2）本地解压

3）在线安装ceph需要的依赖库

由于ceph依赖的库比较多，而且相互之间有依赖顺序，所以在线安装比较简便。

如果希望将所有的依赖库都下载到本地，再编译，则需要挑战下面几项工作：

所有依赖库要能够下载（有些依赖库非常难找，即使找到了，在国内还得才能够下载）
依赖库的依赖关系，可能需要反复修改依赖关系的配置文件

4)依赖库完成安装后，进入源码编译

5）编译结束后，进入src目录，启动/查看/停止集群

修改代码后，重新编译，重启集群（停止集群，再启动集群）。

调试与调优

Ceph性能的发挥关键在根据具体的情况和环境针对ceph做各种调优。

待续。

Ceph源代码分析

引言

编写目的

背景

基本概念

SDS

Disk

块式与流式

磁盘

固态硬盘

块存储指令与协议

硬盘物理接口

SCSI指令体系

块存储指令通信协议

Raid

基本术语

6种Raid模式

Raid卡结构

Raid与LVM

Raid的缺点

存储架构

传统存储架构

存储架构发展历程

分布式存储架构

分布式存储系统通用逻辑结构

分布式存储系统相关理论

HDFS分布式文件存储架构

Swift分布式对象存储架构

Ceph分布式统一存储架构

对比分析

ceph框架分析

相关接口

bufferraw/bufferptr/bufferlist

序列化encode/反序列化decode

逻辑结构

0层分解

1层分解

关键概念

Object 对象

Pool 池

PG Map

OSD Map

Monitor Map

CRUSH Map

主要流程

命令下发、解析流程

RBD客户端写入过程

PG数据恢复过程

PG数据清理过程

CRUSH算法

CRUSH MAP

数据映射规则（ruleset、replica placement）

CRUSH MAP改变与数据移动

安装与编译

安装

源码编译

调试与调优

你可能感兴趣的:(ceph)