梦想SEer

ceph数据重均衡几种策略

Ceph 数据重均衡的几种方法

在集群刚建好的时候，对pool进行调整，调整的方法就是对osd进行reweight，通过多次的reweight，指定的pool在osd上能大致得到比较好的均衡效果，但是后续仍会遇到osd上数据不均衡的情况。

常规操作

当ceph集群出现 osd full的告警时，一般情况下我们先通过ceph osd df查看osd的利用率及权重等信息。
如果我们看到看到有不均衡的数据后，一般会将利用率高osd的权重调低，等待完成数据均衡后重新调回正常权重。

ceph osd reweight 1 0.8

如果调完权重peering后发现recovery卡住或者直接提高osd写入阈值的情况下，会调整osd full ratio的阈值保持能够写入

ceph tell mon.* injectargs "--mon-osd-full-ratio 0.96" //默认0.95
ceph tell osd.* injectargs "--mon-osd-full-ratio 0.96"
ceph osd unpause

方法一：Upmap 操作

从12.2.x版本开始，ceph官方推出osdmaptool工具，这个工具允许我们对指定的osdmap进行运算，再使用ceph osd pg-upmap-items命令可实现单个pg级别的手动迁移，也就意味我们可以指定pg可以分布在哪些pg上，这点人为改变pg分布在一定程度上可以说是违背了crush算法的本意

upmap introduction：

Starting in Luminous v12.2.z there is a new pg-upmap exception table in the OSDMap that allows the cluster to explicitly map specific PGs to specific OSDs. This allows the cluster to fine-tune the data distribution to, in most cases, perfectly distributed PGs across OSDs. The key caveat to this new mechanism is that it requires that all clients understand the new pg-upmap structure in the OSDMap.
Starting in Luminous v12.2.z there is a new pg-upmap exception table in the OSDMap that allows the cluster to explicitly map specific PGs to specific OSDs. This allows the cluster to fine-tune the data distribution to, in most cases, perfectly distributed PGs across OSDs. The key caveat to this new mechanism is that it requires that all clients understand the new pg-upmap structure in the OSDMap.

upmap可以人为的指定pg分布，但是，需要客户端能够识别新的pg-upmap的结构，因为跟使用crush算法直接计算得出pg分布不同，人为修改了pg的位置后，就不能单单通过算法的到移动后的pg的位置了，必须提出新的结构

下边是实践，根据要求，使用upmap的前提条件有两个，第一是ceph版本必须是12.2.x及后续版本，第二是ceph的client特性至少要支持到luminous，才能保证client能够解读pg-upmap的新结构

首先，我们先导出集群的osdmap

ceph osd getmap -o osdmap

可以查看哦算得上pg的分布情况：

[root@~]$  osdmaptool --test-map-pgs --pool 5 ./osdmap
osdmaptool: osdmap file './thisosdmap'
pool 5 pg_num 1024
#osd    count    first    primary    c wt    wt
osd.0    95    36    36    5.45749    1
osd.1    108    37    37    5.45749    1
osd.2    114    34    34    5.45749    1
osd.3    95    25    25    5.45749    1
......
 in 30
 avg 102 stddev 10.6927 (0.10483x) (expected 9.9492 0.0975412x))
 min osd.6 83
 max osd.18 118
size 0    0
size 1    0
size 2    0
size 3    1024

导出的信息可知，我们最大的osd pg是有较大差值的，然后需要对这个pool进行一轮pg迁移的计算，但首先这个unmap只对ceph 12.2.x及后续版本，我们就用下面的命令指定client的最低版本要求

[root@ ~]$ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
// 查看client feature版本
[root@CNSZ335523 ~]$ ceph features
{
    "mon": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 1
        }
    },
    "osd": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 11
        }
    },
    "client": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 5
        }
    }
}

接下来，我们使用导出的osdmap进行计算，可得出哪些pg需要迁移。

[root@ ~]$ osdmaptool osdmap --upmap balanceupmap --upmap-pool default.rgw.buckets.data  --upmap-max 10 --upmap-deviation 0.01
osdmaptool: osdmap file 'osdmap'
writing upmap command output to: balanceupmap
checking for upmap cleanups
upmap, max-count 10, max deviation 0.01
 limiting to pools default.rgw.buckets.data (5)

–upmap: calculate pg upmap entries to balance pg layout writing commands to [default: - for stdout]

–upmap-max: set max upmap entries to calculate [default: 10]

–upmap-deviation: max deviation from target [default: 5]

–upmap-pool: restrict upmap balancing to 1 pool or the option can be repeated for multiple pools

通过计算，我们将得到pg的迁移结果，这里最多计算10次，偏差在1%，

[root@ ~]$ head -n5 balanceupmap
ceph osd pg-upmap-items 5.0 21 20
ceph osd pg-upmap-items 5.1 7 6
ceph osd pg-upmap-items 5.4 7 3
ceph osd pg-upmap-items 5.5 18 10 27 22
ceph osd pg-upmap-items 5.6 7 6 17 10

osdmaptool帮我们生成了移动pg所需要的全部命令，对这个命令做个解释

// 将pg 5.0在osd.21重新映射到osd.20
// from upset[2,21,5] to upset[2,20,5]
// ceph osd pg-upmap-items   [...]
ceph osd pg-upmap-items 5.0 21 20

然后直接执行这些命令

[root@ ~]$ source afterupmap

执行完命令后，pg将会产生迁移，此时会有backfill等状态，等数据迁移完成后即可。

注意：这个修改会使得计算得出来的pg位置与实际pg的位置有差异，这个差异的重映射需要client的内核的支持，否则业务会出问题.

方法二：Balancer

balancer可以自动或者自定义策略优化pg在osd上的分布，实现一个均衡的分布。

查看可以发现有balancer插件。

[root@ ~]# ceph mgr module ls
{
    "enabled_modules": [
        "balancer",
        "restful",
        "status"
    ],
    "disabled_modules": [
        "dashboard",
        "influx",
        "localpool",
        "prometheus",
        "selftest",
        "zabbix"
    ]
}

查看balancer的状态：

[root@~]# ceph balancer status
{
    "active": false,
    "plans": [],
    "mode": "none"
}

Balancer模式

默认模式为 crush-compat. 这个模式可以被调整:

ceph balancer mode upmap

或者：

ceph balancer mode crush-compat

crush-compat模式

Crush-compat模式使用兼容weight-set特性（Luminous版本引入），实现原理是通过更改crush weigth来实现pg的均衡，这种模式兼容于较老的客户端。通过python实现的balancer计算得分，得出需要reweight的osd条目

upmap模式

Upmap是Luminous版本新增功能，实现以pg单位重新映射特定的pg到特定的osd，实现方法是将更新的pg映射关系添加到osdmap中，以此实现对PG映射的细粒度控制，使用upmap模式的前提是集群中的所有服务以及客户端都要升级到Luminous版本，上边的upmap的内容已经介绍过了，这个模式不再单独拎出来说了。

使用指导

balancer操作主要分成以下几个步骤

制定plan
评估数据分布和当前的pg分布，还有执行plan后pg分布的结果
执行plan

评估当前集群的分数:

// 评估当前集群数据分布的整体得分
[root@ ~]# ceph balancer eval
current cluster score 0.001878 (lower is better)

// 当前集群某个pool的数据分布的得分
[root@~]# ceph balancer eval default.rgw.buckets.data
pool "default.rgw.buckets.data" score 0.003757 (lower is better)

// 使用eval-verbose可以查看更详细的细节
[root@ ~]# ceph balancer eval-verbose default.rgw.buckets.data
pool "default.rgw.buckets.data"
target_by_root {'hddRoom': {0: 0.09998210519552231, 1: 0.11980094015598297, 2: 0.1016131192445755, 3: 0.09000761806964874, 4: 0.09637156873941422, 5: 0.10553421825170517, 6: 0.0991905927658081, 7: 0.09007509052753448, 8: 0.10044334828853607, 9: 0.09698139876127243}}
actual_by_pool {'default.rgw.buckets.data': {'objects': {0: 0.09931374219633479, 1: 0.11927666671830382, 2: 0.10097645862056501, 3: 0.09210519521426838, 4: 0.09771299035934297, 5: 0.10265982990720803, 6: 0.09979913146303554, 7: 0.09119638126810528, 8: 0.09976298545381315, 9: 0.09719661879902303}, 'bytes': {0: 0.09974366394301833, 1: 0.11889147120344006, 2: 0.10119391850981176, 3: 0.09131441282712302, 4: 0.09770311522171916, 5: 0.10257730725597898, 6: 0.0992708294374905, 7: 0.09146176128248446, 8: 0.0997894495234637, 9: 0.09805407079547004}, 'pgs': {0: 0.1005859375, 1: 0.119140625, 2: 0.1005859375, 3: 0.0908203125, 4: 0.0966796875, 5: 0.1044921875, 6: 0.099609375, 7: 0.0908203125, 8: 0.099609375, 9: 0.09765625}}}
actual_by_root {'hddRoom': {'objects': {0: 0.09931374219633479, 1: 0.11927666671830382, 2: 0.10097645862056501, 3: 0.09210519521426838, 4: 0.09771299035934297, 5: 0.10265982990720803, 6: 0.09979913146303554, 7: 0.09119638126810528, 8: 0.09976298545381315, 9: 0.09719661879902303}, 'bytes': {0: 0.09974366394301833, 1: 0.11889147120344006, 2: 0.10119391850981176, 3: 0.09131441282712302, 4: 0.09770311522171916, 5: 0.10257730725597898, 6: 0.0992708294374905, 7: 0.09146176128248446, 8: 0.0997894495234637, 9: 0.09805407079547004}, 'pgs': {0: 0.1005859375, 1: 0.119140625, 2: 0.1005859375, 3: 0.0908203125, 4: 0.0966796875, 5: 0.1044921875, 6: 0.099609375, 7: 0.0908203125, 8: 0.099609375, 9: 0.09765625}}}
count_by_pool {'default.rgw.buckets.data': {'objects': {0: 19233L, 1: 23099L, 2: 19555L, 3: 17837L, 4: 18923L, 5: 19881L, 6: 19327L, 7: 17661L, 8: 19320L, 9: 18823L}, 'bytes': {0: 67594318518L, 1: 80570410750L, 2: 68577127503L, 3: 61881980889L, 4: 66211478799L, 5: 69514622837L, 6: 67273887877L, 7: 61981836038L, 8: 67625346505L, 9: 66449314486L}, 'pgs': {0: 103, 1: 122, 2: 103, 3: 93, 4: 99, 5: 107, 6: 102, 7: 93, 8: 102, 9: 100}}}
count_by_root {'hddRoom': {'objects': {0: 19233.0, 1: 23099.0, 2: 19555.0, 3: 17837.0, 4: 18923.0, 5: 19881.0, 6: 19327.0, 7: 17661.0, 8: 19320.0, 9: 18823.0}, 'bytes': {0: 67594318518.0, 1: 80570410750.0, 2: 68577127503.0, 3: 61881980889.0, 4: 66211478799.0, 5: 69514622837.0, 6: 67273887877.0, 7: 61981836038.0, 8: 67625346505.0, 9: 66449314486.0}, 'pgs': {0: 103.0, 1: 122.0, 2: 103.0, 3: 93.0, 4: 99.0, 5: 107.0, 6: 102.0, 7: 93.0, 8: 102.0, 9: 100.0}}}
total_by_pool {'default.rgw.buckets.data': {'objects': 193659L, 'bytes': 677680324202L, 'pgs': 1024}}
total_by_root {'hddRoom': {'objects': 193659L, 'bytes': 677680324202L, 'pgs': 1024}}
stats_by_root {'hddRoom': {'objects': {'score': 0.0042956366348764, 'avg': 19365.9, 'sum_weight': 0.472626268863678, 'stddev': 275.7838050769265}, 'bytes': {'score': 0.0041312513939099306, 'avg': 67768032420.2, 'sum_weight': 0.472626268863678, 'stddev': 922972545.7397375}, 'pgs': {'score': 0.0028432381035250064, 'avg': 102.4, 'sum_weight': 0.5726083740592003, 'stddev': 0.8091224925010055}}}
score_by_pool {}
score_by_root {'hddRoom': {'objects': 0.0042956366348764, 'bytes': 0.0041312513939099306, 'pgs': 0.0028432381035250064}}
score 0.003757 (lower is better)

接下来我们使用balancer生成一个plan，用当前的默认模式:

[root@~]# ceph balancer optimize thisplan  default.rgw.buckets.data

接下来查看这个plan，该plan包含了要执行的动作或者命令：

[root@ ~]# ceph balancer show  thisplan
# starting osdmap epoch 454
# starting crush version 263
# mode crush-compat
ceph osd crush weight-set reweight-compat 0 7.232340
ceph osd crush weight-set reweight-compat 1 8.775111
ceph osd crush weight-set reweight-compat 2 7.439869
ceph osd crush weight-set reweight-compat 3 6.489997
ceph osd crush weight-set reweight-compat 4 6.992526
ceph osd crush weight-set reweight-compat 5 7.761090
ceph osd crush weight-set reweight-compat 6 7.193554
ceph osd crush weight-set reweight-compat 7 6.541217
ceph osd crush weight-set reweight-compat 8 7.337742
ceph osd crush weight-set reweight-compat 9 7.049573
ceph osd crush weight-set reweight-compat 10 1.699997

评估该plan执行完成后，集群中data pool的预评分。

[root@ ~]# ceph balancer eval thisplan
plan thisplan final score 0.002719 (lower is better)

我们发现原来的集群中data pool的评分在0.003757，现在变为0.002719，该评分越低代表集群数据越均衡。那么我们便可以执行该plan

[root@ ~]# ceph balancer execute thisplan

上述为手动模式，我们自己决定何时执行，开启自动执行模式，开启之后，系统默认每隔60s会计算一次集群得分，如果有更优分，则执行pg均衡动作。

[root@~]# ceph balancer on

我们可以手动设置自动执行的周期，我们可以查看当前的配置选择：

[root@ ~]# ceph config-key dump
{
"mgr/balancer/active": "1",  // 自动周期执行
 "mgr/balancer/mode": "crush-compat", // 模式
"mgr/balancer/sleep_interval": "100",
“mgr/balancer/begin_time”:”0000”,
“mgr/balancer/end _time”:”2400”, // 在配置的时间段执行的频率
}

将执行周期设置为3000

[root@ ~]# ceph config-key set “mgr/balancer/sleep_interval” “3000”

方法三： reweight-by-pg 按归置组分布情况调整 OSD 的权重

先评估pg发生迁移数量的相关统计，

[root@~]# ceph osd  test-reweight-by-pg 105 .2 5
no change
moved 241 / 2240 (10.7589%)
avg 203.636
stddev 320.223 -> 251.918 (expected baseline 13.606)
min osd.3 with 93 -> 93 pgs (0.456696 -> 0.456696 * mean)
max osd.10 with 1216 -> 975 pgs (5.97143 -> 4.78795 * mean)

oload 105
max_change 0.2
max_change_osds 5
average_utilization 30.0619
overload_utilization 31.5650
osd.10 weight 1.0000 -> 0.8000

评估完成后选择合适的时机，尽量减少对业务的影响，再执行数据均衡的命令。


[root@~]# ceph osd reweight-by-pg test-reweight-by-pg 105 .2 5

方法四：reweight-by-utilization 按利用率调整 OSD 的权重

先评估pg发生迁移数量的相关统计

[root@ ~]# ceph osd test-reweight-by-utilization 105 .2 2
no change
moved 22 / 2240 (0.982143%)
avg 203.636
stddev 320.223 -> 320.26 (expected baseline 13.606)
min osd.3 with 93 -> 95 pgs (0.456696 -> 0.466518 * mean)
max osd.10 with 1216 -> 1216 pgs (5.97143 -> 5.97143 * mean)

oload 105
max_change 0.2
max_change_osds 2
average_utilization 0.0096
overload_utilization 0.0101
osd.6 weight 1.0000 -> 0.8795
osd.8 weight 1.0000 -> 0.8934