
Ceph 数据重均衡的几种方法



当ceph集群出现 osd full的告警时,一般情况下我们先通过ceph osd df查看osd的利用率及权重等信息。

ceph osd reweight 1 0.8

如果调完权重peering后发现recovery卡住或者直接提高osd写入阈值的情况下,会调整osd full ratio的阈值保持能够写入

ceph tell mon.* injectargs "--mon-osd-full-ratio 0.96" //默认0.95
ceph tell osd.* injectargs "--mon-osd-full-ratio 0.96"
ceph osd unpause

方法一:Upmap 操作

从12.2.x版本开始,ceph官方推出osdmaptool工具,这个工具允许我们对指定的osdmap进行运算,再使用ceph osd pg-upmap-items命令可实现单个pg级别的手动迁移,也就意味我们可以指定pg可以分布在哪些pg上,这点人为改变pg分布在一定程度上可以说是违背了crush算法的本意

upmap introduction:

Starting in Luminous v12.2.z there is a new pg-upmap exception table in the OSDMap that allows the cluster to explicitly map specific PGs to specific OSDs. This allows the cluster to fine-tune the data distribution to, in most cases, perfectly distributed PGs across OSDs. The key caveat to this new mechanism is that it requires that all clients understand the new pg-upmap structure in the OSDMap.
Starting in Luminous v12.2.z there is a new pg-upmap exception table in the OSDMap that allows the cluster to explicitly map specific PGs to specific OSDs. This allows the cluster to fine-tune the data distribution to, in most cases, perfectly distributed PGs across OSDs. The key caveat to this new mechanism is that it requires that all clients understand the new pg-upmap structure in the OSDMap.




ceph osd getmap -o osdmap


[root@~]$  osdmaptool --test-map-pgs --pool 5 ./osdmap
osdmaptool: osdmap file './thisosdmap'
pool 5 pg_num 1024
#osd    count    first    primary    c wt    wt
osd.0    95    36    36    5.45749    1
osd.1    108    37    37    5.45749    1
osd.2    114    34    34    5.45749    1
osd.3    95    25    25    5.45749    1
 in 30
 avg 102 stddev 10.6927 (0.10483x) (expected 9.9492 0.0975412x))
 min osd.6 83
 max osd.18 118
size 0    0
size 1    0
size 2    0
size 3    1024

导出的信息可知,我们最大的osd pg是有较大差值的,然后需要对这个pool进行一轮pg迁移的计算,但首先这个unmap只对ceph 12.2.x及后续版本,我们就用下面的命令指定client的最低版本要求

[root@ ~]$ ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it
// 查看client feature版本
[root@CNSZ335523 ~]$ ceph features
    "mon": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 1
    "osd": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 11
    "client": {
        "group": {
            "features": "0x3ffddff8eeacfffb",
            "release": "luminous",
            "num": 5


[root@ ~]$ osdmaptool osdmap --upmap balanceupmap --upmap-pool default.rgw.buckets.data  --upmap-max 10 --upmap-deviation 0.01
osdmaptool: osdmap file 'osdmap'
writing upmap command output to: balanceupmap
checking for upmap cleanups
upmap, max-count 10, max deviation 0.01
 limiting to pools default.rgw.buckets.data (5)

–upmap: calculate pg upmap entries to balance pg layout writing commands to [default: - for stdout]

–upmap-max: set max upmap entries to calculate [default: 10]

–upmap-deviation: max deviation from target [default: 5]

–upmap-pool: restrict upmap balancing to 1 pool or the option can be repeated for multiple pools


[root@ ~]$ head -n5 balanceupmap
ceph osd pg-upmap-items 5.0 21 20
ceph osd pg-upmap-items 5.1 7 6
ceph osd pg-upmap-items 5.4 7 3
ceph osd pg-upmap-items 5.5 18 10 27 22
ceph osd pg-upmap-items 5.6 7 6 17 10


// 将pg 5.0在osd.21重新映射到osd.20
// from upset[2,21,5] to upset[2,20,5]
// ceph osd pg-upmap-items   [...]
ceph osd pg-upmap-items 5.0 21 20


[root@ ~]$ source afterupmap






[root@ ~]# ceph mgr module ls
    "enabled_modules": [
    "disabled_modules": [


[root@~]# ceph balancer status
    "active": false,
    "plans": [],
    "mode": "none"


默认模式为 crush-compat. 这个模式可以被调整:

ceph balancer mode upmap


ceph balancer mode crush-compat


Crush-compat模式使用兼容weight-set特性(Luminous版本引入),实现原理是通过更改crush weigth来实现pg的均衡,这种模式兼容于较老的客户端。通过python实现的balancer计算得分,得出需要reweight的osd条目





  1. 制定plan
  2. 评估数据分布和当前的pg分布,还有执行plan后pg分布的结果
  3. 执行plan


// 评估当前集群数据分布的整体得分
[root@ ~]# ceph balancer eval
current cluster score 0.001878 (lower is better)

// 当前集群某个pool的数据分布的得分
[root@~]# ceph balancer eval default.rgw.buckets.data
pool "default.rgw.buckets.data" score 0.003757 (lower is better)

// 使用eval-verbose可以查看更详细的细节
[root@ ~]# ceph balancer eval-verbose default.rgw.buckets.data
pool "default.rgw.buckets.data"
target_by_root {'hddRoom': {0: 0.09998210519552231, 1: 0.11980094015598297, 2: 0.1016131192445755, 3: 0.09000761806964874, 4: 0.09637156873941422, 5: 0.10553421825170517, 6: 0.0991905927658081, 7: 0.09007509052753448, 8: 0.10044334828853607, 9: 0.09698139876127243}}
actual_by_pool {'default.rgw.buckets.data': {'objects': {0: 0.09931374219633479, 1: 0.11927666671830382, 2: 0.10097645862056501, 3: 0.09210519521426838, 4: 0.09771299035934297, 5: 0.10265982990720803, 6: 0.09979913146303554, 7: 0.09119638126810528, 8: 0.09976298545381315, 9: 0.09719661879902303}, 'bytes': {0: 0.09974366394301833, 1: 0.11889147120344006, 2: 0.10119391850981176, 3: 0.09131441282712302, 4: 0.09770311522171916, 5: 0.10257730725597898, 6: 0.0992708294374905, 7: 0.09146176128248446, 8: 0.0997894495234637, 9: 0.09805407079547004}, 'pgs': {0: 0.1005859375, 1: 0.119140625, 2: 0.1005859375, 3: 0.0908203125, 4: 0.0966796875, 5: 0.1044921875, 6: 0.099609375, 7: 0.0908203125, 8: 0.099609375, 9: 0.09765625}}}
actual_by_root {'hddRoom': {'objects': {0: 0.09931374219633479, 1: 0.11927666671830382, 2: 0.10097645862056501, 3: 0.09210519521426838, 4: 0.09771299035934297, 5: 0.10265982990720803, 6: 0.09979913146303554, 7: 0.09119638126810528, 8: 0.09976298545381315, 9: 0.09719661879902303}, 'bytes': {0: 0.09974366394301833, 1: 0.11889147120344006, 2: 0.10119391850981176, 3: 0.09131441282712302, 4: 0.09770311522171916, 5: 0.10257730725597898, 6: 0.0992708294374905, 7: 0.09146176128248446, 8: 0.0997894495234637, 9: 0.09805407079547004}, 'pgs': {0: 0.1005859375, 1: 0.119140625, 2: 0.1005859375, 3: 0.0908203125, 4: 0.0966796875, 5: 0.1044921875, 6: 0.099609375, 7: 0.0908203125, 8: 0.099609375, 9: 0.09765625}}}
count_by_pool {'default.rgw.buckets.data': {'objects': {0: 19233L, 1: 23099L, 2: 19555L, 3: 17837L, 4: 18923L, 5: 19881L, 6: 19327L, 7: 17661L, 8: 19320L, 9: 18823L}, 'bytes': {0: 67594318518L, 1: 80570410750L, 2: 68577127503L, 3: 61881980889L, 4: 66211478799L, 5: 69514622837L, 6: 67273887877L, 7: 61981836038L, 8: 67625346505L, 9: 66449314486L}, 'pgs': {0: 103, 1: 122, 2: 103, 3: 93, 4: 99, 5: 107, 6: 102, 7: 93, 8: 102, 9: 100}}}
count_by_root {'hddRoom': {'objects': {0: 19233.0, 1: 23099.0, 2: 19555.0, 3: 17837.0, 4: 18923.0, 5: 19881.0, 6: 19327.0, 7: 17661.0, 8: 19320.0, 9: 18823.0}, 'bytes': {0: 67594318518.0, 1: 80570410750.0, 2: 68577127503.0, 3: 61881980889.0, 4: 66211478799.0, 5: 69514622837.0, 6: 67273887877.0, 7: 61981836038.0, 8: 67625346505.0, 9: 66449314486.0}, 'pgs': {0: 103.0, 1: 122.0, 2: 103.0, 3: 93.0, 4: 99.0, 5: 107.0, 6: 102.0, 7: 93.0, 8: 102.0, 9: 100.0}}}
total_by_pool {'default.rgw.buckets.data': {'objects': 193659L, 'bytes': 677680324202L, 'pgs': 1024}}
total_by_root {'hddRoom': {'objects': 193659L, 'bytes': 677680324202L, 'pgs': 1024}}
stats_by_root {'hddRoom': {'objects': {'score': 0.0042956366348764, 'avg': 19365.9, 'sum_weight': 0.472626268863678, 'stddev': 275.7838050769265}, 'bytes': {'score': 0.0041312513939099306, 'avg': 67768032420.2, 'sum_weight': 0.472626268863678, 'stddev': 922972545.7397375}, 'pgs': {'score': 0.0028432381035250064, 'avg': 102.4, 'sum_weight': 0.5726083740592003, 'stddev': 0.8091224925010055}}}
score_by_pool {}
score_by_root {'hddRoom': {'objects': 0.0042956366348764, 'bytes': 0.0041312513939099306, 'pgs': 0.0028432381035250064}}
score 0.003757 (lower is better)


[root@~]# ceph balancer optimize thisplan  default.rgw.buckets.data


[root@ ~]# ceph balancer show  thisplan
# starting osdmap epoch 454
# starting crush version 263
# mode crush-compat
ceph osd crush weight-set reweight-compat 0 7.232340
ceph osd crush weight-set reweight-compat 1 8.775111
ceph osd crush weight-set reweight-compat 2 7.439869
ceph osd crush weight-set reweight-compat 3 6.489997
ceph osd crush weight-set reweight-compat 4 6.992526
ceph osd crush weight-set reweight-compat 5 7.761090
ceph osd crush weight-set reweight-compat 6 7.193554
ceph osd crush weight-set reweight-compat 7 6.541217
ceph osd crush weight-set reweight-compat 8 7.337742
ceph osd crush weight-set reweight-compat 9 7.049573
ceph osd crush weight-set reweight-compat 10 1.699997

评估该plan执行完成后,集群中data pool的预评分。

[root@ ~]# ceph balancer eval thisplan
plan thisplan final score 0.002719 (lower is better)

我们发现原来的集群中data pool的评分在0.003757,现在变为0.002719,该评分越低代表集群数据越均衡。那么我们便可以执行该plan

[root@ ~]# ceph balancer execute thisplan


[root@~]# ceph balancer on


[root@ ~]# ceph config-key dump
"mgr/balancer/active": "1",  // 自动周期执行
 "mgr/balancer/mode": "crush-compat", // 模式
"mgr/balancer/sleep_interval": "100",
“mgr/balancer/end _time”:”2400”, // 在配置的时间段执行的频率


[root@ ~]# ceph config-key set “mgr/balancer/sleep_interval” “3000”

方法三: reweight-by-pg 按归置组分布情况调整 OSD 的权重


[root@~]# ceph osd  test-reweight-by-pg 105 .2 5
no change
moved 241 / 2240 (10.7589%)
avg 203.636
stddev 320.223 -> 251.918 (expected baseline 13.606)
min osd.3 with 93 -> 93 pgs (0.456696 -> 0.456696 * mean)
max osd.10 with 1216 -> 975 pgs (5.97143 -> 4.78795 * mean)

oload 105
max_change 0.2
max_change_osds 5
average_utilization 30.0619
overload_utilization 31.5650
osd.10 weight 1.0000 -> 0.8000


[root@~]# ceph osd reweight-by-pg test-reweight-by-pg 105 .2 5

方法四:reweight-by-utilization 按利用率调整 OSD 的权重


[root@ ~]# ceph osd test-reweight-by-utilization 105 .2 2
no change
moved 22 / 2240 (0.982143%)
avg 203.636
stddev 320.223 -> 320.26 (expected baseline 13.606)
min osd.3 with 93 -> 95 pgs (0.456696 -> 0.466518 * mean)
max osd.10 with 1216 -> 1216 pgs (5.97143 -> 5.97143 * mean)

oload 105
max_change 0.2
max_change_osds 2
average_utilization 0.0096
overload_utilization 0.0101
osd.6 weight 1.0000 -> 0.8795
osd.8 weight 1.0000 -> 0.8934


[root@ ~]# ceph osd reweight-by-pg test-reweight-by-utilization 105 .2 5
