作者: Yang Honggang
ceph默认创建pool时,其pg分布很不均衡。这会造成有些osd很忙,有些很闲。不能充分发挥整体的性能。
本文以rgwecpool ec pool为例子,演示如果让一个pool的pg均衡分布到各个osd上。本例子中使用的是jewel版本(v10.2.2)。
对于社区master已经有 mgr balancer plugin 来自动调整pg分布(https://www.spinics.net/lists/ceph-devel/msg37730.html)。
$ git clone https://github.com/yanghonggang/python-crush.git
$ cd python-crush
$ git checkout -b v1.0.38
$ python setup.py bdist_wheel
$ pip install dist/crush-1.0.38.dev4-cp27-cp27mu-linux_x86_64.whl
# ceph osd pool ls detail --cluster xtao
...
pool 185 'testpool' replicated size 3 min_size 1 crush_ruleset 6 object_hash rjenkins pg_num 256 pgp_num 256 last_change 4996 flags hashpspool stripe_width 0
pool 186 'rgwecpool' erasure size 4 min_size 3 crush_ruleset 4 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 5001 flags hashpspool stripe_width 4128
# ceph osd crush dump --cluster xtao > crushmap-hehe.json
# crush analyze --rule rgwecpool --type device --replication-count 4 --pool 186 --pg-num 1024 --pgp-num 1024 --crushmap crushmap-hehe.json
~id~ ~weight~ ~PGs~ ~over/under filled %~
~name~
osd.3 3 6.0 101 23.29
osd.45 45 6.0 95 15.97
osd.57 57 6.0 94 14.75
osd.16 16 6.0 92 12.30
osd.30 30 6.0 92 12.30
osd.32 32 6.0 91 11.08
osd.33 33 6.0 90 9.86
osd.41 41 6.0 90 9.86
osd.42 42 6.0 88 7.42
osd.19 19 6.0 87 6.20
osd.35 35 6.0 87 6.20
osd.44 44 6.0 86 4.98
osd.23 23 6.0 86 4.98
osd.21 21 6.0 86 4.98
osd.50 50 6.0 85 3.76
osd.5 5 6.0 85 3.76
osd.31 31 6.0 85 3.76
osd.58 58 6.0 85 3.76
osd.34 34 6.0 84 2.54
osd.10 10 6.0 83 1.32
osd.4 4 6.0 83 1.32
osd.51 51 6.0 83 1.32
osd.43 43 6.0 82 0.10
osd.40 40 6.0 82 0.10
osd.39 39 6.0 82 0.10
osd.38 38 6.0 82 0.10
osd.20 20 6.0 82 0.10
osd.2 2 6.0 81 -1.12
osd.22 22 6.0 81 -1.12
osd.56 56 6.0 81 -1.12
osd.7 7 6.0 80 -2.34
osd.11 11 6.0 80 -2.34
osd.28 28 6.0 80 -2.34
osd.18 18 6.0 79 -3.56
osd.29 29 6.0 78 -4.79
osd.8 8 6.0 78 -4.79
osd.14 14 6.0 77 -6.01
osd.47 47 6.0 77 -6.01
osd.55 55 6.0 76 -7.23
osd.46 46 6.0 76 -7.23
osd.27 27 6.0 76 -7.23
osd.6 6 6.0 75 -8.45
osd.53 53 6.0 74 -9.67
osd.9 9 6.0 74 -9.67
osd.26 26 6.0 73 -10.89
osd.17 17 6.0 73 -10.89
osd.54 54 6.0 72 -12.11
osd.52 52 6.0 71 -13.33
osd.15 15 6.0 69 -15.77
osd.59 59 6.0 67 -18.21
Worst case scenario if a host fails:
~over filled %~
~type~
device 23.29
host 2.54
root 0.00
# ceph osd df --cluster xtao // 和评估的一致
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
...
50 6.00000 1.00000 5581G 151M 5581G 0.00 0.83 85
51 6.00000 1.00000 5581G 135M 5581G 0.00 0.74 83
52 6.00000 1.00000 5581G 139M 5581G 0.00 0.76 71
53 6.00000 1.00000 5581G 136M 5581G 0.00 0.75 74
54 6.00000 1.00000 5581G 132M 5581G 0.00 0.72 72
55 6.00000 1.00000 5581G 134M 5581G 0.00 0.73 76
56 6.00000 1.00000 5581G 138M 5581G 0.00 0.76 81
57 6.00000 1.00000 5581G 144M 5581G 0.00 0.79 94
58 6.00000 1.00000 5581G 139M 5581G 0.00 0.76 85
59 6.00000 1.00000 5581G 133M 5581G 0.00 0.73 67
38 6.00000 1.00000 5581G 129M 5581G 0.00 0.71 82
39 6.00000 1.00000 5581G 146M 5581G 0.00 0.80 82
40 6.00000 1.00000 5581G 133M 5581G 0.00 0.73 82
41 6.00000 1.00000 5581G 135M 5581G 0.00 0.74 90
42 6.00000 1.00000 5581G 130M 5581G 0.00 0.71 88
43 6.00000 1.00000 5581G 140M 5581G 0.00 0.76 82
44 6.00000 1.00000 5581G 142M 5581G 0.00 0.78 86
45 6.00000 1.00000 5581G 158M 5581G 0.00 0.86 95
46 6.00000 1.00000 5581G 142M 5581G 0.00 0.78 76
47 6.00000 1.00000 5581G 139M 5581G 0.00 0.76 77
26 6.00000 1.00000 5581G 134M 5581G 0.00 0.74 73
27 6.00000 1.00000 5581G 125M 5581G 0.00 0.68 76
28 6.00000 1.00000 5581G 137M 5581G 0.00 0.75 80
29 6.00000 1.00000 5581G 133M 5581G 0.00 0.73 78
30 6.00000 1.00000 5581G 145M 5581G 0.00 0.80 92
31 6.00000 1.00000 5581G 134M 5581G 0.00 0.73 85
32 6.00000 1.00000 5581G 140M 5581G 0.00 0.76 91
33 6.00000 1.00000 5581G 145M 5581G 0.00 0.79 90
34 6.00000 1.00000 5581G 136M 5581G 0.00 0.74 84
35 6.00000 1.00000 5581G 128M 5581G 0.00 0.70 87
14 6.00000 1.00000 5581G 141M 5581G 0.00 0.77 77
15 6.00000 1.00000 5581G 135M 5581G 0.00 0.74 69
16 6.00000 1.00000 5581G 136M 5581G 0.00 0.74 92
17 6.00000 1.00000 5581G 136M 5581G 0.00 0.74 73
18 6.00000 1.00000 5581G 127M 5581G 0.00 0.70 79
19 6.00000 1.00000 5581G 140M 5581G 0.00 0.77 87
20 6.00000 1.00000 5581G 148M 5581G 0.00 0.81 82
21 6.00000 1.00000 5581G 151M 5581G 0.00 0.83 86
22 6.00000 1.00000 5581G 144M 5581G 0.00 0.79 81
23 6.00000 1.00000 5581G 128M 5581G 0.00 0.70 86
2 6.00000 1.00000 5581G 134M 5581G 0.00 0.74 81
3 6.00000 1.00000 5581G 131M 5581G 0.00 0.71 101
4 6.00000 1.00000 5581G 130M 5581G 0.00 0.71 83
5 6.00000 1.00000 5581G 125M 5581G 0.00 0.68 85
6 6.00000 1.00000 5581G 136M 5581G 0.00 0.74 75
7 6.00000 1.00000 5581G 144M 5581G 0.00 0.79 80
8 6.00000 1.00000 5581G 146M 5581G 0.00 0.80 78
9 6.00000 1.00000 5581G 136M 5581G 0.00 0.74 74
10 6.00000 1.00000 5581G 138M 5581G 0.00 0.75 83
11 6.00000 1.00000 5581G 140M 5581G 0.00 0.77 80
TOTAL 327T 11008M 327T 0.00
MIN/MAX VAR: 0.68/5.04 STDDEV: 0.00
// 需要确保集群状态不能为ERROR
# ceph report --cluster xtao > report.json
# crush optimize --crushmap report.json --out-path op.crush --pg-num 1024 --pgp-num 1024 --pool 186 --rule rgwecpool --out-format crush --out-version jewel --type device
2017-12-08 15:21:36,796 argv = optimize --crushmap report.json --out-path op.crush --pg-num 1024 --pgp-num 1024 --pool 186 --rule rgwecpool --out-format txt --out-version jewel --type device --replication-count=4 --pg-num=1024 --pgp-num=1024 --rule=rgwecpool --out-version=j --no-positions --choose-args=186
2017-12-08 15:21:36,858 hdd optimizing
2017-12-08 15:21:44,447 hdd wants to swap 246 PGs
2017-12-08 15:21:44,487 xt5-hdd optimizing
2017-12-08 15:21:44,497 xt4-hdd optimizing
2017-12-08 15:21:44,507 xt3-hdd optimizing
2017-12-08 15:21:44,521 xt2-hdd optimizing
2017-12-08 15:21:44,530 xt1-hdd optimizing
2017-12-08 15:21:52,316 xt1-hdd wants to swap 34 PGs
2017-12-08 15:21:53,629 xt2-hdd wants to swap 31 PGs
2017-12-08 15:21:53,862 xt5-hdd wants to swap 43 PGs
2017-12-08 15:22:00,223 xt3-hdd wants to swap 44 PGs
2017-12-08 15:23:52,603 xt4-hdd wants to swap 30 PGs
// 备份集群的crushmap
# ceph osd getcrushmap --cluster xtao > crushmap.bak
got crush map from osdmap epoch 5003
# ceph osd setcrushmap -i new.bin --cluster xtao
set crush map
# ceph -s --cluster xtao
cluster 4acceaa6-136f-11e7-9e17-ac1f6b1196ad
health HEALTH_OK
monmap e3: 3 mons at {xt1=192.168.10.1:6789/0,xt2=192.168.10.2:6789/0,xt3=192.168.10.3:6789/0}
election epoch 21560, quorum 0,1,2 xt1,xt2,xt3
osdmap e5007: 60 osds: 60 up, 60 in
flags sortbitwise
pgmap v3248856: 1760 pgs, 17 pools, 215 kB data, 3338 objects
10424 MB used, 327 TB / 327 TB avail
1760 active+clean
# ceph osd df --cluster xtao
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
....
50 5.83199 1.00000 5581G 141M 5581G 0.00 0.82 81
51 6.00600 1.00000 5581G 126M 5581G 0.00 0.73 82
52 6.87099 1.00000 5581G 129M 5581G 0.00 0.74 82
53 6.32100 1.00000 5581G 127M 5581G 0.00 0.74 82
54 6.37900 1.00000 5581G 121M 5581G 0.00 0.70 82
55 6.05899 1.00000 5581G 123M 5581G 0.00 0.71 82
56 6.00000 1.00000 5581G 128M 5581G 0.00 0.74 82
57 4.90799 1.00000 5581G 134M 5581G 0.00 0.77 82
58 5.00699 1.00000 5581G 130M 5581G 0.00 0.75 82
59 6.61899 1.00000 5581G 123M 5581G 0.00 0.71 82
38 6.17099 1.00000 5581G 119M 5581G 0.00 0.69 82
39 6.00000 1.00000 5581G 136M 5581G 0.00 0.78 82
40 6.16800 1.00000 5581G 123M 5581G 0.00 0.71 81
41 5.49399 1.00000 5581G 125M 5581G 0.00 0.72 81
42 6.00000 1.00000 5581G 120M 5581G 0.00 0.69 81
43 6.28400 1.00000 5581G 131M 5581G 0.00 0.76 81
44 5.93999 1.00000 5581G 133M 5581G 0.00 0.77 82
45 5.53699 1.00000 5581G 148M 5581G 0.00 0.85 83
46 6.34799 1.00000 5581G 129M 5581G 0.00 0.74 84
47 6.05899 1.00000 5581G 129M 5581G 0.00 0.75 82
26 6.71700 1.00000 5581G 124M 5581G 0.00 0.72 82
27 6.63899 1.00000 5581G 116M 5581G 0.00 0.67 80
28 6.21700 1.00000 5581G 127M 5581G 0.00 0.73 82
29 6.36800 1.00000 5581G 122M 5581G 0.00 0.71 82
30 5.14200 1.00000 5581G 137M 5581G 0.00 0.79 82
31 5.97299 1.00000 5581G 124M 5581G 0.00 0.72 81
32 5.59200 1.00000 5581G 130M 5581G 0.00 0.75 83
33 5.64899 1.00000 5581G 130M 5581G 0.00 0.75 82
34 5.93999 1.00000 5581G 126M 5581G 0.00 0.73 83
35 5.76399 1.00000 5581G 118M 5581G 0.00 0.68 82
14 6.32700 1.00000 5581G 131M 5581G 0.00 0.75 80
15 6.63899 1.00000 5581G 126M 5581G 0.00 0.73 82
16 5.42599 1.00000 5581G 126M 5581G 0.00 0.73 82
17 6.69398 1.00000 5581G 127M 5581G 0.00 0.73 82
18 6.22299 1.00000 5581G 118M 5581G 0.00 0.68 82
19 5.59200 1.00000 5581G 130M 5581G 0.00 0.75 82
20 5.93999 1.00000 5581G 139M 5581G 0.00 0.80 82
21 5.31898 1.00000 5581G 141M 5581G 0.00 0.82 82
22 6.12599 1.00000 5581G 134M 5581G 0.00 0.78 82
23 5.71300 1.00000 5581G 118M 5581G 0.00 0.68 83
2 6.60899 1.00000 5581G 124M 5581G 0.00 0.72 82
3 4.76199 1.00000 5581G 122M 5581G 0.00 0.70 82
4 5.76399 1.00000 5581G 120M 5581G 0.00 0.69 83
5 5.88100 1.00000 5581G 116M 5581G 0.00 0.67 82
6 6.11499 1.00000 5581G 126M 5581G 0.00 0.73 84
7 6.00000 1.00000 5581G 134M 5581G 0.00 0.77 82
8 6.20900 1.00000 5581G 131M 5581G 0.00 0.76 81
9 6.55099 1.00000 5581G 126M 5581G 0.00 0.73 81
10 6.10999 1.00000 5581G 122M 5581G 0.00 0.71 82
11 6.00000 1.00000 5581G 130M 5581G 0.00 0.75 81
TOTAL 327T 10424M 327T 0.00
MIN/MAX VAR: 0.67/5.27 STDDEV: 0.00
# ceph osd tree --cluster xtao
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-43 30.00000 root test
-44 6.00000 host xt5-test
49 6.00000 osd.49 up 1.00000 1.00000
-45 6.00000 host xt4-test
37 6.00000 osd.37 up 1.00000 1.00000
-46 6.00000 host xt3-test
25 6.00000 osd.25 up 1.00000 1.00000
-47 6.00000 host xt2-test
13 6.00000 osd.13 up 1.00000 1.00000
-48 6.00000 host xt1-test
1 6.00000 osd.1 up 1.00000 1.00000
-37 0 root yhg
-38 0 host xt5-yhg
-39 0 host xt4-yhg
-40 0 host xt3-yhg
-41 0 host xt2-yhg
-42 0 host xt1-yhg
-36 30.00000 root meta
-35 6.00000 host xt5-meta
48 6.00000 osd.48 up 1.00000 1.00000
-34 6.00000 host xt4-meta
36 6.00000 osd.36 up 1.00000 1.00000
-33 6.00000 host xt3-meta
24 6.00000 osd.24 up 1.00000 1.00000
-32 6.00000 host xt2-meta
12 6.00000 osd.12 up 1.00000 1.00000
-31 6.00000 host xt1-meta
0 6.00000 osd.0 up 1.00000 1.00000
-30 0 root default
-29 0 host xt7-default
-28 0 host xt6-default
-27 0 host xt5-default
-26 0 host xt4-default
-25 0 host xt3-default
-24 0 host xt2-default
-23 0 host xt1-default
-22 0 host xt9-default
-21 0 host xt8-default
-20 0 root ssd
-19 0 host xt7-ssd
-18 0 host xt6-ssd
-17 0 host xt5-ssd
-16 0 host xt4-ssd
-15 0 host xt3-ssd
-14 0 host xt2-ssd
-13 0 host xt1-ssd
-12 0 host xt9-ssd
-11 0 host xt8-ssd
-10 299.99997 root hdd
-9 0 host xt7-hdd
-8 0 host xt6-hdd
-7 64.66899 host xt5-hdd
50 5.83199 osd.50 up 1.00000 1.00000
51 6.00600 osd.51 up 1.00000 1.00000
52 6.87099 osd.52 up 1.00000 1.00000
53 6.32100 osd.53 up 1.00000 1.00000
54 6.37900 osd.54 up 1.00000 1.00000
55 6.05899 osd.55 up 1.00000 1.00000
56 6.00000 osd.56 up 1.00000 1.00000
57 4.90799 osd.57 up 1.00000 1.00000
58 5.00699 osd.58 up 1.00000 1.00000
59 6.61899 osd.59 up 1.00000 1.00000
-6 55.80899 host xt4-hdd
38 6.17099 osd.38 up 1.00000 1.00000
39 6.00000 osd.39 up 1.00000 1.00000
40 6.16800 osd.40 up 1.00000 1.00000
41 5.49399 osd.41 up 1.00000 1.00000
42 6.00000 osd.42 up 1.00000 1.00000
43 6.28400 osd.43 up 1.00000 1.00000
44 5.93999 osd.44 up 1.00000 1.00000
45 5.53699 osd.45 up 1.00000 1.00000
46 6.34799 osd.46 up 1.00000 1.00000
47 6.05899 osd.47 up 1.00000 1.00000
-5 55.06900 host xt3-hdd
26 6.71700 osd.26 up 1.00000 1.00000
27 6.63899 osd.27 up 1.00000 1.00000
28 6.21700 osd.28 up 1.00000 1.00000
29 6.36800 osd.29 up 1.00000 1.00000
30 5.14200 osd.30 up 1.00000 1.00000
31 5.97299 osd.31 up 1.00000 1.00000
32 5.59200 osd.32 up 1.00000 1.00000
33 5.64899 osd.33 up 1.00000 1.00000
34 5.93999 osd.34 up 1.00000 1.00000
35 5.76399 osd.35 up 1.00000 1.00000
-4 63.15900 host xt2-hdd
14 6.32700 osd.14 up 1.00000 1.00000
15 6.63899 osd.15 up 1.00000 1.00000
16 5.42599 osd.16 up 1.00000 1.00000
17 6.69398 osd.17 up 1.00000 1.00000
18 6.22299 osd.18 up 1.00000 1.00000
19 5.59200 osd.19 up 1.00000 1.00000
20 5.93999 osd.20 up 1.00000 1.00000
21 5.31898 osd.21 up 1.00000 1.00000
22 6.12599 osd.22 up 1.00000 1.00000
23 5.71300 osd.23 up 1.00000 1.00000
-3 61.29399 host xt1-hdd
2 6.60899 osd.2 up 1.00000 1.00000
3 4.76199 osd.3 up 1.00000 1.00000
4 5.76399 osd.4 up 1.00000 1.00000
5 5.88100 osd.5 up 1.00000 1.00000
6 6.11499 osd.6 up 1.00000 1.00000
7 6.00000 osd.7 up 1.00000 1.00000
8 6.20900 osd.8 up 1.00000 1.00000
9 6.55099 osd.9 up 1.00000 1.00000
10 6.10999 osd.10 up 1.00000 1.00000
11 6.00000 osd.11 up 1.00000 1.00000
-2 0 host xt9-hdd
-1 0 host xt8-hdd