在使用过程中,当集群硬盘故障什么更换新的硬盘呢,下面我们来演示

查看ceph状态
root@pve-1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43822 root default
-3 0.14607 host pve-1
0 hdd 0.04869 osd.0 up 1.00000 1.00000
3 hdd 0.04869 osd.3 up 1.00000 1.00000
7 hdd 0.04869 osd.7 up 1.00000 1.00000
-5 0.14607 host pve-2
2 hdd 0.04869 osd.2 up 1.00000 1.00000
4 hdd 0.04869 osd.4 up 1.00000 1.00000
6 hdd 0.04869 osd.6 up 1.00000 1.00000
-7 0.14607 host pve-3
1 hdd 0.04869 osd.1 up 1.00000 1.00000
5 hdd 0.04869 osd.5 up 1.00000 1.00000
8 hdd 0.04869 osd.8 up 1.00000 1.00000

都是正常的,等下我把一个硬盘去掉
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43822 root default
-3 0.14607 host pve-1
0 hdd 0.04869 osd.0 up 1.00000 1.00000
3 hdd 0.04869 osd.3 up 1.00000 1.00000
7 hdd 0.04869 osd.7 up 1.00000 1.00000
-5 0.14607 host pve-2
2 hdd 0.04869 osd.2 up 1.00000 1.00000
4 hdd 0.04869 osd.4 up 1.00000 1.00000
6 hdd 0.04869 osd.6 up 1.00000 1.00000
-7 0.14607 host pve-3
1 hdd 0.04869 osd.1 up 1.00000 1.00000
5 hdd 0.04869 osd.5 up 1.00000 1.00000
8 hdd 0.04869 osd.8 down 1.00000 1.00000

osd.8变成down了

然后我们模拟删除硬盘和添加硬盘
ceph osd out osd.8
ceph auth del osd.8
ceph osd rm 8
root@pve-1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43822 root default
-3 0.14607 host pve-1
0 hdd 0.04869 osd.0 up 1.00000 1.00000
3 hdd 0.04869 osd.3 up 1.00000 1.00000
7 hdd 0.04869 osd.7 up 1.00000 1.00000
-5 0.14607 host pve-2
2 hdd 0.04869 osd.2 up 1.00000 1.00000
4 hdd 0.04869 osd.4 up 1.00000 1.00000
6 hdd 0.04869 osd.6 up 1.00000 1.00000
-7 0.14607 host pve-3
1 hdd 0.04869 osd.1 up 1.00000 1.00000
5 hdd 0.04869 osd.5 up 1.00000 1.00000
8 hdd 0.04869 osd.8 DNE 0

osd.8 的状态是DNE

删除故障节点的ceph磁盘,操作如下:
ceph osd crush rm osd.8
root@pve-1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.38953 root default
-3 0.14607 host pve-1
0 hdd 0.04869 osd.0 up 1.00000 1.00000
3 hdd 0.04869 osd.3 up 1.00000 1.00000
7 hdd 0.04869 osd.7 up 1.00000 1.00000
-5 0.14607 host pve-2
2 hdd 0.04869 osd.2 up 1.00000 1.00000
4 hdd 0.04869 osd.4 up 1.00000 1.00000
6 hdd 0.04869 osd.6 up 1.00000 1.00000
-7 0.09738 host pve-3
1 hdd 0.04869 osd.1 up 1.00000 1.00000
5 hdd 0.04869 osd.5 up 1.00000 1.00000
osd.8已经找不到了
说明删除成功

查看我们添加的硬盘信息
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1007K 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 19.5G 0 part
├─pve-swap 253:0 0 2.4G 0 lvm [SWAP]
├─pve-root 253:1 0 4.8G 0 lvm /
├─pve-data_tmeta 253:2 0 1G 0 lvm
│ └─pve-data 253:4 0 8G 0 lvm
└─pve-data_tdata 253:3 0 8G 0 lvm
└─pve-data 253:4 0 8G 0 lvm
sdb 8:16 0 50G 0 disk
├─sdb1 8:17 0 100M 0 part /var/lib/ceph/osd/ceph-1
└─sdb2 8:18 0 49.9G 0 part
sdc 8:32 0 50G 0 disk
├─sdc1 8:33 0 100M 0 part /var/lib/ceph/osd/ceph-5
└─sdc2 8:34 0 49.9G 0 part
sdd 8:48 0 50G 0 disk
sr0 11:0 1 655.3M 0 rom
pveceph createosd /dev/sdd
root@pve-3:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.43822 root default
-3 0.14607 host pve-1
0 hdd 0.04869 osd.0 up 1.00000 1.00000
3 hdd 0.04869 osd.3 up 1.00000 1.00000
7 hdd 0.04869 osd.7 up 1.00000 1.00000
-5 0.14607 host pve-2
2 hdd 0.04869 osd.2 up 1.00000 1.00000
4 hdd 0.04869 osd.4 up 1.00000 1.00000
6 hdd 0.04869 osd.6 up 1.00000 1.00000
-7 0.14607 host pve-3
1 hdd 0.04869 osd.1 up 1.00000 1.00000
5 hdd 0.04869 osd.5 up 1.00000 1.00000
8 hdd 0.04869 osd.8 up 1.00000 1.00000

从ceph集群中删除物理节点,操作如下:

ceph osd crush rm pve-3

从集群中删除故障节点
登录集群中任意正常节点,执行如下指令进行驱逐操作:
pvecm delnode pve-3
故障机恢复操作

最好全部干掉,重新安装系统,并用新的ip地址,加入集群。