创建新CephFS报错Error EINVAL: pool ‘rbd-ssd’ already contains some objects. Use an empty pool instead,解决办法:
ceph fs new cephfs rbd-ssd rbd-hdd --force
断电后出现此问题。MDS进程报错: Error recovering journal 0x200: (5) Input/output error。诊断过程:
# 健康状况
ceph health detail
# HEALTH_ERR mds rank 0 is damaged; mds cluster is degraded
# mds.0 is damaged
# 文件系统详细信息,可以看到唯一的MDS Boron启动不了
ceph fs status
# cephfs - 0 clients
# ======
# +------+--------+-----+----------+-----+------+
# | Rank | State | MDS | Activity | dns | inos |
# +------+--------+-----+----------+-----+------+
# | 0 | failed | | | | |
# +------+--------+-----+----------+-----+------+
# +---------+----------+-------+-------+
# | Pool | type | used | avail |
# +---------+----------+-------+-------+
# | rbd-ssd | metadata | 138k | 106G |
# | rbd-hdd | data | 4903M | 2192G |
# +---------+----------+-------+-------+
# +-------------+
# | Standby MDS |
# +-------------+
# | Boron |
# +-------------+
# 显示错误原因
ceph tell mds.0 damage
# terminate called after throwing an instance of 'std::out_of_range'
# what(): map::at
# Aborted
# 尝试修复,无效
ceph mds repaired 0
# 尝试导出CephFS日志,无效
cephfs-journal-tool journal export backup.bin
# 2019-10-17 16:21:34.179043 7f0670f41fc0 -1 Header 200.00000000 is unreadable
# 2019-10-17 16:21:34.179062 7f0670f41fc0 -1 journal_export: Journal not readable, attempt object-by-object dump with `rados`Error ((5) Input/output error)
# 尝试重日志修复,无效
# 尝试将journal中所有可回收的 inodes/dentries 写到后端存储(如果版本比后端更高)
cephfs-journal-tool event recover_dentries summary
# Events by type:
# Errors: 0
# 2019-10-17 16:22:00.836521 7f2312a86fc0 -1 Header 200.00000000 is unreadable
# 尝试截断日志,无效
cephfs-journal-tool journal reset
# got error -5from Journaler, failing
# 2019-10-17 16:22:14.263610 7fe6717b1700 0 client.6494353.journaler.resetter(ro) error getting journal off disk
# Error ((5) Input/output error)
# 删除重建,数据丢失
ceph fs rm cephfs --yes-i-really-mean-it
## 又一次遇到此问题
# 深度清理,发现200.00000000存在数据不一致
ceph osd deep-scrub all
40.14 shard 14: soid 40:292cf221:::200.00000000:head data_digest
0x6ebfd975 != data_digest 0x9e943993 from auth oi 40:292cf221:::200.00000000:head
(22366'34 mds.0.902:1 dirty|data_digest|omap_digest s 90 uv 34 dd 9e943993 od ffffffff alloc_hint [0 0 0])
40.14 deep-scrub 0 missing, 1 inconsistent objects
40.14 deep-scrub 1 errors
# 查看RADOS不一致对象详细信息
rados list-inconsistent-obj 40.14 --format=json-pretty
{
"epoch": 23060,
"inconsistents": [
{
"object": {
"name": "200.00000000",
},
"errors": [],
"union_shard_errors": [
# 错误原因,校验信息不一致
"data_digest_mismatch_info"
],
"selected_object_info": {
"oid": {
"oid": "200.00000000",
},
},
"shards": [
{
"osd": 7,
"primary": true,
"errors": [],
"size": 90,
"omap_digest": "0xffffffff"
},
{
"osd": 14,
"primary": false,
# errors:分片之间存在不一致,而且无法确定哪个分片坏掉了,原因:
# data_digest_mismatch 此副本的摘要信息和主副本不一样
# size_mismatch 此副本的数据长度和主副本不一致
# read_error 可能存在磁盘错误
"errors": [
# 这里的原因是两个副本的摘要不一致
"data_digest_mismatch_info"
],
"size": 90,
"omap_digest": "0xffffffff",
"data_digest": "0x6ebfd975"
}
]
}
]
}
# 转为处理inconsistent问题,停止OSD.14,Flush 日志,启动OSD.14,执行PG修复
# 无效…… 执行PG修复后Ceph会自动以权威副本覆盖不一致的副本,但是并非总能生效,
# 例如,这里的情况,主副本的数据摘要信息丢失
# 删除故障对象
rados -p rbd-ssd rm 200.00000000
通常可以认为属于Ceph的Bug。这些Bug可能因为数据状态引发,有些时候将崩溃OSD的权重清零,可以恢复:
# 尝试解决osd.17启动后立即宕机
ceph osd reweight 17 0
如果创建一个存储池后,其所有PG都卡在此状态,可能原因是CRUSH map不正常。你可以配置osd_crush_update_on_start为true让集群自动调整CRUSH map。
ceph -s显示如下状态,长期不恢复:
cluster:
health: HEALTH_WARN
Reduced data availability: 2 pgs inactive, 2 pgs peering
19 slow requests are blocked > 32 sec
data:
pgs: 0.391% pgs not active
510 active+clean
2 peering
此案例中,使用此PG的Pod呈Known状态。
检查卡在inactive状态的PG:
ceph pg dump_stuck inactive
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
17.68 peering [3,12] 3 [3,12] 3
16.32 peering [4,12] 4 [4,12] 4
输出其中一个PG的诊断信息,片断如下:
// ceph pg 17.68 query
{
"info": {
"stats": {
"state": "peering",
"stat_sum": {
"num_objects_dirty": 5
},
"up": [
3,
12
],
"acting": [
3,
12
],
// 因为哪个OSD而阻塞
"blocked_by": [
12
],
"up_primary": 3,
"acting_primary": 3
}
},
"recovery_state": [
// 如果顺利,第一个元素应该是 "name": "Started/Primary/Active"
{
"name": "Started/Primary/Peering/GetInfo",
"enter_time": "2018-06-11 18:32:39.594296",
// 但是,卡在向OSD 12 请求信息这一步上
"requested_info_from": [
{
"osd": "12"
}
]
},
{
"name": "Started/Primary/Peering",
},
{
"name": "Started",
}
]
}
没有获得osd-12阻塞Peering的明确原因。
查看日志,osd-12位于10.0.0.104,osd-3位于10.0.0.100,后者为Primary OSD。
osd-3日志,在18:26开始出现,和所有其它OSD之间心跳检测失败。此时10.0.0.100负载很高,卡死。
osd-12日志,在18:26左右大量出现:
osd.12 466 heartbeat_check: no reply from 10.0.0.100:6803 osd.4 since back 2018-06-11 18:26:44.973982 ...
直到18:44分仍然无法进行心跳检测,重启osd-12后一切恢复正常。
检查无法完成的PG:
ceph pg dump_stuck
# PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
# 17.79 incomplete [9,17] 9 [9,17] 9
# 32.1c incomplete [16,9] 16 [16,9] 16
# 17.30 incomplete [16,9] 16 [16,9] 16
# 31.35 incomplete [9,17] 9 [9,17] 9
查询PG 17.30的诊断信息:
// ceph pg 17.30 query
{
"state": "incomplete",
"info": {
"pgid": "17.30",
"stats": {
// 被osd.11阻塞而无法完成,此osd已经不存在
"blocked_by": [
11
],
"up_primary": 16,
"acting_primary": 16
}
},
// 恢复的历史记录
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2018-06-17 04:48:45.185352",
// 最终状态,此PG没有完整的副本
"comment": "not enough complete instances of this PG"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2018-06-17 04:48:45.131904",
"probing_osds": [
"9",
"16",
"17"
],
// 期望检查已经不存在的OSD
"down_osds_we_would_probe": [
11
],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]
}
]
}
以看到17.30期望到osd.11寻找权威数据,而osd.11已经永久丢失了。这种情况下,可以尝试强制标记PG为complete。
首先,停止PG的主OSD: service ceph-osd@16 stop
然后,运行下面的工具:
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 --pgid 17.30 --op mark-complete
# Marking complete
# Marking complete succeeded
最后,重启PG的主OSD: service ceph-osd@16 start
不做副本的情况下,单个OSD宕机即导致数据不可用:
ceph health detail
# 注意Acting Set仅仅有一个成员
# pg 2.21 is stuck stale for 688.372740, current state stale+active+clean, last acting [7]
# 但是其它PG的Acting Set则不是
# pg 3.4f is active+recovering+degraded, acting [9,1]
如果OSD的确出现硬件故障,则数据丢失。此外,你也无法对这种PG进行查询操作。
定位出问题PG的主OSD,停止它,刷出日志,然后修复PG:
ceph health detail
# HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
# OSD_SCRUB_ERRORS 2 scrub errors
# PG_DAMAGED Possible data damage: 2 pgs inconsistent
# pg 15.33 is active+clean+inconsistent, acting [8,9]
# pg 15.61 is active+clean+inconsistent, acting [8,16]
# 查找OSD所在机器
ceph osd find 8
# 登陆到osd.8所在机器
systemctl stop [email protected]
ceph-osd -i 8 --flush-journal
systemctl start [email protected]
ceph pg repair 15.61
持有对象权威副本的OSD宕机或被剔除,会导致该问题出现。例如两个配对的OSD(共同处理某个PG):
在上面这个事件序列中,osd.1知道权威副本存在,但是却找不到,这种情况下针对目标对象的请求会被阻塞,直到权威副本的持有者osd上线。
执行下面的命令,定位存在问题的PG:
ceph health detail | grep unfound
# OBJECT_UNFOUND 1/90055 objects unfound (0.001%)
# pg 33.3e has 1 unfound objects
# pg 33.3e is active+recovery_wait+degraded, acting [17,6], 1 unfound
进一步,定位存在问题的对象:
// ceph pg 33.3e list_missing
{
"offset": {
"oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -9223372036854775808,
"namespace": ""
},
"num_missing": 1,
"num_unfound": 1,
"objects": [
{
"oid": {
// 丢失的对象
"oid": "obj_delete_at_hint.0000000066",
"key": "",
"snapid": -2,
"hash": 2846662078,
"max": 0,
"pool": 33,
"namespace": ""
},
"need": "1723'1412",
"have": "0'0",
"flags": "none",
"locations": []
}
],
"more": false
}
如果丢失的对象太多,more会显示为true。
执行下面的命令,可以查看PG的诊断信息:
// ceph pg 33.3e query
{
"state": "active+recovery_wait+degraded",
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-06-16 15:03:32.873855",
// 丢失的对象所在的OSD
"might_have_unfound": [
{
"osd": "6",
"status": "already probed"
},
{
"osd": "11",
"status": "osd is down"
}
],
}
]
}
上面输出中的osd.11,先前已经出现硬件故障,被移除了。这意味着unfound的对象已经不可恢复。你可以标记:
# 回滚到前一个版本,如果是新创建对象则忘记其存在。不支持EC池
ceph pg 33.3e mark_unfound_lost revert
# 让Ceph忘记unfound对象的存在
ceph pg 33.3e mark_unfound_lost delete
/usr/lib/python2.7/dist-packages/ceph_deploy/osd.py第376行,替换为:
LOG.info(line.decode('utf-8'))
应该安装ceph-deploy的1.5.39版本,2.0.0版本仅仅支持luminous:
apt remove ceph-deploy
apt install ceph-deploy=1.5.39 -y
在我的环境下,是因为MON节点识别的public addr为LVS的虚拟网卡的IP地址导致。修改配置,显式指定MON的IP地址即可:
[mon.master01-10-5-38-24]
public addr = 10.5.38.24
cluster addr = 10.5.38.24
[mon.master02-10-5-38-39]
public addr = 10.5.38.39
cluster addr = 10.5.38.39
[mon.master03-10-5-39-41]
public addr = 10.5.39.41
cluster addr = 10.5.39.41
在我的环境下部署,出现一系列和权限有关的问题,如果你遇到相同问题且不关心安全性,可以修改配置:
# kubectl -n ceph edit configmap ceph-etc
apiVersion: v1
data:
ceph.conf: |
[global]
fsid = 08adecc5-72b1-4c57-b5b7-a543cd8295e7
mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
# 添加以下三行
auth client required = none
auth cluster required = none
auth service required = none
[osd]
# 在大型集群里用单独的“集群”网可显著地提升性能
cluster_network = 10.0.0.0/16
ms_bind_port_max = 7100
public_network = 10.0.0.0/16
kind: ConfigMap
如果需要保证集群安全,请参考下面几个案例。
问题现象
此Pod一直无法启动,查看容器日志,发现:
timeout 10 ceph --cluster ceph auth get-or-create mgr.xenial-100 mon ‘allow profile mgr’ osd ‘allow *’ mds ‘allow *’ -o /var/lib/ceph/mgr/ceph-xenial-100/keyring
0 librados: client.admin authentication error (1) Operation not permitted
问题分析
连接到可以访问的ceph-mon,执行命令:
kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph
发现报同样的错误。这说明client.admin的Keyring有问题。登陆到ceph-mon,获取Keyring列表:
# kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon bash
# ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth list
installed auth entries:
client.admin
key: AQAXPdtaAAAAABAA6wd1kCog/XtV9bSaiDHNhw==
auid: 0
caps: [mds] allow
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *
client.bootstrap-mds
key: AQAgPdtaAAAAABAAFPgqn4/zM5mh8NhccPWKcw==
caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
key: AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
caps: [mon] allow profile bootstrap-osd
client.bootstrap-rgw
key: AQAJPdtaAAAAABAAswtFjgQWahHsuy08Egygrw==
caps: [mon] allow profile bootstrap-rgw
而当前使用的client.admin的Keyring内容为:
[client.admin]
key = AQAda9taAAAAABAAgWIsgbEiEsFRJQq28hFgTQ==
auid = 0
caps mds = "allow"
caps mon = "allow *"
caps osd = "allow *"
caps mgr = "allow *"
内容不一致。使用auth list获得的client.admin的Keyring,可以发现是有效的:
ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.admin > client.admin.keyyring
ceph --name client.admin --keyring client.admin.keyyring # OKskydns_skydns_dns_cachemiss_count_total{instance="172.27.100.134:10055"}
检查一下各Pod的/etc/ceph/ceph.client.admin.keyring,可以发现都是从Secret ceph-client-admin-keyring挂载的。那么这个Secret是如何生成的呢?执行命令:
kubectl -n ceph get job --output=yaml --export | grep ceph-client-admin-keyring -B 50
可以发现Job ceph-storage-keys-generator负责生成该Secret。 查看其Pod日志可以生成Keyring、创建Secret的记录。进一步查看Pod的资源定义,可以看到负责创建的脚本/opt/ceph/ceph-storage-key.sh挂载自ConfigMap ceph-bin中的ceph-storage-key.sh。
解决此问题最简单的办法就是修改Secret,将其修改为集群中实际有效的Keyring:
# 导出Secret定义
kubectl -n ceph get secret ceph-client-admin-keyring --output=yaml --export > ceph-client-admin-keyring
# 获得有效Keyring的Base64编码
cat client.admin.keyyring | base64
# 将Secret中的编码替换为上述Base64,然后重新创建Secret
kubectl -n ceph apply -f ceph-client-admin-keyring
此外Secret pvc-ceph-client-key中存放的也是admin用户的Key,其内容也需要替换到有效的:
kubectl -n ceph edit secret pvc-ceph-client-key
原因和上一个问题类似,还是权限问题。
查看无法绑定的PVC日志:
# kubectl -n ceph describe pvc
Normal Provisioning 53s ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2 External provisioner is provisioning volume for claim
"ceph/ceph-pvc"
Warning ProvisioningFailed 53s ceph.com/rbd ceph-rbd-provisioner-5544dcbcf5-n846s 708edb2c-4619-11e8-abf2-e672650d97a2 Failed to provision volume with StorageClass "general"
: failed to create rbd image: exit status 1, command output: 2018-04-22 13:44:35.269967 7fb3e3e3ad80 -1 did not load config file, using default settings.
2018-04-22 13:44:35.297828 7fb3e3e3ad80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2)
No such file or directoryConnection to localhost closed by remote host.
Connection to localhost closed.e3e3ad80 0 librados: client.admin authentication error (1) Operation not permitted
rbd-provisioner需要读取StorageClass定义,获取需要的凭证信息:
# kubectl -n ceph get storageclass --output=yaml
apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: general
parameters:
adminId: admin
adminSecretName: pvc-ceph-conf-combined-storageclass
adminSecretNamespace: ceph
imageFeatures: layering
imageFormat: "2"
monitors: ceph-mon.ceph.svc.k8s.gmem.cc:6789
pool: rbd
userId: admin
userSecretName: pvc-ceph-client-key
provisioner: ceph.com/rbd
reclaimPolicy: Delete
可以看到牵涉到两个Secret:pvc-ceph-conf-combined-storageclass、pvc-ceph-client-key,你需要把正确的Keyring内容写入其中。
现象:
PVC可以Provision,RBD可以通过Ceph命令挂载,但是Pod无法启动,Describe之显示:
auth: unable to find a keyring on /etc/ceph/keyring: (2) No such file or directory
monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
librados: client.admin authentication error (95) Operation not supported
解决办法:
把ceph.client.admin.keyring拷贝一份为 /etc/ceph/keyring即可。
原因和上一个问题一样。查看无法启动的容器日志:
kubectl -n ceph logs ceph-osd-dev-vdb-bjnbm -c osd-prepare-pod
# ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring health
# 0 librados: client.bootstrap-osd authentication error (1) Operation not permitted
# [errno 1] error connecting to the cluster
进一步查看,可以发现/var/lib/ceph/bootstrap-osd/ceph.keyring挂载自ceph-bootstrap-osd-keyring下的ceph.keyring:
# kubectl -n ceph get secret ceph-bootstrap-osd-keyring --output=yaml --export
apiVersion: v1
data:
ceph.keyring: W2NsaWVudC5ib290c3RyYXAtb3NkXQogIGtleSA9IEFRQVlhOXRhQUFBQUFCQUFSQ2l1bVY1NFpOU2JGVWwwSDZnYlJ3PT0KICBjYXBzIG1vbiA9ICJhbGxvdyBwcm9maWxlIGJvb3RzdHJhcC1vc2QiCgo=
kind: Secret
metadata:
creationTimestamp: null
name: ceph-bootstrap-osd-keyring
selfLink: /api/v1/namespaces/ceph/secrets/ceph-bootstrap-osd-keyring
type: Opaque
# BASE64解码后:
[client.bootstrap-osd]
key = AQAYa9taAAAAABAARCiumV54ZNSbFUl0H6gbRw==
caps mon = "allow profile bootstrap-osd"
获得实际有效的Keyring:
kubectl -n ceph exec -it ceph-mon-nhx52 -c ceph-mon -- ceph --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-xenial-100/keyring auth get client.bootstrap-osd
# 注意上述命令的输出的第一行exported keyring for client.bootstrap-osd不属于Keyring
[client.bootstrap-osd]
key = AQAUPdtaAAAAABAASbfGQ/B/PY4Imoa4Gxsa2Q==
caps mon = "allow profile bootstrap-osd"
修改Secret: kubectl **-**n ceph edit secret ceph-bootstrap-osd-keyring 替换为上述Keyring。
报错信息:
# kubectl -n ceph logs ceph-osd-dev-vdc-cpkxh -c osd-activate-pod
ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid 08adecc5-72b1-4c57-b5b7-a543cd8295e7
# 每个OSD都包同样的错误
对应的配置文件内容:
kubectl -n ceph get configmap ceph-etc --output=yaml
apiVersion: v1
data:
ceph.conf: |
[global]
fsid = a4426e8a-c46d-4407-95f1-911a23a0dd6e
mon_host = ceph-mon.ceph.svc.k8s.gmem.cc
[osd]
cluster_network = 10.0.0.0/16
ms_bind_port_max = 7100
public_network = 10.0.0.0/16
kind: ConfigMap
metadata:
name: ceph-etc
namespace: ceph
可以看到,fsid不一致。修改一下ConfigMap中的fsid即可解决此问题。
报错信息:
describe pod报错:timeout expired waiting for volumes to attach/mount for pod
kubelet报错:executable file not found in $PATH, rbd output
原因分析:
动态提供的持久卷,包含两个阶段:
解决方案:
# 安装软件
apt install -y ceph-common
# 从ceph-mon拷贝以下文件:
# /etc/ceph/ceph.client.admin.keyring
# /etc/ceph/ceph.conf
应用上述方案后,如果继续报错:rbd: map failed exit status 110, rbd output: rbd: sysfs write failed In some cases useful info is found in syslog。则查看一下系统日志:
dmesg | tail
# [ 3004.833252] libceph: mon0 10.0.0.100:6789 feature set mismatch, my 106b84a842a42
# < server's 40106b84a842a42, missing 400000000000000
# [ 3004.840980] libceph: mon0 10.0.0.100:6789 missing required protocol features
对照本文前面的特性表,可以发现内核版本必须4.5+才可以(CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING)。
最简单的办法就是升级一下内核:
# Desktop
apt install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04 -y
# Server
apt install --install-recommends linux-generic-hwe-16.04 -y
sudo apt-get remove linux-headers-4.4.* -y && \
sudo apt-get remove linux-image-4.4.* -y && \
sudo apt-get autoremove -y && \
sudo update-grub
或者,将tunables profile调整到hammer版本的Ceph:
ceph osd crush tunables hammer
# adjusted tunables profile to hammer
报错信息:ERROR: osd init failed: (36) File name too long
报错原因:使用的文件系统为EXT4,存储的xattrs大小有限制,有条件的话最好使用XFS
解决办法:修改配置文件,如下:
osd_max_object_name_len = 256
osd_max_object_namespace_len = 64
报错信息:Fail to open ‘/proc/0/cmdline’ error No such file or directory
报错原因:在CentOS 7上,将ceph-mon和ceph-osd(基于目录)部署在同一节点(基于Helm)报此错误,分离后问题消失。此外部署mon的那些节点还设置了虚IP,其子网和Ceph的Cluster/Public网络相同,这导致了某些OSD监听的地址不正确。
再次遇到此问题,原因是一个虚拟网卡lo:ngress使用和eth0相同的网段,导致OSD使用了错误的网络。
解决办法是写死OSD监听地址:
[osd.2]
public addr = 10.0.4.1
cluster addr = 10.0.4.1
报错信息:Input/output error,结合dmesg | tail可以看到更细节的报错
报错原因,可能情况:
external storage中的CephFS可以正常Provisioning,但是尝试读写数据时报此错误。原因是文件路径过长,和底层文件系统有关,为了兼容部分Ext文件系统的机器,我们限制了osd_max_object_name_len。
解决办法,不使用UUID,而使用namespace + pvcname来命名目录。修改cephfs-provisioner.go,118行
// create random share name
share := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)
// create random user id
user := fmt.Sprintf("%s-%s", options.PVC.Namespace,options.PVC.Name)
重新编译即可。
describe pod发现:
rbd image rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 is still being used
说明有其它客户端正在占用此镜像。如果尝试删除镜像,你会发现无法成功:
rbd rm rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6
librbd::image::RemoveRequest: 0x560e39df9af0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.
要知道watcher是谁,可以执行:
rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6
Watchers:
watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961
可以发现10.5.39.12正在占用镜像。
另一种获取watcher的方法是,使用rbd的header对象。执行下面的命令获取rbd的诊断信息:
rbd info rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6
rbd image 'kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6':
size 8192 MB in 2048 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.134474b0dc51
format: 2
features: layering
flags:
create_timestamp: Wed Jul 11 17:49:51 2018
字段block_name_prefix的值rbd_data.134474b0dc51,将data换为header即为header对象。然后使用命令:
rados listwatchers -p rbd-unsafe rbd_header.134474b0dc51
watcher=10.5.39.12:0/1652752791 client.94563 cookie=18446462598732840961
既然知道10.5.39.12占用镜像,断开连接即可。 在此机器上执行下面的命令,显示当前映射的rbd镜像列表:
rbd showmapped
id pool image snap device
0 rbd-unsafe kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6 - /dev/rbd0
1 rbd-unsafe kubernetes-dynamic-pvc-0729f9a6-84f0-11e8-9b75-5a3f858854b1 - /dev/rbd1
此机器上的rbd0虽然映射,但是没有挂载。解除映射:
rbd unmap /dev/rbd0
再次检查rbd镜像状态,发现没有watcher了:
rbd status rbd-unsafe/kubernetes-dynamic-pvc-c0ac2cff-84ef-11e8-9a2a-566b651a72d6
Watchers: none
kubectl describe报错Unable to mount volumes for pod… timeout expired waiting for volumes to attach or mount for pod…
检查发现目标rbd没有Watcher,Pod所在机器的Kubectl报错rbd: map failed signal: aborted (core dumped)。此前曾经在该机器上执行过rbd unmap操作。
手工 rbd map后问题消失。
journal do_read_entry: bad header magic
报错信息:journal do_read_entry(156389376): bad header magic…FAILED assert(interval.last > last)
这是12.2版本已知的BUG,断电后可能出现OSD无法启动,可能导致数据丢失。
RGW实例无法启动,通过journalctl看到上述信息。
要查看更多信息,需要查看RGW日志:
2020-10-22 16:51:55.771035 7fb1b0f20e80 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 2546439
2020-10-22 16:51:55.792872 7fb1b0f20e80 0 librados: client.rgw.ceph02 authentication error (22) Invalid argument
2020-10-22 16:51:55.793450 7fb1b0f20e80 -1 Couldn't init storage provider (RADOS)
可以发现是和身份验证有关的问题。
通过 systemctl status ceph-radosgw@rgw.**$**RGW_HOST得到命令行,手工运行:
radosgw -f --cluster ceph --name client.rgw.ceph02 --setuser ceph --setgroup ceph -d --debug_ms 1
发现报错和上面一样。尝试增加–keyring参数,问题解决:
radosgw -f --cluster ceph --name client.rgw.ceph02 \
--setuser ceph --setgroup ceph -d --debug_ms 1 \
--keyring=/var/lib/ceph/radosgw/ceph-rgw.ceph02/keyring
看来是Systemd服务没有找到keyring导致。
报错信息:Unhandled exception from module ‘prometheus’ while running on mgr.master01-10-5-38-24: error(‘No socket could
be created’,)
解决办法: ceph config-key set mgr/prometheus/server_addr 0.0.0.0
原因是时钟不同步警告阈值太低,在global段增加配置并重启MON:
mon clock drift allowed = 2
mon clock drift warn backoff = 30
或者执行下面的命令即时生效:
ceph tell mon.* injectargs '--mon_clock_drift_allowed=2'
ceph tell mon.* injectargs '--mon_clock_drift_warn_backoff=30'
或者检查ntp相关配置,保证时钟同步精度。
深度清理很消耗IO,如果长时间无法完成,可以禁用:
ceph osd set noscrub
ceph osd set nodeep-scrub
问题解决后,可以再启用:
ceph osd unset noscrub
ceph osd unset nodeep-scrub
使用CFQ作为IO调度器时,可以调整OSD IO线程的优先级:
# 设置调度器
echo cfq > /sys/block/sda/queue/scheduler
# 检查当前某个OSD的磁盘线程优先级类型
ceph daemon osd.4 config get osd_disk_thread_ioprio_class
# 修改IO优先级
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
# IOPRIO_CLASS_RT最高 IOPRIO_CLASS_IDLE最低
ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
如果上述措施没有问题时,可以考虑配置以下参数:
osd_deep_scrub_stride = 131072
# 每次Scrub的块数量范围
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd scrub during recovery = false
osd deep scrub interval = 2592000
osd scrub max interval = 2592000
# 单个OSD并发进行的Scrub个数
osd max scrubs = 1
# Scrub起止时间
osd max begin hour = 2
osd max end hour = 6
# 系统负载超过多少则禁止Scrub
osd scrub load threshold = 4
# 每次Scrub后强制休眠0.1秒
osd scrub sleep = 0.1
# 线程优先级
osd disk thread ioprio priority = 7
osd disk thread ioprio class = idle
如果Watcher被黑名单,则尝试Unmap镜像时会报错:rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy
可以使用下面的命令强制unmap: rbd unmap -o force ...
部分PG状态卡死,可能原因是OSD允许的PG数量受限,修改全局配置项mon_max_pg_per_osd并重启MON即可。
此外注意:调整PG数量后,一定要进入A+C状态后,再进行下一次调整。
下面第二个镜像对应的K8S PV已经删除:
rbd ls
# kubernetes-dynamic-pvc-35350b13-46b8-11e8-bde0-a2c14c93573f
# kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
但是对应的RBD没有删除,手工删除:
rbd remove kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
报错:
2018-04-23 13:37:25.559444 7f919affd700 -1 librbd::image::RemoveRequest: 0x5598e77831d0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete…failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.
查看RBD状态:
# rbd info kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
rbd image 'kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4':
size 8192 MB in 2048 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1003e238e1f29
format: 2
features: layering
flags:
create_timestamp: Mon Apr 23 11:42:59 2018
#rbd status kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
Watchers:
watcher=10.0.0.101:0/4275384344 client.65597 cookie=18446462598732840963
到10.0.0.101这台机器上查看:
# df | grep e6e3339859d4
/dev/rbd2 8125880 251560 7438508 4% /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-kubernetes-dynamic-pvc-78740b26-46eb-11e8-8349-e6e3339859d4
重启Kubelet后可以删除RBD。
# 删除密钥
ceph auth del osd.9
# 重新收集目标主机的密钥
ceph-deploy --username ceph-ops gatherkeys Carbon
pgs: 12.413% pgs unknown
20.920% pgs not active
768 active+clean
241 creating+activating
143 unknown
可能是由于PG总数太大导致,降低PG数量后很快Active+Clean
报错信息:Orphaned pod “a9621c0e-41ee-11e8-9407-deadbeef00a0” found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them
临时解决办法:
rm -rf /var/lib/kubelet/pods/a9621c0e-41ee-11e8-9407-deadbeef00a0/volumes/rook.io~rook/
可能原因是OSD使用的keyring和MON不一致。对于ID为14的OSD,将宿主机/var/lib/ceph/osd/ceph-14/keyring的内容替换为 ceph auth get osd.14的输出前两行即可。
在没有停止OSD的情况下执行ceph-objectstore-tool命令,会出现此错误。
public_addr
nor public_network
keys are defined for monitors通过ceph-deploy添加MON节点时出现此错误,将public_network配置添加到配置文件的global段即可。
可能原因:
chown: cannot access ‘/var/log/ceph’: No such file or directory
OSD无法启动,报上面的错误,可以配置:
ceph:
storage:
osd_log: /var/log
HEALTH_WARN application not enabled on
#池 # 功能
ceph osd pool application enable rbd block-devices
调试日志
注意:详尽的日志每小时可能超过 1GB ,如果你的系统盘满了,这个节点就会停止工作。
# 通过中心化配置下发
ceph tell osd.0 config set debug_osd 0/5
# 到目标主机上,针对OSD进程设置
ceph daemon osd.0 config set debug_osd 0/5
可以为各子系统定制日志级别
# debug {subsystem} = {log-level}/{memory-level}
[global]
debug ms = 1/5
[mon]
debug mon = 20
debug paxos = 1/5
debug auth = 2
[osd]
debug osd = 1/5
debug filestore = 1/5
debug journal = 1
debug monc = 5/20
[mds]
debug mds = 1
debug mds balancer = 1
debug mds log = 1
debug mds migrator = 1
子系统列表:
子系统 | 日志级别 | 内存日志级别 |
---|---|---|
default | 0 | 5 |
lockdep | 0 | 1 |
context | 0 | 1 |
crush | 1 | 1 |
mds | 1 | 5 |
mds balancer | 1 | 5 |
mds locker | 1 | 5 |
mds log | 1 | 5 |
mds log expire | 1 | 5 |
mds migrator | 1 | 5 |
buffer | 0 | 1 |
timer | 0 | 1 |
filer | 0 | 1 |
striper | 0 | 1 |
objecter | 0 | 1 |
rados | 0 | 5 |
rbd | 0 | 5 |
rbd mirror | 0 | 5 |
rbd replay | 0 | 5 |
journaler | 0 | 5 |
objectcacher | 0 | 5 |
client | 0 | 5 |
osd | 1 | 5 |
optracker | 0 | 5 |
objclass | 0 | 5 |
filestore | 1 | 3 |
journal | 1 | 3 |
ms | 0 | 5 |
mon | 1 | 5 |
monc | 0 | 10 |
paxos | 1 | 5 |
tp | 0 | 5 |
auth | 1 | 5 |
crypto | 1 | 5 |
finisher | 1 | 1 |
reserver | 1 | 1 |
heartbeatmap | 1 | 5 |
perfcounter | 1 | 5 |
rgw | 1 | 5 |
rgw sync | 1 | 5 |
civetweb | 1 | 10 |
javaclient | 1 | 5 |
asok | 1 | 5 |
throttle | 1 | 1 |
refs | 0 | 0 |
compressor | 1 | 5 |
bluestore | 1 | 5 |
bluefs | 1 | 5 |
bdev | 1 | 3 |
kstore | 1 | 5 |
rocksdb | 4 | 5 |
leveldb | 4 | 5 |
memdb | 4 | 5 |
fuse | 1 | 5 |
mgr | 1 | 5 |
mgrc | 1 | 5 |
dpdk | 1 | 5 |
eventtrace | 1 | 5 |
如果磁盘空间有限,可以配置/etc/logrotate.d/ceph,加快日志滚动:
rotate 7
weekly
size 500M
compress
sharedscripts
然后设置定时任务,定期检查并清理:
30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1