为了部署有状态服务,需要给k8s提供一套可持久化存储的方案,我们使用ceph来做底层存储。一般k8s对接ceph有两种:
- 通过rook部署和对接ceph,使用k8s提供ceph服务。rook官方文档非常详细,里面也有常见问题的fix版本,本人一路用下来非常顺利,这里不再赘述。文档如下:
https://rook.io/docs/rook/v1.3/ceph-quickstart.html
https://rook.io/docs/rook/v1.3/ceph-toolbox.html
https://rook.io/docs/rook/v1.3/ceph-cluster-crd.html#storage-selection-settings
https://rook.io/docs/rook/v1.3/ceph-block.html - k8s对接外部的ceph服务
本文主要记录了k8s集群对接外部ceph集群的方案和问题。期间,还是遇见不少问题。
环境准备
我们使用的k8s和ceph环境见:
https://blog.51cto.com/leejia/2495558
https://blog.51cto.com/leejia/2499684
静态持久卷
每次需要使用存储空间,需要存储管理员先手动在存储上创建好对应的image,然后k8s才能使用。
创建ceph secret
需要给k8s添加一个访问ceph的secret,主要用于k8s来给rbd做map。
1,在ceph master节点执行如下命令获取admin的经过base64编码的key(生产环境可以创建一个给k8s使用的专门用户):
# ceph auth get-key client.admin | base64
QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
2,在k8s通过manifest创建secret
# vim ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: ceph-secret
data:
key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
# kubectl apply -f ceph-secret.yaml
创建image
默认情况下,ceph创建之后使用的默认pool为rdb。使用如下命令在安装ceph的客户端或者直接在ceph master节点上创建image:
# rbd create image1 -s 1024
# rbd info rbd/image1
rbd image 'image1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.374d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
创建持久卷
在k8s上通过manifest创建:
# vim pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: ceph-pv
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
- ReadOnlyMany
rbd:
monitors:
- 172.18.2.172:6789
- 172.18.2.178:6789
- 172.18.2.189:6789
pool: rbd
image: image1
user: admin
secretRef:
name: ceph-secret
fsType: ext4
persistentVolumeReclaimPolicy: Retain
# kubectl apply -f pv.yaml
persistentvolume/ceph-pv created
# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
ceph-pv 1Gi RWO,ROX Retain Available 76s
主要指令使用说明如下:
1,accessModes:
RWO:ReadWriteOnce,仅允许单个节点挂载进行读写;
ROX:ReadOnlyMany,允许多个节点挂载且只读;
RWX:ReadWriteMany,允许多个节点挂载进行读写;
2,fsType
如果PersistentVolumes的VolumeMode为Filesystem,那么此字段指定挂载卷时应该使用的文件系统。如果卷尚未格式化,并且支持格式化,此值将用于格式化卷。
3,persistentVolumeReclaimPolicy:
回收策略有三种:
Delete:对于动态配置的PersistentVolumes来说,默认回收策略为 “Delete”。这表示当用户删除对应的 PersistentVolumeClaim 时,动态配置的volume将被自动删除。
Retain:如果volume包含重要数据时,适合使用“Retain”策略。使用 “Retain” 时,如果用户删除 PersistentVolumeClaim,对应的 PersistentVolume 不会被删除。相反,它将变为 Released 状态,表示所有的数据可以被手动恢复。
Recycle: 如果用户删除 PersistentVolumeClaim,则删除卷上的数据,卷不会删除。
创建持久卷声明
在k8s上通过manifest创建:
# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ceph-claim
spec:
accessModes:
- ReadWriteOnce
- ReadOnlyMany
resources:
requests:
storage: 1Gi
# kubectl apply -f pvc.yaml
当创建好claim之后,k8s会匹配最合适的pv将其绑定到claim,持久卷的容量需要满足claim的要求+卷的模式必须包含claim中指定的访问模式。故如上的pvc会绑定到我们刚创建的pv上。
查看pvc的绑定:
# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ceph-claim Bound ceph-pv 1Gi RWO,ROX 13m
pod使用持久卷
在k8s上通过manifest创建:
vim cat ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
name: ceph-pod
spec:
containers:
- name: ceph-ubuntu
image: phusion/baseimage
command: ["sh", "/sbin/my_init"]
volumeMounts:
- name: ceph-mnt
mountPath: /mnt
readOnly: false
volumes:
- name: ceph-mnt
persistentVolumeClaim:
claimName: ceph-claim
# kubectl apply -f ubuntu.yaml
pod/ceph-pod created
检测pod的状态,发现一只处于ContainerCreating阶段,然后通过describe日志发现有报错:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
ceph-pod 0/1 ContainerCreating 0 75s
# kubectl describe pods ceph-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 48m (x6 over 75m) kubelet, work3 Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[default-token-tlsjd ceph-mnt]: timed out waiting for the condition
Warning FailedMount 8m59s (x45 over 84m) kubelet, work3 MountVolume.WaitForAttach failed for volume "ceph-pv" : fail to check rbd image status with: (executable file not found in $PATH), rbd output: ()
Warning FailedMount 3m13s (x23 over 82m) kubelet, work3 Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[ceph-mnt default-token-tlsjd]: timed out waiting for the condition
出现这个问题是因为k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令,使用阿里云的ceph repo给每台机器安装之后,又发现新的报错:
# kubectl describe pods ceph-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:18.575338 7f0171c3ed80 -1 did not load config file, using default settings.
2020-06-02 17:12:18.603861 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
rbd: sysfs write failed
2020-06-02 17:12:18.620447 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address
Warning FailedMount 15s kubelet, work3 MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:19.257006 7fc330e14d80 -1 did not load config file, using default settings.
只能继续查资料找原因,发现有2个问题需要解决:
1),发现是由于k8s集群和ceph集群 kernel版本不一样,k8s集群的kernel版本较低,rdb块存储的一些feature 低版本kernel不支持,需要disable。通过如下命令disable:
# rbd info rbd/image1
rbd image 'image1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.374d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
# rbd feature disable rbd/image1 exclusive-lock object-map fast-diff deep-flatten
2),找不到key的报错是由于k8s节点要和ceph交互以把image映射到本机,需要每台k8s节点的/etc/ceph目录都要放置ceph.client.admin.keyring文件,映射的时候做认证使用。故给每个节点创建了/etc/ceph目录,写脚本放置了一下key文件。
# scp /etc/ceph/ceph.client.admin.keyring root@k8s-node:/etc/ceph
查看pod状态,终于run起来了:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
ceph-pod 1/1 Running 0 29s
进入ubuntu系统查看挂载项,发现image已经挂载和格式化好:
# kubectl exec ceph-pod -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -hT
Filesystem Type Size Used Avail Use% Mounted on
overlay overlay 50G 3.6G 47G 8% /
tmpfs tmpfs 64M 0 64M 0% /dev
tmpfs tmpfs 2.9G 0 2.9G 0% /sys/fs/cgroup
/dev/rbd0 ext4 976M 2.6M 958M 1% /mnt
/dev/mapper/centos-root xfs 50G 3.6G 47G 8% /etc/hosts
shm tmpfs 64M 0 64M 0% /dev/shm
tmpfs tmpfs 2.9G 12K 2.9G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs tmpfs 2.9G 0 2.9G 0% /proc/acpi
tmpfs tmpfs 2.9G 0 2.9G 0% /proc/scsi
tmpfs tmpfs 2.9G 0 2.9G 0% /sys/firmware
在ceph-pod这个pod运行的节点上,通过df命令看rbd的挂载:
# df -hT|grep rbd
/dev/rbd0 ext4 976M 2.6M 958M 1% /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/rbd-image-image2
动态持久卷
不需要存储管理员干预,使k8s使用的存储image创建自动化,即根据使用需要可以动态申请存储空间并自动创建。需要先定义一个或者多个StorageClass,每个StorageClass都必须配置一个provisioner,用来决定使用哪个卷插件分配PV。然后,StorageClass资源指定持久卷声明请求StorageClass时使用哪个provisioner来在对应存储创建持久卷。
k8s官方提供了支持的卷插件:https://kubernetes.io/zh/docs/concepts/storage/storage-classes/
创建一个普通用户来给k8s做rdb的映射
在ceph集群中创建一个k8s专用的pool和用户:
# ceph osd pool create kube 8192
# ceph auth get-or-create client.kube mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=kube' -o ceph.client.kube.keyring
在k8s集群创建kube用户的secret:
# ceph auth get-key client.kube|base64
QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==
# vim ceph-kube-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: ceph-kube-secret
namespace: default
data:
key: QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==
type:
kubernetes.io/rbd
# kubectl create -f ceph-kube-secret.yaml
# kubectl get secret
NAME TYPE DATA AGE
ceph-kube-secret kubernetes.io/rbd 1 68s
创建StorageClass或者使用已经创建好的StorageClass
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
adminId: admin
adminSecretName: ceph-secret
adminSecretNamespace: default
pool: kube
userId: kube
userSecretName: ceph-kube-secret
userSecretNamespace: default
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
# kubectl apply -f sc.yaml
# kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ceph-rbd (default) kubernetes.io/rbd Delete Immediate false 6s
主要指令使用说明如下:
1,storageclass.beta.kubernetes.io/is-default-class
如果设置为true,则为默认的storageclasss。pvc申请存储,如果没有指定storageclass,则从默认的storageclass申请。
2,adminId:ceph客户端ID,用于在ceph 池中创建映像。默认是 “admin”。
3,userId:ceph客户端ID,用于映射rbd镜像。默认与adminId相同。
4,imageFormat:ceph rbd镜像格式,“1” 或者 “2”。默认值是 “1”。
5,imageFeatures:这个参数是可选的,只能在你将imageFormat设置为 “2” 才使用。 目前支持的功能只是layering。默认是 “",没有功能打开。
创建持久卷声明
由于我们已经指定了默认的storageclass,故可以直接创建pvc。创建完成处于pending状态,当使用的时候才会触发provisioner创建:
# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ceph-sc-claim
spec:
accessModes:
- ReadWriteOnce
- ReadOnlyMany
resources:
requests:
storage: 500Mi
# kubectl apply -f pvc.yaml
persistentvolumeclaim/ceph-sc-claim created
# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ceph-sc-claim Pending ceph-rbd 50s
创建pvc之后,发现pvc没有成功绑定pv,一直处于pending状态,然后我们查看pvc的报错信息,发现如下问题:
# kubectl describe pvc ceph-sc-claim
Name: ceph-sc-claim
Namespace: default
StorageClass: ceph-rbd
Status: Pending
Volume:
Labels:
Annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Mounted By:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 5s (x7 over 103s) persistentvolume-controller Failed to provision volume with StorageClass "ceph-rbd": failed to get admin secret from ["default"/"ceph-secret"]: failed to get secret from ["default"/"ceph-secret"]: Cannot get secret of type kubernetes.io/rbd
通过如上报错我们知道应该是k8s的controller获取ceph的admin secret失败了。由于我们创建的cepe-secret这个secret在default namespace下面,而controller在kube-system下面故没有权限获取,所以我们在kube-system下面创建cepe-secret,删除pvc和storageclass资源,然后更新storageclass配置之后重新创建storageclass和pvc资源:
# cat ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: ceph-secret
namespace: kube-system
data:
key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
type:
kubernetes.io/rbd
# kubectl apply -f ceph-secret.yaml
# kubectl get secret ceph-secret -n kube-system
NAME TYPE DATA AGE
ceph-secret kubernetes.io/rbd 1 19m
# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
adminId: admin
adminSecretName: ceph-secret
adminSecretNamespace: kube-system
pool: kube
userId: kube
userSecretName: ceph-kube-secret
userSecretNamespace: default
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml
# kubectl describe pvc ceph-sc-claim
Name: ceph-sc-claim
Namespace: default
StorageClass: ceph-rbd
Status: Pending
Volume:
Labels:
Annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Mounted By:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 33s (x59 over 116m) persistentvolume-controller Failed to provision volume with StorageClass "ceph-rbd": failed to create rbd image: executable file not found in $PATH, command output:
发现依然绑定pv失败,继续找问题。我们在k8s集群的每台节点都安装了ceph-common,为什么rbd命令还是没有找到呢。通过查询分析,发下原因如下:
k8s使用stroageclass动态申请ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。
我们来根据官方建议使用外部的rbd-provisioner来提供服务,如下操作再k8s的master上执行:
# git clone https://github.com/kubernetes-incubator/external-storage.git
# cd external-storage/ceph/rbd/deploy
# sed -r -i "s/namespace: [^ ]+/namespace: kube-system/g" ./rbac/clusterrolebinding.yaml ./rbac/rolebinding.yaml
# kubectl -n kube-system apply -f ./rbac
# kubectl describe deployments.apps -n kube-system rbd-provisioner
Name: rbd-provisioner
Namespace: kube-system
CreationTimestamp: Wed, 03 Jun 2020 18:59:14 +0800
Labels:
Annotations: deployment.kubernetes.io/revision: 1
Selector: app=rbd-provisioner
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: Recreate
MinReadySeconds: 0
Pod Template:
Labels: app=rbd-provisioner
Service Account: rbd-provisioner
Containers:
rbd-provisioner:
Image: quay.io/external_storage/rbd-provisioner:latest
Port:
Host Port:
Environment:
PROVISIONER_NAME: ceph.com/rbd
Mounts:
Volumes:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets:
NewReplicaSet: rbd-provisioner-c968dcb4b (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 6m5s deployment-controller Scaled up replica set rbd-provisioner-c968dcb4b to 1
修改storageclass的provisioner为我们新增加的provisioner:
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-rbd
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: ceph.com/rbd
parameters:
monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
adminId: admin
adminSecretName: ceph-secret
adminSecretNamespace: kube-system
pool: kube
userId: kube
userSecretName: ceph-kube-secret
userSecretNamespace: default
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml
等待provisioner分配存储和pvc绑定pv,大概3min时间。终于绑定成功了:
# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ceph-sc-claim Bound pvc-0b92a433-adb0-46d9-a0c8-5fbef28eff5f 2Gi RWO ceph-rbd 7m49s
pod使用持久卷
创建pod并查看挂载状态:
# vim ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
name: ceph-sc-pod
spec:
containers:
- name: ceph-sc-ubuntu
image: phusion/baseimage
command: ["/sbin/my_init"]
volumeMounts:
- name: ceph-sc-mnt
mountPath: /mnt
readOnly: false
volumes:
- name: ceph-sc-mnt
persistentVolumeClaim:
claimName: ceph-sc-claim
# kubectl apply -f ubuntu.yaml
# kubectl get pods
NAME READY STATUS RESTARTS AGE
ceph-sc-pod 1/1 Running 0 24s
# kubectl exec ceph-sc-pod -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 50G 3.8G 47G 8% /
tmpfs 64M 0 64M 0% /dev
tmpfs 2.9G 0 2.9G 0% /sys/fs/cgroup
/dev/rbd0 2.0G 6.0M 1.9G 1% /mnt
/dev/mapper/centos-root 50G 3.8G 47G 8% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 2.9G 12K 2.9G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 2.9G 0 2.9G 0% /proc/acpi
tmpfs 2.9G 0 2.9G 0% /proc/scsi
tmpfs 2.9G 0 2.9G 0% /sys/firmware
经过这么多的折腾,总算成功对接外部的ceph
总结
1,k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令。
2,k8s使用stroageclass动态create ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。
参考
https://kubernetes.io/zh/docs/concepts/storage/storage-classes/
https://kubernetes.io/zh/docs/concepts/storage/volumes/
https://groups.google.com/forum/#!topic/kubernetes-sig-storage-bugs/4w42QZxboIA