为了部署有状态服务,需要给k8s提供一套可持久化存储的方案,我们使用ceph来做底层存储。一般k8s对接ceph有两种:

  • 通过rook部署和对接ceph,使用k8s提供ceph服务。rook官方文档非常详细,里面也有常见问题的fix版本,本人一路用下来非常顺利,这里不再赘述。文档如下:
    https://rook.io/docs/rook/v1.3/ceph-quickstart.html
    https://rook.io/docs/rook/v1.3/ceph-toolbox.html
    https://rook.io/docs/rook/v1.3/ceph-cluster-crd.html#storage-selection-settings
    https://rook.io/docs/rook/v1.3/ceph-block.html
  • k8s对接外部的ceph服务

本文主要记录了k8s集群对接外部ceph集群的方案和问题。期间,还是遇见不少问题。

环境准备

我们使用的k8s和ceph环境见:
https://blog.51cto.com/leejia/2495558
https://blog.51cto.com/leejia/2499684

静态持久卷

每次需要使用存储空间,需要存储管理员先手动在存储上创建好对应的image,然后k8s才能使用。

创建ceph secret

需要给k8s添加一个访问ceph的secret,主要用于k8s来给rbd做map。
1,在ceph master节点执行如下命令获取admin的经过base64编码的key(生产环境可以创建一个给k8s使用的专门用户):

# ceph auth get-key client.admin | base64
QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==

2,在k8s通过manifest创建secret

# vim ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==

# kubectl apply -f ceph-secret.yaml

创建image

默认情况下,ceph创建之后使用的默认pool为rdb。使用如下命令在安装ceph的客户端或者直接在ceph master节点上创建image:

# rbd create image1 -s 1024
# rbd info rbd/image1
rbd image 'image1':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.374d6b8b4567
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

创建持久卷

在k8s上通过manifest创建:

# vim pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: ceph-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  rbd:
    monitors:
      - 172.18.2.172:6789
      - 172.18.2.178:6789
      - 172.18.2.189:6789
    pool: rbd
    image: image1
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
  persistentVolumeReclaimPolicy: Retain

# kubectl apply -f pv.yaml
persistentvolume/ceph-pv created

# kubectl get pv
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
ceph-pv   1Gi        RWO,ROX        Retain           Available                                   76s

主要指令使用说明如下:
1,accessModes:

RWO:ReadWriteOnce,仅允许单个节点挂载进行读写;
ROX:ReadOnlyMany,允许多个节点挂载且只读;
RWX:ReadWriteMany,允许多个节点挂载进行读写;

2,fsType

如果PersistentVolumes的VolumeMode为Filesystem,那么此字段指定挂载卷时应该使用的文件系统。如果卷尚未格式化,并且支持格式化,此值将用于格式化卷。

3,persistentVolumeReclaimPolicy:

回收策略有三种:
Delete:对于动态配置的PersistentVolumes来说,默认回收策略为 “Delete”。这表示当用户删除对应的 PersistentVolumeClaim 时,动态配置的volume将被自动删除。

Retain:如果volume包含重要数据时,适合使用“Retain”策略。使用 “Retain” 时,如果用户删除 PersistentVolumeClaim,对应的 PersistentVolume 不会被删除。相反,它将变为 Released 状态,表示所有的数据可以被手动恢复。

Recycle: 如果用户删除 PersistentVolumeClaim,则删除卷上的数据,卷不会删除。

创建持久卷声明

在k8s上通过manifest创建:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi

# kubectl apply -f pvc.yaml

当创建好claim之后,k8s会匹配最合适的pv将其绑定到claim,持久卷的容量需要满足claim的要求+卷的模式必须包含claim中指定的访问模式。故如上的pvc会绑定到我们刚创建的pv上。

查看pvc的绑定:

# kubectl get pvc
NAME         STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-claim   Bound    ceph-pv   1Gi        RWO,ROX                       13m

pod使用持久卷

在k8s上通过manifest创建:

vim cat ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod
spec:
  containers:
  - name: ceph-ubuntu
    image: phusion/baseimage
    command: ["sh", "/sbin/my_init"]
    volumeMounts:
    - name: ceph-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-mnt
    persistentVolumeClaim:
      claimName: ceph-claim

# kubectl apply -f ubuntu.yaml
pod/ceph-pod created

检测pod的状态,发现一只处于ContainerCreating阶段,然后通过describe日志发现有报错:

# kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
ceph-pod                 0/1     ContainerCreating   0          75s

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
  Warning  FailedMount  48m (x6 over 75m)     kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[default-token-tlsjd ceph-mnt]: timed out waiting for the condition
  Warning  FailedMount  8m59s (x45 over 84m)  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : fail to check rbd image status with: (executable file not found in $PATH), rbd output: ()
  Warning  FailedMount  3m13s (x23 over 82m)  kubelet, work3  Unable to attach or mount volumes: unmounted volumes=[ceph-mnt], unattached volumes=[ceph-mnt default-token-tlsjd]: timed out waiting for the condition

出现这个问题是因为k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令,使用阿里云的ceph repo给每台机器安装之后,又发现新的报错:

# kubectl describe pods ceph-pod
Events:
  Type     Reason       Age                   From            Message
  ----     ------       ----                  ----            -------
MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:18.575338 7f0171c3ed80 -1 did not load config file, using default settings.
2020-06-02 17:12:18.603861 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
rbd: sysfs write failed
2020-06-02 17:12:18.620447 7f0171c3ed80 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address
  Warning  FailedMount  15s  kubelet, work3  MountVolume.WaitForAttach failed for volume "ceph-pv" : rbd: map failed exit status 6, rbd output: 2020-06-02 17:12:19.257006 7fc330e14d80 -1 did not load config file, using default settings.

只能继续查资料找原因,发现有2个问题需要解决:
1),发现是由于k8s集群和ceph集群 kernel版本不一样,k8s集群的kernel版本较低,rdb块存储的一些feature 低版本kernel不支持,需要disable。通过如下命令disable:

# rbd info rbd/image1
rbd image 'image1':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.374d6b8b4567
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

# rbd  feature disable rbd/image1 exclusive-lock object-map fast-diff deep-flatten  

2),找不到key的报错是由于k8s节点要和ceph交互以把image映射到本机,需要每台k8s节点的/etc/ceph目录都要放置ceph.client.admin.keyring文件,映射的时候做认证使用。故给每个节点创建了/etc/ceph目录,写脚本放置了一下key文件。

# scp /etc/ceph/ceph.client.admin.keyring root@k8s-node:/etc/ceph

查看pod状态,终于run起来了:

# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-pod                 1/1     Running   0          29s

进入ubuntu系统查看挂载项,发现image已经挂载和格式化好:

# kubectl exec ceph-pod -it sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -hT
Filesystem              Type     Size  Used Avail Use% Mounted on
overlay                 overlay   50G  3.6G   47G   8% /
tmpfs                   tmpfs     64M     0   64M   0% /dev
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0               ext4     976M  2.6M  958M   1% /mnt
/dev/mapper/centos-root xfs       50G  3.6G   47G   8% /etc/hosts
shm                     tmpfs     64M     0   64M   0% /dev/shm
tmpfs                   tmpfs    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/acpi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /proc/scsi
tmpfs                   tmpfs    2.9G     0  2.9G   0% /sys/firmware

在ceph-pod这个pod运行的节点上,通过df命令看rbd的挂载:

# df -hT|grep rbd
/dev/rbd0               ext4      976M  2.6M  958M   1% /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/rbd-image-image2

动态持久卷

不需要存储管理员干预,使k8s使用的存储image创建自动化,即根据使用需要可以动态申请存储空间并自动创建。需要先定义一个或者多个StorageClass,每个StorageClass都必须配置一个provisioner,用来决定使用哪个卷插件分配PV。然后,StorageClass资源指定持久卷声明请求StorageClass时使用哪个provisioner来在对应存储创建持久卷。

k8s官方提供了支持的卷插件:https://kubernetes.io/zh/docs/concepts/storage/storage-classes/

创建一个普通用户来给k8s做rdb的映射

在ceph集群中创建一个k8s专用的pool和用户:

# ceph osd pool create kube 8192
# ceph auth get-or-create client.kube mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=kube' -o ceph.client.kube.keyring

在k8s集群创建kube用户的secret:

# ceph auth get-key client.kube|base64
QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==

# vim ceph-kube-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-kube-secret
  namespace: default
data:
  key: QVFBS090WmVDcUxvSHhBQWZma1YxWUNnVzhuRTZUcjNvYS9yclE9PQ==
type:
  kubernetes.io/rbd

# kubectl create -f ceph-kube-secret.yaml
# kubectl get secret
NAME                  TYPE                                  DATA   AGE
ceph-kube-secret      kubernetes.io/rbd                     1      68s

创建StorageClass或者使用已经创建好的StorageClass

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: default
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"

# kubectl apply -f sc.yaml
# kubectl get storageclass
NAME                 PROVISIONER         RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-rbd (default)   kubernetes.io/rbd   Delete          Immediate           false                  6s

主要指令使用说明如下:
1,storageclass.beta.kubernetes.io/is-default-class
如果设置为true,则为默认的storageclasss。pvc申请存储,如果没有指定storageclass,则从默认的storageclass申请。
2,adminId:ceph客户端ID,用于在ceph 池中创建映像。默认是 “admin”。
3,userId:ceph客户端ID,用于映射rbd镜像。默认与adminId相同。
4,imageFormat:ceph rbd镜像格式,“1” 或者 “2”。默认值是 “1”。
5,imageFeatures:这个参数是可选的,只能在你将imageFormat设置为 “2” 才使用。 目前支持的功能只是layering。默认是 “",没有功能打开。

创建持久卷声明

由于我们已经指定了默认的storageclass,故可以直接创建pvc。创建完成处于pending状态,当使用的时候才会触发provisioner创建:

# vim pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-sc-claim
spec:
  accessModes:
    - ReadWriteOnce
    - ReadOnlyMany
  resources:
    requests:
      storage: 500Mi

# kubectl apply -f pvc.yaml
persistentvolumeclaim/ceph-sc-claim created

# kubectl get pvc
NAME            STATUS    VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Pending                                        ceph-rbd       50s

创建pvc之后,发现pvc没有成功绑定pv,一直处于pending状态,然后我们查看pvc的报错信息,发现如下问题:

# kubectl describe pvc  ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  5s (x7 over 103s)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to get admin secret from ["default"/"ceph-secret"]: failed to get secret from ["default"/"ceph-secret"]: Cannot get secret of type kubernetes.io/rbd

通过如上报错我们知道应该是k8s的controller获取ceph的admin secret失败了。由于我们创建的cepe-secret这个secret在default namespace下面,而controller在kube-system下面故没有权限获取,所以我们在kube-system下面创建cepe-secret,删除pvc和storageclass资源,然后更新storageclass配置之后重新创建storageclass和pvc资源:

# cat ceph-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
  namespace: kube-system
data:
  key: QVFCd3BOQmVNMCs5RXhBQWx3aVc3blpXTmh2ZjBFMUtQSHUxbWc9PQ==
type:
  kubernetes.io/rbd

# kubectl apply -f ceph-secret.yaml
# kubectl get secret ceph-secret -n kube-system
NAME          TYPE                DATA   AGE
ceph-secret   kubernetes.io/rbd   1      19m

# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

# kubectl describe  pvc ceph-sc-claim
Name:          ceph-sc-claim
Namespace:     default
StorageClass:  ceph-rbd
Status:        Pending
Volume:
Labels:        
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    
Events:
  Type     Reason              Age                  From                         Message
  ----     ------              ----                 ----                         -------
  Warning  ProvisioningFailed  33s (x59 over 116m)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-rbd": failed to create rbd image: executable file not found in $PATH, command output:

发现依然绑定pv失败,继续找问题。我们在k8s集群的每台节点都安装了ceph-common,为什么rbd命令还是没有找到呢。通过查询分析,发下原因如下:
k8s使用stroageclass动态申请ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。
我们来根据官方建议使用外部的rbd-provisioner来提供服务,如下操作再k8s的master上执行:

# git clone https://github.com/kubernetes-incubator/external-storage.git
# cd external-storage/ceph/rbd/deploy
# sed -r -i "s/namespace: [^ ]+/namespace: kube-system/g" ./rbac/clusterrolebinding.yaml ./rbac/rolebinding.yaml
# kubectl -n kube-system apply -f ./rbac

# kubectl describe deployments.apps -n kube-system rbd-provisioner
Name:               rbd-provisioner
Namespace:          kube-system
CreationTimestamp:  Wed, 03 Jun 2020 18:59:14 +0800
Labels:             
Annotations:        deployment.kubernetes.io/revision: 1
Selector:           app=rbd-provisioner
Replicas:           1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:       Recreate
MinReadySeconds:    0
Pod Template:
  Labels:           app=rbd-provisioner
  Service Account:  rbd-provisioner
  Containers:
   rbd-provisioner:
    Image:      quay.io/external_storage/rbd-provisioner:latest
    Port:       
    Host Port:  
    Environment:
      PROVISIONER_NAME:  ceph.com/rbd
    Mounts:              
  Volumes:               
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  
NewReplicaSet:   rbd-provisioner-c968dcb4b (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  6m5s  deployment-controller  Scaled up replica set rbd-provisioner-c968dcb4b to 1

修改storageclass的provisioner为我们新增加的provisioner:

# vim sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: ceph.com/rbd
parameters:
  monitors: 172.18.2.172:6789,172.18.2.178:6789,172.18.2.189:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-kube-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"

# kubectl delete pvc ceph-sc-claim
# kubectl delete sc ceph-rbd
# kubectl apply -f sc.yaml
# kubectl apply -f pvc.yaml

等待provisioner分配存储和pvc绑定pv,大概3min时间。终于绑定成功了:

# kubectl get pvc
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ceph-sc-claim   Bound    pvc-0b92a433-adb0-46d9-a0c8-5fbef28eff5f   2Gi        RWO            ceph-rbd       7m49s

pod使用持久卷

创建pod并查看挂载状态:

# vim ubuntu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ceph-sc-pod
spec:
  containers:
  - name: ceph-sc-ubuntu
    image: phusion/baseimage
    command: ["/sbin/my_init"]
    volumeMounts:
    - name: ceph-sc-mnt
      mountPath: /mnt
      readOnly: false
  volumes:
  - name: ceph-sc-mnt
    persistentVolumeClaim:
      claimName: ceph-sc-claim

# kubectl apply -f ubuntu.yaml
# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
ceph-sc-pod              1/1     Running   0          24s

# kubectl exec ceph-sc-pod -it  sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
# df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   50G  3.8G   47G   8% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/rbd0                2.0G  6.0M  1.9G   1% /mnt
/dev/mapper/centos-root   50G  3.8G   47G   8% /etc/hosts
shm                       64M     0   64M   0% /dev/shm
tmpfs                    2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    2.9G     0  2.9G   0% /proc/acpi
tmpfs                    2.9G     0  2.9G   0% /proc/scsi
tmpfs                    2.9G     0  2.9G   0% /sys/firmware

经过这么多的折腾,总算成功对接外部的ceph

总结

1,k8s依赖kubelet来实现attach (rbd map) and detach (rbd unmap) RBD image的操作,而kubelet跑在每台k8s的节点上。故每台k8s节点都要安装ceph-common包来给kubelet提供rbd命令。
2,k8s使用stroageclass动态create ceph存储资源的时候,需要controller-manager使用rbd命令去和ceph集群交互,而k8s的controller-manager使用的默认镜像k8s.gcr.io/kube-controller-manager中没有集成ceph的rbd客户端。而k8s官方建议我们使用外部的provisioner来解决这个问题,这些独立的外部程序遵循由k8s定义的规范。

参考

https://kubernetes.io/zh/docs/concepts/storage/storage-classes/
https://kubernetes.io/zh/docs/concepts/storage/volumes/
https://groups.google.com/forum/#!topic/kubernetes-sig-storage-bugs/4w42QZxboIA