Ceph RBD目前存储模式只支持RWO、ROX,因此pv yaml中将其配置为RWO,这样至少可写,之后准备pvc和业务yaml,使其被挂载到容器中的数据库读写目录。
apiVersion: v1
kind: PersistentVolume
metadata:
name: user-pv
namespace: sock-shop
labels:
pv: user-pv
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
rbd:
monitors:
- 192.168.22.47:6789
- 192.168.22.48:6789
- 192.168.22.49:6789
pool: sock_shop
image: user-pv
user: admin
secretRef:
name: ceph-secret
fsType: ext4
readOnly: false
persistentVolumeReclaimPolicy: Recycle
测试过程中关闭了挂载该PV的Pod所在的node,发现该Pod无法被迁移,报错如下:
Multi-Attach error for volume "user-pv" Volume is already used by pod(s) user-db-79f7876cbc-dl8b8
Unable to mount volumes for pod "user-db-79f7876cbc-chddt_sock-shop(1dd2393e-ed51-11e8-95af-001a4ad9b270)": timeout expired waiting for volumes to attach or mount for pod "sock-shop"/"user-db-79f7876cbc-chddt". list of unmounted volumes=[data-volume]. list of unattached volumes=[tmp-volume data-volume default-token-l2g8x]
k8s会判断如果该volume禁用了多pod/node挂载的话,再去判断当attach上该volume的数量>0的时候就让该volume无法被新启的pod挂载,见下
if rc.isMultiAttachForbidden(volumeToAttach.VolumeSpec) {
nodes := rc.actualStateOfWorld.GetNodesForVolume(volumeToAttach.VolumeName)
if len(nodes) > 0 {
if !volumeToAttach.MultiAttachErrorReported {
rc.reportMultiAttachError(volumeToAttach, nodes)
rc.desiredStateOfWorld.SetMultiAttachError(volumeToAttach.VolumeName, volumeToAttach.NodeName)
}
continue
}
}
而对isMultiAttachForbidden的判断则是对AccessModes,而Ceph的AccessModes并不支持RWX,仅支持RWO(不考虑RO),而之前挂载该volume的Pod在运行Node宕机后,集群仍然认为他是Running状态,因此其Volume资源未释放,因此出现新启动的Pod仍不可挂载volume的问题,见下
if volumeSpec.PersistentVolume != nil {
// Check for persistent volume types which do not fail when trying to multi-attach
if volumeSpec.PersistentVolume.Spec.VsphereVolume != nil {
return false
}
if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
// No access mode specified so we don't know for sure. Let the attacher fail if needed
return false
}
// check if this volume is allowed to be attached to multiple PODs/nodes, if yes, return false
for _, accessMode := range volumeSpec.PersistentVolume.Spec.AccessModes {
if accessMode == v1.ReadWriteMany || accessMode == v1.ReadOnlyMany {
return false
}
}
return true
}
这样问题就清楚了,Pod迁移时会将之前的Pod置于Unknown状态,这样会导致k8s内部对RWO的限制影响到了Pod迁移,虽然Ceph RBD不支持RWX的模式,但是在PV、PVC端执行流程中并未有限制,也就是说PV、PVC端并不会因为你是Ceph RBD或者Glusterfs就禁止你的PV申请,因此将accessModes设置成RWX可以解决问题。但是目前还有一个问题,如果Ceph RBD并不支持RWX,将其accessModes强行设置为RWX会不会出问题?
实际是上不会的,见下代码,首先这里会判断accessModes,实际上这里本身有个bug,但已经被我修复掉了,会在后面的文章聊到,先看修改后的,如果判断出非ReadOnlyMany,则必然会验证util.rbdStatus来判断这个rbd pv是否已经被使用了。
func (util *RBDUtil) AttachDisk(b rbdMounter) (string, error) {
。。。
if b.accessModes != nil {
// If accessModes only contain ReadOnlyMany, we don't need check rbd status of being used.
if len(b.accessModes) == 1 && b.accessModes[0] == v1.ReadOnlyMany {
needValidUsed = false
}
}
err := wait.ExponentialBackoff(backoff, func() (bool, error) {
used, rbdOutput, err := util.rbdStatus(&b)
if err != nil {
return false, fmt.Errorf("fail to check rbd image status with: (%v), rbd output: (%s)", err, rbdOutput)
}
return !needValidUsed || !used, nil
})
。。。
}
那么如何实现RBD pv使用判断的呢?实现就是通过rbd status看是否有watcher,如果该rbd被map过了,则必然会出现watcher,也就是说对于RWX和RWO,其都只能被挂载一次,回到最初的问题,即便node宕机了,即便脑裂了,只要该pv被map了,也就是被Pod使用中,新的Pod就无法启动。换句话说就是我们的PV迁移流程是:启动新Pod,等待旧Pod; 有两种情况,其一,出现脑裂,旧Pod仍然挂载上了PV,这时即便调度了新Pod,由于判断watcher已存在,该Pod的启动仍然会失败,RBD不会出现多次RW挂载的问题。其二,未出现脑裂,旧主机宕机,这时候新Pod判断watcher不存在,Pod重新调度成功。
// rbdStatus runs `rbd status` command to check if there is watcher on the image.
func (util *RBDUtil) rbdStatus(b *rbdMounter) (bool, string, error) {
var err error
var output string
var cmd []byte
// If we don't have admin id/secret (e.g. attaching), fallback to user id/secret.
id := b.adminId
secret := b.adminSecret
if id == "" {
id = b.Id
secret = b.Secret
}
mon := util.kernelRBDMonitorsOpt(b.Mon)
// cmd "rbd status" list the rbd client watch with the following output:
//
// # there is a watcher (exit=0)
// Watchers:
// watcher=10.16.153.105:0/710245699 client.14163 cookie=1
//
// # there is no watcher (exit=0)
// Watchers: none
//
// Otherwise, exit is non-zero, for example:
//
// # image does not exist (exit=2)
// rbd: error opening image kubernetes-dynamic-pvc-: (2) No such file or directory
//
glog.V(4).Infof("rbd: status %s using mon %s, pool %s id %s key %s", b.Image, mon, b.Pool, id, secret)
cmd, err = b.exec.Run("rbd",
"status", b.Image, "--pool", b.Pool, "-m", mon, "--id", id, "--key="+secret)
output = string(cmd)
if err, ok := err.(*exec.Error); ok {
if err.Err == exec.ErrNotFound {
glog.Errorf("rbd cmd not found")
// fail fast if command not found
return false, output, err
}
}
// If command never succeed, returns its last error.
if err != nil {
return false, output, err
}
if strings.Contains(output, imageWatcherStr) {
glog.V(4).Infof("rbd: watchers on %s: %s", b.Image, output)
return true, output, nil
} else {
glog.Warningf("rbd: no watchers on %s", b.Image)
return false, output, nil
}
}