Ceph pool有两种保护数据冗余机制,即纠删码(erasure code,简称ec)和复制集(replicated,简称rc)
Ceph pool要保证部分osd损坏时数据不丢失(一般情况下一个disk设置为一个osd);默认情况下创建pool时rule类型选择replicated,即object在多个disk中拷贝保存;pool的另一种rule类型是erasure,这种pool可以节省空间;
纠删码原理
比如 K=3 M=2 K+M=5
这个意思是:
K原始数据盘个数或恢复数据需要的磁盘个数
M校验盘个数或允许出故障的盘个数
使用编码算法,通过K个原始数据生成K+M个新数据
通过任何K个新数据都能还原出原始的K个数据
即允许M个数据盘出故障,数据仍然不会丢失;
参考文档: 浅谈Ceph纠删码_niuanxins的专栏-CSDN博客_ceph 纠删码
我们需要使用rbd作为k8s后端存储,但默认情况下,纠删存储池只能比较好的支持rgw。从L版本开始,纠删存储池支持开启allow_ec_overwrites属性,从而使得纠删存储池可以作为rbd或者cephfs的数据池。支持rbd需要有如下几个条件:
版本大于等于L ceph --version
开启allow_ec_overwrites ceph osd pool set ec_pool allow_ec_overwrites true
bluestore存储引擎
方法 ceph daemon osd.0 config show | grep osd_objectstore
纠删存储池不支持omap,需要一个副本存储池存储块设备的元数据等信息。该副本存储池可以初始化为rbd应用,
在rc池创建image,指定–data-pool为ec池
参考文档: ceph 之 纠删码操作
ceph osd erasure-code-profile ls
ceph osd erasure-code-profile get default
ceph osd erasure-code-profile rm
ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd
可用的选项有:
ceph osd crush rule create-erasure default default
ceph osd crush rule dump default
创建一个使用纠删码规则default的pool
ceph osd pool create ec 128 128 default default
语法: osd pool create
尽管crush rule 也是根据erasure_code_profile来创建的,但是这里创建纠删码pool的时候,还是需要明确指定erasure_code_profile的
注:pg数量系统不能自动识别,必须根据系统环境计算后指定,计算方法网上较多,自行百度谷歌
由于纠删码池不支持存储omap类型数据,但rbd image的元数据是基于rados的omap实现,所以纠删码池无法创建rbd image,必须创建一个复制集池存放image元数据
如上图报错所示,纠删码池不支持创建image镜像
ceph osd pool set ec 1
目前,这个fast_read 之针对纠删码池有效的
ceph osd pool set ec allow_ec_overwrites true
ceph osd pool create rc 128
ceph health detail
可以看到有告警,这是因为未设置池子应用类型所致,我们可以将他们均设置为rbd
ceph osd pool application enable ec rbd
ceph osd pool application enable rc rbd
ceph health detail
ceph auth get-or-create client.kubernetes-ec mon 'allow r' osd 'allow rwx pool=ec, allow rwx pool=rc' -o /etc/ceph/ceph.client.kubernetes-ec.keyring
ceph auth get client.kubernetes-ec
前置条件:已经安装ceph-rbd-csi,安装步骤请见我的第一篇文档 KubeSphere使用rbd-csi创建快照
ceph auth get client.kubernetes-ec
[root@ceph01 ~]# ceph auth get client.kubernetes-ec
exported keyring for client.kubernetes-ec
[client.kubernetes-ec]
key = AQDyPatgdw0UGRAA3Kkg3uvH9pYr0Epa3TV3rw==
caps mon = "allow r"
caps osd = "allow rwx pool=ec, allow rwx pool=rc"
---
apiVersion: v1
kind: Secret
metadata:
name: csi-rbd-ec-secret
namespace: kube-system
stringData:
# Key values correspond to a user name and its key, as defined in the
# ceph cluster. User ID should have required access to the 'pool'
# specified in the storage class
userID: kubernetes-ec # ceph用户
userKey: AQDyPatgdw0UGRAA3Kkg3uvH9pYr0Epa3TV3rw== # 上一步中的key
# Encryption passphrase
# encryptionPassphrase: test_passphrase
kubectl apply -f secret.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd-ec-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
storageclass.kubesphere.io/support-snapshot: 'true' # 必须置位true,否则 Kubesphere不支持创建快照
storageclass.kubesphere.io/supported-access-modes: '["ReadWriteOnce"]'
provisioner: rbd.csi.ceph.com
# If topology based provisioning is desired, delayed provisioning of
# PV is required and is enabled using the following attribute
# For further information read TODO
# volumeBindingMode: WaitForFirstConsumer
parameters:
# (required) String representing a Ceph cluster to provision storage from.
# Should be unique across all Ceph clusters in use for provisioning,
# cannot be greater than 36 bytes in length, and should remain immutable for
# the lifetime of the StorageClass in use.
# Ensure to create an entry in the configmap named ceph-csi-config, based on
# csi-config-map-sample.yaml, to accompany the string chosen to
# represent the Ceph cluster in clusterID below
clusterID: b93a2e42-43e1-4975-bc7d-5998ca61a7c4 # ceph集群id 上文中的fsid
# (optional) If you want to use erasure coded pool with RBD, you need to
# create two pools. one erasure coded and one replicated.
# You need to specify the replicated pool here in the `pool` parameter, it is
# used for the metadata of the images.
# The erasure coded pool must be set as the `dataPool` parameter below.
dataPool: ec # 纠删码存储池
# (required) Ceph pool into which the RBD image shall be created
# eg: pool: rbdpool
pool: rc # 复制集存储池
# Set thickProvision to true if you want RBD images to be fully allocated on
# creation (thin provisioning is the default).
thickProvision: "false"
# (required) RBD image features, CSI creates image with image-format 2
# CSI RBD currently supports `layering`, `journaling`, `exclusive-lock`
# features. If `journaling` is enabled, must enable `exclusive-lock` too.
# imageFeatures: layering,journaling,exclusive-lock
imageFeatures: layering
# (optional) mapOptions is a comma-separated list of map options.
# For krbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
# mapOptions: lock_on_read,queue_depth=1024
# (optional) unmapOptions is a comma-separated list of unmap options.
# For krbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
# unmapOptions: force
# The secrets have to contain Ceph credentials with required access
# to the 'pool'.
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/provisioner-secret-namespace: kube-system
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/controller-expand-secret-namespace: kube-system
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/node-stage-secret-namespace: kube-system
# (optional) Specify the filesystem type of the volume. If not specified,
# csi-provisioner will set default as `ext4`.
csi.storage.k8s.io/fstype: ext4
# (optional) uncomment the following to use rbd-nbd as mounter
# on supported nodes
# mounter: rbd-nbd
# (optional) Prefix to use for naming RBD images.
# If omitted, defaults to "csi-vol-".
# volumeNamePrefix: "foo-bar-"
# (optional) Instruct the plugin it has to encrypt the volume
# By default it is disabled. Valid values are "true" or "false".
# A string is expected here, i.e. "true", not true.
# encrypted: "true"
# (optional) Use external key management system for encryption passphrases by
# specifying a unique ID matching KMS ConfigMap. The ID is only used for
# correlation to configmap entry.
# encryptionKMSID:
# Add topology constrained pools configuration, if topology based pools
# are setup, and topology constrained provisioning is required.
# For further information read TODO
# topologyConstrainedPools: |
# [{"poolName":"pool0",
# "dataPool":"ec-pool0" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"east"},
# {"domainLabel":"zone","value":"zone1"}]},
# {"poolName":"pool1",
# "dataPool":"ec-pool1" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"east"},
# {"domainLabel":"zone","value":"zone2"}]},
# {"poolName":"pool2",
# "dataPool":"ec-pool2" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"west"},
# {"domainLabel":"zone","value":"zone1"}]}
# ]
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
- discard
kubectl apply -f storageclass.yaml
---
# Snapshot API version compatibility matrix:
# v1betav1:
# v1.17 =< k8s < v1.20
# 2.x =< snapshot-controller < v4.x
# v1:
# k8s >= v1.20
# snapshot-controller >= v4.x
# We recommend to use {sidecar, controller, crds} of same version
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-rbd-ec-sc # 注意:必须和存储类同名,否则Kubesphere无法创建存储卷对应快照
driver: rbd.csi.ceph.com
parameters:
# String representing a Ceph cluster to provision storage from.
# Should be unique across all Ceph clusters in use for provisioning,
# cannot be greater than 36 bytes in length, and should remain immutable for
# the lifetime of the StorageClass in use.
# Ensure to create an entry in the configmap named ceph-csi-config, based on
# csi-config-map-sample.yaml, to accompany the string chosen to
# represent the Ceph cluster in clusterID below
clusterID: b93a2e42-43e1-4975-bc7d-5998ca61a7c4 # ceph集群id,上文的fsid
# Prefix to use for naming RBD snapshots.
# If omitted, defaults to "csi-snap-".
# snapshotNamePrefix: "foo-bar-"
csi.storage.k8s.io/snapshotter-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/snapshotter-secret-namespace: kube-system
deletionPolicy: Delete
kubectl apply -f snapshotclass.yaml
dd if=/dev/zero of=test-file1 bs=10M count=500
rbd showmapped | grep -E rc\|ec
显然挂载的复制集池子
ceph df |grep -E rc\|ec
的确是纠删码池子写入了5GiB数据,而占用了7.6GiB存储空间,符合纠删码K,M定义
rbd ls rc
有一个image
这印证了前文所说的复制集存储image元数据,纠删码池存储用户数据。
rbd info rc/172-vo-d3fc8d23-bc54-11eb-bb49-22f8864ed6ad
其中指定了data_pool
这和手动创建image指定–data-pool效果是一致的
rbd create --size 10G --data-pool ec rc/
rados -p rc ls
可以看到,复制集池子里只有少量的对象,而纠删码池子有海量的对象(命令是:rados -p ec ls
,数据量很大,这里不做展示,请自行查看),符合预期
重点说下rbd_header.7357774f37c这个对象,这个对象存放了rbd image的的一些重要参数,如大小,特性order等等。
rados -p rc listomapvals rbd_header.7357774f37c
资料显示,纠删码池不支持存放如上图所示的omap数据,所以无法在纠删码池上无法直接创建image,具体原因,需要对ceph进行深入研究。
如果你不想改变原有形式在复制集pool上创建image,例如opendtack nova cinder glance 原生不支持指定 datapool,又不想二次开发,可尝试这种方式,原理是给纠删码池创建一个复制集缓存池,让复制集池覆盖在纠删码池上,将所有的请求都引流到缓存池上,来实现类似在纠删码池上创建image的效果。
参考文档:
ceph 之 纠删码操作
管理ceph缓存池
ceph osd create pool cache 128
2. 设置缓存层
ceph osd tier add ec cache
ceph osd tier cache-mode cache writeback
注:本场景貌似只能设置成 writeback模式,readony会导致创建image失败,需要深入研究
附:几种缓存配置模式 (参考文档: ceph cache teir配置模式以及参数说明)
writeback mode: 当配置为writeback模式时,ceph客户端将数据写入到缓存层并且从缓存层得到已经写入的确认信息。在缓存层中的数据会及时迁移到存储层并且刷新缓存层。当读取数据时,如果数据在缓存层,那么直接在缓存层操作数据,如果缓存没有所需数据,则从存储层将数据读取到缓存层,然后发送给客户端。
readproxy mode : 当配置为此模式时,如果数据在缓存层,则直接操作缓存中的数据,如果不在缓存层,则去存储层操作。假如突然禁用了缓存模式,这种模式则很有用,因为可以直接可以请求存储层。而且在这种模式下,就算缓存耗尽,ceph集群也会从存储层继续提供读写,这样只是IO性能下降了,但是IO还可以继续响应。
readonly mode: 当配置为readonly模式时,Ceph客户端在写数据的时候,直接将数据写入到后端存储层。当读取数据时,Ceph会将请求的对象从存储层复制到缓存层。当在存储层中更新对象时,Ceph不会将这些更新同步到缓存层中相应的对象,所以这个模式在生产环境不推荐使用。
none: 禁用缓存模式。
ceph osd tier set-overlay ec cache
其实,不仅纠删码池可以做cache tier,replication 池子也能做cache tier,例如,我们可能有一批ssd盘,我们就可以在ssd上创建pool来充当sas盘的cache tier以提高性能;结合纠删码、replication、sas、ssd,我们可以做出多种不同性能的存储用以应对不同的场景。
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache target_max_objects 1000
这里为了观察效果,故设置为1000个,实际环境中这个值太小了
还可以设置其他参数,例如
当缓存池中的数据量达到多少字节时开始刷盘并驱逐
ceph osd pool set cache target_max_bytes 10000000000
当脏对象占比达到10%时开始刷盘
ceph osd pool set cache cache_target_dirty_ratio 0.1
当脏对象占比达到60%时开始高速刷盘
cache_target_dirty_high_ratio: 0.6
当缓存池的使用量达到其总量的一定百分比时,缓存分层代理将驱逐对象以维护可用容量(达到该限制时,就认为缓存池满了),此时会将未修改的(干净的)对象刷盘
ceph osd pool set cache cache_target_full_ratio 0.8
定义缓存层将对象刷至存储层或者驱逐的时间:
ceph osd pool set cache cache_min_flush_age 600
ceph osd pool set cache cache_min_evict_age 600
也可将缓存池类型设置为rbd
ceph osd pool application enable cache rbd
rbd create --size 1GiB ec/test.image
是可以成功的,并且cache也有同名的image
rbd ls cache
rbd info cache/test.image
4. 删除缓存池
ceph osd tier cache-mode cache forward
rados -p cache cache-flush-evict-all
rados -p cache ls
其实是无法将对象全部刷进ec池的,因为ec池无法存储omap元数控
只好删除image
rbd rm cahce/test.image
ceph osd tier remove-overlay ec
ceph osd tier remove ec cache
ceph osd pool delete cache cache --yes-i-really-really-mean-it
注:ceph防止误删,所以命令比较长