Ceph pool有两种保护数据冗余机制,即纠删码(erasure code,简称ec)和复制集(replicated,简称rc)
Ceph pool要保证部分osd损坏时数据不丢失(一般情况下一个disk设置为一个osd);默认情况下创建pool时rule类型选择replicated,即object在多个disk中拷贝保存;pool的另一种rule类型是erasure,这种pool可以节省空间;
比如 K=3 M=2 K+M=5
参考文档: 浅谈Ceph纠删码_niuanxins的专栏-CSDN博客_ceph 纠删码
版本大于等于L ceph --version
开启allow_ec_overwrites ceph osd pool set ec_pool allow_ec_overwrites true
方法 ceph daemon osd.0 config show | grep osd_objectstore
参考文档: ceph 之 纠删码操作
ceph osd erasure-code-profile ls
ceph osd erasure-code-profile get default
ceph osd erasure-code-profile rm
ceph osd erasure-code-profile set hdd-3-2 k=3 m=2 crush-device-class=hdd
ceph osd crush rule create-erasure default default
ceph osd crush rule dump default
ceph osd pool create ec 128 128 default default
语法: osd pool create
尽管crush rule 也是根据erasure_code_profile来创建的,但是这里创建纠删码pool的时候,还是需要明确指定erasure_code_profile的
由于纠删码池不支持存储omap类型数据,但rbd image的元数据是基于rados的omap实现,所以纠删码池无法创建rbd image,必须创建一个复制集池存放image元数据
ceph osd pool set ec 1
目前,这个fast_read 之针对纠删码池有效的
ceph osd pool set ec allow_ec_overwrites true
ceph osd pool create rc 128
ceph health detail
ceph osd pool application enable ec rbd
ceph osd pool application enable rc rbd
ceph health detail
ceph auth get-or-create client.kubernetes-ec mon 'allow r' osd 'allow rwx pool=ec, allow rwx pool=rc' -o /etc/ceph/ceph.client.kubernetes-ec.keyring
ceph auth get client.kubernetes-ec
前置条件:已经安装ceph-rbd-csi,安装步骤请见我的第一篇文档 KubeSphere使用rbd-csi创建快照
ceph auth get client.kubernetes-ec
[root@ceph01 ~]# ceph auth get client.kubernetes-ec
exported keyring for client.kubernetes-ec
key = AQDyPatgdw0UGRAA3Kkg3uvH9pYr0Epa3TV3rw==
caps mon = "allow r"
caps osd = "allow rwx pool=ec, allow rwx pool=rc"
apiVersion: v1
kind: Secret
name: csi-rbd-ec-secret
namespace: kube-system
# Key values correspond to a user name and its key, as defined in the
# ceph cluster. User ID should have required access to the 'pool'
# specified in the storage class
userID: kubernetes-ec # ceph用户
userKey: AQDyPatgdw0UGRAA3Kkg3uvH9pYr0Epa3TV3rw== # 上一步中的key
# Encryption passphrase
# encryptionPassphrase: test_passphrase
kubectl apply -f secret.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
name: csi-rbd-ec-sc
storageclass.kubernetes.io/is-default-class: "true"
storageclass.kubesphere.io/support-snapshot: 'true' # 必须置位true,否则 Kubesphere不支持创建快照
storageclass.kubesphere.io/supported-access-modes: '["ReadWriteOnce"]'
provisioner: rbd.csi.ceph.com
# If topology based provisioning is desired, delayed provisioning of
# PV is required and is enabled using the following attribute
# For further information read TODO
# volumeBindingMode: WaitForFirstConsumer
# (required) String representing a Ceph cluster to provision storage from.
# Should be unique across all Ceph clusters in use for provisioning,
# cannot be greater than 36 bytes in length, and should remain immutable for
# the lifetime of the StorageClass in use.
# Ensure to create an entry in the configmap named ceph-csi-config, based on
# csi-config-map-sample.yaml, to accompany the string chosen to
# represent the Ceph cluster in clusterID below
clusterID: b93a2e42-43e1-4975-bc7d-5998ca61a7c4 # ceph集群id 上文中的fsid
# (optional) If you want to use erasure coded pool with RBD, you need to
# create two pools. one erasure coded and one replicated.
# You need to specify the replicated pool here in the `pool` parameter, it is
# used for the metadata of the images.
# The erasure coded pool must be set as the `dataPool` parameter below.
dataPool: ec # 纠删码存储池
# (required) Ceph pool into which the RBD image shall be created
# eg: pool: rbdpool
pool: rc # 复制集存储池
# Set thickProvision to true if you want RBD images to be fully allocated on
# creation (thin provisioning is the default).
thickProvision: "false"
# (required) RBD image features, CSI creates image with image-format 2
# CSI RBD currently supports `layering`, `journaling`, `exclusive-lock`
# features. If `journaling` is enabled, must enable `exclusive-lock` too.
# imageFeatures: layering,journaling,exclusive-lock
imageFeatures: layering
# (optional) mapOptions is a comma-separated list of map options.
# For krbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
# mapOptions: lock_on_read,queue_depth=1024
# (optional) unmapOptions is a comma-separated list of unmap options.
# For krbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
# For nbd options refer
# https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
# unmapOptions: force
# The secrets have to contain Ceph credentials with required access
# to the 'pool'.
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/provisioner-secret-namespace: kube-system
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/controller-expand-secret-namespace: kube-system
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/node-stage-secret-namespace: kube-system
# (optional) Specify the filesystem type of the volume. If not specified,
# csi-provisioner will set default as `ext4`.
csi.storage.k8s.io/fstype: ext4
# (optional) uncomment the following to use rbd-nbd as mounter
# on supported nodes
# mounter: rbd-nbd
# (optional) Prefix to use for naming RBD images.
# If omitted, defaults to "csi-vol-".
# volumeNamePrefix: "foo-bar-"
# (optional) Instruct the plugin it has to encrypt the volume
# By default it is disabled. Valid values are "true" or "false".
# A string is expected here, i.e. "true", not true.
# encrypted: "true"
# (optional) Use external key management system for encryption passphrases by
# specifying a unique ID matching KMS ConfigMap. The ID is only used for
# correlation to configmap entry.
# encryptionKMSID:
# Add topology constrained pools configuration, if topology based pools
# are setup, and topology constrained provisioning is required.
# For further information read TODO
# topologyConstrainedPools: |
# [{"poolName":"pool0",
# "dataPool":"ec-pool0" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"east"},
# {"domainLabel":"zone","value":"zone1"}]},
# {"poolName":"pool1",
# "dataPool":"ec-pool1" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"east"},
# {"domainLabel":"zone","value":"zone2"}]},
# {"poolName":"pool2",
# "dataPool":"ec-pool2" # optional, erasure-coded pool for data
# "domainSegments":[
# {"domainLabel":"region","value":"west"},
# {"domainLabel":"zone","value":"zone1"}]}
# ]
reclaimPolicy: Delete
allowVolumeExpansion: true
- discard
kubectl apply -f storageclass.yaml
# Snapshot API version compatibility matrix:
# v1betav1:
# v1.17 =< k8s < v1.20
# 2.x =< snapshot-controller < v4.x
# v1:
# k8s >= v1.20
# snapshot-controller >= v4.x
# We recommend to use {sidecar, controller, crds} of same version
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
name: csi-rbd-ec-sc # 注意:必须和存储类同名,否则Kubesphere无法创建存储卷对应快照
driver: rbd.csi.ceph.com
# String representing a Ceph cluster to provision storage from.
# Should be unique across all Ceph clusters in use for provisioning,
# cannot be greater than 36 bytes in length, and should remain immutable for
# the lifetime of the StorageClass in use.
# Ensure to create an entry in the configmap named ceph-csi-config, based on
# csi-config-map-sample.yaml, to accompany the string chosen to
# represent the Ceph cluster in clusterID below
clusterID: b93a2e42-43e1-4975-bc7d-5998ca61a7c4 # ceph集群id,上文的fsid
# Prefix to use for naming RBD snapshots.
# If omitted, defaults to "csi-snap-".
# snapshotNamePrefix: "foo-bar-"
csi.storage.k8s.io/snapshotter-secret-name: csi-rbd-ec-secret
csi.storage.k8s.io/snapshotter-secret-namespace: kube-system
deletionPolicy: Delete
kubectl apply -f snapshotclass.yaml
dd if=/dev/zero of=test-file1 bs=10M count=500
rbd showmapped | grep -E rc\|ec
ceph df |grep -E rc\|ec
rbd ls rc
rbd info rc/172-vo-d3fc8d23-bc54-11eb-bb49-22f8864ed6ad
rbd create --size 10G --data-pool ec rc/
rados -p rc ls
可以看到,复制集池子里只有少量的对象,而纠删码池子有海量的对象(命令是:rados -p ec ls
重点说下rbd_header.7357774f37c这个对象,这个对象存放了rbd image的的一些重要参数,如大小,特性order等等。
rados -p rc listomapvals rbd_header.7357774f37c
如果你不想改变原有形式在复制集pool上创建image,例如opendtack nova cinder glance 原生不支持指定 datapool,又不想二次开发,可尝试这种方式,原理是给纠删码池创建一个复制集缓存池,让复制集池覆盖在纠删码池上,将所有的请求都引流到缓存池上,来实现类似在纠删码池上创建image的效果。
ceph 之 纠删码操作
ceph osd create pool cache 128
2. 设置缓存层
ceph osd tier add ec cache
ceph osd tier cache-mode cache writeback
注:本场景貌似只能设置成 writeback模式,readony会导致创建image失败,需要深入研究
附:几种缓存配置模式 (参考文档: ceph cache teir配置模式以及参数说明)
writeback mode: 当配置为writeback模式时,ceph客户端将数据写入到缓存层并且从缓存层得到已经写入的确认信息。在缓存层中的数据会及时迁移到存储层并且刷新缓存层。当读取数据时,如果数据在缓存层,那么直接在缓存层操作数据,如果缓存没有所需数据,则从存储层将数据读取到缓存层,然后发送给客户端。
readproxy mode : 当配置为此模式时,如果数据在缓存层,则直接操作缓存中的数据,如果不在缓存层,则去存储层操作。假如突然禁用了缓存模式,这种模式则很有用,因为可以直接可以请求存储层。而且在这种模式下,就算缓存耗尽,ceph集群也会从存储层继续提供读写,这样只是IO性能下降了,但是IO还可以继续响应。
readonly mode: 当配置为readonly模式时,Ceph客户端在写数据的时候,直接将数据写入到后端存储层。当读取数据时,Ceph会将请求的对象从存储层复制到缓存层。当在存储层中更新对象时,Ceph不会将这些更新同步到缓存层中相应的对象,所以这个模式在生产环境不推荐使用。
none: 禁用缓存模式。
ceph osd tier set-overlay ec cache
其实,不仅纠删码池可以做cache tier,replication 池子也能做cache tier,例如,我们可能有一批ssd盘,我们就可以在ssd上创建pool来充当sas盘的cache tier以提高性能;结合纠删码、replication、sas、ssd,我们可以做出多种不同性能的存储用以应对不同的场景。
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache target_max_objects 1000
ceph osd pool set cache target_max_bytes 10000000000
ceph osd pool set cache cache_target_dirty_ratio 0.1
cache_target_dirty_high_ratio: 0.6
ceph osd pool set cache cache_target_full_ratio 0.8
ceph osd pool set cache cache_min_flush_age 600
ceph osd pool set cache cache_min_evict_age 600
ceph osd pool application enable cache rbd
rbd create --size 1GiB ec/test.image
rbd ls cache
rbd info cache/test.image
4. 删除缓存池
ceph osd tier cache-mode cache forward
rados -p cache cache-flush-evict-all
rados -p cache ls
rbd rm cahce/test.image
ceph osd tier remove-overlay ec
ceph osd tier remove ec cache
ceph osd pool delete cache cache --yes-i-really-really-mean-it