从去年开始接触ceph,遇到了不少问题,一直没时间记录下来,下面会慢慢做一些记录。今天这个话题其实是大概一年前的事情了。。。。
背景:
在使用ceph作为云平台后端存储的时候,我们会使用rbd 提供块存储给openstack 使用。这个时候我们需要保证系统数据安全性。当然这是一个很复杂的话题,包括很多种方案,以后可以单独开一个话题来讲,这里只说其中一种方案,“定期备份”。
经过简单的调查,可以发现rbd的子命令 “rbd export/import” 可以完成这个事情。于是做了一个尝试。但是遇到一个问题参考:http://tracker.ceph.com/issues/13186
也就是说,当我们export 一个image之后,再次import进来,所有的snapshot 都没有了。只有内容一样的,整个image 都面目全非。
[root@node-1 ~]# rbd create test_export_v2 -s 1G
[root@node-1 ~]# rbd info test_export_v2
rbd image 'test_export_v2':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.f5516b8b4567
format: 2
features: layering, striping, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
stripe unit: 128 kB
stripe count: 16
[root@node-1 ~]# rm -rf export.file
[root@node-1 ~]# rbd export test_export_v2 export.file
Exporting image: 100% complete...done.
[root@node-1 ~]# ll export.file
-rw-r--r-- 1 root root 1073741824 Oct 17 20:04 export.file
[root@node-1 ~]# rbd import export.file test_import_v2
Importing image: 100% complete...done.
[root@node-1 ~]# rbd info test_import_v2
rbd image 'test_import_v2':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.f55a6b8b4567
format: 2
features: layering
flags:
这完全不能满足定时备份的要求。一旦系统崩溃,是无法通过备份文件进行恢复的。
原因:
经过代码研究可以发现,rbd export 命令只是简单的将image的内容读出来,然后写入到一个文件中。rbd import 是新建一个image,然后从文件从头到尾读取数据,写到image中。这样的设计是不可能满足我们的需求的。
解决:
为了解决这个问题,我向社区提交了一组patch 引入了V2 版本的rbd export 和rbd import。 https://github.com/ceph/ceph/pull/10487
可以简单介绍一下实现方法:
RBD Export & Import
===================
This is a file format of an RBD image or snapshot. It's a sparse format
for the full image. There are three recording sections in the file.
(1) Header.
(2) Metadata.
(3) Diffs.
Header
~~~~~~
"rbd image v2\\n"
Metadata records
~~~~~~~~~~~~~~~~
Every record has a one byte "tag" that identifies the record type,
followed by length of data, and then some other data.
Metadata records come in the first part of the image. Order is not
important, as long as all the metadata records come before the data
records.
In v2, we have the following metadata in each section:
(1 Bytes) tag.
(8 Bytes) length.
(n Bytes) data.
In this way, we can skip the unrecognized tag.
Image order
-----------
- u8: 'O'
- le64: length of appending data (8)
- le64: image order
Image format
------------
- u8: 'F'
- le64: length of appending data (8)
- le64: image format
Image Features
--------------
--------------
- u8: 'T'
- le64: length of appending data (8)
- le64: image features
Image Stripe unit
-----------------
- u8: 'U'
- le64: length of appending data (8)
- le64: image striping unit
Image Stripe count
------------------
- u8: 'C'
- le64: length of appending data (8)
- le64: image striping count
Final Record
~~~~~~~~~~~~
End
---
- u8: 'E'
Diffs records
~~~~~~~~~~~~~~~~~
Record the all snapshots and the HEAD in this section.
- le64: number of diffs
- Diffs ...
Detail please refer to rbd-diff.rst
"rbd diff v2\n"
Metadata records
Every record has a one byte "tag" that identifies the record type, followed by length of data, and then some other data.
Metadata records come in the first part of the image. Order is not important, as long as all the metadata records come before the data records.
In v2, we have the following metadata in each section: (1 Bytes) tag. (8 Bytes) length. (n Bytes) data.
In this way, we can skip the unrecognized tag.
From snap
u8: 'f'
le64: length of appending data (4 + length)
le32: snap name length
snap name
To snap
u8: 't'
le64: length of appending data (4 + length)
le32: snap name length
snap name
Size
u8: 's'
le64: length of appending data (8)
le64: (ending) image size
Data Records
These records come in the second part of the sequence.
Updated data
u8: 'w'
le64: length of appending data (8 + 8 + length)
le64: offset
le64: length
length bytes of actual data
Zero data
u8: 'z'
le64: length of appending data (8 + 8)
le64: offset
le64: length
Final Record
End
u8: 'e'
这就是我在这个PR里面引入的方案,以前的v1export 会吧所有的数据都存到到处的文件里面,但是这样就没有办法存储元数据和snapshot的信息。
所以我引入了一个header在导出的文件里面,用来存放元数据。并且吧所有的到处文件分成若干个section,每个section都有一个tag,表示不同的意思。
这样我们在导入的时候就可以通过解析不同的section来导入所有的元数据:
+static const std::string RBD_IMAGE_BANNER_V2 ("rbd image v2\n");
+static const std::string RBD_IMAGE_DIFFS_BANNER_V2 ("rbd image diffss v2\n");
+static const std::string RBD_DIFF_BANNER_V2 ("rbd diff v2\n");
+
+#define RBD_DIFF_FROM_SNAP 'f'
+#define RBD_DIFF_TO_SNAP 't'
+#define RBD_DIFF_IMAGE_SIZE 's'
+#define RBD_DIFF_WRITE 'w'
+#define RBD_DIFF_ZERO 'z'
+#define RBD_DIFF_END 'e'
+
+#define RBD_EXPORT_IMAGE_ORDER 'O'
+#define RBD_EXPORT_IMAGE_FEATURES 'T'
+#define RBD_EXPORT_IMAGE_STRIPE_UNIT 'U'
+#define RBD_EXPORT_IMAGE_STRIPE_COUNT 'C'
+#define RBD_EXPORT_IMAGE_END 'E'
现在的效果如下:
[root@node-1 ~]# rbd create test_export_v2 -s 1G
[root@node-1 ~]# rbd info test_export_v2
rbd image 'test_export_v2':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d9ca36b8b4567
format: 2
features: layering, striping, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
create_timestamp: Tue Oct 17 22:25:02 2017
stripe unit: 128 kB
stripe count: 32
[root@node-1 ~]# rbd export --export-format 2 test_export_v2 export.file
Exporting image: 100% complete...done.
[root@node-1 ~]# rbd import --export-format 2 export.file test_import_v2
Importing image: 100% complete...done.
[root@node-1 ~]# rbd info test_import_v2
rbd image 'test_import_v2':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8ea2566b8b4567
format: 2
features: layering, striping, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
create_timestamp: Tue Oct 17 22:25:55 2017
stripe unit: 128 kB
stripe count: 32
后续:
实际上还有一些后续工作可以做,比如:
(1)image-meta 数据是没有导出的,这个工作已经在前段时间,让别的同事完成了。
(2)导出的文件里面可以写入md5值,导入的时候对数据进行校验,以防到处的文件内容被修改。