一、问题描述
rocksdb数据库发生异常导致mon进程无法拉起。
二、问题现象:
mon异常第一次call trace信息如下:
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,237 2020-07-31 19:36:31.926040 7fdc0142d700 -1 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair, std::basic_string >*, int*)' thread 7fdc0142d700 time 2020-07-31 19:36:31.895145
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,238 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: 5374: FAILED assert(err == 0)
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,239 ceph version 12.2.12-30-ged2e5c3 (ed2e5c3c26215c395ed024dabce34321e1f650b3) luminous (stable)
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,240 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5583990f8b10]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,241 2: (Monitor::_scrub(ScrubResult*, std::pair*, int*)+0xc11) [0x558398e7db61]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,242 3: (Monitor::handle_scrub(boost::intrusive_ptr)+0x22f) [0x558398e8adbf]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,243 4: (Monitor::dispatch_op(boost::intrusive_ptr)+0xc08) [0x558398ea6ab8]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,244 5: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x558398ea7a4b]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,245 6: (Monitor::ms_dispatch(Message*)+0x23) [0x558398ed4323]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,246 7: (DispatchQueue::entry()+0x792) [0x5583993c2ef2]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,247 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x5583991a120d]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,248 9: (()+0x7e25) [0x7fdc0a21ae25]
Jul 31, 2020 @ 19:36:31.000 node-3 ceph ceph-mon 228,249 10: (clone()+0x6d) [0x7fdc077b534d]
三、问题分析
1.经过调查,是ceph mon处理scrub消息时,调用读取racksdb出现错误,可能数据库发生损坏。
bool Monitor::_scrub(ScrubResult *r,
pair *start,
int *num_keys)
{
MonitorDBStore::Synchronizer it = store->get_synchronizer(*start, prefixes);
while (it->has_next_chunk()) {
pair k = it->get_next_key();
bufferlist bl;
int err = store->get(k.first, k.second, bl);// 调用racksdb出现错误,可能数据库发生损坏
assert(err == 0);
}
}
2.经过调查,该问题之前有人在rocksdb仓库中进行了bug submit,我查看了相关的comments信息,以及rocksdb社区对该问题的修复。
issues 提交: https://github.com/facebook/rocksdb/issues/5558
rocksdb修复: https://github.com/facebook/rocksdb/pull/5744
rocksdb修复中提到:
Open-source users recently reported two occurrences of LSM-tree corruption (#5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.
Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
When user uses options.force_consistency_check in RocksDb, instead of crashing the process, we now pass the error back to the users without killing the process.
- 通过打开force_consistency_checks选项,在rocksdb Apply操作时候,调用CheckConsistency or CheckConsistencyForDeletes进行相关的一致性检查,在5.4.0版本中,force_consistency_checks 默认是false,其实根本没有进行相关的检测验证。
void CheckConsistency(VersionStorageInfo* vstorage) {
#ifdef NDEBUG
if (!vstorage->force_consistency_checks()) {
// Dont run consistency checks in release mode except if
// explicitly asked to
return;
}
#endif
...
}
void CheckConsistencyForDeletes(VersionEdit* edit, uint64_t number,
int level) {
#ifdef NDEBUG
if (!base_vstorage_->force_consistency_checks()) {
// Dont run consistency checks in release mode except if
// explicitly asked to
return;
}
#endif
...
}
- 可以在配置中,把force_consistency_checks 置位true,期望进行rocksdb的相关检测。
mon_rocksdb_options = write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true,force_consistency_checks=true
5.通过步骤步骤4,确实会进行rocksdb的相关检测,但是当检测到rocksdb异常(比如,sst文件排序错误、重叠、要删除文件找不到)等情况,在5.4.0版本中进行的操作是abort(),此时直接kill掉进程退出,在https://github.com/facebook/rocksdb/pull/5744 中进行了相关优化。即当出现这些异常时候,输出相关信息到前端,而不是直接把相关的进程kill掉。以下为部分代码优化:
//rocksdb 5.4.0
131 void CheckConsistency(VersionStorageInfo* vstorage) {
...
139 // make sure the files are sorted correctly
140 for (int level = 0; level < vstorage->num_levels(); level++) {
141 auto& level_files = vstorage->LevelFiles(level);
142 for (size_t i = 1; i < level_files.size(); i++) {
143 auto f1 = level_files[i - 1];
144 auto f2 = level_files[i];
145 if (level == 0) {
146 if (!level_zero_cmp_(f1, f2)) {
147 fprintf(stderr, "L0 files are not sorted properly");
148 abort();// 进程直接退出
149 }
150
151 if (f2->smallest_seqno == f2->largest_seqno) {
152 // This is an external file that we ingested
153 SequenceNumber external_file_seqno = f2->smallest_seqno;
154 if (!(external_file_seqno < f1->largest_seqno ||
155 external_file_seqno == 0)) {
156 fprintf(stderr, "L0 file with seqno %" PRIu64 " %" PRIu64
157 " vs. file with global_seqno %" PRIu64 "\n",
158 f1->smallest_seqno, f1->largest_seqno,
159 external_file_seqno);
160 abort();// 进程直接退出
161 }
162 } else if (f1->smallest_seqno <= f2->smallest_seqno) {
163 fprintf(stderr, "L0 files seqno %" PRIu64 " %" PRIu64
164 " vs. %" PRIu64 " %" PRIu64 "\n",
165 f1->smallest_seqno, f1->largest_seqno, f2->smallest_seqno,
166 f2->largest_seqno);
167 abort();// 进程直接退出
168 }
169 } else {
170 if (!level_nonzero_cmp_(f1, f2)) {
171 fprintf(stderr, "L%d files are not sorted properly", level);
172 abort();// 进程直接退出
173 }
}
//rocksdb v6.10.2
204 Status CheckConsistency(VersionStorageInfo* vstorage) {
243 if (level == 0) {
244 if (!level_zero_cmp_(f1, f2)) {
245 fprintf(stderr, "L0 files are not sorted properly");
246 return Status::Corruption("L0 files are not sorted properly");//不退出,给出提示信息
247 }
248
249 if (f2->fd.smallest_seqno == f2->fd.largest_seqno) {
250 // This is an external file that we ingested
251 SequenceNumber external_file_seqno = f2->fd.smallest_seqno;
252 if (!(external_file_seqno < f1->fd.largest_seqno ||
253 external_file_seqno == 0)) {
254 fprintf(stderr,
255 "L0 file with seqno %" PRIu64 " %" PRIu64
256 " vs. file with global_seqno %" PRIu64 "\n",
257 f1->fd.smallest_seqno, f1->fd.largest_seqno,
258 external_file_seqno);
259 return Status::Corruption(
260 "L0 file with seqno " +
261 NumberToString(f1->fd.smallest_seqno) + " " +
262 NumberToString(f1->fd.largest_seqno) +
263 " vs. file with global_seqno" +
264 NumberToString(external_file_seqno) + " with fileNumber " +
265 NumberToString(f1->fd.GetNumber()));//不退出,给出提示信息
266 }
267 } else if (f1->fd.smallest_seqno <= f2->fd.smallest_seqno) {
268 fprintf(stderr,
269 "L0 files seqno %" PRIu64 " %" PRIu64 " vs. %" PRIu64
270 " %" PRIu64 "\n",
271 f1->fd.smallest_seqno, f1->fd.largest_seqno,
272 f2->fd.smallest_seqno, f2->fd.largest_seqno);
273 return Status::Corruption(
274 "L0 files seqno " + NumberToString(f1->fd.smallest_seqno) +
275 " " + NumberToString(f1->fd.largest_seqno) + " " +
276 NumberToString(f1->fd.GetNumber()) + " vs. " +
277 NumberToString(f2->fd.smallest_seqno) + " " +
278 NumberToString(f2->fd.largest_seqno) + " " +
279 NumberToString(f2->fd.GetNumber()));
280 } //不退出,给出提示信息
281 } else {
282 if (!level_nonzero_cmp_(f1, f2)) {
283 fprintf(stderr, "L%d files are not sorted properly", level);
284 return Status::Corruption("L" + NumberToString(level) +
285 " files are not sorted properly");
286 }
287
288 // Make sure there is no overlap in levels > 0
289 if (vstorage->InternalComparator()->Compare(f1->largest,
290 f2->smallest) >= 0) {
291 fprintf(stderr, "L%d have overlapping ranges %s vs. %s\n", level,
292 (f1->largest).DebugString(true).c_str(),
293 (f2->smallest).DebugString(true).c_str());
294 return Status::Corruption(
295 "L" + NumberToString(level) + " have overlapping ranges " +
296 (f1->largest).DebugString(true) + " vs. " +
297 (f2->smallest).DebugString(true));//不退出,给出提示信息
298 }
299 }
6.以下版本进行了该分支代码修复。
v6.11.4 v6.10.2 v6.10.1 v6.8.1 v6.7.3 v6.6.4 v6.6.3 v6.5.3 v6.5.2
7.L版本中,当前默认rocksdb版本。
OSD重启过程中,可以看到rocksdb的版本。
2020-08-10 16:08:22.350071 7f57988afd00 4 rocksdb: RocksDB version: 5.4.0
四、问题总结
建议升级高版本的rocksdb进行相关问题的修复。
五、workaround方法:
- ssh到node-3节点,备份当前节点的mon数据库文件
cd /var/lib/ark/ceph/ceph/mon/mon/
mv ceph-node-3 bak.ceph-node-3
2.ssh到主mon节点,拷贝文件到node-3节点
scp -rp /var/lib/ceph//mon/ceph-node-1 node-3:/var/lib/ceph//mon/ceph-node-3
k8s的 mon本地目录可能在/var/lib/ark/ceph/ceph/mon/mon/
3.重启mon服务。