一、问题描述

rocksdb数据库发生异常导致mon进程无法拉起。

二、问题现象:

mon异常第一次call trace信息如下:

  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,237 2020-07-31 19:36:31.926040 7fdc0142d700 -1 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair, std::basic_string >*, int*)' thread 7fdc0142d700 time 2020-07-31 19:36:31.895145
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,238 /root/rpmbuild/BUILD/ceph-12.2.12-1/src/mon/Monitor.cc: 5374: FAILED assert(err == 0)
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,239  ceph version 12.2.12-30-ged2e5c3 (ed2e5c3c26215c395ed024dabce34321e1f650b3) luminous (stable)
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,240  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5583990f8b10]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,241  2: (Monitor::_scrub(ScrubResult*, std::pair*, int*)+0xc11) [0x558398e7db61]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,242  3: (Monitor::handle_scrub(boost::intrusive_ptr)+0x22f) [0x558398e8adbf]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,243  4: (Monitor::dispatch_op(boost::intrusive_ptr)+0xc08) [0x558398ea6ab8]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,244  5: (Monitor::_ms_dispatch(Message*)+0x7eb) [0x558398ea7a4b]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,245  6: (Monitor::ms_dispatch(Message*)+0x23) [0x558398ed4323]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,246  7: (DispatchQueue::entry()+0x792) [0x5583993c2ef2]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,247  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x5583991a120d]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,248  9: (()+0x7e25) [0x7fdc0a21ae25]
  Jul 31, 2020 @ 19:36:31.000 node-3  ceph  ceph-mon  228,249  10: (clone()+0x6d) [0x7fdc077b534d]

另外一种call trace信息:
ceph mon无法启动-rocksdb数据损坏_第1张图片

三、问题分析

1.经过调查,是ceph mon处理scrub消息时,调用读取racksdb出现错误,可能数据库发生损坏。

bool Monitor::_scrub(ScrubResult *r,
                     pair *start,
                     int *num_keys)
{
  MonitorDBStore::Synchronizer it = store->get_synchronizer(*start, prefixes);
  while (it->has_next_chunk()) {
    pair k = it->get_next_key();  
    bufferlist bl;
    int err = store->get(k.first, k.second, bl);// 调用racksdb出现错误,可能数据库发生损坏
    assert(err == 0);    
    }
}

2.经过调查,该问题之前有人在rocksdb仓库中进行了bug submit,我查看了相关的comments信息,以及rocksdb社区对该问题的修复。

issues 提交: https://github.com/facebook/rocksdb/issues/5558

rocksdb修复: https://github.com/facebook/rocksdb/pull/5744

rocksdb修复中提到:

Open-source users recently reported two occurrences of LSM-tree corruption (#5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.

When user uses options.force_consistency_check in RocksDb, instead of crashing the process, we now pass the error back to the users without killing the process.
  1. 通过打开force_consistency_checks选项,在rocksdb Apply操作时候,调用CheckConsistency or CheckConsistencyForDeletes进行相关的一致性检查,在5.4.0版本中,force_consistency_checks 默认是false,其实根本没有进行相关的检测验证。
void CheckConsistency(VersionStorageInfo* vstorage) {
#ifdef NDEBUG
    if (!vstorage->force_consistency_checks()) {
      // Dont run consistency checks in release mode except if
      // explicitly asked to
      return;
    }
#endif
...
}

void CheckConsistencyForDeletes(VersionEdit* edit, uint64_t number,
                                  int level) {
#ifdef NDEBUG
    if (!base_vstorage_->force_consistency_checks()) {
      // Dont run consistency checks in release mode except if
      // explicitly asked to
      return;
    }
#endif    
...
}
  1. 可以在配置中,把force_consistency_checks 置位true,期望进行rocksdb的相关检测。
mon_rocksdb_options = write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true,force_consistency_checks=true

5.通过步骤步骤4,确实会进行rocksdb的相关检测,但是当检测到rocksdb异常(比如,sst文件排序错误、重叠、要删除文件找不到)等情况,在5.4.0版本中进行的操作是abort(),此时直接kill掉进程退出,在https://github.com/facebook/rocksdb/pull/5744 中进行了相关优化。即当出现这些异常时候,输出相关信息到前端,而不是直接把相关的进程kill掉。以下为部分代码优化:

//rocksdb 5.4.0

131   void CheckConsistency(VersionStorageInfo* vstorage) {
...
139     // make sure the files are sorted correctly
140     for (int level = 0; level < vstorage->num_levels(); level++) {
141       auto& level_files = vstorage->LevelFiles(level);
142       for (size_t i = 1; i < level_files.size(); i++) {
143         auto f1 = level_files[i - 1];
144         auto f2 = level_files[i];
145         if (level == 0) {
146           if (!level_zero_cmp_(f1, f2)) {
147             fprintf(stderr, "L0 files are not sorted properly");
148             abort();// 进程直接退出
149           }
150        
151           if (f2->smallest_seqno == f2->largest_seqno) {
152             // This is an external file that we ingested
153             SequenceNumber external_file_seqno = f2->smallest_seqno;
154             if (!(external_file_seqno < f1->largest_seqno ||
155                   external_file_seqno == 0)) {
156               fprintf(stderr, "L0 file with seqno %" PRIu64 " %" PRIu64
157                               " vs. file with global_seqno %" PRIu64 "\n",
158                       f1->smallest_seqno, f1->largest_seqno,
159                       external_file_seqno);
160               abort();// 进程直接退出
161             }
162           } else if (f1->smallest_seqno <= f2->smallest_seqno) {
163             fprintf(stderr, "L0 files seqno %" PRIu64 " %" PRIu64
164                             " vs. %" PRIu64 " %" PRIu64 "\n",
165                     f1->smallest_seqno, f1->largest_seqno, f2->smallest_seqno,
166                     f2->largest_seqno);
167             abort();// 进程直接退出
168           }
169         } else {
170           if (!level_nonzero_cmp_(f1, f2)) {
171             fprintf(stderr, "L%d files are not sorted properly", level);
172             abort();// 进程直接退出
173           }
}

//rocksdb v6.10.2

204   Status CheckConsistency(VersionStorageInfo* vstorage) {
243         if (level == 0) {
244           if (!level_zero_cmp_(f1, f2)) {
245             fprintf(stderr, "L0 files are not sorted properly");
246             return Status::Corruption("L0 files are not sorted properly");//不退出,给出提示信息
247           }       
248                   
249           if (f2->fd.smallest_seqno == f2->fd.largest_seqno) {
250             // This is an external file that we ingested
251             SequenceNumber external_file_seqno = f2->fd.smallest_seqno;
252             if (!(external_file_seqno < f1->fd.largest_seqno ||
253                   external_file_seqno == 0)) {
254               fprintf(stderr,
255                       "L0 file with seqno %" PRIu64 " %" PRIu64
256                       " vs. file with global_seqno %" PRIu64 "\n",
257                       f1->fd.smallest_seqno, f1->fd.largest_seqno,
258                       external_file_seqno);
259               return Status::Corruption(
260                   "L0 file with seqno " +
261                   NumberToString(f1->fd.smallest_seqno) + " " +
262                   NumberToString(f1->fd.largest_seqno) +
263                   " vs. file with global_seqno" +
264                   NumberToString(external_file_seqno) + " with fileNumber " +
265                   NumberToString(f1->fd.GetNumber()));//不退出,给出提示信息
266             }     
267           } else if (f1->fd.smallest_seqno <= f2->fd.smallest_seqno) {
268             fprintf(stderr,
269                     "L0 files seqno %" PRIu64 " %" PRIu64 " vs. %" PRIu64                                                                                                               
270                     " %" PRIu64 "\n",
271                     f1->fd.smallest_seqno, f1->fd.largest_seqno,
272                     f2->fd.smallest_seqno, f2->fd.largest_seqno);
273             return Status::Corruption(
274                 "L0 files seqno " + NumberToString(f1->fd.smallest_seqno) +
275                 " " + NumberToString(f1->fd.largest_seqno) + " " +
276                 NumberToString(f1->fd.GetNumber()) + " vs. " +
277                 NumberToString(f2->fd.smallest_seqno) + " " +
278                 NumberToString(f2->fd.largest_seqno) + " " +
279                 NumberToString(f2->fd.GetNumber()));
280           } //不退出,给出提示信息        
281         } else {    
282           if (!level_nonzero_cmp_(f1, f2)) {
283             fprintf(stderr, "L%d files are not sorted properly", level);
284             return Status::Corruption("L" + NumberToString(level) +
285                                       " files are not sorted properly");
286           }         
287       
288           // Make sure there is no overlap in levels > 0
289           if (vstorage->InternalComparator()->Compare(f1->largest,
290                                                       f2->smallest) >= 0) {
291             fprintf(stderr, "L%d have overlapping ranges %s vs. %s\n", level,
292                     (f1->largest).DebugString(true).c_str(),
293                     (f2->smallest).DebugString(true).c_str());
294             return Status::Corruption(
295                 "L" + NumberToString(level) + " have overlapping ranges " +
296                 (f1->largest).DebugString(true) + " vs. " +
297                 (f2->smallest).DebugString(true));//不退出,给出提示信息
298           }
299         }

6.以下版本进行了该分支代码修复。

 v6.11.4  v6.10.2 v6.10.1 v6.8.1 v6.7.3 v6.6.4 v6.6.3 v6.5.3 v6.5.2

7.L版本中,当前默认rocksdb版本。

OSD重启过程中,可以看到rocksdb的版本。
2020-08-10 16:08:22.350071 7f57988afd00  4 rocksdb: RocksDB version: 5.4.0

四、问题总结

建议升级高版本的rocksdb进行相关问题的修复。

五、workaround方法:

  1. ssh到node-3节点,备份当前节点的mon数据库文件
cd /var/lib/ark/ceph/ceph/mon/mon/
mv ceph-node-3 bak.ceph-node-3

2.ssh到主mon节点,拷贝文件到node-3节点

scp -rp /var/lib/ceph//mon/ceph-node-1 node-3:/var/lib/ceph//mon/ceph-node-3

k8s的 mon本地目录可能在/var/lib/ark/ceph/ceph/mon/mon/

3.重启mon服务。