问题:
Ceph集群一直显示XXX daemons have recently crashed,而且数目越来越多;
解决方法:
最近有一个或多个Ceph守护进程崩溃,管理员尚未对该崩溃进行存档(确认)。这可能表示软件错误、硬件问题(例如,故障磁盘)或某些其它问题。
系统中所有的崩溃可以通过以下方式列出:
# ceph crash ls
ID ENTITY NEW
2020-05-02_00:53:25.028694Z_b29d405c-2512-4b80-916f-46c45c2cd6a9 osd.94
2020-05-02_00:56:33.807897Z_feea566f-f237-42fd-aadf-45a5e8047896 osd.94
2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e mon.node01
2020-05-02_09:52:51.146773Z_4e637ead-80df-42df-93f0-42c84ab8feb3 osd.19
新的崩溃可以通过以下方式列出:
# ceph crash ls-new
有关特定崩溃的信息可以通过以下方式检查:
# ceph crash info
###例如###
# ceph crash info 2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e
{
"os_version_id": "18.04",
"utsname_release": "4.15.0-55-generic",
"os_name": "Ubuntu",
"entity_name": "mon.node01",
"timestamp": "2020-05-02 05:41:03.542296Z",
"process_name": "ceph-mon",
"utsname_machine": "x86_64",
"utsname_sysname": "Linux",
"os_version": "18.04.3 LTS (Bionic Beaver)",
"os_id": "ubuntu",
"utsname_version": "#60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019",
"backtrace": [
"(()+0x12890) [0x7f6c9f2f3890]",
"(gsignal()+0xc7) [0x7f6c9e3ebe97]",
"(abort()+0x141) [0x7f6c9e3ed801]",
"(()+0x8c957) [0x7f6c9ede0957]",
"(()+0x92ab6) [0x7f6c9ede6ab6]",
"(()+0x92af1) [0x7f6c9ede6af1]",
"(()+0x92d24) [0x7f6c9ede6d24]",
"(()+0x1424b) [0x7f6c9f51424b]",
"(tc_new()+0x283) [0x7f6c9f535943]",
"(rocksdb::Arena::AllocateNewBlock(unsigned long)+0x6c) [0x55c1aabe88ac]",
"(rocksdb::Arena::AllocateFallback(unsigned long, bool)+0x4b) [0x55c1aabe89db]",
"(rocksdb::Arena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)+0x110) [0x55c1aabe8b80]",
"(rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)+0xd4) [0x55c1aaaff004]",
"(()+0x5a3273) [0x55c1aab6b273]",
"(()+0x5a32f0) [0x55c1aab6b2f0]",
"(rocksdb::MemTable::Add(unsigned long, rocksdb::ValueType, rocksdb::Slice const&, rocksdb::Slice const&, bool, rocksdb::MemTablePostProcessInfo*)+0xfc) [0x55c1aaafa5bc]",
"(rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::ValueType)+0x1bd) [0x55c1aab609ed]",
"(rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&)+0x26) [0x55c1aab615d6]",
"(rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0xa19) [0x55c1aab58de9]",
"(rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x14b) [0x55c1aab5cecb]",
"(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x13f6) [0x55c1aaa80f06]",
"(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x55c1aaa82660]",
"(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr)+0x88) [0x55c1aaa342f8]" ,
"(RocksDBStore::submit_transaction_sync(std::shared_ptr)+0x8c) [0x55c1aaa34c3c]" ,
"(MonitorDBStore::apply_transaction(std::shared_ptr)+0x76b) [0x55c1aa80a02b]" ,
"(Paxos::begin(ceph::buffer::v14_2_0::list&)+0x562) [0x55c1aa90bca2]",
"(Paxos::propose_pending()+0x127) [0x55c1aa90d5f7]",
"(Paxos::finish_round()+0x50a) [0x55c1aa90de1a]",
"(Paxos::commit_finish()+0x5fc) [0x55c1aa90fd6c]",
"(C_Committed::finish(int)+0x34) [0x55c1aa913d54]",
"(Context::complete(int)+0x9) [0x55c1aa84a359]",
"(MonitorDBStore::C_DoTransaction::finish(int)+0x94) [0x55c1aa913ac4]",
"(Context::complete(int)+0x9) [0x55c1aa84a359]",
"(Finisher::finisher_thread_entry()+0x17f) [0x7f6ca05227bf]",
"(()+0x76db) [0x7f6c9f2e86db]",
"(clone()+0x3f) [0x7f6c9e4ce88f]"
],
"utsname_hostname": "node01",
"crash_id": "2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e",
"archived": "2020-05-06 14:13:12.975173",
"ceph_version": "14.2.6"
}
可以通过“存档”崩溃(可能是在管理员检查之后)来消除此警告,从而不会生成此警告:
# ceph crash archive
同样,所有新的崩溃都可以通过以下方式存档:
# ceph crash archive-all
通过ceph crash ls仍然可以看到已存档的崩溃,但不是ceph crash ls-new即可看到。
“recent”所指的时间段由选项mgr/crash/warn_recent_interval控制(默认值:两周)。
可以通过以下方式完全禁用这些警告:
# ceph config set mgr mgr/crash/warn_recent_interval 0
参考:
https://docs.ceph.com/docs/master/rados/operations/health-checks/?highlight=backfillfull%20ratio
https://docs.ceph.com/docs/master/mgr/crash/?highlight=crash