osd 中数据出问题时,可能会出现以下两个断言:
第一种情况:
0> 2019-06-20 14:58:16.760629 7f557da58700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f557da58700 time 2019-06-20 14:58:16.758196
os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xa8d) [0x94fe6d]
2: (ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)+0x1f5) [0x9e5aa5]
3: (ECBackend::handle_message(std::tr1::shared_ptr
4: (ReplicatedPG::do_request(std::tr1::shared_ptr
5: (OSD::dequeue_op(boost::intrusive_ptr
6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x6698ef]
7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xb9c0f5]
8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9f980]
9: /lib64/libpthread.so.0() [0x3d00407a51]
10: (clone()+0x6d) [0x3d000e896d]
NOTE: a copy of the executable, or `objdump -rdS
第二种情况:
2019-07-04 19:29:54.217741 7fe47dfb9700 -1 osd.34 529047 heartbeat_check: no reply from osd.153 ever on either front or back, first ping sent 2019-07-04 19:29:42.649390 (cutoff 2019-07-04 19:29:44.217737)
2019-07-04 19:29:54.604043 7fe449c63700 -1 osd.34 529047 heartbeat_check: no reply from osd.153 ever on either front or back, first ping sent 2019-07-04 19:29:42.649390 (cutoff 2019-07-04 19:29:44.604001)
osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7fe461889700 time 2019-07-04 19:30:22.387106
osd/ECBackend.cc: 1496: FAILED assert(hinfo.get_total_chunk_size() == (uint64_t)st.st_size)
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: (ECBackend::get_hash_info(hobject_t const&)+0xa67) [0x9d5027]
2: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector
3: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*)+0x916) [0x8296e6]
4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1a9d) [0x88151d]
5: (ReplicatedPG::do_op(std::tr1::shared_ptr
6: (ReplicatedPG::do_request(std::tr1::shared_ptr
7: (OSD::dequeue_op(boost::intrusive_ptr
8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x6698ef]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xb9c0f5]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9f980]
11: /lib64/libpthread.so.0() [0x3fc8c07a51]
12: (clone()+0x6d) [0x3fc88e896d]
以上两种情况都是pg内的数据出问题导致,可以通过下列方法找到有问题pg id,删除对应的pg数据
osd段错误快速处理方法:
1.通过osd日志,到对应ceph版本源码断言处查看日志等级
2.修改ceph.conf,osd日志等级,添加到global
3.debug_osd = xx/xx
4.重启osd
5.当osd再次进入断言的时候,在/var/lib/ceph/osd/ceph-xx/current 找到对应的pg id,在pg id前加前缀,避免osd启动加载这个有问题的pg
6.再次重启osd,观察osd是否还会down
例子:
这里看到read函数内输入pg id日志等级是15
修改ceph.conf,添加 debug_osd = 15/15 即可
get_hash_info函数日志等级是10可以输出pg id
添加:debug_osd = 10/15 即可