ceph学习之路---osd down,osd启动加载pg断言解决方法

osd 中数据出问题时,可能会出现以下两个断言:

第一种情况:

     0> 2019-06-20 14:58:16.760629 7f557da58700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f557da58700 time 2019-06-20 14:58:16.758196
os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xa8d) [0x94fe6d]
 2: (ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)+0x1f5) [0x9e5aa5]
 3: (ECBackend::handle_message(std::tr1::shared_ptr)+0x45d) [0x9e861d]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr&, ThreadPool::TPHandle&)+0x186) [0x815ea6]
 5: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x178) [0x664d58]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x6698ef]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xb9c0f5]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9f980]
 9: /lib64/libpthread.so.0() [0x3d00407a51]
 10: (clone()+0x6d) [0x3d000e896d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this.

第二种情况:

2019-07-04 19:29:54.217741 7fe47dfb9700 -1 osd.34 529047 heartbeat_check: no reply from osd.153 ever on either front or back, first ping sent 2019-07-04 19:29:42.649390 (cutoff 2019-07-04 19:29:44.217737)
2019-07-04 19:29:54.604043 7fe449c63700 -1 osd.34 529047 heartbeat_check: no reply from osd.153 ever on either front or back, first ping sent 2019-07-04 19:29:42.649390 (cutoff 2019-07-04 19:29:44.604001)
osd/ECBackend.cc: In function 'ECUtil::HashInfoRef ECBackend::get_hash_info(const hobject_t&)' thread 7fe461889700 time 2019-07-04 19:30:22.387106
osd/ECBackend.cc: 1496: FAILED assert(hinfo.get_total_chunk_size() == (uint64_t)st.st_size)
 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: (ECBackend::get_hash_info(hobject_t const&)+0xa67) [0x9d5027]
 2: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector > const&, boost::optional&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::tr1::shared_ptr)+0x6cc) [0x9e3d9c]
 3: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*)+0x916) [0x8296e6]
 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1a9d) [0x88151d]
 5: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0x3487) [0x885457]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr&, ThreadPool::TPHandle&)+0x4e3) [0x816203]
 7: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x178) [0x664d58]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59f) [0x6698ef]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xb9c0f5]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb9f980]
 11: /lib64/libpthread.so.0() [0x3fc8c07a51]
 12: (clone()+0x6d) [0x3fc88e896d]

以上两种情况都是pg内的数据出问题导致,可以通过下列方法找到有问题pg id,删除对应的pg数据

osd段错误快速处理方法:

1.通过osd日志,到对应ceph版本源码断言处查看日志等级

2.修改ceph.conf,osd日志等级,添加到global

3.debug_osd = xx/xx

4.重启osd

5.当osd再次进入断言的时候,在/var/lib/ceph/osd/ceph-xx/current 找到对应的pg id,在pg id前加前缀,避免osd启动加载这个有问题的pg

6.再次重启osd,观察osd是否还会down

例子:

ceph学习之路---osd down,osd启动加载pg断言解决方法_第1张图片

这里看到read函数内输入pg id日志等级是15

修改ceph.conf,添加 debug_osd = 15/15 即可

ceph学习之路---osd down,osd启动加载pg断言解决方法_第2张图片

get_hash_info函数日志等级是10可以输出pg id

添加:debug_osd = 10/15 即可

你可能感兴趣的:(ceph)