目录
This page was originally a brain dump of my investigation into the Ceph memcpy size distributions. I've kept the content mostly as is (after this introduction section).
The take away messages from this page are essentially:
This page should hopefully serve as a warning for those who rely too much on probes and fail to measure actual CPU impact.
First find out where in the C library the memcpy resides:
$ nm -D /lib64/libc.so.6 | grep memcpy 0000000000084fc0 T memcpy 00000000000ebaf0 T __memcpy_chk 000000000008c70c W wmemcpy 00000000000ed41c T __wmemcpy_chk
Now place a uprobe:
# echo "p:memcpy /lib64/libc.so.6:0x84fc0 len=%x2" > /sys/kernel/debug/tracing/uprobe_events
The uprobe is recorded via:
perf record -e uprobes:memcpy -a -g sleep 20
This gives twenty seconds of recording system-wide. Processes can be analysed via a perf script.
切换行号显示
1 from __future__ import division 2 import os 3 import sys 4 5 import collections 6 import string 7 8 sys.path.append(os.environ['PERF_EXEC_PATH'] + \ 9 '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') 10 11 from perf_trace_context import * 12 from Core import * 13 14 15 sizes = dict() 16 17 sizeones = dict() 18 19 def trace_begin(): 20 print "in trace_begin" 21 22 def trace_end(): 23 print "in trace_end\n\n" 24 25 for key in sizes.keys(): 26 print "\n\nTop 10 memcpy sizes of %s:" % (key) 27 total = len(sizes[key]) 28 29 for pair in collections.Counter(sizes[key]).most_common(10): 30 print "%s\t%d\t(%f%%)" % (pair[0], pair[1], 100.0*pair[1]/total) 31 32 for key in sizeones.keys(): 33 print "\n\nTop 10 memcpy size 1 callers of %s:" % (key) 34 35 for pair in collections.Counter(sizeones[key]).most_common(10): 36 print "%d\t%s" % (pair[1], pair[0]) 37 38 def uprobes__memcpy(event_name, context, common_cpu, 39 common_secs, common_nsecs, common_pid, common_comm, 40 common_callchain, __probe_ip, len): 41 42 if (common_comm not in sizes): 43 sizes[common_comm] = list(); 44 45 sizes[common_comm].append(len) 46 47 if (len == 1): 48 node = common_callchain[1] 49 if ('sym' in node): 50 caller = format("[%x] %s %s" % (node['ip'], node['dso'], node['sym']['name'])) 51 else: 52 caller = format("[%x]" % (node['ip'])) 53 54 if (common_comm not in sizeones): 55 sizeones[common_comm] = list(); 56 57 sizeones[common_comm].append(caller) 58 59 def trace_unhandled(event_name, context, event_fields_dict): 60 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
These were obtained via a uprobe on memcpy
Top 10 memcpy sizes of ceph-mds: 1 3349 (30.542636%) 4 1411 (12.868217%) 8 1007 (9.183767%) 2 663 (6.046512%) 46 350 (3.191974%) 6 313 (2.854537%) 5 307 (2.799818%) 85 280 (2.553580%) 3 270 (2.462380%) 80 231 (2.106703%) Top 10 memcpy sizes of sshd: 4 80 (30.769231%) 16 40 (15.384615%) 100 40 (15.384615%) 76 38 (14.615385%) 6 20 (7.692308%) 60 20 (7.692308%) 40 19 (7.307692%) 231 2 (0.769231%) 8 1 (0.384615%) Top 10 memcpy sizes of perf: 16 1 (20.000000%) 4 1 (20.000000%) 30 1 (20.000000%) 6 1 (20.000000%) 5 1 (20.000000%) Top 10 memcpy sizes of chronyd: 24 5 (50.000000%) 128 2 (20.000000%) 72 2 (20.000000%) 192 1 (10.000000%) Top 10 memcpy sizes of systemd: 10 1 (25.000000%) 39 1 (25.000000%) 54 1 (25.000000%) 7 1 (25.000000%) Top 10 memcpy sizes of lt-rados: 1 6138 (35.294118%) 4 3350 (19.262837%) 8 1549 (8.906906%) 2 1025 (5.893853%) 16 890 (5.117590%) 3 556 (3.197056%) 7 442 (2.541544%) 13 437 (2.512794%) 6 329 (1.891783%) 5 314 (1.805532%) Top 10 memcpy sizes of ceph-mon: 1 56354 (28.211418%) 4 34085 (17.063317%) 8 20027 (10.025731%) 3 9962 (4.987084%) 2 9248 (4.629648%) 6 7521 (3.765093%) 5 6256 (3.131821%) 14 4419 (2.212199%) 80 4063 (2.033981%) 9 3999 (2.001942%) Top 10 memcpy sizes of ceph-osd: 1 354007 (39.020946%) 4 115021 (12.678360%) 2 109050 (12.020198%) 8 51911 (5.721967%) 3 40176 (4.428459%) 6 36285 (3.999568%) 80 25808 (2.844725%) 5 19347 (2.132552%) 7 18729 (2.064432%) 11 15813 (1.743011%) Top 10 memcpy sizes of sleep: 12 14 (73.684211%) 10 2 (10.526316%) 1 1 (5.263158%) 4 1 (5.263158%) 6 1 (5.263158%) Top 10 memcpy size 1 callers of lt-rados: 2716 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert> 1666 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 522 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 245 [48984] /home/steven/ceph/src/.libs/librados.so.2.0.0 ceph::buffer::list::append 198 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 125 [266944] /home/steven/ceph/src/.libs/librados.so.2.0.0 Pipe::tcp_read_nonblocking 122 [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 77 [1c7298] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode 77 [1c72a8] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode 77 [229b98] /home/steven/ceph/src/.libs/librados.so.2.0.0 pg_t::decode Top 10 memcpy size 1 callers of ceph-osd: 231102 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert > 39477 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 35877 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 19712 [6ccbf8] /home/steven/ceph/src/ceph-osd ceph::buffer::list::append 6734 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 4496 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put 1496 [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 883 [68e628] /home/steven/ceph/src/ceph-osd ghobject_t::decode 883 [68e770] /home/steven/ceph/src/ceph-osd ghobject_t::decode 883 [68e6c8] /home/steven/ceph/src/ceph-osd ghobject_t::decode Top 10 memcpy size 1 callers of sleep: 1 [2a698] /usr/lib64/libc-2.20.so _nl_find_locale Top 10 memcpy size 1 callers of ceph-mds: 1403 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert > 352 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 351 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 241 [557f78] /home/steven/ceph/src/ceph-mds ceph::buffer::list::append 214 [8dde0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put 211 [8dc1c] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_float 196 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 77 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put 70 [43989c] /home/steven/ceph/src/ceph-mds DecayCounter::decode 70 [4397fc] /home/steven/ceph/src/ceph-mds DecayCounter::decode Top 10 memcpy size 1 callers of ceph-mon: 29454 [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert > 9161 [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 5614 [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType 5226 [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int 2473 [3cc1d0] /home/steven/ceph/src/ceph-mon ceph::buffer::list::append 463 [455ef4] /home/steven/ceph/src/ceph-mon Pipe::tcp_read_nonblocking 387 [1948e8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 387 [194944] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 387 [1948f8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode 385 [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put
$ addr2line -f -e /usr/lib64/libstdc++.so.6.0.20 0x98db8 __ostream_write> /usr/src/debug/gcc-4.9.2-20141101/obj-aarch64-redhat-linux/aarch64-redhat-linux/libstdc++-v3/include/bits/ostream_insert.h:51
This leads to:
切换行号显示
1 template2 inline void 3 __ostream_write(basic_ostream<_CharT, _Traits>& __out, 4 const _CharT* __s, streamsize __n) 5 { 6 typedef basic_ostream<_CharT, _Traits> __ostream_type; 7 typedef typename __ostream_type::ios_base __ios_base; 8 9 const streamsize __put = __out.rdbuf()->sputn(__s, __n); 10 if (__put != __n) //<=== line 51 11 __out.setstate(__ios_base::badbit); 12 }
So we need to investigate sputn()
This leads to:
切换行号显示
1 streamsize 2 sputn(const char_type* __s, streamsize __n) 3 { return this->xsputn(__s, __n); }
Let's investigate __streambuf_type::xsputnThis leads to:
切换行号显示
1 template2 streamsize 3 basic_streambuf<_CharT, _Traits>:: 4 xsputn(const char_type* __s, streamsize __n) 5 { 6 streamsize __ret = 0; 7 while (__ret < __n) 8 { 9 const streamsize __buf_len = this->epptr() - this->pptr(); 10 if (__buf_len) 11 { 12 const streamsize __remaining = __n - __ret; 13 const streamsize __len = std::min(__buf_len, __remaining); 14 traits_type::copy(this->pptr(), __s, __len); 15 __ret += __len; 16 __s += __len; 17 this->__safe_pbump(__len); 18 } 19 20 if (__ret < __n) 21 { 22 int_type __c = this->overflow(traits_type::to_int_type(*__s)); 23 if (!traits_type::eq_int_type(__c, traits_type::eof())) 24 { 25 ++__ret; 26 ++__s; 27 } 28 else 29 break; 30 } 31 } 32 return __ret; 33 }
Let's investigate traits_type::copy.
切换行号显示
1 static char_type* 2 copy(char_type* __s1, const char_type* __s2, size_t __n) 3 { return static_cast(__builtin_memcpy(__s1, __s2, __n)); }
Here is the output:
Top ten ceph callers leading to memcpy size 1 19712 [6ccbf8] ceph::buffer::list::append 13346 [334f04] operator<< 9571 [25e724] operator<<7426 [308410] operator<< 7426 [3083fc] operator<< 6734 [5a85f4] nss_aes_operation 6657 [68e2b0] operator<< 6647 [68e33c] operator<< 6635 [68e2d0] operator<< 6577 [68e368] operator<<
Investigating first sample:
$ addr2line -Cf -e /home/steven/ceph/src/ceph-osd 0x6ccbf4 Objecter::_session_linger_op_remove(Objecter::OSDSession*, Objecter::LingerOp*) /home/steven/ceph/src/osdc/Objecter.cc:2570
Which gives the following function:
切换行号显示
1 void Objecter::_session_linger_op_remove(OSDSession *from, LingerOp *op) 2 { 3 assert(from == op->session); 4 assert(from->lock.is_locked()); 5 6 if (from->is_homeless()) { 7 num_homeless_ops.dec(); 8 } 9 10 from->linger_ops.erase(op->linger_id); 11 put_session(from); 12 op->session = NULL; 13 14 ldout(cct, 15) << __func__ << " " << from->osd << " " << op->linger_id << dendl; 15 }
Looks like this is due to debugging messages!
The ip's need to have 0x40000 added to them for ceph-osd (perf bug?).... Let's try again....
$ addr2line -Cf -e ./src/ceph-osd 0xaccbf8 ceph::buffer::ptr::end() const /home/steven/ceph/src/./include/buffer.h:207 $ addr2line -Cf -e ./src/ceph-osd 0xaccbf4 ceph::buffer::list::append(char const*, unsigned int) /home/steven/ceph/src/common/buffer.cc:1310
This leads to:
切换行号显示
1 void buffer::list::append(const char *data, unsigned len) 2 { 3 while (len > 0) { 4 // put what we can into the existing append_buffer. 5 unsigned gap = append_buffer.unused_tail_length(); 6 if (gap > 0) { 7 if (gap > len) gap = len; 8 //cout << "append first char is " << data[0] << ", last char is " << data[len-1] << std::endl; 9 append_buffer.append(data, gap); // <--- this is line 1310 10 append(append_buffer, append_buffer.end() - gap, gap); // add segment to the list 11 len -= gap; 12 data += gap; 13 } 14 if (len == 0) 15 break; // done! 16 17 // make a new append_buffer! 18 unsigned alen = CEPH_PAGE_SIZE * (((len-1) / CEPH_PAGE_SIZE) + 1); 19 append_buffer = create_page_aligned(alen); 20 append_buffer.set_length(0); // unused, so far. 21 } 22 }
I think the problem could be attributed to:
切换行号显示
1 void buffer::ptr::append(const char *p, unsigned l) 2 { 3 assert(_raw); 4 assert(l <= unused_tail_length()); 5 memcpy(c_str() + _len, p, l); 6 _len += l; 7 }
A flame graph of size 1 memcpy call stacks can be found here: ceph-size1.svgceph-size1.svg
And the inverted graph here: ceph-size1-rev.svgceph-size1-rev.svg
The stack data was generated via the following python script
切换行号显示
1 import os 2 import sys 3 4 sys.path.append(os.environ['PERF_EXEC_PATH'] + \ 5 '/scripts/python/Perf-Trace-Util/lib/Perf/Trace') 6 7 from perf_trace_context import * 8 from Core import * 9 10 11 def trace_begin(): 12 pass 13 14 def trace_end(): 15 for callframe in callframes.keys(): 16 print "%s %d" % (callframe, callframes[callframe]) 17 18 callframes = dict() 19 20 def uprobes__memcpy(event_name, context, common_cpu, 21 common_secs, common_nsecs, common_pid, common_comm, 22 common_callchain, __probe_ip, len): 23 24 if (len != 1): 25 return 26 27 callframe = common_comm 28 29 for node in reversed(common_callchain): 30 callframe += ';' 31 32 if 'sym' in node: 33 callframe += node['sym']['name'] 34 else: 35 callframe += '[unknown]' 36 37 if (callframe not in callframes.keys()): 38 callframes[callframe] = 1 39 else: 40 callframes[callframe] += 1 41 42 def trace_unhandled(event_name, context, event_fields_dict): 43 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
Analysing via uprobe has one significant disadvantage that I overlooked. Namely, that it gives us the proportion of all memcpy's that are size 1. It doesnottake into account the expensive/cheap memcpy's i.e. what are the memcpy's that are taking up the most time.
If we look at the call graph for the hot memcpy calls we get:
10.55% ceph-osd /usr/lib64/libc-2.20.so 0x8503c B [.] memcpy | ---memcpy | |--98.83%-- ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int) | FileJournal::align_bl(long, ceph::buffer::list&) | FileJournal::do_write(ceph::buffer::list&) | FileJournal::write_thread_entry() | FileJournal::Writer::entry() | Thread::entry_wrapper() | start_thread | thread_start
i.e. we see this as buffer management rather than single character log appends.