Profiling Ceph memcpy Usage

Profiling Ceph memcpy Usage

 

目录

  1. Profiling Ceph memcpy Usage
    1. Introduction
    2. Placing the uprobe on memcpy
    3. Simple Python Perf script to analyse memcpy sizes
    4. Distribution of memcpy sizes
    5. Steps to analyse `std::__ostream_insert >`
    6. Locating ceph-osd callers that lead to size 1 memcpy
    7. Being tricked by perf
    8. Flame graph of call stack for size 1 memcpy's
    9. Actual hot memcpy's

 

Introduction

This page was originally a brain dump of my investigation into the Ceph memcpy size distributions. I've kept the content mostly as is (after this introduction section). 

The take away messages from this page are essentially: 

  • Yes Ceph has a lot of memcpy's of size 1, ... but 
  • The only way these really crop up is via a probe point rather than via natural pc sampling, ... meaning 
  • That they are not likely worth worrying about :-(. 

This page should hopefully serve as a warning for those who rely too much on probes and fail to measure actual CPU impact. 

 

Placing the uprobe on memcpy

First find out where in the C library the memcpy resides: 

$ nm -D /lib64/libc.so.6 | grep memcpy
0000000000084fc0 T memcpy
00000000000ebaf0 T __memcpy_chk
000000000008c70c W wmemcpy
00000000000ed41c T __wmemcpy_chk

Now place a uprobe: 

# echo "p:memcpy /lib64/libc.so.6:0x84fc0 len=%x2" > /sys/kernel/debug/tracing/uprobe_events

The uprobe is recorded via: 

perf record -e uprobes:memcpy -a -g sleep 20

This gives twenty seconds of recording system-wide. Processes can be analysed via a perf script. 

 

Simple Python Perf script to analyse memcpy sizes

 

切换行号显示

   1 from __future__ import division
   2 import os
   3 import sys
   4 
   5 import collections
   6 import string
   7 
   8 sys.path.append(os.environ['PERF_EXEC_PATH'] + \
   9         '/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
  10 
  11 from perf_trace_context import *
  12 from Core import *
  13 
  14 
  15 sizes = dict()
  16 
  17 sizeones = dict()
  18 
  19 def trace_begin():
  20         print "in trace_begin"
  21 
  22 def trace_end():
  23         print "in trace_end\n\n"
  24 
  25         for key in sizes.keys():
  26                 print "\n\nTop 10 memcpy sizes of %s:" % (key)
  27                 total = len(sizes[key])
  28 
  29                 for pair in collections.Counter(sizes[key]).most_common(10):
  30                         print "%s\t%d\t(%f%%)" % (pair[0], pair[1], 100.0*pair[1]/total)
  31 
  32         for key in sizeones.keys():
  33                 print "\n\nTop 10 memcpy size 1 callers of %s:" % (key)
  34                 
  35                 for pair in collections.Counter(sizeones[key]).most_common(10):
  36                         print "%d\t%s" % (pair[1], pair[0])
  37 
  38 def uprobes__memcpy(event_name, context, common_cpu,
  39         common_secs, common_nsecs, common_pid, common_comm,
  40         common_callchain, __probe_ip, len):
  41 
  42                 if (common_comm not in sizes):
  43                         sizes[common_comm] = list();
  44 
  45                 sizes[common_comm].append(len)
  46 
  47                 if (len == 1):
  48                         node = common_callchain[1]
  49                         if ('sym' in node):
  50                                 caller = format("[%x] %s %s" % (node['ip'], node['dso'], node['sym']['name']))
  51                         else:
  52                                 caller = format("[%x]" % (node['ip']))
  53 
  54                         if (common_comm not in sizeones):
  55                                 sizeones[common_comm] = list();
  56 
  57                         sizeones[common_comm].append(caller)
  58 
  59 def trace_unhandled(event_name, context, event_fields_dict):
  60                 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])

 

Distribution of memcpy sizes

These were obtained via a uprobe on memcpy 

 

Top 10 memcpy sizes of ceph-mds:
1       3349    (30.542636%)
4       1411    (12.868217%)
8       1007    (9.183767%)
2       663     (6.046512%)
46      350     (3.191974%)
6       313     (2.854537%)
5       307     (2.799818%)
85      280     (2.553580%)
3       270     (2.462380%)
80      231     (2.106703%)


Top 10 memcpy sizes of sshd:
4       80      (30.769231%)
16      40      (15.384615%)
100     40      (15.384615%)
76      38      (14.615385%)
6       20      (7.692308%)
60      20      (7.692308%)
40      19      (7.307692%)
231     2       (0.769231%)
8       1       (0.384615%)


Top 10 memcpy sizes of perf:
16      1       (20.000000%)
4       1       (20.000000%)
30      1       (20.000000%)
6       1       (20.000000%)
5       1       (20.000000%)


Top 10 memcpy sizes of chronyd:
24      5       (50.000000%)
128     2       (20.000000%)
72      2       (20.000000%)
192     1       (10.000000%)


Top 10 memcpy sizes of systemd:
10      1       (25.000000%)
39      1       (25.000000%)
54      1       (25.000000%)
7       1       (25.000000%)


Top 10 memcpy sizes of lt-rados:
1       6138    (35.294118%)
4       3350    (19.262837%)
8       1549    (8.906906%)
2       1025    (5.893853%)
16      890     (5.117590%)
3       556     (3.197056%)
7       442     (2.541544%)
13      437     (2.512794%)
6       329     (1.891783%)
5       314     (1.805532%)


Top 10 memcpy sizes of ceph-mon:
1       56354   (28.211418%)
4       34085   (17.063317%)
8       20027   (10.025731%)
3       9962    (4.987084%)
2       9248    (4.629648%)
6       7521    (3.765093%)
5       6256    (3.131821%)
14      4419    (2.212199%)
80      4063    (2.033981%)
9       3999    (2.001942%)


Top 10 memcpy sizes of ceph-osd:
1       354007  (39.020946%)
4       115021  (12.678360%)
2       109050  (12.020198%)
8       51911   (5.721967%)
3       40176   (4.428459%)
6       36285   (3.999568%)
80      25808   (2.844725%)
5       19347   (2.132552%)
7       18729   (2.064432%)
11      15813   (1.743011%)


Top 10 memcpy sizes of sleep:
12      14      (73.684211%)
10      2       (10.526316%)
1       1       (5.263158%)
4       1       (5.263158%)
6       1       (5.263158%)


Top 10 memcpy size 1 callers of lt-rados:
2716    [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert >
1666    [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType
522     [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
245     [48984] /home/steven/ceph/src/.libs/librados.so.2.0.0 ceph::buffer::list::append
198     [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
125     [266944] /home/steven/ceph/src/.libs/librados.so.2.0.0 Pipe::tcp_read_nonblocking
122     [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
77      [1c7298] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode
77      [1c72a8] /home/steven/ceph/src/.libs/librados.so.2.0.0 request_redirect_t::decode
77      [229b98] /home/steven/ceph/src/.libs/librados.so.2.0.0 pg_t::decode


Top 10 memcpy size 1 callers of ceph-osd:
231102  [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert >
39477   [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
35877   [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
19712   [6ccbf8] /home/steven/ceph/src/ceph-osd ceph::buffer::list::append
6734    [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType
4496    [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put
1496    [8d554] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
883     [68e628] /home/steven/ceph/src/ceph-osd ghobject_t::decode
883     [68e770] /home/steven/ceph/src/ceph-osd ghobject_t::decode
883     [68e6c8] /home/steven/ceph/src/ceph-osd ghobject_t::decode


Top 10 memcpy size 1 callers of sleep:
1       [2a698] /usr/lib64/libc-2.20.so _nl_find_locale


Top 10 memcpy size 1 callers of ceph-mds:
1403    [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert >
352     [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
351     [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
241     [557f78] /home/steven/ceph/src/ceph-mds ceph::buffer::list::append
214     [8dde0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put
211     [8dc1c] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_float
196     [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType
77      [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put
70      [43989c] /home/steven/ceph/src/ceph-mds DecayCounter::decode
70      [4397fc] /home/steven/ceph/src/ceph-mds DecayCounter::decode


Top 10 memcpy size 1 callers of ceph-mon:
29454   [98db8] /usr/lib64/libstdc++.so.6.0.20 std::__ostream_insert >
9161    [8ce30] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
5614    [1df68] /usr/lib64/libsoftokn3.so sftk_AddAttributeType
5226    [8c868] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::_M_insert_int
2473    [3cc1d0] /home/steven/ceph/src/ceph-mon ceph::buffer::list::append
463     [455ef4] /home/steven/ceph/src/ceph-mon Pipe::tcp_read_nonblocking
387     [1948e8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode
387     [194944] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode
387     [1948f8] /home/steven/ceph/src/ceph-mon MonitorDBStore::Transaction::decode
385     [8ccb0] /usr/lib64/libstdc++.so.6.0.20 std::num_put > >::do_put

 

Steps to analyse `std::__ostream_insert >`

 

$ addr2line -f -e /usr/lib64/libstdc++.so.6.0.20 0x98db8
__ostream_write >
/usr/src/debug/gcc-4.9.2-20141101/obj-aarch64-redhat-linux/aarch64-redhat-linux/libstdc++-v3/include/bits/ostream_insert.h:51

This leads to: 

切换行号显示

   1   template
   2     inline void
   3     __ostream_write(basic_ostream<_CharT, _Traits>& __out,
   4                     const _CharT* __s, streamsize __n)
   5     {
   6       typedef basic_ostream<_CharT, _Traits>       __ostream_type;
   7       typedef typename __ostream_type::ios_base    __ios_base;
   8 
   9       const streamsize __put = __out.rdbuf()->sputn(__s, __n);
  10       if (__put != __n) //<=== line 51
  11         __out.setstate(__ios_base::badbit);
  12     }

So we need to investigate sputn()

This leads to: 

切换行号显示

   1       streamsize
   2       sputn(const char_type* __s, streamsize __n)
   3       { return this->xsputn(__s, __n); }

Let's investigate __streambuf_type::xsputnThis leads to: 

切换行号显示

   1   template
   2     streamsize
   3     basic_streambuf<_CharT, _Traits>::
   4     xsputn(const char_type* __s, streamsize __n)
   5     {
   6       streamsize __ret = 0;
   7       while (__ret < __n)
   8         {
   9           const streamsize __buf_len = this->epptr() - this->pptr();
  10           if (__buf_len)
  11             {
  12               const streamsize __remaining = __n - __ret;
  13               const streamsize __len = std::min(__buf_len, __remaining);
  14               traits_type::copy(this->pptr(), __s, __len);
  15               __ret += __len;
  16               __s += __len;
  17               this->__safe_pbump(__len);
  18             }
  19 
  20           if (__ret < __n)
  21             {
  22               int_type __c = this->overflow(traits_type::to_int_type(*__s));
  23               if (!traits_type::eq_int_type(__c, traits_type::eof()))
  24                 {
  25                   ++__ret;
  26                   ++__s;
  27                 }
  28               else
  29                 break;
  30             }
  31         }
  32       return __ret;
  33     }

Let's investigate traits_type::copy. 

切换行号显示

   1       static char_type*
   2       copy(char_type* __s1, const char_type* __s2, size_t __n)
   3       { return static_cast(__builtin_memcpy(__s1, __s2, __n)); }

 

Locating ceph-osd callers that lead to size 1 memcpy

Here is the output: 

Top ten ceph callers leading to memcpy size 1
19712   [6ccbf8] ceph::buffer::list::append
13346   [334f04] operator<<
9571    [25e724] operator<< 
7426    [308410] operator<<
7426    [3083fc] operator<<
6734    [5a85f4] nss_aes_operation
6657    [68e2b0] operator<<
6647    [68e33c] operator<<
6635    [68e2d0] operator<<
6577    [68e368] operator<<

Investigating first sample: 

$ addr2line -Cf -e /home/steven/ceph/src/ceph-osd 0x6ccbf4
Objecter::_session_linger_op_remove(Objecter::OSDSession*, Objecter::LingerOp*)
/home/steven/ceph/src/osdc/Objecter.cc:2570

Which gives the following function: 

切换行号显示

   1 void Objecter::_session_linger_op_remove(OSDSession *from, LingerOp *op)
   2 {
   3   assert(from == op->session);
   4   assert(from->lock.is_locked());
   5 
   6   if (from->is_homeless()) {
   7     num_homeless_ops.dec();
   8   }
   9 
  10   from->linger_ops.erase(op->linger_id);
  11   put_session(from);
  12   op->session = NULL;
  13 
  14   ldout(cct, 15) << __func__ << " " << from->osd << " " << op->linger_id << dendl;
  15 }

Looks like this is due to debugging messages! 

 

Being tricked by perf

The ip's need to have 0x40000 added to them for ceph-osd (perf bug?).... Let's try again.... 

 

$ addr2line -Cf -e ./src/ceph-osd 0xaccbf8
ceph::buffer::ptr::end() const
/home/steven/ceph/src/./include/buffer.h:207

$ addr2line -Cf -e ./src/ceph-osd 0xaccbf4
ceph::buffer::list::append(char const*, unsigned int)
/home/steven/ceph/src/common/buffer.cc:1310

This leads to: 

切换行号显示

   1   void buffer::list::append(const char *data, unsigned len)
   2   {
   3     while (len > 0) {
   4       // put what we can into the existing append_buffer.
   5       unsigned gap = append_buffer.unused_tail_length();
   6       if (gap > 0) {
   7         if (gap > len) gap = len;
   8         //cout << "append first char is " << data[0] << ", last char is " << data[len-1] << std::endl;
   9         append_buffer.append(data, gap); // <--- this is line 1310
  10         append(append_buffer, append_buffer.end() - gap, gap);  // add segment to the list
  11         len -= gap;
  12         data += gap;
  13       }
  14       if (len == 0)
  15         break;  // done!
  16 
  17       // make a new append_buffer!
  18       unsigned alen = CEPH_PAGE_SIZE * (((len-1) / CEPH_PAGE_SIZE) + 1);
  19       append_buffer = create_page_aligned(alen);
  20       append_buffer.set_length(0);   // unused, so far.
  21     }
  22   }

I think the problem could be attributed to: 

切换行号显示

   1   void buffer::ptr::append(const char *p, unsigned l)
   2   {
   3     assert(_raw);
   4     assert(l <= unused_tail_length());
   5     memcpy(c_str() + _len, p, l);
   6     _len += l;
   7   }

 

Flame graph of call stack for size 1 memcpy's

A flame graph of size 1 memcpy call stacks can be found here: ceph-size1.svgceph-size1.svg

And the inverted graph here: ceph-size1-rev.svgceph-size1-rev.svg

The stack data was generated via the following python script 

切换行号显示

   1 import os
   2 import sys
   3 
   4 sys.path.append(os.environ['PERF_EXEC_PATH'] + \
   5         '/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
   6 
   7 from perf_trace_context import *
   8 from Core import *
   9 
  10 
  11 def trace_begin():
  12         pass
  13 
  14 def trace_end():
  15         for callframe in callframes.keys():
  16                 print "%s %d" % (callframe, callframes[callframe])
  17 
  18 callframes = dict()
  19 
  20 def uprobes__memcpy(event_name, context, common_cpu,
  21         common_secs, common_nsecs, common_pid, common_comm,
  22         common_callchain, __probe_ip, len):
  23 
  24                 if (len != 1):
  25                         return
  26 
  27                 callframe = common_comm
  28 
  29                 for node in reversed(common_callchain):
  30                         callframe += ';'
  31 
  32                         if 'sym' in node:
  33                                 callframe += node['sym']['name']
  34                         else:
  35                                 callframe += '[unknown]'
  36 
  37                 if (callframe not in callframes.keys()):
  38                         callframes[callframe] = 1
  39                 else:
  40                         callframes[callframe] += 1
  41 
  42 def trace_unhandled(event_name, context, event_fields_dict):
  43                 print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])

 

Actual hot memcpy's

Analysing via uprobe has one significant disadvantage that I overlooked. Namely, that it gives us the proportion of all memcpy's that are size 1. It doesnottake into account the expensive/cheap memcpy's i.e. what are the memcpy's that are taking up the most time. 

If we look at the call graph for the hot memcpy calls we get: 

    10.55%  ceph-osd         /usr/lib64/libc-2.20.so                        0x8503c            B [.] memcpy                                                           
                   |
                   ---memcpy
                      |          
                      |--98.83%-- ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)
                      |          FileJournal::align_bl(long, ceph::buffer::list&)
                      |          FileJournal::do_write(ceph::buffer::list&)
                      |          FileJournal::write_thread_entry()
                      |          FileJournal::Writer::entry()
                      |          Thread::entry_wrapper()
                      |          start_thread
                      |          thread_start

i.e. we see this as buffer management rather than single character log appends. 

你可能感兴趣的:(ceph)