Who Stole gettimeofday() System Calls From Oracle strace() sessions?

当strace 出gettimeoftime的时候,跑上知乎上问,R大给出了这篇参考文章,转载一下.在CentOS6.5,

cat  /proc/sys/kernel/vsyscall64 已是1, cat /proc/pid/maps 也有vdso.jvm线程中竟然能strace出gettimeofday,确实诡异.


转帖地址:  http://www.scaleabilities.co.uk/2012/12/18/who-stole-gettimeofday-from-oracle-straces/

I've been meaning to write this blog post for a while now but never seemed to find the time. Hopefully this posting will be useful information for anyone that spends a lot of time tracing processes in Oracle from the Linux Operating System. I'm one of those people, so it was good to get to the bottom of this particular mystery.

Background

Oracle, as any ardent tracer knows, loves to call the gettimeofday(2) system call. This is a system call that returns the current system timestamp with microsecond granularity, which  Oracle uses as the source of timing information for the wait interface. The wait interface calls gettimeofday(2) before and after the event that causes the wait and uses the delta between the two times to report the time waited in v$session[_wait], v$system_event/v$session_wait and in extended sql_trace files, as well as some newer locations such as v$event_histogram. This information has, since Oracle7, become the cornerstone of Oracle performance diagnostics. And if you simply use strace (Linux) or truss (Solaris), you can clearly see these calls being made. In fact, it's pretty much the dominant system call because Oracle is so heavily instrumented. This is a good thing. Consider the following example, where I have used strace on my Linux test box in conjunction with an extended sql_trace to show where the call take place:

1
2
3
4
5
gettimeofday({1353543405, 652924}, NULL) = 0
pread(256, "\6\242\0\0\372\1@\0@[\3\0\0\0\1\4\321\204\0\0\2\0\0\0009\0\0\0@[\3\0\0\0\0\0\2\370\2\0\0\0\0\0\1\0\r\0\2\0\0\0\211\0\300\0\1\0\1\0\0\340\0\0b]\2\0\1\0\0\0\2\0\0\0\202\0\300\0\0\0]\0\0\240\0\0b]\2\0\0\0\200\3\1\0\0\0"..., 8192, 4145152) = 8192
gettimeofday({1353543405, 654424}, NULL) = 0
lseek(10, 0, SEEK_CUR) = 81949
write(10, "WAIT #140678405599440: nam='db file sequential read' ela= 1500 file#=1 block#=506 blocks=1 obj#=57 t"..., 119) = 119

So, some things to notice in this output. Starting at the bottom of the output, we can see the process is writing some wait event information to the trace file. We can see this because strace is dereferencing the pointer that is supplied to the write(2) system call and showing us the contents of the buffer rather than just the buffer address. You can see that we waited for a 'db file sequential read' event, and that it took 1500 microseconds (ela=1500). The call to lseek(2) is doing an effectively useless seek within the trace file by advancing the file position by zero bytes. Just ignore that. Now for the interesting stuff. Lines 1-3 contain the actual 'db file sequential read' action along with the timing provided by the wait interface. Line 2 is the actual system call to pread(2) which is reading data from disk, and lines 1 and 3 are the required calls to allow Oracle to calculate the elapsed time of the event. It's pretty simple, and very easy to see what is going on behind the scenes with the wait interface.

Weird Things

At some point in the lifecycle of Red Hat Enterprise Linux, I believe RHEL5.5, this all changed and life became momentarily confusing. Consider the same strace session output, running on RHEL5.5. I've started the extract of output with the previous wait event just to show that I'm not cheating:

1
2
3
4
5
6
write(10, "WAIT #140678405594432: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj"..., 126) = 126
write(10, "\n", 1) = 1
getrusage(RUSAGE_SELF, {ru_utime={0, 551916}, ru_stime={0, 183972}, ...}) = 0
pread(256, "\20\242\0\0008\6@\0d\376\34\0\0\0\1\4\373:\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\377\5\0\0 \20\0\0\32\0\0\0s\0\0\0\200\0\0\0\3637A\0\0\0\0\0\32\0\0\0\2\0\0\0\362\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0"..., 8192, 13041664) = 8192
lseek(10, 0, SEEK_CUR) = 119298
write(10, "WAIT #140678405594432: nam='db file sequential read' ela= 886 file#=1 block#=1592 blocks=1 obj#=235 "..., 120) = 120

Some things are the same: The pread(2) is there, the useless lseek(2) is still there, and the write of the wait event to the trace are all there. But where did the gettimeofday(2) calls go? My initial thoughts when I first saw this were that perhaps Oracle have created some kind of user-mode method of capturing the current time. User mode processing does not show up in strace output because there are, by definition, no system calls: Strace only traces systems calls. But it turns out that Oracle have done precisely nothing different in the code, and the change is all taking place under the hood in the Linux libraries.

System Call Interface

Historically, the method for making a system call was via a software interrupt. This interrupt signalled the kernel to switch into the kernel context and to process the system call on behalf of the user process. There is lots of detail surrounding this such as how parameters are passed, security checks (because the kernel operates with full access to hardware, including all system memory), invalidating TLB entries, and so forth, but we won't go into that here. The important part is that this interface became increasingly unscalable, particularly across multiple SMP CPUs.  Intel addressed the scalability issue in the Pentium II generation through the creation of a Fast System Call interface using the SYSENTER and SYSEXIT instructions instead of interrupts. As this method did not exist in previous processor generations, the Linux kernel needed a way to determine which method of invoking system calls was optimal for all processors that were/are supported by the Operating System. This implementation needed to insulate the user code against any kind of porting between different processor types, and the result was the implementation of the Virtual Dynamic Shared Object library. This is a special library that the kernel maps into the address space of any user process and implements the code for the Fast System Call interface. Using ldd we can see that it is not actually represented by any underlying file, just a memory address:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[morlej@ol6 ~]$ ldd /opt/oracle/11202/bin/oracle
  linux-vdso.so.1 => (0x00007fff057a3000)
  libodm11.so => /opt/oracle/11202/lib/libodm11.so (0x00007fbb22e15000)
  libcell11.so => /opt/oracle/11202/lib/libcell11.so (0x00007fbb22cb1000)
  libskgxp11.so => /opt/oracle/11202/lib/libskgxp11.so (0x00007fbb22adb000)
  librt.so.1 => /lib64/librt.so.1 (0x0000003b0f800000)
  libnnz11.so => /opt/oracle/11202/lib/libnnz11.so (0x00007fbb22701000)
  libclsra11.so => /opt/oracle/11202/lib/libclsra11.so (0x00007fbb225e9000)
  libdbcfg11.so => /opt/oracle/11202/lib/libdbcfg11.so (0x00007fbb224c6000)
  libhasgen11.so => /opt/oracle/11202/lib/libhasgen11.so (0x00007fbb21cf2000)
  libskgxn2.so => /opt/oracle/11202/lib/libskgxn2.so (0x00007fbb21bf0000)
  libocr11.so => /opt/oracle/11202/lib/libocr11.so (0x00007fbb21a4a000)
  libocrb11.so => /opt/oracle/11202/lib/libocrb11.so (0x00007fbb218aa000)
  libocrutl11.so => /opt/oracle/11202/lib/libocrutl11.so (0x00007fbb2179f000)
  libaio.so.1 => /lib64/libaio.so.1 (0x0000003b0e800000)
  libdl.so.2 => /lib64/libdl.so.2 (0x00007fbb2159a000)
  libm.so.6 => /lib64/libm.so.6 (0x0000003b0f400000)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003b0f000000)
  libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003b1dc00000)
  libc.so.6 => /lib64/libc.so.6 (0x0000003b0ec00000)
  /lib64/ld-linux-x86-64.so.2 (0x0000003b0e400000)

The linux-vdso.so.1 entry at the top there is the VDSO library, which used be referred to aslinux-gate in previous versions. This article has some further information on how this is all structured, including disassembly of the actual library showing the call to SYSENTER.

Anyway, that's one of the purposes of the VDSO library, but it is more than that in concept; it is as abstraction to what a system call actually is. With this abstraction in place it becomes possible to perform other magic tricks under the hood without any change to the user code. The particular trick that we are interested in for this blog entry is the speedup of the gettimeofday(2) call.

Vsyscall64

As documented here, the VDSO also contains some mapping directly to kernel code. Specifically, it allows the user process some shortcuts to the system call interface by directly executing some/all of the code in user mode rather than kernel mode. This is controlled via a kernel flag which can be viewed and modified via the /proc/sys/kernel/vsyscall64 special file. The possible values are as follows:

0 - Provides the most accurate time intervals at μs (microsecond) resolution, but also produces the highest call overhead, as it uses a regular system call

1 - Slightly less accurate, although still at μs resolution, with a lower call overhead
2 - The least accurate, with time intervals at the ms (millisecond) level, but offers the lowest call overhead

As from RHEL 5.5, the default value for this has become "1", which means by implication that it does not "use a regular system call" as it did when the default was "0". I have not fully investigated exactly what is executed in the various modes from a kernel code perspective, but I could make a fair guess from the descriptions:

0 - Regular system call, via Fast System Call interface if available

1 - Mostly user mode, probably just reading from a page of kernel-updated memory which is mapped into the user address space. Some kernel code must be necessary to ensure the value in memory reflects the current timestamp

2 - Just user mode. Read the mapped memory page and rely on the kernel to update the value every millisecond.

Like I said, that's just an educated guess. If anyone's looked into this, please do comment and I'll update accordingly.

Let's run a few tests. To check this out I wrote a small C program to essentially just call gettimeofday(2) over and over again:

1
2
3
4
5
6
7
8
9
10
11
12
#include
 
void
main() {
 
     int i;
     struct timeval tv;
 
     for (i=1000000;i>0;i--)
         if (gettimeofday(&tv, 0))
             perror("gtod");
}

Next up, I just run this program with the different settings for vsyscall64. This was all performed on my Oracle Linux 6.2 virtual machine. Note: kernel mode activity has a higher cost in a virtualised environment and so the effect is somewhat amplified.

[root@ol6 morlej]# echo 0 > /proc/sys/kernel/vsyscall64
[root@ol6 morlej]# time strace -c ./gtod
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.004388           0   1000000           gettimeofday
0.00    0.000000           0         1           read
0.00    0.000000           0         2           open
0.00    0.000000           0         2           close
0.00    0.000000           0         2           fstat
0.00    0.000000           0         8           mmap
0.00    0.000000           0         3           mprotect
0.00    0.000000           0         1           munmap
0.00    0.000000           0         1           brk
0.00    0.000000           0         1         1 access
0.00    0.000000           0         1           execve
0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.004388               1000023         1 total
 
real    0m12.737s
user    0m3.084s
sys 0m9.466s
[root@ol6 morlej]# echo 1 > /proc/sys/kernel/vsyscall64
[root@ol6 morlej]# time strace -c ./gtod
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
-nan    0.000000           0         1           read
-nan    0.000000           0         2           open
-nan    0.000000           0         2           close
-nan    0.000000           0         2           fstat
-nan    0.000000           0         8           mmap
-nan    0.000000           0         3           mprotect
-nan    0.000000           0         1           munmap
-nan    0.000000           0         1           brk
-nan    0.000000           0         1         1 access
-nan    0.000000           0         1           execve
-nan    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    23         1 total
 
real    0m0.052s
user    0m0.046s
sys 0m0.004s
[root@ol6 morlej]# echo 2 > /proc/sys/kernel/vsyscall64
[root@ol6 morlej]# time strace -c ./gtod
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
-nan    0.000000           0         1           read
-nan    0.000000           0         2           open
-nan    0.000000           0         2           close
-nan    0.000000           0         2           fstat
-nan    0.000000           0         8           mmap
-nan    0.000000           0         3           mprotect
-nan    0.000000           0         1           munmap
-nan    0.000000           0         1           brk
-nan    0.000000           0         1         1 access
-nan    0.000000           0         1           execve
-nan    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    23         1 total
 
real    0m0.043s
user    0m0.038s
sys 0m0.003s

Clearly, the big performance gain is to be had at level 1, and it's not really worth losing the microsecond accuracy (at least for database workloads) by going to level 2.
Anyway, if you're doing some strace sessions on Linux systems, it's probably worth considering setting vsyscall64 to "1" so that you can see the full picture. This is particularly true if you would like to understand where the Oracle wait interface is taking its timing snapshots in relation to other underlying system calls.

That's about it for this posting, I hope that was useful information to somebody!


你可能感兴趣的:(Linux,gettimeofday)