aa.stp:
probe kernel .function ( "sys_sync" ) {
printf ( "probfunc:%s fun:%s\n",execname(),ppfunc());
print_backtrace();
print_ubacktrace();
exit();
}
A:
[root@localhost ~]# sync
B:
stap -v aa.stp -d /lib64/libc-2.5.so -d /bin/sync
probfunc:sync fun:sys_sync 0xffffffff810e73e7 : sys_sync+0x0/0x2e [kernel] 0xffffffff8100bb29 : tracesys+0xd9/0xde [kernel] 0x34688ce477 : sync+0x7/0x30 [/lib64/libc-2.5.so] 0x4011b5 : usage+0x1f5/0x240 [/bin/sync] 0x346881d9f4 : __libc_start_main+0xf4/0x1b0 [/lib64/libc-2.5.so] 0x400f09 [/bin/sync+0xf09/0x4000]
int fsync(int fd); int fdatasync(int fd);
fsync() transfers ("flushes") all modified in-core data of (i.e., modi- fied buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks until the device reports that the trans- fer has completed. It also flushes metadata information associated with the file (see stat(2)). Calling fsync() does not necessarily ensure that the entry in the direc- tory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modifica- tion; see stat(2)) do not not require flushing because they are not nec- essary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. The aim of fdatasync(2) is to reduce disk activity for applications that do not require all metadata to be synchronised with the disk.
int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); flags: O_APPEND The file is opened in append mode. Before each write(), the file offset is positioned at the end of the file, as if with lseek(). O_APPEND may lead to corrupted files on NFS file systems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can’t be done without a race condition. O_ASYNC Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is only available for terminals, pseudo-terminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. O_CREAT If the file does not exist it will be created. The owner (user ID) of the file is set to the effective user ID of the process. The group ownership (group ID) is set either to the effective group ID of the process or to the group ID of the parent direc- tory (depending on filesystem type and mount options, and the mode of the parent directory, see, e.g., the mount options bsd- groups and sysvgroups of the ext2 filesystem, as described in mount(8)). O_DIRECT Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment must fit the block size of the device. A semantically similar (but deprecated) interface for block devices is described in raw(8). O_DIRECTORY If pathname is not a directory, cause the open to fail. This flag is Linux-specific, and was added in kernel version 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device, but should not be used outside of the imple- mentation of opendir. O_EXCL When used with O_CREAT, if the file already exists it is an error and the open() will fail. In this context, a symbolic link exists, regardless of where it points to. O_EXCL is broken on NFS file systems; programs which rely on it for performing lock- ing tasks will contain a race condition. The solution for per- forming atomic file locking using a lockfile is to create a unique file on the same file system (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful. O_LARGEFILE (LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. O_NOATIME (Since Linux 2.6.8) Do not update the file last access time (st_atime in the inode) when the file is read(2). This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time. O_NOCTTY If pathname refers to a terminal device — see tty(4) — it will not become the process’s controlling terminal even if the process does not have one. O_NOFOLLOW If pathname is a symbolic link, then the open fails. This is a FreeBSD extension, which was added to Linux in version 2.1.126. Symbolic links in earlier components of the pathname will still be followed. O_NONBLOCK or O_NDELAY When possible, the file is opened in non-blocking mode. Neither the open() nor any subsequent operations on the file descriptor which is returned will cause the calling process to wait. For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with manda- tory file locks and with file leases, see fcntl(2). O_SYNC The file is opened for synchronous I/O. Any write()s on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware. But see RESTRICTIONS below. O_TRUNC If the file already exists and is a regular file and the open mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise the effect of O_TRUNC is unspecified.
[root@localhost ~]# stap -L 'kernel .function ( "sys_*sync" )' kernel.function("sys_fdatasync@fs/sync.c:284") $fd:unsigned int kernel.function("sys_fsync@fs/sync.c:279") $fd:unsigned int kernel.function("sys_msync@mm/msync.c:32") $start:long unsigned int $len:size_t $flags:int $mm:struct mm_struct* kernel.function("sys_sync@fs/sync.c:129")
[root@localhost ~]# stap -v aa.stp -d /lib64/libc-2.5.so -d /lib64/libpthread-2.5.so -d /usr/local/mysql56/bin/mysqld probfunc:mysqld fun:sys_fsync 0xffffffff810e718d : sys_fsync+0x0/0x10 [kernel] 0xffffffff8100bb29 : tracesys+0xd9/0xde [kernel] 0x346940e1d7 : __fsync_nocancel+0x2e/0x67 [/lib64/libpthread-2.5.so] 0xba81a5 : _Z13os_file_fsynci+0x1b/0xda [/usr/local/mysql56/bin/mysqld] 0xba8277 : _Z18os_file_flush_funci+0x13/0x94 [/usr/local/mysql56/bin/mysqld] 0xd4d3b5 : _Z22pfs_os_file_flush_funciPKcm+0x7d/0xb4 [/usr/local/mysql56/bin/mysqld] 0xd4dbf9 : _Z9fil_flushm+0x363/0x486 [/usr/local/mysql56/bin/mysqld] 0xb8dcef : _Z15log_write_up_tommm+0x5b3/0x7c0 [/usr/local/mysql56/bin/mysqld] 0xc96813 : _Z27trx_flush_log_if_needed_lowm+0x53/0x88 [/usr/local/mysql56/bin/mysqld] 0xc96873 : _Z23trx_flush_log_if_neededmP5trx_t+0x2b/0x40 [/usr/local/mysql56/bin/mysqld] 0xc97434 : _Z29trx_commit_complete_for_mysqlP5trx_t+0x84/0x96 [/usr/local/mysql56/bin/mysqld] 0xb32ca8 : _Z15innobase_commitP10handlertonP3THDb+0x2a4/0x2f4 [/usr/local/mysql56/bin/mysqld] 0x625e75 : _Z13ha_commit_lowP3THDbb+0xa1/0x1e6 [/usr/local/mysql56/bin/mysqld] 0x70300f : _ZN12TC_LOG_DUMMY6commitEP3THDb+0x25/0x3e [/usr/local/mysql56/bin/mysqld] 0x6264c0 : _Z15ha_commit_transP3THDbb+0x506/0x612 [/usr/local/mysql56/bin/mysqld] 0x89ee02 : _Z17trans_commit_stmtP3THD+0x1cc/0x292 [/usr/local/mysql56/bin/mysqld] 0x7d60b7 : _Z21mysql_execute_commandP3THD+0x7bc3/0x7ec8 [/usr/local/mysql56/bin/mysqld] 0x7d67c4 : _Z11mysql_parseP3THDPcjP12Parser_state+0x408/0x690 [/usr/local/mysql56/bin/mysqld] 0x7d83f4 : _Z16dispatch_command19enum_server_commandP3THDPcj+0xd0a/0x227e [/usr/local/mysql56/bin/mysqld] 0x7d9c80 : _Z10do_commandP3THD+0x318/0x394 [/usr/local/mysql56/bin/mysqld] 0x78e3fb : _Z24do_handle_one_connectionP3THD+0x1ad/0x246 [/usr/local/mysql56/bin/mysqld] 0x78e4c1 : handle_one_connection+0x2d/0x34 [/usr/local/mysql56/bin/mysqld]
innodb_flush_method参数 与 文件系统IO
mysql innodb引擎可以使用innodb_flush_method参数设置与文件系统的交互方式。
linux下的可选项有:
fdatasync
O_DIRECT
O_SYNC
其中默认的是fdatasync
三个参数是如何影响程序MySQL对日志和数据文件的操作:
Open log | Flush log | Open datafile | Flush data | |
Fdatasync | fsync() | fsync() | ||
O_DSYNC | O_SYNC | fsync() | ||
O_DIRECT | fsync() | O_DIRECT |
fsync() |
注: 1)参数fdatasync实际是使用的fsync()函数,fsync函数只对由文件描述符filedes指定的单一文件起作用,并且等待写磁盘操作结束,然后返回。
fsync可用于数据库这样的应用程序,这种应用程序需要确保将修改过的块立即写到磁盘上。fsync()函数是flush阶段调用的函数
2)参数O_DIRECT告诉操作系统禁用缓存,然后使用fsync()的方式将数据刷入磁盘。O_DIRECT是open阶段设置的标志位
3)参数O_DSYNC实际是使用的O_SYNC作为打开日志文件的标志,O_SYNC是open阶段设置的标志位,也是表示同步写入IO,即将缓存中的数据写入磁盘后再返回。
O_SYNC和O_DIRECT的区别是O_SYNC不会在操作系统层面禁用缓存。但会告诉硬件层设备不要使用缓存。
程序描述了一般的文件I/O操作的三个过程 open、write、fdatasync,分别是打开文件、写文件、flush操作(将文件缓存刷到磁盘上)。
一、Open阶段 open("test.file",O_WRONLY|O_APPDENT|O_SYNC)) 系统调用Open会为该进程一个文件描述符fd【附录2】。这里使用了O_WRONLY|O_APPDENT|O_SYNC打开文件: O_WRONLY表示我们以"写"的方式打开,告诉内核我们需要向文件中写入数据; O_APPDENT告诉内核以"追加"的方式写文件; O_DSYNC告诉内核,当向文件写入数据的时候,只有当数据写到了磁盘时,写入操作才算完成(write才返回成功)。和O_DSYNC同类的文件标志,还有O_SYNC,O_RSYNC,O_DIRECT。 O_SYNC比O_DSYNC更严格,不仅要求数据已经写到了磁盘,而且对应的数据文件的属性(例如文件长度等)也需要更新完成才算write操作成功。可见O_SYNC较之O_DSYNC要多做一些操作。 O_RSYNC表示文件读取时,该文件的OS cache必须已经全部flush到磁盘了【附录3】; 如果使用O_DIRECT打开文件,则读/写操作都会跳过OS cache,直接在device(disk)上读/写。因为没有了OS cache,所以会O_DIRECT降低文件的顺序读写的效率。
二、Write阶段 write(fd,buf,6) 在使用open打开文件获得文件描述符之后,我们就可以调用write函数来写入数据了,write会根据前面的open参数不同,而表现不同。
三、Flush阶段 fdatasync(fd) == -1 flush的函数还有fsync、sync、fdatasync write操作后,我们还调用了fdatasync来确保文件数据flush到了disk上。fdatasync返回成功后,那么可以认为数据已经写到了磁盘上。像这样的flush的函数还有fsync、sync。 fsync和fdatasync的区别等同于O_SYNC和O_DSYNC的区别, fdatasync函数,与fsync函数类似,但是只刷文件的数据部分,不包括元数据(修改时间等) sync函数表示将文件在OS cache中的数据排入写队列,并不确认是否真的写磁盘了,所以sync并不可以靠。 忽略文件打开的过程,通常我们会说“写文件”有两个阶段,一个是调用write我们称为写数据阶段(其实是受open的参数影响),调用fsync(或者fdatasync)我们称为flush阶段。 传统的UNIX实现在内核中设有缓冲区高速缓存或页面高速缓存,大多数磁盘 I/O都通过缓冲进行。当将数据写入文件时,内核通常先将该数据复制到其中一个缓冲区中,如果该缓冲区尚未写满,则并不将其排入输出队列,而是等待其写满或者当内核需要重用该缓冲区以便存放其他磁盘块数据时,再将该缓冲排入输出队列,然后待其到达队首时,才进行实际的I/O操作。这种输出方式被称为延迟写(delayed write)(Bach [1986]第3章详细讨论了缓冲区高速缓存)。
The innodb_flush_method options for Unix-like systems include: fsync: InnoDB uses the fsync() system call to flush both the data and log files. fsync is the default setting. O_DSYNC: InnoDB uses O_SYNC to open and flush the log files, and fsync() to flush the data files. InnoDB does not use O_DSYNC directly because there have been problems with it on many varieties of Unix. littlesync: This option is used for internal performance testing and is currently unsupported. Use at your own risk. nosync: This option is used for internal performance testing and is currently unsupported. Use at your own risk. O_DIRECT: InnoDB uses O_DIRECT (or directio() on Solaris) to open the data files, and uses fsync() to flush both the data and log files. This option is available on some GNU/Linux versions, FreeBSD, and Solaris. O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call afterwards. This setting is suitable for some types of file systems but not others. For example, it is not suitable for XFS. If you are not sure whether the file system you use requires an fsync(), for example to preserve all file metadata, use O_DIRECT instead. This option was introduced in MySQL 5.6.7 (Bug #11754304, Bug #45892).
stap -v aa.stp -d /lib64/libc-2.5.so -d /lib64/libpthread-2.5.so -d /usr/local/mysql56/bin/mysqld