我们在Linux上总是要保存数据的,数据要么保存在文件系统里(如ext3),要么就在裸设备里面。我们在使用这些数据的时候都是通过文件这个抽象来访问的,操作系统会把我们需要的数据给我们,我们通常无需和块设备打交道。
从下图我们可以很清楚的看到:
我们会发现IO是个层次很深的子系统,有很复杂的数据流动线路。
至于操作系统如何去存储和获取这些数据对我们完全是黑盒子的,这通常不是问题。但是如果我们的IO很密集,我们就需要搞清楚IO具体是如何运作的,免的滥用IO和导致设计问题。
这时候你就需要blktrace这样的工具。
blktrace is a block layer IO tracing mechanism which provides detailed information about request queue operations up to user space.
它的作者Jens Axboe, 是内核IO模块的维护者,目前就职于FusionIO, 是个很nice的家伙,同时他还是著名IO评测工具fio的作者。
相关的文档:
users guide: http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug
HP的人写的指南: http://www.gelato.org/pdf/apr2006/gelato_ICE06apr_blktrace_brunelle_hp.pdf
CU上的小伙子写的: http://linux.chinaunix.net/bbs/viewthread.php?tid=1115851&extra=&ordertype=2
目前blktrace在大部分的Linux发行版都支持的,我们可以轻松的安装使用:
$ sudo yum install blktrace |
$ sudo blktrace /dev/sda5 -o - | blkparse -i - |
8,5 2 1 0.000000000 0 C W 40247824 + 8 [0] |
0,0 2 2 0.000040884 4271 A W 31105920 + 8 <- (8,3) 132600 |
8,5 2 3 0.000041214 4271 Q W 31105920 + 8 [(null)] |
8,5 2 4 0.000045947 4271 G W 31105920 + 8 [(null)] |
8,5 2 5 0.000046707 4271 P N [(null)] |
8,5 2 6 0.000047073 4271 I W 31105920 + 8 [(null)] |
0,0 2 7 0.000048282 4271 A W 31105928 + 8 <- (8,3) 132608 |
8,5 2 8 0.000048357 4271 Q W 31105928 + 8 [(null)] |
8,5 2 9 0.000049137 4271 M W 31105928 + 8 [(null)] |
0,0 2 10 0.000050167 4271 A W 31105936 + 8 <- (8,3) 132616 |
8,5 2 11 0.000050241 4271 Q W 31105936 + 8 [(null)] |
8,5 2 12 0.000050417 4271 M W 31105936 + 8 [(null)] |
0,0 2 13 0.000050984 4271 A W 31105944 + 8 <- (8,3) 132624 |
8,5 2 14 0.000051047 4271 Q W 31105944 + 8 [(null)] |
8,5 2 15 0.000051258 4271 M W 31105944 + 8 [(null)] |
8,5 2 16 0.000051829 4271 U N [(null)] 1 |
8,5 2 17 0.000052699 4271 D W 31105920 + 32 [(null)] |
8,5 2 18 0.000108292 0 C W 31105920 + 32 [0] |
0,0 2 19 0.000127791 4271 A W 31105952 + 8 <- (8,3) 132632 |
8,5 2 20 0.000128001 4271 Q W 31105952 + 8 [(null)] |
8,5 2 21 0.000128874 4271 G W 31105952 + 8 [(null)] |
8,5 2 22 0.000129373 4271 P N [(null)] |
8,5 2 23 0.000129706 4271 I W 31105952 + 8 [(null)] |
8,5 2 24 0.000130551 4271 U N [(null)] 1 |
8,5 2 25 0.000131330 4271 D W 31105952 + 8 [(null)] |
8,5 2 26 0.000172705 0 C W 31105952 + 8 [0] |
0,0 13 1 1266874889.709337223 4271 A W 40247824 + 8 <- (8,3) 9274504 |
8,5 13 2 1266874889.709338011 4271 Q W 40247824 + 8 [kjournald] |
8,5 13 3 1266874889.709343974 4271 G W 40247824 + 8 [kjournald] |
8,5 13 4 1266874889.709346653 4271 P N [kjournald] |
8,5 13 5 1266874889.709347728 4271 I W 40247824 + 8 [kjournald] |
8,5 13 6 1266874889.709350795 4271 U N [kjournald] 1 |
8,5 13 7 1266874889.709355396 4271 D W 40247824 + 8 [kjournald] |
0,0 21 1 0.504685570 4267 A W 92640335 + 8 <- (8,6) 234392 |
8,5 21 2 0.504686212 4267 Q W 92640335 + 8 [kjournald] |
8,5 21 3 0.504690614 4267 G W 92640335 + 8 [kjournald] |
8,5 21 4 0.504691826 4267 P N [kjournald] |
8,5 21 5 0.504692896 4267 I W 92640335 + 8 [kjournald] |
0,0 21 6 0.504694268 4267 A W 92640343 + 8 <- (8,6) 234400 |
8,5 21 7 0.504694448 4267 Q W 92640343 + 8 [kjournald] |
8,5 21 8 0.504695115 4267 M W 92640343 + 8 [kjournald] |
0,0 21 9 0.504696227 4267 A W 92640351 + 8 <- (8,6) 234408 |
8,5 21 10 0.504696357 4267 Q W 92640351 + 8 [kjournald] |
8,5 21 11 0.504696615 4267 M W 92640351 + 8 [kjournald] |
0,0 21 12 0.504697422 4267 A W 92640359 + 8 <- (8,6) 234416 |
8,5 21 13 0.504697565 4267 Q W 92640359 + 8 [kjournald] |
8,5 21 14 0.504697787 4267 M W 92640359 + 8 [kjournald] |
0,0 21 15 0.504698549 4267 A W 92640367 + 8 <- (8,6) 234424 |
8,5 21 16 0.504698677 4267 Q W 92640367 + 8 [kjournald] |
8,5 21 17 0.504698939 4267 M W 92640367 + 8 [kjournald] |
8,5 21 18 0.504699954 4267 U N [kjournald] 1 |
8,5 21 19 0.504704050 4267 D W 92640335 + 40 [kjournald] |
8,5 2 27 0.504810390 0 C W 92640335 + 40 [0] |
0,0 2 28 0.504842324 4267 A W 92640375 + 8 <- (8,6) 234432 |
8,5 2 29 0.504842594 4267 Q W 92640375 + 8 [kjournald] |
8,5 2 30 0.504844133 4267 G W 92640375 + 8 [kjournald] |
8,5 2 31 0.504845233 4267 P N [kjournald] |
8,5 2 32 0.504845703 4267 I W 92640375 + 8 [kjournald] |
8,5 2 33 0.504846958 4267 U N [kjournald] 1 |
8,5 2 34 0.504848547 4267 D W 92640375 + 8 [kjournald] |
8,5 2 35 0.504879109 0 C W 92640375 + 8 [0] |
Reads Queued: 0, 0KiB Writes Queued: 6, 24KiB |
Read Dispatches: 0, 0KiB Write Dispatches: 3, 24KiB |
Reads Requeued: 0 Writes Requeued: 0 |
Reads Completed: 0, 0KiB Writes Completed: 5, 48KiB |
Read Merges: 0, 0KiB Write Merges: 3, 12KiB |
Read depth: 0 Write depth: 2 |
IO unplugs: 3 Timer unplugs: 0 |
Reads Queued: 0, 0KiB Writes Queued: 1, 4KiB |
Read Dispatches: 0, 0KiB Write Dispatches: 1, 4KiB |
Reads Requeued: 0 Writes Requeued: 0 |
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
Read Merges: 0, 0KiB Write Merges: 0, 0KiB |
Read depth: 0 Write depth: 2 |
IO unplugs: 1 Timer unplugs: 0 |
Reads Queued: 0, 0KiB Writes Queued: 5, 20KiB |
Read Dispatches: 0, 0KiB Write Dispatches: 1, 20KiB |
Reads Requeued: 0 Writes Requeued: 0 |
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
Read Merges: 0, 0KiB Write Merges: 4, 16KiB |
Read depth: 0 Write depth: 2 |
IO unplugs: 1 Timer unplugs: 0 |
Reads Queued: 0, 0KiB Writes Queued: 12, 48KiB |
Read Dispatches: 0, 0KiB Write Dispatches: 5, 48KiB |
Reads Requeued: 0 Writes Requeued: 0 |
Reads Completed: 0, 0KiB Writes Completed: 5, 48KiB |
Read Merges: 0, 0KiB Write Merges: 7, 28KiB |
IO unplugs: 5 Timer unplugs: 0 |
Throughput (R/W): 0KiB/s / 95KiB/s |
Skips: 0 forward (0 - 0.0%) |
利用这些信息我们可以很清楚的知道我们IO设备在做什么,花了多少时间,透过它了解我们系统的运作。如何解读这些信息我们可以看手册有详细的解释:
$ man blkparse
同时如果你觉得这些信息太原始,类似btt, seekwatcher这样的工具在blktrace的信息的基础上更深入的挖掘了系统的行为,使用起来也更简单。
我们在实际工作的过程中用blktrace定位了很多问题,比如fsync的延时问题和IO调度器的问题,确实是很实用的一个工具。