开发一个小工具来跟踪程序初始化期间第一次访问每个文件的时间和部分时,我有机会应用这些新获得的功能。此请求带有一些约束,例如必须独立于所使用的缓冲I / O方法(同步,aio,内存映射)的信息。关联正在访问文件的块数据也应该是微不足道的,最重要的是,跟踪代码不应该对观察到的系统产生大的性能影响。

安装探针的最佳级别是什么?
我首先调查了我们想要放置探针的位置:

在块层级别进行跟踪
Block Layer中的跟踪器将检查磁盘块方面的请求,这些请求与文件系统文件不直接相关。此级别的跟踪可以深入了解磁盘的哪些区域正在读取或写入,以及它们是否以连续的方式进行物理组织。但它不会以文件为基础为您提供更高级别的系统视图。已经存在其他工具来跟踪块级访问,例如EBF脚本biosnoop和传统的blktrace。

跟踪文件系统级别
文件系统级别的跟踪器以文件块的形式公开数据,文件块可以解析为磁盘中的一个或多个数据块。在示例场景中,具有4Kib文件块大小的Ext4文件系统,一个系统页面可能对应于1个物理块(在4K磁盘中)或具有512个扇区大小的磁盘中的4个物理块。文件系统级别的跟踪允许我们根据偏移量查看文件,这样我们就可以忽略磁盘碎片。不同的安装可能会以不同的方式对磁盘进行分段,从应用程序级别来看,我们不应该对磁盘布局感兴趣,就像我们需要的数据文件块一样,以防我们想要通过预取它们进行优化。

跟踪页面缓存级别
页面缓存是位于VFS /内存映射系统和文件系统层之间的结构,负责管理已从磁盘读取的内存部分。通过跟踪此缓存中的包含和删除页面,我们可以免费获得“首次访问”行为。首次访问新页面时,它将被带到缓存,进一步访问将不需要转到磁盘。如果页面最终从缓存中删除,因为不再需要它,则需要新的用户访问磁盘,并记录新的访问条目。在我们的性能调查场景中,我们正在寻找的确切功能。

探头
我们实现的探针跟踪内核页面缓存处理函数中的页面缓存未命中,以在请求甚至提交到磁盘之前识别第一次请求块。对相同内存区域的进一步请求(只要数据仍然被映射)将在缓存中返回命中,我们不关心,也不跟踪。这可以防止我们的代码干扰进一步的访问,严重削弱我们的探针可能对性能的影响。

通过跟踪页面缓存,我们还能够将用户程序直接请求的块与内核中的Read Ahead逻辑请求的块区分开来。知道哪些块被预先读取是应用程序开发人员和系统管理员非常有趣的信息,因为它允许他们调整他们的系统或应用程序以理智的方式预取他们想要的块。

与任何eBPF应用程序一样,代码非常简单。如果我们忽略了进行eBPF编译所需的一些样板文件,那么探测结果将归结为以下函数:

int fblktrace_read_pages(struct pt_regs * ctx,struct address_space * mapping,
             struct list_head * pages,struct page * page,
             unsigned nr_pages,bool is_readahead)
{
    u64指数;
    unsigned blkbits = mapping-> host-> i_blkbits;
    unsigned long ino = mapping-> host-> i_ino;
     u64 block_in_file;

    for(int i = 0; i <32 && nr_pages--; i ++){
        if(pages){
            pages = pages-> prev;
            page = container_of(pages,struct page,lru);
        }
        index = page-> index;
        block_in_file =(unsigned long)index <<(12  -  blkbits);

        bpf_trace_printk(“=> inode:%ld:FSBLK =%lu BSIZ =%lu%s \\ n”,
                 ino,index,1 << blkbits,is_readahead?“[RA]”:“”);

    }
    返回0;
}

上面的函数作为函数的跟踪器安装=ext4_mpage_readpages=,通过以下代码片段:

b.attach_kprobe(event =“ext4_mpage_readpages”,fn_name =“fblktrace_read_pages”)

每次内核页面缓存都要求文件系统首次从磁盘中获取某些页面时,探针就会运行。应该读取哪个区域由进程的地址空间中页面的索引和偏移间接标识。我们使用该信息来计算要加载的文件的偏移量,在文件块中,并将该信息与标识文件的inode编号一起传递给打印函数。

用法示例
出于演示目的,我们编写了一个名为touchblk的小程序,它以两种方式读取文件:使用同步读/写系统调用,以及使用mmap功能。在两种情况下,我们读取文件的两个任意选择的区域,块34后面是块100。

要运行探针,需要安装包BCC工具中提供的eBPF编译器。除了运行此示例之外,许多Linux发行版中已经提供的BCC软件包还包含大量基于eBPF的探针示例,您可以使用它们来学习如何使用此工具并编写特定于您需求的实用程序。

bcc编译器由iovisor项目提供:

https://github.com/iovisor/bcc

现在,让我们看一下实际的探测器。

读/写系统调用

[remote:root @ fblktrace~] $ ./fblktrace
印刷...
touchblk-2143 [002] d ... 91137.791064:0:=>打开inode 14:fname = test.img
touchblk-2143 [002] .N .. 91137.811093:0:=> inode:14:FSBLK = 34 BSIZ = 4096 [RA]
touchblk-2143 [002] .... 91137.828293:0:=> inode:14:FSBLK = 100 BSIZ = 4096 [RA]

上面的输出显示了我之前描述的测试应用程序的确切行为。由于读/写不一定会触发预读,因此对于我们要查找的确切块,只显示两个条目。还给出了时间戳和索引节点号。为了改善输出,安装了第二个探针以将inode编号映射到文件名,但这显然不是必需的。它仅用于简化用户的生活。

内存映射访问
下面是内存映射版本的输出。这很长......

[remote:root @ fblktrace~] $ ./fblktrace
印刷...
touchblk-2147 [003] d ... 91258.462486:0:=>打开inode 14:fname = image
touchblk-2147 [003] .... 91258.480927:0:=> inode:14:FSBLK = 18 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480940:0:=> inode:14:FSBLK = 19 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480942:0:=> inode:14:FSBLK = 20 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480943:0:=> inode:14:FSBLK = 21 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480944:0:=> inode:14:FSBLK = 22 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480945:0:=> inode:14:FSBLK = 23 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480946:0:=> inode:14:FSBLK = 24 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480947:0:=> inode:14:FSBLK = 25 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480948:0:=> inode:14:FSBLK = 26 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480949:0:=> inode:14:FSBLK = 27 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480952:0:=> inode:14:FSBLK = 28 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480952:0:=> inode:14:FSBLK = 29 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480954:0:=> inode:14:FSBLK = 30 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480955:0:=> inode:14:FSBLK = 31 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480955:0:=> inode:14:FSBLK = 32 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480956:0:=> inode:14:FSBLK = 33 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480957:0:=> inode:14:FSBLK = 34 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480958:0:=> inode:14:FSBLK = 35 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480959:0:=> inode:14:FSBLK = 36 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480960:0:=> inode:14:FSBLK = 37 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480961:0:=> inode:14:FSBLK = 38 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480962:0:=> inode:14:FSBLK = 39 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480963:0:=> inode:14:FSBLK = 40 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480966:0:=> inode:14:FSBLK = 41 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480967:0:=> inode:14:FSBLK = 42 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480968:0:=> inode:14:FSBLK = 43 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480969:0:=> inode:14:FSBLK = 44 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480970:0:=> inode:14:FSBLK = 45 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480971:0:=> inode:14:FSBLK = 46 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480972:0:=> inode:14:FSBLK = 47 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480973:0:=> inode:14:FSBLK = 48 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.480974:0:=> inode:14:FSBLK = 49 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498554:0:=> inode:14:FSBLK = 84 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498565:0:=> inode:14:FSBLK = 85 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498566:0:=> inode:14:FSBLK = 86 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498567:0:=> inode:14:FSBLK = 87 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498568:0:=> inode:14:FSBLK = 88 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498569:0:=> inode:14:FSBLK = 89 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498570:0:=> inode:14:FSBLK = 90 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498573:0:=> inode:14:FSBLK = 91 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498574:0:=> inode:14:FSBLK = 92 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498575:0:=> inode:14:FSBLK = 93 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498576:0:=> inode:14:FSBLK = 94 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498577:0:=> inode:14:FSBLK = 95 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498578:0:=> inode:14:FSBLK = 96 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498579:0:=> inode:14:FSBLK = 97 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498580:0:=> inode:14:FSBLK = 98 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498581:0:=> inode:14:FSBLK = 99 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498582:0:=> inode:14:FSBLK = 100 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498583:0:=> inode:14:FSBLK = 101 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498584:0:=> inode:14:FSBLK = 102 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498585:0:=> inode:14:FSBLK = 103 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498586:0:=> inode:14:FSBLK = 104 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498587:0:=> inode:14:FSBLK = 105 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498588:0:=> inode:14:FSBLK = 106 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498589:0:=> inode:14:FSBLK = 107 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498591:0:=> inode:14:FSBLK = 108 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498592:0:=> inode:14:FSBLK = 109 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498593:0:=> inode:14:FSBLK = 110 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498594:0:=> inode:14:FSBLK = 111 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498595:0:=> inode:14:FSBLK = 112 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498596:0:=> inode:14:FSBLK = 113 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498597:0:=> inode:14:FSBLK = 114 BSIZ = 4096 [RA]
touchblk-2147 [003] .... 91258.498598:0:=> inode:14:FSBLK = 115 BSIZ = 4096 [RA]

为什么它比read()/ write()系统调用示例长得多?因为内核为了优化昂贵的I / O操作,假设当在非顺序读取文件中访问特定地址时,很快就会需要附近的区域,这样它就会执行预读(RA)I / O操作。

内核不能假设所需的下一个区域将紧跟在访问的块之后,因此它尝试访问目标块之前和之后的邻居。提前查看的邻居数由sysfs中的文件系统特定参数定义。

[krisman @ dilma sda2] $ cat / sys / fs / ext4 / sda2 / inode_readahead_blks
32

此参数指示内核在预读期间在目标块周围加载32个块。如果您返回示例代码跟踪的第二个版本的输出并计算为两次访问中的每一次访问而读取的块,您将观察到每次访问时只读取了32个块,紧接在目标之前的15个块阻止,并且紧接着16。这为预读机制的工作原理提供了非常有趣的见解。

其他类型的I / O和限制
此方法尝试在通过页面缓存时捕获I / O访问,这样就不会跟踪其他非缓冲机制(如Direct I / O)。此示例也仅限于ext4,但它也可以简单地扩展到任何其他Linux文件系统。

完整的代码
与往常一样,完整的代码以我们知道的唯一方式提供:在公共存储库中的自由软件许可下。请享用!

https://gitlab.collabora.com/krisman/bcc/blob/master/tools/fblktrace.py**