Linux内核调试技术——Fault-injection故障注入

当我们在开发内核功能或者验证定位问题时,经常需要模拟各种内核的异常场景,来验证程序的健壮性或加速问题的复现,比如内存分配失败、磁盘IO错误超时等等。Linux内核集成了一个比较实用的功能“Fault-injection”来帮助我们进行故障注入,从而可以构建一些通用的内核异常场景。它能够模拟内存slab分配失败、内存页分配失败、磁盘IO错误、磁盘IO超时、futex锁错误以及专门针对mmc的IO错误,用户也可以利用该机制设计增加自己需要的故障注入。本文主要从内存分配和磁盘IO两个方面介绍如何使用“Fault-injection”注入异常并详细分析其实现。

内核版本:Linux 4.11.y

实验环境:Rpi 3


 

Fault-injection概述

故障注入类型

Fault-injection默认实现了6种错误注入方式,分别是failslab、fail_page_alloc、fail_futex、fail_make_request、fail_io_timeout和fail_mmc_request。它们分别的功能如下:

1)failslab

注入slab分配器内存分配错误,主要包括kmalloc()、kmem_cache_alloc()等等。

2)fail_page_alloc

注入内存页分配错误,主要包括alloc_pages()、get_free_pages()等等(较failslab更为底层)。

3)fail_futex

注入futex锁死锁和uaddr错误。

4)fail_make_request

注入磁盘IO错误。它对块核心层的generic_make_request()函数进行故障注入,可以通过/sys/block//make-it-fail或者/sys/block///make-it-fail接口对指定的磁盘或分区进行注入。

5)fail_io_timeout

注入IO超时错误。它对IO处理流程中的IO处理完成blk_complete_request()函数进行故障注入,忽略IO完成“通知”。仅对使用通用超时处理流程的drivers有效,例如标准的scsi流程。

6)fail_mmc_request

注入mmc 数据错误,仅对mmc设备有效,通过对mmc core返回data error进行错误注入,从而可以测试mmc块设备驱动的错误处理流程以及重试机制,可通过/sys/kernel/debug/mmcx/fail_mmc_request接口进行设置。

以上6中故障注入类型是内核中已经默认实现了的,用户也可以利用其核心框架自行按需进行修改添加,只需依葫芦画瓢即可。我这里挑选了使用最多的failslab、fail_page_alloc、fail_make_request和fail_io_timeout进行详细分析,其他的两种大同小异。

 

故障注入debugfs配置

Fault-injection提供了内核选项可以开启debugfs控制接口,启停或者调整故障注入配置,主要包括如下一些文件接口:

1)/sys/kernel/debug/fail*/probability:

设置异常发生的比例,百分制。如果觉得最小值1%依然太频繁,可以设置该值为100,然后通过interval来调整异常触发的频率。默认值为0。

2)/sys/kernel/debug/fail*/interval:

设置异常发生的间隔,如果需要启用则设置大于1的值,probability设置为100。默认值为1。

3)/sys/kernel/debug/fail*/times:

设置异常发生的最大次数,超过该次数后将不会再发生异常了,设置为-1表示不设限。默认值为1。

4)/sys/kernel/debug/fail*/space

设置异常的size余量,每次执行到故障注入点后,都会将在该space的基础上递减size值,直到该值降低为0后才会注入异常。其中size的含义对各种异常各不相同,对于IO异常表示的是本次IO的字节数,对于内存分配表示的是内存的大小。默认值为0。

5)/sys/kernel/debug/fail*/verbose、 verbose_ratelimit_burst

格式:{ 0 | 1 | 2 }

设置异常触发后的内核打印信息输出方式。0表示不输出日志信息;1表示输出以“FAULT_INJECTION”开头的最基本信息,包括触发的类型、间隔、频率等等;2表示会追加backtrace的输出(这点对问题的定位很有用)。默认值为2。

6)/sys/kernel/debug/fail*/verbose_ratelimit_interval_ms、/sys/kernel/debug/fail*/verbose_ratelimit_burst

用于控制日志输出ratelimit的interval和burst这两个参数,可以用来调节日志输出的频率,若太过频繁会丢掉一些输出,默认值分别为0和10。

7)/sys/kernel/debug/fail*/task-filter:

格式:{ 'Y' | 'N' }

设置进程过滤,N表示不过滤,Y表示对启用了make-it-fail的进程和在中断上下文的流程进行过滤(通过/proc//make-it-fail=1进行设置),不触发故障注入。默认值为N。

8)/sys/kernel/debug/fail*/require-start、 /sys/kernel/debug/fail*/require-end、 /sys/kernel/debug/fail*/reject-start、 /sys/kernel/debug/fail*/reject-end:

设置调用流程的虚拟地址空间过滤。若调用流程设计的代码段(Text段)包含在require-start~require-end且不包含在reject-start~reject-end中才注入异常,可以用来设置故障注入只针对某个或某些模块执行。默认required范围为[0, ULONG_MAX)(即整个虚拟地址空间),rejected范围为[0, 0)。

9)/sys/kernel/debug/fail*/stacktrace-depth:

设置[require-start, require-end) 和[reject-start, reject-end)跟踪的调用深度。默认值为32.

10)/sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem:

格式:{ 'Y' | 'N' }

设置页分配的高端内存过滤,设置为Y后当分配的内存类型包含__GFP_HIGHMEM不启用故障注入。默认值为N。

11)/sys/kernel/debug/failslab/ignore-gfp-wait、 /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait

格式:{ 'Y' | 'N' }

设置内存分配的分配模式过滤,设置为Y后只对非睡眠的内存分配启用故障注入(GFP_ATOMIC)。默认值为N。

12)/sys/kernel/debug/fail_page_alloc/min-order:

设置页分配order的过滤限制,当内核分配页小于该设定值则不进行故障注入。默认值为1

 

故障注入启动参数配置

前文中提到的debugfs接口只在debugfs启用后在有效,对于在内核启动阶段或没有设置debugfs配置选项的情况,Fault-injection的默认配置值通过启动参数进行传递,包括以下:

failslab=
fail_page_alloc=
fail_make_request=
fail_futex=
mmc_core.fail_request=,,,

通过启动参数传入的参数有限,目前只能接受interval、probability、space和times这4个参数(其他参数会被内核设置为默认的值),但是在一般情况下也够用了。

例如:如果想在内核启动阶段就启用failslab 100%无限故障注入,则可以传入内核启动参数:

failslab=1,100,0,-1


 

Fault-injection使用

配置内核选项

Fault-injection功能主要涉及以下几个内核配置选项,每一种注入模式一个配置选项,可按需开启:

CONFIG_FAULT_INJECTION:功能总开关
    CONFIG_FAILSLAB:failslab故障注入功能配置
    CONFIG_FAIL_PAGE_ALLOC:fail_page_alloc故障注入功能配置
    CONFIG_FAIL_MAKE_REQUEST:fail_make_request故障注入功能配置
    CONFIG_FAIL_IO_TIMEOUT:fail_io_timeout故障注入功能配置
CONFIG_FAIL_MMC_REQUEST:fail_mmc_request故障注入功能配置
CONFIG_FAIL_FUTEX:fail_futex故障注入功能配置
CONFIG_FAULT_INJECTION_DEBUG_FS:debugfs接口启用

这里我只介绍内存和IO相关的4个故障注入功能,因此需要开启CONFIG_FAULT_INJECTION、CONFIG_FAILSLAB、CONFIG_FAIL_PAGE_ALLOC、CONFIG_FAIL_IO_TIMEOUT和CONFIG_FAIL_MAKE_REQUEST这5个内核配置选项,与此同时,为了操作的方便,也设置CONFIG_FAULT_INJECTION_DEBUG_FS选项开启debugfs动态配置功能,然后重新编译安装内核。


 

fail_make_request使用

进入debugfs的挂载点,可以看到出现了以下几个目录:

[root@centos-rpi3 debug]# ls | grep fail 
fail_futex
fail_io_timeout
fail_make_request
fail_page_alloc
failslab

从名字就可以看出它们分配用于配置哪类故障注入了,在fail_make_request目录下则有以下配置参数:

[root@centos-rpi3 fail_make_request]# ls
interval     reject-start   space             times                    verbose_ratelimit_interval_ms
probability  require-end    stacktrace-depth  verbose
reject-end   require-start  task-filter       verbose_ratelimit_burst

这些配置参数前文中已经介绍过了,这里以100%无上限触发make request错误为例进行演示:

[root@centos-rpi3 fail_make_request]# echo 1 > interval 
[root@centos-rpi3 fail_make_request]# echo -1 > times 
[root@centos-rpi3 fail_make_request]# echo 100 > probability

这里触发比率设置为100%,无触发上限,其他参数无需修改使用默认值即可,这样fail_make_request的参数就算配置完成了,下面来开其它:

在磁盘块设备及其分区的sys接口目录下都有一个make-it-fail文件,例如我树莓派的sda和mmcblk下:

[root@centos-rpi3 block]# find -name make-it-fail
./sda/sda2/make-it-fail
./sda/make-it-fail
./sda/sda1/make-it-fail

[root@centos-rpi3 mmcblk1]# find -name make-it-fail
./mmcblk1p3/make-it-fail
./make-it-fail
./mmcblk1p1/make-it-fail
./mmcblk1p4/make-it-fail
./mmcblk1p2/make-it-fail

这个make-it-fail文件就是对相应块设备的故障注入开关,对该文件写入1以后对该设备就正式启用故障注入了:

[root@centos-rpi3 sda]# echo 1 > make-it-fail   
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct

[13744.902281] FAULT_INJECTION: forcing a failure.
[13744.902281] name fail_make_request, interval 1, probability 100, space 0, times -1

[13744.922972] CPU: 2 PID: 1649 Comm: dd Not tainted 4.11.0-v7+ #1
[13744.933280] Hardware name: BCM2835
[13744.941091] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[13744.957492] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[13744.973606] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[13744.989915] [<80490414>] (should_fail) from [<80433310>] (should_fail_request+0x28/0x30)
[13745.006664] [<80433310>] (should_fail_request) from [<804334ac>] (generic_make_request_checks+0xe4/0x668)
[13745.025003] [<804334ac>] (generic_make_request_checks) from [<80435868>] (generic_make_request+0x20/0x228)
[13745.043590] [<80435868>] (generic_make_request) from [<80435b18>] (submit_bio+0xa8/0x194)
[13745.060677] [<80435b18>] (submit_bio) from [<802b7cac>] (__blkdev_direct_IO_simple+0x158/0x2e0)
[13745.078294] [<802b7cac>] (__blkdev_direct_IO_simple) from [<802b8224>] (blkdev_direct_IO+0x3c4/0x400)
[13745.096168] [<802b8224>] (blkdev_direct_IO) from [<8021520c>] (generic_file_direct_write+0xac/0x1c0)
[13745.113872] [<8021520c>] (generic_file_direct_write) from [<802153e0>] (__generic_file_write_iter+0xc0/0x204)
[13745.132466] [<802153e0>] (__generic_file_write_iter) from [<802b8e50>] (blkdev_write_iter+0xb0/0x130)
[13745.150240] [<802b8e50>] (blkdev_write_iter) from [<80279d2c>] (__vfs_write+0xd4/0x124)
[13745.166717] [<80279d2c>] (__vfs_write) from [<8027b6d4>] (vfs_write+0xb0/0x1c4)
[13745.182487] [<8027b6d4>] (vfs_write) from [<8027ba40>] (SyS_write+0x4c/0x98)
[13745.198050] [<8027ba40>] (SyS_write) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

可以看到,对sda2设备,fail_make_request异常已经被成功注入了,如若去挂载设备上的Ext4文件系统将会无法挂载:

[root@centos-rpi3 sda]# mount /dev/sda1 /mnt/
mount: /dev/sda1: can't read superblock


 

fail_io_timeou使用

fail_io_timeout故障注入的用法同fail_make_request类似,在debugfs挂载点的fail_io_timeout目录下存在同样的几个配置文件,现按同样的方式进行配置:

[root@centos-rpi3 fail_io_timeout]# echo 1 > interval 
[root@centos-rpi3 fail_io_timeout]# echo -1 > times
[root@centos-rpi3 fail_io_timeout]# echo 100 > probability

配置完成后,同样需要对块设备启用,启用的接口为/sys/block/sdx/io-timeout-fail,注意该异常只能对磁盘块设备(struct gendisk)注入而无法对分区注入。

[root@centos-rpi3 sda]# echo 1 > io-timeout-fail 
[root@centos-rpi3 sda]# dd if=/dev/zero of=/dev/sda2 bs=4k count=1 oflag=direct

[15198.056490] FAULT_INJECTION: forcing a failure.
[15198.056490] name fail_io_timeout, interval 1, probability 100, space 0, times -1
[15198.081768] CPU: 0 PID: 1405 Comm: usb-storage Not tainted 4.11.0-v7+ #1
[15198.097541] Hardware name: BCM2835
[15198.105454] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[15198.122090] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[15198.138443] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[15198.155013] [<80490414>] (should_fail) from [<8043edbc>] (blk_should_fake_timeout+0x30/0x38)
[15198.172426] [<8043edbc>] (blk_should_fake_timeout) from [<8043ed68>] (blk_complete_request+0x20/0x44)
[15198.190450] [<8043ed68>] (blk_complete_request) from [<8053e2c4>] (scsi_done+0x24/0x98)
[15198.207148] [<8053e2c4>] (scsi_done) from [<805a5b34>] (usb_stor_control_thread+0x130/0x28c)
[15198.224387] [<805a5b34>] (usb_stor_control_thread) from [<8013c0b0>] (kthread+0x12c/0x168)
[15198.241451] [<8013c0b0>] (kthread) from [<80108268>] (ret_from_fork+0x14/0x2c)

由于完成complete完成调用被忽略,一般的IO会超时并重试,而dd命令在下面的同步调用流程中会等待barrier命令返回,从而会进入D状态而引起hungtask:

[15235.156646] INFO: task dd:1738 blocked for more than 120 seconds.
[15235.167371]       Not tainted 4.11.0-v7+ #1
[15235.176039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[15235.393326] dd              D    0  1738   1371 0x00000000
[15235.403207] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[15235.418950] [<8072447c>] (schedule) from [<80727bdc>] (schedule_timeout+0x1f8/0x338)
[15235.435528] [<80727bdc>] (schedule_timeout) from [<80724f44>] (wait_for_common+0xe8/0x190)
[15235.452572] [<80724f44>] (wait_for_common) from [<8072500c>] (wait_for_completion+0x20/0x24)
[15235.469657] [<8072500c>] (wait_for_completion) from [<80134cac>] (flush_work+0x11c/0x1a0)
[15235.486675] [<80134cac>] (flush_work) from [<801369a0>] (__cancel_work_timer+0x138/0x208)
[15235.503478] [<801369a0>] (__cancel_work_timer) from [<80136a8c>] (cancel_delayed_work_sync+0x1c/0x20)
[15235.521301] [<80136a8c>] (cancel_delayed_work_sync) from [<8044ad58>] (disk_block_events+0x74/0x78)
[15235.538733] [<8044ad58>] (disk_block_events) from [<802b9838>] (__blkdev_get+0x108/0x430)
[15235.555331] [<802b9838>] (__blkdev_get) from [<802b9cfc>] (blkdev_get+0x19c/0x310)
[15235.571404] [<802b9cfc>] (blkdev_get) from [<802ba40c>] (blkdev_open+0x7c/0x88)
[15235.587364] [<802ba40c>] (blkdev_open) from [<80276f08>] (do_dentry_open+0x100/0x30c)
[15235.603973] [<80276f08>] (do_dentry_open) from [<80278530>] (vfs_open+0x60/0x8c)
[15235.620265] [<80278530>] (vfs_open) from [<8028929c>] (path_openat+0x410/0xef4)
[15235.636666] [<8028929c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[15235.653430] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[15235.670269] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[15235.686748] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

若是对古仔的ext4文件系统进行touch操作,会出现以下hungtask现象:

[ 1964.356864] INFO: task touch:1363 blocked for more than 120 seconds.
[ 1964.367450]       Not tainted 4.11.0-v7+ #1
[ 1964.375846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1964.392079] touch           D    0  1363   1078 0x00000000
[ 1964.401834] [<80723ccc>] (__schedule) from [<8072447c>] (schedule+0x44/0xa8)
[ 1964.417429] [<8072447c>] (schedule) from [<80149104>] (io_schedule+0x20/0x40)
[ 1964.432969] [<80149104>] (io_schedule) from [<80724998>] (bit_wait_io+0x1c/0x64)
[ 1964.448858] [<80724998>] (bit_wait_io) from [<80724d08>] (__wait_on_bit+0x94/0xcc)
[ 1964.465131] [<80724d08>] (__wait_on_bit) from [<80724e50>] (out_of_line_wait_on_bit+0x78/0x84)
[ 1964.482734] [<80724e50>] (out_of_line_wait_on_bit) from [<802b1dcc>] (__wait_on_buffer+0x3c/0x44)
[ 1964.500710] [<802b1dcc>] (__wait_on_buffer) from [<80308a44>] (ext4_read_inode_bitmap+0x6b0/0x758)
[ 1964.518775] [<80308a44>] (ext4_read_inode_bitmap) from [<803095b0>] (__ext4_new_inode+0x470/0x15dc)
[ 1964.536764] [<803095b0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[ 1964.553559] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[ 1964.570016] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[ 1964.586650] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[ 1964.603310] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[ 1964.619605] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

关闭故障注入后,hungtask可恢复。

 

内存分配failslab使用

在debugfs挂载点的failslab目录也同样有类似的几个配置文件,只是多两个特有的配置:ignore-gfp-wait和cache_filter,前者用于过滤__GFP_RECLAIM类型内存分配的开关,后者用于过滤用户需要的slab分配,避免一启用后直接系统就报错而无法运行。

[root@centos-rpi3 failslab]# ls
cache-filter     probability   require-end    stacktrace-depth  verbose
ignore-gfp-wait  reject-end    require-start  task-filter       verbose_ratelimit_burst
interval         reject-start  space          times             verbose_ratelimit_interval_ms

用户可以在/sys/kernel/slab/xxx/failslab中配置要进行注入的kmem_cache类型:

[root@centos-rpi3 slab]# ls /sys/kernel/slab
:at-0000016   :t-0001024             dentry                   inotify_inode_mark    nsproxy
:at-0000024   :t-0001536             dio                      ip4-frags             pid
:at-0000032   :t-0002048             discard_cmd              ip_dst_cache          pid_namespace
:at-0000040   :t-0003072             discard_entry            ip_fib_alias          pool_workqueue
:at-0000048   :t-0004032             dmaengine-unmap-2        ip_fib_trie           posix_timers_cache
:at-0000064   :t-0004096             dnotify_mark             ip_mrt_cache          proc_inode_cache
:at-0000072   :t-0008192             dnotify_struct           jbd2_inode            radix_tree_node
:at-0000104   :tA-0000032            dquot                    jbd2_journal_handle   request_queue
:at-0000112   :tA-0000064            eventpoll_epi            jbd2_journal_head     request_sock_TCP
:at-0000184   :tA-0000088            eventpoll_pwq            jbd2_revoke_record_s  rpc_buffers
:at-0000192   :tA-0000128            ext4_allocation_context  jbd2_revoke_table_s   rpc_inode_cache
:atA-0000136  :tA-0000256            ext4_extent_status       jbd2_transaction_s    rpc_tasks
:atA-0000528  :tA-0000448            ext4_free_data           kernfs_node_cache     scsi_data_buffer
:t-0000024    :tA-0000704            ext4_groupinfo_4k        key_jar               scsi_sense_cache
:t-0000032    :tA-0003776            ext4_inode_cache         kioctx                sd_ext_cdb
:t-0000040    PING                   ext4_io_end              kmalloc-1024          secpath_cache
:t-0000048    RAW                    ext4_prealloc_space      kmalloc-128           sgpool-128
:t-0000056    TCP                    ext4_system_zone         kmalloc-192           sgpool-16
:t-0000064    UDP                    f2fs_extent_node         kmalloc-2048          sgpool-32
:t-0000080    UDP-Lite               f2fs_extent_tree         kmalloc-256           sgpool-64
:t-0000088    UNIX                   f2fs_ino_entry           kmalloc-4096          sgpool-8
:t-0000112    aio_kiocb              f2fs_inode_cache         kmalloc-512           shmem_inode_cache
:t-0000120    anon_vma               f2fs_inode_entry         kmalloc-64            sighand_cache
:t-0000128    anon_vma_chain         fanotify_event_info      kmalloc-8192          signal_cache
:t-0000144    bdev_cache             fasync_cache             kmem_cache            sigqueue
:t-0000152    bio-0                  fat_cache                kmem_cache_node       sit_entry_set
:t-0000176    bio-1                  fat_inode_cache          mbcache               skbuff_fclone_cache
:t-0000192    biovec-128             file_lock_cache          mm_struct             skbuff_head_cache
:t-0000208    biovec-16              file_lock_ctx            mnt_cache             sock_inode_cache
:t-0000256    biovec-256             files_cache              mqueue_inode_cache    task_delay_info
:t-0000320    biovec-64              filp                     names_cache           task_group
:t-0000328    blkdev_ioc             flow_cache               nat_entry             task_struct
:t-0000344    blkdev_requests        free_nid                 nat_entry_set         taskstats
:t-0000384    bsg_cmd                fs_cache                 net_namespace         tcp_bind_bucket
:t-0000448    buffer_head            fscache_cookie_jar       nfs_commit_data       trace_event_file
:t-0000512    cachefiles_object_jar  fsnotify_mark            nfs_direct_cache      tw_sock_TCP
:t-0000576    cfq_io_cq              ftrace_event_field       nfs_inode_cache       uid_cache
:t-0000704    cfq_queue              inet_peer_cache          nfs_page              user_namespace
:t-0000768    configfs_dir_cache     inmem_page_entry         nfs_read_data         vm_area_struct
:t-0000904    cred_jar               inode_cache              nfs_write_data        xfrm_dst_cache

这里以:t-0000xxx多表示为kmalloc()分配使用的指定内存大小,由好多一些是指向它们的符号链接:

[root@centos-rpi3 slab]# ll kmalloc-1024
lrwxrwxrwx 1 root root 0 May 21 09:26 kmalloc-1024 -> :t-0001024

下面以ext4_inode_cache为例进行故障注入:

[root@centos-rpi3 failslab]# echo -1 > times
[root@centos-rpi3 failslab]# echo 100 > probability
[root@centos-rpi3 failslab]# echo 1 > cache-filter
[root@centos-rpi3 failslab]# echo 1 > /sys/kernel/slab/ext4_inode_cache/failslab
[root@centos-rpi3 failslab]# echo N >  ignore-gfp-wait

启用以后可以在ext4文件系统中执行创建文件等命令,会打印如下故障注入信息:

[  157.633204] FAULT_INJECTION: forcing a failure.
[  157.633204] name failslab, interval 1, probability 100, space 0, times -1
[  157.659029] CPU: 1 PID: 379 Comm: in:imjournal Not tainted 4.11.0-v7+ #1
[  157.675420] Hardware name: BCM2835
[  157.683660] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[  157.701176] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[  157.718337] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[  157.735689] [<80490414>] (should_fail) from [<8026950c>] (should_failslab+0x60/0x8c)
[  157.753309] [<8026950c>] (should_failslab) from [<80266120>] (kmem_cache_alloc+0x44/0x230)
[  157.771433] [<80266120>] (kmem_cache_alloc) from [<80323390>] (ext4_alloc_inode+0x24/0x104)
[  157.789647] [<80323390>] (ext4_alloc_inode) from [<80296254>] (alloc_inode+0x2c/0xb0)
[  157.807213] [<80296254>] (alloc_inode) from [<80297d90>] (new_inode_pseudo+0x18/0x5c)
[  157.824793] [<80297d90>] (new_inode_pseudo) from [<80297df0>] (new_inode+0x1c/0x30)
[  157.842200] [<80297df0>] (new_inode) from [<803091d0>] (__ext4_new_inode+0x90/0x15dc)
[  157.859751] [<803091d0>] (__ext4_new_inode) from [<8031c064>] (ext4_create+0xb0/0x178)
[  157.877373] [<8031c064>] (ext4_create) from [<8028996c>] (path_openat+0xae0/0xef4)
[  157.894604] [<8028996c>] (path_openat) from [<8028ac30>] (do_filp_open+0x70/0xc4)
[  157.911620] [<8028ac30>] (do_filp_open) from [<802788fc>] (do_sys_open+0x11c/0x1d4)
[  157.928965] [<802788fc>] (do_sys_open) from [<802789e0>] (SyS_open+0x2c/0x30)
[  157.945881] [<802789e0>] (SyS_open) from [<801081e0>] (ret_fast_syscall+0x0/0x1c)

 

内存分配fail_page_alloc使用

在debugfs挂载点的fail_page_alloc目录也同样有类似的几个配置文件,只是fail_page_alloc会多3个特有的配置:min-order,ignore-gfp-wait和ignore-gfp-highmem。ignore-gfp-wait为用于过滤__GFP_DIRECT_RECLAIM类型内存分配的开关,ignore-gfp-highmem为用于是否过滤高端内存分配__GFP_HIGHMEM的开关,min-order为对故障注入最小分配页的过滤器,只有大于该参数的分配才能够进行故障注入

[root@centos-rpi3 fail_page_alloc]# ls
ignore-gfp-highmem  probability   require-start     times
ignore-gfp-wait     reject-end    space             verbose
interval            reject-start  stacktrace-depth  verbose_ratelimit_burst
min-order           require-end   task-filter       verbose_ratelimit_interval_ms

[root@centos-rpi3 fail_page_alloc]# echo 2 > times
[root@centos-rpi3 fail_page_alloc]# echo 100 > probability

[18950.321696] FAULT_INJECTION: forcing a failure.
[18950.321696] name fail_page_alloc, interval 1, probability 100, space 0, times -1
[18950.439402] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O    4.11.0-v7+ #1
[18950.516300] Hardware name: BCM2835
[18950.553611] [<8010f4a0>] (unwind_backtrace) from [<8010ba24>] (show_stack+0x20/0x24)
[18950.628930] [<8010ba24>] (show_stack) from [<80465264>] (dump_stack+0xc0/0x114)
[18950.703954] [<80465264>] (dump_stack) from [<80490414>] (should_fail+0x198/0x1ac)
[18950.780256] [<80490414>] (should_fail) from [<8021d208>] (__alloc_pages_nodemask+0xc0/0xf88)
[18950.858001] [<8021d208>] (__alloc_pages_nodemask) from [<8021e210>] (page_frag_alloc+0x68/0x188)
[18950.937115] [<8021e210>] (page_frag_alloc) from [<80625774>] (__netdev_alloc_skb+0xb0/0x154)
[18951.016021] [<80625774>] (__netdev_alloc_skb) from [<8055fe0c>] (rx_submit+0x3c/0x20c)
[18951.094485] [<8055fe0c>] (rx_submit) from [<80560450>] (rx_complete+0x1e0/0x204)
[18951.172404] [<80560450>] (rx_complete) from [<80569aa4>] (__usb_hcd_giveback_urb+0x80/0x154)
[18951.251386] [<80569aa4>] (__usb_hcd_giveback_urb) from [<80569cc8>] (usb_hcd_giveback_urb+0x4c/0xf4)
[18951.331268] [<80569cc8>] (usb_hcd_giveback_urb) from [<805931e8>] (completion_tasklet_func+0x6c/0x98)
[18951.411890] [<805931e8>] (completion_tasklet_func) from [<805a1060>] (tasklet_callback+0x20/0x24)
[18951.493310] [<805a1060>] (tasklet_callback) from [<80122f74>] (tasklet_hi_action+0x74/0x108)
[18951.575189] [<80122f74>] (tasklet_hi_action) from [<8010162c>] (__do_softirq+0x134/0x3ac)
[18951.658448] [<8010162c>] (__do_softirq) from [<80122b44>] (irq_exit+0xf8/0x164)
[18951.741056] [<80122b44>] (irq_exit) from [<80175384>] (__handle_domain_irq+0x68/0xc0)
[18951.824219] [<80175384>] (__handle_domain_irq) from [<801014f0>] (bcm2836_arm_irqchip_handle_irq+0xa8/0xb0)
[18951.909282] [<801014f0>] (bcm2836_arm_irqchip_handle_irq) from [<807293fc>] (__irq_svc+0x5c/0x7c)

当然了,内存异常一般都是不希望全局生效的,但又没有设置过多的过滤器,因此往往需要用户对生效范围(如某个模块或者某个调用)和概率等进行设置。

 

Fault-injection实现

核心数据结构

/*
 * For explanation of the elements of this struct, see
 * Documentation/fault-injection/fault-injection.txt
 */
struct fault_attr {
	unsigned long probability;
	unsigned long interval;
	atomic_t times;
	atomic_t space;
	unsigned long verbose;
	bool task_filter;
	unsigned long stacktrace_depth;
	unsigned long require_start;
	unsigned long require_end;
	unsigned long reject_start;
	unsigned long reject_end;

	unsigned long count;
	struct ratelimit_state ratelimit_state;
	struct dentry *dname;
};
该结构体是fault-injection实现的核心结构体,该结构体中的大多数字段是否都有一种似成相识的感觉 :) ,其实它们都对应到debugfs中的各个配置接口文件。最后的三个字段是用于功能实现控制用的,其中count用于统计故障注入点的执行次数,ratelimit_state用于日志输出频率控制,最后的dname表示故障的类型(即fail_make_request、failslab等等)。下面来跟踪程序流程逐个详细分析fail_make_request、fail_io_timeout、failslab和fail_page_alloc的实现。

fail_make_request

static DECLARE_FAULT_ATTR(fail_make_request);

static int __init setup_fail_make_request(char *str)
{
	return setup_fault_attr(&fail_make_request, str);
}
__setup("fail_make_request=", setup_fail_make_request);
首先代码中静态定义一个struct fault_attr结构以实例fail_make_request用于描述fail_make_request类型故障注入,DECLARE_FAULT_ATTR是一个宏定义:

#define FAULT_ATTR_INITIALIZER {					\
		.interval = 1,						\
		.times = ATOMIC_INIT(1),				\
		.require_end = ULONG_MAX,				\
		.stacktrace_depth = 32,					\
		.ratelimit_state = RATELIMIT_STATE_INIT_DISABLED,	\
		.verbose = 2,						\
		.dname = NULL,						\
	}

#define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER
这里fail_make_request的一些通用的字段被初始化为默认的值。随后通过这里的__setup宏可知,在内核初始化阶段将处理“fail_make_request=xxx”的启动参数,注册的处理函数为setup_fail_make_request,它进一步调用通用函数setup_fault_attr,对fail_make_request结构体进一步初始化。
/*
 * setup_fault_attr() is a helper function for various __setup handlers, so it
 * returns 0 on error, because that is what __setup handlers do.
 */
int setup_fault_attr(struct fault_attr *attr, char *str)
{
	unsigned long probability;
	unsigned long interval;
	int times;
	int space;

	/* ",,," */
	if (sscanf(str, "%lu,%lu,%d,%d",
			&interval, &probability, &space, ×) < 4) {
		printk(KERN_WARNING
			"FAULT_INJECTION: failed to parse arguments\n");
		return 0;
	}

	attr->probability = probability;
	attr->interval = interval;
	atomic_set(&attr->times, times);
	atomic_set(&attr->space, space);

	return 1;
}
EXPORT_SYMBOL_GPL(setup_fault_attr);
前文中已经介绍了,启动参数的配置只能接收interval、probability、space和times这4个参数,由setup_fault_attr()负责解析并赋值到fail_make_request结构体中。下面来看下debugfs的入口:

static int __init fail_make_request_debugfs(void)
{
	struct dentry *dir = fault_create_debugfs_attr("fail_make_request",
						NULL, &fail_make_request);

	return PTR_ERR_OR_ZERO(dir);
}

late_initcall(fail_make_request_debugfs);
该函数也在内核初始化阶段调用,它会在debugfs的目录下创建一个名为fail_make_request的attr目录,如下:

struct dentry *fault_create_debugfs_attr(const char *name,
			struct dentry *parent, struct fault_attr *attr)
{
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
	struct dentry *dir;

	dir = debugfs_create_dir(name, parent);
	if (!dir)
		return ERR_PTR(-ENOMEM);

	if (!debugfs_create_ul("probability", mode, dir, &attr->probability))
		goto fail;
	if (!debugfs_create_ul("interval", mode, dir, &attr->interval))
		goto fail;
	if (!debugfs_create_atomic_t("times", mode, dir, &attr->times))
		goto fail;
	if (!debugfs_create_atomic_t("space", mode, dir, &attr->space))
		goto fail;
	if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose))
		goto fail;
	if (!debugfs_create_u32("verbose_ratelimit_interval_ms", mode, dir,
				&attr->ratelimit_state.interval))
		goto fail;
	if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir,
				&attr->ratelimit_state.burst))
		goto fail;
	if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter))
		goto fail;

#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER

	if (!debugfs_create_stacktrace_depth("stacktrace-depth", mode, dir,
				&attr->stacktrace_depth))
		goto fail;
	if (!debugfs_create_ul("require-start", mode, dir,
				&attr->require_start))
		goto fail;
	if (!debugfs_create_ul("require-end", mode, dir, &attr->require_end))
		goto fail;
	if (!debugfs_create_ul("reject-start", mode, dir, &attr->reject_start))
		goto fail;
	if (!debugfs_create_ul("reject-end", mode, dir, &attr->reject_end))
		goto fail;

#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */

	attr->dname = dget(dir);
	return dir;
fail:
	debugfs_remove_recursive(dir);

	return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL_GPL(fault_create_debugfs_attr);
首先传入的parent为NULL,所以fail_make_request目录创建的点为degubfs的根目录,然后在该目录下依次创建probability、interval、times等等之前看到的属性配置文件,最后将目录的dentry保存到attr->dname字段中。接下来再看一下块设备的开关接口:

#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
	__ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store);
#endif
...
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
		       struct device_attribute *attr, char *buf)
{
	struct hd_struct *p = dev_to_part(dev);

	return sprintf(buf, "%d\n", p->make_it_fail);
}

ssize_t part_fail_store(struct device *dev,
			struct device_attribute *attr,
			const char *buf, size_t count)
{
	struct hd_struct *p = dev_to_part(dev);
	int i;

	if (count > 0 && sscanf(buf, "%d", &i) > 0)
		p->make_it_fail = (i == 0) ? 0 : 1;

	return count;
}
#endif
基于sysfs的接口,当用户往/sys/block/sdx/make-it-fail写入非0时,对应设备struct hd_struct的make_it_fail字段就被置位为1,开关就打开了,否则设置为0,开关就关闭了。

了解了以上的参数配置接口,下面进入正题,fail_make_request究竟是如何注入故障的?如何判断是否需要注入?

首先来关注以下通用IO提交流程:submit_bio()->generic_make_request()->generic_make_request_checks():

static noinline_for_stack bool
generic_make_request_checks(struct bio *bio)
{
	...

	part = bio->bi_bdev->bd_part;
	if (should_fail_request(part, bio->bi_iter.bi_size) ||
	    should_fail_request(&part_to_disk(part)->part0,
				bio->bi_iter.bi_size))
		goto end_io;

	...
}
在IO提交流程的generic_make_request_checks()函数中会调用should_fail_request()函数进行故障注入的判断,如果这里返回true(注入),那IO的提交流程就不会继续下去,上层的generic_make_request()函数会返回cookie为BLK_QC_T_NONE,故障得以注入。

static bool should_fail_request(struct hd_struct *part, unsigned int bytes)
{
	return part->make_it_fail && should_fail(&fail_make_request, bytes);
}
判断是否注入故障主要取决于should_fail_request()函数,该函数返回true表示需要注入,返回false表示不注入。入参中第一个参数为磁盘设备hd_struct结构体,第二个参数为本次IO的字节数。这里可以看到hd_struct结构体中的make_it_fail开关的作用了,然后还有一个“与”条件为通用函数should_fail()的返回结果。should_fail()函数时整个故障注入条件判断的核心,它将根据struce fault_attr结构体中配置的参数进行评估。

/*
 * This code is stolen from failmalloc-1.0
 * http://www.nongnu.org/failmalloc/
 */

bool should_fail(struct fault_attr *attr, ssize_t size)
{
	/* No need to check any other properties if the probability is 0 */
	if (attr->probability == 0)
		return false;

	if (attr->task_filter && !fail_task(attr, current))
		return false;

	if (atomic_read(&attr->times) == 0)
		return false;

	if (atomic_read(&attr->space) > size) {
		atomic_sub(size, &attr->space);
		return false;
	}

	if (attr->interval > 1) {
		attr->count++;
		if (attr->count % attr->interval)
			return false;
	}

	if (attr->probability <= prandom_u32() % 100)
		return false;

	if (!fail_stacktrace(attr))
		return false;

	fail_dump(attr);

	if (atomic_read(&attr->times) != -1)
		atomic_dec_not_zero(&attr->times);

	return true;
}
EXPORT_SYMBOL_GPL(should_fail);
1)如果设置的probability为0,不注入;

2)如果设置了进程过滤,调用fail_task函数进行判断,如果当前进程的make_it_fail标没有记置位或者在中断上下文中,不注入;

static bool fail_task(struct fault_attr *attr, struct task_struct *task)
{
	return !in_interrupt() && task->make_it_fail;
}

3)如果注入次数不足超过上限,不注入;

4)如果预留的size余量大于本次io的字节数,那递减余量,不注入;

5)如果设置的间隔数大于1,则计算调用次数统计,若间隔数未到则不注入;

6)判断注入比率,通过prandom_u32()%100得到一个100以内的随机值,以此实现注入比率;

7)如果配置了CONFIG_FAULT_INJECTION_STACKTRACE_FILTER内核参数,这里会在fail_stacktrace()函数中判断执行流程中调用栈的代码段和配置“[require-start, require-end) ,[reject-start, reject-end)”的对应关系(这个函数中给出了获取调用栈地址的方法,这是一些比较有用的工具函数,值得mark一下):

#ifdef CONFIG_FAULT_INJECTION_STACKTRACE_FILTER

static bool fail_stacktrace(struct fault_attr *attr)
{
	struct stack_trace trace;
	int depth = attr->stacktrace_depth;
	unsigned long entries[MAX_STACK_TRACE_DEPTH];
	int n;
	bool found = (attr->require_start == 0 && attr->require_end == ULONG_MAX);

	if (depth == 0)
		return found;

	trace.nr_entries = 0;
	trace.entries = entries;
	trace.max_entries = depth;
	trace.skip = 1;

	save_stack_trace(&trace);
	for (n = 0; n < trace.nr_entries; n++) {
		if (attr->reject_start <= entries[n] &&
			       entries[n] < attr->reject_end)
			return false;
		if (attr->require_start <= entries[n] &&
			       entries[n] < attr->require_end)
			found = true;
	}
	return found;
}

#else

static inline bool fail_stacktrace(struct fault_attr *attr)
{
	return true;
}

#endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */
首先为了加速默认条件下的判断,当require_xxx和reject_xxx的值为默认时,直接返回pass(表示可以注入)。若用户设置了非默认值,则调用save_stack_trace函数向上逐级抓取调用栈,栈的深度由attr->stacktrace_depth给出,最大支持深度为32级,每一级调用函数的地址保存在trace.entries这个数组中,接下来就逐函数判断了,现判断[reject_start, reject_end),再判断[require-start, require-end) 。

回到should_fail()函数中,如果上边的各项判断都顺利通过了,就表示可以注入故障了,不过再注入故障之前最后要做的就是打印日志信息:

static void fail_dump(struct fault_attr *attr)
{
	if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) {
		printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n"
		       "name %pd, interval %lu, probability %lu, "
		       "space %d, times %d\n", attr->dname,
		       attr->interval, attr->probability,
		       atomic_read(&attr->space),
		       atomic_read(&attr->times));
		if (attr->verbose > 1)
			dump_stack();
	}
}
这些打印信息在前文介绍使用时已经看到了。如果verbose为2还会调用dump_stack()打出内核调用栈。

至此fail_make_request类型的故障注入就分析完了,其中最核心的部分也已经分析清楚了,余下的三个故障注入类型也大同小异。

 


fail_io_timeout

static DECLARE_FAULT_ATTR(fail_io_timeout);

static int __init setup_fail_io_timeout(char *str)
{
	return setup_fault_attr(&fail_io_timeout, str);
}
__setup("fail_io_timeout=", setup_fail_io_timeout);

fail_io_timeout的定义也由DECLARE_FAULT_ATTR宏给出,它的启动参数初始化接口由setup_fail_io_timeout()进行处理。

static int __init fail_io_timeout_debugfs(void)
{
	struct dentry *dir = fault_create_debugfs_attr("fail_io_timeout",
						NULL, &fail_io_timeout);

	return PTR_ERR_OR_ZERO(dir);
}

debugfs的接口由fail_io_timeout_debugfs()函数负责在debugfs根目录创建。这两点同之前的fail_make_request是一样的。

int blk_should_fake_timeout(struct request_queue *q)
{
	if (!test_bit(QUEUE_FLAG_FAIL_IO, &q->queue_flags))
		return 0;

	return should_fail(&fail_io_timeout, 1);
}

判断是否要进行故障注入的接口blk_should_fake_timeout(),它会在进行should_fail()判断之前对功能的开关进行判断,这里的开关为QUEUE_FLAG_FAIL_IO标记,该标记通过/sys/block/sdx/io-timeout-fail接口设置,对应的sysfs处理函数为:

ssize_t part_timeout_show(struct device *dev, struct device_attribute *attr,
			  char *buf)
{
	struct gendisk *disk = dev_to_disk(dev);
	int set = test_bit(QUEUE_FLAG_FAIL_IO, &disk->queue->queue_flags);

	return sprintf(buf, "%d\n", set != 0);
}

ssize_t part_timeout_store(struct device *dev, struct device_attribute *attr,
			   const char *buf, size_t count)
{
	struct gendisk *disk = dev_to_disk(dev);
	int val;

	if (count) {
		struct request_queue *q = disk->queue;
		char *p = (char *) buf;

		val = simple_strtoul(p, &p, 10);
		spin_lock_irq(q->queue_lock);
		if (val)
			queue_flag_set(QUEUE_FLAG_FAIL_IO, q);
		else
			queue_flag_clear(QUEUE_FLAG_FAIL_IO, q);
		spin_unlock_irq(q->queue_lock);
	}

	return count;
}

当用户写入非0后,part_timeout_store函数对相应的磁盘所在的struct request_queue结构体的queue_flags设置QUEUE_FLAG_FAIL_IO标记,对应磁盘的fail_io_timeout故障注入开关也就打开了,否则就复位该标记(关闭该开关)。

blk_should_fake_timeout()函数调用的地方(即故障注入点)一共有2处,分别如下:

1、blk_complete_request

void blk_complete_request(struct request *req)
{
	if (unlikely(blk_should_fake_timeout(req->q)))
		return;
	if (!blk_mark_rq_complete(req))
		__blk_complete_request(req);
}
EXPORT_SYMBOL(blk_complete_request);

该函数在底层IO写入完成或出现错误以后会由底层硬件驱动进行回调,正常的执行流程下会调用__blk_complete_request(),然后提交BLOCK_SOFTIRQ类型的软中断:

void __blk_complete_request(struct request *req)
{
	...
do_local:
		if (list->next == &req->ipi_list)
			raise_softirq_irqoff(BLOCK_SOFTIRQ);

	...
}
static __latent_entropy void blk_done_softirq(struct softirq_action *h)
{
	...
	while (!list_empty(&local_list)) {
		struct request *rq;

		rq = list_entry(local_list.next, struct request, ipi_list);
		list_del_init(&rq->ipi_list);
		rq->q->softirq_done_fn(rq);
	}
}

BLOCK_SOFTIRQ类型的软中断由blk_done_softirq()负责处理,它回调注册到request_queue中softirq_done_fn函数指针的函数,例如对于SCSI设备接下来的流程就是scsi_softirq_done()->scsi_finish_command()->scsi_io_completion()->scsi_end_request()->blk_update_request()->req_bio_endio()->bio_endio()完成本次IO,最后通知上层。

但是如果在blk_complete_request函数中注入故障,主动丢弃complete回调的向上传递,那就会触发request请求超时,调用流程如下:

blk_timeout_work()->blk_rq_check_expired()->blk_rq_timed_out()->scsi_times_out(),由scsi驱动程序进行超时处理,后台工作队列会定期进行IO的重试操作。

2、blk_mq_complete_request

blk层多队列完成回调函数,在使能内核多队列功能时IO的提交流程会走该流程分支。

scsi_mq_done()->blk_mq_complete_request()->__blk_mq_complete_request()->blk_mq_end_request()->blk_update_request()

 

failslab

failslab控制结构体:

static struct {
	struct fault_attr attr;
	bool ignore_gfp_reclaim;
	bool cache_filter;
} failslab = {
	.attr = FAULT_ATTR_INITIALIZER,
	.ignore_gfp_reclaim = true,
	.cache_filter = false,
};

failslab对通用的struct fault_attr结构体进行封装,另外定义了两个单独的参数ignore_gfp_reclaim和cache_filter,前者用于过滤__GFP_RECLAIM类型内存分配的开关,后者用于过滤用户需要的slab分配。之所以设置这两个参数是为了用户正对某些特定的kmem_cache注入需要,同事也为了避免一启用failslab后整个系统报错不可进一步调试。

failslab的启动配置初始化接口:

static int __init setup_failslab(char *str)
{
	return setup_fault_attr(&failslab.attr, str);
}
__setup("failslab=", setup_failslab);

failslab的debugfs配置接口:

static int __init failslab_debugfs_init(void)
{
	struct dentry *dir;
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;

	dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr);
	if (IS_ERR(dir))
		return PTR_ERR(dir);

	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
				&failslab.ignore_gfp_reclaim))
		goto fail;
	if (!debugfs_create_bool("cache-filter", mode, dir,
				&failslab.cache_filter))
		goto fail;

	return 0;
fail:
	debugfs_remove_recursive(dir);

	return -ENOMEM;
}

failslab在创建debugfs目录和配置文件时会多创建两个配置文件ignore-gfp-wait和cache-filter,分别对应于struct failslab结构体中的ignore_gfp_reclaim和cache_filter这两个配置参数。

failslab cache_filter过滤器配置接口:

#ifdef CONFIG_FAILSLAB
static ssize_t failslab_show(struct kmem_cache *s, char *buf)
{
	return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB));
}

static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
							size_t length)
{
	if (s->refcount > 1)
		return -EINVAL;

	s->flags &= ~SLAB_FAILSLAB;
	if (buf[0] == '1')
		s->flags |= SLAB_FAILSLAB;
	return length;
}
SLAB_ATTR(failslab);
#endif

该接口位于/sys/kernel/slab/xxx/failslab,设置为非0后会在kmem_cache的flags标识位置位SLAB_FAILSLAB(否则则清除该标识),SLAB_FAILSLAB标识在故障注入判断函数should_failslab中进行确认:

bool should_failslab(struct kmem_cache *s, gfp_t gfpflags)
{
	/* No fault-injection for bootstrap cache */
	if (unlikely(s == kmem_cache))
		return false;

	if (gfpflags & __GFP_NOFAIL)
		return false;

	if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
		return false;

	if (failslab.cache_filter && !(s->flags & SLAB_FAILSLAB))
		return false;

	return should_fail(&failslab.attr, s->object_size);
}

首先若ignore_gfp_reclaim标识启用,则自动忽略__GFP_RECLAIM类型的内存分配,然后如果cache_filter过滤器被启用,则自动过滤SLAB_FAILSLAB标识未置位的内存分配,最后调用should_fail进行通用标识判断。

接下来看一下failslab是在何处进行故障注入的:

static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
						     gfp_t flags)
{
	flags &= gfp_allowed_mask;
	lockdep_trace_alloc(flags);
	might_sleep_if(gfpflags_allow_blocking(flags));

	if (should_failslab(s, flags))
		return NULL;

	if (memcg_kmem_enabled() &&
	    ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
		return memcg_kmem_get_cache(s);

	return s;
}

正常的内存分配的调用流程为:kmem_cache_alloc()->slab_alloc()->slab_alloc_node()->slab_pre_alloc_hook()(kmalloc的内存分配流程在也类似),这里判断如果进行故障注入则直接返回NULL,表示内存分配失败,故障得以注入。

 

fail_page_alloc

fail_page_alloc的故障注入比failslab的更为底层,直接在内存伙伴系统中进行注入。它也有自己的结构体定义:

static struct {
	struct fault_attr attr;

	bool ignore_gfp_highmem;
	bool ignore_gfp_reclaim;
	u32 min_order;
} fail_page_alloc = {
	.attr = FAULT_ATTR_INITIALIZER,
	.ignore_gfp_reclaim = true,
	.ignore_gfp_highmem = true,
	.min_order = 1,
};

在通用struct fault_attr配置参数的基础之上又增加了ignore_gfp_reclaim、ignore_gfp_highmem和min_order这三个配置参数,第一个为用于过滤__GFP_DIRECT_RECLAIM类型内存分配的开关,第二个为用于过滤高端内存分配__GFP_HIGHMEM的开关,最后一个为对故障注入最小分配页的过滤器,只有大于该参数的分配才能够进行故障注入。

fail_page_alloc的启动参数配置接口:

static int __init setup_fail_page_alloc(char *str)
{
	return setup_fault_attr(&fail_page_alloc.attr, str);
}
__setup("fail_page_alloc=", setup_fail_page_alloc);

fail_page_alloc的debugfs配置接口:

static int __init fail_page_alloc_debugfs(void)
{
	umode_t mode = S_IFREG | S_IRUSR | S_IWUSR;
	struct dentry *dir;

	dir = fault_create_debugfs_attr("fail_page_alloc", NULL,
					&fail_page_alloc.attr);
	if (IS_ERR(dir))
		return PTR_ERR(dir);

	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
				&fail_page_alloc.ignore_gfp_reclaim))
		goto fail;
	if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
				&fail_page_alloc.ignore_gfp_highmem))
		goto fail;
	if (!debugfs_create_u32("min-order", mode, dir,
				&fail_page_alloc.min_order))
		goto fail;

	return 0;
fail:
	debugfs_remove_recursive(dir);

	return -ENOMEM;
}

它在通用配置文件的基础上多创建了ignore-gfp-wait、ignore-gfp-highmem和min-order这三个配置文件,分别对应于fail_page_alloc结构体的ignore_gfp_reclaim、ignore_gfp_highmem和min_order配置参数。

下面来分析故障注入判断函数:

static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
{
	if (order < fail_page_alloc.min_order)
		return false;
	if (gfp_mask & __GFP_NOFAIL)
		return false;
	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
		return false;
	if (fail_page_alloc.ignore_gfp_reclaim &&
			(gfp_mask & __GFP_DIRECT_RECLAIM))
		return false;

	return should_fail(&fail_page_alloc.attr, 1 << order);
}

1)首先判断如果本次需要申请页的order值小于min_order过滤器的设置值则不注入故障;

2)然后如果申请页置位了__GFP_NOFAIL标记页不注入故障;

3)如果打开了高端内存开关,则对于置位了__GFP_HIGHMEM的高端页分配不注入故障;

4)如果打开了ignore_gfp_reclaim开关,则对置位了__GFP_DIRECT_RECLAIM的页分配不注入故障;

5)最后执行should_fail进行通用判断。

fail_page_alloc故障注入到内存页分配流程的prepare_alloc_pages()函数中:

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
		struct zonelist *zonelist, nodemask_t *nodemask,
		struct alloc_context *ac, gfp_t *alloc_mask,
		unsigned int *alloc_flags)
{
	ac->high_zoneidx = gfp_zone(gfp_mask);
	ac->zonelist = zonelist;
	ac->nodemask = nodemask;
	ac->migratetype = gfpflags_to_migratetype(gfp_mask);

	if (cpusets_enabled()) {
		*alloc_mask |= __GFP_HARDWALL;
		if (!ac->nodemask)
			ac->nodemask = &cpuset_current_mems_allowed;
		else
			*alloc_flags |= ALLOC_CPUSET;
	}

	lockdep_trace_alloc(gfp_mask);

	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

	if (should_fail_alloc_page(gfp_mask, order))
		return false;

	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
		*alloc_flags |= ALLOC_CMA;

	return true;
}

正常的内存页分配流程之一为:page_frag_alloc()->alloc_pages_node()->__alloc_pages_node()->__alloc_pages->__alloc_pages_nodemask->prepare_alloc_pages(),该函数返回true表示可以分配页,如果故障注入,那这里就返回false,无法分配页,则调用方需要进行异常处理。

 

总结

在我们编写和调试内核程序的时候,一般情况下很容易只考虑到正常的执行流程,而对一些不常见的异常流程缺乏有效的处理机制,导致程序的健壮性不够,这样往往由于各种原因最终导致内核出现“卡死”、panic等用户不想见到的结果。本文介绍了内核中用于磁盘IO和内存分配的4种常见的故障注入技术(Fault injection),在程序的调试过程中或问题定位复现时可以用来模拟故障的场景,大大的提高了软件开发验证的效率。

 

参考文献:

Documentation/fault-injection/fault-injection.txt


你可能感兴趣的:(Linux,Kernel)