Linux设备驱动第四篇：以Oops信息定位代码行为例谈驱动调试方法

上一篇我们大概聊了如何写一个简单的字符设备驱动，我们不是神，写代码肯定会出现问题，我们需要在编写代码的过程中不断调试。在普通的c应用程序中，我们经常使用printf来输出信息，或者使用gdb来调试程序，那么驱动程序如何调试呢？我们知道在调试程序时经常遇到的问题就是野指针或者数组越界带来的问题，在应用程序中运行这种程序就会报segmentation fault的错误，而由于驱动程序的特殊性，出现此类情况后往往会直接造成系统宕机，并会抛出oops信息。那么我们如何来分析oops信息呢，甚至根据oops信息来定位具体的出错的代码行呢？下面就根据一个简单的实例来说明如何调试驱动程序。
如何根据Oops定位代码行

我们借用linux设备驱动第二篇：构造和运行模块里面的hello world程序来演示出错的情况，含有错误代码的hello world如下

#include 
#include 
MODULE_LICENSE("Dual BSD/GPL");

static int hello_init(void)
{
        char *p = NULL;
        memcpy(p, "test", 4);
        printk(KERN_ALERT "Hello, world\n");
        return 0;
}
static void hello_exit(void)
{

        printk(KERN_ALERT "Goodbye, cruel world\n");
}

module_init(hello_init);
module_exit(hello_exit);

Makefile文件如下：

ifneq ($(KERNELRELEASE),)
obj-m := helloworld.o
else
KERNELDIR ?= /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
        $(MAKE) -C $(KERNELDIR) M=$(PWD) modules
endif

clean:
        rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions modules.order  Module.symvers

很明显，以上代码的第8行是一个空指针错误。insmod后会出现下面的oops信息：

[  459.516441] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  459.516445] 
[  459.516448] PGD 0 
[  459.516450] Oops: 0002 [#1] SMP 
[  459.516452] Modules linked in: helloworld(OE+) vmw_vsock_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel vmw_balloon snd_ens1371 aes_x86_64 lrw snd_ac97_codec gf128mul glue_helper ablk_helper cryptd ac97_bus gameport snd_pcm serio_raw snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer vmwgfx btusb ttm snd drm_kms_helper drm soundcore shpchp vmw_vmci i2c_piix4 rfcomm bnep bluetooth 6lowpan_iphc parport_pc ppdev mac_hid lp parport hid_generic usbhid hid psmouse ahci libahci floppy e1000 vmw_pvscsi vmxnet3 mptspi mptscsih mptbase scsi_transport_spi pata_acpi [last unloaded: helloworld]
[  459.516476] CPU: 0 PID: 4531 Comm: insmod Tainted: G           OE 3.16.0-33-generic #44~14.04.1-Ubuntu
[  459.516478] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[  459.516479] task: ffff88003821f010 ti: ffff880038fa0000 task.ti: ffff880038fa0000
[  459.516480] RIP: 0010:[]  [] hello_init+0xd/0x30 [helloworld]
[  459.516483] RSP: 0018:ffff880038fa3d40  EFLAGS: 00010246
[  459.516484] RAX: ffff88000c31d901 RBX: ffffffff81c1a020 RCX: 000000000004b29f
[  459.516485] RDX: 000000000004b29e RSI: 0000000000000017 RDI: ffffffffc0615024
[  459.516485] RBP: ffff880038fa3db8 R08: 0000000000015e80 R09: ffff88003d615e80
[  459.516486] R10: ffffea000030c740 R11: ffffffff81002138 R12: ffff88000c31d0c0
[  459.516487] R13: 0000000000000000 R14: ffffffffc0614000 R15: ffffffffc0616000
[  459.516488] FS:  00007f8a6fa86740(0000) GS:ffff88003d600000(0000) knlGS:0000000000000000
[  459.516489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  459.516490] CR2: 0000000000000000 CR3: 0000000038760000 CR4: 00000000003407f0
[  459.516522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  459.516524] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  459.516524] Stack:
[  459.516537]  ffff880038fa3db8 ffffffff81002144 0000000000000001 0000000000000001
[  459.516540]  0000000000000001 ffff880028ab5040 0000000000000001 ffff880038fa3da0
[  459.516541]  ffffffff8119d0b2 ffffffffc0616018 00000000bd1141ac ffffffffc0616018
[  459.516543] Call Trace:
[  459.516548]  [] ? do_one_initcall+0xd4/0x210
[  459.516550]  [] ? __vunmap+0xb2/0x100
[  459.516554]  [] load_module+0x13c1/0x1b80
[  459.516557]  [] ? store_uevent+0x40/0x40
[  459.516560]  [] SyS_finit_module+0x86/0xb0
[  459.516563]  [] system_call_fastpath+0x1a/0x1f
[  459.516564] Code:  04 25 00 00 00 00 74 65 73 74 31 c0 48 89 e5 e8 a2 86 14 c1 31 
[  459.516573] RIP  [] hello_init+0xd/0x30 [helloworld]
[  459.516575]  RSP 
[  459.516576] CR2: 0000000000000000
[  459.516578] ---[ end trace 7c52cc8624b7ea60 ]---

下面简单分析下oops信息的内容。
由BUG: unable to handle kernel NULL pointer dereference at (null)知道出错的原因是使用了空指针。标红的部分确定了具体出错的函数。Modules linked in: helloworld表明了引起oops问题的具体模块。call trace列出了函数的调用信息。这些信息中其中标红的部分是最有用的，我们可以根据其信息找到具体出错的代码行。下面就来说下，如何定位到具体出错的代码行。
第一步我们需要使用objdump把编译生成的bin文件反汇编，我们这里就是helloworld.o，如下命令把反汇编信息保存到err.txt文件中：
objdump helloworld.o -D > err.txt
err.txt内容如下：

helloworld.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 :
   0:    e8 00 00 00 00           callq  5 
   5:    55                       push   %rbp
   6:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi
   d:    c7 04 25 00 00 00 00     movl   $0x74736574,0x0
  14:    74 65 73 74 
  18:    31 c0                    xor    %eax,%eax
  1a:    48 89 e5                 mov    %rsp,%rbp
  1d:    e8 00 00 00 00           callq  22 
  22:    31 c0                    xor    %eax,%eax
  24:    5d                       pop    %rbp
  25:    c3                       retq   
  26:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1)
  2d:    00 00 00 
:
  30:    e8 00 00 00 00           callq  35 
  35:    55                       push   %rbp
  36:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi
  3d:    31 c0                    xor    %eax,%eax
  3f:    48 89 e5                 mov    %rsp,%rbp
  42:    e8 00 00 00 00           callq  47 
  47:    5d                       pop    %rbp
  48:    c3                       retq   

Disassembly of section .rodata.str1.1:
<.rodata.str1.1>:
   0:    01 31                    add    %esi,(%rcx)
   2:    48                       rex.W
   3:    65                       gs
   4:    6c                       insb   (%dx),%es:(%rdi)
   5:    6c                       insb   (%dx),%es:(%rdi)
   6:    6f                       outsl  %ds:(%rsi),(%dx)
   7:    2c 20                    sub    $0x20,%al
   9:    77 6f                    ja     7a 
   b:    72 6c                    jb     79 
   d:    64 0a 00                 or     %fs:(%rax),%al
  10:    01 31                    add    %esi,(%rcx)
  12:    47 6f                    rex.RXB outsl %ds:(%rsi),(%dx)
  14:    6f                       outsl  %ds:(%rsi),(%dx)
  15:    64                       fs
  16:    62                       (bad)  
  17:    79 65                    jns    7e 
  19:    2c 20                    sub    $0x20,%al
  1b:    63 72 75                 movslq 0x75(%rdx),%esi
  1e:    65                       gs
  1f:    6c                       insb   (%dx),%es:(%rdi)
  20:    20 77 6f                 and    %dh,0x6f(%rdi)
  23:    72 6c                    jb     91 
  25:    64 0a 00                 or     %fs:(%rax),%al

Disassembly of section .modinfo:
<__UNIQUE_ID_license0>:
   0:    6c                       insb   (%dx),%es:(%rdi)
   1:    69 63 65 6e 73 65 3d     imul   $0x3d65736e,0x65(%rbx),%esp
   8:    44 75 61                 rex.R jne 6c 
   b:    6c                       insb   (%dx),%es:(%rdi)
   c:    20 42 53                 and    %al,0x53(%rdx)
   f:    44 2f                    rex.R (bad) 
  11:    47 50                    rex.RXB push %r8
  13:    4c                       rex.WR
    ...

Disassembly of section .comment:
<.comment>:
   0:    00 47 43                 add    %al,0x43(%rdi)
   3:    43 3a 20                 rex.XB cmp (%r8),%spl
   6:    28 55 62                 sub    %dl,0x62(%rbp)
   9:    75 6e                    jne    79 
   b:    74 75                    je     82 
   d:    20 34 2e                 and    %dh,(%rsi,%rbp,1)
  10:    38 2e                    cmp    %ch,(%rsi)
  12:    32 2d 31 39 75 62        xor    0x62753931(%rip),%ch        # 62753949 
  18:    75 6e                    jne    88 
  1a:    74 75                    je     91 
  1c:    31 29                    xor    %ebp,(%rcx)
  1e:    20 34 2e                 and    %dh,(%rsi,%rbp,1)
  21:    38 2e                    cmp    %ch,(%rsi)
  23:    32 00                    xor    (%rax),%al

Disassembly of section __mcount_loc:
<__mcount_loc>:

由oops信息我们知道出错的地方是hello_init的地址偏移0xd。而有dump信息知道，hello_init的地址即init_module的地址，因为hello_init即本模块的初始化入口，如果在其他函数中出错，dump信息中就会有相应符号的地址。由此我们得到出错的地址是0xd，下一步我们就可以使用addr2line来定位具体的代码行：
addr2line -C -f -e helloworld.o d

其他调试手段

以上就是通过oops信息来获取具体的导致崩溃的代码行，这种情况都是用在遇到比较严重的错误导致内核挂掉的情况下使用的，另外比较常用的调试手段就是使用printk来输出打印信息。printk的使用方法类似printf，只是要注意一下打印级别，详细介绍在linux设备驱动第二篇：构造和运行模块中已有描述，另外需要注意的是大量使用printk会严重拖慢系统，所以使用过程中也要注意。

以上两种调试手段是我工作中最常用的，还有一些其他的调试手段，例如使用/proc文件系统，使用trace等用户空间程序，使用gdb，kgdb等，这些调试手段一般不太容易使用或者不太方便使用，所以这里就不在介绍了。

介绍完驱动的调试方法后，下一篇会介绍下linux驱动的并发与竞态，欢迎关注

Linux设备驱动第四篇：以Oops信息定位代码行为例谈驱动调试方法

其他调试手段

你可能感兴趣的:(Linux设备驱动第四篇：以Oops信息定位代码行为例谈驱动调试方法)