Analysis of Linux kernel crashes

From: http://stablebits.blogspot.hk/
  • Introduction
  • Tools
  • Format of a crash report
  • Analysis
    • Simple case to learn the basics
    • Crash in a binary kernel module
    • Suspected memory corruption
  • Extra Details

Introduction

The aim of this post is to illustrate the analysis of Linux kernel crashes by studying a few real-life examples. The examples are coming from a MIPS platform, but the general approach is applicable to other architectures.

It's implied that readers have knowledge of C programming and of basic operating system concepts, like virtual memory.

We begin by analyzing a simple crash to illustrate the basics. Further, we reconstruct what happened in case of a crash in a kernel module that has no source code available. Finally, we consider a crash caused by "memory corruption".

The information provided here is by no means comprehensive. We take a minimalist approach and don't consider tools such as Kdump and crash.

Tools

Any general purpose disassembler is sufficient. We'll use objdump with '-d' option here.

If a binary was built with debugging information, objdump -S can display source code intermixed with disassembly [1]. Also,addr2line can be used to match addresses with source code file names and lines.

In order to interpret disassembly, we need to have the MIPS Instruction Reference and the Compiler Register Usageinformation at hand [2], so please keep these pages open while reading further material.

If you are not familiar with how the virtual memory space is divided on MIPS, please refer to 'Virtual Memory Layout' [3] in the last section.

Format of a crash report

Here is an example of a crash report:

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 8023afd0, ra == 8023b024
Oops[#1]:
Cpu 0
$ 0 : 00000000 1000fc00 8555fc54 00000000
$ 4 : 00000000 00000000 0000000b 00000001
$ 8 : 00000008 800445f4 00000000 00000000
$12 : 0000004f 0000004e 00000041 00000001
$16 : 00000000 8555fc54 0000000b 8555fc54
$20 : c01eded0 8555fbd0 80240000 7f7ff0c4
$24 : 00000002 c01d6edc
$28 : 8555e000 8555fab0 7f7ff0a0 8023b024
Hi : 00000000
Lo : 3b9aca00
epc : 8023afd0 strlen+0x0/0x28
    Tainted: PF
ra : 8023b024 strlcpy+0x2c/0x7c
Status: 1000fc04 IEp
Cause : 00000008
BadVA : 00000000
PrId : 0000c401 (Fusiv MIPS1)
Modules linked in: xt_CLASSIFY [ skipped proprietary (aka evil) modules ] ip6_tunnel tunnel6
Process controllerd (pid: 751, threadinfo=8555e000, task=8783add8, tls=00000000)
Stack : 1000fc01 7f7ff0d0 8008f4a4 7f7ff268 8555fc5f 8555fc54 8023aff8 80044500
    c01d6cc8 c01d6ab8 000000a4 7f7ff0c4 00000000 80050000 00000000 c026ca78
    c026ca78 8555fb18 8555fbd0 7f7ff0d0 7f7ff174 7f7ff0d0 7f7ff0c0 86458400
    000000a4 c01d4440 80631224 8026d168 87008838 8026ca84 00000000 00000001
    00000006 00000001 80631224 806312c0 1000fc01 fffffffe 805e5778 805e0000
...
Call Trace:
[<8023afd0>] strlen+0x0/0x28
[<8023b024>] strlcpy+0x2c/0x7c
[] contoller_get_info+0x2c4/0x37c [controller_lkm]
[] controller_init+0x3e0/0xa64 [controller_lkm]

Code: 00000000  03e00008  01031023 <80820000> 0808ebfa  00801821 24630001  80620000  00000000

Let's review the parts one by one.

CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 8023afd0, ra == 8023b024
Oops[#1]:

A header indicates a particular reason for this crash.

On CPU #0 a load or store instruction at address epc accessed a virtual address 0x00000000. There was no valid virtual to physical address translation available - hence, the crash. In the middle of the report the same virtual address is shown as:

    BadVA : 00000000

BadVA is a register of the MIPS Coprocessor 0 that describes a memory address at which address exception occurred.

Unable to handle kernel paging request

is by far one of the most common reasons for crashes. You may also encounter:

Kernel bug detected

It is triggered by one of the sanity checks in the kernel code, such as BUG() or BUG_ON(condition). This mechanism does what assert() does for user-space applications.

Other reasons can be found by running 'grep -rn die_if_kernel arch/mips/' in a Linux kernel tree.

Further, the content of the registers is displayed.

$ 0   : 00000000 1000fc00 8555fc54 00000000
$ 4   : 00000000 00000000 0000000b 00000001
[ ... ]
$24   : 00000002 c01d6edc
$28   : 8555e000 8555fab0 7f7ff0a0 8023b024
Hi    : 00000000
Lo    : 3b9aca00
[ ... ]
Status: 1000fc04    IEp
Cause : 00000008

Registers $0-31 are general purpose MIPS registers. To simplify reading, each of them has a mnemonic name in assembler code. Now it's time to take a quick look at Compiler Register Usage. For example, a0-3 correspond to $4-7, which are used in the o32 calling convention to pass the first 4 arguments to a function. o32 is the most commonly used calling convention on 32bit MIPS [2] and our examples here relate to it.

An ideal case is to have a complete dump of the memory used by the kernel. That would allow us to restore the environment - to see the content of local variables, various kernel data structures, etc. Kdump can do this (no MIPS support at the moment). Nevertheless, in many cases the content of the registers alone reveals enough information to understand a problem.

The content of Status and Cause registers may be very useful in some cases, but we won't consider them here.

epc   : 8023afd0 strlen+0x0/0x28
ra    : 8023b024 strlcpy+0x2c/0x7c

epc shows the address of the instruction that caused a crash. ra ($31) contains the return address from the last function called prior to a crash. In practice, ra usually points either to a caller of the function where epc belongs to or to the same function as epc. An invalid address in ra can indicate stack corruption (at least for non-leaf functions).

Names of the functions where epc and ra belong to are displayed if the kernel was built with CONFIG_KALLSYMS enabled. In any case, these names can be located with objdump.

The +0x0/0x28 notation stands for +offset/size, where offset is the offset of the instruction within a function it belongs to, and size is the size of this function.

In most cases, ra points to the 2nd instruction that follows an instruction representing a function call (usually jal or jalr). For example,

  801f9bb8 :
  [...]
  801f9c14:       02202821        move    a1,s1
  801f9c18:       0040f809        jalr    v0
  801f9c1c:       24070001        li      a3,1
  801f9c20:       00408021        move    s0,v0
  801f9c24:       8fa20018        lw      v0,24(sp)

A function call is at 0x801f9c18. Control gets back to pci_bus_read_config_byte at 0x801f9c20jalr saves this address into ra before jumping to an address v0 - the start of a called function.

The instruction at 0x801f9c1c is located in the delay slot of jalr and is executed before any instruction in the called function. Delay slots of jalr are often used to initialize one of the function arguments. If the called function above has at least 4 arguments, its 4th argument will be 1.

Branch instructions, like bltz (branch on less than zero), are another example of instructions with delay slots.

We mentioned earlier that both epc and ra may point to the same function. To illustrate this case, let's suppose that a crash occurs at 0x801f9c24 in the above disassembly. Provided that a function call at 0x801f9c18 took place, ra would point to 0x801f9c20 inside the same function as epc.

  Modules linked in: xt_CLASSIFY [ ... ] ip6_tunnel tunnel6

This is a list of loaded kernel modules.

  Process controllerd (pid: 751, threadinfo=8555e000, task=8783add8, tls=00000000)

This is information about a process that was running at the moment of a crash.

In an ideal world where kernels and, especially, kernel modules behave well, user-space actions can never trigger a kernel crash. No matter what these actions are. Kernel-mode tasks have full privileges to cause havoc though.

A user-space task appears here in one of the following cases:

  • it has triggered a kernel action (usually via a system call like ioctl()) that, due to a kernel bug or memory corruption, results in a crash.

  • a crash has occurred in the interrupt context. Unless your kernel supports threaded interrupt handlers (i.e. interrupts are handled by dedicated threads) or separate interrupt stacks, the kernel-space stack of a current task is used by the interrupt handling code. The displayed task is usually unrelated in this case.

  Stack : 1000fc01 7f7ff0d0 8008f4a4 7f7ff268 8555fc5f 8555fc54 8023aff8 80044500
          [ ... ]
          00000006 00000001 80631224 806312c0 1000fc01 fffffffe 805e5778 805e0000
         ...

This is a partial dump of the kernel-space stack of the current task, which we have discussed in the previous section.

  Call Trace:
  [<8023afd0>] strlen+0x0/0x28
  [<8023b024>] strlcpy+0x2c/0x7c
  [] controller_get_info+0x2c4/0x37c [controller_lkm]
  [] controller_init+0x3e0/0xa64 [controller_lkm]
  [ ... ]

As the name suggests, this is a call trace.

It does not always represent a genuine call trace though. When epc points to an invalid address on MIPS orraw_show_trace is enabled in the kernel command line, the so-called raw call trace is displayed. It contains all the values from stack that look like valid return addresses. So there can be 'ghost' traces of previously run and completely unrelated functions. For curious readers, the implementation of both methods is in show_backtrace() and show_raw_backtrace() inarch/mips/kernel/traps.c.

  Code: 00000000  03e00008  01031023 <80820000> 0808ebfa  00801821 24630001  80620000  00000000

Finally, this last section displays a sequence of instructions (binary representation) at and around epc, with the instruction at epc being indicated by <> symbols.

Analysis

Simple crash to learn the basics

Let's now analyze the crash that was used as an example in the previous section.

  CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 8023afd0, ra == 8023b024
  [...]
  epc   : 8023afd0 strlen+0x0/0x28
  ra    : 8023b024 strlcpy+0x2c/0x7c

The kernel was built with CONFIG_KALLSYMS enabled, and that allows us to see the names of functions where the instruction at epc and ra belong to. If it was not the case, we may note that both epc and ra belong to kseg0 [3], so we may expect to find them inside the kernel image (vmlinux).

A quick sanity check for ra (recall that property of ra we mentioned above):

  8023aff8 :
  [...]
  8023b018:       afbf0020        sw      ra,32(sp)
  8023b01c:       0c08ebf4        jal     8023afd0 
  8023b020:       00a08021        move    s0,a1
  8023b024:       00408821        move    s1,v0                  <=== 'ra' points to this location
  [...]

ra is indeed one instruction away (delay slot) from jal. It's also clear that a function being called is strlen(). In cases when the address of a called function is not known at build time (kernel modules), disassembly may look as follows:

  b6d38:        3c020000        lui     v0,0x0
  b6d3c:        24420000        addiu   v0,v0,0
  b6d40:        0040f809        jalr    v0 

The first 2 instructions are changed by the loader at run time. In such cases, don't get confused when disassembly and theCode: sequence of a crash dump display different instructions at the same address. Usually, there is a remote resemblance though. For instance, the instructions above might have been changed as follows:

  b6d38:        3c02804a        lui     v0,0x804a
  b6d3c:        2442346c        addiu   v0,v0,13420

This basically corresponds to v0 = 0x804a346c.

For 'epc',

  8023afd0 :
  8023afd0:       80820000        lb      v0,0(a0)               <=== 'epc' is here
  8023afd4:       0808ebfa        j       8023afe8 

  8023afd8:       00801821        move    v1,a0
  8023afdc:       24630001        addiu   v1,v1,1
  [...]

we may make the following observations:

  • it's indeed the first instruction (offset 0x0) of strlen();
  • a0 ($4) is indeed 0. The instruction at epc loads a byte from address MEM[a0 + 0] and BadVA is reported to be 0. Hence, a0 should have been 0 too.

    $ 4   : 00000000 [...]
    
  • The instructions from disassembly match the ones shown in the Code: sequence.

    Code: [...] <80820000> 0808ebfa  00801821 24630001  80620000  00000000
    

These checks can be also applied as sanity checks to ensure you have got the right image for disassembly.

Now, a0 is supposed to hold the 1st (and only) argument of strlen() [Compiler Register Usage]. Given that epc points to the 1st instruction, a0 has not yet been reused for anything else inside strlen(). This can be verified by analyzing an instructions flow. The current working hypothesis is that strlen(s) has been called with s == NULL.

Let's see if we can figure out where this s == NULL is coming from by examining the caller of strlen - strlcpy(). But before doing so, we should consider a few more aspects common to all functions.

At the beginning of most of the functions, there is a sequence of instruction called "prologue". For example,

  800a28d0 :
  800a28d0:       27bdffd0        addiu   sp,sp,-48
  800a28d4:       afb30024        sw      s3,36(sp)
  800a28d8:       afb20020        sw      s2,32(sp)
  800a28dc:       afb1001c        sw      s1,28(sp)
  800a28e0:       afb00018        sw      s0,24(sp)
  800a28e4:       afbf0028        sw      ra,40(sp)
  [...]

The first instruction creates a stack frame by reserving space on the stack. As per o32 calling convention, it's a job of a called function to preserve non-temporaries registers (like $s-registers) if they are to be reused. This is what those sw $reg, off(sp) instructions are concerned with - saving to-be-reused-registers to the stack. Same applies to ra for non-leaf functions (those calling other functions). Obviously, local variables also reside on this stack frame.

An "epilogue" sequence does the opposite actions.

  800a2908:       8fbf0028        lw      ra,40(sp)
  800a290c:       8fb30024        lw      s3,36(sp)
  800a2910:       8fb20020        lw      s2,32(sp)
  800a2914:       8fb1001c        lw      s1,28(sp)
  800a2918:       8fb00018        lw      s0,24(sp)
  800a291c:       03e00008        jr      ra
  800a2920:       27bd0030        addiu   sp,sp,48

The content of reused registers is restored. The stack frame is deleted - usually, by the last instruction in a delay slot ofjr. Finally, control is given back to the caller by jr ra.

Stack corruptions may overwrite a value corresponding to ra, resulting in the control being given to unexpected places (unless this is a result of a deliberate security attack). This is likely to result in "Unable to handle kernel paging request""Unaligned access" , or "Invalid instruction". Quite often in such cases epc is equal to ra (or close to it) or to both ra and BadVA.

"Epilogue" is not necessarily placed at the very end of a function. Moreover, a function may have more than one "epilogue".

One more thing before we get back to the analysis. Function calls look as follows:

  800a2aa0:       00c03821        move    a3,a2
  800a2aa4:       00002021        move    a0,zero
  800a2aa8:       02202821        move    a1,s1
  800a2aac:       0c028848        jal     800a2120 
  800a2ab0:       02603021        move    a2,s3

this sequence corresponds to rw_verify_area(a0, a1, a2, a3) with a0 being 0 (zero is register $0). The instructions initializing a0-a3 do not have to be placed immediately next to the jal instruction, but usually they are in some proximity.

The return value of a function, if any, is stored in v0 ($2).

Let's get back to our analysis.

  8023aff8 :
  8023aff8:       27bdffd8        addiu   sp,sp,-40
  8023affc:       afb3001c        sw      s3,28(sp)
  8023b000:       00809821        move    s3,a0
  8023b004:       00a02021        move    a0,a1                    <=== the argument for strlen()
  8023b008:       afb20018        sw      s2,24(sp)
  8023b00c:       afb10014        sw      s1,20(sp)
  8023b010:       afb00010        sw      s0,16(sp)
  8023b014:       00c09021        move    s2,a2
  8023b018:       afbf0020        sw      ra,32(sp)
  8023b01c:       0c08ebf4        jal     8023afd0         <=== the call is here
  8023b020:       00a08021        move    s0,a1
  8023b024:       00408821        move    s1,v0                    <=== 'ra' points here
  [...]

strlen() accepts a single argument that is passed via a0. A few instructions above the actual call (see remarks) - at0x8023b004, we can see that a0 is being loaded with the content of a1. After examining the remaining instructions it becomes clear that a1 still holds its initial value that corresponds to the 2nd argument of strlcpy(dst, src, len).

Now we can update our working hypothesis. It looks like strlcpy() has been called with its 2nd argument, src, being NULL.

Real-life shortcut: we may simply examine the source code of strlcpy() and notice that there is a single call to strlen(). This is in accordance with our hypothesis indeed.

  size_t strlcpy(char *dest, const char *src, size_t size)
  {
          size_t ret = strlen(src);
  [...]

What's next? We can do the same analysis for contoller_get_info() that has supposedly called strlcpy().

  Call Trace:
  [<8023afd0>] strlen+0x0/0x28
  [<8023b024>] strlcpy+0x2c/0x7c
  [] controller_get_info+0x2c4/0x37c [controller_lkm]
  [...]

Recall the remarks above regarding the validity of call traces. Basic sanity checks won't take much time. At the very least, check that 0xc01d6cc8 could have been a valid ra (one instruction away from jalr/jal). If it is the case, verify the code of (in this example) controller_get_info() to confirm that it does call strlcpy(). Having disassembly intermixed with source code is helpful here [1].

In any case, there is obviously a limit as to how far "in the past" we would be able to look by analyzing a crash report even if we had a complete memory dump. Nevertheless, the results of this analysis - if not sufficient to reveal a root cause - are usually very helpful in further debugging. As to this particular example, we would still need to analyzecontroller_get_info() to understand why strlcpy() might have been called with src == NULL.

Crash in a binary kernel module

This crash occurred in a kernel module for which no source code is available.

  CPU 1 Unable to handle kernel paging request at virtual address 00000000, epc == c1c52470, ra == c1c63d64
  [...]
  $ 0   : 00000000 10008d00 c1c523f0 00000000
  $ 4   : 00000000 c1f18f5c 0000008c ffff00fe
  $ 8   : 80008fe1 15941794 8e038b00 fefe7dfd
  $12   : faf9fdfe 7dfffe7f fb7eff7d 7b7e7e7c
  $16   : 00000000 00000800 c1e2d178 c1e2cf98
  $20   : 842ffe08 c1e2d1b4 00000050 c1c51fc0
  $24   : 00000000 00000000
  $28   : 842fc000 842ffdf0 00000000 c1c63d64
  [...]
  epc   : c1c52470 fast_memcpy+0x80/0x1cc [binary_blob_module]
      Tainted: P
  ra    : c1c63d64 net_egress+0x80/0x2b78 [binary_blob_module]
  [...]

The load address of a kernel module is not known at build time, so we see relative addresses in the disassembly ofbinary_blob_module. We can use '+0x80/0x1cc' to locate the instruction at epc:

  a53f0 : 
  [...]
  a5468:       98ab000f        lwr     t3,15(a1)
  a546c:       98af001f        lwr     t7,31(a1)
  a5470:       a8880000        swl     t0,0(a0)      <== 'epc' points here at offset 0x80

The instruction at epc accesses MEM[a0 + 0], so a0 should have been 0 to result in a memory access at virtual address 0x00000000. We can confirm this by verifying the content of the a0 ($4) register:

  $ 4   : 00000000 [...]

Next step is to examine the flow of instructions to trace the source of the value in a0. A full listing is not provided here, but what it revealed is that a0 is used read-only. At epc it still holds the 1st input argument of the function. Moreover, there are no explicit validity checks prior to its use. Thus, the 1st argument is expected to be a valid address.

The name of function, fast_memcpy(), suggests its memcpy-like nature, so the 1st argument is likely to be dst (of course, this can be verified by a careful analysis of disassembly).

Let's examine the caller, net_egress().

  b6ce4 :
  [...]
  b6d38:       3c020000        lui     v0,0x0           (1)
  b6d3c:       24420000        addiu   v0,v0,0          (2)
  b6d40:       0040f809        jalr    v0               (3)
  b6d44:       97a4001a        lhu     a0,26(sp)        (4)
  b6d48:       97a6001a        lhu     a2,26(sp)        (5)
  b6d4c:       00408021        move    s0,v0            (6)
  b6d50:       00402021        move    a0,v0            (7)
  b6d54:       3c020000        lui     v0,0x0           (8)
  b6d58:       24420000        addiu   v0,v0,0          (9)
  b6d5c:       0040f809        jalr    v0               (10)
  b6d60:       8fa5001c        lw      a1,28(sp)        (11)
  b6d64:       02202021        move    a0,s1             <== 'ra' points here at offset 0x80
  [...]

There are 2 function calls here. The first one, at line (3), seems to have a single argument which gets initialized at line(4). Let's refer to this called function as unknown_function. The second one, at line (10), takes 3 arguments that are initialized at lines (7)(5), and (11) respectively. Supposedly, this is a call of fast_memcpy() where the crash occurred.

Can we say something specific about those arguments?

  • the 1st argument, a0, of fast_memcpy() gets initialized with the return value, v0, of unknown_function() at line (7);
  • this return value is passed as is, i.e. there are no validity check;
  • the 1st argument of unknown_function() and the 3rd one of fast_memcpy() get initilized with the same value loaded from 26(sp) at lines (4) and (5).

These observations suggest that the source code may look as follows:

  dst = unknown_function(len);
  fast_memcpy(dst, src, len);

In this particular case, unknown_function() returned NULL - hence, the crash.

Further, a question regarding the nature of that unknown_function(), accompanied by the analysis of the crash, can be sent to a supplier of binary_blob_module. By submitting a more detailed report and asking concrete questions for hard-to-reproduce problems, we can somewhat decrease the chances of having a (sometimes) default reply such as "please try reproducing it on our reference software and/or hardware".

Suspected memory corruption

Finally, let's consider a case where memory corruption is suspected.

  CPU 0 Unable to handle kernel paging request at virtual address 0004349c, epc == 0004349c, ra == 80012224
  Oops[#1]:
  Cpu 0
  $ 0   : 00000000 7f99bcc0 00000069 2abc7ea0
  $ 4   : 00000000 7f99bd20 7f99be60 00000001
  $ 8   : 00000000 80000008 0004349c fffffff4
  $12   : 7f99bd08 00000001 00000000 00000000
  $16   : 2ab01000 2ab01000 7f99bdc8 00000000
  $20   : 2aafc6a8 00410000 7f99bde0 7f99be60
  $24   : 00000000 2abc7e80
  $28   : 85248000 85249f30 0040484c 80012224
  Hi    : 0000ba1a
  Lo    : ff98c506
  epc   : 0004349c 0x4349c
      Tainted: PF       W
  ra    : 80012224 stack_done+0x20/0x3c
  Status: 1100ff03    KERNEL EXL IE
  Cause : 10800008
  BadVA : 0004349c
  PrId  : 00019554 (MIPS 34Kc)
  [...]
  Process screen_plugin (pid: 799, threadinfo=85248000, task=87140038, tls=00000000)
  Stack : 2ab85040 00000000 00000001 00000000 00000000 00000000 00000000 7f99bcc0
          [...]
          00000001 00000000 2ac2d530 7f99bce8 7f99bd18 2aae83d4 0100ff13 00028675
  Call Trace:
  Code: (Bad address in epc)

Note that epc == BadVA. A CPU has tried to fetch an instruction at 0x0004349c, but this is not a valid kernel-space address.

Let's examine the code at ra:

  ra    : 80012224 stack_done+0x20/0x3c

  80012140 :
  [...]
  800121e0:       000240c0        sll     t0,v0,0x3
  800121e4:       3c098001        lui     t1,0x8001
  800121e8:       25292460        addiu   t1,t1,9312
  800121ec:       01284821        addu    t1,t1,t0
  800121f0:       8d2a0000        lw      t2,0(t1)
  800121f4:       1140005e        beqz    t2,80012370 
  800121f8:       8d2b0004        lw      t3,4(t1)
  800121fc:       05610040        bgez    t3,80012300 
  80012200:       afa70080        sw      a3,128(sp)
  80012204 :
  80012204:       8f880008        lw      t0,8(gp)
  80012208:       3c098000        lui     t1,0x8000
  8001220c:       35290008        ori     t1,t1,0x8
  80012210:       01094024        and     t0,t0,t1
  80012214:       15000016        bnez    t0,80012270 

  80012218:       00000000        nop
  8001221c:       0140f809        jalr    t2
  80012220:       00000000        nop
  80012224:       2408fb92        li      t0,-1134             <=== 'ra' points here

ra is the valid return address for a function call at 0x8001221c. The address of that function is taken from t2 ($10), which indeed contains 0x0004349c:

  $ 8   : 00000000 80000008 0004349c fffffff4

So what we have is a call through a function pointer that contains a bogus (corrupted?) value.

Let's try to figure out where t2 is coming from:

  800121e0:       000240c0        sll     t0,v0,0x3
  800121e4:       3c098001        lui     t1,0x8001
  800121e8:       25292460        addiu   t1,t1,9312
  800121ec:       01284821        addu    t1,t1,t0
  800121f0:       8d2a0000        lw      t2,0(t1)
  800121f4:       1140005e        beqz    t2,80012370 

The instruction at 0x800121f0 corresponds to 't2 = *t1' and is followed by an instruction that compares t2 to 0. In case oft2 being 0, control is given to illegal_syscall.

Well, some knowledge of the kernel internals would be helpful here. In any case, the appearance of names such asillegal_syscallhandle_sys, and syscall_trace_entry suggest that the code in question has something to do with the handling of system calls. The relevant code can indeed be found in arch/mips/kernel/scall32-o32.S.

Can we guess what syscall it was?

  800121e4:       3c098001        lui     t1,0x8001
  800121e8:       25292460        addiu   t1,t1,9312

t1 = 0x80019312;

nm shows that this value corresponds to sys_call_table, which is an array that contains addresses of all the system calls.

  800121ec:       01284821        addu    t1,t1,t0

t1 = t1 + t0;

t0 is the offset in the table, which is being calculated as follows:

  800121e0:       000240c0        sll     t0,v0,0x3

t0 = v0 * 8;

and the value of v0 ($2) is 0x69 (105 decimal):

  $ 0   : 00000000 7f99bcc0 00000069 2abc7ea0

The analysis of the source code of handle_sys (it's written in assembler) reveals that v0 represents a syscall number.

The syscall numbers are defined in arch/mips/include/asm/unistd.h. The one we are interested in corresponds tosys_getitimer():

  #define __NR_getitimer                  (__NR_Linux + 105)

sys_call_table is not modified at run time. Also, it was not the first and only call to sys_gettimer() - the system has been functioning properly for days prior to this crash. So where do we go from here?

Memory corruptions often result in seemingly unrelated crashes: both in kernel and user-space. What common though is that all these crashes may look "weird". That is, the careful analysis does not reveal any obvious problems with the code and, moreover, suggests possible external influence, be it stack/memory corruption or hardware issues. Having multiple crashes in different parts of the core kernel code is usually a good indicator too.

Of course, it's always possible to overlook something. So the larger a set of crashes from which a conclusion is drawn, the better.

The strategy then is to look for common patterns.

In this particular case, there was another crash in the same location (among a dozen of crashes in yet other areas) where the syscall number and epc were 0x68 and 0x000430d8 correspondingly.

  syscall 0x69 (sys_getitimer) and epc: 0x0004349c
  syscall 0x68 (sys_setitimer) and epc: 0x000430d8

These 2 slots are neighboring in sys_call_table. Maybe it's just a coincidence but is worth taking into account. Now, what's about the content of epc? Do we actually know how the correct values would look like? We can look them up with nmor from disassembly:

  8004349c :
  800430d8 :

Is there another pattern? Yes, the only difference between good and bad values is in the high bit 0x80000000. ...

p.s. This missing-high-bit theory had explanatory power when applied to some of the other "weird" crashes for which it was possible to infer the good value. How could this bit be cleared? In the end, it has been found that DDR timing settings were not properly set in the bootloader. However, as of the moment of this writing, it's not yet clear whether the problem has been completely resolved.

Perhaps, we can dedicate another post specifically to the analysis of "weird" crashes.

Many thanks to Yuri Leikind, Bero Brekalo, and Alina Krynina for review and useful suggestions.

Extra Details

[1] objdump

'objdump -d' alone may be sufficient in many cases (not to mention all the fun of matching disassembly and source code on your own). Alternatively, you can reproduce the original binary (if possible) with debugging information enabled and then use'-dS'. Be careful though to double-check that the addresses you are interested in correspond to the same instructions in both original and new disassembly files. If it's not the case, code shifts/changes should be taken into account.

[2] Calling conventions

Be sure to verify the options used by your toolchain, if in doubt. For gcc, '-mabi=type' options are used. For example, '-mabi=32' corresponds to o32.

[3] Virtual Memory Layout on MIPS

Please refer to MIPS Address Space for a general review.

Regarding the use in Linux:

1) kuseg range [0x00000000, 0x80000000) is user-space addresses.

A private address space of user-space processes resides in this range. From kernel-space this area can be safely accessed only by means of special-purpose functions, like copy_to_user() and copy_from_user(). Direct accesses are always a bug, even though, given the nature of MIPS's MMU, such accesses may appear to be working properly under certain circumstances.

2) kseg0 range [0x80000000, 0xa0000000) is kernel-space addresses used by the kernel code and data (vmlinux).

Dynamic allocations via general purpose allocators, such as kmalloc() and __get_free_pages() (but not from theZONE_HIGHMEM zone) return addresses in this range. [ to-be-continued ]

3) kseg2 range [0xc0000000, 0xffffffff) is kernel-space addresses used by the code and data of kernel modules.

vmalloc() and vmap() allocations return addresses in this range.

For kuseg and kseg2, the translation of virtual addresses into physical ones is done via MMU. Conversely, kseg0 addresses don't require MMU translations; the translation is done simply by stripping off the top-bit. For example, 0x80100000corresponds to 0x00100000 (1 MB) in RAM.

kseg0 ranges are both virtually and physically contiguous, while kuseg and kseg2 are only virtually contiguous.


你可能感兴趣的:(Debug)