Crash分析gpu非法访问地址问题

Crash分析gpu非法访问地址问题

1. 问题描述

在我司产品monkey老化过程中,极低概率出现gpu驱动访问非法地址导致kernel panic问题,在kernel panic后,主动触发ramdump机制,抓到相关的ramdump文件,利用crash工具进行离线分析。

2. crash分析ramdump

2.1 获取log信息

在crash工具中,通过dmesg或者log命令获取kernel panic时的logbuf的信息,提取关键信息如下:

[24145.263446] aliperm##error[sess_perm:1692]can't find session with id 651
[24145.291018] Unhandled fault: level 3 address size fault (0x96000043) at 0xffff00001fbaf000
[24145.291021] Mem abort info:
[24145.291024]   Exception class = DABT (current EL), IL = 32 bits
[24145.291025]   SET = 0, FnV = 0
[24145.291026]   EA = 0, S1PTW = 0
[24145.291027] Data abort info:
[24145.291028]   ISV = 0, ISS = 0x00000043
[24145.291029]   CM = 0, WnR = 1
[24145.291033] Internal error: : 96000043 [#1] PREEMPT SMP
[24145.291039] Modules linked in: 8021q garp mrp bridge stp llc fts_ts veth himax_mmi xrp_hw_semidrive aes_neon_bs crc32_ce aes_neon_blk x9_ref_mach_ak7738 dwmac_dwc_qos_eth alitks_mod(PO) bcmdhd snd_sdrv_i2s_sc aliperm_mod(P) alisec_mod(P) alipatch_mod(P) alintgr_mod(P) alimbedtls_mod(P)
[24145.291075] CPU: 5 PID: 1803 Comm: pvr_defer_free Tainted: P        W  O    4.14.61+ #1
[24145.291076] Hardware name: Semidrive kunlun x9 REF Board (DT)
[24145.291079] task: ffff8001794fab80 task.stack: ffff00000e7d8000
[24145.291088] PC is at DeviceMemSet+0x8c/0xc8
[24145.291094] LR is at _ZeroPageArray+0x78/0x118
[24145.291096] pc : [] lr : [] pstate: 20c00145
[24145.291097] sp : ffff00000e7dbd00
[24145.291098] x29: ffff00000e7dbd00 x28: ffff80015bc9fb00 
[24145.291102] x27: 00000000000186a0 x26: 0000000000000000 
[24145.291105] x25: 00e8000000000f0f x24: 0000000000000080 
[24145.291108] x23: ffff80012d0db000 x22: ffff00001fb7c000 
[24145.291111] x21: 0000000000000000 x20: 0000000000080000 
[24145.291114] x19: ffff00001fbaf000 x18: 0000ffffac000bcc 
[24145.291117] x17: 0000ffff97c22ec8 x16: ffff00000818ded0 
[24145.291121] x15: 0000ffffac000bc8 x14: 0140000000000000 
[24145.291124] x13: ffff00001fbfc000 x12: 0000000000000000 
[24145.291127] x11: 0000000000000000 x10: 0000000000000040 
[24145.291130] x9 : 0040000000000041 x8 : 0040000000000001 
[24145.291132] x7 : 0000000000000001 x6 : 000000017fffd7e8 
[24145.291135] x5 : ffff8001398feb98 x4 : ffff8001398feb98 
[24145.291138] x3 : ffff00001fbfbfff x2 : 0000000000000000 
[24145.291140] x1 : 0000000000000000 x0 : ffff00001fbfc000 
[24145.291146] 
               X4: 0xffff8001398feb18:
[24145.291147] eb18  398feb10 ffff8001 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291155] eb38  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291163] eb58  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291170] eb78  00000000 00007e28 1fac5000 ffff0000 1fb7c000 ffff0000 00000000 00000000
[24145.291178] eb98  33317c19 ffff8000 1e1c9018 ffff8000 1e1e2618 ffff8000 1e1c9030 ffff8000
[24145.291186] ebb8  1e1e2630 ffff8000 398fe5c0 ffff8001 00000000 00000000 00000000 00000000
[24145.291193] ebd8  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291201] ebf8  00000000 00000000 39a12880 ffff8001 00000000 00000000 00000020 00000000
[24145.291210] 
               X5: 0xffff8001398feb18:
[24145.291210] eb18  398feb10 ffff8001 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291218] eb38  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291225] eb58  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291233] eb78  00000000 00007e28 1fac5000 ffff0000 1fb7c000 ffff0000 00000000 00000000
[24145.291240] eb98  33317c19 ffff8000 1e1c9018 ffff8000 1e1e2618 ffff8000 1e1c9030 ffff8000
[24145.291248] ebb8  1e1e2630 ffff8000 398fe5c0 ffff8001 00000000 00000000 00000000 00000000
[24145.291256] ebd8  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291263] ebf8  00000000 00000000 39a12880 ffff8001 00000000 00000000 00000020 00000000
[24145.291276] 
               X23: 0xffff80012d0daf80:
[24145.291277] af80  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291284] afa0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291292] afc0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291300] afe0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291307] b000  04fcd9c0 ffff7e00 04fcd980 ffff7e00 04fce8c0 ffff7e00 04fce880 ffff7e00
[24145.291315] b020  04fcebc0 ffff7e00 04fceb80 ffff7e00 04fcf0c0 ffff7e00 04fcf080 ffff7e00
[24145.291323] b040  04fcf2c0 ffff7e00 04fcf280 ffff7e00 04fcfac0 ffff7e00 04fcfa80 ffff7e00
[24145.291330] b060  04fcffc0 ffff7e00 04fcff80 ffff7e00 0433aec0 ffff7e00 0433ae80 ffff7e00
[24145.291340] 
               X28: 0xffff80015bc9fa80:
[24145.291341] fa80  786f7270 5f632e79 31393533 ffff8000 5bc9fa88 ffff8001 718dbc00 ffff8001
[24145.291349] faa0  6596aa00 ffff8001 00000000 00000000 2cf5a295 00000003 00000002 00000000
[24145.291356] fac0  0000001d 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291364] fae0  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291371] fb00  00000000 00000000 00000000 00000000 08760a60 ffff0000 5bc9fb00 ffff8001
[24145.291379] fb20  00000000 00000000 000000c8 00000000 00000000 00000000 eb2d59e0 ffff8000
[24145.291386] fb40  0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291394] fb60  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00005a47
[24145.291402] 
[24145.291404] Process pvr_defer_free (pid: 1803, stack limit = 0xffff00000e7d8000)
[24145.291405] Call trace:
[24145.291408] Exception stack(0xffff00000e7dbbc0 to 0xffff00000e7dbd00)
[24145.291411] bbc0: ffff00001fbfc000 0000000000000000 0000000000000000 ffff00001fbfbfff
[24145.291414] bbe0: ffff8001398feb98 ffff8001398feb98 000000017fffd7e8 0000000000000001
[24145.291417] bc00: 0040000000000001 0040000000000041 0000000000000040 0000000000000000
[24145.291420] bc20: 0000000000000000 ffff00001fbfc000 0140000000000000 0000ffffac000bc8
[24145.291422] bc40: ffff00000818ded0 0000ffff97c22ec8 0000ffffac000bcc ffff00001fbaf000
[24145.291425] bc60: 0000000000080000 0000000000000000 ffff00001fb7c000 ffff80012d0db000
[24145.291428] bc80: 0000000000000080 00e8000000000f0f 0000000000000000 00000000000186a0
[24145.291431] bca0: ffff80015bc9fb00 ffff00000e7dbd00 ffff0000087609c0 ffff00000e7dbd00
[24145.291434] bcc0: ffff00000876d444 0000000020c00145 ffff00000e7dbd30 ffff0000087609ac
[24145.291436] bce0: ffffffffffffffff ffff00000877e72c ffff00000e7dbd00 ffff00000876d444
[24145.291440] [] DeviceMemSet+0x8c/0xc8
[24145.291444] [] _ZeroPageArray+0x78/0x118
[24145.291447] [] _CleanupThread_CleanPages+0xfc/0x2b8
[24145.291453] [] CleanupThread+0x148/0x3a0
[24145.291455] [] OSThreadRun+0x28/0x60
[24145.291461] [] kthread+0x138/0x140
[24145.291465] [] ret_from_fork+0x10/0x1c
[24145.291469] Code: 927cec00 91004000 8b000260 b3607c42 (a9000a62) 
[24145.291483] SMP: stopping secondary CPUs
[24145.291503] ---[ end trace 2abaad5900994446 ]---
[24145.335190] Kernel panic - not syncing: Fatal exception
[24145.335308] Kernel Offset: disabled
[24145.335312] CPU features: 0x0802210
[24145.335313] Memory Limit: none
[24146.006485] Rebooting in 1 seconds..
[24147.010058] flush all cache

2.2 结合log信息分析

从log看,是gpu驱动在进行memset时,访问非法地址了,初步判断,可能是buffer越界访问导致的,从PC指针获取现场信息

crash> dis DeviceMemSet+0x8c -l
/drivers/gpu/rogue_km/services/shared/common/mem_utils.c: 335
0xffff00000876d444 :  stp     x2, x2, [x19]

对应源码如下:
Crash分析gpu非法访问地址问题_第1张图片
同过crash工具获取异常时现场寄存器的上下文信息如下:

crash> bt -e
PID: 1803   TASK: ffff8001794fab80  CPU: 5   COMMAND: "pvr_defer_free"

KERNEL-MODE EXCEPTION FRAME AT: ffff00000e7dbbc0
     PC: ffff00000876d444  [DeviceMemSet+140]
     LR: ffff0000087609c0  [_ZeroPageArray+120]
     SP: ffff00000e7dbd00  PSTATE: 20c00145
    X29: ffff00000e7dbd00  X28: ffff80015bc9fb00  X27: 00000000000186a0
    X26: 0000000000000000  X25: 00e8000000000f0f  X24: 0000000000000080
    X23: ffff80012d0db000  X22: ffff00001fb7c000  X21: 0000000000000000
    X20: 0000000000080000  X19: ffff00001fbaf000  X18: 0000ffffac000bcc
    X17: 0000ffff97c22ec8  X16: ffff00000818ded0  X15: 0000ffffac000bc8
    X14: 0140000000000000  X13: ffff00001fbfc000  X12: 0000000000000000
    X11: 0000000000000000  X10: 0000000000000040   X9: 0040000000000041
     X8: 0040000000000001   X7: 0000000000000001   X6: 000000017fffd7e8
     X5: ffff8001398feb98   X4: ffff8001398feb98   X3: ffff00001fbfbfff
     X2: 0000000000000000   X1: 0000000000000000   X0: ffff00001fbfc000

从现场寄存器信息来看,X19=ffff00001fbaf000,与panic时log提示的现场一致
反编译整个DeviceMemSet函数:

crash> dis DeviceMemSet
0xffff00000876d3b8 :      stp     x29, x30, [sp, #-48]!
0xffff00000876d3bc :    mov     x29, sp
0xffff00000876d3c0 :    stp     x19, x20, [sp, #16]
0xffff00000876d3c4 :   str     x21, [sp, #32]
0xffff00000876d3c8 :   mov     x19, x0
0xffff00000876d3cc :   and     w21, w1, #0xff
0xffff00000876d3d0 :   mov     x20, x2
0xffff00000876d3d4 :   mov     x0, x30
0xffff00000876d3d8 :   nop
0xffff00000876d3dc :   ands    x1, x19, #0xf
0xffff00000876d3e0 :   b.eq    0xffff00000876d418   // b.none
0xffff00000876d3e4 :   mov     x0, #0x10                       // #16
0xffff00000876d3e8 :   sub     x0, x0, x1
0xffff00000876d3ec :   cmp     x0, x20
0xffff00000876d3f0 :   csel    x0, x0, x20, ls  // ls = plast
0xffff00000876d3f4 :   sub     x20, x20, x0
0xffff00000876d3f8 :   cbz     x0, 0xffff00000876d418 
0xffff00000876d3fc :   add     x0, x19, x0
0xffff00000876d400 :   mov     x1, x19
0xffff00000876d404 :   strb    w21, [x1]
0xffff00000876d408 :   add     x19, x19, #0x1
0xffff00000876d40c :   mov     x1, x19
0xffff00000876d410 :   cmp     x0, x19
0xffff00000876d414 :   b.ne    0xffff00000876d404   // b.any
0xffff00000876d418 :   cmp     x20, #0xf
0xffff00000876d41c :  b.ls    0xffff00000876d458   // b.plast
0xffff00000876d420 :  lsl     w2, w21, #16
0xffff00000876d424 :  orr     w1, w21, w21, lsl #8
0xffff00000876d428 :  orr     w2, w2, w21, lsl #24
0xffff00000876d42c :  sub     x0, x20, #0x10
0xffff00000876d430 :  orr     w2, w2, w1
0xffff00000876d434 :  and     x0, x0, #0xfffffffffffffff0
0xffff00000876d438 :  add     x0, x0, #0x10
0xffff00000876d43c :  add     x0, x19, x0
0xffff00000876d440 :  bfi     x2, x2, #32, #32
0xffff00000876d444 :  stp     x2, x2, [x19]            //panic
...

结合源码和汇编信息分析,应该大概率是pDst这个指针越界了,分析上下文,获取pDst指针指向的buffer以及size如下:
Crash分析gpu非法访问地址问题_第2张图片
从源码可以看出,PDst访问的buffer和size,是由该函数的参数决定的,因此pDst指针范围,不能超出函数参数给出的限制。因为函数调用时被调用函数会在自己的栈帧中保存即将被修改到的寄存器,因此可以从函数栈帧获取DeviceMemSet()函数的信息如下:

crash> bt -f
PID: 1803   TASK: ffff8001794fab80  CPU: 5   COMMAND: "pvr_defer_free"
 #0 [ffff00000e7dbd00] DeviceMemSet at ffff00000876d440
    ffff00000e7dbd00: ffff00000e7dbd30 ffff0000087609c0 
    ffff00000e7dbd10: 0000000000000080 0000000000000080 
    ffff00000e7dbd20: 0000000000000080 ffff00000877fcb4 
 #1 [ffff00000e7dbd30] _ZeroPageArray at ffff0000087609bc
    ffff00000e7dbd30: ffff00000e7dbd80 ffff000008760b5c 
    ffff00000e7dbd40: ffff80015bc9fb00 ffff8000eb2d59e0 
    ffff00000e7dbd50: ffff000008760a60 ffff80015bc9f800 
    ffff00000e7dbd60: 0000000000000000 0000000000000000 
    ffff00000e7dbd70: ffff000009334000 ffff00000877283c 
 #2 [ffff00000e7dbd80] _CleanupThread_CleanPages at ffff000008760b58
    ffff00000e7dbd80: ffff00000e7dbdc0 ffff000008781be0 
    ffff00000e7dbd90: ffff80017910b200 ffff80017910b290 
    ffff00000e7dbda0: ffff000008760a60 ffff000008781bd4 
    ffff00000e7dbdb0: ffff80017910b200 ffff80017910b290 
 #3 [ffff00000e7dbdc0] CleanupThread at ffff000008781bdc
    ffff00000e7dbdc0: ffff00000e7dbe50 ffff00000875c258 
    ffff00000e7dbdd0: ffff8001783be180 ffff80017918d080 
    ffff00000e7dbde0: ffff8001794fab80 ffff000009b8cf70 
    ffff00000e7dbdf0: ffff00000a563be8 ffff8001783be180 
    ffff00000e7dbe00: ffff00000875c230 ffff000009313368 
    ffff00000e7dbe10: 0000000000000000 0000000000000000 
    ffff00000e7dbe20: ffff000009391ad8 ffff000009334968 
    ffff00000e7dbe30: ffff8001783be180 ffff80017918d900 
    ffff00000e7dbe40: ffff80017918d380 168cfdb4a63a5800 
 #4 [ffff00000e7dbe50] OSThreadRun at ffff00000875c254
    ffff00000e7dbe50: ffff00000e7dbe70 ffff000008108c70 
    ffff00000e7dbe60: ffff8001783be980 168cfdb4a63a5800 
 #5 [ffff00000e7dbe70] kthread at ffff000008108c6c

在被调用函数DeviceMemSet()在栈帧中保存了以下参数:

crash> dis DeviceMemSet
0xffff00000876d3b8 :      stp     x29, x30, [sp, #-48]!
0xffff00000876d3bc :    mov     x29, sp
0xffff00000876d3c0 :    stp     x19, x20, [sp, #16]      //保存X19,X20到栈中
0xffff00000876d3c4 :   str     x21, [sp, #32]           //保存X21到栈中

而DeviceMemSet()函数的SP为0xffff00000e7dbd00
再看看调用DeviceMemSet()函数的父函数:

crash> dis _ZeroPageArray 
0xffff000008760948 <_ZeroPageArray>:    stp     x29, x30, [sp, #-80]!
0xffff00000876094c <_ZeroPageArray+4>:  mov     x29, sp
0xffff000008760950 <_ZeroPageArray+8>:  stp     x20, x21, [sp, #24]
0xffff000008760954 <_ZeroPageArray+12>: str     x23, [sp, #48]
0xffff000008760958 <_ZeroPageArray+16>: str     x25, [sp, #64]
0xffff00000876095c <_ZeroPageArray+20>: mov     w20, w0
0xffff000008760960 <_ZeroPageArray+24>: mov     x23, x1
0xffff000008760964 <_ZeroPageArray+28>: mov     x25, x2
0xffff000008760968 <_ZeroPageArray+32>: mov     x0, x30
0xffff00000876096c <_ZeroPageArray+36>: nop
0xffff000008760970 <_ZeroPageArray+40>: mov     w21, #0x400                     // #1024
0xffff000008760974 <_ZeroPageArray+44>: cmp     w20, w21
0xffff000008760978 <_ZeroPageArray+48>: csel    w21, w20, w21, ls  // ls = plast
0xffff00000876097c <_ZeroPageArray+52>: cbz     w20, 0xffff0000087609e4 <_ZeroPageArray+156>
0xffff000008760980 <_ZeroPageArray+56>: str     x19, [x29, #16]
0xffff000008760984 <_ZeroPageArray+60>: str     x22, [x29, #40]
0xffff000008760988 <_ZeroPageArray+64>: str     x24, [x29, #56]
0xffff00000876098c <_ZeroPageArray+68>: cmp     w20, w21
0xffff000008760990 <_ZeroPageArray+72>: mov     x3, x25
0xffff000008760994 <_ZeroPageArray+76>: csel    w19, w20, w21, ls  // ls = plast
0xffff000008760998 <_ZeroPageArray+80>: mov     w2, #0xffffffff                 // #-1
0xffff00000876099c <_ZeroPageArray+84>: mov     w1, w19
0xffff0000087609a0 <_ZeroPageArray+88>: mov     x0, x23
0xffff0000087609a4 <_ZeroPageArray+92>: mov     w24, w19
0xffff0000087609a8 <_ZeroPageArray+96>: bl      0xffff000008267958 
0xffff0000087609ac <_ZeroPageArray+100>:        mov     x22, x0
0xffff0000087609b0 <_ZeroPageArray+104>:        cbz     x0, 0xffff000008760a00 <_ZeroPageArray+184>
0xffff0000087609b4 <_ZeroPageArray+108>:        lsl     x2, x24, #12
0xffff0000087609b8 <_ZeroPageArray+112>:        mov     w1, #0x0                        // #0
0xffff0000087609bc <_ZeroPageArray+116>:        bl      0xffff00000876d3b8 
...

DeviceMemSet()函数如下方式被调用:

0xffff0000087609a4 <_ZeroPageArray+92>: mov     w24, w19
0xffff0000087609a8 <_ZeroPageArray+96>: bl      0xffff000008267958 
0xffff0000087609ac <_ZeroPageArray+100>:        mov     x22, x0
0xffff0000087609b0 <_ZeroPageArray+104>:        cbz     x0, 0xffff000008760a00 <_ZeroPageArray+184>
0xffff0000087609b4 <_ZeroPageArray+108>:        lsl     x2, x24, #12
0xffff0000087609b8 <_ZeroPageArray+112>:        mov     w1, #0x0                        // #0
0xffff0000087609bc <_ZeroPageArray+116>:        bl      0xffff00000876d3b8 

可以看到,传个DeviceMemSet()的参数X1 = 0, X0 = X22, X2 = X24 << 12, X24 = X19, 而X19保存到DeviceMemSet()函数的SP + 16的地址,SP = 0xffff00000e7dbd00,所以X19寄存器的值从栈中可以获取到为0x80,因此这个buffer的size = 0x80 * 4096Byte = 0x80000,因为在DeviceMemSet()中没有使用到X22,因此出现异常时,X22还保存了跟父函数_ZeroPageArray()一样的值,为0xffff00001fb7c000

因此,这样要memset的buffer起始地址和size都获取到了,所以要memset的范围是:0xffff00001fb7c000 ~ 0xffff00001fb7c000 + 0x80000 - 1,即0xffff00001fb7c000 ~ 0xffff00001fbfc000 - 1

因为出错指令为stp指令:

stp     x2, x2, [x19]

这条指令,因此当X19 = 0xffff00001fbfbff0时,就可以把该片memory清0。而0xffff00001fbfbff0 + 0x10 = 0xffff00001fbfc000,刚好是越界访问的地址,且该地址是没有映射的,所有会提示找不到相关转换的页表。

至此,锅已经找到了,就差负责炖的人了。。。

你可能感兴趣的:(Linux,debug,linux,linux)