记录一次概率死机的debug过程,概率性问题无法验证,仅是对现有的资料做出推理分析,若有不对的地方,欢迎各位看官拍砖。
现象:
发现一台机器执行monkey test测试出现概率死机现象,mt6580-N平台。
拿到mtklog,发现有发生NE和KE,使用GAT工具的logview-》open Aee db解压db发现
Build Info: 'alps-mp-n0.mp2:alps-mp-n0.mp2-V1_tinno6580.we.n_P181:mt6580:S01,TINNO/k300/k300:7.0/NRD90M/978301282079:user/release-keys'
Flavor Info: 'None'
Exception Log Time:[Thu Nov 16 18:16:15 CST 2017] [2664.454478]
Exception Class: Native (NE)
Exception Type: SIGSEGV
Current Executing Process:
pid: 813, tid: 818
system_server
Backtrace:
#00 pc 0015295a /system/lib/libart.so (_ZN3art2gc9allocator8RosAlloc3Run15InspectAllSlotsEPFvPvS4_jS4_ES4_+73)
#01 pc 00153a7f /system/lib/libart.so (_ZN3art2gc9allocator8RosAlloc10InspectAllEPFvPvS3_jS3_ES3_+358)
#02 pc 001b0ce1 /system/lib/libart.so (_ZN3art2gc5space13RosAllocSpace18InspectAllRosAllocEPFvPvS3_jS3_ES3_b+112)
#03 pc 001b122b /system/lib/libart.so (_ZThn36_N3art2gc5space13RosAllocSpace19GetObjectsAllocatedEv+30)
#04 pc 001924a1 /system/lib/libart.so (_ZNK3art2gc4Heap19GetObjectsAllocatedEv+528)
#05 pc 001967c5 /system/lib/libart.so (_ZN3art2gc4Heap14DumpForSigQuitERNSt3__113basic_ostreamIcNS2_11char_traitsIcEEEE+292)
#06 pc 003208f9 /system/lib/libart.so (_ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+144)
#07 pc 003251b7 /system/lib/libart.so (_ZN3art13SignalCatcher13HandleSigQuitEv+1394)
#08 pc 00324325 /system/lib/libart.so (_ZN3art13SignalCatcher3RunEPv+336)
#09 pc 000482a3 /system/lib/libc.so (_ZL15__pthread_startPv+22)
#10 pc 00019d0d /system/lib/libc.so (__start_thread+6)
解析调用栈:
arm32-linux-androideabi-addr2line -fC -e libart.so 0015295a 00153a7f 001b0ce1 001b122b 001924a1 001967c5 003208f9 003251b7 00324325
art::gc::allocator::RosAlloc::Slot::Next() const
/proc/self/cwd/art/runtime/gc/allocator/rosalloc.h:119
art::gc::allocator::RosAlloc::InspectAll(void (*)(void*, void*, unsigned int, void*), void*)
/proc/self/cwd/art/runtime/gc/allocator/rosalloc.cc:1470
art::gc::space::RosAllocSpace::InspectAllRosAlloc(void (*)(void*, void*, unsigned int, void*), void*, bool)
/proc/self/cwd/art/runtime/gc/space/rosalloc_space.cc:322
art::gc::space::RosAllocSpace::GetObjectsAllocated()
/proc/self/cwd/art/runtime/gc/space/rosalloc_space.cc:297
art::gc::Heap::GetObjectsAllocated() const
/proc/self/cwd/art/runtime/gc/heap.cc:1877
art::gc::Heap::DumpForSigQuit(std::__1::basic_ostream std::__1::char_traits >&)
/ proc/self/cwd/art/runtime/gc/heap.cc:3491 (discriminator 4)
art::Runtime::DumpForSigQuit(std::__1::basic_ostream std::__1::char_traits >&)
/ proc/self/cwd/art/runtime/runtime.cc:1401 (discriminator 1)
art::SignalCatcher::HandleSigQuit()
/proc/self/cwd/art/runtime/signal_catcher.cc:144
art::SignalCatcher::Run(void*)
/proc/self/cwd/art/runtime/signal_catcher.cc:211
SIGSEGV 提示当前线程访问了非法内存导致异常,但是好像看不出啥毛病?
上面的调用栈,看函数名称好像是在接收信号?第一感觉此处不是第一现场,
继续看看tid=818这个线程的backtrace是什么,NE_JBT_TRACE中搜索 tid=818
"Signal Catcher" daemon prio=5 tid=2 Runnable
| group="system" sCount=0 dsCount=0 obj=0x12c010d0 self=0xa0a8d000
| sysTid=818 nice=0 cgrp=default sched=0/0 handle=0xa6728920
| state=R schedstat=( 109230311 3211154 108 ) utm=6 stm=4 core=2 HZ=100
| stack=0xa662c000-0xa662e000 stackSize=1014KB
| held mutexes= "mutator lock"(shared held)
(no managed stack frames)
好像也挺正常的,依然没有什么线索,继续看Android log:
11-16 18:16:07.106 813 839 I Process : Sending signal. PID: 813 SIG: 3
11-16 18:16:07.106 813 818 I art : Thread[2,tid=818,WaitingInMainSignalCatcherLoop,Thread*=0xa0a8d000,peer=0x12c010d0,"Signal Catcher"]: reacting to signal 3 ==》SIGQUIT
11-16 18:16:07.106 813 818 I art :
11-16 18:16:07.107 813 818 I art : Enter while loop.
11-16 18:16:07.113 11945 11951 I art : Wrote stack traces to '/data/anr/traces.txt'
11-16 18:16:07.117 813 818 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0xe3d8d8d8 in tid 818 (Signal Catcher)
11-16 18:16:07.128 562 562 D AEE_AED : $===AEE===AEE===AEE===$
11-16 18:16:07.129 562 562 D AEE_AED : p 2 poll events 1 revents 1
11-16 18:16:07.129 562 562 D AEE_AED : PPM cpu cores:4, online:2
11-16 18:16:07.130 562 562 D AEE_AED : aed_main_fork_worker: generator 0xa9d94df8, worker 0xbecce944, recv_fd 0
11-16 18:16:07.132 12737 12737 I AEE_AED : handle_request(0)
11-16 18:16:07.132 12737 12737 I AEE_AED : check process 813 name:system_server
11-16 18:16:07.132 12737 12737 I AEE_AED : tid 818 abort msg address:0x00000000, si_code:1 (request from 813:1000)
分析SYS_KERNEL_LOG,搜索sig 关键字:
[ 2656.013962] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [11945:com.myos.camera:S]
[ 2656.402325] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [813:system_server:S]
[ 2656.503634] -(0)[818:Signal Catcher][name:mtprof&][signal][818:Signal Catcher] send death sig 11 to [818:Signal Catcher:R]
[ 2662.629715] (1)... UTC;android time 2017-11-16 18:16:13.333643
2662.629715 == 》2017-11-16 18:16:13 那么
2656.402325 == 》2017-11-16 18:16:07 刚好对上时间.
上面串联起来解读就是:
由于进程system_server的tid=818发生了segmentfault,linux kernel会向改目标进程发生sig 11,然而java虚拟机有捕获该信号,然后会先发sig 3给目标进程,打印出backtrace,然后再重新发sig 11给目标进程,导致目标进程挂掉!这个就是解释上面信号发生的时间原因.java 进程发sig 3只会打印trace,而native进程直接发sig 3就会stop进程
此刻依然看不出什么具体原因。看看KE的信息:
Build Info: 'alps-mp-n0.mp2:alps-mp-n0.mp2-V1_tinno6580.we.n_P181:mt6580:S01,TINNO/k300/k300:7.0/NRD90M/978301282079:user/release-keys'
Flavor Info: 'None'
Exception Log Time:[Thu Nov 16 18:44:14 CST 2017] [10.540823]
Exception Class: Kernel (KE)
PC is at [] load_elf_binary+0x8b4/0x1280
LR is at [] padzero+0x54/0x60
Current Executing Process:
[logcat, 18228][WifiStateMachin, 13150][main, 12787]
Backtrace:
[] do_undefinstr+0x1a4/0x1ec
[] __und_svc_finish+0x0/0x34
[] load_elf_binary+0x8b4/0x1280
[] search_binary_handler+0x6c/0x110
[] do_execve+0x438/0x5a8
[] SyS_execve+0x24/0x28
[] ret_fast_syscall+0x0/0x38
[] 0xffffffff
很明显,提示发生了und未定义指令异常,查看kernel log的详细的调用栈:
[ 3675.201056] -(1)[18228:logcat]Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP ARM
...
[ 3683.747415] -(1)[18228:logcat]Backtrace:
[ 3683.747780] -(1)[18228:logcat][] (load_elf_binary) from [] (search_binary_handler+0x6c/0x110)
[ 3683.747927] -(1)[18228:logcat] r10:dd376fc0 r9:00004734 r8:c287c000 r7:c1034218 r6:fffffff8 r5:c10351e8
[ 3683.748471] -(1)[18228:logcat] r4:cea01f00
[ 3683.748833] -(1)[18228:logcat][] (search_binary_handler) from [] (do_execve+0x438/0x5a8)
[ 3683.748976] -(1)[18228:logcat] r7:cea01f00 r6:cc8c5400 r5:c287c008 r4:c1853000
[ 3683.749569] -(1)[18228:logcat][] (do_execve) from [] (SyS_execve+0x24/0x28)
[ 3683.749710] -(1)[18228:logcat] r10:00000000 r9:c287c000 r8:c01071a4 r7:0000000b r6:8e5f2d88 r5:8876d310
[ 3683.750249] -(1)[18228:logcat] r4:befc6adc
[ 3683.750601] -(1)[18228:logcat][] (SyS_execve) from [] (ret_fast_syscall+0x0/0x38)
[ 3683.750746] -(1)[18228:logcat] r5:8876d310 r4:8bb61088
[ 3683.751139] -(1)[18228:logcat]Code: ff3d1813 ff3d1c17 ff3e1c19 ff3d1d19 (ff3d1914)
[ 3683.751370] -(1)[18228:logcat]---[ end trace eaaf6e60bd76e65a ]---
[ 3684.553220] -(1)[18228:logcat]Kernel panic - not syncing: Fatal exception
有明显的kernel panic提示!
从这里看像是在加载可执行程序的代码段出现了und异常问题.
[ 3683.672806] -(1)[18228:logcat]Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
异常来自用户空间程序.
接着分析Android log:
11-16 18:44:14.114 541 541 I AEE_AED : Exception Class: Kernel (KE)
11-16 18:44:14.114 541 541 I AEE_AED : PC is at [] load_elf_binary+0x8b4/0x1280
11-16 18:44:14.114 240 240 E PQ : [PQ][PQWhiteList] libwlparser.so is absent
11-16 18:44:14.114 541 541 I AEE_AED : LR is at [] padzero+0x54/0x60
11-16 18:44:14.114 541 541 I AEE_AED :
11-16 18:44:14.114 541 541 I AEE_AED : Current Executing Process:
11-16 18:44:14.114 541 541 I AEE_AED : [logcat, 18228][WifiStateMachin, 13150][main, 12787]
11-16 18:44:14.114 541 541 I AEE_AED :
11-16 18:44:14.115 541 541 I AEE_AED : Backtrace:
11-16 18:44:14.115 541 541 I AEE_AED : [] do_undefinstr+0x1a4/0x1ec
11-16 18:44:14.115 541 541 I AEE_AED : [] __und_svc_finish+0x0/0x34
11-16 18:44:14.115 541 541 I AEE_AED : [] load_elf_binary+0x8b4/0x1280
11-16 18:44:14.115 541 541 I AEE_AED : [] search_binary_handler+0x6c/0x110
11-16 18:44:14.115 541 541 I AEE_AED : [] do_execve+0x438/0x5a8
11-16 18:44:14.115 541 541 I AEE_AED : [] SyS_execve+0x24/0x28
11-16 18:44:14.115 541 541 I AEE_AED : [] ret_fast_syscall+0x0/0x38
11-16 18:44:14.115 541 541 I AEE_AED : [] 0xffffffff
然后分析mini ramdump,看看KE的原因是什么,从上面调用栈看是发生了undefined instruction exception,那是如何导致的呢?
打开trace32,当前的调用栈:
当前的pc停止在0xc026adf8:
m
emory dump显示:
从上面看到C026ADF8地址对应的指令是 svc 0x3d1914,而vmlinux中是一条bl指令:
c026ada0: e2600a01 rsb r0, r0, #4096 ; 0x1000
c026ada4: e068100c rsb r1, r8, ip
c026ada8: e51bc054 ldr ip, [fp, #-84] ; 0x54
c026adac: e1500001 cmp r0, r1
c026adb0: e3cc203f bic r2, ip, #63 ; 0x3f
c026adb4: 31a01000 movcc r1, r0
c026adb8: 21a01001 movcs r1, r1
c026adbc: e5922008 ldr r2, [r2, #8]
c026adc0: e0930001 adds r0, r3, r1
c026adc4: 30d00002 sbcscc r0, r0, r2
c026adc8: 33a02000 movcc r2, #0
c026adcc: e3520000 cmp r2, #0
c026add0: 1affff10 bne c026aa18
c026add4: e1a00003 mov r0, r3
c026add8: eb03e392 bl c0363c28 <__clear_user_std>
c026addc: eaffff0d b c026aa18
c026ade0: e51b9044 ldr r9, [fp, #-68] ; 0x44
c026ade4: e1a02000 mov r2, r0
c026ade8: e51b8040 ldr r8, [fp, #-64] ; 0x40
c026adec: eafffe54 b c026a744
c026adf0: e3e0200d mvn r2, #13
c026adf4: eafffe52 b c026a744
c026adf8: ebfa7020 bl c0106e80
c026adfc: e3500000 cmp r0, #0
c026ae00: 0affffd6 beq c026ad60
==》C026ADC0-C026ADFC地址16字节内存被踩了,所以导致了未定义指令异常!
从上面可以看到ramdump跟vimlinux中的指令发生了不一样的情况!这是明显的内存被踩坏的情况:
vmlinux中的指令相当于是写入到ROM中的指令,是正常应该执行的指令,而ramdump中的指令是CPU将指令从ROM加载到RAM后的指令,所以如果正常情况下,vmlinux中的指令跟RAM中的指令是完全一样的才对,当内存被踩(其他程序非法篡改)后就会出现这种不一致的情况。
内存被踩最大的难点是被踩的点跟崩溃的点不是一个,所以导致了追踪困难,导入MMU buddy system 保护的方案可以提高解题概率,其基本原理是将未分配出去的内存设置为不可写,这样一旦有人尝试去访问就会报kernel panic,直接可抓到凶手.
从已有的资料,目前无法进一步分析,往前推一点看看Android log痕迹:
11-16 18:44:12.947 263 263 E SoundTriggerHwService: couldn't load sound trigger module sound_trigger.primary (No such file or directory)
11-16 18:44:12.958 482 482 D AEE_AED : Rtt command(type:0, file_path:3 arg0:233)
11-16 18:44:12.958 482 482 D AEE_AED : Rtt waiting daemon finish the job...
11-16 18:44:13.003 448 448 D BootAnimation: [BootAnimation main 95]before new BootAnimation...
11-16 18:44:13.010 225 425 I SurfaceFlinger: [SF client] NEW(0xa7d15000) for (448:/system/bin/bootanimation)
11-16 18:44:13.011 448 448 D BootAnimation: [BootAnimation BootAnimation 162]bBootOrShutDown=1,bPlayMP3=1,bShutRotate=0
11-16 18:44:13.011 448 448 D BootAnimation: joinThreadPool...
11-16 18:44:13.011 448 448 D BootAnimation: [BootAnimation main 98]before joinThreadPool...
11-16 18:44:13.322 448 495 D BootAnimation: initialize opengl and egl
11-16 18:44:13.325 448 495 E : appName=/system/bin/bootanimation, acAppName=/system/bin/surfaceflinger
11-16 18:44:13.325 448 495 E : 0
明显看出这个时候是在Android 重启!而且可以断定是NE导致的。
从这些细节可以看撸一撸:上面的load_elf_binary是在加载可执行程序的代码段出现und异常,而这里的可执行程序可能是
/system/bin/bootanimation或者/system/bin/surfaceflinger或者其他程序,由于前期出现了NE异常导致了Android crash(kernel没有挂)然后bootanimation进程重新启动,启动过程中可能由于NE前一次的异常条件还存在,导致内存被踩坏,load_elf_binary出现und异常==》Kernel panic
但是NE发生的时间是11-16 18:16:14,KE发生的时间是11-16 18:44:13,中间隔了半个小时了!所以一种推测是不止发生了一次NE,有可能是发生了多次NE,而其中的一次NE导致Android crash后踩坏内存导致了KE.
猜想,KE属于多次NE异常之后的异常,所以在KE无更好debug情况下,重点先把NE解决掉,再安排压力测试验证效果。
继续分析NE的异常原因,分析Android log:
从上面分析看出,先发sig 3给com.myos.camera,再发sig 3,sig11给system_server,从而导致system_server挂掉,那么会不会本题的根本原因就是出在com.myos.camera模块?带这个疑问,追查Android log:
有看到这个细节:
11-16 18:16:14.842 266 12646 E AeAlgo : [checkNightScene_v2p0()] Err: 1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
11-16 18:16:15.047 12737 12737 D AEE_AED : 1, 8, 1, 1510827174, com.myos.camera
11-16 18:16:15.215 266 12646 E AeAlgo : [checkNightScene_v2p0()] Err: 1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
11-16 18:16:15.886 266 12646 E AeAlgo : [checkNightScene_v2p0()] Err: 1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
出现了好几次。
再看kernel log打印出发出sig 3给com.myos.camera的时间点:
[ 2656.013962] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [11945:com.myos.camera:S]
搜索tid=11945,查看调用栈:
"main" prio=5 tid=1 Waiting
| group="main" sCount=1 dsCount=0 obj=0x744a1f18 self=0xa7005400
| sysTid=11945 nice=0 cgrp=default sched=0/0 handle=0xaa302534
| state=S schedstat=( 2196790856 2283773067 5583 ) utm=148 stm=71 core=0 HZ=100
| stack=0xbe14d000-0xbe14f000 stackSize=8MB
| held mutexes=
at java.lang.Object.wait!(Native method)
- waiting on <0x08a4d8ef> (a android.os.ConditionVariable)
at android.os.ConditionVariable.block(ConditionVariable.java:97)
- locked <0x08a4d8ef> (a android.os.ConditionVariable)
at com.myos.camera.scheduler.CameraScheduler.onFirstPreviewReceived(unavailable:-1)
at com.myos.camera.activity.CameraActivity.onFirstPreviewArrived(unavailable:-1)
at com.myos.camera.activity.ActivityBase.onFirstPreviewOpened(unavailable:-1)
at com.myos.camera.activity.ActivityBase$MyAppBridge.onFirstPreviewOpened(unavailable:-1)
at com.myos.camera.glui.CameraScreenNail.onFrameAvailable(unavailable:-1)
- locked <0x00235ffc> (a java.lang.Object)
at android.graphics.SurfaceTexture$1.handleMessage(SurfaceTexture.java:203)
at android.os.Handler.dispatchMessage(Handler.java:110)
at android.os.Looper.loop(Looper.java:203)
at android.app.ActivityThread.main(ActivityThread.java:6251)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1063)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:924)
参数提示unavailable:-1,看上去不太正常?
转给camera模块相关人分析.