一台机器执行monkey test测试出现概率死机

   记录一次概率死机的debug过程,概率性问题无法验证,仅是对现有的资料做出推理分析,若有不对的地方,欢迎各位看官拍砖。

现象:
发现一台机器执行monkey test测试出现概率死机现象,mt6580-N平台。
拿到mtklog,发现有发生NE和KE,使用GAT工具的logview-》open Aee db解压db发现
Build Info: 'alps-mp-n0.mp2:alps-mp-n0.mp2-V1_tinno6580.we.n_P181:mt6580:S01,TINNO/k300/k300:7.0/NRD90M/978301282079:user/release-keys'
Flavor Info: 'None'
Exception Log Time:[Thu Nov 16 18:16:15 CST 2017] [2664.454478]

Exception Class: Native (NE)
Exception Type: SIGSEGV

Current Executing Process: 
  pid: 813, tid: 818
  system_server

Backtrace: 
    #00 pc 0015295a  /system/lib/libart.so (_ZN3art2gc9allocator8RosAlloc3Run15InspectAllSlotsEPFvPvS4_jS4_ES4_+73)
    #01 pc 00153a7f  /system/lib/libart.so (_ZN3art2gc9allocator8RosAlloc10InspectAllEPFvPvS3_jS3_ES3_+358)
    #02 pc 001b0ce1  /system/lib/libart.so (_ZN3art2gc5space13RosAllocSpace18InspectAllRosAllocEPFvPvS3_jS3_ES3_b+112)
    #03 pc 001b122b  /system/lib/libart.so (_ZThn36_N3art2gc5space13RosAllocSpace19GetObjectsAllocatedEv+30)
    #04 pc 001924a1  /system/lib/libart.so (_ZNK3art2gc4Heap19GetObjectsAllocatedEv+528)
    #05 pc 001967c5  /system/lib/libart.so (_ZN3art2gc4Heap14DumpForSigQuitERNSt3__113basic_ostreamIcNS2_11char_traitsIcEEEE+292)
    #06 pc 003208f9  /system/lib/libart.so (_ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+144)
    #07 pc 003251b7  /system/lib/libart.so (_ZN3art13SignalCatcher13HandleSigQuitEv+1394)
    #08 pc 00324325  /system/lib/libart.so (_ZN3art13SignalCatcher3RunEPv+336)
    #09 pc 000482a3  /system/lib/libc.so (_ZL15__pthread_startPv+22)
    #10 pc 00019d0d  /system/lib/libc.so (__start_thread+6)

解析调用栈:
arm32-linux-androideabi-addr2line -fC -e libart.so 0015295a 00153a7f 001b0ce1 001b122b 001924a1 001967c5 003208f9 003251b7 00324325  
art::gc::allocator::RosAlloc::Slot::Next() const
/proc/self/cwd/art/runtime/gc/allocator/rosalloc.h:119
art::gc::allocator::RosAlloc::InspectAll(void (*)(void*, void*, unsigned int, void*), void*)
/proc/self/cwd/art/runtime/gc/allocator/rosalloc.cc:1470
art::gc::space::RosAllocSpace::InspectAllRosAlloc(void (*)(void*, void*, unsigned int, void*), void*, bool)
/proc/self/cwd/art/runtime/gc/space/rosalloc_space.cc:322
art::gc::space::RosAllocSpace::GetObjectsAllocated()
/proc/self/cwd/art/runtime/gc/space/rosalloc_space.cc:297
art::gc::Heap::GetObjectsAllocated() const
/proc/self/cwd/art/runtime/gc/heap.cc:1877
art::gc::Heap::DumpForSigQuit(std::__1::basic_ostream std::__1::char_traits >&)
/proc/self/cwd/art/runtime/gc/heap.cc:3491 (discriminator 4)
art::Runtime::DumpForSigQuit(std::__1::basic_ostream std::__1::char_traits >&)
/proc/self/cwd/art/runtime/runtime.cc:1401 (discriminator 1)
art::SignalCatcher::HandleSigQuit()
/proc/self/cwd/art/runtime/signal_catcher.cc:144
art::SignalCatcher::Run(void*)
/proc/self/cwd/art/runtime/signal_catcher.cc:211

SIGSEGV 提示当前线程访问了非法内存导致异常,但是好像看不出啥毛病?
上面的调用栈,看函数名称好像是在接收信号?第一感觉此处不是第一现场,
继续看看tid=818这个线程的backtrace是什么,NE_JBT_TRACE中搜索 tid=818
"Signal Catcher" daemon prio=5 tid=2 Runnable
  | group="system" sCount=0 dsCount=0 obj=0x12c010d0 self=0xa0a8d000
  | sysTid=818 nice=0 cgrp=default sched=0/0 handle=0xa6728920
  | state=R schedstat=( 109230311 3211154 108 ) utm=6 stm=4 core=2 HZ=100
  | stack=0xa662c000-0xa662e000 stackSize=1014KB
  | held mutexes= "mutator lock"(shared held)
  (no managed stack frames)

好像也挺正常的,依然没有什么线索,继续看Android log:
11-16 18:16:07.106   813   839 I Process : Sending signal. PID: 813 SIG: 3
11-16 18:16:07.106   813   818 I art     : Thread[2,tid=818,WaitingInMainSignalCatcherLoop,Thread*=0xa0a8d000,peer=0x12c010d0,"Signal Catcher"]: reacting to signal 3 ==》SIGQUIT
11-16 18:16:07.106   813   818 I art     : 
11-16 18:16:07.107   813   818 I art     : Enter while loop.
11-16 18:16:07.113 11945 11951 I art     : Wrote stack traces to '/data/anr/traces.txt'
11-16 18:16:07.117   813   818 F libc    : Fatal signal 11 (SIGSEGV), code 1, fault addr 0xe3d8d8d8 in tid 818 (Signal Catcher)
11-16 18:16:07.128   562   562 D AEE_AED : $===AEE===AEE===AEE===$
11-16 18:16:07.129   562   562 D AEE_AED : p 2 poll events 1 revents 1
11-16 18:16:07.129   562   562 D AEE_AED : PPM cpu cores:4, online:2
11-16 18:16:07.130   562   562 D AEE_AED : aed_main_fork_worker: generator 0xa9d94df8, worker 0xbecce944, recv_fd 0
11-16 18:16:07.132 12737 12737 I AEE_AED : handle_request(0)
11-16 18:16:07.132 12737 12737 I AEE_AED : check process 813 name:system_server
11-16 18:16:07.132 12737 12737 I AEE_AED : tid 818 abort msg address:0x00000000, si_code:1 (request from 813:1000)

分析SYS_KERNEL_LOG,搜索sig 关键字:
[ 2656.013962] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [11945:com.myos.camera:S]
[ 2656.402325] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [813:system_server:S]
[ 2656.503634] -(0)[818:Signal Catcher][name:mtprof&][signal][818:Signal Catcher] send death sig 11 to [818:Signal Catcher:R]
[ 2662.629715]  (1)... UTC;android time 2017-11-16 18:16:13.333643

2662.629715 == 》2017-11-16 18:16:13 那么
2656.402325 == 》2017-11-16 18:16:07 刚好对上时间.

上面串联起来解读就是:
由于进程system_server的tid=818发生了segmentfault,linux kernel会向改目标进程发生sig 11,然而java虚拟机有捕获该信号,然后会先发sig 3给目标进程,打印出backtrace,然后再重新发sig 11给目标进程,导致目标进程挂掉!这个就是解释上面信号发生的时间原因.java 进程发sig 3只会打印trace,而native进程直接发sig 3就会stop进程

此刻依然看不出什么具体原因。看看KE的信息:
Build Info: 'alps-mp-n0.mp2:alps-mp-n0.mp2-V1_tinno6580.we.n_P181:mt6580:S01,TINNO/k300/k300:7.0/NRD90M/978301282079:user/release-keys'
Flavor Info: 'None'
Exception Log Time:[Thu Nov 16 18:44:14 CST 2017] [10.540823]

Exception Class: Kernel (KE)
PC is at [] load_elf_binary+0x8b4/0x1280
LR is at [] padzero+0x54/0x60

Current Executing Process:
[logcat, 18228][WifiStateMachin, 13150][main, 12787]

Backtrace:
[] do_undefinstr+0x1a4/0x1ec 
[] __und_svc_finish+0x0/0x34 
[] load_elf_binary+0x8b4/0x1280      
[] search_binary_handler+0x6c/0x110  
[] do_execve+0x438/0x5a8     
[] SyS_execve+0x24/0x28      
[] ret_fast_syscall+0x0/0x38 
[] 0xffffffff
很明显,提示发生了und未定义指令异常,查看kernel log的详细的调用栈:
[ 3675.201056] -(1)[18228:logcat]Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP ARM
...
[ 3683.747415] -(1)[18228:logcat]Backtrace: 
[ 3683.747780] -(1)[18228:logcat][] (load_elf_binary) from [] (search_binary_handler+0x6c/0x110)
[ 3683.747927] -(1)[18228:logcat] r10:dd376fc0 r9:00004734 r8:c287c000 r7:c1034218 r6:fffffff8 r5:c10351e8
[ 3683.748471] -(1)[18228:logcat] r4:cea01f00
[ 3683.748833] -(1)[18228:logcat][] (search_binary_handler) from [] (do_execve+0x438/0x5a8)
[ 3683.748976] -(1)[18228:logcat] r7:cea01f00 r6:cc8c5400 r5:c287c008 r4:c1853000
[ 3683.749569] -(1)[18228:logcat][] (do_execve) from [] (SyS_execve+0x24/0x28)
[ 3683.749710] -(1)[18228:logcat] r10:00000000 r9:c287c000 r8:c01071a4 r7:0000000b r6:8e5f2d88 r5:8876d310
[ 3683.750249] -(1)[18228:logcat] r4:befc6adc
[ 3683.750601] -(1)[18228:logcat][] (SyS_execve) from [] (ret_fast_syscall+0x0/0x38)
[ 3683.750746] -(1)[18228:logcat] r5:8876d310 r4:8bb61088
[ 3683.751139] -(1)[18228:logcat]Code: ff3d1813 ff3d1c17 ff3e1c19 ff3d1d19 (ff3d1914) 
[ 3683.751370] -(1)[18228:logcat]---[ end trace eaaf6e60bd76e65a ]---
[ 3684.553220] -(1)[18228:logcat]Kernel panic - not syncing: Fatal exception
有明显的kernel panic提示!
从这里看像是在加载可执行程序的代码段出现了und异常问题.
[ 3683.672806] -(1)[18228:logcat]Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
异常来自用户空间程序.

接着分析Android log:
11-16 18:44:14.114   541   541 I AEE_AED : Exception Class: Kernel (KE)
11-16 18:44:14.114   541   541 I AEE_AED : PC is at [] load_elf_binary+0x8b4/0x1280
11-16 18:44:14.114   240   240 E PQ      : [PQ][PQWhiteList] libwlparser.so is absent
11-16 18:44:14.114   541   541 I AEE_AED : LR is at [] padzero+0x54/0x60
11-16 18:44:14.114   541   541 I AEE_AED : 
11-16 18:44:14.114   541   541 I AEE_AED : Current Executing Process:
11-16 18:44:14.114   541   541 I AEE_AED : [logcat, 18228][WifiStateMachin, 13150][main, 12787]
11-16 18:44:14.114   541   541 I AEE_AED : 
11-16 18:44:14.115   541   541 I AEE_AED : Backtrace:
11-16 18:44:14.115   541   541 I AEE_AED : [] do_undefinstr+0x1a4/0x1ec 
11-16 18:44:14.115   541   541 I AEE_AED : [] __und_svc_finish+0x0/0x34 
11-16 18:44:14.115   541   541 I AEE_AED : [] load_elf_binary+0x8b4/0x1280      
11-16 18:44:14.115   541   541 I AEE_AED : [] search_binary_handler+0x6c/0x110  
11-16 18:44:14.115   541   541 I AEE_AED : [] do_execve+0x438/0x5a8     
11-16 18:44:14.115   541   541 I AEE_AED : [] SyS_execve+0x24/0x28      
11-16 18:44:14.115   541   541 I AEE_AED : [] ret_fast_syscall+0x0/0x38 
11-16 18:44:14.115   541   541 I AEE_AED : [] 0xffffffff

然后分析mini ramdump,看看KE的原因是什么,从上面调用栈看是发生了undefined instruction exception,那是如何导致的呢?
打开trace32,当前的调用栈:

当前的pc停止在0xc026adf8

memory dump显示:

从上面看到C026ADF8地址对应的指令是 svc 0x3d1914,而vmlinux中是一条bl指令:
c026ada0:	e2600a01 	rsb	r0, r0, #4096	; 0x1000
c026ada4:	e068100c 	rsb	r1, r8, ip
c026ada8:	e51bc054 	ldr	ip, [fp, #-84]	; 0x54
c026adac:	e1500001 	cmp	r0, r1
c026adb0:	e3cc203f 	bic	r2, ip, #63	; 0x3f
c026adb4:	31a01000 	movcc	r1, r0
c026adb8:	21a01001 	movcs	r1, r1
c026adbc:	e5922008 	ldr	r2, [r2, #8]
c026adc0:	e0930001 	adds	r0, r3, r1
c026adc4:	30d00002 	sbcscc	r0, r0, r2
c026adc8:	33a02000 	movcc	r2, #0
c026adcc:	e3520000 	cmp	r2, #0
c026add0:	1affff10 	bne	c026aa18 
c026add4:	e1a00003 	mov	r0, r3
c026add8:	eb03e392 	bl	c0363c28 <__clear_user_std>
c026addc:	eaffff0d 	b	c026aa18 
c026ade0:	e51b9044 	ldr	r9, [fp, #-68]	; 0x44
c026ade4:	e1a02000 	mov	r2, r0
c026ade8:	e51b8040 	ldr	r8, [fp, #-64]	; 0x40
c026adec:	eafffe54 	b	c026a744 
c026adf0:	e3e0200d 	mvn	r2, #13
c026adf4:	eafffe52 	b	c026a744 
c026adf8:	ebfa7020 	bl	c0106e80 
c026adfc:	e3500000 	cmp	r0, #0
c026ae00:	0affffd6 	beq	c026ad60 

==》C026ADC0-C026ADFC地址16字节内存被踩了,所以导致了未定义指令异常!
从上面可以看到ramdump跟vimlinux中的指令发生了不一样的情况!这是明显的内存被踩坏的情况:
vmlinux中的指令相当于是写入到ROM中的指令,是正常应该执行的指令,而ramdump中的指令是CPU将指令从ROM加载到RAM后的指令,所以如果正常情况下,vmlinux中的指令跟RAM中的指令是完全一样的才对,当内存被踩(其他程序非法篡改)后就会出现这种不一致的情况。
内存被踩最大的难点是被踩的点跟崩溃的点不是一个,所以导致了追踪困难,导入MMU  buddy system 保护的方案可以提高解题概率,其基本原理是将未分配出去的内存设置为不可写,这样一旦有人尝试去访问就会报kernel panic,直接可抓到凶手.

从已有的资料,目前无法进一步分析,往前推一点看看Android log痕迹:
11-16 18:44:12.947   263   263 E SoundTriggerHwService: couldn't load sound trigger module sound_trigger.primary (No such file or directory)
11-16 18:44:12.958   482   482 D AEE_AED : Rtt command(type:0, file_path:3 arg0:233)
11-16 18:44:12.958   482   482 D AEE_AED : Rtt waiting daemon finish the job...
11-16 18:44:13.003   448   448 D BootAnimation: [BootAnimation main 95]before new BootAnimation...
11-16 18:44:13.010   225   425 I SurfaceFlinger: [SF client] NEW(0xa7d15000) for (448:/system/bin/bootanimation)
11-16 18:44:13.011   448   448 D BootAnimation: [BootAnimation BootAnimation 162]bBootOrShutDown=1,bPlayMP3=1,bShutRotate=0
11-16 18:44:13.011   448   448 D BootAnimation: joinThreadPool...
11-16 18:44:13.011   448   448 D BootAnimation: [BootAnimation main 98]before joinThreadPool...
11-16 18:44:13.322   448   495 D BootAnimation: initialize opengl and egl
11-16 18:44:13.325   448   495 E         : appName=/system/bin/bootanimation, acAppName=/system/bin/surfaceflinger
11-16 18:44:13.325   448   495 E         : 0

明显看出这个时候是在Android 重启!而且可以断定是NE导致的。
从这些细节可以看撸一撸:上面的load_elf_binary是在加载可执行程序的代码段出现und异常,而这里的可执行程序可能是
/system/bin/bootanimation或者/system/bin/surfaceflinger或者其他程序,由于前期出现了NE异常导致了Android crash(kernel没有挂)然后bootanimation进程重新启动,启动过程中可能由于NE前一次的异常条件还存在,导致内存被踩坏,load_elf_binary出现und异常==》Kernel panic
但是NE发生的时间是11-16 18:16:14,KE发生的时间是11-16 18:44:13,中间隔了半个小时了!所以一种推测是不止发生了一次NE,有可能是发生了多次NE,而其中的一次NE导致Android crash后踩坏内存导致了KE.
猜想,KE属于多次NE异常之后的异常,所以在KE无更好debug情况下,重点先把NE解决掉,再安排压力测试验证效果。

继续分析NE的异常原因,分析Android log:
从上面分析看出,先发sig 3给com.myos.camera,再发sig 3,sig11给system_server,从而导致system_server挂掉,那么会不会本题的根本原因就是出在com.myos.camera模块?带这个疑问,追查Android log:

有看到这个细节:
11-16 18:16:14.842   266 12646 E AeAlgo  : [checkNightScene_v2p0()] Err:  1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
11-16 18:16:15.047 12737 12737 D AEE_AED :  1,  8,	    1,	 1510827174,	 com.myos.camera 
11-16 18:16:15.215   266 12646 E AeAlgo  : [checkNightScene_v2p0()] Err:  1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
11-16 18:16:15.886   266 12646 E AeAlgo  : [checkNightScene_v2p0()] Err:  1929:, NS_BrightTone_THDS < NS_OE_THDFLAT (checkNightScene_v2p0){#1929:vendor/mediatek/proprietary/hardware/libcamera/hardware/mt6580/lib/lib3a/ae/ae_algo.cpp}
出现了好几次。

再看kernel log打印出发出sig 3给com.myos.camera的时间点:
[ 2656.013962] -(0)[839:ActivityManager][name:mtprof&][signal][839:ActivityManager] send death sig 3 to [11945:com.myos.camera:S]

搜索tid=11945,查看调用栈:
"main" prio=5 tid=1 Waiting
  | group="main" sCount=1 dsCount=0 obj=0x744a1f18 self=0xa7005400
  | sysTid=11945 nice=0 cgrp=default sched=0/0 handle=0xaa302534
  | state=S schedstat=( 2196790856 2283773067 5583 ) utm=148 stm=71 core=0 HZ=100
  | stack=0xbe14d000-0xbe14f000 stackSize=8MB
  | held mutexes=
  at java.lang.Object.wait!(Native method)
  - waiting on <0x08a4d8ef> (a android.os.ConditionVariable)
  at android.os.ConditionVariable.block(ConditionVariable.java:97)
  - locked <0x08a4d8ef> (a android.os.ConditionVariable)
  at com.myos.camera.scheduler.CameraScheduler.onFirstPreviewReceived(unavailable:-1)
  at com.myos.camera.activity.CameraActivity.onFirstPreviewArrived(unavailable:-1)
  at com.myos.camera.activity.ActivityBase.onFirstPreviewOpened(unavailable:-1)
  at com.myos.camera.activity.ActivityBase$MyAppBridge.onFirstPreviewOpened(unavailable:-1)
  at com.myos.camera.glui.CameraScreenNail.onFrameAvailable(unavailable:-1)
  - locked <0x00235ffc> (a java.lang.Object)
  at android.graphics.SurfaceTexture$1.handleMessage(SurfaceTexture.java:203)
  at android.os.Handler.dispatchMessage(Handler.java:110)
  at android.os.Looper.loop(Looper.java:203)
  at android.app.ActivityThread.main(ActivityThread.java:6251)
  at java.lang.reflect.Method.invoke!(Native method)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1063)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:924)

参数提示unavailable:-1,看上去不太正常? 转给camera模块相关人分析.

你可能感兴趣的:(【解题笔记】,【系统异常分析】)