一次新的视频播放卡死异常:4路视频10秒切换拷机,出现测试app画面卡死的异常,可以切换到其它界面。
查看负载,发现异常app有线程跑满CPU,且CPU大多消耗在内核层
User 2%, System 23%, IOW 0%, IRQ 0%
User 30 + Nice 0 + Sys 283 + Idle 899 + IOW 0 + IRQ 0 + SIRQ 0 = 1212
PID TID PR CPU% S VSS RSS PCY UID Thread Proc
2768 29663 1 24% R 1398368K 182332K fg u0_a53 Thread-13095407 com.test33.demo
32103 32103 2 0% R 2364K 948K fg root top top
77 77 0 0% S 0K 0K fg root fb-vsync
193 222 0 0% S 171340K 15512K fg system mali-renderer /system/bin/surfaceflinger
190 190 0 0% S 2472K 792K fg root lmkd /system/bin/lmkd
8 8 0 0% S 0K 0K fg root rcu_preempt
9 9 0 0% S 0K 0K fg root rcu_bh
10 10 3 0% S 0K 0K fg root rcu_sched
使用debuggerd获得该进程的backtrace及寄存器信息:
ABI: 'arm'
pid: 2768, tid: 29663, name: Thread-13095407 >>> com.test33.demo <<<
signal 19 (SIGSTOP), code 0 (SI_USER), fault addr --------
r0 00000000 r1 00001000 r2 00000098 r3 9c29b000
r4 9c19f000 r5 9c29c564 r6 9c29c000 r7 b8b65468
r8 9c1a0000 r9 9c1a0000 sl 9c19f000 fp ffffffff
ip b8b65468 sp 9c29c558 lr b4c8982b pc b4c89832 cpsr 80070030
...
backtrace:
#00 pc 0034d832 /system/lib/libart.so (_ZN3art6Thread25InstallImplicitProtectionEv+233)
#01 pc 0034da73 /system/lib/libart.so (_ZN3art6Thread12InitStackHwmEv+210)
#02 pc 0034dca1 /system/lib/libart.so (_ZN3art6Thread4InitEPNS_10ThreadListEPNS_9JavaVMExtEPNS_9JNIEnvExtE+112)
#03 pc 00354b65 /system/lib/libart.so (_ZN3art6Thread6AttachEPKcbP8_jobjectb+136)
#04 pc 0032fc4f /system/lib/libart.so (_ZN3art7Runtime19AttachCurrentThreadEPKcbP8_jobjectb+14)
#05 pc 002586cb /system/lib/libart.so (_ZN3art3JII19AttachCurrentThreadEP7_JavaVMPP7_JNIEnvPv+158)
#06 pc 00089335 /data/app/com.test33.demo-1/lib/arm/libPlayCtrl.so (_ZN13CHKMediaCodec10OutputDataERl+64)
...
memory map:
9c19f000-9c29cfff rw- 0 fe000 [stack:29663]
for (uint8_t* p = stack_top; p >= pregion; p -= kPageSize) {
dont_optimize_this = *p;
}
34d826: f557 eb52 blx a4ecc
34d82a: 42b4 cmp r4, r6
34d82c: bf98 it ls
34d82e: 4633 movls r3, r6
34d830: d807 bhi.n 34d842 <_ZN3art6Thread25InstallImplicitProtectionEv+0xfa>
34d832: 781a ldrb r2, [r3, #0]
当前PC=0x34d832,该指令从r3寄存器所指向地址中取数据保存到r2,r3指向线程栈,寄存器的数据看起来没什么问题。
我们再来看下此时内核在做什么,为什么会消耗这么多的CPU? 使用perf工具抓异常线程的信息如下:
# Overhead Command Shared Object Symbol
# ........ ............... ................. ....................................................................................................................
#
29.60% com.test33.demo [kernel.kallsyms] [k] do_page_fault
|
--- do_page_fault
do_DataAbort
__dabt_usr
art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
26.83% com.test33.demo [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
|
--- _raw_spin_unlock_irqrestore
|
|--52.88%-- do_page_fault
| do_DataAbort
| __dabt_usr
| art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
|
|--46.95%-- __down_read_trylock
| do_page_fault
| do_DataAbort
| __dabt_usr
| art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
--0.17%-- [...]
9.65% com.test33.demo libart.so [.] art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
|
--- art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
7.86% com.test33.demo [kernel.kallsyms] [k] _raw_spin_unlock
|
--- _raw_spin_unlock
handle_pte_fault
__handle_mm_fault
handle_mm_fault
do_page_fault
do_DataAbort
__dabt_usr
art::StackVisitor::GetVRegFromOptimizedCode(art::ArtMethod*, unsigned short, art::VRegKind, unsigned int*) const
看起来是程序在访问线程栈时候,那块page已经不存在,所以触发了kernel的page fault机制去分配page,但为何kernel一直陷在page fault中无法返回?
查看了系统可用内存还很充足,不存在分配不到内存的情况。
用gdb连接上去修改r3寄存器内容,让它指向其它page则能成功执行当前指令,进入下一条指令,否则如果还指向异常的page,则会停留在当前指令上:
(gdb) info register
r0 0x0 0
r1 0x1000 4096
r2 0x98 152
r3 0x9c29b000 2619977728
r4 0x9c19f000 2618945536
r5 0x9c29c564 2619983204
r6 0x9c29c000 2619981824
r7 0xb8b65468 3098956904
r8 0x9c1a0000 2618949632
r9 0x9c1a0000 2618949632
r10 0x9c19f000 2618945536
r11 0xffffffff 4294967295
r12 0xb8b65468 3098956904
sp 0x9c29c558 0x9c29c558
lr 0xb4c8982b -1261922261
pc 0xb4c89832 0xb4c89832
cpsr 0x80070030 -2147024848
(gdb) set $r3=0x9c29c000
(gdb) si
0xb4c89834 in ?? ()
修改r3的值为另外page时候,si指令能成功执行
(gdb) set $r3=0x9c29bccc
(gdb) info register
r0 0x0 0
r1 0x1000 4096
r2 0x98 152
r3 0x9c29bccc 2619981004
r4 0x9c19f000 2618945536
r5 0x9c29c564 2619983204
r6 0x9c29c000 2619981824
r7 0xb8b65468 3098956904
r8 0x9c1a0000 2618949632
r9 0x9c1a0000 2618949632
r10 0x9c19f000 2618945536
r11 0xffffffff 4294967295
r12 0xb8b65468 3098956904
sp 0x9c29c558 0x9c29c558
lr 0xb4c8982b -1261922261
pc 0xb4c89832 0xb4c89832
cpsr 0x80070030 -2147024848
(gdb) si
^C
Program received signal SIGINT, Interrupt.
修改r3的值为当前page的其它地址时候,si指令执行失败。
所以,是部分page出现了异常,但异常的根本原因呢?
我提供了一个patch给客户,用于把kernel里面CMA内存借用的机制给禁用掉,长时间拷机不再出现该异常。
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf6c928..aaf8342 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -933,7 +933,7 @@ static int fallbacks[MIGRATE_TYPES][4] = {
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE },
#ifdef CONFIG_CMA
- [MIGRATE_MOVABLE] = { MIGRATE_CMA, MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
+ [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
[MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */
#else
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
@@ -1360,7 +1360,7 @@ void free_hot_cold_page(struct page *page, int cold)
* excessively into the page allocator
*/
if (migratetype >= MIGRATE_PCPTYPES) {
- if (unlikely(is_migrate_isolate(migratetype))) {
+ if (unlikely(is_migrate_isolate(migratetype)) || is_migrate_cma(migratetype)) {
free_one_page(zone, page, 0, migratetype);
goto out;
}
由此我猜测:该app从cma中借用了一个page做为其stack的一部分,视频播放时候需要CMA内存,在CMA内存紧张时候会收回被借出去的page,包括被app借走的那块;接着,当app访问stack时候发现对应的page已经不存在,于是触发kernel的page fault机制去分配内存。可能是当前app在执行的代码比较特殊,必须要使用原来的那块page,但是那块page已经被用作视频buffer无法收回,于是kernel一直陷在page fault中。
这猜测过于勉强,但我已没有时间继续就这问题深入下去,暂且如此吧。