Android系统稳定性简述

Android系统稳定性简述

稳定性问题从表现来看有: 死机重启, 自动关机, 无法开机,冻屏,黑屏以及闪退, 无响应等情况;

从技术层面来划分无外乎两大类: 长时间无法执行完成(Timeout) 以及异常崩溃(crash). 主要分类如下:
Android系统稳定性简述_第1张图片

1 kernel panic

kernel panic相关错误信息代码处理:

arch/arm64/mm/fault.c
arch/arm64/kernel/traps.c
kernel/panic.c

1.1 常见错误打印

常见异常原因有hungtask,panic,oom,watchdog,softlockup

1 hungtask

内核某进程/线程长期处于D状态,无法唤醒

INFO: task xxx:xxx blocked for more than 120 seconds.

2 panic

内核崩溃,错误原因很多,比如页表异常、空指针访问等。

"Data abort info:\n"
"Mem abort info:\n"
"Unable to handle kernel %s at virtual address %016lx\n"
"Kernel panic - not syncing: %s\n"
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[2022/2/27 21:26:18] [254094.838383] !(0)[675:tosMsgAdvTask] list_add corruption. prev->next should be next (ffffff802e905648), but was ffffff8016ca8540. (prev=ffffff8016ca81c0).
[2022/2/27 21:26:18] [254094.851569] -(0)[675:tosMsgAdvTask] ------------[ cut here ]------------
[2022/2/27 21:26:18] [254094.854755] -(0)[675:tosMsgAdvTask] kernel BUG at lib/list_debug.c:28!
[2022/2/27 21:26:18] [254094.861344] -(0)[675:tosMsgAdvTask] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP

[251:irq/63-vdec-1] Call trace:
[251:irq/63-vdec-1] [ffffffc020253d20+  96][<ffffffc00a1401b8>] vh264_isr_thread_fn+0x22d8/0x2640 [amvdec_mh264_v4l]
[251:irq/63-vdec-1] [ffffffc020253d80+  48][<ffffffc009f10864>] vdec_thread_isr+0x8c/0xe8 [decoder_common]
[251:irq/63-vdec-1] [ffffffc020253db0+  80][<ffffffc01016ef80>] irq_thread_fn+0x30/0x78
[251:irq/63-vdec-1] [ffffffc020253e00+  96][<ffffffc01016ee30>] irq_thread+0x1e8/0x260
[251:irq/63-vdec-1] [ffffffc020253e60+   0][<ffffffc01011de70>] kthread+0x14c/0x15c
[251:irq/63-vdec-1] [0000000000000000+   0][<ffffffc010084070>] ret_from_fork+0x10/0x18
[251:irq/63-vdec-1] Code: 9400183c f85f03a8 52801009 f9539908 (39138109) 
[251:irq/63-vdec-1] ---[ end trace c5cc2e4cf0888320 ]---
[251:irq/63-vdec-1] dyn_fsync: panic_flush_device_cache
[251:irq/63-vdec-1] dyn_fsync: panic_flush_device_cache: flush device cache

  
3 memory分配异常
[9:ksoftirqd/0] warn_alloc: 203 callbacks suppressed
[9:ksoftirqd/0] ksoftirqd/0: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[9:ksoftirqd/0] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: P        WC O      5.4.125-android11-2-gdc4d0f1d4a41-dirty #1
[9:ksoftirqd/0] Hardware name: Amlogic (DT)
[9:ksoftirqd/0] Call trace:
[9:ksoftirqd/0] [ffffffc020023860+  64][<ffffffc0100c61f4>] dump_backtrace+0x0/0x18c
[9:ksoftirqd/0] [ffffffc0200238a0+  32][<ffffffc0100c63a4>] show_stack+0x24/0x34
[9:ksoftirqd/0] [ffffffc0200238c0+ 160][<ffffffc010de41e0>] dump_stack+0xbc/0x108
[9:ksoftirqd/0] [ffffffc020023960+ 240][<ffffffc0102d640c>] warn_alloc+0xd8/0x128
[9:ksoftirqd/0] [ffffffc020023a50+ 144][<ffffffc0102d7744>] __alloc_pages_slowpath+0xb4c/0xb78
[9:ksoftirqd/0] [ffffffc020023ae0+  80][<ffffffc0102d67b8>] __alloc_pages_nodemask+0x2ec/0x374
[9:ksoftirqd/0] [ffffffc020023b30+  48][<ffffffc0102d7a28>] page_frag_alloc+0x100/0x1a0
[9:ksoftirqd/0] [ffffffc020023b60+ 256][<ffffffc010b760c0>] __napi_alloc_skb+0x74/0x198
[9:ksoftirqd/0] [ffffffc020023c60+  96][<ffffffc0099083d0>] stmmac_napi_poll_rx+0x3d4/0xacc [stmmac]
[9:ksoftirqd/0] [ffffffc020023cc0+  96][<ffffffc010b9df34>] napi_poll+0x70/0x224
[9:ksoftirqd/0] [ffffffc020023d20+  96][<ffffffc010b9d5c4>] net_rx_action+0xc0/0x238
[9:ksoftirqd/0] [ffffffc020023d80+  96][<ffffffc010081384>] __do_softirq+0x184/0x394
[9:ksoftirqd/0] [ffffffc020023de0+  32][<ffffffc0100f6858>] run_ksoftirqd+0x30/0x7c
[9:ksoftirqd/0] [ffffffc020023e00+  96][<ffffffc010122954>] smpboot_thread_fn+0x1c4/0x2c0
[9:ksoftirqd/0] [ffffffc020023e60+   0][<ffffffc01011d5e4>] kthread+0x14c/0x15c
[9:ksoftirqd/0] [0000000000000000+   0][<ffffffc010084070>] ret_from_fork+0x10/0x18
[9:ksoftirqd/0] Mem-Info:

4 oom

Linux内核有个机制叫OOM killer(Out-Of-Memory killer),该机制会监控那些占用内存过大,尤其是瞬间很快消耗大量内存的进程,为了防止内存耗尽,内核会按优先级等相关来杀掉一些进程,直到free内存在一定范围内

Out of memory: Killed process 225 (busybox) total-vm:1180kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:20kB oom_score_adj:0
oom_reaper: reaped process 225 (busybox), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Killed
busybox invoked oom-killer: gfp_mask=0x2100cca(GFP_HIGHUSER_MOVABLE|0x2000000), order=0, oom_score_adj=0
CPU: 0 PID: 228 Comm: busybox Tainted: G        W         5.4.125-android11-2-g58da01d48f87-dirty #1
Hardware name: Amlogic (DT)
Call trace:
[ffffffc0201239e0+  48][<ffffffc0100be19c>] dump_backtrace+0x0/0x110
[ffffffc020123a10+  32][<ffffffc0100be378>] show_stack+0x24/0x34
[ffffffc020123a30+  64][<ffffffc010ae2338>] dump_stack+0xbc/0x108
[ffffffc020123a70+  48][<ffffffc010245d30>] dump_header+0x50/0x1d4
[ffffffc020123aa0+  48][<ffffffc010245bd4>] oom_kill_process+0x9c/0x1a8
[ffffffc020123ad0+ 208][<ffffffc010245670>] out_of_memory+0x1b4/0x27c
[ffffffc020123ba0+ 144][<ffffffc01028fba0>] __alloc_pages_slowpath+0x740/0x9cc
[ffffffc020123c30+  80][<ffffffc01028f0e8>] __alloc_pages_nodemask+0x2b8/0x30c
[ffffffc020123c80+  80][<ffffffc01027a368>] wp_page_copy+0x98/0x4d0
[ffffffc020123cd0+  48][<ffffffc010278cc8>] do_wp_page+0x304/0x45c
[ffffffc020123d00+ 224][<ffffffc01027b054>] handle_pte_fault+0xe0/0xac4
[ffffffc020123de0+  64][<ffffffc01027944c>] handle_mm_fault+0x1c8/0x3c8
[ffffffc020123e20+  96][<ffffffc010aff754>] do_page_fault+0x2a4/0x420
[ffffffc020123e80+ 368][<ffffffc010080a2c>] do_mem_abort+0x50/0xbc
[ffffffc020123ff0+   0][<ffffffc010083848>] el0_da+0x1c/0x20
Mem-Info:

5 soft lockup

soft lockup是指内核中有BUG导致在内核模式下一直循环的时间超过n s(n为配置参数),而其他进程得不到运行的机会;实现方式:内核对于每一个cpu都有一个监控进程watchdog/x 每秒钟会对比进程时间戳,对比时间戳就可以知道运行情况了,进程长时间没有运行,时间戳没有更新,超过一定时间就报错。hard lockup的发生是由于禁止了CPU的所有中断超过一定时间(几秒)这种情况下,外部设备发生的中断无法处理,内核认为此时发生了所谓的lockup。
下图log中发生lockup的线程为aml-ditest-0

BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 79s) not found.
[8865:cae_front] dyn_fsync: panic_flush_device_cache
[8865:cae_front] dyn_fsync: panic_flush_device_cache: flush device cache
[0:swapper/2] RT throttling on cpu:3 rt_time:991ms, curr:aml-ditest-0/165 prio:0 sum_runtime:248810ms
[0:swapper/2] RT throttling on cpu:3 rt_time:991ms, curr:aml-ditest-0/165 prio:0 sum_runtime:249802ms
[8865:cae_front] dyn_fsync: panic_flush_device_cache: remaining timeout = 0
[8865:cae_front] Kernel panic - not syncing: workqueue lockup !!!

1.2 案列分析

Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
Data abort info:
ISV = 0, ISS = 0x00000006
user pgtable: 4k pages, 39-bit VAs, pgdp=0000000065733000
pgd=0000000062020003, pud=0000000062020003, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP

从Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0和Mem abort info两个地方看出panic的原因为空指针引起,ESR = 0x96000006, ESR为ARM异常寄存器错误信息,96000006 [#1] 后面的数字1表示发生的次数

Modules linked in: adc_keypad(O) ad82584f(O) snd_soc_aml_t9015(O) snd_soc(O) snd_soc_dummy_codec(O) aml_media(O) aml_dvb_extern(O) aml_spicc snd_soc_notify(O) meson_clk_measure aml_gki_tool(O) aml_mailbox(O)
PC : ffffffc009c329f0, PFN:8d4a4 V
SP : ffffff809612ad70, PFN:9612a L
FAR : 00000000000002c0  U
R0  : ffffffc009c38a5f, PFN:8d4aa V
R1  : ffffffc009c38000, PFN:8d4aa V
R3  : ffffffc01649b000, PFN:***** V
R9  : ffffffc0113a5cb0, PFN: 23a5 V
R15 : ffffffc0110e56d7, PFN: 20e5 V
R16 : ffffffc010debb00, PFN:***** V
R18 : ffffff809610b840, PFN:9610b L
R19 : ffffffc009c53000, PFN:8d4c5 V
R29 : ffffff809612ada0, PFN:9612a L
R30 : ffffffc009c329d0, PFN:8d4a4 V
CPU: 1 PID: 2286 Comm: log_out Tainted: P         C O      5.4.125-android11-2-ge3776401faa0-dirty-ab97 #1
Hardware name: Amlogic (DT)
pstate: 20400089 (nzCv daIf +PAN -UAO)
pc : _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
lr : _0d503df0a3108831179ecc8bd8eab8d7+0xfb0/0x1294 [pm_kara_release]

[ffffffc02162be20+  32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+  96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+   0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
[ffffff809612ada0+  96][] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]

Modules linked in 后面是动态加载的ko模块,如果panic是模块中问题,会在堆栈后面打印对应的模块名,从下面这个log中发现出错的地方为pm_kara_release模块 ,同时会把PC等相关的寄存器信息打印出来

Call trace:
[ffffff809612ada0+  96][<ffffffc009c329f0>] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
[ffffff809612ae00+  32][<ffffffc009c31650>] _d0571ed82a857ce697fd4a1931c7f2f9+0xad8/0xb18 [pm_kara_release]
[ffffffc02162bd60+  48][<ffffffc01030b440>] __vfs_write+0x160/0x184
[ffffffc02162bd90+  80][<ffffffc01030b608>] vfs_write+0xe8/0x1ac
[ffffffc02162bde0+  64][<ffffffc01030b844>] ksys_write+0x78/0xcc
[ffffffc02162be20+  32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+  96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+   0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
Code: f9408908 d0000020 d0000021 91297c00 (b942c104) 
---[ end trace 589d34c66e567ecd ]---
Kernel panic - not syncing: Fatal exception in interrupt

Call trace将panic的堆栈信息打印出来,panic出现的地方大概有两种情况,一种是出现在进程上下文和中断上下文中,进程上下文中发生的panic为用户空间通过系统调用驱动程序引起,中断上下文中发生的panic为处理中断程序时发生。下面的el0_svc_compat->el0_svc_compat_handler->el0_svc_common->这个过程发现引发的panic为用户空间调用write系统调用时触发。

从上面这个信息可以清楚知道发生的模块为pm_kara_release,不能用add2line工具解析[ffffff809612ada0+ 96]这个地址,这个地址为vmlinux中的地址,如果直接转换可能会是?号或者行数不对应等。

  1. 通过addr2line工具解析pm_kara_release模块中0d503df0a3108831179ecc8bd8eab8d7这个地址,解析出对应的函数,大致发生的地方在哪个位置
  2. 然后通过objdump反汇编出来,在0d503df0a3108831179ecc8bd8eab8d7地址的基础上+0xfd0为对应的行,这个地方为最终出问题的地方也可以通过nm工具来搜索0d503df0a3108831179ecc8bd8eab8d7这个地址

解析ko地址时,首先要确认ko上基于32位编译的还是64位编译的,否则会报如下错误,ko文件需要使用带有符号的ko或者lib

arm-linux-androideabi-addr2line: out/media.ko: File format not recognized

1.3 常用工具

// objdump -d : 将代码段反汇编
// objdump -S : 将代码段反汇编 arm-linux-objdump -d -S file
arm-linux-objdump
arm-linux-nm 
arm-linux-add2line
aarch64-linux-gnu-addr2line
aarch64-none-elf

2 tombstoned

tombstoned主要信息如下

1 死亡进程相关信息
pid : tombstoned的进程ID
tid :线程ID
name: 线程名
pid: 438, tid: 562, name: HwBinder:438_1 >>> /vendor/tvos/bin/sitatvservice <<<

2异常退出原因
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1 Cause: null pointer dereference

3 backtrace

3 寄存器相关信息

4 虚拟内存映射关系

5 各个线程的堆栈信息

6 logcat相关信息

2.1 案列分析

Build fingerprint: 'TCL/t982_ar31a8/t982_ar31a8:11/RQ3A.210705.001/15:userdebug/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2022-03-02 09:27:08+0800
pid: 438, tid: 562, name: HwBinder:438_1  >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1
Cause: null pointer dereference
    r0  eae36ee8  r1  eae36ee0  r2  00000000  r3  00000200
    r4  00000006  r5  eae36ee0  r6  eae36edc  r7  00000001
    r8  eae36ed8  r9  eadef2d8  r10 eadf0ced  r11 eadf0cda
    ip  eca8da30  sp  e8a5fd40  lr  eca10a85  pc  eae1495e
​
backtrace:
    NOTE: Function names and BuildId information is missing for some frames due
    NOTE: to unreadable libraries. For unwinds of apps, only shared libraries
    NOTE: found under the lib/ directory are readable.
    NOTE: On this device, run setenforce 0 to make the libraries readable.
      #00 pc 0004895e  /vendor/lib/libamlhalcore.so (removeCommandValue+146) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
      #01 pc 000491a9  /vendor/lib/libamlhalcore.so (Aml_Hal_SetMasterVol+68) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
      #02 pc 0005191f  /vendor/lib/libamlhalcore.so (AML_HAL_CEC_setDeviceConnectionState+678) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
      #03 pc 000107a7  /vendor/lib/libsystemcontrolservice.so

进程退出原因为空指针异常,backtrace为出错时的对应线程的堆栈打印信息:在libamlhalcore.so中的removeCommandValue+146处挂掉了,通过addr2line工具定位到具体出错是哪个源文件里的哪个函数,确保debug的libamlhalcore.so和对应测试的版本一致,要不然可能会解析错误或者?等

arm-linux-add2line -e libamlhalcore.so -Cfs 0004895e

EX_TRANSACTION_FAILED

Build fingerprint: 'TCL/t982_ar31a8/t982_ar31a8:11/RQ3A.210705.001/AR01:userdebug/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2022-02-10 17:01:29+0800
pid: 449, tid: 497, name: binder_server >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
Abort message: 'Failed HIDL return status not checked. Usually this happens because of a transport error (error parceling, binder driver, or from unparceling). If you see this in code calling into "Bn" classes in for a HAL server process, then it is likely that the code there is returning transport errors there (as opposed to errors defined within its protocol). Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT: ''
r0 00000000 r1 000001f1 r2 00000006 r3 f1484bd0
r4 f1484be4 r5 f1484bc8 r6 000001c1 r7 0000016b
r8 f1484bd0 r9 f1484be0 r10 f1484c00 r11 f1484bf0
ip 000001f1 sp f1484ba0 lr f19eeb0d pc f19eeb20

出现Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT:的情况为是hidl binder客户端向服务端发送请求时,binder服务端进程挂掉了,没有添加错误检测,导致引发SIGABRT,添加Return<>和linktoDeatch机制


2.2 debuggerd

debuggerd -b [pid],可输出进程的Native调用栈

[2022-08-12 11:10:02]  ----- pid 9346 at 2022-08-12 11:10:03 -----
[2022-08-12 11:10:02]  Cmd line: system_server
[2022-08-12 11:10:02]  ABI: 'arm'
[2022-08-12 11:10:02]  
[2022-08-12 11:10:02]  "system_server" sysTid=9346
[2022-08-12 11:10:02]    #00 pc 00053ebc  /system/lib/libc.so (__ioctl+8)
[2022-08-12 11:10:02]    #01 pc 00021bb9  /system/lib/libc.so (ioctl+36)
[2022-08-12 11:10:02]    #02 pc 001d768c  /system/lib/libsitatv.so (sita_android::IPCThreadState::talkWithDriver(bool)+164)
[2022-08-12 11:10:02]    #03 pc 001d7e74  /system/lib/libsitatv.so (sita_android::IPCThreadState::waitForResponse(sita_android::Parcel*, int*)+48)
[2022-08-12 11:10:02]    #04 pc 001d87a0  /system/lib/libsitatv.so (sita_android::IPCThreadState::transact(int, unsigned int, sita_android::Parcel const&, sita_android::Parcel*, unsigned int)+160)
[2022-08-12 11:10:02]    #05 pc 001d6054  /system/lib/libsitatv.so (sita_android::BpBinder::transact(unsigned int, sita_android::Parcel const&, sita_android::Parcel*, unsigned int)+68)
[2022-08-12 11:10:02]    #06 pc 000889ec  /system/lib/libsitatv.so (tos_factory_get_project_id+164)
[2022-08-12 11:10:02]    #07 pc 00017b15  /system/lib/libcom_tcl_tv_jni.so
[2022-08-12 11:10:02]    #08 pc 0000808f  /system/framework/arm/boot-com.tcl.tvmanager.oat (offset 0x8000) (com.tcl.tosapi.atv.TFactoryApi.native_addCallBack [DEDUPED]+78)
[2022-08-12 11:10:02]    #09 pc 00417f75  /system/lib/libart.so (art_quick_invoke_stub_internal+68)
[2022-08-12 11:10:02]    #10 pc 003f150f  /system/lib/libart.so (art_quick_invoke_stub+226)
[2022-08-12 11:10:02]    #11 pc 000a1031  /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136)
[2022-08-12 11:10:02]    #12 pc 001e8a25  /system/lib/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+232)
[2022-08-12 11:10:02]    #13 pc 001e3701  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+776)

2.3 常见错误码

出现问题时,一帮是signal 11 (SIGSEGV) 和 signal 6 (SIGABRT)

#define SIGHUP       1
#define SIGINT       2
#define SIGQUIT      3
#define SIGILL       4
#define SIGTRAP      5
#define SIGABRT      6
#define SIGIOT       6
#define SIGBUS       7
#define SIGFPE       8
#define SIGKILL      9
#define SIGUSR1     10
#define SIGSEGV     11
#define SIGUSR2     12
#define SIGPIPE     13
#define SIGALRM     14
#define SIGTERM     15
#define SIGSTKFLT   16
#define SIGCHLD     17
#define SIGCONT     18
#define SIGSTOP     19
#define SIGTSTP     20
#define SIGTTIN     21
#define SIGTTOU     22
#define SIGURG      23
#define SIGXCPU     24
#define SIGXFSZ     25
#define SIGVTALRM   26
#define SIGPROF		27
#define SIGWINCH	28
#define SIGIO		29
#define SIGPOLL		SIGIO

3 system_server 异常

// 异常log位置
/data/anr/*
/data/tombstones/*
/data/system/dropbox/*
/sys/fs/pstore/*

3.1 watchdog

Android中Watchdog用来看护system_server进程,system_server进程运行着系统最终要的服务,譬如AMS、PKMS、WMS等, 当这些服务不能正常运转时,Watchdog可能会杀掉system_server,让系统重启。Watchdog的实现利用了锁和消息队列机制。当system_server发生死锁或消息队列一直处于忙碌状态时,则认为系统已经没有响应了(System Not Responding)。在分析Watchdog问题的时候,首先要有详尽的日志,其次要能定位出导致Watchdog超时的直接原因,最重要的是能还原出问题发生的场景。watchdog不是引发重启的原因,其它模块获取锁等相关导致系统卡住,才导致引发watchdog.
dropboxTag 含义

dropboxTag 含义
system_server_anr system进程无响应
system_server_watchdog system进程发生watchdog
system_server_crash system进程崩溃
system_server_native_crash system进程native出现崩溃
system_server_wtf system进程发生严重错误
system_server_lowmem system进程内存不足

1 watchdog异常打印
常见watchdog产生的场景有死锁,或者binder等待
下面log中看到Blocked在android.fg线程中, android.fg线程等待thread 130释放锁,需要根据thread 130 线程堆栈情况查看在等什么,层层分析,直到找到最后block的原因,有held by thread字眼打印基本上就可以确定有死锁发生

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
Process: system_server
Subject: Blocked in monitor com.android.server.am.TclActivityManagerService on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on display thread (android.display)

"android.fg" prio=5 tid=12 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14e00568 self=0xe955fe00
  | sysTid=718 nice=0 cgrp=default sched=0/0 handle=0xbe4651c0
  | state=S schedstat=( 1528549224 5198733872 10016 ) utm=96 stm=56 core=2 HZ=100
  | stack=0xbe362000-0xbe364000 stackSize=1040KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:19133)
  - waiting to lock <0x06f44807> (a com.android.server.am.TclActivityManagerService) held by thread 130
  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:251)
  at android.os.Handler.handleCallback(Handler.java:967)
  at android.os.Handler.dispatchMessage(Handler.java:104)
  at android.os.Looper.loop(Looper.java:250)
  at android.os.HandlerThread.run(HandlerThread.java:67)
  at com.android.server.ServiceThread.run(ServiceThread.java:44)

system_server卡在IPCThreadState::talkWithDriver处,卡在binder处等待服务端返回

Process: system_server
Subject: Blocked in handler on animation thread (android.anim)
Build: TCL/t982_ar31a8/t982_ar31a8:11/unknown/AR01:userdebug/release-keys

"android.anim" prio=5 tid=16 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13280800 self=0xdcf4fc10
  | sysTid=19110 nice=-10 cgrp=default sched=0/0 handle=0xb7aec1c0
  | state=S schedstat=( 61934857681 8470932389 87042 ) utm=3119 stm=3073 core=0 HZ=100
  | stack=0xb79e9000-0xb79eb000 stackSize=1040KB
  | held mutexes=
  native: #00 pc 00070654  /apex/com.android.runtime/lib/bionic/libc.so (__ioctl+12)
  native: #01 pc 0003ebef  /apex/com.android.runtime/lib/bionic/libc.so (ioctl+26)
  native: #02 pc 00039a43  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+238)
  native: #03 pc 0003a69b  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+86)
  native: #04 pc 0003a43f  /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+122)
  native: #05 pc 00035297  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+98)
  native: #06 pc 0006cf15  /system/lib/libgui.so (android::BpGraphicBufferProducer::dequeueBuffer(int*, android::sp<android::Fence>*, unsigned int, unsigned int, int, unsigned long long, unsigned long long*, android::FrameEventHistoryDelta*)+192)
  native: #07 pc 0008320d  /system/lib/libgui.so (android::Surface::dequeueBuffer(ANativeWindowBuffer**, int*)+316)
  native: #08 pc 00085ef5  /system/lib/libgui.so (android::Surface::lock(ANativeWindow_Buffer*, ARect*)+92)
  native: #09 pc 000a4a21  /system/lib/libandroid_runtime.so (android::nativeLockCanvas(_JNIEnv*, _jclass*, long long, _jobject*, _jobject*)+352)
  at android.view.Surface.nativeLockCanvas(Native method)
  at android.view.Surface.lockCanvas(Surface.java:405)
  - locked <0x0d37119c> (a java.lang.Object)
  at android.view.ViewRootImpl.drawSoftware(ViewRootImpl.java:4327)
  at android.view.ViewRootImpl.draw(ViewRootImpl.java:4288)
  at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:3979)
  at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:3231)
  at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:2079)
  at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:8376)
  at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1091)
  at android.view.Choreographer.doCallbacks(Choreographer.java:909)
  at android.view.Choreographer.doFrame(Choreographer.java:844)
  at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:1076)
  at android.os.Handler.handleCallback(Handler.java:967)
  at android.os.Handler.dispatchMessage(Handler.java:104)
  at android.os.Looper.loop(Looper.java:250)
  at android.os.HandlerThread.run(HandlerThread.java:67)
  at com.android.server.ServiceThread.run(ServiceThread.java:44)

3.2 app异常

E AndroidRuntime: FATAL EXCEPTION: main
E AndroidRuntime: Process: com.tcl.tv, PID: 1195
E AndroidRuntime: java.lang.IllegalStateException: SharedPreferences in credential encrypted storage are not available until after user is unlocked
E AndroidRuntime: 	at android.app.ContextImpl.getSharedPreferences(ContextImpl.java:494)
E AndroidRuntime: 	at android.app.ContextImpl.getSharedPreferences(ContextImpl.java:479)
E AndroidRuntime: 	at android.content.ContextWrapper.getSharedPreferences(ContextWrapper.java:188)
E AndroidRuntime: 	at android.preference.PreferenceManager.getDefaultSharedPreferences(PreferenceManager.java:545)
E AndroidRuntime: 	at com.tcl.tv.SharedData.<init>(SharedData.java:12)
E AndroidRuntime: 	at com.tcl.tv.SharedData.getInst(SharedData.java:17)
E AndroidRuntime: 	at com.tcl.tv.recommendchannel.RecommendChannelManager.<init>(RecommendChannelManager.java:65)
E AndroidRuntime: 	at com.tcl.tv.recommendchannel.RecommendChannelManager.getInstance(RecommendChannelManager.java:75)
E AndroidRuntime: 	at com.tcl.tv.TVRootApp$CommonHandler.handleMessage(TVRootApp.java:565)
E AndroidRuntime: 	at android.os.Handler.dispatchMessage(Handler.java:111)
E AndroidRuntime: 	at android.os.Looper.loop(Looper.java:250)
E AndroidRuntime: 	at android.app.ActivityThread.main(ActivityThread.java:7848)
E AndroidRuntime: 	at java.lang.reflect.Method.invoke(Native Method)
E AndroidRuntime: 	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:609)
E AndroidRuntime: 	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1078)

java.lang.OutOfMemoryError: Failed to allocate a 1583242928 byte allocation with 1556480 free bytes and 255MB until OOM, target footprint 2169528, growth limit 268435456
        at java.util.Arrays.copyOf(Arrays.java:3257)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:137)
        at com.tcl.tvservice.proxy.TvChannelProxy.updateChannelLists(Unknown Source:12)
        at com.tcl.tvservice.proxy.TvChannelProxy.updateChannelInfoInner(Unknown Source:3)
        at com.tcl.tvservice.proxy.TvChannelProxy.access$100(Unknown Source:0)
        at com.tcl.tvservice.proxy.TvChannelProxy$ChannelProxyHander.handleMessage(Unknown Source:23)
        at android.os.Handler.dispatchMessage(Handler.java:111)
        at android.os.Looper.loop(Looper.java:250)
        at android.os.HandlerThread.run(HandlerThread.java:67)

3.3 Bugreport

bugreport也是收集android系统异常信息的小工具,常见用法如下:

bugreport  bugreportz
adb bugreport > Bugreport.txt

bugreport主要收集系统如下信息:

1 系统build以及运行时长等相关信息;
2 内存/CPU/进程等信息;
3 kernel log;
4 lsof、map及Wait-Channels;
5 system log;
6 event log;
7 radio log;
8 vm traces:
9 network相关信息;
10 last kernel log;
11 last system log;
12 ip相关信息;
13 中断向量表
14 property以及fs等信息
15 last radio log;
16 Binder相关信息;
17 dumpsys all:
18 dumpsys checkin相关:
19 dumpsys batterystats电池统计;
20 dumpsys meminfo内存
21 dumpsys netstats网络统计;
22 dumpsys procstats进程统计;
23 dumpsys usagestats使用情况;
24 dumpsys package.
25 dumpsys app相关

3.4 ChkBugReport

ChkBugReport是索尼开发一个jar包用来解析bugreport的工具
https://github.com/sonyxperiadev/ChkBugReport

git clone git://github.com/sonyxperiadev/ChkBugReport.git

3.5 sysrq-trigger

// 导出内存分配信息
echo m > /proc/sysrq-trigger

// 导出当前任务状态信息
echo t > /proc/sysrq-trigger 

// 导出当前CPU寄存器和标志位信息
echo p > /proc/sysrq-trigger

3.6 内存debug命令

1 free
2 cat /proc/meminfo
3 procrank
4 dumpsys meminfo
5 cat /proc/slabinfo
6 cat /proc/pagetrace
7 echo m > /proc/sysrq-trigger
8 cat /proc/buddyinfo
9 cat /proc/zoneinfo

4 kernel 内存异常检测工具

4.1 Kasan

CONFIG_KASAN=y
CONFIG_KASAN_STACK_ENABLE=y

4.2 slub_debug

CONFIG_SLUB=y
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y

slabinfo.c 位于tools\vm路径下,通过aarch64-linux-gnu-gcc -o slabinfo slabinfo.c

cat /proc/slabinfo
echo 1 > /sys/kernel/slab/<leaking_slab>/trace
slabinfo -v

4.3 kmemleak

CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y则不需要在命令行添加kmemleak=on

CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=4096
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y

打开kmemleak扫描功能:

echo scan > sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak

你可能感兴趣的:(Android,linux,system,android)