系统反复重启--RescueParty触发recovery记录

重启进入recovery reboot,反复循环

文章目录

  • 重启进入recovery reboot,反复循环
    • 前言
    • 1. 问题现象:
    • 2. 分析过程
      • 2.1 串口打印初步分析:
        • 第一步:查看重启原因:
        • 第二步:根据刚才的信息继续往前看init进程的操作
        • 第三步:确认system_server信息:
      • 2.2 recovery入口梳理
      • 2.3 对比验证实验
      • 2.4 logcat信息分析
        • 2.4.1 抓取logcat
        • 2.4.2 logcat信息分析
          • 2.4.2.1 查看system_server:
          • 2.4.2.2 搜索/system/bin/uncrypt|ShutdownThread|RescueParty:
          • 2.4.2.3 上述应用程序挂掉的原因
    • 3. 总结:
    • 4. 附录
      • 4.1 RescueParty
        • 4.1.1 救援优先级
        • 4.1.2 触发逻辑
        • 4.2.3 禁用场景
        • 4.2.4 调用逻辑

前言

输入 – 内化 – 输出 – 价值

xmind:

1. 问题现象:

  1. 编译user版本image并烧录;
  2. 开机过程中出现重启;
  3. 重启后界面显示清除中;
  4. 上述清除过程完成后,重启;
  5. 一直重复上述2~4循环;

从现象来看:
1. 系统遇到异常,该异常比较严重会触发重启;
2. 重启后会进入recovery界面(也可能是恢复出厂设置),做清除动作;

2. 分析过程

结合现象和初步log,找到怀疑点,通过实验的方式逐步缩小范围,最终定位并解决问题;

2.1 串口打印初步分析:

第一步:查看重启原因:

打印信息为:ARCH_RESET,且命令为recovery,这一段打印在函数arch_reset中:

(200709_09:11:00.173)[ 101.418221] <0>-(0)[1:init]ARCH_RESET happen!!!
(200709_09:11:00.173)[ 101.418795] <0>-(0)[1:init]arch_reset: cmd = recovery
(200709_09:11:00.173)[ 101.419441] <0>-(0)[1:init]CPU: 0 PID: 1 Comm: init Tainted: G W O 4.9.117 #3
(200709_09:11:00.173)[ 101.420456] <0>-(0)[1:init]Hardware name: AC8257V/WAB (DT)
(200709_09:11:00.173)[ 101.421149] <0>-(0)[1:init]Call trace:
(200709_09:11:00.173)[ 101.421643] <0>-(0)[1:init][] dump_backtrace+0x0/0x294
(200709_09:11:00.173)[ 101.422492] <0>-(0)[1:init][] show_stack+0x18/0x20
(200709_09:11:00.173)[ 101.423300] <0>-(0)[1:init][] dump_stack+0xcc/0x104
(200709_09:11:00.173)[ 101.424119] <0>-(0)[1:init][] arch_reset+0x48/0x164
(200709_09:11:00.173)[ 101.424934] <0>-(0)[1:init][] mtk_arch_reset_handle+0x24/0x40
(200709_09:11:00.173)[ 101.425861] <0>-(0)[1:init][] atomic_notifier_call_chain+0x4c/0x84
(200709_09:11:00.173)[ 101.426838] <0>-(0)[1:init][] do_kernel_restart+0x24/0x2c
(200709_09:11:00.173)[ 101.427717] <0>-(0)[1:init][] machine_restart+0x40/0x50
(200709_09:11:00.173)[ 101.428575] <0>-(0)[1:init][] kernel_restart+0x16c/0x17c
(200709_09:11:00.173)[ 101.429442] <0>-(0)[1:init][] SyS_reboot+0x140/0x20c
(200709_09:11:00.174)[ 101.430267] <0>-(0)[1:init][] el0_svc_naked+0x34/0x38
(200709_09:11:00.174)[ 101.431099] <0>-(0)[1:init]we can get console_sem
(200709_09:11:00.174)[ 101.431695] <0>-(0)[1:init]mtk_rtc_common: rtc_mark_recovery
(200709_09:11:00.174)[ 101.432412] <0>-(0)[1:init]mtk_rtc_hal_common: hal_rtc_set_spare_register: cmd[2], set rg[0x5b4, 0x3 , 4] = 0x1
(200709_09:11:00.174)[ 101.434382] <0>-(0)[1:init]mtk_rtc_hal_common: mon = 1, day = 20481, hour = 16384

其中对应arch_reset中cmd为recovery的处理是:rtc_mark_recovery 这个function在两个地方有实现:

  1. mtk_rtc.h中定义为:#define rtc_mark_recovery() ({0;}) 这里直接返回0;由配置CONFIG_MTK_RTC控制;
  2. mtk_rtc_common.c中定义为:rtc_mark_recovery() 这里有实际实现;
    1. 打印function name;
    2. hal_rtc_set_spare_register() 这个就是实际读写寄存器操作;
    3. clear alarm setting
    4. 将time写入rtc 寄存器;
      所以实际这里就是rtc的写入操作,则需要找到的就是arch_reset的触发者;

第二步:根据刚才的信息继续往前看init进程的操作

是从system_server中拿到的sys.powerctl=‘reboot,recovery’ 说明是system_server干的这个坏事

(200709_09:10:56.087)[ 97.324635] <3>.(1)[1:init]init: Received sys.powerctl=‘reboot,recovery’ from pid: 709 (system_server)
(200709_09:10:56.087)[ 97.324995] <3>.(1)[1:init]init: PropSet [sys.powerctl]=[reboot,recovery] Done
(200709_09:10:56.087)[ 97.325108] <3>.(1)[1:init]init: Clear action queue and start shutdown trigger
(200709_09:10:56.087)[ 97.325360] <3>.(1)[1:init]init: processing action (shutdown_done) from (:0)
(200709_09:10:56.087)[ 97.325390] <3>.(1)[1:init]init: Reboot start, reason: reboot,recovery, rebootTarget: recovery
(200709_09:10:56.206)[ 97.467814] <1>.(1)[1:init]init: PropSet [persist.sys.boot.reason]=[recovery] Done
(200709_09:10:56.518)[ 97.759500] <1>.(1)[1:init]init: Shutdown timeout: 1000 ms
[1:init]reboot: Restarting system with command ‘recovery’

简单看了下sys.powerctl,在init中检测到这东西变化后就会执行对应的处理,即触发arch_reset

第三步:确认system_server信息:

>Line 1988: (200709_09:10:52.712)[   93.962201] <3>.(3)[2212:Binder:709_3]binder: 709:2212 transaction failed 29189/0, size 4-0 line 3628
Line 2009: (200709_09:10:53.035)[   94.286605] <0>.(0)[2212:Binder:709_3]binder: 709:2212 transaction failed 29189/0, size 4-0 line 3628

当前串口log中能看到的有效信息只有上述binder信息,即对端29189进程异常,但是目前看不到该进程是什么;
则上述path走不通了,需要想办法搞出来logcat信息;

2.2 recovery入口梳理

由于上述信息暂时看不到具体的触发点,则梳理下recovery入口的可能性,针对性的验证下:

  1. 通过volum+power
  2. misc分区加入,有些时候会往misc里面写,断电后通过检测misc进入
    可能原因1:RescueParty ==> 系统防护,对于频繁的挂掉的问题
    可能原因2:AutoRecovery ==> 对文件分区的检测,如果遇到无法fix的问题,则启用autorecovery进行清除分区恢复
    可能原因3:其他目前知识体系内不清楚的可能性;
    另外:注意经过wdt则会写rtc,所以rtc并非原因而是入口后的结果;

所以找到上述system_server 触发进入recovery的入口的关键是:如何输出更多打印:

2.3 对比验证实验

另外在上述打印信息分析过程中,也请customer这边同步做对比试验:目的主要是为了缩小怀疑范围,从另一个维度提供信息;

  1. 确认正常原始版本(user)
  2. 当前项目最新版本(user)
    实验1:逐步替换各个分区做AB实验,确认影响分区;
    实验2:对重要分区改动记录二分法check;

最终确认与system.img和vendor.img两个配合相关,由于改动记录过多,此实验后续耗时会比较多,

2.4 logcat信息分析

2.4.1 抓取logcat

当前USER版本,在串口shell中命令不生效,经测试可以在adb logcat抓到:

  1. 切换默认版本为usb device;
  2. 默认打开开发者mode中USB调试;
  3. 编译烧录
    可以在开机过程中输入adb logcat 将logcat信息输入到指定文件中;
    ps:实际测试还可以替换UserDebug版本的system.img,也可以直接使用;

2.4.2 logcat信息分析

抓取到logcat信息后则继续分析logcat,需要排查的点:
1. system_server 中做了什么?
2. auto recovery 机制是否触发?搜索autorecovery未找到关键字样,查看分区mount过程也没有看到明显异常;

2.4.2.1 查看system_server:
  1. 搜索system_server,相关log中没有看到什么有效信息;
  2. 搜索对应的进程号,然而log信息太多,无法筛选,其实对应串口log时间点还是可以找到的,就是麻烦一些;
  3. 搜索FATAL EXCEPTION 这次好像找到了一些信息:

01-01 16:09:56.040 3076 3104 E AndroidRuntime: FATAL EXCEPTION: Thread-3
01-01 16:09:56.040 3076 3104 E AndroidRuntime: Process: com.customer.txvoiceadapter, PID: 3076
01-01 16:09:56.040 3076 3104 E AndroidRuntime: java.lang.IllegalArgumentException: Invalid audio buffer size -2 (frame size 4)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.audioBuffSizeCheck(AudioRecord.java:731)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:380)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:284)

发现这东西出现了38次,根据我们之前推测的可能的入口,还有一种就是这个system_server或者应用程序挂掉太多次的情况下,系统会做对应的防护:

2.4.2.2 搜索/system/bin/uncrypt|ShutdownThread|RescueParty:

Line 22423: 01-01 16:09:58.752 707 1196 W RescueParty: Noticed 2 events for UID 1000 in last 0 sec
Line 22751: 01-01 16:09:59.329 707 1196 W RescueParty: Noticed 3 events for UID 1000 in last 1 sec
Line 23454: 01-01 16:10:00.173 707 945 W RescueParty: Noticed 4 events for UID 1000 in last 2 sec
Line 23803: 01-01 16:10:00.619 707 2377 W RescueParty: Noticed 5 events for UID 1000 in last 2 sec
Line 23805: 01-01 16:10:00.625 707 2377 W RescueParty: Attempting rescue level FACTORY_RESET
Line 24448: 01-01 16:10:01.527 707 1212 W RescueParty: Noticed 2 events for UID 1000 in last 0 sec
Line 24505: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: received command: [–prompt_and_wipe_data
Line 24506: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: --reason=RescueParty
Line 24506: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: --reason=RescueParty
Line 24507: 01-01 16:10:01.643 3574 3574 I /system/bin/uncrypt: --locale=zh_CN
Line 24508: 01-01 16:10:01.643 3574 3574 I /system/bin/uncrypt: ] (59)
Line 24513: 01-01 16:10:01.664 3574 3574 I /system/bin/uncrypt: received 0, exiting now
Line 24515: 01-01 16:10:01.671 707 723 D ShutdownThread: Notifying thread to start shutdown longPressBehavior=3
Line 24516: 01-01 16:10:01.675 707 723 I MtkShutdownThread: mShutOffAnimation: 0 (NONE)
Line 24794: 01-01 16:10:02.073 707 3686 I ShutdownThread: Sending shutdown broadcast…
Line 24903: 01-01 16:10:02.259 707 1204 W RescueParty: Noticed 3 events for UID 1000 in last 1 sec
Line 25249: 01-01 16:10:02.645 707 718 W RescueParty: Noticed 4 events for UID 1000 in last 1 sec
Line 25349: 01-01 16:10:02.763 707 3686 I ShutdownThread: Shutting down activity manager…
Line 25567: 01-01 16:10:03.062 707 3686 I ShutdownThread: Shutting down package manager…
Line 25569: 01-01 16:10:03.100 707 3755 I ShutdownThread: Waiting for Radio…
Line 25570: 01-01 16:10:03.101 707 3755 I ShutdownThread: Radio shutdown complete.
Line 25571: 01-01 16:10:03.101 707 3686 I MtkShutdownThread: setBacklightBrightness: Off
Line 25586: 01-01 16:10:03.121 707 3686 I ShutdownThread: Rebooting, reason: recovery
Line 25764: 01-01 16:10:03.287 707 718 W RescueParty: Noticed 5 events for UID 1000 in last 2 sec

果然证实了这一点,在RescueParty发现上述进程挂了太多次以后,触发了FACTORY_RESET机制;

2.4.2.3 上述应用程序挂掉的原因

这是个android应用程序,所以对于这类信息,在出现FATAL EXCEPTION时,会dump相应对栈信息

01-01 16:09:56.040 3076 3104 E AndroidRuntime: FATAL EXCEPTION: Thread-3
01-01 16:09:56.040 3076 3104 E AndroidRuntime: Process: com.customer.txvoiceadapter, PID: 3076
01-01 16:09:56.040 3076 3104 E AndroidRuntime: java.lang.IllegalArgumentException: Invalid audio buffer size -2 (frame size 4)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.audioBuffSizeCheck(AudioRecord.java:731)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:380)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:284)

在这段log上下文找下原因:

  1. txvoiceadapter进程在做audioBuffSizeCheck时产生异常;
  2. 继续查找相关buffersize获取操作:

01-01 16:09:56.031 3076 3104 I TXVoiceAdapterLog: start record
01-01 16:09:56.032 1916 2550 E AudioRecord-JNI: Error creating AudioRecord instance: initialization check failed with status -22.
01-01 16:09:56.033 1916 2550 E android.media.AudioRecord: Error code -20 when initializing native AudioRecord object.
01-01 16:09:56.035 3076 3104 E AudioSystem: AudioSystem::getInputBufferSize failed sampleRate 44100 format 0x1 channelMask 0xc
01-01 16:09:56.036 3076 3104 E AudioRecord: AudioSystem could not query the input buffer size for sampleRate 44100, format 0x1, channelMask 0xc; status -22
01-01 16:09:56.036 3076 3104 I TXVoiceAdapterLog: mixBufSize=-2

初始化native层的AudioRecord失败了:when initializing native AudioRecord object

01-01 16:09:55.991 3076 3076 I TXVoiceAdapterLog: tx voice adapter on create.
01-01 16:09:56.007 3076 3103 I TXVoiceAdapterLog: start audiotrack play for open blk
01-01 16:09:56.020 3076 3103 E AudioTrack: Unable to query output sample rate for stream type -1; status -1
01-01 16:09:56.021 3076 3103 E AudioTrack-JNI: AudioTrack::getMinFrameCount() for sample rate 44100 failed with status -1
01-01 16:09:56.021 3076 3103 E android.media.AudioTrack: getMinBufferSize(): error querying hardware

01-01 16:09:56.030 387 1246 D APM_AudioPolicyManager: getInputForAttr() source 1, sampling rate 16000, format 0x1, channel mask 0xc,session 201, flags 0
01-01 16:09:56.030 3076 3103 I TXVoiceAdapterLog: minBufferSize=-1
01-01 16:09:56.030 387 1246 E APM::AudioPolicyEngine: getDeviceForInputSource() no default device defined
01-01 16:09:56.030 387 1246 W APM_AudioPolicyManager: getInputForAttr() could not find device for source 1

是device没有创建成功,这个其实是native audio对hal层的依赖了,则可能原因为:

  1. 是否启动太早?
    修改到boot complete之后仍为同样现象;
  2. 由于此次为新编译版本,是否有依赖异常?
    由于之前我们把差异定位到了system.img和vendor.img,应用崩溃,这个应用是在system分区中的,而依赖为audio hal层内容,为vendor img中内容;

所以对比vendor img中差异:发现audio中有个so库配置为USER版本不导入!!!

添加后OK,则此问题fix,大功告成

3. 总结:

  1. 处理问题就是根据现有信息提出假设,验证排除,逐步缩小范围的过程;
  2. 知识储备的多少,以及思维逻辑(外加经验)的差异只会影响过程的快慢,我们相信所有问题都会解决的;

4. 附录

4.1 RescueParty

AndroidO 引入,核心功能:当应用程序或者systemserver连续崩溃导致时,就会进入救援程序,修复该问题,避免影响用户体验:
所谓救援:

  1. kill用户进程
  2. 恢复出厂设置

4.1.1 救援优先级

private static final int LEVEL_NONE = 0;
private static final intLEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS = 1;
private static final intLEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES = 2;
private static final intLEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS = 3;
private static final intLEVEL_FACTORY_RESET = 4;

4.1.2 触发逻辑

  1. system_server 在 5 分钟内重启 5 次以上调整一次级别
  2. 永久性系统应用在 30 秒内崩溃 5 次以上调整一次级别

4.2.3 禁用场景

  1. eng版本会被禁用
  2. userdebug版本,并且usb正在连接中
  3. getprop persist.sys.disable_rescue 为 true

4.2.4 调用逻辑

to be done

你可能感兴趣的:(异常分析,Android)