输入 – 内化 – 输出 – 价值
xmind:
从现象来看:
1. 系统遇到异常,该异常比较严重会触发重启;
2. 重启后会进入recovery界面(也可能是恢复出厂设置),做清除动作;
结合现象和初步log,找到怀疑点,通过实验的方式逐步缩小范围,最终定位并解决问题;
打印信息为:ARCH_RESET,且命令为recovery,这一段打印在函数arch_reset中:
(200709_09:11:00.173)[ 101.418221] <0>-(0)[1:init]ARCH_RESET happen!!!
(200709_09:11:00.173)[ 101.418795] <0>-(0)[1:init]arch_reset: cmd = recovery
(200709_09:11:00.173)[ 101.419441] <0>-(0)[1:init]CPU: 0 PID: 1 Comm: init Tainted: G W O 4.9.117 #3
(200709_09:11:00.173)[ 101.420456] <0>-(0)[1:init]Hardware name: AC8257V/WAB (DT)
(200709_09:11:00.173)[ 101.421149] <0>-(0)[1:init]Call trace:
(200709_09:11:00.173)[ 101.421643] <0>-(0)[1:init][] dump_backtrace+0x0/0x294
(200709_09:11:00.173)[ 101.422492] <0>-(0)[1:init][] show_stack+0x18/0x20
(200709_09:11:00.173)[ 101.423300] <0>-(0)[1:init][] dump_stack+0xcc/0x104
(200709_09:11:00.173)[ 101.424119] <0>-(0)[1:init][] arch_reset+0x48/0x164
(200709_09:11:00.173)[ 101.424934] <0>-(0)[1:init][] mtk_arch_reset_handle+0x24/0x40
(200709_09:11:00.173)[ 101.425861] <0>-(0)[1:init][] atomic_notifier_call_chain+0x4c/0x84
(200709_09:11:00.173)[ 101.426838] <0>-(0)[1:init][] do_kernel_restart+0x24/0x2c
(200709_09:11:00.173)[ 101.427717] <0>-(0)[1:init][] machine_restart+0x40/0x50
(200709_09:11:00.173)[ 101.428575] <0>-(0)[1:init][] kernel_restart+0x16c/0x17c
(200709_09:11:00.173)[ 101.429442] <0>-(0)[1:init][] SyS_reboot+0x140/0x20c
(200709_09:11:00.174)[ 101.430267] <0>-(0)[1:init][] el0_svc_naked+0x34/0x38
(200709_09:11:00.174)[ 101.431099] <0>-(0)[1:init]we can get console_sem
(200709_09:11:00.174)[ 101.431695] <0>-(0)[1:init]mtk_rtc_common: rtc_mark_recovery
(200709_09:11:00.174)[ 101.432412] <0>-(0)[1:init]mtk_rtc_hal_common: hal_rtc_set_spare_register: cmd[2], set rg[0x5b4, 0x3 , 4] = 0x1
(200709_09:11:00.174)[ 101.434382] <0>-(0)[1:init]mtk_rtc_hal_common: mon = 1, day = 20481, hour = 16384
其中对应arch_reset中cmd为recovery的处理是:rtc_mark_recovery 这个function在两个地方有实现:
是从system_server中拿到的sys.powerctl=‘reboot,recovery’ 说明是system_server干的这个坏事
(200709_09:10:56.087)[ 97.324635] <3>.(1)[1:init]init: Received sys.powerctl=‘reboot,recovery’ from pid: 709 (system_server)
(200709_09:10:56.087)[ 97.324995] <3>.(1)[1:init]init: PropSet [sys.powerctl]=[reboot,recovery] Done
(200709_09:10:56.087)[ 97.325108] <3>.(1)[1:init]init: Clear action queue and start shutdown trigger
(200709_09:10:56.087)[ 97.325360] <3>.(1)[1:init]init: processing action (shutdown_done) from (:0)
(200709_09:10:56.087)[ 97.325390] <3>.(1)[1:init]init: Reboot start, reason: reboot,recovery, rebootTarget: recovery
(200709_09:10:56.206)[ 97.467814] <1>.(1)[1:init]init: PropSet [persist.sys.boot.reason]=[recovery] Done
(200709_09:10:56.518)[ 97.759500] <1>.(1)[1:init]init: Shutdown timeout: 1000 ms
[1:init]reboot: Restarting system with command ‘recovery’
简单看了下sys.powerctl,在init中检测到这东西变化后就会执行对应的处理,即触发arch_reset
>Line 1988: (200709_09:10:52.712)[ 93.962201] <3>.(3)[2212:Binder:709_3]binder: 709:2212 transaction failed 29189/0, size 4-0 line 3628
Line 2009: (200709_09:10:53.035)[ 94.286605] <0>.(0)[2212:Binder:709_3]binder: 709:2212 transaction failed 29189/0, size 4-0 line 3628
当前串口log中能看到的有效信息只有上述binder信息,即对端29189进程异常,但是目前看不到该进程是什么;
则上述path走不通了,需要想办法搞出来logcat信息;
由于上述信息暂时看不到具体的触发点,则梳理下recovery入口的可能性,针对性的验证下:
所以找到上述system_server 触发进入recovery的入口的关键是:如何输出更多打印:
另外在上述打印信息分析过程中,也请customer这边同步做对比试验:目的主要是为了缩小怀疑范围,从另一个维度提供信息;
最终确认与system.img和vendor.img两个配合相关,由于改动记录过多,此实验后续耗时会比较多,
当前USER版本,在串口shell中命令不生效,经测试可以在adb logcat抓到:
抓取到logcat信息后则继续分析logcat,需要排查的点:
1. system_server 中做了什么?
2. auto recovery 机制是否触发?搜索autorecovery未找到关键字样,查看分区mount过程也没有看到明显异常;
01-01 16:09:56.040 3076 3104 E AndroidRuntime: FATAL EXCEPTION: Thread-3
01-01 16:09:56.040 3076 3104 E AndroidRuntime: Process: com.customer.txvoiceadapter, PID: 3076
01-01 16:09:56.040 3076 3104 E AndroidRuntime: java.lang.IllegalArgumentException: Invalid audio buffer size -2 (frame size 4)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.audioBuffSizeCheck(AudioRecord.java:731)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:380)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:284)
发现这东西出现了38次,根据我们之前推测的可能的入口,还有一种就是这个system_server或者应用程序挂掉太多次的情况下,系统会做对应的防护:
Line 22423: 01-01 16:09:58.752 707 1196 W RescueParty: Noticed 2 events for UID 1000 in last 0 sec
Line 22751: 01-01 16:09:59.329 707 1196 W RescueParty: Noticed 3 events for UID 1000 in last 1 sec
Line 23454: 01-01 16:10:00.173 707 945 W RescueParty: Noticed 4 events for UID 1000 in last 2 sec
Line 23803: 01-01 16:10:00.619 707 2377 W RescueParty: Noticed 5 events for UID 1000 in last 2 sec
Line 23805: 01-01 16:10:00.625 707 2377 W RescueParty: Attempting rescue level FACTORY_RESET
Line 24448: 01-01 16:10:01.527 707 1212 W RescueParty: Noticed 2 events for UID 1000 in last 0 sec
Line 24505: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: received command: [–prompt_and_wipe_data
Line 24506: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: --reason=RescueParty
Line 24506: 01-01 16:10:01.642 3574 3574 I /system/bin/uncrypt: --reason=RescueParty
Line 24507: 01-01 16:10:01.643 3574 3574 I /system/bin/uncrypt: --locale=zh_CN
Line 24508: 01-01 16:10:01.643 3574 3574 I /system/bin/uncrypt: ] (59)
Line 24513: 01-01 16:10:01.664 3574 3574 I /system/bin/uncrypt: received 0, exiting now
Line 24515: 01-01 16:10:01.671 707 723 D ShutdownThread: Notifying thread to start shutdown longPressBehavior=3
Line 24516: 01-01 16:10:01.675 707 723 I MtkShutdownThread: mShutOffAnimation: 0 (NONE)
Line 24794: 01-01 16:10:02.073 707 3686 I ShutdownThread: Sending shutdown broadcast…
Line 24903: 01-01 16:10:02.259 707 1204 W RescueParty: Noticed 3 events for UID 1000 in last 1 sec
Line 25249: 01-01 16:10:02.645 707 718 W RescueParty: Noticed 4 events for UID 1000 in last 1 sec
Line 25349: 01-01 16:10:02.763 707 3686 I ShutdownThread: Shutting down activity manager…
Line 25567: 01-01 16:10:03.062 707 3686 I ShutdownThread: Shutting down package manager…
Line 25569: 01-01 16:10:03.100 707 3755 I ShutdownThread: Waiting for Radio…
Line 25570: 01-01 16:10:03.101 707 3755 I ShutdownThread: Radio shutdown complete.
Line 25571: 01-01 16:10:03.101 707 3686 I MtkShutdownThread: setBacklightBrightness: Off
Line 25586: 01-01 16:10:03.121 707 3686 I ShutdownThread: Rebooting, reason: recovery
Line 25764: 01-01 16:10:03.287 707 718 W RescueParty: Noticed 5 events for UID 1000 in last 2 sec
果然证实了这一点,在RescueParty发现上述进程挂了太多次以后,触发了FACTORY_RESET机制;
这是个android应用程序,所以对于这类信息,在出现FATAL EXCEPTION时,会dump相应对栈信息
01-01 16:09:56.040 3076 3104 E AndroidRuntime: FATAL EXCEPTION: Thread-3
01-01 16:09:56.040 3076 3104 E AndroidRuntime: Process: com.customer.txvoiceadapter, PID: 3076
01-01 16:09:56.040 3076 3104 E AndroidRuntime: java.lang.IllegalArgumentException: Invalid audio buffer size -2 (frame size 4)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.audioBuffSizeCheck(AudioRecord.java:731)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:380)
01-01 16:09:56.040 3076 3104 E AndroidRuntime: at android.media.AudioRecord.(AudioRecord.java:284)
在这段log上下文找下原因:
01-01 16:09:56.031 3076 3104 I TXVoiceAdapterLog: start record
01-01 16:09:56.032 1916 2550 E AudioRecord-JNI: Error creating AudioRecord instance: initialization check failed with status -22.
01-01 16:09:56.033 1916 2550 E android.media.AudioRecord: Error code -20 when initializing native AudioRecord object.
01-01 16:09:56.035 3076 3104 E AudioSystem: AudioSystem::getInputBufferSize failed sampleRate 44100 format 0x1 channelMask 0xc
01-01 16:09:56.036 3076 3104 E AudioRecord: AudioSystem could not query the input buffer size for sampleRate 44100, format 0x1, channelMask 0xc; status -22
01-01 16:09:56.036 3076 3104 I TXVoiceAdapterLog: mixBufSize=-2
初始化native层的AudioRecord失败了:when initializing native AudioRecord object
01-01 16:09:55.991 3076 3076 I TXVoiceAdapterLog: tx voice adapter on create.
01-01 16:09:56.007 3076 3103 I TXVoiceAdapterLog: start audiotrack play for open blk
01-01 16:09:56.020 3076 3103 E AudioTrack: Unable to query output sample rate for stream type -1; status -1
01-01 16:09:56.021 3076 3103 E AudioTrack-JNI: AudioTrack::getMinFrameCount() for sample rate 44100 failed with status -1
01-01 16:09:56.021 3076 3103 E android.media.AudioTrack: getMinBufferSize(): error querying hardware
…
01-01 16:09:56.030 387 1246 D APM_AudioPolicyManager: getInputForAttr() source 1, sampling rate 16000, format 0x1, channel mask 0xc,session 201, flags 0
01-01 16:09:56.030 3076 3103 I TXVoiceAdapterLog: minBufferSize=-1
01-01 16:09:56.030 387 1246 E APM::AudioPolicyEngine: getDeviceForInputSource() no default device defined
01-01 16:09:56.030 387 1246 W APM_AudioPolicyManager: getInputForAttr() could not find device for source 1
是device没有创建成功,这个其实是native audio对hal层的依赖了,则可能原因为:
所以对比vendor img中差异:发现audio中有个so库配置为USER版本不导入!!!
添加后OK,则此问题fix,大功告成
AndroidO 引入,核心功能:当应用程序或者systemserver连续崩溃导致时,就会进入救援程序,修复该问题,避免影响用户体验:
所谓救援:
private static final int LEVEL_NONE = 0;
private static final intLEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS = 1;
private static final intLEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES = 2;
private static final intLEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS = 3;
private static final intLEVEL_FACTORY_RESET = 4;
to be done