现象
使用错误手指触摸指纹5次,指纹被锁定后,按power灭屏再亮屏,手动解图形锁概率发生卡住,另外插入USB不复现。
分析
Exception Class: SWT
Exception Type: system_server_watchdog
Current Executing Process:
system_server
Trigger time:[2018-01-02 12:51:47.93449] pid:631
Backtrace:
Process: system_server
Subject: Blocked in handler on
main
thread (main)
Build: alps/full_k400/k400:8.1.0/O11019/1516598058:userdebug/dev-keys
main thread
这个是system_server进程的主线程,等待锁:0x0ae821b0
"main" prio=5
tid=1
Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x70e1c9f8 self=0xabe2d000
|
sysTid=631
nice=-2 cgrp=default sched=0/0 handle=0xb046e4a4
| state=S schedstat=( 9786847713 6915040794 14170 ) utm=749 stm=229 core=0 HZ=100
| stack=0xbe0c8000-0xbe0ca000 stackSize=8MB
| held mutexes=
at com.android.server.locksettings.LockSettingsService.havePattern(LockSettingsService.java:990)
- waiting to lock <
0x0ae821b0
> (a com.android.server.locksettings.SyntheticPasswordManager) held by
thread 111
at com.android.internal.widget.LockPatternUtils.savedPatternExists(LockPatternUtils.java:536)
at com.android.internal.widget.LockPatternUtils.isLockPatternEnabled(LockPatternUtils.java:1157)
搜索:0x0ae821b0:
这个是system_server进程的binder线程线程,名称是Binder:631_19,对于内核中的pid是4148.
"Binder:
631_19
" prio=5
tid=111
Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x1358ab20 self=0x94c09800
| sysTid=
4148
nice=0 cgrp=default sched=0/0 handle=0x8b836970
| state=S schedstat=( 2006656770 1821935538 4272 ) utm=152 stm=48 core=2 HZ=100
| stack=0x8b73c000-0x8b73e000 stackSize=1006KB
| held mutexes=
kernel: binder_thread_read+0x458/0x1054
kernel: binder_ioctl+0x46c/0x9e8
kernel: do_vfs_ioctl+0x9c/0x69c
kernel: SyS_ioctl+0x54/0x78
native: #00 pc 000493a4 /system/lib/libc.so (__ioctl+8)
native: #01 pc 0001df8f /system/lib/libc.so (ioctl+38)
native: #02 pc 0004242f /system/lib/libbinder.so (android::IPCThreadState::
talkWithDriver
(bool)+170)
native: #03 pc 00042de9 /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+236)
native: #04 pc 0003d2e5 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)
native: #05 pc 000bcdf5 /system/lib/libandroid_runtime.so (???)
native: #06 pc 00298fb5 /system/framework/arm/boot-framework.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+132)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:764)
at android.service.gatekeeper.IGateKeeperService$Stub$Proxy.verifyChallenge(
IGateKeeperService
.java:253)
at com.android.server.locksettings.SyntheticPasswordManager.unwrapPasswordBasedSyntheticPassword(SyntheticPasswordManager.java:810)
at com.android.server.locksettings.LockSettingsService.spBasedDoVerifyCredential(LockSettingsService.java:2105)
- locked <
0x0ae821b0
> (a com.android.server.locksettings.SyntheticPasswordManager)
at com.android.server.locksettings.LockSettingsService.doVerifyCredential(LockSettingsService.java:1553)
at com.android.server.locksettings.LockSettingsService.checkCredential(LockSettingsService.java:1526)
at com.android.internal.widget.ILockSettings$Stub.onTransact(ILockSettings.java:164)
binder call一般是同步调用,很明显这里就是.调用栈显示binder client端4118线程发生了binder call长等待,
迟迟收不到来自binder server端的回应导致,那么这个binder server端是谁呢?
分析binder proc信息数据(搜索4148):
proc 631
context binder
thread 4148: l 11 need_return 0 tr 0
outgoing transaction 386736: dde3c180 from
631:4148
to
314:314
code 3 flags 10 pri 0:120 r1
很明显,这个binder server端就是pid=314进程了,搜索314进程:
Line 108828: <4>[ 1412.757750] -(2)[899:watchdog]
gatekeeperd
314 273297.350405 292 120 273297.350405 131.026859 1306310.341713 /
所以sysTid=4148的binder线程的一直没等到314进程
gatekeeperd 的回应,长久占用binder资源导致system_server的main线程卡住超时,最终被watchdog发送:
kill -9 631
<7>[ 1414.774647] -(1)[899:watchdog][name:mtprof&][signal][899:watchdog] send death
sig 9 to [631:system_server:R]
<6>[ 1415.102373] (2)[4300:kworker/u8:9]binder: release 631:4148 transaction 386736 out, still active
<6>[ 1415.102398] (2)[4300:kworker/u8:9]binder: release 631:4148 transaction 386652 in, still active
下一步需要分析gatekeeperd线程为什么会卡住(进入深度睡眠)不动了。
kernel log没发现gatekeeperd的相关调用栈,按道理watchdog会把所有卡住的进程调用栈打印出来的,这是怎么回事?
<4>[ 1412.710618] (2)[899:watchdog]Backtrace:
<4>[ 1412.710638] (2)[899:watchdog][] (__schedule) from [] (schedule+0x90/0x9c)
<4>[ 1412.710647] (2)[899:watchdog] r10:c0e1fe48 r9:00000007 r8:c0e1fe48 r7:dd975480
<4>[ 1412.710681] (2)[899:watchdog][] (schedule) from [] (schedule_timeout+0x2c/0x214)
<4>[ 1412.710698] (2)[899:watchdog][] (schedule_timeout) from [] (__down+0x78/0xcc)
<4>[ 1412.710707] (2)[899:watchdog] r9:00000007 r8:00000002 r7:dd975480 r6:7fffffff
<4>[ 1412.710743] (2)[899:watchdog][] (__down) from [] (down+0x34/0x44)
<4>[ 1412.710752] (2)[899:watchdog] r8:aa0b0000 r7:00005403 r6:c0e1fe48 r5:80000013
<4>[ 1412.710788] (2)[899:watchdog][] (
down
) from [] (
fp_ioctl
+0x54/0x498)
<4>[ 1412.710797] (2)[899:watchdog] r5:00005403 r4:aa0b0000
<4>[ 1412.710824] (2)[899:watchdog][] (fp_ioctl) from [] (do_vfs_ioctl+0x9c/0x69c)
<4>[ 1412.710833] (2)[899:watchdog] r9:00000007 r8:aa0b0000 r7:00005403 r6:c7509780
<4>[ 1412.710867] (2)[899:watchdog][] (do_vfs_ioctl) from [] (SyS_ioctl+0x54/0x78)
<4>[ 1412.710876] (2)[899:watchdog] r10:00000000 r9:00000007 r8:aa0b0000 r7:00005403
分析
fp_ioctl
代码,这个是kernel tee driver代码里面的.
89 static long
fp_ioctl(struct file *filp, unsigned cmd, unsigned long arg)
90 {
91 unsigned int args_len = 0;
92 unsigned int fp_cid = 0xFF;
93 unsigned int fp_fid = 0xFF;
94 unsigned char args[16] = {0};
95 unsigned int buff_len = 0;
96 if (_IOC_TYPE(cmd) != TEEI_IOC_MAGIC)
97 return -ENOTTY;
98 if (_IOC_NR(cmd) > TEEI_IOC_MAXNR)
99 return -ENOTTY;
100 down(&
fp_api_lock);
代码分析结合调用栈显示拿不到
fp_api_lock信号量
锁而被阻塞,需要看谁拿住了锁,继续分析D进程.
android.hardwar
排查D进程发现android.hardwar可疑,可以结合代码调用栈分析得出是它拿住了上面的信号量锁
fp_api_lock
,
执行send_fp_command后久久不返回导致长久持锁,导致gatekeeperd久久拿不到这个锁而被卡住.
<6>[ 1412.710976] (2)[899:watchdog]
android.hardwar D
c0958ff8 0 630 1 0x00000000
<4>[ 1412.710996] (2)[899:watchdog]Backtrace:
<4>[ 1412.711016] (2)[899:watchdog][] (__schedule) from [] (schedule+0x90/0x9c)
<4>[ 1412.711025] (2)[899:watchdog] r10:c0e1fe48 r9:d7480000 r8:c0e1fe48 r7:dd1bb400
<4>[ 1412.711060] (2)[899:watchdog][] (schedule) from [] (schedule_timeout+0x2c/0x214)
<4>[ 1412.711076] (2)[899:watchdog][] (schedule_timeout) from [] (__down+0x78/0xcc)
<4>[ 1412.711085] (2)[899:watchdog] r9:d7480000 r8:00000002 r7:dd1bb400 r6:7fffffff
<4>[ 1412.711120] (2)[899:watchdog][] (__down) from [] (down+0x34/0x44)
<4>[ 1412.711129] (2)[899:watchdog] r8:00000051 r7:c100d2f4 r6:00000000 r5:60010013
<4>[ 1412.711167] (2)[899:watchdog][] (down) from [] (send_fp_command+0xe8/0x194)
<4>[ 1412.711176] (2)[899:watchdog] r5:c100d178 r4:c0e1fe48
<4>[ 1412.711202] (2)[899:watchdog][] (
send_fp_command
) from [] (
fp_ioctl
+0x270/0x498)
<4>[ 1412.711211] (2)[899:watchdog] r6:c0e1fe48 r5:00002418 r4:b387f7a8
<4>[ 1412.711240] (2)[899:watchdog][] (fp_ioctl) from [] (do_vfs_ioctl+0x9c/0x69c)
<4>[ 1412.711249] (2)[899:watchdog] r9:00000005 r8:b387f7a8 r7:00005402 r6:db10b840
这里代码调用栈的行为是LINUX内核使用SMC CALL指令跟EL3(TEE OS)通信,但是很不幸一直没等待到回应.
故接下来需要分析EL3(TEE OS)为什么会没返回了.
TZ_LOG
<7>[ 1278.054685] -(0)[52:cfinteractive][name:mt_freqhopping&](PLL_CON1): 0x9a000
<7>[ 1278.054733] -(0)[52:cfinteractive][name:mt_freqhopping&]PLL_CON1: 0x000c8000<7>[ 1278.054768] (0)[52:cfinteractive][name:mt_cpufreq&][Power/cpufreq] @_cpufreq_set_locked(): Vproc = 1200mv, freq = 1300000 KHz
<4>[ 1278.055076] -(0)[139:teei_switch_thr][TZ_LOG] [01;32mREEagent| [E][tee_listener.cc:110/smc_listener_list_driver_handle]____________________________ smc_listener_list_driver_handle : 110 _______________________________[0m
<4>[ 1278.055093] -(0)[139:teei_switch_thr][TZ_LOG] [01;32mREEagent| [0m
<4>[ 1278.055108] -(0)[139:teei_switch_thr][TZ_LOG] [01;32mREEagent| [E][tee_listener.cc:123/smc_listener_list_driver_handle]____________________________ smc_listener_list_driver_handle : 123 _______________________________[0m
<4>[ 1278.055119] -(0)[139:teei_switch_thr][TZ_LOG] [01;32mREEagent| [0m
<4>[ 1278.055130] -(0)[139:teei_switch_thr][
TZ_LOG
] [01;31mFP_S | Doing
SIGSEGV
[0m
<4>[ 1278.055142] -(0)[139:teei_switch_thr][
TZ_LOG
] [01;31mFP_S |
Invalid user memory for sigframe
...[0m
<2>[ 1278.559001] (0)[93:hps_main]Boot slave CPU
log显示TEE内部发生了
SIGSEGV
.所以是需要提供给TEE提供商分析的。
根本原因
CPU off的时候没有按ARM规范flush D-cache影响了mem的一致性所致.