由Monkey测试引发的跨多个进程的Android系统死锁问题分析

一、问题现象

1、界面定住,没有任何刷新,所有输入事件无效,包括power key

2、adb shell可以连接并操作手机

3、手机的data和sdcard存储空间已满

4、watchdog没有重启system server

二、解决方案

通过初步分析、深入分析(具体分析过程和关键代码及log在下面)我们知道了问题的原因:

1、monkey测试的过程中起了bugreport,并将bugreport输出到stdout的log通过同步pipe循环读取出来然后write到sdcard的文件中,但是由于sdcard空间满了,write的时候发生了IO error导致monkey异常跳出了读取同步pipe的流程,由于bugreport还没有将log输出完毕所以会一直等待,而monkey则继续运行。

2、bugreport起来之后通过set property的方式触发了dumpstate去获取log,然后通过同步local socket(以下为了描述方便简短起见简称socket)与dumpstate建立连接,将dumpstate通过socket输出的log读取出来写到stdout。

3、dumpstate起来之后会首先将与bugreport的socket通过dup2重定向到stdout,然后执行一系列的抓取log信息的动作,包括cat proc节点、执行dumpsys batterystats等,这些动作都是通过run command的方式起的即fork一个进程,然后执行execvp,fork会继承dumpstate的文件描述符,所以起dumpsys batterystats之后dumpsys中的stdout就是与bugreport通信的socket,dumpsys batterystats会将stdout 文件描述符通过binder传递到system server中。

4、binder driver负责在system server进程中重新分配一个文件描述符来指向与bugreport通信的socket,system server收到dump batterystats的请求后开始把信息输出到bugreport的socket,由于bugreport被write block,导致system server输出的log无法被读取,从而引起system server的dump thread拿着batterystats的一个lock而block了进程内多个其他thread。

5、monkey继续运行向system server发起了dumpsys meminfo的请求,由于dump batterystats的线程拿着一个关键的lock没有释放,导致dumpsys meminfo一直无法得到执行而block了monkey主线程。

6、watchdog线程检测到system server被block之后开始执行相关逻辑,其中就有调用IActivityController的systemNotResponding方法,这个IActivityController是monkey注册的,所有就会跨进程调用到了monkey端,monkey的binder thread收到这个请求之后开始执行,但是执行的过程中需要monkey主线dump完需要的信息后notify它,由于monkey主线程dumpsys meminfo的过程中block,所以导致watchdog线程也被block而无法kill system server。
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第1张图片

针对问题的原因,给出以下解决方案:
解开死锁问题的根本死结,当monkey向sdcard写log的时候,如果发生IO error,我们捕获它并继续将monkey起的bugreport的输出读完,从而使system server的dump线程释放持有的lock,使其他线程继续正常工作,dumpsys、dumpstate、cat、bugreport等都能完成自己的同步输入输出,monkey主线程也会恢复,整个系统恢复正常。

patch已同步提交给AOSP review,https://android-review.googlesource.com/#/c/234141/

由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第2张图片

三、初步分析

1、查看system server的traces发现多线程互锁,主线程需要的锁在thread 33手里

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x74869fb8 self=0x5588e4e100
  | sysTid=3230 nice=-2 cgrp=default sched=0/0 handle=0x7fa92efe80
  | state=S schedstat=( 153835271056 66149983041 217214 ) utm=9061 stm=6322 core=4 HZ=100
  | stack=0x7ff873d000-0x7ff873f000 stackSize=8MB
  | held mutexes=
  at com.android.server.AlarmManagerService$ResultReceiver.onSendFinished(AlarmManagerService.java:2266)
  - waiting to lock <0x2310f49f> (a java.lang.Object) held by thread 33
  at android.app.PendingIntent$FinishedDispatcher.run(PendingIntent.java:219)
  at android.os.Handler.handleCallback(Handler.java:739)
  at android.os.Handler.dispatchMessage(Handler.java:95)
  at android.os.Looper.loop(Looper.java:135)
  at com.android.server.SystemServer.run(SystemServer.java:296)
  at com.android.server.SystemServer.main(SystemServer.java:188)
  at java.lang.reflect.Method.invoke!(Native method)
  at java.lang.reflect.Method.invoke(Method.java:372)
  at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:908)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:703)

2、thread 33需要的锁在thread 86手里

"AlarmManager" prio=5 tid=33 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x132f14c0 self=0x7f92814780
  | sysTid=3773 nice=0 cgrp=default sched=0/0 handle=0x7f93cb7d50
  | state=S schedstat=( 7677732166 3315508747 11830 ) utm=569 stm=198 core=5 HZ=100
  | stack=0x7f9233f000-0x7f92341000 stackSize=1036KB
  | held mutexes=
  at com.android.server.power.PowerManagerService.acquireWakeLockInternal(PowerManagerService.java:753)
  - waiting to lock <0x012c60c0> (a java.lang.Object) held by thread 86
  at com.android.server.power.PowerManagerService.access$3300(PowerManagerService.java:87)
  at com.android.server.power.PowerManagerService$BinderService.acquireWakeLock(PowerManagerService.java:2990)
  at android.os.PowerManager$WakeLock.acquireLocked(PowerManager.java:986)
  at android.os.PowerManager$WakeLock.acquire(PowerManager.java:954)
  - locked <@addr=0x1300b8a0> (a android.os.Binder)
  at com.android.server.AlarmManagerService.deliverAlarmsLocked(AlarmManagerService.java:1820)
  at com.android.server.AlarmManagerService$AlarmThread.run(AlarmManagerService.java:1981)
  - locked <0x2310f49f> (a java.lang.Object)

3、thread 86需要的锁在thread 84手里

"Binder_E" prio=5 tid=86 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x141230a0 self=0x7f8b096630
  | sysTid=5467 nice=0 cgrp=default sched=0/0 handle=0x7f8b0960e0
  | state=S schedstat=( 169694491361 86744323304 358303 ) utm=11590 stm=5379 core=5 HZ=100
  | stack=0x7f8ada5000-0x7f8ada7000 stackSize=1012KB
  | held mutexes=
  at com.android.server.am.BatteryStatsService.noteStartWakelock(BatteryStatsService.java:253)
  - waiting to lock <0x052efb00> (a com.android.internal.os.BatteryStatsImpl) held by thread 84
  at com.android.server.power.Notifier.onWakeLockAcquired(Notifier.java:158)
  at com.android.server.power.PowerManagerService.notifyWakeLockAcquiredLocked(PowerManagerService.java:963)
  at com.android.server.power.PowerManagerService.acquireWakeLockInternal(PowerManagerService.java:805)
  - locked <0x012c60c0> (a java.lang.Object)
  at com.android.server.power.PowerManagerService.access$3300(PowerManagerService.java:87)
  at com.android.server.power.PowerManagerService$BinderService.acquireWakeLock(PowerManagerService.java:2990)
  at android.os.IPowerManager$Stub.onTransact(IPowerManager.java:66)
  at android.os.Binder.execTransact(Binder.java:446)

4、thread 84拿着锁在做IO操作,但是这个IO操作为什么会一直block?

"Binder_C" prio=5 tid=84 Native
  | group="main" sCount=1 dsCount=0 obj=0x1411e0a0 self=0x7f8b093e10
  | sysTid=5394 nice=-20 cgrp=default sched=0/0 handle=0x7f8b091de0
  | state=S schedstat=( 164975127905 86670873575 354263 ) utm=11207 stm=5290 core=4 HZ=100
  | stack=0x7f8aaa1000-0x7f8aaa3000 stackSize=1012KB
  | held mutexes=
  kernel: __switch_to+0x70/0x7c
  kernel: sock_alloc_send_pskb+0x25c/0x314
  kernel: sock_alloc_send_skb+0x18/0x24
  kernel: unix_stream_sendmsg+0x15c/0x340
  kernel: sock_aio_write+0x124/0x140
  kernel: do_sync_write+0x74/0xa0
  kernel: vfs_write+0xd4/0x178
  kernel: SyS_write+0x44/0x74
  kernel: cpu_switch_to+0x48/0x4c
  native: #00 pc 0005fca4  /system/lib64/libc.so (write+4)
  native: #01 pc 0000136c  /system/vendor/lib64/libNimsWrap.so (write+44)
  native: #02 pc 000368d4  /system/lib64/libjavacore.so (???)
  native: #03 pc 003c0300  /data/dalvik-cache/arm64/system@[email protected] (Java_libcore_io_Posix_writeBytes__Ljava_io_FileDescriptor_2Ljava_lang_Object_2II+212)
  at libcore.io.Posix.writeBytes(Native method)
  at libcore.io.Posix.write(Posix.java:258)
  at libcore.io.BlockGuardOs.write(BlockGuardOs.java:313)
  at libcore.io.IoBridge.write(IoBridge.java:497)
  at java.io.FileOutputStream.write(FileOutputStream.java:186)
  at com.android.internal.util.FastPrintWriter.flushBytesLocked(FastPrintWriter.java:334)
  at com.android.internal.util.FastPrintWriter.flushLocked(FastPrintWriter.java:355)
  at com.android.internal.util.FastPrintWriter.appendLocked(FastPrintWriter.java:303)
  at com.android.internal.util.FastPrintWriter.print(FastPrintWriter.java:466)
  - locked <@addr=0x13c93200> (a com.android.internal.util.FastPrintWriter$DummyWriter)
  at android.os.BatteryStats$HistoryPrinter.printNextItem(BatteryStats.java:3503)
  at android.os.BatteryStats.dumpHistoryLocked(BatteryStats.java:3915)
  at android.os.BatteryStats.dumpLocked(BatteryStats.java:3962)
  at com.android.internal.os.BatteryStatsImpl.dumpLocked(BatteryStatsImpl.java:8843)
  at com.android.server.am.BatteryStatsService.dump(BatteryStatsService.java:947)
  - locked <0x052efb00> (a com.android.internal.os.BatteryStatsImpl)
  at android.os.Binder.dump(Binder.java:319)
  at android.os.Binder.onTransact(Binder.java:285)
  at com.android.internal.app.IBatteryStats$Stub.onTransact(IBatteryStats.java:832)
  at android.os.Binder.execTransact(Binder.java:446)

带着问题与朴老师还有明浩兄沟通,明浩兄给出的结论是同步IO操作,如果读端没有将写端写入到buffer或者写端自带buffer中的数据读取完,一旦写端将buffer写满仍然需要继续写或者写端自带buffer没人读,写端就会block,明浩兄自己模拟出了类似的写端调用栈:

kernel: __switch_to+0x70/0x7c
kernel: sock_alloc_send_pskb+0x25c/0x314
kernel: sock_alloc_send_skb+0x18/0x24
kernel: unix_stream_sendmsg+0x15c/0x340
kernel: sock_aio_write+0x124/0x140
kernel: do_sync_write+0x74/0xa0
kernel: vfs_write+0xd4/0x178
kernel: SyS_write+0x44/0x74
kernel: cpu_switch_to+0x48/0x4c

读端的调用栈:

[<0000000000000000>] __switch_to+0x70/0x7c
[<0000000000000000>] __skb_recv_datagram+0x408/0x49c
[<0000000000000000>] skb_recv_datagram+0x38/0x5c
[<0000000000000000>] unix_accept+0x6c/0x154
[<0000000000000000>] SyS_accept4+0x104/0x1e0
[<0000000000000000>] el0_svc_naked+0x20/0x28
[<0000000000000000>] 0xffffffffffffffff

与此同时明浩兄还带来了两个好问题:

1.读端的线程是哪个?这时候在干什么?没有找到对应的bugreport.

2.watchdog为什么没有杀掉system_server呢?

当然还少不了建议的解决方案:

1.采用异步写.

2.读端要及时关闭socket.

问题就这样结束了吗?当然不能!

四、深入分析

首先解释明浩兄提出的第二个问题,watchdog为什么没有杀掉system_server呢?

watchdog thread在调用android.app.IActivityController Stub Proxy.systemNotResponding时被block,导致不能及时顺序的执行kill system server的动作:

"watchdog" prio=5 tid=71 Native
group="main" sCount=1 dsCount=0 obj=0x12edbe00 self=0x7f90d1bdc0
sysTid=5070 nice=0 cgrp=default sched=0/0 handle=0x7f90d1c6b0
state=S schedstat=( 876766452 162402010 4730 ) utm=17 stm=70 core=1 HZ=100
stack=0x7f8c410000-0x7f8c412000 stackSize=1036KB
held mutexes=
kernel: __switch_to+0x70/0x7c
kernel: binder_thread_read+0x464/0xe8c
kernel: binder_ioctl+0x3f8/0x824
kernel: do_vfs_ioctl+0x4a4/0x578
kernel: SyS_ioctl+0x5c/0x88
kernel: cpu_switch_to+0x48/0x4c
native: #00 pc 0005ec3c /system/lib64/libc.so (__ioctl+4)
native: #01 pc 00068eb0 /system/lib64/libc.so (ioctl+100)
native: #02 pc 000275f0 /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+164)
native: #03 pc 00028050 /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+112)
native: #04 pc 000282c4 /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+176)
native: #05 pc 0001ff38 /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+64)
native: #06 pc 000dcd58 /system/lib64/libandroid_runtime.so (???)
native: #07 pc 005fcfb0 /data/dalvik-cache/arm64/system@[email protected] (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+212)
at android.os.BinderProxy.transactNative(Native method)
at android.os.BinderProxy.transact(Binder.java:496)
at android.app.IActivityController$Stub$Proxy.systemNotResponding(IActivityController.java:293)
at com.android.server.Watchdog.run(Watchdog.java:487)

由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第3张图片

执行android.app.IActivityController Stub Proxy.systemNotResponding调用到了何方神圣?为什么会被block?

通过搜索代码和trace定位到是调用到了monkey中,而且发现了条件循环和wait:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第4张图片

"Binder_1" prio=5 tid=15 Waiting
  | group="main" sCount=1 dsCount=0 obj=0x12cbb0a0 self=0x55cc7ec1e0
  | sysTid=12015 nice=0 cgrp=default sched=0/0 handle=0x55cc801c50
  | state=S schedstat=( 959522781 386691057 2262 ) utm=41 stm=54 core=5 HZ=100
  | stack=0x7f82c72000-0x7f82c74000 stackSize=1012KB
  | held mutexes=
  at java.lang.Object.wait!(Native method)
  - waiting on <0x121ce47e> (a com.android.commands.monkey.Monkey)
  at com.android.commands.monkey.Monkey$ActivityController.systemNotResponding(Monkey.java:432)
  - locked <0x121ce47e> (a com.android.commands.monkey.Monkey)
  at android.app.IActivityController$Stub.onTransact(IActivityController.java:130)
  at android.os.Binder.execTransact(Binder.java:446)

mWatchdogWaiting置为false的条件是跑完runMonkeyCycles之后走到下面才设置false并notifyall,但是当前monkey的main thread也block了:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第5张图片

monkey的main thread执行reportDumpsysMemInfo时被block的trace:

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 obj=0x12c410a0 self=0x55cc7c7510
  | sysTid=12001 nice=0 cgrp=default sched=0/0 handle=0x7f8cf59e80
  | state=S schedstat=( 216727507148 81167424419 501554 ) utm=17390 stm=4282 core=4 HZ=100
  | stack=0x7fcefa3000-0x7fcefa5000 stackSize=8MB
  | held mutexes=
  kernel: __switch_to+0x70/0x7c
  kernel: pipe_wait+0x60/0x88
  kernel: pipe_read+0x34c/0x3cc
  kernel: do_sync_read+0x74/0xa0
  kernel: vfs_read+0xa0/0x12c
  kernel: SyS_read+0x44/0x74
  kernel: cpu_switch_to+0x48/0x4c
  native: #00 pc 0005f68c  /system/lib64/libc.so (read+4)
  native: #01 pc 0003947c  /system/lib64/libjavacore.so (???)
  native: #02 pc 003c0300  /data/dalvik-cache/arm64/system@[email protected] (Java_libcore_io_Posix_readBytes__Ljava_io_FileDescriptor_2Ljava_lang_Object_2II+212)
  at libcore.io.Posix.readBytes(Native method)
  at libcore.io.Posix.read(Posix.java:165)
  at libcore.io.BlockGuardOs.read(BlockGuardOs.java:230)
  at libcore.io.IoBridge.read(IoBridge.java:472)
  at java.io.FileInputStream.read(FileInputStream.java:177)
  at java.io.InputStreamReader.read(InputStreamReader.java:231)
  - locked <@addr=0x12cc22a0> (a java.lang.ProcessManager$ProcessInputStream)
  at java.io.BufferedReader.fillBuf(BufferedReader.java:145)
  at java.io.BufferedReader.readLine(BufferedReader.java:397)
  - locked <@addr=0x12cc2460> (a java.io.InputStreamReader)
  at com.android.commands.monkey.Monkey.commandLineReport(Monkey.java:502)
  at com.android.commands.monkey.Monkey.reportDumpsysMemInfo(Monkey.java:469)
  at com.android.commands.monkey.Monkey.runMonkeyCycles(Monkey.java:1192)
  at com.android.commands.monkey.Monkey.run(Monkey.java:700)
  at com.android.commands.monkey.Monkey.main(Monkey.java:553)
  at com.android.internal.os.RuntimeInit.nativeFinishInit(Native method)
  at com.android.internal.os.RuntimeInit.main(RuntimeInit.java:251)

又看到了同步IO被block的调用栈,这次是读端block。那么问题就来了,读端是谁?为什么写端没有写?

接下来watchdog没有kill system server的问题就转变为monkey的主线程为什么会被block?导致不能及时的设置条件为false并notifyall从而使watchdog继续工作?

好了,再捋一捋,有点乱:

watchdog thread等待monkey的binder thread执行完systemNotResponding,monkey的binder thread又等着monkey的main thread执行完reportDumpsysMemInfo去notify自己,而reportDumpsysMemInfo的执行需要拿到一个BatteryStatsImpl实例的锁,但是BatteryStatsImpl实例的锁又被tid=84的binder thread拿着,也就是上面最早执行同步socket操作的thread,绕了一大圈发现这是一个连环死锁,最终死锁的关键还是这个tid=84的thread.

system server中dumpsys meminfo被block的trace:

"Binder_1" prio=5 tid=16 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x12ceb0a0 self=0x5589025e20
  | sysTid=3256 nice=0 cgrp=default sched=0/0 handle=0x5588f5c810
  | state=S schedstat=( 187370245727 86634518556 362219 ) utm=12721 stm=6016 core=6 HZ=100
  | stack=0x7f93e8c000-0x7f93e8e000 stackSize=1012KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.updateCpuStatsNow(ActivityManagerService.java:2304)
  - waiting to lock <0x052efb00> (a com.android.internal.os.BatteryStatsImpl) held by thread 84
  - locked <0x1a8a1243> (a com.android.internal.os.ProcessCpuTracker)
  at com.android.server.am.ActivityManagerService.dumpApplicationMemoryUsage(ActivityManagerService.java:14708)
  at com.android.server.am.ActivityManagerService$MemBinder.dump(ActivityManagerService.java:2002)
  at android.os.Binder.dump(Binder.java:319)
  at android.os.Binder.onTransact(Binder.java:285)
  at android.os.Binder.execTransact(Binder.java:446)

好你个thread 84,竟然这么多人被你block,你到底在干啥?我们再看看它的trace:

"Binder_C" prio=5 tid=84 Native
  | group="main" sCount=1 dsCount=0 obj=0x1411e0a0 self=0x7f8b093e10
  | sysTid=5394 nice=-20 cgrp=default sched=0/0 handle=0x7f8b091de0
  | state=S schedstat=( 164975127905 86670873575 354263 ) utm=11207 stm=5290 core=4 HZ=100
  | stack=0x7f8aaa1000-0x7f8aaa3000 stackSize=1012KB
  | held mutexes=
  kernel: __switch_to+0x70/0x7c
  kernel: sock_alloc_send_pskb+0x25c/0x314
  kernel: sock_alloc_send_skb+0x18/0x24
  kernel: unix_stream_sendmsg+0x15c/0x340
  kernel: sock_aio_write+0x124/0x140
  kernel: do_sync_write+0x74/0xa0
  kernel: vfs_write+0xd4/0x178
  kernel: SyS_write+0x44/0x74
  kernel: cpu_switch_to+0x48/0x4c
  native: #00 pc 0005fca4  /system/lib64/libc.so (write+4)
  native: #01 pc 0000136c  /system/vendor/lib64/libNimsWrap.so (write+44)
  native: #02 pc 000368d4  /system/lib64/libjavacore.so (???)
  native: #03 pc 003c0300  /data/dalvik-cache/arm64/system@[email protected] (Java_libcore_io_Posix_writeBytes__Ljava_io_FileDescriptor_2Ljava_lang_Object_2II+212)
  at libcore.io.Posix.writeBytes(Native method)
  at libcore.io.Posix.write(Posix.java:258)
  at libcore.io.BlockGuardOs.write(BlockGuardOs.java:313)
  at libcore.io.IoBridge.write(IoBridge.java:497)
  at java.io.FileOutputStream.write(FileOutputStream.java:186)
  at com.android.internal.util.FastPrintWriter.flushBytesLocked(FastPrintWriter.java:334)
  at com.android.internal.util.FastPrintWriter.flushLocked(FastPrintWriter.java:355)
  at com.android.internal.util.FastPrintWriter.appendLocked(FastPrintWriter.java:303)
  at com.android.internal.util.FastPrintWriter.print(FastPrintWriter.java:466)
  - locked <@addr=0x13c93200> (a com.android.internal.util.FastPrintWriter$DummyWriter)
  at android.os.BatteryStats$HistoryPrinter.printNextItem(BatteryStats.java:3503)
  at android.os.BatteryStats.dumpHistoryLocked(BatteryStats.java:3915)
  at android.os.BatteryStats.dumpLocked(BatteryStats.java:3962)
  at com.android.internal.os.BatteryStatsImpl.dumpLocked(BatteryStatsImpl.java:8843)
  at com.android.server.am.BatteryStatsService.dump(BatteryStatsService.java:947)
  - locked <0x052efb00> (a com.android.internal.os.BatteryStatsImpl)
  at android.os.Binder.dump(Binder.java:319)
  at android.os.Binder.onTransact(Binder.java:285)
  at com.android.internal.app.IBatteryStats$Stub.onTransact(IBatteryStats.java:832)
  at android.os.Binder.execTransact(Binder.java:446)

仔细看这个traces,再仔细看这个traces,原来它在做dumpsys batterystats的操作,这个操作是谁发起的?
赶紧ps查看一下所有的dumpsys,发现有两个,一个是Z状态,僵尸了,一个是S状态。

Z状态?发生了啥?赶紧看看父进程是谁,发现是dumpstate,为啥dumpstate没有帮它儿子收尸?赶紧看代码:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第6张图片
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第7张图片

原来dumpstate起了dumpsys batterystats之后如果waitpid等待超时就会通过kill来发送SIGTERM优雅的结束dumpsys batterystats,但是貌似它忘记了一件事情,就是再次waitpid替儿子收尸。。。太大意了,导致儿子变成僵尸!

但是尽管如此,处于僵尸状态的进程已经把占用的资源已经释放掉了,其中就包括文件描述符,所以还是解释不了thread 84为啥被block?

接着看dumpsys batterystats是如何调用system server去做dump操作的,继续看代码:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第8张图片

原来dumpsys把STDOUT传递给了system server去dump batterystats,但是上面dumpsys已经释放了所有文件资源,并且传递的是STDOUT,不应该block system server的thread 84啊?
带着这个问题继续烧脑,看源码,发现dumpstate是bugreport通过set property起的,然后通过socket与dumpstate建立连接,bugreport读取这个socket中的数据并write到标准输出,而dumpstate又通过dup2把与bugreport的socket重定向到了stdout:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第9张图片
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第10张图片
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第11张图片

run_command是通过先fork然后在子进程中用execvp的方式执行的dumpsys batterystats,这样dumpsys batterystats就继承了dumpstate的stdout,对应的就是与bugreport通信的socket,所以就算dumpsys进程的stdout被关闭也只会对这个socket的引用计数减1,现在这个socket对应的引用计数还有2,分别是dumpstate进程自己以及system server,因此dumpsys的Z状态不会中断system server的thread 84,所以问题的关键还是在于thread 84在向这个socket同步写数据,但是socket的另一端没有读。

我们知道socket的另一端是bugreport,但是它为什么没有及时的把socket中的数据读走呢?赶紧用debuggerd dump出来bugreport的调用栈,发现bugreport在往标准输出里面写数据的时候被block了:

backtrace:
    #00 pc 000000000005fca4  /system/lib64/libc.so (write+4)
    #01 pc 00000000000524dc  /system/lib64/libc.so (__swrite+76)
    #02 pc 00000000000509d8  /system/lib64/libc.so (__sfvwrite+564)
    #03 pc 0000000000050e30  /system/lib64/libc.so (fwrite+188)
    #04 pc 0000000000000960  /system/bin/bugreport (main+192)
    #05 pc 0000000000013504  /system/lib64/libc.so (__libc_init+100)
    #06 pc 0000000000000a18  /system/bin/bugreport

bugreport往标准输出里面写数据时为什么会被block?

先通过ps看一下bugreport的父进程是谁,发现是monkey进程,那monkey进程为什么不把bugreport的输出读走?从上面我们知道monkey的主线程block在了dumpsys meminfo,看它的其他线程也没有调用bugreport的调用栈,那bugreport是怎么起来的?什么时候起来的?百思不得其解,通过一遍一遍的看代码以及出问题时sdcard没有空间的种种线索发现了以下可疑代码和一个推论:
由Monkey测试引发的跨多个进程的Android系统死锁问题分析_第12张图片
monkey起bugreport的方式就是调用上面的commandLineReport函数,这个函数起了bugreport之后会通过同步pipe的方式读取bugreport的标准输出,正常情况下会一直读完,mRequestBugreport为true就会把这些数据写到sdcard指定的文件里。如果在向sdcard写入这些log数据的时候存储空间满了,就会发生IO exception,那么就会跳出readline的while循环,从而继续执行其他操作,导致bugreport在write的时候block,进一步导致system server的thread 84在write的时候block,因为bugreport无法读取socket中的数据。

monkey主线程跳出readline的while循环之后继续执行,然后执行到了dumpsys meminfo的操作,上面ps出来的另外一个S状态的dumpsys的父进程就是monkey,由于system server的thread 84拿着dumpsys meminfo线程需要的锁,所以monkey主线程也被block了,monkey主线程block导致system server的watchdog调用systemNotResponding一直无法返回而无法继续kill system server。

好了,分析到这里就回答完明浩兄的两个问题了,system server不被watchdog kill的原因,以及thread 84的同步socket的读端是谁也找到了,为什么不读的原因也找到了。

五、后续延伸

1、在发生问题时,通过进入到bugreport进程的proc的fd目录下,cat 1(stdout)整个系统就全部恢复了,原理如上

2、通过dumpsys batterystats|more可以使system server被block而引发watchdog重启,原理就是上面的分析

3、有测试的同学发现它们通过java代码写的模拟shell执行的调用logcat的代码时而有效,时而logcat起来之后没有响应,调用进程一直waitfor无法返回,原因就是它们调用的logcat命令在执行时有时手机里面的crash信息比较多导致同步pipe的buffer被用完,而他们的调用进程没有将同步pipe中的数据读出,导致logcat进程在同步写pipe的时候block,而他的调用进程一直在空waitfor,时而有效是因为手机里面crash信息比较少甚至没有,执行他这行代码后logcat进程不会因为数据太多而在write的时候被block,他们的具体代码如下:

        String returnInfo = Utils.SHELL(new String[]{"sh", "-c", "logcat -d -s crashmonitor"});
        if(returnInfo.contains("CRASH") == true){
            Utils.LOG("Found crash.");
            Utils.SHELL(new String[]{"sh", "-c", "logcat -v time -d > " + Utils.WORK_PATH + "logcat_crash.txt"});
        }
    public static String SHELL(String[] command){
        Process p = null;
        int status = 0;
        String text = null;
        BufferedInputStream in = null;
        BufferedReader resultReader  = null;

        try{
            p = Runtime.getRuntime().exec(command);
            status = p.waitFor();
            if(status != 0){
                throw new RuntimeException(String.format("Run shell command: %s,  status: %s", command, status));
            }

可以看到调用“Runtime.getRuntime().exec(command);”之后直接waitfor了,没有管p的输出,正确的用法应该是:

    public static String SHELL(String[] command){
        Process p = null;
        int status = 0;
        String text = null;
        BufferedInputStream in = null;
        BufferedReader resultReader  = null;

        try{
            p = Runtime.getRuntime().exec(command);

            in = new BufferedInputStream(p.getInputStream());
            resultReader = new BufferedReader(new InputStreamReader(in, "UTF-8"));
            String line;
            while((line = resultReader.readLine()) != null){
                text += line + "\n";
            }

            status = p.waitFor();
            if(status != 0){
                throw new RuntimeException(String.format("Run shell command: %s,  status: %s", command, status));
            }

4、问题是在L版本上发生,通过搜索和查看AOSP的最新master分支发现他们也提交了一些patch

首先Google重构了bugreport的代码,https://android-review.googlesource.com/#/c/124474/

Refactor of the bugreport code.
Implemented these changes:
Make this code C++.
- Avoid hangs by adding a timeout.
- Add the necessary TEMP_FAILURE_RETRY calls.
Restructure the code a bit.

为dumpstate的run command实现更好的超时和waitpid机制,解决上面没有收尸的问题,https://android-review.googlesource.com/#/c/128514/

Add better timeout mechanism for running commands.

移除无效dumpstate gzip压缩选项,https://android-review.googlesource.com/#/c/130384/

Remove -z option.
This option doesn't work with the current selinux protections and
doesn't serve much of a purpose. You can get the same results running
this:
dumpstate | gzip > /data/local/tmp/dump.txt.gz

为dumpstate的dump_file增加超时机制避免bugreport永远hang住,https://android-review.googlesource.com/#/c/134103/

Add timeout for dump_file.
It turns out dump_file is used on a number of /proc and system files.
In one case, the read of a file stalled and caused a bugreport to
hang forever. It's still possible if there is a kernel bug that this
could stall forever, but less likely.
Also, change the return type of nanotime to uint64_t.
Testing:
Created a named fifo and verified that dump_file fails with a timeout.
Created a large /data/anr/traces.txt to verify that large files still
dump properly and that the additional NONBLOCK parameter doesn't cause
a problem.
Created a dummy /data/tombstones/tombstone_00 to verify that the
dump of these files still works.
Compared a dump using the old dumpstate to the new dumpstate to verify
nothing obviously different.

https://android-review.googlesource.com/#/c/150928/

Don't use TEMP_FAILURE_RETRY on close in frameworks/native.

https://android-review.googlesource.com/#/c/142952/

dumpstate: add O_CLOEXEC

Add O_CLOEXEC to various file descriptor calls, to avoid
leaking file descriptors to dumpstate's child processes.

你可能感兴趣的:(Android)