首先检查SWT 发生具体时间,检查SYS_ANDROID_EVENT_LOG 搜索Watchdog:
01-05 04:54:40.811 785 1160 I watchdog: surfaceflinger hang.
检查SYS_ANDROID_LOG 文件确认sf hang 时间
01-05 04:54:40.778 785 1160 V Watchdog: **SF hang Time **42201
01-05 04:54:40.779 785 1160 E Watchdog: **SWT happen **
且从上log得知 SF hang的总时间为42201。且倒推可知SF hang的开始发生时间约为04:53:58。
然后检查SF_RTT_DUMP 和 SF_RTT_DUMP1 文件 搜索确认有效时间内的SF 堆栈信息
----- pid 384 at 2018-01-05 04:53:58 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:02 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:06 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:10 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:14 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:18 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:22 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:27 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:31 -----
Cmd line: /system/bin/surfaceflinger
----- pid 384 at 2018-01-05 04:54:35 -----
Cmd line: /system/bin/surfaceflinger
检查384主线程
找到code
status_t BufferQueueConsumer::setTransformHint(uint32_t hint) {
777 ATRACE_CALL();
778 BQ_LOGV("setTransformHint: %#x", hint);
779 Mutex::Autolock lock(mCore->mMutex);
780 mCore->mTransformHint = hint;
781 return NO_ERROR;
782 }
发现主线程一直等锁 mCore->mMutex
检查后面线程发现 423线程
"Binder:384_2" sysTid=423
#00 pc 00048c90 /system/lib/libc.so (__ioctl+8)
#01 pc 0001dd75 /system/lib/libc.so (ioctl+32)
#02 pc 000429bd /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+168)
#03 pc 000433fd /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+236)
#04 pc 0003d841 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)
#05 pc 000518d1 /system/lib/libgui.so
#06 pc 000430b3 /system/lib/libgui.so (android::BufferQueueProducer::connect(android::sp const&, int, bool, android::IGraphicBufferProducer::QueueBufferOutput*)+974)
也一直卡住。
检查code 发现BufferQueueProducer connect函数中
status_t BufferQueueProducer::connect(const sp& listener,
1235 int api, bool producerControlledByApp, QueueBufferOutput *output) {
1236 ATRACE_CALL();
1237 Mutex::Autolock lock(mCore->mMutex);
mCore->mMutex 锁拿住了。所以确认此处卡住了SF 主线程导致的SWT。
由于后面的堆栈打印不清楚。看不到具体的调用。
所以详细分析 SYS_ANDROID_LOG
看到一行 log
01-05 04:54:46.724 384 423 E : IProducerListener: binder call 'needsReleaseNotify' failed
在code 中搜索发现是
调用到 IProducerListener.cpp
virtual bool needsReleaseNotify() 函数里面:
status_t err = remote()->transact(NEEDS_RELEASE_NOTIFY, data, &reply);---这里一直卡住了
47 if (err != NO_ERROR) {
48 ALOGE("IProducerListener: binder call \'needsReleaseNotify\' failed");
49 return true;
50 }
检查确认needsReleaseNotify 此函数就是
BufferQueueProducer.cpp connect里面的
1307 if (listener->needsReleaseNotify()) {
1308 mCore->mConnectedProducerListener = listener;
1309 }
所以从上面可以知道 SF hang的原因是
SF主线程需要拿的锁mCore->mMutex 被 线程423 拿住了。线程423 在调用到needsReleaseNotify 的时候 在binder 通讯是卡住。
由于不清楚IProducerListener 谁注册下来的。所以继续看了下 SYS_ANDROID_LOG
发现一个问题;
com.manboker.headportrait 应用发生NE 后 SF 开始打印
01-05 04:53:58.185 384 429 W : [SF-WD] ============================================
01-05 04:53:58.185 384 429 W : [SF-WD] SF performance down, TID=384 SpendTime=797ms Threshold=500ms AnchorTime=261951612172457 Now=261952410170919.
01-05 04:53:58.185 384 429 W : [SF-WD] reset anchor by [handleTransaction]
01-05 04:53:58.186 384 429 W : [SF-WD] Start to dump RTT
01-05 04:53:58.186 384 429 W : [SF-WD] ============================================
01-05 04:53:58.568 384 429 W : [SF-WD] ============================================
01-05 04:53:58.568 384 429 W : [SF-WD] SF maybe hang, TID=384 SpendTime=1180ms Threshold=500ms AnchorTime=261951612172457 Now=261952792703919.
01-05 04:53:58.568 384 429 W : [SF-WD] reset anchor by [handleTransaction]
01-05 04:53:58.568 384 429 W : [SF-WD] RTT is still dumping
01-05 04:53:58.568 384 429 W : [SF-WD] ============================================
在NE发生之后必现 SF hang的log
而且在
01-05 04:54:46.724 384 423 E : IProducerListener: binder call 'needsReleaseNotify' failed
01-05 04:54:46.724 384 538 I BufferQueueProducer: [com.manboker.headportrait/com.manboker.headportrait.activities.CameraActivity#0](this:0xae088000,id:103885,api:1,p:7968,c:384) disconnect(P): api 1
01-05 04:54:46.726 384 1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)
01-05 04:54:46.726 384 1299 I chatty : uid=1000(system) Binder:384_4 identical 1 line
01-05 04:54:46.726 384 1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)
01-05 04:54:46.727 384 422 E IPCThreadState: binder thread pool (4 threads) starved for 33728 ms
01-05 04:54:46.730 481 481 I Zygote : Process 7968 exited due to signal (6)
01-05 04:54:46.730 481 481 I Zygote : Process 7968 dumped core.
7968 died SF 就好了。
检查kernal log
[262000.955432] (3)[26663:kworker/3:2]binder: release 7968:7968 transaction 298427269 out, still active
[262000.955512] (3)[26663:kworker/3:2]binder: release 7968:8025 transaction 298427286 out, still active
[262000.955522] (3)[26663:kworker/3:2]binder: send failed reply for transaction 298427287 to 384:423
[262000.955579] (3)[26663:kworker/3:2]binder: undelivered transaction 298432612, process died.
[262000.955859] (3)[26663:kworker/3:2]binder: undelivered transaction 298431965, process died.
[262000.955867] (3)[26663:kworker/3:2]binder: undelivered transaction 298432005, process died.
明确打印出 binder 对端是7968 进程,即com.manboker.headportrait
所以此问题的根本原因是 SF 和 com.manboker.headportrait 通讯的时候。com.manboker.headportrait发生异常开启了coredump功能后 抓coredump信息 挂起了。
从而导致 SF一直卡死。