surfaceflinger 导致的SWT 问题分析流程

首先检查SWT 发生具体时间,检查SYS_ANDROID_EVENT_LOG 搜索Watchdog:

01-05 04:54:40.811   785  1160 I watchdog: surfaceflinger  hang.

检查SYS_ANDROID_LOG 文件确认sf hang 时间

01-05 04:54:40.778   785  1160 V Watchdog: **SF hang Time **42201

01-05 04:54:40.779   785  1160 E Watchdog: **SWT happen **

且从上log得知 SF hang的总时间为42201。且倒推可知SF hang的开始发生时间约为04:53:58

 

然后检查SF_RTT_DUMP 和 SF_RTT_DUMP1 文件 搜索确认有效时间内的SF 堆栈信息

----- pid 384 at 2018-01-05 04:53:58 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:02 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:06 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:10 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:14 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:18 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:22 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:27 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:31 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:35 -----

Cmd line: /system/bin/surfaceflinger

检查384主线程

 

找到code

status_t BufferQueueConsumer::setTransformHint(uint32_t hint) {

777     ATRACE_CALL();

778     BQ_LOGV("setTransformHint: %#x", hint);

779     Mutex::Autolock lock(mCore->mMutex);

780     mCore->mTransformHint = hint;

781     return NO_ERROR;

782 }

发现主线程一直等锁 mCore->mMutex

 

检查后面线程发现 423线程

"Binder:384_2" sysTid=423

  #00 pc 00048c90  /system/lib/libc.so (__ioctl+8)

  #01 pc 0001dd75  /system/lib/libc.so (ioctl+32)

  #02 pc 000429bd  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+168)

  #03 pc 000433fd  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+236)

  #04 pc 0003d841  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)

  #05 pc 000518d1  /system/lib/libgui.so

  #06 pc 000430b3  /system/lib/libgui.so (android::BufferQueueProducer::connect(android::sp const&, int, bool, android::IGraphicBufferProducer::QueueBufferOutput*)+974)

也一直卡住。

检查code 发现BufferQueueProducer connect函数中

status_t BufferQueueProducer::connect(const sp& listener,

1235         int api, bool producerControlledByApp, QueueBufferOutput *output) {

1236     ATRACE_CALL();

1237     Mutex::Autolock lock(mCore->mMutex);

 mCore->mMutex 锁拿住了。所以确认此处卡住了SF 主线程导致的SWT

 

 

由于后面的堆栈打印不清楚。看不到具体的调用。

所以详细分析 SYS_ANDROID_LOG

看到一行 log

01-05 04:54:46.724   384   423 E         : IProducerListener: binder call 'needsReleaseNotify' failed

 

在code 中搜索发现是

调用到 IProducerListener.cpp
virtual bool needsReleaseNotify() 函数里面:

status_t err = remote()->transact(NEEDS_RELEASE_NOTIFY, data, &reply);---这里一直卡住了
47 if (err != NO_ERROR) {
48 ALOGE("IProducerListener: binder call \'needsReleaseNotify\' failed");
49 return true;
50 }


检查确认needsReleaseNotify 此函数就是

BufferQueueProducer.cpp connect里面的

1307 if (listener->needsReleaseNotify()) { 
1308   mCore->mConnectedProducerListener = listener;
1309 }

 

所以从上面可以知道 SF hang的原因是

SF主线程需要拿的锁mCore->mMutex 被 线程423 拿住了。线程423 在调用到needsReleaseNotify 的时候 在binder 通讯是卡住。

 

由于不清楚IProducerListener 谁注册下来的。所以继续看了下 SYS_ANDROID_LOG

发现一个问题;

com.manboker.headportrait 应用发生NE 后 SF 开始打印

01-05 04:53:58.185   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.185   384   429 W         : [SF-WD] SF performance down, TID=384 SpendTime=797ms Threshold=500ms AnchorTime=261951612172457 Now=261952410170919.

01-05 04:53:58.185   384   429 W         : [SF-WD] reset anchor by [handleTransaction]

01-05 04:53:58.186   384   429 W         : [SF-WD] Start to dump RTT

01-05 04:53:58.186   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.568   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.568   384   429 W         : [SF-WD] SF maybe hang, TID=384 SpendTime=1180ms Threshold=500ms AnchorTime=261951612172457 Now=261952792703919.

01-05 04:53:58.568   384   429 W         : [SF-WD] reset anchor by [handleTransaction]

01-05 04:53:58.568   384   429 W         : [SF-WD] RTT is still dumping

01-05 04:53:58.568   384   429 W         : [SF-WD] ============================================

在NE发生之后必现 SF hang的log

而且在

01-05 04:54:46.724   384   423 E         : IProducerListener: binder call 'needsReleaseNotify' failed

01-05 04:54:46.724   384   538 I BufferQueueProducer: [com.manboker.headportrait/com.manboker.headportrait.activities.CameraActivity#0](this:0xae088000,id:103885,api:1,p:7968,c:384) disconnect(P): api 1

01-05 04:54:46.726   384  1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)

01-05 04:54:46.726   384  1299 I chatty  : uid=1000(system) Binder:384_4 identical 1 line

01-05 04:54:46.726   384  1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)

01-05 04:54:46.727   384   422 E IPCThreadState: binder thread pool (4 threads) starved for 33728 ms

01-05 04:54:46.730   481   481 I Zygote  : Process 7968 exited due to signal (6)

01-05 04:54:46.730   481   481 I Zygote  : Process 7968 dumped core.

7968 died SF 就好了。

 

检查kernal log

[262000.955432] (3)[26663:kworker/3:2]binder: release 7968:7968 transaction 298427269 out, still active

[262000.955512] (3)[26663:kworker/3:2]binder: release 7968:8025 transaction 298427286 out, still active

[262000.955522] (3)[26663:kworker/3:2]binder: send failed reply for transaction 298427287 to 384:423

[262000.955579] (3)[26663:kworker/3:2]binder: undelivered transaction 298432612, process died.

[262000.955859] (3)[26663:kworker/3:2]binder: undelivered transaction 298431965, process died.

[262000.955867] (3)[26663:kworker/3:2]binder: undelivered transaction 298432005, process died.

 

明确打印出 binder 对端是7968 进程,即com.manboker.headportrait

 

所以此问题的根本原因是 SF 和 com.manboker.headportrait 通讯的时候。com.manboker.headportrait发生异常开启了coredump功能后 抓coredump信息 挂起了。

从而导致 SF一直卡死。

你可能感兴趣的:(surfaceflinger 导致的SWT 问题分析流程)