应用与系统稳定性第五篇---Watchdog原理和问题分析

前面已经这个系列已经更新了4篇,死机重启问题分析中,Watchdog问题最为常见,今天接着写一写Watchdog问题的分析套路以及工作原理。
应用与系统稳定性第一篇---ANR问题分析的一般套路
应用与系统稳定性第二篇---ANR的监测与信息采集
应用与系统稳定性第三篇---FD泄露问题漫谈
应用与系统稳定性第四篇---单线程导致的空指针问题分析

一、Watchdog基本认识

1、什么是watchdog?

Watchdog又名看门狗,如果不按时给“喂狗”,超过一分钟,就会咬人。Android系统中,服务有上百种,为了防止SystemServer的一些核心服务hang住而发生冻屏,引入了Watchdog机制,当出现故障时,Watchdog就会调用Process.killProcess(Process.myPid())杀死SystemServer进程system_server进程是zygote的大弟子,是zygote进程fork的第一个进程,zygote和system_server这两个进程可以说是Java世界的半边天,任何一个进程的死亡,都会导致Java世界的崩溃。所以如果子进程SystemServer挂了,Zygote就会自杀,这样Zygote孵化的所有子进程都会重启一遍,相当于手机被软重启了,用户不会因为手机冻屏而不能使用。

上面说的是防止Watchdog问题,系统的处理策略,而我们程序员关注的是,具体是哪里发生了Watchdog,和ANR类似,Watchdog发生过程中,需要dump trace,最终定位并解决问题。所以得研究一套机制能确定超时问题。

watchdog代码位于 /frameworks/base/services/core/java/com/android/server/Watchdog.java

常见Log有下面两种,一种是Blocked in handler 、另外一种是: Blocked in monitor,区别在下文分析。

11-15 06:56:39.696 24203 24902 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main), Blocked in handler on ui thread (android.ui)
11-15 06:56:39.696 24203 24902 W Watchdog: main thread stack trace:
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.nativePollOnce(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.MessageQueue.next(MessageQueue.java:323)
11-15 06:56:39.696 24203 24902 W Watchdog:     at android.os.Looper.loop(Looper.java:142)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:377)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:239)
11-15 06:56:39.696 24203 24902 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:901)
11-15 06:56:39.696 24203 24902 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:791)
11-15 06:56:39.696 24203 24902 W Watchdog: ui thread stack trace:
......
10-26 00:07:00.884 1000 17132 17312 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
10-26 00:07:00.884 1000 17132 17312 W Watchdog: foreground thread stack trace:
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:381)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:353)
10-26 00:07:00.885 1000 17132 17312 W Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.Looper.loop(Looper.java:193)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
10-26 00:07:00.886 1000 17132 17312 W Watchdog: *** GOODBYE!
2、初始化
watchdog初始化.png

Watchdog本身继承Thread,初始化是在SystemServer启动过程中

public final class SystemServer {
  ... ...
    /**
     * Starts a miscellaneous grab bag of stuff that has yet to be refactored
     * and organized.
     */
    private void startOtherServices() {
    ......
        try {
          ......
            traceBeginAndSlog("InitWatchdog");
            final Watchdog watchdog = Watchdog.getInstance(); // 获取Watchdog对象初始化
            watchdog.init(context, mActivityManagerService); // 注册receiver以接收系统重启广播
            Trace.traceEnd(Trace.TRACE_TAG_SYSTEM_SERVER);
          ......
        }
         ......
        mActivityManagerService.systemReady(new Runnable() {
            @Override
            public void run() {
              ......
                Watchdog.getInstance().start();
              ......
             }
        });
    }

241    public static Watchdog getInstance() {
242        if (sWatchdog == null) {
243            sWatchdog = new Watchdog();
244        }
245
246        return sWatchdog;
247    }

为了搞一套超时判断的方案,在Watchdog在构造函数中,会构建很多HandlerChecker,可以分为两类:

  • Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
  • Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,ui, Io, display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。
  /* This handler will be used to post message back onto the main thread */
107    final ArrayList mHandlerCheckers = new ArrayList<>();

249    private Watchdog() {
    //实质调用的是父类Thread的构造方法,设置线程名称
250        super("watchdog");
251        // Initialize handler checkers for each common thread we want to check.  Note
252        // that we are not currently checking the background thread, since it can
253        // potentially hold longer running operations with no guarantees about the timeliness
254        // of operations there.
255
256        // The shared foreground thread is the main checker.  It is where we
257        // will also dispatch monitor checks and do other work.
258        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
259                "foreground thread", DEFAULT_TIMEOUT);
260        mHandlerCheckers.add(mMonitorChecker);
261        // Add checker for main thread.  We only do a quick check since there
262        // can be UI running on the thread.
263        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
264                "main thread", DEFAULT_TIMEOUT));
265        // Add checker for shared UI thread.
266        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
267                "ui thread", DEFAULT_TIMEOUT));
268        // And also check IO thread.
269        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
270                "i/o thread", DEFAULT_TIMEOUT));
271        // And the display thread.
272        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
273                "display thread", DEFAULT_TIMEOUT));
274
275        // Initialize monitor for Binder threads.
276        addMonitor(new BinderThreadMonitor());
277        //O上新增对FD泄露的监控
278        mOpenFdMonitor = OpenFdMonitor.create();
......
283    }

其中DEFAULT_TIMEOUT一般是一分钟,对于installd是10分钟。
两类HandlerChecker的侧重点不同,

Monitor Checker预警我们不能长时间持有核心系统服务的对象锁,否则会阻塞很多函数的运行;
Looper Checker预警我们不能长时间的霸占消息队列,否则其他消息将得不到处理。

所以Watchdog就靠这两个Checker来搞搞事情了。

3、基本原理
3.1如何添加Checker对象

拿AMS举例,是既添加了Monitor Checker对象,也添加了Looper Checker对象,也实现了Watchdog.Monitor接口,重写了monitor方法。

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
  ......
    public ActivityManagerService(Context systemContext) {
     ......
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
     ......
    }
  ......
    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }
  ......
}

在AMS构造的时候,会调用Watchdog的addMonitor和addThread把自己和MainHandler的对象mHander加进去

323    public void addThread(Handler thread) {
324        addThread(thread, DEFAULT_TIMEOUT);
325    }
326
327    public void addThread(Handler thread, long timeoutMillis) {
328        synchronized (this) {
329            if (isAlive()) {
330                throw new RuntimeException("Threads can't be added once the Watchdog is running");
331            }
332            final String name = thread.getLooper().getThread().getName();
333            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
334        }
335    }
336

314    public void addMonitor(Monitor monitor) {
315        synchronized (this) {
316            if (isAlive()) {
317                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
318            }
319            mMonitorChecker.addMonitor(monitor);
320        }
321    }

mMonitorChecker是HandlerChecker 对象,实质上是HandlerChecker的addMonitor方法,而mHandlerCheckers是ArrayList对象,就可以直接add。

120    public final class HandlerChecker implements Runnable {
121        private final Handler mHandler;
122        private final String mName;
123        private final long mWaitMax;
124        private final ArrayList mMonitors = new ArrayList();
125        private boolean mCompleted;
126        private Monitor mCurrentMonitor;
127        private long mStartTime;
128
129        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
130            mHandler = handler;
131            mName = name;
132            mWaitMax = waitMaxMillis;
133            mCompleted = true;
134        }
135
136        public void addMonitor(Monitor monitor) {
137            mMonitors.add(monitor);
138        }
     ......
       }
3.2、核心原理

在添加Checker之后,该如何使用这些Checker呢?因为Watchdog继承Thread,直接看run方法。

398    @Override
399    public void run() {
400        boolean waitedHalf = false;
401        while (true) {
402            final ArrayList blockedCheckers;
403            final String subject;
404            final boolean allowRestart;
                 //是否是在调试状态
405            int debuggerWasConnected = 0;
406            synchronized (this) {
          //CHECK_INTERVAL时长是DEFAULT_TIMEOUT的一半,一般是30s
407                long timeout = CHECK_INTERVAL;
408                //1、处理所有的HandlerChecker
410                for (int i=0; i 0) {
425                    if (Debug.isDebuggerConnected()) {
426                        debuggerWasConnected = 2;
427                    }
428                    try {
429                        wait(timeout);
430                    } catch (InterruptedException e) {
431                        Log.wtf(TAG, e);
432                    }
433                    if (Debug.isDebuggerConnected()) {
434                        debuggerWasConnected = 2;
435                    }
436                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
437                } 
438        // 3. 获取状态,状态有如下三种,
439                final int waitState = evaluateCheckerCompletionLocked();
440                if (waitState == COMPLETED) {
441                    // The monitors have returned; reset
442                    waitedHalf = false;
443                    continue;
444                } else if (waitState == WAITING) {
445                    // still waiting but within their configured intervals; back off and recheck
446                    continue;
447                } else if (waitState == WAITED_HALF) {
448                    if (!waitedHalf) {
449                        //超时一半的时候,开始dumpStackTraces
451                        ArrayList pids = new ArrayList();
452                        pids.add(Process.myPid());
453                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
454                            getInterestingNativePids());
455                        waitedHalf = true;
456                    }
457                    continue;
458                }
459
460                // 走到这里,说明存在超时的HandlerChecker
461                blockedCheckers = getBlockedCheckersLocked();
462                subject = describeCheckersLocked(blockedCheckers);
463                allowRestart = mAllowRestart;
464            }
465
466            // If we got here, that means that the system is most likely hung.
467            // First collect stack traces from all threads of the system process.
468            // Then kill this process so that the system will restart.
               //eventlog打印发生了watchdog
469            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
470   //
471            ArrayList pids = new ArrayList<>();
472            pids.add(Process.myPid());

473            if (mPhonePid > 0) pids.add(mPhonePid);
474            // Pass !waitedHalf so that just in case we somehow wind up here without having
475            //开始dumpStackTraces,包含pids中的进程和getInterestingNativePids中的进程
476            final File stack = ActivityManagerService.dumpStackTraces(
477                    !waitedHalf, pids, null, null, getInterestingNativePids());
478
479            // Give some extra time to make sure the stack traces get written.
480            // The system's been hanging for a minute, another second or two won't hurt much.
481            SystemClock.sleep(2000);
482
483            // Pull our own kernel thread stacks as well if we're configured for that
     //开始dumpKernelStackTraces
484            if (RECORD_KERNEL_THREADS) {
485                dumpKernelStackTraces();
486            }
487
488            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
489            doSysRq('w');
490            doSysRq('l');
491
492            // Try to add the error to the dropbox, but assuming that the ActivityManager
493            // itself may be deadlocked.  (which has happened, causing this statement to
494            // deadlock and the watchdog as a whole to be ineffective)
495            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
496                    public void run() {
                              //将Error加入到DropBox文件中
497                        mActivity.addErrorToDropBox(
498                                "watchdog", null, "system_server", null, null,
499                                subject, null, stack, null);
500                    }
501                };
502            dropboxThread.start();
                 ......
525
526            // Only kill the process if the debugger is not attached.
527            if (Debug.isDebuggerConnected()) {
528                debuggerWasConnected = 2;
529            }
530            if (debuggerWasConnected >= 2) {
531                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
532            } else if (debuggerWasConnected > 0) {
533                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
534            } else if (!allowRestart) {
535                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
536            } else {
537                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
538                for (int i=0; i

原理总结:

  • 1、系统中所有需要监控的服务都调用Watchdog的addMonitor添加Monitor Checker到mMonitors这个List中或者addThread方法添加Looper Checker到mHandlerCheckers这个List中。
  • 2、当Watchdog线程启动后,便开始无限循环,它的run方法就开始执行
    • 第一步调用HandlerChecker#scheduleCheckLocked处理所有的mHandlerCheckers
    • 第二步定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒,每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态:
      COMPLETED表示已经完成
      WAITING和WAITED_HALF表示还在等待,但未超时,WAITED_HALF时候会dump一次trace.
      OVERDUE表示已经超时。默认情况下,timeout是1分钟。
  • 3、如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息,保存日志,包括一些运行时的堆栈信息。
  • 4、最后杀死SystemServer进程


    Watchdog原理.png

上面就是大概的原理总结,还需要看几个细节问题

3.2.1、HandlerChecker#scheduleCheckLocked的处理?
127        public void scheduleCheckLocked() {
                  //mMonitors.size为0或者,消息队列处于空闲,说明没有阻塞,设置   mCompleted = true后直接返回
128            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
129                // If the target looper has recently been polling, then
130                // there is no reason to enqueue our checker on it since that
131                // is as good as it not being deadlocked.  This avoid having
132                // to do a context switch to check the thread.  Note that we
133                // only do this if mCheckReboot is false and we have no
134                // monitors, since those would need to be executed at this point.
135                mCompleted = true;
136                return;
137            }
......
144            mCompleted = false;
145            mCurrentMonitor = null;
146            mStartTime = SystemClock.uptimeMillis();
       //post一个消息到当前mHandler所在消息队列的最前面
147            mHandler.postAtFrontOfQueue(this);
148        }

如果上面消息能够执行,下面的run方法就会走进去,尝试调用monitor申请锁。

    public final class HandlerChecker implements Runnable {
       .......
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }

对于Looper Checker而言,会判断线程的消息队列是否处于空闲状态。 如果被监测的消息队列一直闲不下来,则说明可能已经阻塞等待了很长时间

如果scheduleCheckLocked中post的消息能够被执行到,对于Monitor Checker而言,会调用实现类的monitor方法,上文中提到的AMS.monitor()方法, 方法实现一般很简单,就是获取当前类的对象锁,如果当前对象锁已经被持有,则monitor()会一直处于wait状态,直到超时。

如果scheduleCheckLocked中post的消息不能够被执行到,那么说明消息队列中前一个消息一直在执行,没有执行完成,也会超时。不得不佩服这种巧妙的设计啊,postAtFrontOfQueue可谓是一箭双雕,既检测了是否锁有耗时,也检查了消息队列中某个Message是否耗时。

二、案例分析

对于Watchdog问题分析,首先需要确定trace是否有效,通过前面的分析,Watchdog在30s和1分钟的时候都会dump一次trace,比如看到下面的trace。

09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on ActivityManager (ActivityManager)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: ActivityManager stack trace:
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.nativePollOnce(Native Method)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.MessageQueue.next(MessageQueue.java:325)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.Looper.loop(Looper.java:148)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)
09-24 11:25:43.442 1000 1540 2033 W Watchdog: *** GOODBYE!

然后我们看ActivityManager的trace.

"ActivityManager" prio=5 tid=12 Blocked
group="main" sCount=1 dsCount=0 flags=1 obj=0x13180c38 self=0x73bb923600
sysTid=1579 nice=-2 cgrp=default sched=0/0 handle=0x73adbcf4f0
state=S schedstat=( 3039883125048 14149853235996 6778200 ) utm=112965 stm=191023 core=6 HZ=100
stack=0x73adacd000-0x73adacf000 stackSize=1037KB
held mutexes=
at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3486)
waiting to lock <0x0748826a> (a com.android.server.am.ActivityManagerService) held by thread 10
at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2032)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:173)
at android.os.HandlerThread.run(HandlerThread.java:65)
at com.android.server.ServiceThread.run(ServiceThread.java:46)

因为ActivityManager被10号线程blocked,继续看10号线程的trace.

"Binder:1540_1C" prio=5 tid=10 Native

group="main" sCount=1 dsCount=0 flags=1 obj=0x1318deb8 self=0x73c0817600
sysTid=8946 nice=-4 cgrp=default sched=0/0 handle=0x739db674f0
state=S schedstat=( 2025031009459 6852098325718 5020435 ) utm=136019 stm=66484 core=1 HZ=100
stack=0x739da6d000-0x739da6f000 stackSize=1005KB
held mutexes=
kernel: __switch_to+0x9c/0xd0
kernel: futex_wait_queue_me+0xc4/0x13c
kernel: futex_wait+0xe4/0x204
kernel: do_futex+0x170/0x500
kernel: SyS_futex+0x90/0x1b0
kernel: __sys_trace+0x4c/0x4c
native: #00 pc 000000000001db2c /system/lib64/libc.so (syscall+28)
native: #01 pc 00000000000e74c8 /system/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152)
native: #02 pc 00000000005227a8 /system/lib64/libart.so (art::GoToRunnable(art::Thread*)+440)
native: #03 pc 00000000005225a8 /system/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
native: #04 pc 0000000000cb8fc0 /system/framework/arm64/boot-framework.oat (Java_android_os_Process_setThreadPriority__II+176)
at android.os.Process.setThreadPriority(Native method)
at com.android.server.ThreadPriorityBooster.boost(ThreadPriorityBooster.java:49)
at com.android.server.wm.WindowManagerThreadPriorityBooster.boost(WindowManagerThreadPriorityBooster.java:58)
at com.android.server.wm.WindowManagerService.boostPriorityForLockedSection(WindowManagerService.java:930)
at com.android.server.wm.WindowManagerService.containsDismissKeyguardWindow(WindowManagerService.java:3116)
locked <0x0b54880e> (a com.android.server.wm.WindowHashMap)
at com.android.server.am.ActivityRecord.hasDismissKeyguardWindows(ActivityRecord.java:1364)
at com.android.server.am.ActivityStack.checkKeyguardVisibility(ActivityStack.java:2070)
at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1924)
at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3626)
at com.android.server.am.ActivityStackSupervisor.attachApplicationLocked(ActivityStackSupervisor.java:1043)
at com.android.server.am.ActivityManagerService.attachApplicationLocked(ActivityManagerService.java:7471)
at com.android.server.am.ActivityManagerService.attachApplication(ActivityManagerService.java:7538)
locked <0x0748826a> (a com.android.server.am.ActivityManagerService)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:292)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3026)
at android.os.Binder.execTransact(Binder.java:704)

难道是setThreadPriority超时??但是缺乏1分钟的trace,我们不能断定是这个地方卡住。在 dumptraces 的时候对于处于 Suspended 状态的线程,会修改线程的 suspend_count_,使其+1,然后将其添加到suspended_count_modified_threads 的列表中,然后对于 suspended_count_modified_threads 中的线程一起 dumptraces ,对于 dump 完成的 thread 会进行 suspend_count_ - 1 的操作。Suspended 线程想要由 jni 回到 java 代码(Runnable 状态)在 GoToRunnable 时会检查 suspend_count_,如果不为0就在这里等待,直到其变为0。所以这里只能说明 dumptraces 的时候 tid=10 在执行 setThreadPriority 的 native method,如果要确定是否卡在了这里还需要对比两次 traces才能确定。

2.1、案例一

有的手机在Monkey测试过程中发生Watchdog不会重启,现象可能是冻屏,查看traces_SystemServer_WDT05_1月_23_50_59.974.txt,发现所有线程都被73号线程blocked,而且两次trace完全一致


WDT.png

来看看73号线程在干嘛

"Binder:1300_3" prio=5 tid=73 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14f89110 self=0x7ee794d600
  | sysTid=1774 nice=-10 cgrp=default sched=0/0 handle=0x7ecbc474f0
  | state=S schedstat=( 59882636556 104471794509 273786 ) utm=3455 stm=2533 core=6 HZ=100
  | stack=0x7ecbb4d000-0x7ecbb4f000 stackSize=1005KB
  | held mutexes=
  kernel: __switch_to+0x94/0xa8
  kernel: binder_thread_read+0x460/0x10a0
  kernel: binder_ioctl_write_read+0x21c/0x360
  kernel: binder_ioctl+0x50c/0x798
  kernel: do_vfs_ioctl+0xb8/0x800
  kernel: SyS_ioctl+0x84/0x98
  kernel: el0_svc_naked+0x24/0x28
  native: #00 pc 00000000000690a4  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 0000000000024638  /system/lib64/libc.so (ioctl+132)
  native: #02 pc 0000000000061a10  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14talkWithDriverEb+256)
  native: #03 pc 00000000000627a8  /system/lib64/libbinder.so (_ZN7android14IPCThreadState15waitForResponseEPNS_6ParcelEPi+340)
  native: #04 pc 00000000000624c8  /system/lib64/libbinder.so (_ZN7android14IPCThreadState8transactEijRKNS_6ParcelEPS1_j+216)
  native: #05 pc 0000000000056d98  /system/lib64/libbinder.so (_ZN7android8BpBinder8transactEjRKNS_6ParcelEPS1_j+72)
  native: #06 pc 000000000008a86c  /system/lib64/libgui.so (???)
  native: #07 pc 000000000009ec88  /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)
  native: #08 pc 00000000000fb058  /system/lib64/libandroid_runtime.so (???)
  native: #09 pc 0000000001326998  /system/framework/arm64/boot-framework.oat (Java_android_view_SurfaceControl_nativeScreenshot__Landroid_os_IBinder_2Landroid_graphics_Rect_2IIIIZZI+264)
  at android.view.SurfaceControl.nativeScreenshot(Native method)
  at android.view.SurfaceControl.screenshot(SurfaceControl.java:877)
  at com.android.server.wm.DisplayContent.-com_android_server_wm_DisplayContent-mthref-0(DisplayContent.java:2863)
  at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.$m$0(unavailable:-1)
  at com.android.server.wm.-$Lambda$OzPvdnGprtQoLZLCvw2GU8IaGyI.screenshot(unavailable:-1)
  at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:3125)
  - locked <0x036ec628> (a com.android.server.wm.WindowHashMap)
  at com.android.server.wm.DisplayContent.screenshotApplications(DisplayContent.java:2862)
  at com.android.server.wm.AppWindowContainerController.screenshotApplications(AppWindowContainerController.java:749)
  at com.android.server.am.ActivityRecord.screenshotActivityLocked(ActivityRecord.java:1650)
  at com.android.server.am.ActivityRecord.setVisible(ActivityRecord.java:1675)
  at com.android.server.am.ActivityStack.makeInvisible(ActivityStack.java:2078)
  at com.android.server.am.ActivityStack.ensureActivitiesVisibleLocked(ActivityStack.java:1896)
  at com.android.server.am.ActivityStackSupervisor.ensureActivitiesVisibleLocked(ActivityStackSupervisor.java:3575)
  at com.android.server.am.ActivityManagerService.ensureConfigAndVisibilityAfterUpdate(ActivityManagerService.java:20965)
  at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20897)
  at com.android.server.am.ActivityManagerService.updateDisplayOverrideConfigurationLocked(ActivityManagerService.java:20867)
  at com.android.server.am.ActivityStack.resumeTopActivityInnerLocked(ActivityStack.java:2608)
  at com.android.server.am.ActivityStack.resumeTopActivityUncheckedLocked(ActivityStack.java:2246)
  at com.android.server.am.ActivityStackSupervisor.resumeFocusedStackTopActivityLocked(ActivityStackSupervisor.java:2148)
  at com.android.server.am.ActivityStack.completePauseLocked(ActivityStack.java:1480)
  at com.android.server.am.ActivityStack.activityPausedLocked(ActivityStack.java:1406)
  at com.android.server.am.ActivityManagerService.activityPaused(ActivityManagerService.java:7542)
  - locked <0x08abeada> (a com.android.server.am.ActivityManagerService)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:317)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3018)
  at android.os.Binder.execTransact(Binder.java:677)

最后是停在下面两行

 native: #06 pc 000000000008a86c  /system/lib64/libgui.so (???)
  native: #07 pc 000000000009ec88  /system/lib64/libgui.so (_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj+260)

使用addr2line -Cfe ./system/lib64/libgui.so 000000000009ec88

_ZN7android16ScreenshotClient6updateERKNS_2spINS_7IBinderEEENS_4RectEjjiibj
frameworks/native/libs/gui/SurfaceComposerClient.cpp:1018 (discriminator 1)
1003 status_t ScreenshotClient::captureToBuffer(const sp& display,
1004        Rect sourceCrop, uint32_t reqWidth, uint32_t reqHeight,
1005        int32_t minLayerZ, int32_t maxLayerZ, bool useIdentityTransform,
1006        uint32_t rotation,
1007        sp* outBuffer) {
1008    sp s(ComposerService::getComposerService());
1009    if (s == NULL) return NO_INIT;
1010
1011    sp gbpConsumer;
1012    sp producer;
1013    BufferQueue::createBufferQueue(&producer, &gbpConsumer);
1014    sp consumer(new BufferItemConsumer(gbpConsumer,
1015           GRALLOC_USAGE_HW_TEXTURE | GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER,
1016           1, true));
1017
1018    status_t ret = s->captureScreen(display, producer, sourceCrop, reqWidth, reqHeight,
1019            minLayerZ, maxLayerZ, useIdentityTransform,
1020            static_cast(rotation));
1021    if (ret != NO_ERROR) {
1022        return ret;
1023    }
1024    BufferItem b;
1025    consumer->acquireBuffer(&b, 0, true);
1026    *outBuffer = b.mGraphicBuffer;
1027    return ret;
1028}

1018行captureScreen函数是在做截屏,看来是截屏时候发生了Watchdog,根据captureScreen,那么对应的surfaceflinger的trace如下:

"Binder:820_3" sysTid=1331
  #00 pc 0000000000068fb8  /system/lib64/libc.so (__epoll_pwait+8)
  #01 pc 000000000001fc68  /system/lib64/libc.so (epoll_pwait+48)
  #02 pc 0000000000015c84  /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+144)
  #03 pc 0000000000015b6c  /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)
  #04 pc 00000000000b921c  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger13captureScreenERKNS_2spINS_7IBinderEEERKNS1_INS_22IGraphicBufferProducerEEENS_4RectEjjiibNS_16ISurfaceComposer8RotationE+672)
  #05 pc 0000000000088660  /system/lib64/libgui.so (_ZN7android17BnSurfaceComposer10onTransactEjRKNS_6ParcelEPS1_j+1788)
  #06 pc 00000000000b8828  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger10onTransactEjRKNS_6ParcelEPS1_j+144)
  #07 pc 00000000000559ac  /system/lib64/libbinder.so (_ZN7android7BBinder8transactEjRKNS_6ParcelEPS1_j+136)
  #08 pc 0000000000061ecc  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14executeCommandEi+536)
  #09 pc 0000000000061c04  /system/lib64/libbinder.so (_ZN7android14IPCThreadState20getAndExecuteCommandEv+156)
  #10 pc 0000000000062250  /system/lib64/libbinder.so (_ZN7android14IPCThreadState14joinThreadPoolEb+60)
  #11 pc 0000000000082bcc  /system/lib64/libbinder.so (_ZN7android10PoolThread10threadLoopEv+24)
  #12 pc 0000000000011674  /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)
  #13 pc 0000000000066970  /system/lib64/libc.so (_ZL15__pthread_startPv+36)
  #14 pc 000000000001f474  /system/lib64/libc.so (__start_thread+68)

surfaceflinger的主线程trace如下

----- pid 820 at 2018-01-05 23:49:16 -----
Cmd line: /system/bin/surfaceflinger
ABI: 'arm64'

"surfaceflinger" sysTid=820
  #00 pc 00000000000690a4  /system/lib64/libc.so (__ioctl+4)
  #01 pc 0000000000024638  /system/lib64/libc.so (ioctl+132)
  #02 pc 0000000000015210  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState14talkWithDriverEb+256)
  #03 pc 0000000000015f58  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState15waitForResponseEPNS0_6ParcelEPi+60)
  #04 pc 0000000000015d84  /system/lib64/libhwbinder.so (_ZN7android8hardware14IPCThreadState8transactEijRKNS0_6ParcelEPS2_j+216)
  #05 pc 00000000000128d4  /system/lib64/libhwbinder.so (_ZN7android8hardware10BpHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+72)
  #06 pc 0000000000038bc4  /system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_118BpHwComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+240)
  #07 pc 0000000000091cc0  /system/lib64/libsurfaceflinger.so (_ZN7android4Hwc28Composer11createLayerEmPm+100)
  #08 pc 000000000009a930  /system/lib64/libsurfaceflinger.so (_ZN4HWC27Display11createLayerEPNSt3__110shared_ptrINS_5LayerEEE+72)
  #09 pc 00000000000c5304  /system/lib64/libsurfaceflinger.so (_ZN7android10HWComposer11createLayerEi+152)
  #10 pc 00000000000b1524  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger15setUpHWComposerEv+1560)
  #11 pc 00000000000b096c  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger20handleMessageRefreshEv+108)
  #12 pc 00000000000aa660  /system/lib64/libsurfaceflinger.so (_ZN7android16ExSurfaceFlinger20handleMessageRefreshEv+16)
  #13 pc 00000000000b03e4  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger17onMessageReceivedEi+260)
  #14 pc 0000000000015d40  /system/lib64/libutils.so (_ZN7android6Looper9pollInnerEi+332)
  #15 pc 0000000000015b6c  /system/lib64/libutils.so (_ZN7android6Looper8pollOnceEiPiS1_PPv+108)
  #16 pc 000000000008b944  /system/lib64/libsurfaceflinger.so (_ZN7android12MessageQueue11waitMessageEv+84)
  #17 pc 00000000000af338  /system/lib64/libsurfaceflinger.so (_ZN7android14SurfaceFlinger3runEv+20)
  #18 pc 0000000000002cfc  /system/bin/surfaceflinger (main+948)
  #19 pc 000000000001b8b0  /system/lib64/libc.so (__libc_init+88)
  #20 pc 00000000000028a8  /system/bin/surfaceflinger (do_arm64_start+80)

看到主线程正在createLayer,又通过binder从surfaceflinger进程call到了/vendor/bin/hw/[email protected],我们在去看看graphics.composer的对应线程的trace.

"HwBinder:738_1" sysTid=1273
  #00 pc 000000000001dc2c  /system/lib64/libc.so (syscall+28)
  #01 pc 0000000000066014  /system/lib64/libc.so (pthread_cond_wait+96)
  #02 pc 000000000001fc3c  /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)
  #03 pc 00000000000140e0  /vendor/lib64/hw/[email protected] (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)
  #04 pc 0000000000044840  /system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_116BsComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+160)
  #05 pc 000000000003fd10  /system/lib64/[email protected] (_ZN7android8hardware8graphics8composer4V2_118BnHwComposerClient10onTransactEjRKNS0_6ParcelEPS5_jNSt3__18functionIFvRS5_EEE+2224)
  #06 pc 0000000000011be0  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware9BHwBinder8transactEjRKNS0_6ParcelEPS2_jNSt3__18functionIFvRS2_EEE+132)
  #07 pc 00000000000156fc  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14executeCommandEi+584)
  #08 pc 0000000000015404  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState20getAndExecuteCommandEv+156)
  #09 pc 0000000000015b0c  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware14IPCThreadState14joinThreadPoolEb+60)
  #10 pc 000000000001f5c8  /system/lib64/vndk-sp/libhwbinder.so (_ZN7android8hardware10PoolThread10threadLoopEv+24)
  #11 pc 0000000000011674  /system/lib64/vndk-sp/libutils.so (_ZN7android6Thread11_threadLoopEPv+280)
  #12 pc 0000000000066970  /system/lib64/libc.so (_ZL15__pthread_startPv+36)
  #13 pc 000000000001f474  /system/lib64/libc.so (__start_thread+68)

终于找到了最终blocked的地方

 #02 pc 000000000001fc3c  /vendor/lib64/hw/hwcomposer.sdm845.so (_ZN3sdm10HWCSession11CreateLayerEP11hwc2_devicemPm+120)
#03 pc 00000000000140e0  /vendor/lib64/hw/[email protected] (_ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3__18functionIFvNS3_5ErrorEmEEE+84)

再次使用addr2line
addr2line -f -e hwcomposer.sdm845.so 1fc3c
_ZN3sdm6Locker4WaitEv
hardware/qcom/display/include/../sdm/include/utils/locker.h:141

addr2line -f -e [email protected] 140e0
ZN7android8hardware8graphics8composer4V2_114implementation14ComposerClient11createLayerEmjNSt3_18functionIFvNS3_5ErrorEmEEE
hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp:299

hardware/interfaces/graphics/composer/2.1/default/ComposerClient.cpp#299
295Return ComposerClient::createLayer(Display display,
296        uint32_t bufferSlotCount, createLayer_cb hidl_cb)
297{
298    Layer layer = 0;
299    Error err = mHal.createLayer(display, &layer);
300    if (err == Error::NONE) {
301        std::lock_guard lock(mDisplayDataMutex);
302
303        auto dpy = mDisplayData.find(display);
304        if (dpy != mDisplayData.end()) {
305            auto ly = dpy->second.Layers.emplace(layer, LayerBuffers()).first;
306            ly->second.Buffers.resize(bufferSlotCount);
307        } else {
308            layer = 0;
309            err = Error::BAD_DISPLAY;
310        }
311    }
312
313    hidl_cb(err, layer);
314    return Void();
315}

看样子是createLayer出了问题,最后将问题转给底层显示模块的同学继续分析。最后关于Watchdog还是有一些问题可以思考的,比如Watchdog各个版本有哪些变化,Watchdog线程被blocked了怎么办?而且Watchdog问题纷繁复杂,各个模块的业务都不一样,由于篇幅原因,读者自己调查。

参考连接https://duanqz.github.io/2015-10-12-Watchdog-Analysis

你可能感兴趣的:(应用与系统稳定性第五篇---Watchdog原理和问题分析)