SystemServer的Watchdog

  在我们的认知中,Watchdog 是 Linux 系统一个很重要的机制,其目的是监测系统运行的情况,一旦出现锁死,死机的情况,能及时重启机器(取决于设置策略),并收集crash dump。在Android的SystemServer中,也存在一个Watchdog用来监控一些重要的线程,一旦这些线程长时间阻塞,就会杀掉SystemServer进程,导致Android重启。
  Watchdog是一个单例:

/frameworks/base/services/core/java/com/android/server/Watchdog.java

    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

  Watchdog内部维持了一个mHandlerCheckers的ArrayList,这个ArrayList的元素是HandlerChecker。HandlerChecker的构造函数接受三个参数:1.一个线程的Handler;2.名字;3.超时时间。
  可以看到FgThread,SystemServer主线程,UiThread,IoThread和DisplayThread的Handler都被作为入参构建一个HandlerChecker,保存在mHandlerCheckers中,从而纳入Watchdog的监控管理中。

/frameworks/base/services/core/java/com/android/server/Watchdog.java

    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
    }

/frameworks/base/services/core/java/com/android/server/Watchdog.java

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

  HandlerChecker接受一个Handler参数,目的是检测Handler对应的Looper有没有长时间阻塞。此外,HandlerChecker还拥有mMonitors成员,用来检测一个线程有没有长时间阻塞在Synchronized关键字上。添加一个Monitor的方式是调用Watchdog#addMonitor,monitor是一个仅包含了moitor方法的接口。这些Monitor都被加在一个名为mMonitorChecker的HandlerChecker中。的我们熟知的ActivityManagerService,WindowManagerService等一些重要的Service类均实现了Watchdog.Monitor的接口。例如:

/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    public void monitor() {
        synchronized (this) { }
    }

/frameworks/base/services/core/java/com/android/server/wm/WindowManagerService.java

    // Called by the heartbeat to ensure locks are not held indefnitely (for deadlock detection).
    @Override
    public void monitor() {
        synchronized (mWindowMap) { }
    }

  Watchdog继承自Thread类,看下其run方法:

/frameworks/base/services/core/java/com/android/server/Watchdog.java

    @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList pids = new ArrayList();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList pids = new ArrayList();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i

  首先,遍历所有的HandlerChecker,调用其scheduleCheckLocked方法进行阻塞检查。当一个HandlerChecker不是mMonitorChecker且内部的Handler对应的Looper处于空闲状态时,直接返回。mCompleted表示Monitor阻塞检查是否已通过,若为true,表示Monitor阻塞检查已通过。mCompleted初始化时为true,在Monitor阻塞检查前置为false,在Monitor阻塞检查已通过置为true。
  Monitor阻塞检查就是执行HandlerChecker的run函数,就是将mMonitors里面每个Monitor的monitor方法执行一遍。

/frameworks/base/services/core/java/com/android/server/Watchdog.java

        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) {
                // If the target looper is or just recently was idling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

/frameworks/base/services/core/java/com/android/server/Watchdog.java

        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    

  接下来,睡眠30s,醒来后调用evaluateCheckerCompletionLocked检查mHandlerCheckers的各个HandlerChecker的执行状态。如果完成了Monitor阻塞检查,getCompletionStateLocked会返回COMPLETED;否则,根据当前时间和发起Monitor阻塞检查的时间的差值决定getCompletionStateLocked的返回值:小于30s的返回WAITING,大于30s小于60s的返回WAITED_HALF,大于或等于60s的返回OVERDUE。

/frameworks/base/services/core/java/com/android/server/Watchdog.java

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i

  若返回的是COMPLETED,则调用continue进行新的循环,即是重新开始新的检测步骤。若返回的是WAITING,则调用continue继续进行检测步骤,这时evaluateCheckerCompletionLocked返回的是WAITED_HALF,这个时候需要调用ActivityManagerService#dumpStackTraces dump出第一次的trace,然后调用continue继续进行检测步骤。再次执行到这里时,evaluateCheckerCompletionLocked返回的是OVERDUE,之后便会dump出更加详细的信息,进行善后处理工作,因为确实有被监控的线程阻塞了。ActivityManagerService#dumpStackTraces会再次被调用,和第一次被调用不同的是,第一次被调用时会清空当前的traces.txt文件然后写入,第二次被调用时会将内容添加到第一次写入的内容后面。

你可能感兴趣的:(Framework)