Android WatchDog介绍

文章目录

    • Android WatchDog
      • WatchDog初始化
      • HandlerChecker介绍
      • WatchDog检测逻辑介绍
      • 参考文献

WatchDog,在早期的嵌入式系统,设计它是为了防止软件系统跑飞后最后一个挽救措施,就是重启设备,虽然有点暴力,但是一般重启后,对于很多偶现的bug,基本都能临时解决

WatchDog的设计基本都需要包含如下三个功能

  • 投喂机制
  • dump异常日志
  • 异常修复

投喂机制,又分成

  • 被动 - 等系统来喂"食物"
  • 主动 - 自己主动检查是否有"食物"

不管是主动还是被动,当没"食物"给到WatchDog的时候,都会触发异常,接着dump异常日志,然后尝试修复

早期嵌入式系统,WatchDog一般都是硬件设备,所以会采用软件系统喂的方式

对于为了软件系统而实现WatchDog,由于实现更加灵活,所以投喂机制就可以按需来实现

Android WatchDog

Android系统也存在WatchDog,主要用于监控systemserver内部各服务线程的运行情况,systemserver在初始化启动服务时,会完成WatchDog的初始化配置和启动

private void startOtherServices() {
{
    ...
    final Watchdog watchdog = Watchdog.getInstance();
    watchdog.init(context, mActivityManagerService);
    ...
    mActivityManagerService.systemReady(new Runnable() {
        .....
        Watchdog.getInstance().start();
    })
}

先调用init初始化,然后在AMS.systemReady完成后,启动WatchDog,那怎么往WatchDog配置监控线程或回调呢?直接拿AMS的配置代码举例:

public ActivityManagerService(Context systemContext) {
   ...
   Watchdog.getInstance().addMonitor(this);
   Watchdog.getInstance().addThread(mHandler);
}

在构造函数结束前,添加了监控回调和与监控线程绑定的handler

WatchDog初始化

接着从代码来分析,先看WatchDog的构造函数

public class Watchdog extends Thread {
    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
    }
    ...
}

WatchDog派生自Thread,在构造时,主要初始化

  • mMonitorChecker - monitor监控回调执行线程绑定的HandlerChecker
  • mHandlerCheckers - 初始化预置的HandlerCheckers

HandlerChecker实现了对Handler绑定线程执行超时做监控,超时时间可在构造时配置,这个是默认行为,基于Android Handler Looper机制来实现的

除了默认行为,我们还可以通过设置HandlerChecker的monitor回调,来添加自定义的监控行为

WatchDog的monitor回调会被统一保存到mMonitorChecker

HandlerChecker介绍

HandlerChecker的核心实现介绍:

  1. post message到message queue的头部
        public void scheduleCheckLocked() {
            //monitor回调为空并且looper是空闲的,状态置为完成直接返回
            if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) {
                // If the target looper is or just recently was idling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            //往头部插入message
            mHandler.postAtFrontOfQueue(this);
        }
  1. message关联runnable被执行
        public void run() {
            final int size = mMonitors.size();
            //执行monitor回调
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }
            //设置执行完成状态
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
  1. 获取执行状态
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

从上面的代码可以看出,在scheduleCheckLocked()被调用后,能够影响HandlerChecker状态置为COMPLETED就两点

  • handlerchecker关联的线程阻塞,导致post message关联runnable在超时时间内没被执行
  • runnable执行了,并配置了monitor回调,monitor回调执行超时了

WatchDog检测逻辑介绍

上头说了,WatchDog自身就是一条线程,在线程启动后触发检测,直接看代码吧

    @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                //检测间隔,默认半分钟
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                //遍历handlerchecker,依次触发检测
                for (int i=0; i 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        //线程等待
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList pids = new ArrayList();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }

                // 超时了
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList pids = new ArrayList();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i

从代码可以很明显的看出整个逻辑

  1. 通过无限循环来达到重复检测
  2. 在每次检测前,遍历所有的HandlerChecker并调用scheduleCheckLocked
  3. 通过调用wait函数并设置超时时间来使线程挂起一段时间
  4. 超时后线程继续执行,通过调用evaluateCheckerCompletionLocked获取各个HandlerChecker的最终执行状态,如果返回overdue,说明存在未完成的情况
  5. 通过调用ActivityManagerService.dumpStackTraces保存堆栈信息
  6. 通过mActivity.addErrorToDropBox将错误日志保存到dropbox
  7. 通过Process.killProcess(Process.myPid())和System.exit(10)杀死system server进程,从而触发Android设备的软重启

参考文献

Android7.0 Watchdog机制

你可能感兴趣的:(Android,android)