SystemServer启动流程之WatchDog分析(四)

1、概述

在前面一篇文章中我们分析了SystemServer启动后所进行的操作,无非就是启动大量的系统Service,如果大家对其过程还不是很了解,可以先去看一下SystemServer启动流程之SystemServer分析(三)这篇文章。

当然我们今天的主角并不是分析SystemServer了,而是在SystemServer中被初始化的WatchDog,也许之前你听说过WatchDog,其实它就是一个监控Android系统中系统服务的类(俗名看门狗~嘎嘎)。

Android开发了WatchDog类作为软件看门狗来监控SystemServer进程,一旦发现问题,WatchDog会杀死SystemServer进程;SystemServer的父进程Zygote接收到SystemServer的死亡信号后,会杀死自己;Zygote进程的死亡信号传递到Init进程后,Init进程会杀死Zygote进程及其所有子进程并重启Zygote,相当于手机进行了重启。

好啦话不多话,进入我们进入的主题。

2、源码分析

首先我们来看一下WatchDog是在哪里被启动和初始化的。

(1)WatchDog的初始化

还记得我们在之前一篇文章SystemServer启动流程之SystemServer分析(三)的最后有提及到WatchDog吗?我们紧接着那里开始分析。

private void startOtherServices() {

    //......
    final Watchdog watchdog = Watchdog.getInstance();
    watchdog.init(context, mActivityManagerService);
    Watchdog.getInstance().start();
}

WatchDog是在SystemServer的startOtherServices方法中开始被初始化的,我们就先来看一下其getInstance()方法。

public class Watchdog extends Thread {

    static Watchdog sWatchdog;

    /* This handler will be used to post message back onto the main thread */
    final ArrayList mHandlerCheckers = new ArrayList();

    final HandlerChecker mMonitorChecker;

    static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000;
    static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;

    //通过getInstance获得Watchdog实例,这里使用了单例模式
    public static Watchdog getInstance() {
        if (sWatchdog == null) {
            sWatchdog = new Watchdog();
        }

        return sWatchdog;
    }

    //Watchdog的构造器
    private Watchdog() {
        super("watchdog");
        // Initialize handler checkers for each common thread we want to check.  Note
        // that we are not currently checking the background thread, since it can
        // potentially hold longer running operations with no guarantees about the timeliness
        // of operations there.

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);

        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));

        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));

        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));

        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
    }
}

WatchDog的构造方法其实主要就是创建几个HandlerChecker对象,并将它们全部保存到集合列表mHandlerCheckers中,每一个HandlerChecker对应一个监控的线程,我们接着往下分析看其init方法。

public void init(Context context, ActivityManagerService activity) {
        mResolver = context.getContentResolver();
        mActivity = activity;

        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }

注册RebootRequestReceiver广播,监听重启的Intent(ACTION_REBOOT),进入该广播的onReceive方法。

final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intent intent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
        }
    }

    /**
     * Perform a full reboot of the system.
     */
    void rebootSystem(String reason) {
        Slog.i(TAG, "Rebooting system because: " + reason);
        IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
            pms.reboot(false, reason, false);
        } catch (RemoteException ex) {
        }
    }

(2)WatchDog监听的服务和线程

在分析这个问题之前,我们先看一下之前在WatchDog构造器中创建的HandlerChecker类。

// These are temporally ordered: larger values as lateness increases
    static final int COMPLETED = 0;
    static final int WAITING = 1;
    static final int WAITED_HALF = 2;
    static final int OVERDUE = 3;

/**
     * Used for checking status of handle threads and scheduling monitor callbacks.
     */
    public final class HandlerChecker implements Runnable {
        private final Handler mHandler;
        private final String mName;
        private final long mWaitMax;
        private final ArrayList mMonitors = new ArrayList();
        private boolean mCompleted;
        private Monitor mCurrentMonitor;
        private long mStartTime;

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

        //用于添加监听服务的方法(后续会用到)
        public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }

        //用于检查当前服务是否有问题-通过post方式发送消息(后续会用到)
        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) {
                // If the target looper is or just recently was idling, then
                // there is no reason to enqueue our checker on it since that
                // is as good as it not being deadlocked.  This avoid having
                // to do a context switch to check the thread.  Note that we
                // only do this if mCheckReboot is false and we have no
                // monitors, since those would need to be executed at this point.
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

        public boolean isOverdueLocked() {
            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
        }

        //根据返回值判断当前服务的状态(后续会用到)
        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

        public Thread getThread() {
            return mHandler.getLooper().getThread();
        }

        public String getName() {
            return mName;
        }

        public String describeBlockedStateLocked() {
            if (mCurrentMonitor == null) {
                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
            } else {
                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                        + " on " + mName + " (" + getThread().getName() + ")";
            }
        }

        //真正用于检查当前服务是否有问题的方法(之前通过post方式发送消息,这里执行run方法)
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }

以上就是在WatchDog中创建的HandlerChecker对象,主要的作用就是检查所有注册的服务和线程是否有死锁等问题。
我们来看一下WatchDog提供的增加监控服务和线程的两个方法addThread()和addMonitor()。

public void addThread(Handler thread) {
        addThread(thread, DEFAULT_TIMEOUT);
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Threads can't be added once the Watchdog is running");
            }
            final String name = thread.getLooper().getThread().getName();
            //创建新的HandlerChecker并添加到mHandlerCheckers集合中
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

addThread方法实际上就是创建一个受监控的HandlerChecker对象,并将其加入到mHandlerCheckers集合中,以便以后进行循环检查。

public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            //使用的是之前在WatchDog构造器中创建的mMonitorChecker对象的addMonitor方法
            mMonitorChecker.addMonitor(monitor);
        }
    }

从代码来看,对服务的监控也是通过HandlerChecker来实现的,不过其只需要一个HandlerChecker对象就可以检查所有的服务,新的需要监控的服务只需调用HandlerChecker对象的addMonitor方法。

public final class HandlerChecker implements Runnable {

    private final ArrayList mMonitors = new ArrayList();

    public void addMonitor(Monitor monitor) {
            mMonitors.add(monitor);
        }
}

也就是直接将需要监控的服务保存到ArrayList集合中,以便后续遍历检查。

在Android系统中到底有哪些服务是被WatchDog所监控呢?
之前我们在WatchDog的构造器中就监听了几个线程:
1、主线程
2、FgThread
3、UiThread
4、IoThread
5、DisplayThread

除了这几个公共的线程以外,还有一些重要的服务也被加入到被监控的范围之内,在这里说一下如果想被WatchDog监控其必须要实现WatchDog的Moniter接口。

public interface Monitor {
        void monitor();
    }

实现Monitor接口后重写其monitor方法,并调用WatchDog的addMonitor()方法将自己加入到WatchDog的服务监控列表中,在SystemServer中实现Monitor接口并调用了addMonitor()方法的服务有:
1、ActivityManagerService
2、InputManagerService
3、MediaRouterService
4、MountService
5、NativeDaemonConnector
6、NetworkManagementService
7、PowerManagerService
8、WindowManagerService

(3)WatchDog监控原理

这些需要被监控的服务和线程都已经注册完毕了,那如果进行监控呢?接下来就进入执行我们的Watchdog.getInstance().start()方法了,WatchDog是继承自Thread类,所以执行其start方法相当于执行其run方法, 我们进入其run方法。

@Override
    public void run() {
        boolean waitedHalf = false;

        //while死循环
        while (true) {
            final ArrayList blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval

                //(1)for循环给所有受监控的线程发送消息
                for (int i=0; iif (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {

                        //(2)发完消息后,调用wait方法睡眠一段时间
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }

                //(3)for循环遍历检查每个服务和线程,判断其当前状态,根据不同的状态进行不同的操作
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList pids = new ArrayList();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);

                        // SPRD: add for debug watchdog @{
                        // The system's been hanging for 30s, another five won't hurt much.
                        SystemClock.sleep(3000);
                        if (RECORD_KERNEL_THREADS) {
                            dumpKernelStackTraces();
                            SystemClock.sleep(2000);
                        }
                        // @}

                        waitedHalf = true;
                    }
                    continue;
                }

                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList pids = new ArrayList();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            // SPRD: modify 2000 to 3000
            SystemClock.sleep(3000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
                // SPRD: wait for dump info writen
                SystemClock.sleep(2000);
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            if (Debug.isDebug()){//SPRD: TEMP
                doSysRq('w');
                doSysRq('l');
            }

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }

            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i" stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");

                //Kill掉SystemServer进程
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

WatchDog的run方法主要就是检查被监听的服务和线程是否有死锁等情况,它是一个while无限循环,其主要做了三件事。

1、调用scheduleCheckLocked方法给所有受监控的线程发送消息

for (int i=0; i<mHandlerCheckers.size(); i++) {
    HandlerChecker hc = mHandlerCheckers.get(i);
    hc.scheduleCheckLocked();
}

可以看出其是遍历mHandlerCheckers这个集合,这个集合是保存HandlerChecker对象的,取出每个HandlerChecker对象并调用其scheduleCheckLocked方法,我们进入该方法。

public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) {
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }

HandlerCHecker对象既要监控服务又要监控线程,因此先判断mMonitors的size是否为0,如果为0,说明这个HandlerChecker没有监控服务,这时如果被监控线程的消息队列处于空闲状态(isIding()方法判断),则说明线程运行良好,把mCompleted设为true后就可以返回了。否则先把mCompleted设为false,并记录消息开始发送的时间,最后调用postAtFrontOfQueue()方法给被监控的线程发送一个消息。这个消息的处理方法是HandlerChecker类的run()方法。如下:

@Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

如果消息处理方法run()能够执行,说明受监控的线程本身没有问题,但是还要检查被监控服务的状态,检查是通过调用服务中实现的monitor()方法来完成的。通常monitor()方法的实现是获取服务中的锁,如果不能得到线程就会挂起,这样mCompleted的值就不能被设为true。
mCompleted的值为true,表面HandlerChecker对象监控的线程或服务正常,否则就有空有问题,是否真有问题还要通过等待的时间是否超过规定时间来判断。

2、给受监控的线程发送完消息后,调用wait()方法让WatchDog线程睡眠一段时间。

3、通过for循环一一检查注册的线程和服务是否有问题,如果有问题将杀死进程,我们进入evaluateCheckerCompletionLocked方法。

private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; iget(i);
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        //返回当前被检查对象的状态
        return state;
    }

public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

evaluateCheckerCompletionLocked()通过调用每个检查对象的getCompletionStatelocked()方法来得到对象的状态值,状态值总共有4种:

1、COMPLETED:0,表示状态良好。
2、WAITING:1,表示正在等待消息处理的结果。
3、WAITED_HALF:2,表示正在等待并且等待的时间超过了规定时间的一半。
4、OVERDUE:3,表示等待时间已经超过了规定的时间,此状态一般会杀死SystemServer进程。

好啦,今天的WatchDog源码分析就到这里啦,接下来我们总结一下其大致过程。

3、总结

以上就是本篇博客关于WatchDog的主要内容了,其主要做了以下事情:

1、首先创建WatchDog的对象(这是一个单例对象),在WatchDog的构造器中创建几个受监控的HandlerChecker对象。

2、调用WatchDog的init方法用于注册RebootRequestReceiver广播,该广播用于监听Intent(ACTION_REBOOT)系统重启的Action。

3、可以通过WatchDog的addThread()和addMonitor()方法将需要受监控的服务和线程加入到mHandlerCheckers集合中,以便后续循环检查。

4、通过调用WatchDog的run方法可以做到循环检查受监控的线程和服务,对于发生死锁或者超时的服务将做相应的处理,甚至会Kill掉SystemServer进程以重启系统。

你可能感兴趣的:(Android底层分析)