WatchDog是在SystemServer进程中被初始化和启动的。在SystemServer 的run方法中,各种Android服务被注册和启动,其中也包括了WatchDog的初始化和启动。代码如下:
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService);
在SystemServer中startOtherServices的后半段,将通过SystemReady接口通知系统已经就绪。在ActivityManagerService的SystemReady接口的CallBack函数中实现WatchDog的启动
Watchdog.getInstance().start();
以上代码位于frameworks/base/services/java/com/android/server/SystemServer.java中。
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog(); //单例模式创建实例
}
return sWatchdog;
}
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
}
在Watchdog构造函数中将main thread,UIthread,Iothread,DisplayThread加入mHandlerCheckers列表中。最后初始化monitor放入mMonitorCheckers列表中。
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor);
}
}
上述代码仅仅是启动了watchdog服务,但watchdog还不知道需要监视哪些系统服务。为保持watchdog模块的独立性和可扩展性,需要由系统服务向watchdog注册。Watchdog提供两种监视方式,一种是通过monitor()回调监视服务关键区是否出现死锁或阻塞,一种是通过发送消息监视服务主线程是否阻塞。
public class ActivityManagerService extends ActivityManagerNativeEx
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
而后在构造函数中把自身注册到watchdog monitor服务中。注意这里有两个检测项,一个是addMonitor,在每一个检测周期中watchdog会使用foreground thread的HandlerChecker回调服务注册的monitor()方法给服务的关键区上锁并马上释放,以检测关键区是否存在死锁或阻塞;另一个是addThread,watchdog会定时通过HandlerChecker向系统服务发送消息,以检测服务主线程是否被阻塞。这就是为什么在watchdog重启时有有两种提示语:“Block in Handler in ......”和“Block in monitor”,它们分别对应不同的阻塞类型。
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
最后在类中实现watchdog.Monitor所需的monitor方法。watchdog运行时每30秒会回调这个方法来锁一次这个关键区,如果60秒都无法得到锁,就说明服务已经发生了死锁,必须重启设备。
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
从上面分析可以知道,在watchdog的构造函数中将foreground thread、mian thread传入了一个HandlerChecker类。这个类就是watchdog检测超时的执行者。HandlerChecker类有多个实例,每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例。
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
HandlerChecker继承了Runnable,每个HandlerChecker在各自服务的主线程中运行并完成相应的检查,不会互相干扰。
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList mMonitors = new ArrayList();
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
每个通过addThread向watchdog注册自身的服务都对应一个HandlerChecker类实例,那么通过addMonitor()注册的服务由谁来检查呢?答案就是前面出现的mMonitorChecker,也就是foreground thread的HandlerChecker。它除了需要检测主线程是否堵塞外,还需要回调系统服务注册的monitor()方法,以检测这些服务的关键区是否存在死锁或阻塞。
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
如果线程的消息队列没有阻塞,PostAtFrontOfQueue很快就会触发HandlerChecker的run方法。对于foreground thread的HandlerChecker,它会回调被监控服务的monitor方法,对其关键区上锁并马上释放,以检查是否存在死锁或阻塞。对于其他线程,仅需要将mComplete标记为true,表明消息已经处理完成即可。
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
如果服务的消息循环发生了堵塞,那么mComplete就会一直处于false状态。watchdog在每一个检测周期中都会一次调用每个HandlerChecker的getCompletionStateLocked方法检测超时时间,如果任何一个服务的主线程30s无响应就会提前输出其堆栈为重启做准备,如果60s无响应则进入重启流程。
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
Watchdog主循环
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i
对于每个检测周期,首先需要将timeout计时器复位,而后依次检查在watchdog的init方法中注册的foreground thread,main thread,UI thread,i/o thread,以及其他通过addThread方法注册的服务的主线程是否阻塞。
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis(); //使用uptimeMills不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠,状态无法响应watchdog会导致误杀
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
} //CHECK_INTERVAL的默认时间是30s,此为第一次等待时间,WatchDog判断对象是否死锁的最长等待时间为1min
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
30秒等待完成后,就要检测之前送出的消息是否已经执行完毕。通过evaluateCheckerCompletionLocked遍历所有的HandlerChecker,返回最大的waitState值。waitState共有四种情况:COMPLETED对应消息已处理完毕线程无阻塞;WAITING对应消息处理花费0~29秒,需要继续运行;WAITED_HALF对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态,用以在超时发生时输出堆栈;OVERDUE对应消息处理已经花费超过60s,此时便进入下一流程,输出堆栈信息并重启手机。
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false; //所有服务都正常,reset
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList pids = new ArrayList();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
Watchdog超时已经发生,但之前evaluateCheckerCompletionLocked并不关心是哪个服务发生阻塞,仅仅返回所有服务最大的waitState值。此时需要调用getBlockedCheckersLocked判断具体是哪些应用发生了阻塞,阻塞的原因是什么。这就是我们在dropbox中看到的阻塞原因描述。而后依次输出AMS与Kernel调用堆栈。
// something is overdue!
blockedCheckers = getBlockedCheckersLocked(); //WatchDog超时,获取那个服务超时阻塞,生成崩溃描述符
subject = describeCheckersLocked(blockedCheckers); //判断是否重启
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList pids = new ArrayList();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i
输出dropbox,并检查activity controller连接的调试器是否可以处理这次watchdog无响应,如果activity controller不要求重启,那么就忽视这次超时,从头继续运行watchdog循环。杀死SystemServer并重启手机。