前言:之前在梳理SystemServer的时候有注意到Watchdog的初始化,很早之前也听说过看门狗,梳理一下。
参考:https://blog.csdn.net/fu_kevin0606/article/details/64479489
“对手机系统而言,因为肩负着接听电话和接收短信的“重任”,所以被寄予7x24小 时正常工作的希望。但是作为一个在嵌入式设备上运行的操作系统,Android运行中必须面对各种软硬件干扰,从最简单的代码出现死锁或者被阻塞,到内存越界导致的内存破坏,或者由于硬件问题导致的内存反转,甚至是极端工作环境下出现的CPU电子迁移和存储器消磁。这一切问题都可能导致系统服务发生难以预料的崩溃和死机。
想解决这一问题,可以从正反两个方向出发,其一是提高软硬件在极端状态下的可靠性,如进行程序终止性验证,或选用抗辐射加固器件。但是基于成本考虑,普通的手机系统很难做到完全不出故障;另一个方法是及时发现系统崩溃并重启系统。手机系统的大部分的故障都会在重启后消失,不会影响继续使用。所以简单的办法是,如果检测到系统不正常了,将设备重新启动,这样用户就能继续使用了。那么如何才能判断系统是否正常呢。在早期的手机平台上通常的做法是在设备中增加一个硬件看门狗,软件系统必须定 时的向看门狗硬件中写值来表示自己没出故障(俗称“喂狗”),否则超过了规定的时间看门狗就会重新启动设备。
硬件看门狗的问题是它的功能比较单一,只能监控整个系统。早期的手机操作系统大多是单任务的,硬件看门狗勉强能胜任。Android的SystemServer是一个非常复杂的进程,里面运行的服务超过五十种,是最可能出问题的进程,因此有必要对SystemServer中运行的各种线程实施监控。但是如果使用硬件看门狗的工作方式,每个线程隔一段时间去喂狗,不但非常浪费CPU,而且会导致程序设计更加复杂。因此Android开发了WatchDog类作为软件看门狗来监控SystemServer中的线程。一旦发现问题,WatchDog会杀死SystemServer进程。
SystemServer的父进程Zygote接收到SystemServer的死亡信号后,会杀死自己。Zygote进程死亡的信号传递到Init进程后,Init进程会杀死Zygote进程所有的子进程并重启Zygote。这样整个手机相当于重启一遍。通常SystemServer出现问题和kernel并没有关系,所以这种“软重启”大部分时候都能够解决问题。而且这种“软重启”的速度更快,对用户的影响也更小。”
盗一下图=-=
Watchdog之前梳理SystemServer的时候有遇到,所以还是先从SystemServer开始吧。
Watchdog在SystemServer中主要完成了init和start的操作
private void startOtherServices() {
...
traceBeginAndSlog("InitWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService);
traceEnd();
...
traceBeginAndSlog("StartWatchdog");
Watchdog.getInstance().start();
traceEnd();
其实在SystemServer完成init和start WatchDog之前AMS已经调用了WatchDog的addMonitor和addThread了,并且实现了Watchdog.Monitor接口。
原因:SystemServer先调用startBootstrapServices启动起ams,SystemServiceManager启动代码会对应的通过反射调用构造方法。
mActivityManagerService = mSystemServiceManager.startService(
ActivityManagerService.Lifecycle.class).getService();
AMS中与Watchdog相关代码:
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
这个monitor方法其实就是获取一下锁,然后快速释放,测试下有没有死锁。
// Note: This method is invoked on the main thread but may need to attach various
// handlers to other threads. So take care to be explicit about the looper.
public ActivityManagerService(Context systemContext) {
...
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
看下watchdog对应方法
AMS中调用的:
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
getInstance方法表明了Watchdog是个单例模式,但是没有加synchronize=-=
private Watchdog() {
super("watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
// The shared foreground thread is the main checker. It is where we
// will also dispatch monitor checks and do other work.
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mOpenFdMonitor = OpenFdMonitor.create();
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
这边HandlerCheckers添加成员:foreground thread/main thread/ui thread/io thread/display thread
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor);
}
}
public void addThread(Handler thread) {
addThread(thread, DEFAULT_TIMEOUT);
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
addMonitor和addThread看起来就是简单的ArrayList拓展
SystemServer中调用的:
public void init(Context context, ActivityManagerService activity) {
mResolver = context.getContentResolver();
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
注册了个监听器
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
收到ACTON_REBOOT会重启
至于start方法,首先看下Watchdog是继承的Thread,所以Watchdog首先是一个线程,跑在SystemServer的线程。
那么start方法其实是调用的run方法
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
boolean fdLimitTriggered = false;
if (mOpenFdMonitor != null) {
fdLimitTriggered = mOpenFdMonitor.monitor();
}
if (!fdLimitTriggered) {
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList pids = new ArrayList();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
getInterestingNativePids());
waitedHalf = true;
}
continue;
}
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
一长段代码,慢慢分析吧=-=
先看到while(ture)的无线循环,说明Watchdog是一直在跑着的,里面接着一个for循环:
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i
之前应该注意到AMS在构造函数里传给Watchdog两个参数,monitor和handler,handler就是这边for循环取出来的。
看下内部类HandlerChecker,实现了Runnable
public final class HandlerChecker implements Runnable {
...
scheduleCheckLocked方法就是将handler放在消息队列的开头
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if mCheckReboot is false and we have no
// monitors, since those would need to be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
之后调用
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
从我分析来看通过下面加入的HandlerChecker是没有addMonitor过的,说到底就是发个消息没有具体的逻辑,将mCompleted这个标志位置位false,但还是要处理的,表示监控的Service 的handler loop没有问题,处理完了会把mCompleted置为true,表示完成。
Watchdog.getInstance().addThread(mHandler);
只有通过以下代码加入到mMonitorChecker是有monitor方法实现的,比如AMS实现了。
Watchdog.getInstance().addMonitor(this);
mMonitorChecker其实一开始就加入到了mHandlerChecks里面了,相当于将monitor方法剥离出来。
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
这块代码应该是计算timeout的,默认timeout是CHECK_INTERVAL是30s,这里注意使用了uptimeMillis()
/**
* Returns milliseconds since boot, not counting time spent in deep sleep.
*
* @return milliseconds of non-sleep uptime since boot.
*/
@CriticalNative
native public static long uptimeMillis();
这个时间是不算deep sleep的时间的,我理解为是监控的Service喂狗的真实时间,watchdog睡眠也许会大于30s的。
看下evaluateCheckerCompletionLocked方法,遍历所有的HandlerChecker,取state的最大值
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
这边mCompleted是衡量消息有没有被处理的标志位
static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000;
再结合这边的代码对评估的结果进行对应处理
if (!fdLimitTriggered) {
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList pids = new ArrayList();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
getInterestingNativePids());
waitedHalf = true;
}
continue;
}
对应处理如下
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
看起来就是eventlog+dumpsys+dropbox
,留下重启原因,
并检查activity controller连接的调试器是否可以处理这次watchdog无响应,如果可以则继续
,否则调用Process.killProcess(Process.myPid())和System.exit(10)重启。
现在分析来看watchdog就是给所有监控对象下发一个任务,隔了个30s后检查,然后通过检查结果判断系统是否出了问题,超时60s一般会导致重启。