前言
看门狗是一种监控系统的运行状况的手段,通过软硬件结合的方式实现对系统运行状况的监控。稳定运行的软件会在执行完特定指令后进行喂狗,若在一定周期内看门狗没有收到来自软件的喂狗信号,则认为系统故障,会进入中断处理程序或强制系统复位。根据运行的软硬件平台,分为硬件看门狗和软件看门狗。
Android SystemServer 是一个非常复杂的进程,里面运行的很多系统服务,是一个很重要的进程,因此有必要对SystemServer 中运行的各种服务线程进行监控。Android 开发了WatchDog 类作为软件看门狗来监控 SystemServer 中这些系统服务线程。一旦发现问题系统服务出现问题,WatchDog 会杀死 SystemServer 进程。
本文源码分析:基于Android Pie版本(http://androidxref.com/9.0.0_r3)
一.Watchdog 工作原理
Watchdog 主要服务对象为系统 service 和主要线程,它的工作原理是周期性地向被监控线程消息队列中发送消息任务,来检查在指定时间内是否返回,如果超时不返回,则视为死锁,记录该 watchdog 记录,并做后续 dump 处理,然后 kill 掉当前 SystemServer 进程,SystemServer 的父进程 Zygote 接收到 SystemServer 的死亡信号后,会杀死自己,Zygote 进程死亡的信号传递到 Init 进程后,Init 进程会杀死 Zygote 进程所有的子进程并重启 Zygot e并上报 Framework 异常,这样整个手机相当于重启一遍。
二.Watchdog源码分析
Watchdog在SystemServer中主要完成了init和start的操作。
/frameworks/base/services/java/com/android/server/SystemServer.java:
private void startOtherServices() {
...
traceBeginAndSlog("InitWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService);
traceEnd();
...
traceBeginAndSlog("StartWatchdog");
Watchdog.getInstance().start();
traceEnd();
...
}
Watchdog类使用单例:
frameworks/base/services/core/java/com/android/server/Watchdog.java:
242 public static Watchdog getInstance() {
243 if (sWatchdog == null) {
244 sWatchdog = new Watchdog();
245 }
246
247 return sWatchdog;
248 }
Watchdog构造方法:
250 private Watchdog() {
251 super("watchdog");
252 // Initialize handler checkers for each common thread we want to check. Note
253 // that we are not currently checking the background thread, since it can
254 // potentially hold longer running operations with no guarantees about the timeliness
255 // of operations there.
256
257 // The shared foreground thread is the main checker. It is where we
258 // will also dispatch monitor checks and do other work.
259 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
260 "foreground thread", DEFAULT_TIMEOUT);
261 mHandlerCheckers.add(mMonitorChecker);
262 // Add checker for main thread. We only do a quick check since there
263 // can be UI running on the thread.
264 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
265 "main thread", DEFAULT_TIMEOUT));
266 // Add checker for shared UI thread.
267 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
268 "ui thread", DEFAULT_TIMEOUT));
269 // And also check IO thread.
270 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
271 "i/o thread", DEFAULT_TIMEOUT));
272 // And the display thread.
273 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
274 "display thread", DEFAULT_TIMEOUT));
275
276 // Initialize monitor for Binder threads.
277 addMonitor(new BinderThreadMonitor());
278
279 mOpenFdMonitor = OpenFdMonitor.create();
280
281 // See the notes on DEFAULT_TIMEOUT.
282 assert DB ||
283 DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
284 }
产生Watchdog实例时,将将主要系统线程(foreground main ui io display)以HandlerChecker的形式加入mHandlerCheckers列表。
Watchdog中比较重要的一个类--HandlerChecker:
118 /**
119 * Used for checking status of handle threads and scheduling monitor callbacks.
120 */
121 public final class HandlerChecker implements Runnable {
122 private final Handler mHandler;
123 private final String mName;
124 private final long mWaitMax;
125 private final ArrayList mMonitors = new ArrayList();
126 private boolean mCompleted;
127 private Monitor mCurrentMonitor;
128 private long mStartTime;
129
130 HandlerChecker(Handler handler, String name, long waitMaxMillis) {
131 mHandler = handler;
132 mName = name;
133 mWaitMax = waitMaxMillis;
134 mCompleted = true;
135 }
136
137 public void addMonitor(Monitor monitor) {
138 mMonitors.add(monitor);
139 }
140
141 public void scheduleCheckLocked() {
142 if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
143 // If the target looper has recently been polling, then
144 // there is no reason to enqueue our checker on it since that
145 // is as good as it not being deadlocked. This avoid having
146 // to do a context switch to check the thread. Note that we
147 // only do this if mCheckReboot is false and we have no
148 // monitors, since those would need to be executed at this point.
149 mCompleted = true;
150 return;
151 }
152
153 if (!mCompleted) {
154 // we already have a check in flight, so no need
155 return;
156 }
157
158 mCompleted = false;
159 mCurrentMonitor = null;
160 mStartTime = SystemClock.uptimeMillis();
161 mHandler.postAtFrontOfQueue(this);
162 }
163
164 public boolean isOverdueLocked() {
165 return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
166 }
167
168 public int getCompletionStateLocked() {
169 if (mCompleted) {
170 return COMPLETED;
171 } else {
172 long latency = SystemClock.uptimeMillis() - mStartTime;
173 if (latency < mWaitMax/2) {
174 return WAITING;
175 } else if (latency < mWaitMax) {
176 return WAITED_HALF;
177 }
178 }
179 return OVERDUE;
180 }
181
182 public Thread getThread() {
183 return mHandler.getLooper().getThread();
184 }
185
186 public String getName() {
187 return mName;
188 }
189
190 public String describeBlockedStateLocked() {
191 if (mCurrentMonitor == null) {
192 return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
193 } else {
194 return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
195 + " on " + mName + " (" + getThread().getName() + ")";
196 }
197 }
198
199 @Override
200 public void run() {
201 final int size = mMonitors.size();
202 for (int i = 0 ; i < size ; i++) {
203 synchronized (Watchdog.this) {
204 mCurrentMonitor = mMonitors.get(i);
205 }
206 mCurrentMonitor.monitor();
207 }
208
209 synchronized (Watchdog.this) {
210 mCompleted = true;
211 mCurrentMonitor = null;
212 }
213 }
214 }
HandlerChecker是Watchdog用来检测主要系统线程(foreground main ui io display)是否block。其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。
Watchdog作为一个线程类,其run方法如下:
418 @Override
419 public void run() {
420 boolean waitedHalf = false;
421 while (true) {
422 final List blockedCheckers;
423 final String subject;
424 final boolean allowRestart;
425 int debuggerWasConnected = 0;
426 synchronized (this) {
427 long timeout = CHECK_INTERVAL;
428 // Make sure we (re)spin the checkers that have become idle within
429 // this wait-and-check interval
430 for (int i=0; i 0) {
436 debuggerWasConnected--;
437 }
438
439 // NOTE: We use uptimeMillis() here because we do not want to increment the time we
440 // wait while asleep. If the device is asleep then the thing that we are waiting
441 // to timeout on is asleep as well and won't have a chance to run, causing a false
442 // positive on when to kill things.
443 long start = SystemClock.uptimeMillis();
444 while (timeout > 0) {
445 if (Debug.isDebuggerConnected()) {
446 debuggerWasConnected = 2;
447 }
448 try {
449 wait(timeout);
450 } catch (InterruptedException e) {
451 Log.wtf(TAG, e);
452 }
453 if (Debug.isDebuggerConnected()) {
454 debuggerWasConnected = 2;
455 }
456 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
457 }
458
459 boolean fdLimitTriggered = false;
460 if (mOpenFdMonitor != null) {
461 fdLimitTriggered = mOpenFdMonitor.monitor();
462 }
463
464 if (!fdLimitTriggered) {
465 final int waitState = evaluateCheckerCompletionLocked();
466 if (waitState == COMPLETED) {
467 // The monitors have returned; reset
468 waitedHalf = false;
469 continue;
470 } else if (waitState == WAITING) {
471 // still waiting but within their configured intervals; back off and recheck
472 continue;
473 } else if (waitState == WAITED_HALF) {
474 if (!waitedHalf) {
475 // We've waited half the deadlock-detection interval. Pull a stack
476 // trace and wait another half.
477 ArrayList pids = new ArrayList();
478 pids.add(Process.myPid());
479 ActivityManagerService.dumpStackTraces(true, pids, null, null,
480 getInterestingNativePids());
481 waitedHalf = true;
482 }
483 continue;
484 }
485
486 // something is overdue!
487 blockedCheckers = getBlockedCheckersLocked();
488 subject = describeCheckersLocked(blockedCheckers);
489 } else {
490 blockedCheckers = Collections.emptyList();
491 subject = "Open FD high water mark reached";
492 }
493 allowRestart = mAllowRestart;
494 }
495
496 // If we got here, that means that the system is most likely hung.
497 // First collect stack traces from all threads of the system process.
498 // Then kill this process so that the system will restart.
499 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
500
501 ArrayList pids = new ArrayList<>();
502 pids.add(Process.myPid());
503 if (mPhonePid > 0) pids.add(mPhonePid);
504 // Pass !waitedHalf so that just in case we somehow wind up here without having
505 // dumped the halfway stacks, we properly re-initialize the trace file.
506 final File stack = ActivityManagerService.dumpStackTraces(
507 !waitedHalf, pids, null, null, getInterestingNativePids());
508
509 // Give some extra time to make sure the stack traces get written.
510 // The system's been hanging for a minute, another second or two won't hurt much.
511 SystemClock.sleep(2000);
512
513 // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
514 doSysRq('w');
515 doSysRq('l');
516
517 // Try to add the error to the dropbox, but assuming that the ActivityManager
518 // itself may be deadlocked. (which has happened, causing this statement to
519 // deadlock and the watchdog as a whole to be ineffective)
520 Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
521 public void run() {
522 mActivity.addErrorToDropBox(
523 "watchdog", null, "system_server", null, null,
524 subject, null, stack, null);
525 }
526 };
527 dropboxThread.start();
528 try {
529 dropboxThread.join(2000); // wait up to 2 seconds for it to return.
530 } catch (InterruptedException ignored) {}
531
532 IActivityController controller;
533 synchronized (this) {
534 controller = mController;
535 }
536 if (controller != null) {
537 Slog.i(TAG, "Reporting stuck state to activity controller");
538 try {
539 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
540 // 1 = keep waiting, -1 = kill system
541 int res = controller.systemNotResponding(subject);
542 if (res >= 0) {
543 Slog.i(TAG, "Activity controller requested to coninue to wait");
544 waitedHalf = false;
545 continue;
546 }
547 } catch (RemoteException e) {
548 }
549 }
550
551 // Only kill the process if the debugger is not attached.
552 if (Debug.isDebuggerConnected()) {
553 debuggerWasConnected = 2;
554 }
555 if (debuggerWasConnected >= 2) {
556 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
557 } else if (debuggerWasConnected > 0) {
558 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
559 } else if (!allowRestart) {
560 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
561 } else {
562 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
563 WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
564 Slog.w(TAG, "*** GOODBYE!");
565 Process.killProcess(Process.myPid());
566 System.exit(10);
567 }
568
569 waitedHalf = false;
570 }
571 }
Watchdog的run方法就是死循环,死循环中主要完成的流程如下:
1.遍历所有HandlerChecker,并调其监控方法scheduleCheckLocked,记录开始时间;
2.等待30s;
3.评估各Checker的状态evaluateCheckerCompletionLocked,里面会遍历所有的HandlerChecker,并获取最大的返回值
4.根据评估状态设置阻塞状态标志waitedHalf;
当状态为COMPLETED时,说明线程服务正常;
当状态为WAITING时,说明线程服务开始阻塞;
当状态为WAITED_HALF且waitedHalf为false时,线程服务已经阻塞超过30s,dump线程相关信息,并置waitedHalf为true;
当状态为OVERDUE时,说明线程服务已经超过60s,重启系统。
注意:评估状态为超时,重启系统前会进行以下事情:
a.写Eventlog
b.以追加的方式,输出system_server和3个native进程的栈信息
c.输出kernel栈信息
d.dump所有阻塞线程
e.输出dropbox信息
f.判断有没有debuger,没有的话,重启系统了,并输出kill系统服务的log。
最后,为了更好理解Watchdog工作流程,将Watchdog的工作流程绘制如下:
扫码关注公众号,收看更多精彩内容