WatchDog机制

http://blog.csdn.net/guoqifa29/article/details/42400943

1、addMonitor()
WindowManagerService、ActivityManagerService、PowerManagerService、NetworkManagementService、MountService、InputManagerService等service通过Watchdog.getInstance().addMonitor(this)将自己(实现了Watchdog.Monitor)添加到Watchdog.mMonitorChecker.mMonitors列表中;该列表会不断被调用Monitor.Monitor()函数,这个函数很简单,就是去获取对应锁,如果线程死锁或其他原因阻塞,那么必然无法获取锁,Monitor()函数执行必然会阻塞。Watchdog就是利用这个原理来判断System_server是否死锁。
    public void monitor() {
        synchronized (mLock) { }
        nativeMonitor(mPtr);
    }
2、addThread()
将WindowManagerService、PowerManagerService、PackageManagerService、ActivityManagerService四个主线程Handler保存到Watchdog.mHandlerCheckers列表中;同时还会把第1点中的mMonitorChecker也保存到Watchdog.mHandlerCheckers中;同时还会将UiThread、IOThread、MainThread的Handler保存到Watchdog.mHandlerCheckers中来,当然这四个线程处在System_Server进程中,总共8个线程的Handler;Watchdog会不断判断这些线程的Looper是否空闲,如果一直非空闲,那么必然被blocked住了。
UiThread---主要用于叠加视图(OverlayDisplay)、显示触摸指针(PointerEventDispatcher)、PhoneWindowManager?;
IOThread---主要用于(BluetoothManagerService、JobStore、MountService中的OBB操作、PackageInstallerService中writeSessionsAsync()、Tethering、PacManager中ACTION_PAC_REFRESH广播处理、TvInputManagerService中mWatchLogHandler);
MainThread---是SystemServer主线程吧;
3、run()

 

[html]  view plain  copy
 
 在CODE上查看代码片
  1. public void run() {  
  2.     boolean waitedHalf = false;  
  3.     while (true) {  
  4.         final ArrayList<HandlerChecker> blockedCheckers;  
  5.         final String subject;  
  6.         final boolean allowRestart;  
  7.         synchronized (this) {  
  8.             long timeout = CHECK_INTERVAL;  
  9.             // Make sure we (re)spin the checkers that have become idle within  
  10.             // this wait-and-check interval  
  11.             for (int i=0; i<mHandlerCheckers.size(); i++) {           //①mHandlerCheckers中包含一个mMonitorChecker和7个重要线程的HandlerChecker(当然是System_Server进程中最重要的线程啦);该for循环主要检测mMonitorChecker中重要的几把锁(检测死锁)、7个重要线程的消息队列是否空闲(检测是否blocked),当然判断是否死锁和被blocked的依据便是是否超时啦;  
  12.                 HandlerChecker hc = mHandlerCheckers.get(i);  
  13.                 hc.scheduleCheckLocked();                        //②如果一个线程死锁或被blocked,那么该HandlerChecker的mCompleted = false、mStartTime=阻塞的初始时间点;这两个变量是判断的基础;  
  14.             }  
  15.   
  16.             // NOTE: We use uptimeMillis() here because we do not want to increment the time we  
  17.             // wait while asleep. If the device is asleep then the thing that we are waiting  
  18.             // to timeout on is asleep as well and won't have a chance to run, causing a false  
  19.             // positive on when to kill things.  
  20.             long start = SystemClock.uptimeMillis();  
  21.             while (timeout > 0) {                              //③等待3秒;    
  22.                 try {  
  23.                     wait(timeout);  
  24.                 } catch (InterruptedException e) {  
  25.                     Log.wtf(TAG, e);  
  26.                 }  
  27.                 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);  
  28.             }  
  29.   
  30.             final int waitState = evaluateCheckerCompletionLocked();      //④根据mCompleted 、mStartTime值评估等待状态;  
  31.   
  32.             if (waitState == COMPLETED) {                             //⑤如果运行顺畅,那么在此处便return,也就是每隔3秒检测一次;  
  33.                 // The monitors have returned; reset  
  34.                 waitedHalf = false;  
  35.                 continue;  
  36.             } else if (waitState == WAITING) {                           //⑤30秒之内,继续检查;                         
  37.                 // still waiting but within their configured intervals; back off and recheck  
  38.                 continue;  
  39.             } else if (waitState == WAITED_HALF) {                    //⑤30~60秒之内,继续检查;  
  40.                 if (!waitedHalf) {  
  41.                     // We've waited half the deadlock-detection interval.  Pull a stack  
  42.                     // trace and wait another half.  
  43.                     ArrayList<Integerpids = new ArrayList<Integer>();  
  44.                     pids.add(Process.myPid());  
  45.                     ActivityManagerService.dumpStackTraces(true, pids, null, null,  
  46.                             NATIVE_STACKS_OF_INTEREST);  
  47.                     waitedHalf = true;  
  48.                 }  
  49.                 continue;  
  50.             }  
  51.   
  52.             // something is overdue!  
  53.             blockedCheckers = getBlockedCheckersLocked();             //超了60秒,此时便出问题了;收集阻塞的线程;  
  54.             subject = describeCheckersLocked(blockedCheckers);        //将阻塞线程写到一个字符串中方便下面打印到Event日志中;  
  55.             allowRestart = mAllowRestart;  
  56.         }  
  57.   
  58.         // If we got here, that means that the system is most likely hung.  
  59.         // First collect stack traces from all threads of the system process.  
  60.         // Then kill this process so that the system will restart.  
  61.         EventLog.writeEvent(EventLogTags.WATCHDOG, subject);  
  62.   
  63.         ArrayList<Integerpids = new ArrayList<Integer>();  
  64.         pids.add(Process.myPid());  
  65.         if (mPhonePid > 0) pids.add(mPhonePid);  
  66.         // Pass !waitedHalf so that just in case we somehow wind up here without having  
  67.         // dumped the halfway stacks, we properly re-initialize the trace file.  
  68.         final File stack = ActivityManagerService.dumpStackTraces(  
  69.                 !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);        //⑥调用ActivityManagerService.dumpStackTraces(),该函数会调用Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT)给System_server进程发送-3信号,这样虚拟机便可打印出trace信息,同时该函数还会收集/system/bin/mediaserver、/system/bin/sdcard、/system/bin/surfaceflinger三个native进程的trace信息;同时还会将CPU的usage打印出来;  
  70.         // Give some extra time to make sure the stack traces get written.  
  71.         // The system's been hanging for a minute, another second or two won't hurt much.  
  72.         SystemClock.sleep(2000);                       //让系统睡眠2秒;注释说既然system已经被挂起了60秒,那么再sleep 2秒也won't hurt much;  
  73.         // Pull our own kernel thread stacks as well if we're configured for that  
  74.         if (RECORD_KERNEL_THREADS) {  
  75.             dumpKernelStackTraces();                 //⑥将kernel的trace也打印出来;kernel的trace主要打印/proc/%pid/task,/proc/%tid/stack中的信息;  
  76.         }  
  77.   
  78.         // Trigger the kernel to dump all blocked threads to the kernel log  
  79.         try {  
  80.             FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");         //⑥向/proc/sysrq-trigger中写一个字符“w”,会触发kernel将blocked线程写到kernel日志中;  
  81.   
  82.             sysrq_trigger.write("w");  
  83.             sysrq_trigger.close();  
  84.         } catch (IOException e) {  
  85.             Slog.e(TAG, "Failed to write to /proc/sysrq-trigger");  
  86.             Slog.e(TAG, e.getMessage());  
  87.         }  
  88.   
  89.         // Try to add the error to the dropbox, but assuming that the ActivityManager  
  90.         // itself may be deadlocked.  (which has happened, causing this statement to  
  91.         // deadlock and the watchdog as a whole to be ineffective)  
  92.         Thread dropboxThread = new Thread("watchdogWriteToDropbox") {       //⑦创建一个线程用于将error add到DropBox中;  
  93.                 public void run() {  
  94.                     mActivity.addErrorToDropBox(  
  95.                             "watchdog", null, "system_server", null, null,  
  96.                             subject, null, stack, null);  
  97.                 }  
  98.             };  
  99.         dropboxThread.start();  
  100.         try {  
  101.             dropboxThread.join(2000);  // wait up to 2 seconds for it to return.  
  102.         } catch (InterruptedException ignored) {}  
  103.   
  104.         IActivityController controller;  
  105.         synchronized (this) {  
  106.             controller = mController;  
  107.         }  
  108.         if (controller != null) {  
  109.             Slog.i(TAG, "Reporting stuck state to activity controller");  
  110.             try {  
  111.                 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");  
  112.                 // 1 = keep waiting, -1 = kill system  
  113.                 int res = controller.systemNotResponding(subject);  
  114.                 if (res >= 0) {  
  115.                     Slog.i(TAG, "Activity controller requested to coninue to wait");  
  116.                     waitedHalf = false;  
  117.                     continue;  
  118.                 }  
  119.             } catch (RemoteException e) {  
  120.             }  
  121.         }  
  122.   
  123.         // Only kill the process if the debugger is not attached.  
  124.         if (Debug.isDebuggerConnected()) {  
  125.             Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");  
  126.         } else if (!allowRestart) {  
  127.             Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");  
  128.         } else {  
  129.             Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);  
  130.             for (int i=0; i<blockedCheckers.size(); i++) {  
  131.                 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");  
  132.                 StackTraceElement[] stackTrace  
  133.                         = blockedCheckers.get(i).getThread().getStackTrace();  
  134.                 for (StackTraceElement element: stackTrace) {  
  135.                     Slog.w(TAG, "    at " + element);  
  136.                 }  
  137.             }  
  138.             Slog.w(TAG, "*** GOODBYE!");  
  139.             Process.killProcess(Process.myPid());        //⑧将System_server干掉,系统重启;  
  140.             System.exit(10);  
  141.         }  
  142.   
  143.         waitedHalf = false;  
  144.     }  
  145. }  

总结:Watchdog的机制就是在一个独立线程中,每隔3秒?会检查System_Server中重要的几把锁(包括WindowManagerService、ActivityManagerService、PowerManagerService、NetworkManagementService、MountService、InputManagerService等)、同时还会检测最重要的7个线程消息队列是否空闲(WindowManagerService、PowerManagerService、PackageManagerService、ActivityManagerService、UiThread、IOThread、MainThread),最终根据mCompleted 、mStartTime值来判断是否阻塞超时60S,如果发生超时,那么将打印trace日志和kernel trace日志,最后将SystemServer干掉重启。

你可能感兴趣的:(WatchDog机制)