WatchDog 工作原理

作者 :-gityuan

链接:http://android.jobbole.com/84881/


一、概述


Android系统中,有硬件WatchDog用于定时检测关键硬件是否正常工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件。WatchDog功能主要是分析系统核心服务和重要线程是否处于Blocked状态。


  • 监视reboot广播;

  • 监视mMonitors关键系统服务是否死锁。


二、启动流程


2.1 startOtherServices


[-> SystemServer.java]


private void startOtherServices() {

    ...

    //创建watchdog【见小节2.2】

    final Watchdog watchdog = Watchdog.getInstance();

    //注册reboot广播【见小节2.3】

    watchdog.init(context, mActivityManagerService);

    ...

    mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY);

    ...

    mActivityManagerService.systemReady(new Runnable() {

       @Override

       public void run() {

           mSystemServiceManager.startBootPhase(

                   SystemService.PHASE_ACTIVITY_MANAGER_READY);

           ...

           // watchdog启动【见小节3.1】

           Watchdog.getInstance().start();

           mSystemServiceManager.startBootPhase(

                   SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);

        }

    }

}


2.2 getInstance


[-> Watchdog.java]


public static Watchdog getInstance() {

    if (sWatchdog == null) {

        //单例模式,创建实例对象【见小节2.2.1 】

        sWatchdog = new Watchdog();

    }

    return sWatchdog;

}


2.2.1 创建Watchdog


[-> Watchdog.java]


public class Watchdog extends Thread {

    ...

 

    private Watchdog() {

        super("watchdog");

        //【见小节2.2.2 】

        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),

                "foreground thread", DEFAULT_TIMEOUT);

        mHandlerCheckers.add(mMonitorChecker);

        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),

                "main thread", DEFAULT_TIMEOUT));

        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),

                "ui thread", DEFAULT_TIMEOUT));

        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),

                "i/o thread", DEFAULT_TIMEOUT));

        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),

                "display thread", DEFAULT_TIMEOUT));

        //【见小节2.2.3】

        addMonitor(new BinderThreadMonitor());

    }

 

}


Watchdog继承于Thread,创建的线程名为”watchdog”。mHandlerCheckers是记录着所有的HandlerChecker对象的列表。


Watchdog监控的线程有:


线程名 对应handler 含义
main thread new Handler(MainLooper) 当前主线程
android.fg FgThread.getHandler 前台线程
android.ui UiThread.getHandler UI线程
android.io IoThread.getHandler I/O线程
android.display DisplayThread.getHandler Display线程


DEFAULT_TIMEOUT默认为60s,调试时才为10s方便找出潜在的ANR问题。


2.2.2 HandlerChecker


[-> Watchdog.java]


public final class HandlerChecker implements Runnable {

    ...

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {

        mHandler = handler;

        mName = name;

        mWaitMax = waitMaxMillis;

        mCompleted = true;

    }

}


mMonitors记录所有Watchdog目前正在监控的服务。


2.2.3 监控Binder线程


在小节【2.2.1】创建Watchdog时,通过addMonitor(new BinderThreadMonitor())来监控Binder线程, 这里拆分两步骤:


  • addMonitor

  • new BinderThreadMonitor


2.2.3.1 addMonitor


public class Watchdog extends Thread {

    public void addMonitor(Monitor monitor) {

        synchronized (this) {

            if (isAlive()) {

                throw new RuntimeException("Monitors can't be added once the Watchdog is running");

            }

            //此处mMonitorChecker数据类型为HandlerChecker

            mMonitorChecker.addMonitor(monitor);

        }

    }

 

    public final class HandlerChecker implements Runnable {

        private final ArrayList mMonitors = new ArrayList();

 

        public void addMonitor(Monitor monitor) {

            //将上面的BinderThreadMonitor添加到mMonitors队列

            mMonitors.add(monitor);

        }

        ...

    }

}


将monitor添加到HandlerChecker的成员变量mMonitors列表中。


2.2.3.2 BinderThreadMonitor


private static final class BinderThreadMonitor implements Watchdog.Monitor {

    public void monitor() {

        Binder.blockUntilThreadAvailable();

    }

}


blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程


void IPCThreadState::blockUntilThreadAvailable()

{

    pthread_mutex_lock(&mProcess->mThreadCountLock);

    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {

        //等待正在执行的binder线程小于进程最大binder线程上限(16个)

        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);

    }

    pthread_mutex_unlock(&mProcess->mThreadCountLock);

}


可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到android.fg线程的handler(mMonitorChecker)来检查是否工作正常。


2.2.4 Monitor


public class Watchdog extends Thread {

    public interface Monitor {

        void monitor();

    }

}


能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口。 实现该接口类:


  • ActivityManagerService

  • PowerManagerService

  • WindowManagerService

  • InputManagerService

  • NetworkManagementService

  • MountService

  • NativeDaemonConnector

  • BinderThreadMonitor

  • MediaProjectionManagerService

  • MediaRouterService

  • MediaSessionService


2.3 init


[-> Watchdog.java]


public void init(Context context, ActivityManagerService activity) {

    mResolver = context.getContentResolver();

    mActivity = activity;

    //注册reboot广播接收者【见小节2.3.1】

    context.registerReceiver(new RebootRequestReceiver(),

            new IntentFilter(Intent.ACTION_REBOOT),

            android.Manifest.permission.REBOOT, null);

}


2.3.1 RebootRequestReceiver


[-> Watchdog.java]


final class RebootRequestReceiver extends BroadcastReceiver {

    @Override

    public void onReceive(Context c, Intent intent) {

        if (intent.getIntExtra("nowait", 0) != 0) {

            //【见小节2.3.2】

            rebootSystem("Received ACTION_REBOOT broadcast");

            return;

        }

        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);

    }

}


2.3.2 rebootSystem


[-> Watchdog.java]


void rebootSystem(String reason) {

    Slog.i(TAG, "Rebooting system because: " + reason);

    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);

    try {

        //通过PowerManager执行reboot操作

        pms.reboot(false, reason, false);

    } catch (RemoteException ex) {

    }

}


最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。


2.4 小节


获取watchdog实例对象,并注册reboot广播


  • mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler;

  • mMonitors记录所有Watchdog目前正在监控Monitor,此处为BinderThreadMonitor;

  • 注册reboot广播,最终是通过PowerManagerService来完成;

  • 系统每间隔1分钟,执行一次monitor操作, 当系统hang时间超过1分钟则执行run()方法;


下面来看看当系统hang超过1分钟时,进入watchdog.run()的过程:


三、Watchdog


3.1 run


public void run() {

    boolean waitedHalf = false;

    while (true) {

        final ArrayList blockedCheckers;

        final String subject;

        final boolean allowRestart;

        int debuggerWasConnected = 0;

        synchronized (this) {

            //timeout=30s

            long timeout = CHECK_INTERVAL;

            for (int i=0; i<mHandlerCheckers.size(); i++) {

                HandlerChecker hc = mHandlerCheckers.get(i);

                //【见小节3.2】

                hc.scheduleCheckLocked();

            }

 

            if (debuggerWasConnected > 0) {

                debuggerWasConnected--;

            }

 

            long start = SystemClock.uptimeMillis();

            //等待30s

            while (timeout > 0) {

                if (Debug.isDebuggerConnected()) {

                    debuggerWasConnected = 2;

                }

                try {

                    wait(timeout);

                } catch (InterruptedException e) {

                    Log.wtf(TAG, e);

                }

                if (Debug.isDebuggerConnected()) {

                    debuggerWasConnected = 2;

                }

                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);

            }



            
//获取等待状态【见小节3.3】

            final int waitState = evaluateCheckerCompletionLocked();

            if (waitState == COMPLETED) {

                waitedHalf = false;

                continue;

            } else if (waitState == WAITING) {

                continue;

            } else if (waitState == WAITED_HALF) {

                if (!waitedHalf) {

                    //第一次进入等待时间过半的状态

                    ArrayList pids = new ArrayList();

                    pids.add(Process.myPid());

                    //则输出栈信息【见小节3.4】

                    ActivityManagerService.dumpStackTraces(true, pids, null, null,

                            NATIVE_STACKS_OF_INTEREST);

                    waitedHalf = true;

                }

                continue;

            }

            //获取被阻塞的checkers

            blockedCheckers = getBlockedCheckersLocked();

            subject = describeCheckersLocked(blockedCheckers);

            allowRestart = mAllowRestart;

        }

 

        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

 

        ArrayList pids = new ArrayList();

        pids.add(Process.myPid());

        if (mPhonePid > 0) pids.add(mPhonePid);

        //waitedHalf=true,则追加输出栈信息【见小节3.4】

        final File stack = ActivityManagerService.dumpStackTraces(

                !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

        //系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出

        SystemClock.sleep(2000);

 

        if (RECORD_KERNEL_THREADS) {

            //输出kernel栈信息【见小节3.5】

            dumpKernelStackTraces();

        }

 

        //触发kernel来dump所有阻塞线程【见小节3.6】

        doSysRq('l');

        //输出dropbox信息【见小节3.7】

        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {

                public void run() {

                    mActivity.addErrorToDropBox(

                            "watchdog", null, "system_server", null, null,

                            subject, null, stack, null);

                }

            };

        dropboxThread.start();

        try {

            //等待dropbox线程工作2s

            dropboxThread.join(2000);

        } catch (InterruptedException ignored) {}

 

        IActivityController controller;

        synchronized (this) {

            controller = mController;

        }

        if (controller != null) {

            //将阻塞状态报告给activity controller,

            try {

                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");

                //返回值为1表示继续等待,-1表示杀死系统

                int res = controller.systemNotResponding(subject);

                if (res >= 0) {

                    waitedHalf = false; //继续等待

                    continue;

                }

            } catch (RemoteException e) {

            }

        }

 

        //当debugger没有attach时,才杀死进程

        if (Debug.isDebuggerConnected()) {

            debuggerWasConnected = 2;

        }

        if (debuggerWasConnected >= 2) {

            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");

        } else if (debuggerWasConnected > 0) {

            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");

        } else if (!allowRestart) {

            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");

        } else {

            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);

            //遍历输出阻塞线程的栈信息

            for (int i=0; i<blockedCheckers.size(); i++) {

                Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");

                StackTraceElement[] stackTrace

                        = blockedCheckers.get(i).getThread().getStackTrace();

                for (StackTraceElement element: stackTrace) {

                    Slog.w(TAG, "    at " + element);

                }

            }

            Slog.w(TAG, "*** GOODBYE!");

            //杀死进程system_server【见小节3.8】

            Process.killProcess(Process.myPid());

            System.exit(10);

        }

 

        waitedHalf = false;

    }

}



3.2 scheduleCheckLocked


public final class HandlerChecker implements Runnable {

    ...

    public void scheduleCheckLocked() {

        if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {

            mCompleted = true;

            return;

        }

 

        if (!mCompleted) {

            return; //有一个check正在处理中,则无需重复发送

        }

 

        mCompleted = false;

        mCurrentMonitor = null;

        mStartTime = SystemClock.uptimeMillis();

        //发送消息,插入消息队列最开头【见3.2.1】

        mHandler.postAtFrontOfQueue(this);

    }

}


postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制,回调HandlerChecker中的run方法。


3.2.1 HandlerChecker.run


public final class HandlerChecker implements Runnable {

    public void run() {

        final int size = mMonitors.size();

        for (int i = 0 ; i < size ; i++) {

            synchronized (Watchdog.this) {

                mCurrentMonitor = mMonitors.get(i);

            }

            //回调具体服务的monitor方法

            mCurrentMonitor.monitor();

        }

 

        synchronized (Watchdog.this) {

            mCompleted = true;

            mCurrentMonitor = null;

        }

    }

}


回调的方法,例如BinderThreadMonitor.monitor


3.3 evaluateCheckerCompletionLocked


private int evaluateCheckerCompletionLocked() {

    int state = COMPLETED;

    for (int i=0; i<mHandlerCheckers.size(); i++) {

        HandlerChecker hc = mHandlerCheckers.get(i);

        //【见小节3.3.1】

        state = Math.max(state, hc.getCompletionStateLocked());

    }

    return state;

}


获取mHandlerCheckers列表中等待状态值最大的state.


3.3.1 getCompletionStateLocked


public int getCompletionStateLocked() {

    if (mCompleted) {

        return COMPLETED;

    } else {

        long latency = SystemClock.uptimeMillis() - mStartTime;

        if (latency < mWaitMax/2) {

            return WAITING;

        } else if (latency < mWaitMax) {

            return WAITED_HALF;

        }

    }

    return OVERDUE;

}


  • COMPLETED = 0:等待完成;

  • WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;

  • WAITED_HALF = 2:等待时间处于30s~60s之间;

  • OVERDUE = 3:等待时间大于或等于60s。


3.4 AMS.dumpStackTraces


public static File dumpStackTraces(boolean clearTraces, ArrayList firstPids,

        ProcessCpuTracker processCpuTracker, SparseArray lastPids, String[] nativeProcs) {

    //默认为 data/anr/traces.txt

    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);

    if (tracesPath == null || tracesPath.length() == 0) {

        return null;

    }

 

    File tracesFile = new File(tracesPath);

    try {

        //当clearTraces,则删除已存在的traces文件

        if (clearTraces && tracesFile.exists()) tracesFile.delete();

        //创建traces文件

        tracesFile.createNewFile();

        // -rw-rw-rw-

        FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);

    } catch (IOException e) {

        return null;

    }

    //输出trace内容

    dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);

    return tracesFile;

}


关于trace内容,这里就不细说,直接说说结论:


  1. 调用Process.sendSignal()向目标进程发送信号SIGNAL_QUIT;

  2. 分别调用backtrace.dump_backtrace(),输出/system/bin/mediaserver,/system/bin/sdcard,/system/bin/surfaceflinger这3个进程的backtrace;

  3. 统计CPU使用率;

  4. 调用Process.sendSignal()向其他进程发送信号SIGNAL_QUIT。


3.5 dumpKernelStackTraces


private File dumpKernelStackTraces() {

    // 路径为data/anr/traces.txt

    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);

    if (tracesPath == null || tracesPath.length() == 0) {

        return null;

    }

 

    native_dumpKernelStacks(tracesPath);

    return new File(tracesPath);

}


native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks


3.6 doSysRq


private void doSysRq(char c) {

    try {

        FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");

        sysrq_trigger.write(c);

        sysrq_trigger.close();

    } catch (IOException e) {

        Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);

    }

}


通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。


3.7 dropBox


关于dropbox已在dropBox源码篇详细讲解过,输出文件到/data/system/dropbox,比如system_app_crash。


3.8 killProcess


Process.killProcess已经在文章理解杀进程的实现原理已详细讲解,通过发送信号9给目标进程来完成杀进程的过程。


当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。


四、小结


watchdog在check过程中出现阻塞1分钟的情况,则会输出:


  • AMS.dumpStackTraces

  • kill -3

  • backtrace.dump_backtrace()

  • dumpKernelStackTraces,输出kernel栈信息

  • dropBox,输出文件到/data/system/dropbox


WatchDog 工作原理_第1张图片

关于Java和Android大牛频道

Java和Android大牛频道是一个数万人关注的探讨Java和Android开发的公众号,分享和原创最有价值的干货文章,让你成为这方面的大牛!

我们探讨android和Java开发最前沿的技术:android性能优化 ,插件化,跨平台,动态化,加固和反破解等,也讨论设计模式/软件架构等。由群来自BAT的工程师组成的团队

关注即送红包,回复:“百度” 、“阿里”、“腾讯” 有惊喜!!!关注后可用入微信群。群里都是来自百度阿里腾讯的大牛。

欢迎关注我们,一起讨论技术,扫描和长按下方的二维码可快速关注我们。搜索微信公众号:JANiubility。

WatchDog 工作原理_第2张图片

公众号:JANiubility

你可能感兴趣的:(Android框架源码)