应用与系统稳定性第二篇---ANR的监测与信息采集

第一篇文章-应用与系统稳定性第一篇---ANR问题分析的一般套路主要梳理了几个ANR案例,讲了分析ANR问题的一般思路。本文基于Android O,深入理解ANR发生时系统的处理逻辑,加深对ANR问题的实战处理。实际处理ANR问题中,对于很多的信息还是比较陌生。比如ANR的trace是怎么输出来的?有时候为什么ANR的trace是无效的?一个进程发生了ANR的过程中又发生了ANR,系统是怎么处理的?什么是后台ANR?等等小问题如果你都能回答,那么可以不用看了。

一、ANR的trace是如何生成的

1、appNotResponding方法解读

发生ANR的时候,一般先搜索ANR in获取最直观的信息,如下:

06-16 16:16:28.590  1853  2073 E ActivityManager: ANR in com.android.camera (com.android.camera/.Camera)
06-16 16:16:28.590  1853  2073 E ActivityManager: PID: 27661
06-16 16:16:28.590  1853  2073 E ActivityManager: Reason: Input dispatching timed out (com.android.camera/com.android.camera.Camera, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 24.  Wait queue head age: 5511.1ms.)
06-16 16:16:28.590  1853  2073 E ActivityManager: Load: 16.25 / 29.48 / 38.33
06-16 16:16:28.590  1853  2073 E ActivityManager: CPU usage from 0ms to 8058ms later:
06-16 16:16:28.590  1853  2073 E ActivityManager:   58% 291/mediaserver: 51% user + 6.7% kernel / faults: 2457 minor 4 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   27% 317/mm-qcamera-daemon: 21% user + 5.8% kernel / faults: 15965 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 288/debuggerd: 0% user + 0.3% kernel / faults: 21615 minor 87 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   17% 27661/com.android.camera: 10% user + 6.8% kernel / faults: 2412 minor 34 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   16% 1853/system_server: 10% user + 6.4% kernel / faults: 1754 minor 87 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   10% 539/sensors.qcom: 7.8% user + 2.6% kernel / faults: 16 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   4.4% 277/surfaceflinger: 1.8% user + 2.6% kernel / faults: 14 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   4% 203/mmcqd/0: 0% user + 4% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   2.6% 3510/com.android.phone: 1.9% user + 0.6% kernel / faults: 1148 minor 8 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   2.1% 2902/com.android.systemui: 1.6% user + 0.4% kernel / faults: 1272 minor 32 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   1.6% 3110/com.miui.whetstone: 1.6% user + 0% kernel / faults: 2614 minor 22 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 99/kswapd0: 0% user + 0.8% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   1.4% 217/jbd2/mmcblk0p25: 0% user + 1.4% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   1.4% 223/logd: 0.7% user + 0.7% kernel / faults: 4 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.9% 12808/kworker/0:1: 0% user + 0.9% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 35/kworker/u:2: 0% user + 0.8% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 3222/com.miui.sysbase: 0% user + 0% kernel / faults: 1314 minor 12 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 3446/com.android.nfc: 0.4% user + 0.3% kernel / faults: 1223 minor 9 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.7% 10866/kworker/u:1: 0% user + 0.7% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.6% 642/mdss_fb0: 0% user + 0.6% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.6% 29336/kworker/u:7: 0% user + 0.6% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 6/kworker/u:0: 0% user + 0.4% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 22924/kworker/u:6: 0% user + 0.4% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.3% 4421/mpdecision: 0% user + 0.3% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.2% 276/servicemanager: 0.1% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.2% 289/rild: 0.2% user + 0% kernel / faults: 20 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 4161/mcd: 0% user + 0% kernel / faults: 9 minor 1 major
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 3/ksoftirqd/0: 0% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 5/kworker/0:0H: 0% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 7/kworker/u:0H: 0% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 215/flush-179:0: 0% user + 0% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 321/displayfeature: 0.1% user + 0% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 368/irq/33-cpubw_hw: 0% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 403/qmuxd: 0% user + 0.1% kernel / faults: 60 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 3491/com.xiaomi.finddevice: 0% user + 0% kernel / faults: 706 minor
06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 29330/ksoftirqd/1: 0% user + 0.1% kernel
06-16 16:16:28.590  1853  2073 E ActivityManager: 96% TOTAL: 56% user + 29% kernel + 6.3% iowait + 4.1% softirq

应用在发生ANR的时候,不论是Provider,input还是BroadcastReceiver和Service处理超时等,系统都会调用appNotResponding方法,进行trace的收集,主要记录当时的案发现场,各个相关线程的状态一目了然。下面从appNotResponding入手一起看看源码是怎么写的。

这个方法的参数有5个,如下:

  • app 当前发生ANR的进程
  • activity 发生ANR的界面
  • parent 发生ANR的界面的上一级界面
  • aboveSystem 这个参数系统中用的全部都是False,还没有看到有true的地方
  • annotation 发生ANR的原因
776    final void appNotResponding(ProcessRecord app, ActivityRecord activity,
777            ActivityRecord parent, boolean aboveSystem, final String annotation) {
            // 填充firstPids和lastPids数组,从最近运行进程中挑选
            // firstPids: 用于保存ANR进程及其父进程,system_server进程和persistent的进程
            // lastPids: 用于保存除firstPids外的其他进程
778        ArrayList firstPids = new ArrayList(5);
779        SparseArray lastPids = new SparseArray(20);
780       
781        if (mService.mController != null) {
782            try {
783                // 0 == continue, -1 = kill process immediately
784                int res = mService.mController.appEarlyNotResponding(
785                        app.processName, app.pid, annotation);
786                if (res < 0 && app.pid != MY_PID) {
787                    app.kill("anr", true);
788                }
789            } catch (RemoteException e) {
790                mService.mController = null;
791                Watchdog.getInstance().setActivityController(null);
792            }
793        }
794
795        long anrTime = SystemClock.uptimeMillis();
              //更新Cpu的信息
796        if (ActivityManagerService.MONITOR_CPU_USAGE) {
797            mService.updateCpuStatsNow();
798        }
799
800        // Unless configured otherwise, swallow ANRs in background processes & kill the process.
    //showBackground意味在后台的时候是否让ANR有感知,如果showBackground为fasle就不会弹出ANR的dialog了
801        boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(),
802                Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0;
803
804        boolean isSilentANR;
805
806        synchronized (mService) {
         //这些场景的ANR直接做忽略处理(关机,app已经ANR了,APP处在crash流程中,被AMS或者其他人杀死的)
807            // PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
808            if (mService.mShuttingDown) {
809                Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation);
810                return;
811            } else if (app.notResponding) {
812                Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation);
813                return;
814            } else if (app.crashing) {
815                Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation);
816                return;
817            } else if (app.killedByAm) {
818                Slog.i(TAG, "App already killed by AM skipping ANR: " + app + " " + annotation);
819                return;
820            } else if (app.killed) {
821                Slog.i(TAG, "Skipping died app ANR: " + app + " " + annotation);
822                return;
823            }
824
825            // In case we come through here for the same app before completing
826            // this one, mark as anring now so we will bail out.
827            app.notResponding = true;
828
829            // Log the ANR to the event log.常常搜索的am_anr关键字,记录ANR信息到eventlog中
830            EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
831                    app.processName, app.info.flags, annotation);
832
833            // Dump thread traces as quickly as we can, starting with "interesting" processes.
         //ANR进程放到firstPids集合中
834            firstPids.add(app.pid);
835
836            // Don't dump other PIDs if it's a background ANR
         //如果是后台的ANR,就不dump了
837            isSilentANR = !showBackground && !isInterestingForBackgroundTraces(app);
838            if (!isSilentANR) {
839                int parentPid = app.pid;
840                if (parent != null && parent.app != null && parent.app.pid > 0) {
841                    parentPid = parent.app.pid;
842                }
           //把ANR的父进程也记录firstPids中
843                if (parentPid != app.pid) firstPids.add(parentPid);
844          //MY_PID是system_server的pid,如果发生的进程不是system,需要把system进程加到firstPids中
845                if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
846         //最近运行的进程在mLruProcesses中,遍历这个List
847                for (int i = mService.mLruProcesses.size() - 1; i >= 0; i--) {
848                    ProcessRecord r = mService.mLruProcesses.get(i);
849                    if (r != null && r.thread != null) {
850                        int pid = r.pid;
851                        if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
                  //如果persistent进程,这类是“不死进程”,
                  //这类进程在启动的时候会针对目标用户进程的IApplicationThread接口,注册一个binder讣告监听器,
                  //一旦日后用户进程意外挂掉,AMS就能在第一时间感知到,
                              //并采取相应的措施。如果AMS发现意外挂掉的应用是persistent的,它会尝试重新启动这个应用
852                            if (r.persistent) {
853                                firstPids.add(pid);
854                                if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
855                            } else if (r.treatLikeActivity) {
856                                firstPids.add(pid);
857                                if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
858                            } else {
                                   //其他的都加到lastPids中
859                                lastPids.put(pid, Boolean.TRUE);
860                                if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
861                            }
862                        }
863                    }
864                }
865            }
866        }
867
868        // Log the ANR to the main log.输出到main log里面
869        StringBuilder info = new StringBuilder();
870        info.setLength(0);
871        info.append("ANR in ").append(app.processName);
872        if (activity != null && activity.shortComponentName != null) {
873            info.append(" (").append(activity.shortComponentName).append(")");
874        }
875        info.append("\n");
876        info.append("PID: ").append(app.pid).append("\n");
877        if (annotation != null) {
878            info.append("Reason: ").append(annotation).append("\n");
879        }
880        if (parent != null && parent != activity) {
881            info.append("Parent: ").append(parent.shortComponentName).append("\n");
882        }
883  
884        ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
885
886        // don't dump native PIDs for background ANRs unless it is the process of interest
887        String[] nativeProcs = null;
888        if (isSilentANR) {
        //NATIVE_STACKS_OF_INTEREST是Watchog中定义的几个关键的进程
889            for (int i = 0; i < NATIVE_STACKS_OF_INTEREST.length; i++) {
890                if (NATIVE_STACKS_OF_INTEREST[i].equals(app.processName)) {
891                    nativeProcs = new String[] { app.processName };
892                    break;
893                }
894            }
895        } else {
896            nativeProcs = NATIVE_STACKS_OF_INTEREST;
897        }
898  //获取nativeProcs这些进程的pid
899        int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
900        ArrayList nativePids = null;
901
902        if (pids != null) {
903            nativePids = new ArrayList(pids.length);
904            for (int i : pids) {
905                nativePids.add(i);
906            }
907        }
908
909        // For background ANRs, don't pass the ProcessCpuTracker to
910        // avoid spending 1/2 second collecting stats to rank lastPids.
              //
911        File tracesFile = mService.dumpStackTraces(true, firstPids,
912                                                   (isSilentANR) ? null : processCpuTracker,
913                                                   (isSilentANR) ? null : lastPids,
914                                                   nativePids);
915
916        String cpuInfo = null;
917        if (ActivityManagerService.MONITOR_CPU_USAGE) {
           //更新Cpu的信息
918            mService.updateCpuStatsNow();
919            synchronized (mService.mProcessCpuTracker) {
           // 记录ANR之前的cpu使用情况(CPU usage from xxxms to 0ms ago)
920                cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime);
921            }
            // 记录从anr时间开始的Cpu负载
922            info.append(processCpuTracker.printCurrentLoad());
923            info.append(cpuInfo);
924        }
925          // 记录从anr时间开始的cpu使用情况(CPU usage from xxms to xxms later)
926        info.append(processCpuTracker.printCurrentState(anrTime));
927
928        Slog.e(TAG, info.toString());
929        if (tracesFile == null) {
930            // 如果trace为空,则发送singal 3到发送ANR的进程,相当于执行adb shell kill -3 pid命令
931            Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
932        }
933      //写到DropBox文件夹里面(data/system/dropbox)
934        mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
935                cpuInfo, tracesFile, null);
936
937        if (mService.mController != null) {
938            try {
939                // 0 == show dialog, 1 = keep waiting, -1 = kill process immediately
940                int res = mService.mController.appNotResponding(
941                        app.processName, app.pid, info.toString());
942                if (res != 0) {
943                    if (res < 0 && app.pid != MY_PID) {
944                        app.kill("anr", true);
945                    } else {
946                        synchronized (mService) {
947                            mService.mServices.scheduleServiceTimeoutLocked(app);
948                        }
949                    }
950                    return;
951                }
952            } catch (RemoteException e) {
953                mService.mController = null;
954                Watchdog.getInstance().setActivityController(null);
955            }
956        }
957
958        synchronized (mService) {
959            mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
960        //如果是后台ANR,直接kill
961            if (isSilentANR) {
962                app.kill("bg anr", true);
963                return;
964            }
965
966            // Set the app's notResponding state, and look up the errorReportReceiver
967             吧   (app,
968                    activity != null ? activity.shortComponentName : null,
969                    annotation != null ? "ANR " + annotation : "ANR",
970                    info.toString());
971
972            // 弹出无响应对话框
973            Message msg = Message.obtain();
974            HashMap map = new HashMap();
975            msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
976            msg.obj = map;
977            msg.arg1 = aboveSystem ? 1 : 0;
978            map.put("app", app);
979            if (activity != null) {
980                map.put("activity", activity);
981            }
982    给android.ui线程发生一个消息,弹出对话框
983            mService.mUiHandler.sendMessage(msg);
984        }
985    }
986

上面的代码撸了一遍,总结一下appNotResponding的业务逻辑:

  • 1、输出ANR Reason信息到Events log中,关键字是am_anr,一般都是用这个信息来确定Anr的发生时间
  • 2、如果是后台的anr,即isSilentANR值为true,那么不显示ANR对话框,直接kill掉ANR所在进程
  • 3、输出ANR 信息到main log中,关键字是ANR in,含有Cpu负载信息,一般都是用这个看ANR发生时候的其他进程的资源占用情况
  • 4、调用dumpStackTraces去生成trace文件,这个trace文件有哪些进程呢?
    a.firstPids队列:第一个是ANR进程,第二个是system_server,剩余是所有persistent进程;
    b.Native队列:是指/system/bin/目录的mediaserver,sdcard 以及surfaceflinger进程;
    c.lastPids队列: 是指mLruProcesses中的不属于firstPids的所有进程。
  • 5、将traces文件和 CPU使用情况信息保存到dropbox,即data/system/dropbox目录
  • 6、弹窗告知用户

除了上面的业务总结,我们发现有几个小知识点

  • 1、关机过程虽然比较耗时,但是不会出现ANR,不必多说,你说人家都关机了,还有提示发生ANR的必要了吗?
  • 2、ANR有两种,前台ANR和后台ANR,我在调试中发现,后台ANR非常容易发生,系统直接kill掉这个应用进程,但是用户没有感知。
  • 3、am_anr关键字比ANR in要先打印,所以更接近ANR的实际发生时间
2、dumpStackTraces(1/2)方法解读

继续看appNotResponding中的核心逻辑dumpStackTraces

http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5492
5483    /**
5484     * If a stack trace dump file is configured, dump process stack traces.
5485     * @param clearTraces causes the dump file to be erased prior to the new
5486     *    traces being written, if true; when false, the new traces will be
5487     *    appended to any existing file content.
5488     * @param firstPids of dalvik VM processes to dump stack traces for first
5489     * @param lastPids of dalvik VM processes to dump stack traces for last
5490     * @param nativePids optional list of native pids to dump stack crawls
5491     */
5492    public static File dumpStackTraces(boolean clearTraces, ArrayList firstPids,
5493            ProcessCpuTracker processCpuTracker, SparseArray lastPids,
5494            ArrayList nativePids) {
5495            ArrayList extraPids = null;
5496            ......
5524        boolean useTombstonedForJavaTraces = false;
5525        File tracesFile;
5526        //如果dalvik.vm.stack-trace-dir没有配置,trace就写到全局文件data/anr/traces.txt中
5527        final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
5528        if (tracesDirProp.isEmpty()) {
5529            // When dalvik.vm.stack-trace-dir is not set, we are using the "old" trace
5530            // dumping scheme. All traces are written to a global trace file (usually
5531            // "/data/anr/traces.txt") so the code below must take care to unlink and recreate
5532            // the file if requested.
5533            //
5534            // This mode of operation will be removed in the near future.
5535
5536
5537            String globalTracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
5538            if (globalTracesPath.isEmpty()) {
            // 没有配置dalvik.vm.stack-trace-file,则返回null
5539                Slog.w(TAG, "dumpStackTraces: no trace path configured");
5540                return null;
5541            }
5542
5543            tracesFile = new File(globalTracesPath);
5544            try {
            //删除旧的,创建新的
5545                if (clearTraces && tracesFile.exists()) {
5546                    tracesFile.delete();
5547                }
5548
5549                tracesFile.createNewFile();
5550                FileUtils.setPermissions(globalTracesPath, 0666, -1, -1); // -rw-rw-rw-
5551            } catch (IOException e) {
5552                Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesFile, e);
5553                return null;
5554            }
5555        } else {
5556            File tracesDir = new File(tracesDirProp);
5557            // When dalvik.vm.stack-trace-dir is set, we use the "new" trace dumping scheme.
5558            // Each set of ANR traces is written to a separate file and dumpstate will process
5559            // all such files and add them to a captured bug report if they're recent enough.
5560            maybePruneOldTraces(tracesDir);
5561
5562            // NOTE: We should consider creating the file in native code atomically once we've
5563            // gotten rid of the old scheme of dumping and lot of the code that deals with paths
5564            // can be removed.
        // 创建trace文件,格式为anr_yyyy-MM-dd-HH-mm-ss-SSS
5565            tracesFile = createAnrDumpFile(tracesDir);
5566            if (tracesFile == null) {
5567                return null;
5568            }
5569
5570            useTombstonedForJavaTraces = true;
5571        }
5572
5573        dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids,
5574                useTombstonedForJavaTraces);
5575        return tracesFile;
5576    }

这段代码并没有真正的执行dump操作,只是确定了trace文件路径,如果”dalvik.vm.stack-trace-dir”这个属性没有配置,就使用旧的dump策略,trace信息写入到全局trace文件/data/anr/traces.txt中,删除旧的traces文件,创建新的traces文件;如果”dalvik.vm.stack-trace-dir”属性被配置了,就创建格式为anr_yyyy-MM-dd-HH-mm-ss-SSS的文件。

dalvik.vm.stack-trace-file特性
dalvik.vm.stack-trace-file特性允许你指定要将线程堆栈追踪写入的文件名,如果不存在,将创建,新的信息将追加到文件尾,文件名通过-Xstacktracefile参数写入虚拟机。例如:
adb shell setprop dalvik.vm.stack-trace-file /tmp/stack-traces.txt
如果这个特性没有被定义,虚拟机会在收到这个信号时将堆栈追踪信息写入android log。

http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5581
5581    private static synchronized File createAnrDumpFile(File tracesDir) {
5582        if (sAnrFileDateFormat == null) {
5583            sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS");
5584        }
5585
5586        final String formattedDate = sAnrFileDateFormat.format(new Date());
5587        final File anrFile = new File(tracesDir, "anr_" + formattedDate);
5588
5589        try {
5590            if (anrFile.createNewFile()) {
5591                FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw-------
5592                return anrFile;
5593            } else {
5594                Slog.w(TAG, "Unable to create ANR dump file: createNewFile failed");
5595            }
5596        } catch (IOException ioe) {
5597            Slog.w(TAG, "Exception creating ANR dump file:", ioe);
5598        }
5599
5600        return null;
5601    }
5602
3、dumpStackTraces(2/2)方法解读
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5704
5704    private static void dumpStackTraces(String tracesFile, ArrayList firstPids,
5705            ArrayList nativePids, ArrayList extraPids,
5706            boolean useTombstonedForJavaTraces) {
5707
5708        // We don't need any sort of inotify based monitoring when we're dumping traces via
5709        // tombstoned. Data is piped to an "intercept" FD installed in tombstoned so we're in full
5710        // control of all writes to the file in question.
5711        final DumpStackFileObserver observer;
5712        if (useTombstonedForJavaTraces) {
5713            observer = null;
5714        } else {
5715            // Use a FileObserver to detect when traces finish writing.
5716            // The order of traces is considered important to maintain for legibility.
5717            observer = new DumpStackFileObserver(tracesFile);
5718        }
5719
5720        // 所有的dump操作需要在20秒之内完成
5721        long remainingTime = 20 * 1000;
5722        try {
5723            if (observer != null) {
5724                observer.startWatching();
5725            }
5726
5727            // 第一步收集firstPids中进程的trace,超时会自动放弃
5728            if (firstPids != null) {
5729                int num = firstPids.size();
5730                for (int i = 0; i < num; i++) {
5731                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for pid "
5732                            + firstPids.get(i));
5733                    final long timeTaken;
5734                    if (useTombstonedForJavaTraces) {
5735                        timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime);
5736                    } else {
5737                        timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime);
5738                    }
5739
5740                    remainingTime -= timeTaken;
5741                    if (remainingTime <= 0) {
5742                        Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + firstPids.get(i) +
5743                            "); deadline exceeded.");
5744                        return;
5745                    }
5746
5747                    if (DEBUG_ANR) {
5748                        Slog.d(TAG, "Done with pid " + firstPids.get(i) + " in " + timeTaken + "ms");
5749                    }
5750                }
5751            }
5752
5753              // 第二步收集nativePids中进程的trace,超时会自动放弃
5754            if (nativePids != null) {
5755                for (int pid : nativePids) {
5756                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for native pid " + pid);
5757                    final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime);
5758
5759                    final long start = SystemClock.elapsedRealtime();
5760                    Debug.dumpNativeBacktraceToFileTimeout(
5761                            pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
5762                    final long timeTaken = SystemClock.elapsedRealtime() - start;
5763
5764                    remainingTime -= timeTaken;
5765                    if (remainingTime <= 0) {
5766                        Slog.e(TAG, "Aborting stack trace dump (current native pid=" + pid +
5767                            "); deadline exceeded.");
5768                        return;
5769                    }
5770
5771                    if (DEBUG_ANR) {
5772                        Slog.d(TAG, "Done with native pid " + pid + " in " + timeTaken + "ms");
5773                    }
5774                }
5775            }
5776
5777            //第三步收集extraPids中进程的trace,超时会自动放弃,这个代码和第一步一样的
5778            if (extraPids != null) {
5779                for (int pid : extraPids) {
5780                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + pid);
5781
5782                    final long timeTaken;
5783                    if (useTombstonedForJavaTraces) {
5784                        timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
5785                    } else {
5786                        timeTaken = observer.dumpWithTimeout(pid, remainingTime);
5787                    }
5788
5789                    remainingTime -= timeTaken;
5790                    if (remainingTime <= 0) {
5791                        Slog.e(TAG, "Aborting stack trace dump (current extra pid=" + pid +
5792                                "); deadline exceeded.");
5793                        return;
5794                    }
5795
5796                    if (DEBUG_ANR) {
5797                        Slog.d(TAG, "Done with extra pid " + pid + " in " + timeTaken + "ms");
5798                    }
5799                }
5800            }
5801        } finally {
5802            if (observer != null) {
5803                observer.stopWatching();
5804            }
5805        }
5806    }

firstPids和extraPids中的进程dump trace有两种方式,看useTombstonedForJavaTraces的取值;

  • 如果useTombstonedForJavaTraces为真,采用dumpJavaTracesTombstoned获取trace
  • 如果useTombstonedForJavaTraces为假,采用DumpStackFileObserver#dumpWithTimeout获取trace

对于nativePids中的进程,只能通过 Debug.dumpNativeBacktraceToFileTimeout来获取trace。

除了上面的总结还需要思考几个小问题

  • 问题1: 利用am_anr来确定ANR的发生时间真的准确吗?
    在上一篇文章就已经说了,不准确,可能由于系统资源比较紧张造成一定程度的滞后,十多秒都可能,此时需要结合eventlog进一步分析。

问题2:ANR中经常出现没有trace,或者无效trace是怎么回事?

  • 从appNotResponding方法中,我们看到有不少return的地方,在dumpStackTraces方法,也会有return出现,比如这么多进程的trace需要在20秒之内完成,在系统资源紧张的时候,能不能完成,需要打一个问号了
  • 有一些ANR问题需要binder对端的trace,通过上面的分析看,大概率是没有对端的trace的,除非binder call到system_server或者比较重要的几个native进程里面
  • 另外发生ANR的时候,线程并没有暂停掉,有可能等到我们去抓这个进程的trace的时候,这个进程的主线程都恢复正常了,比如经常性出现下面的nativePollOnce状态的堆栈。
"main" prio=5 tid=1 Native
 | group="main" sCount=1 dsCount=0 obj=0x7685ba50 self=0x7fb0840a00
 | sysTid=1460 nice=-10 cgrp=default sched=0/0 handle=0x7fb4824aa0
 | state=S schedstat=( 213445473008 99617457476 724519 ) utm=14518 stm=6826 core=0 HZ=100
 | stack=0x7fe82c6000-0x7fe82c8000 stackSize=8MB
 | held mutexes=
 kernel: __switch_to+0x70/0x7c
 kernel: SyS_epoll_wait+0x2d4/0x394
 kernel: SyS_epoll_pwait+0xc4/0x150
 kernel: el0_svc_naked+0x24/0x28
 native: #00 pc 000000000001bf2c  /system/lib64/libc.so (syscall+28)
 native: #01 pc 00000000000e813c  /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
 native: #02 pc 000000000054b6f8  /system/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE+328)
 native: #03 pc 000000000054b56c  /system/lib64/libart.so (_ZN3art12JniMethodEndEjPNS_6ThreadE+28)
 native: #04 pc 00000000008ccc50  /system/framework/arm64/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+156)
 at android.os.MessageQueue.nativePollOnce(Native method)
 at android.os.MessageQueue.next(MessageQueue.java:323)
 at android.os.Looper.loop(Looper.java:142)
 at com.android.server.SystemServer.run(SystemServer.java:387)
 at com.android.server.SystemServer.main(SystemServer.java:245)
 at java.lang.reflect.Method.invoke!(Native method)
 at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:904)
 at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:794)
  • 问题3: ANR中的dumpStackTraces是耗时操作,会不会引起系统的ANR?
    我觉得是有可能的,而是在系统稳定性很差的情况下,大量应用都ANR了,都需要dumpStackTraces,那么都有引起系统watchdog的可能。AOSP在P上有多个changelist对此进行优化,有兴趣可以自己查看。
3.1、DumpStackFileObserver方式获取trace
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5631
5625    /**
5626     * Legacy code, do not use. Existing users will be deleted.
5627     *
5628     * @deprecated
5629     */
5630    @Deprecated
5631    public static class DumpStackFileObserver extends FileObserver {
5632        // Keep in sync with frameworks/native/cmds/dumpstate/utils.cpp
5633        private static final int TRACE_DUMP_TIMEOUT_MS = 10000; // 10 seconds
5634
5635        private final String mTracesPath;
5636        private boolean mClosed;
5637
5638        public DumpStackFileObserver(String tracesPath) {
5639            super(tracesPath, FileObserver.CLOSE_WRITE);
5640            mTracesPath = tracesPath;
5641        }
5642
5643        @Override
5644        public synchronized void onEvent(int event, String path) {
5645            mClosed = true;
5646            notify();
5647        }
5648
5649        public long dumpWithTimeout(int pid, long timeout) {
           //发生singal开输出trace,相当于adb shell kill -3
5650            sendSignal(pid, SIGNAL_QUIT);
5651            final long start = SystemClock.elapsedRealtime();
5652            // timeout为20s,TRACE_DUMP_TIMEOUT_MS为10s,取小则为10s
5653            final long waitTime = Math.min(timeout, TRACE_DUMP_TIMEOUT_MS);
5654            synchronized (this) {
5655                try {
5656                    wait(waitTime); // Wait for traces file to be closed.
5657                } catch (InterruptedException e) {
5658                    Slog.wtf(TAG, e);
5659                }
5660            }
5661
5662            // This avoids a corner case of passing a negative time to the native
5663            // trace in case we've already hit the overall timeout.
5664            final long timeWaited = SystemClock.elapsedRealtime() - start;
5665            if (timeWaited >= timeout) {
5666                return timeWaited;
5667            }
5668
.......
5683            return (end - start);
5684        }
5685    }

这个方法将会被废弃了,原理上就是用adb shell kill -3抓trace

3.2、dumpJavaTracesTombstoned方式获取trace
5687    /**
5688     * Dump java traces for process {@code pid} to the specified file. If java trace dumping
5689     * fails, a native backtrace is attempted. Note that the timeout {@code timeoutMs} only applies
5690     * to the java section of the trace, a further {@code NATIVE_DUMP_TIMEOUT_MS} might be spent
5691     * attempting to obtain native traces in the case of a failure. Returns the total time spent
5692     * capturing traces.
5693     */
5694    private static long dumpJavaTracesTombstoned(int pid, String fileName, long timeoutMs) {
5695        final long timeStart = SystemClock.elapsedRealtime();
5696        if (!Debug.dumpJavaBacktraceToFileTimeout(pid, fileName, (int) (timeoutMs / 1000))) {
5697            Debug.dumpNativeBacktraceToFileTimeout(pid, fileName,
5698                    (NATIVE_DUMP_TIMEOUT_MS / 1000));
5699        }
5700
5701        return SystemClock.elapsedRealtime() - timeStart;
5702    }

实质上还是由dumpJavaBacktraceToFileTimeout来获取trace。dumpNativeBacktraceToFileTimeout是一个native方法,下面来看一看

1131static jboolean android_os_Debug_dumpNativeBacktraceToFileTimeout(JNIEnv* env, jobject clazz,
1132        jint pid, jstring fileName, jint timeoutSecs) {
1133    const bool ret = dumpTraces(env, pid, fileName, timeoutSecs, kDebuggerdNativeBacktrace);
1134    return ret ? JNI_TRUE : JNI_FALSE;
1135}
1107static bool dumpTraces(JNIEnv* env, jint pid, jstring fileName, jint timeoutSecs,
1108                       DebuggerdDumpType dumpType) {
1109    const ScopedUtfChars fileNameChars(env, fileName);
1110    if (fileNameChars.c_str() == nullptr) {
1111        return false;
1112    }
1113 //打开 /data/anr/anr_yyyy-MM-dd-HH-mm-ss-SSS文件
1114    android::base::unique_fd fd(open(fileNameChars.c_str(),
1115                                     O_CREAT | O_WRONLY | O_NOFOLLOW | O_CLOEXEC | O_APPEND,
1116                                     0666));
1117    if (fd < 0) {
1118        fprintf(stderr, "Can't open %s: %s\n", fileNameChars.c_str(), strerror(errno));
1119        return false;
1120    }
1121
1122    return (dump_backtrace_to_file_timeout(pid, dumpType, timeoutSecs, fd) == 0);
1123}
int dump_backtrace_to_file_timeout(pid_t tid, DebuggerdDumpType dump_type, int timeout_secs,   int fd) {
 android::base::unique_fd copy(dup(fd));
 if (copy == -1) {
   return -1;
 }
 int timeout_ms = timeout_secs > 0 ? timeout_secs * 1000 : 0;
 return debuggerd_trigger_dump(tid, dump_type, timeout_ms, std::move(copy)) ? 0 : -1;
}

可以看到实质是调用debuggerd_trigger_dump,可以看debuggerd_client.cpp,可向debuggerd发送命令,利用debuggerd抓trace。

总结:在dumpStackTraces方法中,主要根据useTombstonedForJavaTraces这个属性来设置不同的抓取trace的方式,java的进程实质上是使用adb shell kill -3来抓trace,对于native的进程实质上是使用debuggerd(debuggerd -b [pid])抓trace。而useTombstonedForJavaTraces的值是false还是true取决与"dalvik.vm.stack-trace-dir"这个属性在开机的时候是否设置了。

你可能感兴趣的:(应用与系统稳定性第二篇---ANR的监测与信息采集)