第一篇文章-应用与系统稳定性第一篇---ANR问题分析的一般套路主要梳理了几个ANR案例,讲了分析ANR问题的一般思路。本文基于Android O,深入理解ANR发生时系统的处理逻辑,加深对ANR问题的实战处理。实际处理ANR问题中,对于很多的信息还是比较陌生。比如ANR的trace是怎么输出来的?有时候为什么ANR的trace是无效的?一个进程发生了ANR的过程中又发生了ANR,系统是怎么处理的?什么是后台ANR?等等小问题如果你都能回答,那么可以不用看了。
一、ANR的trace是如何生成的
1、appNotResponding方法解读
发生ANR的时候,一般先搜索ANR in获取最直观的信息,如下:
06-16 16:16:28.590 1853 2073 E ActivityManager: ANR in com.android.camera (com.android.camera/.Camera)
06-16 16:16:28.590 1853 2073 E ActivityManager: PID: 27661
06-16 16:16:28.590 1853 2073 E ActivityManager: Reason: Input dispatching timed out (com.android.camera/com.android.camera.Camera, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 24. Wait queue head age: 5511.1ms.)
06-16 16:16:28.590 1853 2073 E ActivityManager: Load: 16.25 / 29.48 / 38.33
06-16 16:16:28.590 1853 2073 E ActivityManager: CPU usage from 0ms to 8058ms later:
06-16 16:16:28.590 1853 2073 E ActivityManager: 58% 291/mediaserver: 51% user + 6.7% kernel / faults: 2457 minor 4 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 27% 317/mm-qcamera-daemon: 21% user + 5.8% kernel / faults: 15965 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.4% 288/debuggerd: 0% user + 0.3% kernel / faults: 21615 minor 87 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 17% 27661/com.android.camera: 10% user + 6.8% kernel / faults: 2412 minor 34 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 16% 1853/system_server: 10% user + 6.4% kernel / faults: 1754 minor 87 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 10% 539/sensors.qcom: 7.8% user + 2.6% kernel / faults: 16 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 4.4% 277/surfaceflinger: 1.8% user + 2.6% kernel / faults: 14 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 4% 203/mmcqd/0: 0% user + 4% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 2.6% 3510/com.android.phone: 1.9% user + 0.6% kernel / faults: 1148 minor 8 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 2.1% 2902/com.android.systemui: 1.6% user + 0.4% kernel / faults: 1272 minor 32 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 1.6% 3110/com.miui.whetstone: 1.6% user + 0% kernel / faults: 2614 minor 22 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.8% 99/kswapd0: 0% user + 0.8% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 1.4% 217/jbd2/mmcblk0p25: 0% user + 1.4% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 1.4% 223/logd: 0.7% user + 0.7% kernel / faults: 4 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.9% 12808/kworker/0:1: 0% user + 0.9% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.8% 35/kworker/u:2: 0% user + 0.8% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0% 3222/com.miui.sysbase: 0% user + 0% kernel / faults: 1314 minor 12 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.8% 3446/com.android.nfc: 0.4% user + 0.3% kernel / faults: 1223 minor 9 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.7% 10866/kworker/u:1: 0% user + 0.7% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.6% 642/mdss_fb0: 0% user + 0.6% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.6% 29336/kworker/u:7: 0% user + 0.6% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.4% 6/kworker/u:0: 0% user + 0.4% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.4% 22924/kworker/u:6: 0% user + 0.4% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.3% 4421/mpdecision: 0% user + 0.3% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.2% 276/servicemanager: 0.1% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.2% 289/rild: 0.2% user + 0% kernel / faults: 20 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 4161/mcd: 0% user + 0% kernel / faults: 9 minor 1 major
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 3/ksoftirqd/0: 0% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 5/kworker/0:0H: 0% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 7/kworker/u:0H: 0% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0% 215/flush-179:0: 0% user + 0% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 321/displayfeature: 0.1% user + 0% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 368/irq/33-cpubw_hw: 0% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 403/qmuxd: 0% user + 0.1% kernel / faults: 60 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 0% 3491/com.xiaomi.finddevice: 0% user + 0% kernel / faults: 706 minor
06-16 16:16:28.590 1853 2073 E ActivityManager: 0.1% 29330/ksoftirqd/1: 0% user + 0.1% kernel
06-16 16:16:28.590 1853 2073 E ActivityManager: 96% TOTAL: 56% user + 29% kernel + 6.3% iowait + 4.1% softirq
应用在发生ANR的时候,不论是Provider,input还是BroadcastReceiver和Service处理超时等,系统都会调用appNotResponding方法,进行trace的收集,主要记录当时的案发现场,各个相关线程的状态一目了然。下面从appNotResponding入手一起看看源码是怎么写的。
这个方法的参数有5个,如下:
- app 当前发生ANR的进程
- activity 发生ANR的界面
- parent 发生ANR的界面的上一级界面
- aboveSystem 这个参数系统中用的全部都是False,还没有看到有true的地方
- annotation 发生ANR的原因
776 final void appNotResponding(ProcessRecord app, ActivityRecord activity,
777 ActivityRecord parent, boolean aboveSystem, final String annotation) {
// 填充firstPids和lastPids数组,从最近运行进程中挑选
// firstPids: 用于保存ANR进程及其父进程,system_server进程和persistent的进程
// lastPids: 用于保存除firstPids外的其他进程
778 ArrayList firstPids = new ArrayList(5);
779 SparseArray lastPids = new SparseArray(20);
780
781 if (mService.mController != null) {
782 try {
783 // 0 == continue, -1 = kill process immediately
784 int res = mService.mController.appEarlyNotResponding(
785 app.processName, app.pid, annotation);
786 if (res < 0 && app.pid != MY_PID) {
787 app.kill("anr", true);
788 }
789 } catch (RemoteException e) {
790 mService.mController = null;
791 Watchdog.getInstance().setActivityController(null);
792 }
793 }
794
795 long anrTime = SystemClock.uptimeMillis();
//更新Cpu的信息
796 if (ActivityManagerService.MONITOR_CPU_USAGE) {
797 mService.updateCpuStatsNow();
798 }
799
800 // Unless configured otherwise, swallow ANRs in background processes & kill the process.
//showBackground意味在后台的时候是否让ANR有感知,如果showBackground为fasle就不会弹出ANR的dialog了
801 boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(),
802 Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0;
803
804 boolean isSilentANR;
805
806 synchronized (mService) {
//这些场景的ANR直接做忽略处理(关机,app已经ANR了,APP处在crash流程中,被AMS或者其他人杀死的)
807 // PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
808 if (mService.mShuttingDown) {
809 Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation);
810 return;
811 } else if (app.notResponding) {
812 Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation);
813 return;
814 } else if (app.crashing) {
815 Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation);
816 return;
817 } else if (app.killedByAm) {
818 Slog.i(TAG, "App already killed by AM skipping ANR: " + app + " " + annotation);
819 return;
820 } else if (app.killed) {
821 Slog.i(TAG, "Skipping died app ANR: " + app + " " + annotation);
822 return;
823 }
824
825 // In case we come through here for the same app before completing
826 // this one, mark as anring now so we will bail out.
827 app.notResponding = true;
828
829 // Log the ANR to the event log.常常搜索的am_anr关键字,记录ANR信息到eventlog中
830 EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
831 app.processName, app.info.flags, annotation);
832
833 // Dump thread traces as quickly as we can, starting with "interesting" processes.
//ANR进程放到firstPids集合中
834 firstPids.add(app.pid);
835
836 // Don't dump other PIDs if it's a background ANR
//如果是后台的ANR,就不dump了
837 isSilentANR = !showBackground && !isInterestingForBackgroundTraces(app);
838 if (!isSilentANR) {
839 int parentPid = app.pid;
840 if (parent != null && parent.app != null && parent.app.pid > 0) {
841 parentPid = parent.app.pid;
842 }
//把ANR的父进程也记录firstPids中
843 if (parentPid != app.pid) firstPids.add(parentPid);
844 //MY_PID是system_server的pid,如果发生的进程不是system,需要把system进程加到firstPids中
845 if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
846 //最近运行的进程在mLruProcesses中,遍历这个List
847 for (int i = mService.mLruProcesses.size() - 1; i >= 0; i--) {
848 ProcessRecord r = mService.mLruProcesses.get(i);
849 if (r != null && r.thread != null) {
850 int pid = r.pid;
851 if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
//如果persistent进程,这类是“不死进程”,
//这类进程在启动的时候会针对目标用户进程的IApplicationThread接口,注册一个binder讣告监听器,
//一旦日后用户进程意外挂掉,AMS就能在第一时间感知到,
//并采取相应的措施。如果AMS发现意外挂掉的应用是persistent的,它会尝试重新启动这个应用
852 if (r.persistent) {
853 firstPids.add(pid);
854 if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
855 } else if (r.treatLikeActivity) {
856 firstPids.add(pid);
857 if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
858 } else {
//其他的都加到lastPids中
859 lastPids.put(pid, Boolean.TRUE);
860 if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
861 }
862 }
863 }
864 }
865 }
866 }
867
868 // Log the ANR to the main log.输出到main log里面
869 StringBuilder info = new StringBuilder();
870 info.setLength(0);
871 info.append("ANR in ").append(app.processName);
872 if (activity != null && activity.shortComponentName != null) {
873 info.append(" (").append(activity.shortComponentName).append(")");
874 }
875 info.append("\n");
876 info.append("PID: ").append(app.pid).append("\n");
877 if (annotation != null) {
878 info.append("Reason: ").append(annotation).append("\n");
879 }
880 if (parent != null && parent != activity) {
881 info.append("Parent: ").append(parent.shortComponentName).append("\n");
882 }
883
884 ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
885
886 // don't dump native PIDs for background ANRs unless it is the process of interest
887 String[] nativeProcs = null;
888 if (isSilentANR) {
//NATIVE_STACKS_OF_INTEREST是Watchog中定义的几个关键的进程
889 for (int i = 0; i < NATIVE_STACKS_OF_INTEREST.length; i++) {
890 if (NATIVE_STACKS_OF_INTEREST[i].equals(app.processName)) {
891 nativeProcs = new String[] { app.processName };
892 break;
893 }
894 }
895 } else {
896 nativeProcs = NATIVE_STACKS_OF_INTEREST;
897 }
898 //获取nativeProcs这些进程的pid
899 int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
900 ArrayList nativePids = null;
901
902 if (pids != null) {
903 nativePids = new ArrayList(pids.length);
904 for (int i : pids) {
905 nativePids.add(i);
906 }
907 }
908
909 // For background ANRs, don't pass the ProcessCpuTracker to
910 // avoid spending 1/2 second collecting stats to rank lastPids.
//
911 File tracesFile = mService.dumpStackTraces(true, firstPids,
912 (isSilentANR) ? null : processCpuTracker,
913 (isSilentANR) ? null : lastPids,
914 nativePids);
915
916 String cpuInfo = null;
917 if (ActivityManagerService.MONITOR_CPU_USAGE) {
//更新Cpu的信息
918 mService.updateCpuStatsNow();
919 synchronized (mService.mProcessCpuTracker) {
// 记录ANR之前的cpu使用情况(CPU usage from xxxms to 0ms ago)
920 cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime);
921 }
// 记录从anr时间开始的Cpu负载
922 info.append(processCpuTracker.printCurrentLoad());
923 info.append(cpuInfo);
924 }
925 // 记录从anr时间开始的cpu使用情况(CPU usage from xxms to xxms later)
926 info.append(processCpuTracker.printCurrentState(anrTime));
927
928 Slog.e(TAG, info.toString());
929 if (tracesFile == null) {
930 // 如果trace为空,则发送singal 3到发送ANR的进程,相当于执行adb shell kill -3 pid命令
931 Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
932 }
933 //写到DropBox文件夹里面(data/system/dropbox)
934 mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
935 cpuInfo, tracesFile, null);
936
937 if (mService.mController != null) {
938 try {
939 // 0 == show dialog, 1 = keep waiting, -1 = kill process immediately
940 int res = mService.mController.appNotResponding(
941 app.processName, app.pid, info.toString());
942 if (res != 0) {
943 if (res < 0 && app.pid != MY_PID) {
944 app.kill("anr", true);
945 } else {
946 synchronized (mService) {
947 mService.mServices.scheduleServiceTimeoutLocked(app);
948 }
949 }
950 return;
951 }
952 } catch (RemoteException e) {
953 mService.mController = null;
954 Watchdog.getInstance().setActivityController(null);
955 }
956 }
957
958 synchronized (mService) {
959 mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
960 //如果是后台ANR,直接kill
961 if (isSilentANR) {
962 app.kill("bg anr", true);
963 return;
964 }
965
966 // Set the app's notResponding state, and look up the errorReportReceiver
967 吧 (app,
968 activity != null ? activity.shortComponentName : null,
969 annotation != null ? "ANR " + annotation : "ANR",
970 info.toString());
971
972 // 弹出无响应对话框
973 Message msg = Message.obtain();
974 HashMap map = new HashMap();
975 msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
976 msg.obj = map;
977 msg.arg1 = aboveSystem ? 1 : 0;
978 map.put("app", app);
979 if (activity != null) {
980 map.put("activity", activity);
981 }
982 给android.ui线程发生一个消息,弹出对话框
983 mService.mUiHandler.sendMessage(msg);
984 }
985 }
986
上面的代码撸了一遍,总结一下appNotResponding的业务逻辑:
- 1、输出ANR Reason信息到Events log中,关键字是am_anr,一般都是用这个信息来确定Anr的发生时间
- 2、如果是后台的anr,即isSilentANR值为true,那么不显示ANR对话框,直接kill掉ANR所在进程
- 3、输出ANR 信息到main log中,关键字是ANR in,含有Cpu负载信息,一般都是用这个看ANR发生时候的其他进程的资源占用情况
- 4、调用dumpStackTraces去生成trace文件,这个trace文件有哪些进程呢?
a.firstPids队列:第一个是ANR进程,第二个是system_server,剩余是所有persistent进程;
b.Native队列:是指/system/bin/目录的mediaserver,sdcard 以及surfaceflinger进程;
c.lastPids队列: 是指mLruProcesses中的不属于firstPids的所有进程。- 5、将traces文件和 CPU使用情况信息保存到dropbox,即data/system/dropbox目录
- 6、弹窗告知用户
除了上面的业务总结,我们发现有几个小知识点
- 1、关机过程虽然比较耗时,但是不会出现ANR,不必多说,你说人家都关机了,还有提示发生ANR的必要了吗?
- 2、ANR有两种,前台ANR和后台ANR,我在调试中发现,后台ANR非常容易发生,系统直接kill掉这个应用进程,但是用户没有感知。
- 3、am_anr关键字比ANR in要先打印,所以更接近ANR的实际发生时间
2、dumpStackTraces(1/2)方法解读
继续看appNotResponding中的核心逻辑dumpStackTraces
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5492
5483 /**
5484 * If a stack trace dump file is configured, dump process stack traces.
5485 * @param clearTraces causes the dump file to be erased prior to the new
5486 * traces being written, if true; when false, the new traces will be
5487 * appended to any existing file content.
5488 * @param firstPids of dalvik VM processes to dump stack traces for first
5489 * @param lastPids of dalvik VM processes to dump stack traces for last
5490 * @param nativePids optional list of native pids to dump stack crawls
5491 */
5492 public static File dumpStackTraces(boolean clearTraces, ArrayList firstPids,
5493 ProcessCpuTracker processCpuTracker, SparseArray lastPids,
5494 ArrayList nativePids) {
5495 ArrayList extraPids = null;
5496 ......
5524 boolean useTombstonedForJavaTraces = false;
5525 File tracesFile;
5526 //如果dalvik.vm.stack-trace-dir没有配置,trace就写到全局文件data/anr/traces.txt中
5527 final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
5528 if (tracesDirProp.isEmpty()) {
5529 // When dalvik.vm.stack-trace-dir is not set, we are using the "old" trace
5530 // dumping scheme. All traces are written to a global trace file (usually
5531 // "/data/anr/traces.txt") so the code below must take care to unlink and recreate
5532 // the file if requested.
5533 //
5534 // This mode of operation will be removed in the near future.
5535
5536
5537 String globalTracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
5538 if (globalTracesPath.isEmpty()) {
// 没有配置dalvik.vm.stack-trace-file,则返回null
5539 Slog.w(TAG, "dumpStackTraces: no trace path configured");
5540 return null;
5541 }
5542
5543 tracesFile = new File(globalTracesPath);
5544 try {
//删除旧的,创建新的
5545 if (clearTraces && tracesFile.exists()) {
5546 tracesFile.delete();
5547 }
5548
5549 tracesFile.createNewFile();
5550 FileUtils.setPermissions(globalTracesPath, 0666, -1, -1); // -rw-rw-rw-
5551 } catch (IOException e) {
5552 Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesFile, e);
5553 return null;
5554 }
5555 } else {
5556 File tracesDir = new File(tracesDirProp);
5557 // When dalvik.vm.stack-trace-dir is set, we use the "new" trace dumping scheme.
5558 // Each set of ANR traces is written to a separate file and dumpstate will process
5559 // all such files and add them to a captured bug report if they're recent enough.
5560 maybePruneOldTraces(tracesDir);
5561
5562 // NOTE: We should consider creating the file in native code atomically once we've
5563 // gotten rid of the old scheme of dumping and lot of the code that deals with paths
5564 // can be removed.
// 创建trace文件,格式为anr_yyyy-MM-dd-HH-mm-ss-SSS
5565 tracesFile = createAnrDumpFile(tracesDir);
5566 if (tracesFile == null) {
5567 return null;
5568 }
5569
5570 useTombstonedForJavaTraces = true;
5571 }
5572
5573 dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids,
5574 useTombstonedForJavaTraces);
5575 return tracesFile;
5576 }
这段代码并没有真正的执行dump操作,只是确定了trace文件路径,如果”dalvik.vm.stack-trace-dir”这个属性没有配置,就使用旧的dump策略,trace信息写入到全局trace文件/data/anr/traces.txt中,删除旧的traces文件,创建新的traces文件;如果”dalvik.vm.stack-trace-dir”属性被配置了,就创建格式为anr_yyyy-MM-dd-HH-mm-ss-SSS的文件。
dalvik.vm.stack-trace-file特性
dalvik.vm.stack-trace-file特性允许你指定要将线程堆栈追踪写入的文件名,如果不存在,将创建,新的信息将追加到文件尾,文件名通过-Xstacktracefile参数写入虚拟机。例如:
adb shell setprop dalvik.vm.stack-trace-file /tmp/stack-traces.txt
如果这个特性没有被定义,虚拟机会在收到这个信号时将堆栈追踪信息写入android log。
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5581
5581 private static synchronized File createAnrDumpFile(File tracesDir) {
5582 if (sAnrFileDateFormat == null) {
5583 sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS");
5584 }
5585
5586 final String formattedDate = sAnrFileDateFormat.format(new Date());
5587 final File anrFile = new File(tracesDir, "anr_" + formattedDate);
5588
5589 try {
5590 if (anrFile.createNewFile()) {
5591 FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw-------
5592 return anrFile;
5593 } else {
5594 Slog.w(TAG, "Unable to create ANR dump file: createNewFile failed");
5595 }
5596 } catch (IOException ioe) {
5597 Slog.w(TAG, "Exception creating ANR dump file:", ioe);
5598 }
5599
5600 return null;
5601 }
5602
3、dumpStackTraces(2/2)方法解读
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5704
5704 private static void dumpStackTraces(String tracesFile, ArrayList firstPids,
5705 ArrayList nativePids, ArrayList extraPids,
5706 boolean useTombstonedForJavaTraces) {
5707
5708 // We don't need any sort of inotify based monitoring when we're dumping traces via
5709 // tombstoned. Data is piped to an "intercept" FD installed in tombstoned so we're in full
5710 // control of all writes to the file in question.
5711 final DumpStackFileObserver observer;
5712 if (useTombstonedForJavaTraces) {
5713 observer = null;
5714 } else {
5715 // Use a FileObserver to detect when traces finish writing.
5716 // The order of traces is considered important to maintain for legibility.
5717 observer = new DumpStackFileObserver(tracesFile);
5718 }
5719
5720 // 所有的dump操作需要在20秒之内完成
5721 long remainingTime = 20 * 1000;
5722 try {
5723 if (observer != null) {
5724 observer.startWatching();
5725 }
5726
5727 // 第一步收集firstPids中进程的trace,超时会自动放弃
5728 if (firstPids != null) {
5729 int num = firstPids.size();
5730 for (int i = 0; i < num; i++) {
5731 if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for pid "
5732 + firstPids.get(i));
5733 final long timeTaken;
5734 if (useTombstonedForJavaTraces) {
5735 timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime);
5736 } else {
5737 timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime);
5738 }
5739
5740 remainingTime -= timeTaken;
5741 if (remainingTime <= 0) {
5742 Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + firstPids.get(i) +
5743 "); deadline exceeded.");
5744 return;
5745 }
5746
5747 if (DEBUG_ANR) {
5748 Slog.d(TAG, "Done with pid " + firstPids.get(i) + " in " + timeTaken + "ms");
5749 }
5750 }
5751 }
5752
5753 // 第二步收集nativePids中进程的trace,超时会自动放弃
5754 if (nativePids != null) {
5755 for (int pid : nativePids) {
5756 if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for native pid " + pid);
5757 final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime);
5758
5759 final long start = SystemClock.elapsedRealtime();
5760 Debug.dumpNativeBacktraceToFileTimeout(
5761 pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
5762 final long timeTaken = SystemClock.elapsedRealtime() - start;
5763
5764 remainingTime -= timeTaken;
5765 if (remainingTime <= 0) {
5766 Slog.e(TAG, "Aborting stack trace dump (current native pid=" + pid +
5767 "); deadline exceeded.");
5768 return;
5769 }
5770
5771 if (DEBUG_ANR) {
5772 Slog.d(TAG, "Done with native pid " + pid + " in " + timeTaken + "ms");
5773 }
5774 }
5775 }
5776
5777 //第三步收集extraPids中进程的trace,超时会自动放弃,这个代码和第一步一样的
5778 if (extraPids != null) {
5779 for (int pid : extraPids) {
5780 if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + pid);
5781
5782 final long timeTaken;
5783 if (useTombstonedForJavaTraces) {
5784 timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
5785 } else {
5786 timeTaken = observer.dumpWithTimeout(pid, remainingTime);
5787 }
5788
5789 remainingTime -= timeTaken;
5790 if (remainingTime <= 0) {
5791 Slog.e(TAG, "Aborting stack trace dump (current extra pid=" + pid +
5792 "); deadline exceeded.");
5793 return;
5794 }
5795
5796 if (DEBUG_ANR) {
5797 Slog.d(TAG, "Done with extra pid " + pid + " in " + timeTaken + "ms");
5798 }
5799 }
5800 }
5801 } finally {
5802 if (observer != null) {
5803 observer.stopWatching();
5804 }
5805 }
5806 }
firstPids和extraPids中的进程dump trace有两种方式,看useTombstonedForJavaTraces的取值;
- 如果useTombstonedForJavaTraces为真,采用dumpJavaTracesTombstoned获取trace
- 如果useTombstonedForJavaTraces为假,采用DumpStackFileObserver#dumpWithTimeout获取trace
对于nativePids中的进程,只能通过 Debug.dumpNativeBacktraceToFileTimeout来获取trace。
除了上面的总结还需要思考几个小问题
- 问题1: 利用am_anr来确定ANR的发生时间真的准确吗?
在上一篇文章就已经说了,不准确,可能由于系统资源比较紧张造成一定程度的滞后,十多秒都可能,此时需要结合eventlog进一步分析。
问题2:ANR中经常出现没有trace,或者无效trace是怎么回事?
- 从appNotResponding方法中,我们看到有不少return的地方,在dumpStackTraces方法,也会有return出现,比如这么多进程的trace需要在20秒之内完成,在系统资源紧张的时候,能不能完成,需要打一个问号了
- 有一些ANR问题需要binder对端的trace,通过上面的分析看,大概率是没有对端的trace的,除非binder call到system_server或者比较重要的几个native进程里面
- 另外发生ANR的时候,线程并没有暂停掉,有可能等到我们去抓这个进程的trace的时候,这个进程的主线程都恢复正常了,比如经常性出现下面的nativePollOnce状态的堆栈。
"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x7685ba50 self=0x7fb0840a00
| sysTid=1460 nice=-10 cgrp=default sched=0/0 handle=0x7fb4824aa0
| state=S schedstat=( 213445473008 99617457476 724519 ) utm=14518 stm=6826 core=0 HZ=100
| stack=0x7fe82c6000-0x7fe82c8000 stackSize=8MB
| held mutexes=
kernel: __switch_to+0x70/0x7c
kernel: SyS_epoll_wait+0x2d4/0x394
kernel: SyS_epoll_pwait+0xc4/0x150
kernel: el0_svc_naked+0x24/0x28
native: #00 pc 000000000001bf2c /system/lib64/libc.so (syscall+28)
native: #01 pc 00000000000e813c /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
native: #02 pc 000000000054b6f8 /system/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE+328)
native: #03 pc 000000000054b56c /system/lib64/libart.so (_ZN3art12JniMethodEndEjPNS_6ThreadE+28)
native: #04 pc 00000000008ccc50 /system/framework/arm64/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+156)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:323)
at android.os.Looper.loop(Looper.java:142)
at com.android.server.SystemServer.run(SystemServer.java:387)
at com.android.server.SystemServer.main(SystemServer.java:245)
at java.lang.reflect.Method.invoke!(Native method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:904)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:794)
- 问题3: ANR中的dumpStackTraces是耗时操作,会不会引起系统的ANR?
我觉得是有可能的,而是在系统稳定性很差的情况下,大量应用都ANR了,都需要dumpStackTraces,那么都有引起系统watchdog的可能。AOSP在P上有多个changelist对此进行优化,有兴趣可以自己查看。
3.1、DumpStackFileObserver方式获取trace
http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5631
5625 /**
5626 * Legacy code, do not use. Existing users will be deleted.
5627 *
5628 * @deprecated
5629 */
5630 @Deprecated
5631 public static class DumpStackFileObserver extends FileObserver {
5632 // Keep in sync with frameworks/native/cmds/dumpstate/utils.cpp
5633 private static final int TRACE_DUMP_TIMEOUT_MS = 10000; // 10 seconds
5634
5635 private final String mTracesPath;
5636 private boolean mClosed;
5637
5638 public DumpStackFileObserver(String tracesPath) {
5639 super(tracesPath, FileObserver.CLOSE_WRITE);
5640 mTracesPath = tracesPath;
5641 }
5642
5643 @Override
5644 public synchronized void onEvent(int event, String path) {
5645 mClosed = true;
5646 notify();
5647 }
5648
5649 public long dumpWithTimeout(int pid, long timeout) {
//发生singal开输出trace,相当于adb shell kill -3
5650 sendSignal(pid, SIGNAL_QUIT);
5651 final long start = SystemClock.elapsedRealtime();
5652 // timeout为20s,TRACE_DUMP_TIMEOUT_MS为10s,取小则为10s
5653 final long waitTime = Math.min(timeout, TRACE_DUMP_TIMEOUT_MS);
5654 synchronized (this) {
5655 try {
5656 wait(waitTime); // Wait for traces file to be closed.
5657 } catch (InterruptedException e) {
5658 Slog.wtf(TAG, e);
5659 }
5660 }
5661
5662 // This avoids a corner case of passing a negative time to the native
5663 // trace in case we've already hit the overall timeout.
5664 final long timeWaited = SystemClock.elapsedRealtime() - start;
5665 if (timeWaited >= timeout) {
5666 return timeWaited;
5667 }
5668
.......
5683 return (end - start);
5684 }
5685 }
这个方法将会被废弃了,原理上就是用adb shell kill -3抓trace
3.2、dumpJavaTracesTombstoned方式获取trace
5687 /**
5688 * Dump java traces for process {@code pid} to the specified file. If java trace dumping
5689 * fails, a native backtrace is attempted. Note that the timeout {@code timeoutMs} only applies
5690 * to the java section of the trace, a further {@code NATIVE_DUMP_TIMEOUT_MS} might be spent
5691 * attempting to obtain native traces in the case of a failure. Returns the total time spent
5692 * capturing traces.
5693 */
5694 private static long dumpJavaTracesTombstoned(int pid, String fileName, long timeoutMs) {
5695 final long timeStart = SystemClock.elapsedRealtime();
5696 if (!Debug.dumpJavaBacktraceToFileTimeout(pid, fileName, (int) (timeoutMs / 1000))) {
5697 Debug.dumpNativeBacktraceToFileTimeout(pid, fileName,
5698 (NATIVE_DUMP_TIMEOUT_MS / 1000));
5699 }
5700
5701 return SystemClock.elapsedRealtime() - timeStart;
5702 }
实质上还是由dumpJavaBacktraceToFileTimeout来获取trace。dumpNativeBacktraceToFileTimeout是一个native方法,下面来看一看
1131static jboolean android_os_Debug_dumpNativeBacktraceToFileTimeout(JNIEnv* env, jobject clazz,
1132 jint pid, jstring fileName, jint timeoutSecs) {
1133 const bool ret = dumpTraces(env, pid, fileName, timeoutSecs, kDebuggerdNativeBacktrace);
1134 return ret ? JNI_TRUE : JNI_FALSE;
1135}
1107static bool dumpTraces(JNIEnv* env, jint pid, jstring fileName, jint timeoutSecs,
1108 DebuggerdDumpType dumpType) {
1109 const ScopedUtfChars fileNameChars(env, fileName);
1110 if (fileNameChars.c_str() == nullptr) {
1111 return false;
1112 }
1113 //打开 /data/anr/anr_yyyy-MM-dd-HH-mm-ss-SSS文件
1114 android::base::unique_fd fd(open(fileNameChars.c_str(),
1115 O_CREAT | O_WRONLY | O_NOFOLLOW | O_CLOEXEC | O_APPEND,
1116 0666));
1117 if (fd < 0) {
1118 fprintf(stderr, "Can't open %s: %s\n", fileNameChars.c_str(), strerror(errno));
1119 return false;
1120 }
1121
1122 return (dump_backtrace_to_file_timeout(pid, dumpType, timeoutSecs, fd) == 0);
1123}
int dump_backtrace_to_file_timeout(pid_t tid, DebuggerdDumpType dump_type, int timeout_secs, int fd) {
android::base::unique_fd copy(dup(fd));
if (copy == -1) {
return -1;
}
int timeout_ms = timeout_secs > 0 ? timeout_secs * 1000 : 0;
return debuggerd_trigger_dump(tid, dump_type, timeout_ms, std::move(copy)) ? 0 : -1;
}
可以看到实质是调用debuggerd_trigger_dump,可以看debuggerd_client.cpp,可向debuggerd发送命令,利用debuggerd抓trace。
总结:在dumpStackTraces方法中,主要根据useTombstonedForJavaTraces这个属性来设置不同的抓取trace的方式,java的进程实质上是使用adb shell kill -3来抓trace,对于native的进程实质上是使用debuggerd(debuggerd -b [pid])抓trace。而useTombstonedForJavaTraces的值是false还是true取决与"dalvik.vm.stack-trace-dir"这个属性在开机的时候是否设置了。