(由于公司项目特殊情况,需要使用一些小厂的三防功能手机,不能使用我们平时用的这些民用手机)
前期测试的时候是用民用手机测试的,有六七种机型(小米,华为,中兴,oppo),使用过程中均没有出现ANR的情况,但是在公司采购的一款工程机上面用了一段时间后肯定就会出现ANR,出现了怎么办呢,得想办法解决啊。现在想起来这段日子,真是痛苦啊,不过这也是提示能力的一个过程。
先总结下哪些情况会出现ANR吧(主线程才会导致ANR)
1.用户进行按键操作或者触屏操作时候,应用程序在默认时间范围内(5s)未及时处理,就会出现ANR,其实别说5s了,就是超过1s,用户都觉得不可接受,这种ANR的检测是system_server进程的Inputdispatcher不断检测是否处理完用户的输入事件,一旦超时,就会出现ANR了。
2.主线程在执行broadcastReceiver的onReceiver回调方法在10s内没有处理完事件
3.主线程在执行Service的各个生命周期函数中超过20s没有处理结束
4.大量的线程死循环的去做任务,导致应用获取不到CPU的时间片去处理用户输入事件
这四种情况其实总结成两个根本原因:
1.主线程做了耗时操作,导致后续用户的输入事件没有及时处理,这是谷歌不可接受的,这样极大的影响用户的体验,只能给开发者抛出来,你得必须给我解决。
2.我们知道应用的事件,不管进程还是线程,最终都是要获取CPU的时间片去处理,假如CPU负载过大,达到100%了,并且还有不断的新事件进入队列,那新事件只能等待了,假如这个新事件是用户的输入事件,那又回到原因1了,但是又有不同,因为这个新的事件可能并不是一个耗时事件;但是在规定的时间内,nputdispatcher检测到这个新事件没有处理完,那也会抛出ANR,让开发者去解决CPU负载问题。(其实第二个原因也就是我此次遇到ANR的原因,公司采购的工程机所使用的是一款联发科的低配CPU)
ANR不像是平常开发中出现的一些运行时异常(NullPointerException,indexOutOfBoundsException),可以很方便的查看报错日志,也会弹出一个错误提示框以便退出或重启;或者NDK异常,直接闪退,也不用弹出框;但是ANR就很难受了,如果是APP进程ANR,顶多APP无法响应了;但是一些系统进程ANR,就搞得手机一时半会没法用,手机就感觉脱离了我们的控制,真的是很悲伤啊。
原因如此,那平时开发如何避免呢
1.主线程中不要做任何耗时操作,将这些放到子线程去做
2.开发中注意线程的使用,有没有导致CPU负载过高
既然出现了ANR,那就想办法解决吧
当时我想难道四大组件的使用代码里有什么耗时操作?接下来就是排查了,可是检查了一圈下来,没发现在主线程有什么耗时操作啊,耗时操作都是放在字线程中做的,如果有,在其它机子应该也会出现ANR啊;当时没想到是cpu的问题,但又没想到其它原因,那就只能取出ANR日志了,怎么取呢?当应用出现ANR的时候,在com\android\server\am\ActivityManagerService的appNotResponding方法被调用,然后ANR信息会被写到/data/anr/traces.txt文件中
final void appNotResponding(ProcessRecord app, ActivityRecord activity,ActivityRecord parent, boolean aboveSystem, final String annotation) {
ArrayList firstPids = new ArrayList(5);
SparseArray lastPids = new SparseArray(20);
if (mController != null) {
try {
// 0 == continue, -1 = kill process immediately
int res = mController.appEarlyNotResponding(app.processName, app.pid, annotation);
if (res < 0 && app.pid != MY_PID) {
app.kill("anr", true);
}
} catch (RemoteException e) {
mController = null;
Watchdog.getInstance().setActivityController(null);
}
}
long anrTime = SystemClock.uptimeMillis();
if (MONITOR_CPU_USAGE) {
updateCpuStatsNow(); // 更新CPU使用率
}
synchronized (this) {
// PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
if (mShuttingDown) {
Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation);
return;
} else if (app.notResponding) {
Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation);
return;
} else if (app.crashing) {
Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation);
return;
}
// In case we come through here for the same app before completing
// this one, mark as anring now so we will bail out.
app.notResponding = true;
// Log the ANR to the event log.
EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,app.processName, app.info.flags, annotation);
// Dump thread traces as quickly as we can, starting with "interesting" processes.
firstPids.add(app.pid);
int parentPid = app.pid;
if (parent != null && parent.app != null && parent.app.pid > 0) parentPid = parent.app.pid;
if (parentPid != app.pid) firstPids.add(parentPid);
if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
for (int i = mLruProcesses.size() - 1; i >= 0; i--) {
ProcessRecord r = mLruProcesses.get(i);
if (r != null && r.thread != null) {
int pid = r.pid;
if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
if (r.persistent) {
firstPids.add(pid);
} else {
lastPids.put(pid, Boolean.TRUE);
}
}
}
}
}
// Log the ANR to the main log.
StringBuilder info = new StringBuilder();
info.setLength(0);
info.append("ANR in ").append(app.processName);
if (activity != null && activity.shortComponentName != null) {
info.append(" (").append(activity.shortComponentName).append(")");
}
info.append("\n");
info.append("PID: ").append(app.pid).append("\n");
if (annotation != null) {
info.append("Reason: ").append(annotation).append("\n");
}
if (parent != null && parent != activity) {
info.append("Parent: ").append(parent.shortComponentName).append("\n");
}
final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
// dumpStackTraces是输出traces文件的函数
File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids,NATIVE_STACKS_OF_INTEREST);
String cpuInfo = null;
if (MONITOR_CPU_USAGE) {
updateCpuStatsNow(); // 再次更新CPU信息
synchronized (mProcessCpuTracker) {
// 输出ANR发生前一段时间内的CPU使用率
cpuInfo = mProcessCpuTracker.printCurrentState(anrTime);
}
info.append(processCpuTracker.printCurrentLoad());
info.append(cpuInfo);
}
// 输出ANR发生后一段时间内的CPU使用率
info.append(processCpuTracker.printCurrentState(anrTime));
Slog.e(TAG, info.toString());
if (tracesFile == null) {
// There is no trace file, so dump (only) the alleged culprit's threads to the log
Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
}
// 将ANR信息同时输出到DropBox中
addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,cpuInfo, tracesFile, null);
if (mController != null) {
try {
// 0 == show dialog, 1 = keep waiting, -1 = kill process immediately
int res = mController.appNotResponding(app.processName, app.pid, info.toString());
if (res != 0) {
if (res < 0 && app.pid != MY_PID) {
app.kill("anr", true);
} else {
synchronized (this) {
mServices.scheduleServiceTimeoutLocked(app);
}
}
return;
}
} catch (RemoteException e) {
mController = null;
Watchdog.getInstance().setActivityController(null);
}
}
// Unless configured otherwise, swallow ANRs in background processes & kill the process.
boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(),Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0;
synchronized (this) {
mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
if (!showBackground && !app.isInterestingToUserLocked() && app.pid != MY_PID) {
app.kill("bg anr", true);
return;
}
// Set the app's notResponding state, and look up the errorReportReceiver
makeAppNotRespondingLocked(app,activity != null ? activity.shortComponentName : null,annotation != null ? "ANR " + annotation : "ANR",info.toString());
//Set the trace file name to app name + current date format to avoid overrinding trace file
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath != null && tracesPath.length() != 0) {
File traceRenameFile = new File(tracesPath);
String newTracesPath;
int lpos = tracesPath.lastIndexOf (".");
if (-1 != lpos)
newTracesPath = tracesPath.substring (0, lpos) + "_" + app.processName + "_" + mTraceDateFormat.format(new Date()) + tracesPath.substring (lpos);
else
newTracesPath = tracesPath + "_" + app.processName;
traceRenameFile.renameTo(new File(newTracesPath));
}
// 显示ANR提示对话框
// Bring up the infamous App Not Responding dialog
Message msg = Message.obtain();
HashMap map = new HashMap();
msg.what = SHOW_NOT_RESPONDING_MSG;
msg.obj = map;
msg.arg1 = aboveSystem ? 1 : 0;
map.put("app", app);
if (activity != null) {
map.put("activity", activity);
}
mUiHandler.sendMessage(msg);
}
}
接下来就将这个文件导出来,将手机连上电脑,然后打开dos窗口,如果没有配置环境变量,就直接定位到adb.exe目录去,然后输入如下命令
> adb shell
$ cat data/anr/traces.txt > /mnt/sdcard/traces.txt
$ exit
> adb pull /mnt/sdcard/traces.txt d:\ANR
当然这里有更简单的方法,我这里因为是要导出到手机SD卡上,所以多了这一步;如果不需要的话就直接导出到电脑
> adb shell
$ cd data/anr
$ ls
$ //这里就可以查看anr文件名称,并不是所有的手机厂商的文件名都是traces.txt
> adb pull data/anr/traces.txt d:\ANR
这就将traces.txt文件拷贝到电脑d盘的ANR文件夹了,但是在一些高版本的手机上,会提示没有权限
adb: error: failed to copy 'data/anr/anr_2019-01-30-13-35-18-005' to '.\anr_2019-01-30-13-35-18-005': remote open failed: Permission denied
这时候就需要使用另一个命令,导出系统运行log进行分析,详情见Google文档
//6.0及以下设备
adb bugreport > bugreport.txt
//7.0及以上设备
adb bugreport bugreport.zip
这个命令获取的是系统运行log,应用发生的anr信息也会在里面展示
最后把文件打开看,APP每次出现ANR,这个traces.txt文件夹内容都是一样,如下
----- pid 792 at 2018-03-24 18:15:04 ----- //ANR发生的进程id 时间
Cmd line: system_server // ANR发生的进程名
ABI: arm64
Build type: optimized
Zygote loaded classes=3684 post zygote classes=2785
Intern table: 55484 strong; 3891 weak
JNI: CheckJNI is off; globals=2334 (plus 88 weak)
....................
DALVIK THREADS (101):
"main" prio=5 tid=1 Native //main是线程名 prio是线程优先级,默认是5 tid是线程锁id Native是线程状态的一种 正在执行jni函数
| group="main" sCount=1 dsCount=0 obj=0x757dafb8 self=0x7f7c0af800 //group是线程组名称 sCount是线程被挂起次数 dsCount是线程被调试器挂起的次数 obj表示这个线程的java对象的地址 self表示这个线程本身的地址
| sysTid=792 nice=-2 cgrp=default sched=0/0 handle=0x7f7ff8feb0 // sysTid是Linux下的内核线程id,nice是线程调度优先级 sched分别标志了线程的调度策略和优先级,cgrp是调度属组,handle是线程的处理函数地址
| state=S schedstat=( 768518653810 256839614325 1023817 ) utm=72051 stm=4800 core=0 HZ=100 // state是调度状态;schedstat三个值分别表示线程在cpu上执行的时间、线程的等待时间和线程执行的时间片长度;utm是线程用户态下使用的时间值(单位是jiffies);stm是内核态下的调度时间值;core是最后执行这个线程的cpu核的序号
| stack=0x7fe61f6000-0x7fe61f8000 stackSize=8MB
| held mutexes=
kernel: __switch_to+0x74/0x8c
kernel: SyS_epoll_wait+0x304/0x44c
kernel: SyS_epoll_pwait+0x118/0x124
kernel: cpu_switch_to+0x48/0x4c
native: #00 pc 000199cc /system/lib64/libc.so (syscall+28)
native: #01 pc 000d2ca4 /system/lib64/libart.so (art::ConditionVariable::Wait(art::Thread*)+140)
native: #02 pc 003a22d8 /system/lib64/libart.so (art::GoToRunnable(art::Thread*)+1252)
native: #03 pc 000a517c /system/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+24)
native: #04 pc 0010ff54 /data/dalvik-cache/arm64/system@[email protected] (Java_android_os_MessageQueue_nativePollOnce__JI+168)
at android.os.MessageQueue.nativePollOnce(Native method)
at android.os.MessageQueue.next(MessageQueue.java:148)
at android.os.Looper.loop(Looper.java:151)
at com.android.server.SystemServer.run(SystemServer.java:379)
at com.android.server.SystemServer.main(SystemServer.java:231)
at java.lang.reflect.Method.invoke!(Native method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:959)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:754)
通过分析traces.txt文件可以获取到ANR的一些信息,比如ANR是哪个进程发生的,具体的调用信息;如果进程名是APP的进程名,那么main线程里会很清楚的写出是你的APP里哪块代码发生了ANR,就很容易去排查;但是像我现在遇到的这种情况,直接原因是system_server进程发生了ANR,main线程表明是消息队列堵塞了,那怎么解决呢,既然这个日志不好看出,那就打开系统log,对app进行操作,直到发生ANR,然后取出运行日志,可以看到有一句
AEE/AED : CPU usage from 3051ms to 14ms ago with 99% awake:
AEE/AED : 98% 4191/com.xxx.xxx: 98% user + 0% kernel / faults: 12 minor
AEE/AED : 2.9% 803/system_server: 0.9% user + 1.9% kernel / faults: 476 minor
AEE/AED : 0.9% 1114/com.android.systemui: 0.6% user + 0.3% kernel / faults: 38 minor
AEE/AED : 0.6% 222/surfaceflinger: 0% user + 0.6% kernel / faults: 32 minor
AEE/AED : 0.6% 329/mobile_log_d: 0% user + 0.6% kernel
AEE/AED : 0.3% 8/rcu_preempt: 0% user + 0.3% kernel
AEE/AED : 0% 131/btif_rxd: 0% user + 0% kernel
AEE/AED : 0.3% 237/adbd: 0% user + 0.3% kernel / fau
AEE/AED : Process: com.xxx.xxx
AEE/AED : Flags: 0x98be46
AEE/AED : Package: com.xxx.xxx v1 (3.00.01build001)
AEE/AED : Activity: com.xxx.xxx/.xxxActvitivty
AEE/AED : Subject: Input dispatching timed out (Waiting to send non-key event because the touched window
has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 15. Wait queue head age: 10212.5ms.)
从这段日志可以很清楚的看到APP对cpu的使用率已经达到了98%(com.xxx.xxx是我的app的包名),最后的描述是 在等待发送一个非按键事件,因为所触摸的窗口还没有完成500ms之前传过来的某些输入事件,等待队列长度15,等待队列头时长10s多。这个描述是由appNotResponding方法输出的,但是具体的原因是同一个类的inputDispatchingTimedOut方法传入的
public boolean inputDispatchingTimedOut(final ProcessRecord proc,
final ActivityRecord activity, final ActivityRecord parent,final boolean aboveSystem, String reason) {
if (checkCallingPermission(android.Manifest.permission.FILTER_EVENTS)
!= PackageManager.PERMISSION_GRANTED) {
throw new SecurityException("Requires permission "
+ android.Manifest.permission.FILTER_EVENTS);
}
final String annotation;
if (reason == null) {
annotation = "Input dispatching timed out";
} else {
annotation = "Input dispatching timed out (" + reason + ")"; //在这里组装ANR原因信息
}
if (proc != null) {
synchronized (this) {
if (proc.debugging) {
return false;
}
if (mDidDexOpt) {
// Give more time since we were dexopting.
mDidDexOpt = false;
return false;
}
if (proc.instrumentationClass != null) {
Bundle info = new Bundle();
info.putString("shortMsg", "keyDispatchingTimedOut");
info.putString("longMsg", annotation);
finishInstrumentationLocked(proc, Activity.RESULT_CANCELED, info);
return true;
}
}
mHandler.post(new Runnable() {
@Override
public void run() {
appNotResponding(proc, activity, parent, aboveSystem, annotation);//调用这个方法去写日志和弹出ANRdialog
}
});
}
return true;
}
可以看到这个reason也是从别的方法传进来的,往上追踪到com\android\server\wm下面的InputMonitor类的notifyANR方法
@Override
public long notifyANR(InputApplicationHandle inputApplicationHandle,
InputWindowHandle inputWindowHandle, String reason) {
AppWindowToken appWindowToken = null;
WindowState windowState = null;
boolean aboveSystem = false;
synchronized (mService.mWindowMap) {
if (inputWindowHandle != null) {
windowState = (WindowState) inputWindowHandle.windowState;
if (windowState != null) {
appWindowToken = windowState.mAppToken;
}
}
if (appWindowToken == null && inputApplicationHandle != null) {
appWindowToken = (AppWindowToken)inputApplicationHandle.appWindowToken;
}
if (windowState != null) {
Slog.i(WindowManagerService.TAG, "Input event dispatching timed out "
+ "sending to " + windowState.mAttrs.getTitle()
+ ". Reason: " + reason);
// Figure out whether this window is layered above system windows.
// We need to do this here to help the activity manager know how to
// layer its ANR dialog.
int systemAlertLayer = mService.mPolicy.windowTypeToLayerLw(
WindowManager.LayoutParams.TYPE_SYSTEM_ALERT);
aboveSystem = windowState.mBaseLayer > systemAlertLayer;
} else if (appWindowToken != null) {
Slog.i(WindowManagerService.TAG, "Input event dispatching timed out "
+ "sending to application " + appWindowToken.stringName
+ ". Reason: " + reason);
} else {
Slog.i(WindowManagerService.TAG, "Input event dispatching timed out "
+ ". Reason: " + reason);
}
mService.saveANRStateLocked(appWindowToken, windowState, reason);
}
if (appWindowToken != null && appWindowToken.appToken != null) {
try {
// Notify the activity manager about the timeout and let it decide whether
// to abort dispatching or keep waiting.
boolean abort = appWindowToken.appToken.keyDispatchingTimedOut(reason);
if (! abort) {
// The activity manager declined to abort dispatching.
// Wait a bit longer and timeout again later.
return appWindowToken.inputDispatchingTimeoutNanos;
}
} catch (RemoteException ex) {
}
} else if (windowState != null) {
try {
// Notify the activity manager about the timeout and let it decide whether
// to abort dispatching or keep waiting.
long timeout = ActivityManagerNative.getDefault().inputDispatchingTimedOut(
windowState.mSession.mPid, aboveSystem, reason);//InputMonitor和ActivityManagerService都是在系统进程SystemServer中,可以直接调用
if (timeout >= 0) {
// The activity manager declined to abort dispatching.
// Wait a bit longer and timeout again later.
return timeout;
}
} catch (RemoteException ex) {
}
}
return 0; // abort dispatching
}
到这里reason还是传进来的,继续往上追踪到com\android\server\input的InputManagerService类的notifyANR方法
// Native callback.
private long notifyANR(InputApplicationHandle inputApplicationHandle,
InputWindowHandle inputWindowHandle, String reason) {
return mWindowManagerCallbacks.notifyANR(
inputApplicationHandle, inputWindowHandle, reason);
}
mWindowManagerCallbacks是这个类里定义的一个接口,实现类是InputMonitor,所以这地方调用这个接口的这个方法,会走到InputMonitor类里去,但是现在这个方法的注释很清楚的表明这是从native层调用的,因为我们不管是按键还是触屏操作都是由底层驱动检测到然后一层一层传递上来,至于到native层如何调用,可查看点击打开链接。既然原因已经清楚了,代码的跟踪就先到这里。
由上面的分析可知,app对CPU的占用率过高,导致其它输入事件得不到cpu处理,最后事件队列堵塞时间过久,接下来就是ANR了
现在通过adb看下app使用过程中cpu使用率的情况吧
输入如下命令
> adb shell dumpsys cpuinfo | find "com.xxx.xxx" 这是查看某个进程的cpu使用率,进程名一般为app的包名。
其实还有一个top命令可以查看,至于这两个命令有什么区别可以点
击 https://blog.csdn.net/xiaodanpeng/article/details/51838237 查看
当刚打开app的时候,就能看到APP的cpu使用率已经到99%了,果真给我吓到了,然后就看app刚打开的时候做了啥操作,最后一阵盘查,发现是起了一个后台线程,线程的run方法中是一个死循环在跑,在检测推送消息的队列,然后语音提醒用户。最后修改成当消息队列为空的时候,睡眠1s,就这样cpu的使用率就没再高过10%了,自此,就没再出现ANR了。
其实也是这个手机才会出现的问题,别的测试手机CPU就没高过10%