「Android」ANR入门学习笔记

「Android」ANR入门学习笔记

ANR 全称是 Applicatipon Not Response ,Android 设计 ANR 的用意,是系统通过与之交互的组件以及用户交互进行超时监控,用来判断应用进程是否存在卡死或响应过慢的问题

ANR 的触发原因

应用层导致 ANR 的原因:

  • 函数阻塞:如死循环、主线程 IO、处理大数据
  • 锁出错:主线程等待子线程的锁
  • 内存紧张:系统分配给一个应用的内存是有上限的,长期处于内存紧张,会导致频繁内存交换,进而导致应用的一些操作超时

系统导致ANR 的原因:

  • CPU 被抢占:一般来说,前台在玩游戏,可能会导致你的后台广播被抢占
  • 系统服务无法及时响应:比如获取系统联系人等,系统的服务都是 Binder 机制,服务能力也是有限的,有可能系统服务长时间不响应导致 ANR
  • 其他应用占用大量内存

ANR 的触发标准

简单来说,ANR 就是系统响应超时,Android对于系统是否响应超时有如下标准:

  • Service 触发ANR Service Timeout:比如前台服务在20s内未执行完成,后台服务 Timeout 时间是前台服务的10倍,200s;
  • Broadcast 触发ANR BroadcastQueue Timeout:比如前台广播在10s内未执行完成,后台60s;
  • Provider 触发ANR ContentProvider Timeout:内容提供者,在 publish 过超时10s;
  • Input 触发ANR InputDispatching Timeout:输入事件分发超时5s,包括按键和触摸事件。
//ActiveServices.java
// How long we wait for a service to finish executing.
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;
// How long the startForegroundService() grace period is to get around to
// calling startForeground() before we ANR + stop it.
static final int SERVICE_START_FOREGROUND_TIMEOUT = 10*1000;

//ActivityManagerService.java
// How long we allow a receiver to run before giving up on it.
static final int BROADCAST_FG_TIMEOUT = 10*1000;
static final int BROADCAST_BG_TIMEOUT = 60*1000;
// How long we wait until we timeout on key dispatching.
static final int KEY_DISPATCHING_TIMEOUT = 5*1000;

ANR 的触发流程

对于Service、Broadcast、Provider触发的ANR来说,类似于是一个埋炸弹、拆炸弹、引爆炸弹的过程,例如startService的时候:

常规场景:埋炸弹 -- 拆炸弹

App->AMS: 启动Service
App->Handler: 启动标准时间20s倒计时(埋炸弹)
AMS-->App: 启动完成
App->Handler: 取消倒计时(拆炸弹)

ANR场景:埋炸弹 -- 引爆炸弹

App->AMS: 启动Service
App->Handler: 启动标准时间20s倒计时(埋炸弹)
Note right of AMS: 20s内未完成Service启动
Note right of Handler: 20s内未取消
Handler->AMS: 触发ANR(引爆炸弹)

对于Input触发的ANR来说,与Service、Broadcast、Provider触发的ANR有所不同,其并非时间到了就一定会引爆炸弹(触发ANR),而是在后续的Input事件上报的过程中,才会去主动检测前一个正在处理的事件是否超时。若超时则触发ANR,反之则重置计时器。

ANR 的dump流程

不管 ANR 是怎么发生的,最终逻辑都会调用appNotResponding函数:

graph TD
ActivityManagerService#inputDispatchingTimedOut
--> AnrHelper#appNotResponding
--> AnrConsumerThread#run
--> AnrRecord#appNotResponding
--> ProcessRecord#appNotResponding
//com.android.server.am.ProcessRecord.java
void appNotResponding(String activityShortComponentName, ApplicationInfo aInfo,
        String parentShortComponentName, WindowProcessController parentProcess,
        boolean aboveSystem, String annotation, boolean onlyDumpSelf) {
    ArrayList firstPids = new ArrayList<>(5);
    SparseArray lastPids = new SparseArray<>(20);

    mWindowProcessController.appEarlyNotResponding(annotation, () -> kill("anr",
                ApplicationExitInfo.REASON_ANR, true));

    long anrTime = SystemClock.uptimeMillis();
    if (isMonitorCpuUsage()) {
        mService.updateCpuStatsNow();
    }

    final boolean isSilentAnr;
    synchronized (mService) {
        //注释1
        // PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
        //正在重启
        if (mService.mAtmInternal.isShuttingDown()) {
            Slog.i(TAG, "During shutdown skipping ANR: " + this + " " + annotation);
            return;
        } else if (isNotResponding()) {
            //已经处于ANR流程中
            Slog.i(TAG, "Skipping duplicate ANR: " + this + " " + annotation);
            return;
        } else if (isCrashing()) {
            //正在crash的状态
            Slog.i(TAG, "Crashing app skipping ANR: " + this + " " + annotation);
            return;
        } else if (killedByAm) {
            //app已经被killed
            Slog.i(TAG, "App already killed by AM skipping ANR: " + this + " " + annotation);
            return;
        } else if (killed) {
            //app已经死亡了
            Slog.i(TAG, "Skipping died app ANR: " + this + " " + annotation);
            return;
        }

        // In case we come through here for the same app before completing
        // this one, mark as anring now so we will bail out.
        //做个标记
        setNotResponding(true);

        // Log the ANR to the event log.
        EventLog.writeEvent(EventLogTags.AM_ANR, userId, pid, processName, info.flags,
                annotation);

        // Dump thread traces as quickly as we can, starting with "interesting" processes.
        firstPids.add(pid);

        // Don't dump other PIDs if it's a background ANR or is requested to only dump self.
        //注释2
        //沉默的anr : 这里表示后台anr
        isSilentAnr = isSilentAnr();
        if (!isSilentAnr && !onlyDumpSelf) {
            int parentPid = pid;
            if (parentProcess != null && parentProcess.getPid() > 0) {
                parentPid = parentProcess.getPid();
            }
            if (parentPid != pid) firstPids.add(parentPid);

            if (MY_PID != pid && MY_PID != parentPid) firstPids.add(MY_PID);

            //选择需要dump的进程
            for (int i = getLruProcessList().size() - 1; i >= 0; i--) {
                ProcessRecord r = getLruProcessList().get(i);
                if (r != null && r.thread != null) {
                    int myPid = r.pid;
                    if (myPid > 0 && myPid != pid && myPid != parentPid && myPid != MY_PID) {
                        if (r.isPersistent()) {
                            firstPids.add(myPid);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
                        } else if (r.treatLikeActivity) {
                            firstPids.add(myPid);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
                        } else {
                            lastPids.put(myPid, Boolean.TRUE);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
                        }
                    }
                }
            }
        }
    }

    ......

    int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
    ArrayList nativePids = null;

    if (pids != null) {
        nativePids = new ArrayList<>(pids.length);
        for (int i : pids) {
            nativePids.add(i);
        }
    }

    // For background ANRs, don't pass the ProcessCpuTracker to
    // avoid spending 1/2 second collecting stats to rank lastPids.
    StringWriter tracesFileException = new StringWriter();
    // To hold the start and end offset to the ANR trace file respectively.
    final long[] offsets = new long[2];
    //注释4
    File tracesFile = ActivityManagerService.dumpStackTraces(firstPids,
            isSilentAnr ? null : processCpuTracker, isSilentAnr ? null : lastPids,
            nativePids, tracesFileException, offsets);
        ......
}

注释1处针对几种特殊情况进行处理:正在重启、已经处于 ANR 流程中、正在crash、app 已经被 killed 和 app 已经死亡了,不用处理 ANR,直接 return。

注释2处 isSilentAnr 是表示当前是否为一个后台 ANR,前台 ANR 会弹出无响应的 Dialog,后台 ANR 会直接杀死进程。什么是前台 ANR:发生ANR的进程对用户来说有感知,就是前台 ANR,否则就是后台 ANR。

注释3处,选择需要 dump 的进程。发生 ANR 时,为了方便定位问题,会 dump 很多信息到 Trace 文件中。而 Trace 文件里包含着与 ANR 相关联的进程的 Trace 信息,因为产生 ANR 的原因有可能是其他的进程抢占了太多资源,或者 IPC 到其他进程的时候卡住导致的。需要被 dump 的进程分为3类:

  • firstPids:firstPids 是需要首先 dump 的重要进程,发生 ANR 的进程无论如何是一定要被 dump 的,也是首先被 dump 的,所以第一个被加到 firstPids 中。如果是 SilentAnr(即后台 ANR),不用再加入任何其他的进程。如果不是,需要进一步添加其他的进程:如果发生 ANR 的进程不是 system_server 进程的话,需要添加 system_server 进程;接下来轮询 AMS 维护的一个 LRU 的进程 List,如果最近访问的进程包含了 persistent 的进程,或者带有 BIND_TREAT_LIKE_ACTVITY 标签的进程,都添加到 firstPids 中。
  • extraPids:LRU 进程 List 中的其他进程,都会首先添加到lastPids中,然后lastPids会进一步被选出最近CPU使用率高的进程,进一步组成 extraPids;
  • nativePids:nativePids 最为简单,是一些固定的 native 的系统进程,定义在 WatchDog.java 中

注释4处,拿到需要 dump 的所有进程的 pid 后,AMS 开始按照 firstPids、 nativePids、extraPids 的顺序 dump 这些进程的堆栈:

public static Pair dumpStackTraces(String tracesFile, ArrayList firstPids,
        ArrayList nativePids, ArrayList extraPids) {

    // 最多dump 20秒
    long remainingTime = 20 * 1000;

    // First collect all of the stacks of the most important pids.
    if (firstPids != null) {
        int num = firstPids.size();
        for (int i = 0; i < num; i++) {
            final int pid = firstPids.get(i);
            final long timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
            remainingTime -= timeTaken;
            if (remainingTime <= 0) {
                Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + pid
                        + "); deadline exceeded.");
                return firstPidStart >= 0 ? new Pair<>(firstPidStart, firstPidEnd) : null;
            }
        }
    }
    ......
}

根据顺序取出前面传入的 firstPids、nativePids 、extraPids 的 pid,然后逐一去 dump 这些进程中所有的线程。由于此处有多个进程且每个进程通常有多个线程,所以这里规定了个最长 dump 时间为20秒,超过则及时返回,这样可以确保ANR弹窗可以及时弹出(或者被 kill 掉)。

接下来的调用链为:

graph TD
ActivityManagerService#dumpJavaTracesTombstoned
--> Debug#dumpJavaBacktraceToFileTimeout
--> android_os_Debug#android_os_Debug_dumpJavaBacktraceToFileTimeout
--> android_os_Debug#dumpTraces
--> debuggerd_client#dump_backtrace_to_file_timeout
--> debuggerd_client#debuggerd_trigger_dump
bool debuggerd_trigger_dump(pid_t tid, DebuggerdDumpType dump_type, unsigned int timeout_ms, unique_fd output_fd) {
    //pid是从AMS那边传过来的,即需要dump堆栈的进程
        pid_t pid = tid;
    //......

    // Send the signal.
        //从android_os_Debug_dumpJavaBacktraceToFileTimeout过来的,dump_type为kDebuggerdJavaBacktrace
    const int signal = (dump_type == kDebuggerdJavaBacktrace) ? SIGQUIT : BIONIC_SIGNAL_DEBUGGER;
    sigval val = {.sival_int = (dump_type == kDebuggerdNativeBacktrace) ? 1 : 0};
        //sigqueue:在队列中向指定进程发送一个信号和数据,成功返回0
    if (sigqueue(pid, signal, val) != 0) {
      log_error(output_fd, errno, "failed to send signal to pid %d", pid);
      return false;
    }
    //......
    LOG(INFO) << TAG "done dumping process " << pid;
    return true;
}

除 Zygote 进程外,每个进程都会创建一个 SignalCatcher 守护线程,用于捕获 SIGQUIT、SIGUSR1 信号,并采取相应的行为。
AMS 进程间接给需要 dump 堆栈那个进程发送了一个 SIGQUIT 信号,进程收到 SIGQUIT 信号之后便开始 dump。也就是说,每当一个进程发生 ANR 时,则会收到 SIGQUIT 信号。换言之,如果能监控到系统发送的 SIGQUIT 信号,就能监控到发生了 ANR。

总结 ANR 的dump流程:
系统监控到 app 发生 ANR 后,收集了一些相关进程 pid(包括发生 ANR 的进程),准备让这些进程 dump 堆栈,从而生成 ANR Trace 文件,接着系统开始向这些进程发送 SIGQUIT 信号,进程收到 SIGQUIT 信号之后开始 dump 堆栈。
「Android」ANR入门学习笔记_第1张图片

ANR 的分析

获取 ANR 日志的方式:

  • adb pull /data/anr/
  • adb bugreport

trace文件分析

----- pid 7761 at 2022-11-02 07:02:26 -----
Cmd line: com.xfhy.watchsignaldemo
Build fingerprint: 'HUAWEI/LYA-AL00/HWLYA:10/HUAWEILYA-AL00/10.1.0.163C00:user/release-keys'
ABI: 'arm64'
Build type: optimized
Zygote loaded classes=11918 post zygote classes=729
Dumping registered class loaders
#0 dalvik.system.PathClassLoader: [], parent #1
#1 java.lang.BootClassLoader: [], no parent
#2 dalvik.system.PathClassLoader: [/system/app/FeatureFramework/FeatureFramework.apk], no parent
#3 dalvik.system.PathClassLoader: [/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes2.dex:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes4.dex:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes3.dex], parent #1
Done dumping class loaders
Intern table: 44132 strong; 436 weak
JNI: CheckJNI is off; globals=681 (plus 67 weak)
Libraries: /data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/lib/arm64/libwatchsignaldemo.so libandroid.so libcompiler_rt.so libhitrace_jni.so libhiview_jni.so libhwapsimpl_jni.so libiAwareSdk_jni.so libimonitor_jni.so libjavacore.so libjavacrypto.so libjnigraphics.so libmedia_jni.so libopenjdk.so libsoundpool.so libwebviewchromium_loader.so (15)
//已分配堆内存大小26M,其中2442kb医用,总分配74512个对象
Heap: 90% free, 2442KB/26MB; 74512 objects

Total number of allocations 120222 //进程创建到现在一共创建了多少对象
Total bytes allocated 10MB         //进程创建到现在一共申请了多少内存
Total bytes freed 8173KB           //进程创建到现在一共释放了多少内存
Free memory 23MB                   //不扩展堆的情况下可用的内存
Free memory until GC 23MB          //GC前的可用内存
Free memory until OOME 381MB       //OOM之前的可用内存,这个值很小的话,说明已经处于内存紧张状态,app可能是占用了过多的内存
Total memory 26MB                  //当前总内存(已用+可用)
Max memory 384MB                   //进程最多能申请的内存

.....//省略GC相关信息


//当前进程共17个线程
DALVIK THREADS (17):

//Signal Catcher线程调用栈
"Signal Catcher" daemon prio=5 tid=4 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800
  | sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50
  | state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100
  | stack=0x7253454000-0x7253456000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 000000000042f8e8  /apex/com.android.runtime/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+140)
  native: #01 pc 0000000000523590  /apex/com.android.runtime/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream>&, bool, BacktraceMap*, bool) const+508)
  native: #02 pc 000000000053e75c  /apex/com.android.runtime/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+844)
  native: #03 pc 000000000053735c  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+504)
  native: #04 pc 0000000000536744  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream>&, bool)+1048)
  native: #05 pc 0000000000536228  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream>&)+884)
  native: #06 pc 00000000004ee4d8  /apex/com.android.runtime/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream>&)+196)
  native: #07 pc 000000000050250c  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1356)
  native: #08 pc 0000000000501558  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::Run(void*)+268)
  native: #09 pc 00000000000cf7c0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36)
  native: #10 pc 00000000000721a8  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
  (no managed stack frames)

"main" prio=5 tid=1 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73907540 self=0x725f010800
  | sysTid=7761 nice=-10 cgrp=default sched=1073741825/2 handle=0x72e60080d0
  | state=S schedstat=( 281909898 5919799 311 ) utm=20 stm=7 core=4 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes=
  at java.lang.Thread.sleep(Native method)
  - sleeping on <0x00f895d9> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:443)
  - locked <0x00f895d9> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:359)
  at android.os.SystemClock.sleep(SystemClock.java:131)
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:35)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

  ... //此处省略剩余的N个线程

trace 参数详细解读:

"Signal Catcher" daemon prio=5 tid=4 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800
  | sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50
  | state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100
  | stack=0x7253454000-0x7253456000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)

第1行:"Signal Catcher" daemon prio=5 tid=4 Runnable

  • “Signal Catcher” daemon:线程名,有 daemon 表示守护线程
  • prio:线程优先级
  • tid:线程内部 id
  • 线程状态:Runnable

    一般来说,main线程处于 BLOCK、WAITING、TIMEWAITING 状态,基本上是函数阻塞导致的 ANR,如果 main 线程无异常,则应该排查 CPU 负载和内存环境。

第2行:| group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800

  • group:线程所属的线程组
  • sCount:线程挂起次数
  • dsCount:用于调试的线程挂起次数
  • obj:当前线程关联的 Java 线程对象
  • self:当前线程地址

第3行:| sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50

  • sysTid:线程真正意义上的 tid
  • nice:调度优先级,值越小则优先级越高
  • cgrp:进程所属的进程调度组
  • sched:调度策略
  • handle:函数处理地址

第4行:| state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100

  • state:线程状态
  • schedstat:CPU调度时间统计(schedstat 括号中的3个数字依次是Running、Runable、Switch,Running 时间:CPU 运行的时间,单位 ns,Runable 时间:RQ 队列的等待时间,单位 ns,Switch 次数:CPU 调度切换次数)
  • utm/stm:用户态/内核态的 CPU 时间
  • core:该线程的最后运行所在核
  • HZ:时钟频率

第5行:| stack=0x7253454000-0x7253456000 stackSize=991KB

  • stack:线程栈的地址区间
  • stackSize:栈的大小

第6行:| held mutexes= "mutator lock"(shared held)

  • mutex:所持有 mutex 类型,有独占锁 exclusive 和共享锁 shared 两类

案例分析

主线程无卡顿,处于正常状态堆栈
"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x74b38080 self=0x7ad9014c00
  | sysTid=23081 nice=0 cgrp=default sched=0/0 handle=0x7b5fdc5548
  | state=S schedstat=( 284838633 166738594 505 ) utm=21 stm=7 core=1 HZ=100
  | stack=0x7fc95da000-0x7fc95dc000 stackSize=8MB
  | held mutexes=
  kernel: __switch_to+0xb0/0xbc
  kernel: SyS_epoll_wait+0x288/0x364
  kernel: SyS_epoll_pwait+0xb0/0x124
  kernel: cpu_switch_to+0x38c/0x2258
  native: #00 pc 000000000007cd8c  /system/lib64/libc.so (__epoll_pwait+8)
  native: #01 pc 0000000000014d48  /system/lib64/libutils.so (android::Looper::pollInner(int)+148)
  native: #02 pc 0000000000014c18  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
  native: #03 pc 00000000001275f4  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:330)
  at android.os.Looper.loop(Looper.java:169)
  at android.app.ActivityThread.main(ActivityThread.java:7073)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:536)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:876)

主线程空闲,因为它正处于 nativePollOnce,正在等待新消息。处于这个状态还发生了 ANR,可能有2个原因:

  • dump 堆栈时机太晚了,ANR 已经发生过了,才去 dump 堆栈,此时主线程已经恢复正常了
  • CPU 抢占或者内存紧张等其他因素引起

遇到这种情况,要先去分析 CPU、内存的使用情况。其次可以关注抓取日志的时间和 ANR 发生的时间是否相隔太久,时间太久这个堆栈就没有分析的意义了。

主线程执行耗时操作
suspend all histogram:    Sum: 206us 99% C.I. 0.098us-46us Avg: 7.629us Max: 46us
DALVIK THREADS (16):
"main" prio=5 tid=1 Runnable
  | group="main" sCount=0 dsCount=0 flags=0 obj=0x73907540 self=0x725f010800
  | sysTid=32298 nice=-10 cgrp=default sched=1073741825/2 handle=0x72e60080d0
  | state=R schedstat=( 6746757297 5887495 256 ) utm=670 stm=4 core=6 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes= "mutator lock"(shared held)
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:58)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

主线程处于执行状态,不是空闲状态,导致 ANR 了,说明com.xfhy.watchsignaldemo.MainActivity.makeAnr有耗时操作。

主线程被锁阻塞
"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73907540 self=0x725f010800
  | sysTid=19900 nice=-10 cgrp=default sched=0/0 handle=0x72e60080d0
  | state=S schedstat=( 542745832 9516666 182 ) utm=48 stm=5 core=4 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes=
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:59)
  - waiting to lock <0x0c6f8c52> (a java.lang.Object) held by thread 22   //注释1
  - locked <0x01abeb23> (a java.lang.Object)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

"卧槽" prio=5 tid=22 Blocked  //注释2
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x12c8a118 self=0x71d625f800
  | sysTid=20611 nice=0 cgrp=default sched=0/0 handle=0x71d4513d50
  | state=S schedstat=( 486459 0 3 ) utm=0 stm=0 core=4 HZ=100
  | stack=0x71d4411000-0x71d4413000 stackSize=1039KB
  | held mutexes=
  at com.xfhy.watchsignaldemo.MainActivity$makeAnr$1.invoke(MainActivity.kt:52)
  - waiting to lock <0x01abeb23> (a java.lang.Object) held by thread 1
  - locked <0x0c6f8c52> (a java.lang.Object)  
  at com.xfhy.watchsignaldemo.MainActivity$makeAnr$1.invoke(MainActivity.kt:49)
  at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

......

其中:

"main" prio=5 tid=1 Blocked
  - waiting to lock <0x0c6f8c52> (a java.lang.Object) held by thread 22
  - locked <0x01abeb23> (a java.lang.Object)

"卧槽" prio=5 tid=22 Blocked
  - waiting to lock <0x01abeb23> (a java.lang.Object) held by thread 1
  - locked <0x0c6f8c52> (a java.lang.Object) 

主线程的 tid 是1,线程状态是 Blocked,正在等待0x0c6f8c52这个 Object,而这个 Object 被 thread 22这个线程所持有,主线程当前持有的是0x01abeb23的锁。而卧槽的tid是22,也是 Blocked 状态,它想请求的和已有的锁刚好与主线程相反。这样的话,ANR原因也就找到了。线程22持有了一把锁,并且一直不释放,主线程等待这把锁发生超时。在线上环境,常见因锁而 ANR 的场景是 SharePreference 写入。

CPU被抢占
CPU usage from 0ms to 10625ms later (2020-03-09 14:38:31.633 to 2020-03-09 14:38:42.257):
  543% 2045/com.test.demo: 54% user + 89% kernel / faults: 4608 minor 1 major //注意看这里
  99% 674/[email protected]: 81% user + 18% kernel / faults: 403 minor
  24% 32589/com.wang.test: 22% user + 1.4% kernel / faults: 7432 minor 1 major
  ......

该进程占据 CPU 高达543%,抢占了大部分 CPU 资源,因为导致发生 ANR,这种 ANR 通常与 App 无关。

内存紧张导致ANR
10-31 22:37:19.749 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:37:33.458 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:38:00.153 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:38:58.731 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:39:02.816 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0

如果一份 ANR 日志的 CPU 和堆栈都很正常,可以考虑是内存紧张。看一下 ANR 日志里面的内存相关部分。还可以去日志里面搜一下 onTrimMemory,如果 dump ANR 日志的时间附近有相关日志,可能是内存比较紧张了。

系统服务超时导致ANR
"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x727851e8 self=0x78d7060e00
  | sysTid=4894 nice=0 cgrp=default sched=0/0 handle=0x795cc1e9a8
  | state=S schedstat=( 8292806752 1621087524 7167 ) utm=707 stm=122 core=5 HZ=100
  | stack=0x7febb64000-0x7febb66000 stackSize=8MB
  | held mutexes=
  kernel: __switch_to+0x90/0xc4
  kernel: binder_thread_read+0xbd8/0x144c
  kernel: binder_ioctl_write_read.constprop.58+0x20c/0x348
  kernel: binder_ioctl+0x5d4/0x88c
  kernel: do_vfs_ioctl+0xb8/0xb1c
  kernel: SyS_ioctl+0x84/0x98
  kernel: cpu_switch_to+0x34c/0x22c0
  native: #00 pc 000000000007a2ac  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 00000000000276ec  /system/lib64/libc.so (ioctl+132)
  native: #02 pc 00000000000557d4  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+252)
  native: #03 pc 0000000000056494  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
  native: #04 pc 00000000000562d0  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+216)
  native: #05 pc 000000000004ce1c  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
  native: #06 pc 00000000001281c8  /system/lib64/libandroid_runtime.so (???)
  native: #07 pc 0000000000947ed4  /system/framework/arm64/boot-framework.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+196)
  at android.os.BinderProxy.transactNative(Native method) ————————————————关键行!!!
  at android.os.BinderProxy.transact(Binder.java:804)
  at android.net.IConnectivityManager$Stub$Proxy.getActiveNetworkInfo(IConnectivityManager.java:1204)—关键行!
  at android.net.ConnectivityManager.getActiveNetworkInfo(ConnectivityManager.java:800)
  at com.xiaomi.NetworkUtils.getNetworkInfo(NetworkUtils.java:2)
  at com.xiaomi.frameworkbase.utils.NetworkUtils.getNetWorkType(NetworkUtils.java:1)
  at com.xiaomi.frameworkbase.utils.NetworkUtils.isWifiConnected(NetworkUtils.java:1)

系统服务超时一般会包含 BinderProxy.transactNative 关键字。
从日志堆栈中可以看到是获取网络信息发生了 ANR:getActiveNetworkInfo。系统的服务都是 Binder 机制(16个线程),服务能力也是有限的,有可能系统服务长时间不响应导致 ANR。如果其他应用占用了所有 Binder 线程,那么当前应用只能等待。可进一步搜索 blockUntilThreadAvailable 关键字,at android.os.Binder.blockUntilThreadAvailable(Native method)。
如果有发现某个线程的堆栈,包含此字样,可进一步看其堆栈,确定是调用了什么系统服务。此类ANR也是属于系统环境的问题,如果某类型手机上频繁发生此问题,应用层可以考虑规避策略。

参考

https://mp.weixin.qq.com/s/qQ...
https://xfhy666.blog.csdn.net...

你可能感兴趣的:(android)