GC suspend abort

问题现象

  • 复现步骤
    • 手机(Android 7.0)
    • Monkey测试,Camera2 app偶现native crash

定位分析

  • 相关log
    • tombstone
      Revision: '0'
      ABI: 'arm' 
      pid: 20403, tid: 20414, name: HeapTaskDaemon  >>> com.android.camera2 <<<
      signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
      Abort message: 'art/runtime/thread_list.cc:621] Check failed: cur_val == 0 (cur_val=1, 0=0) '
      r0 00000000  r1 00004fbe  r2 00000006  r3 00000008
      r4 a0c7f978  r5 00000006  r6 a0c7f920  r7 0000010c
      r8 a7085900  r9 a6fae8c7  sl a0c7f2ec  fp 00000001
      ip 0000000b  sp a0c7f1f0  lr a7224597  pc a7226df4  cpsr 20010010
      d0  0000000000000000  d1  0000000000000000
      d2  0000000000000000  d3  0000000000000000
      d4  0000000000000000  d5  0000000000000000
      d6  0000000000000000  d7  0000000000000000
      d8  0000000000000000  d9  0000000000000000
      d10 0000000000000000  d11 0000000000000000
      d12 0000000000000000  d13 0000000000000000
      d14 0000000000000000  d15 0000000000000000
      d16 2e74736973726570  d17 2e6761742e676f6c
      d18 a72360a3a723609b  d19 a72360afa72360ab
      d20 656d72612c613776  d21 0065006d00720061
      d22 656d72612c613776  d23 0000000000000000
      d24 0000003f0000003f  d25 0000000000000000
      d26 0000000000000000  d27 0000000000000000
      d28 0000000000000000  d29 0000000000000000
      d30 0000000000000000  d31 00000000a70e7940
      scr 80000093
      
      backtrace:
      #00 pc 00049df4  /system/lib/libc.so (tgkill+12)
      #01 pc 00047593  /system/lib/libc.so (pthread_kill+34)
      #02 pc 0001d855  /system/lib/libc.so (raise+10)
      #03 pc 000193a1  /system/lib/libc.so (__libc_android_abort+34)
      #04 pc 00017014  /system/lib/libc.so (abort+4)
      #05 pc 003188f5  /system/lib/libart.so (_ZN3art7Runtime5AbortEv+252)
      #06 pc 000b4e79  /system/lib/libart.so (_ZN3art10LogMessageD2Ev+864)
      #07 pc 0033c0cd  /system/lib/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+1400)
      #08 pc 0033c2cf  /system/lib/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+374)
      #09 pc 0016ebc5  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector11ScopedPauseC2EPS2_+32)
      #10 pc 0017385d  /system/lib/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+160)
      #11 pc 0016e5d5  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector3RunENS0_7GcCauseEb+248)
      #12 pc 00191aed  /system/lib/libart.so (_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEb+2364)
      #13 pc 001973cd  /system/lib/libart.so (_ZN3art2gc4Heap12ConcurrentGCEPNS_6ThreadEb+68)
      #14 pc 0019bf03  /system/lib/libart.so (_ZN3art2gc4Heap16ConcurrentGCTask3RunEPNS_6ThreadE+18)
      #15 pc 001b42c3  /system/lib/libart.so (_ZN3art2gc13TaskProcessor11RunAllTasksEPNS_6ThreadE+30)
      #16 pc 7413f047  /data/dalvik-cache/arm/system@[email protected] (offset 0x2845000)
      

初步分析

从堆栈来看,GC时去做suspend ,但suspend过程中检测到异常并主动abort.

01-02 00:04:16.124 20403 20414 F art     : art/runtime/thread_list.cc:621] Check failed: cur_val == 0 (cur_val=1, 0=0)


01-02 00:04:16.765 20403 20414 F art     : art/runtime/runtime.cc:404] Runtime aborting...
01-02 00:04:16.765 20403 20414 F art     : art/runtime/runtime.cc:404] Aborting thread:
01-02 00:04:16.765 20403 20414 F art     : art/runtime/runtime.cc:404] "HeapTaskDaemon" prio=5 tid=8 WaitingPerformingGc
01-02 00:04:16.765 20403 20414 F art     : art/runtime/runtime.cc:404]   | group="" sCount=0 dsCount=0 obj=0x12c3e4c0 self=0xa7085900
01-02 00:04:16.765 20403 20414 F art     : art/runtime/runtime.cc:404]   | sysTid=20414 nice=0 cgrp=default sched=0/0 handle=0xa0c7f920
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   | state=R schedstat=( 0 0 0 ) utm=16 stm=2 core=2 HZ=100
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   | stack=0xa0b7d000-0xa0b7f000 stackSize=1038KB
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   | held mutexes= "abort lock"
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   native: #00 pc 0034e109  /system/lib/libart.so (_ZN3art15DumpNativeStackERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEEiP12BacktraceMapPKcPNS_9ArtMethodEPv+128)
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   native: #01 pc 0032ea05  /system/lib/libart.so (_ZNK3art6Thread9DumpStackERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEbP12BacktraceMap+308)
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   native: #02 pc 00321ba5  /system/lib/libart.so (_ZNK3art10AbortState10DumpThreadERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEEPNS_6ThreadE+24)
01-02 00:04:16.766 20403 20414 F art     : art/runtime/runtime.cc:404]   native: #03 pc 00321a2d  /system/lib/libart.so (_ZNK3art10AbortState4DumpERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+424)

深入分析

  • 关键code如下

    ... ...
    while (true) {
      int32_t cur_val = pending_threads.LoadRelaxed();
      if (LIKELY(cur_val > 0)) {
    #if ART_USE_FUTEXES
        if (futex(pending_threads.Address(), FUTEX_WAIT, cur_val, &wait_timeout, nullptr, 0) != 0) {
          // EAGAIN and EINTR both indicate a spurious failure, try again from the beginning.
          if ((errno != EAGAIN) && (errno != EINTR)) {
            if (errno == ETIMEDOUT) {
              LOG(kIsDebugBuild ? FATAL : ERROR) << "Unexpected time out during suspend all.";
            } else {
              PLOG(FATAL) << "futex wait failed for SuspendAllInternal()";
            }   
          }   
        } else {
          cur_val = pending_threads.LoadRelaxed();
          CHECK_EQ(cur_val, 0);              ==>check failed
          break;
        }   
    #else
        // Spin wait. This is likely to be slow, but on most architecture ART_USE_FUTEXES is set.
    #endif
      } else {
        CHECK_EQ(cur_val, 0); 
        break;
      }   
    }
    ... ...
    
    

    CHECK_EQ(cur_val, 0);失败的原因在于当前获取的cur_val为1,表示当前仍有一个线程在运行,并未挂起,但Google原生逻辑认为此时所有的线程已suspend.

  • 查看futex()用法如下

    int futex(int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);
    
    FUTEX_WAIT
          This  operation  atomically  verifies that the futex address uaddr still contains the value val, and sleeps awaiting FUTEX_WAKE on this futex address.
          If the timeout argument is non-NULL, its contents specify the duration of the wait.  (This interval will be rounded up to the system  clock  granular‐
          ity,  and  kernel scheduling delays mean that the blocking interval may overrun by a small amount.) If timeout is NULL, the call blocks indefinitely.
          The arguments uaddr2 and val3 are ignored.
    

    可看到,futex()正常返回0的时候表示uaddr监测的内存内容发生变化,但变化并不等同于val变为一个特定的值(如0).

  • 另外,GC suspend慢的原因在于某些thread做了耗时操作,但目前log和debug 信息已无法确认哪些threads可能存在问题(线程均已suspend).

    Thread[tid = 20403, name = main                                    ]: flag = 1, state = 90
    Thread[tid = 20408, name = Jit thread pool worker thread 0         ]: flag = 1, state = 89
    Thread[tid = 20409, name = Signal Catcher                          ]: flag = 1, state = 81
    Thread[tid = 20410, name = JDWP                                    ]: flag = 1, state = 77
    Thread[tid = 20411, name = ReferenceQueueDaemon                    ]: flag = 1, state = 71
    Thread[tid = 20412, name = FinalizerDaemon                         ]: flag = 1, state = 71
    Thread[tid = 20413, name = FinalizerWatchdogDaemon                 ]: flag = 1, state = 69
    Thread[tid = 20414, name = HeapTaskDaemon                          ]: flag = 0, state = 74
    Thread[tid = 20415, name = Binder:20403_1                          ]: flag = 1, state = 89
    Thread[tid = 20416, name = Binder:20403_2                          ]: flag = 1, state = 89
    Thread[tid = 20417, name = Profile Saver                           ]: flag = 1, state = 89
    Thread[tid = 20425, name = pool-2-thread-1                         ]: flag = 1, state = 71
    Thread[tid = 20427, name = OnCreateAsync Task                      ]: flag = 1, state = 89
    Thread[tid = 20428, name = CameraRequest Task                      ]: flag = 1, state = 89
    Thread[tid = 20429, name = pool-2-thread-2                         ]: flag = 1, state = 71
    Thread[tid = 20430, name = pool-2-thread-3                         ]: flag = 1, state = 71
    Thread[tid = 20431, name = pool-2-thread-4                         ]: flag = 1, state = 71
    Thread[tid = 20432, name = pool-2-thread-5                         ]: flag = 1, state = 71
    Thread[tid = 20433, name = Camera Handler Thread                   ]: flag = 1, state = 89
    Thread[tid = 20435, name = Camera2 Handler Thread                  ]: flag = 1, state = 89
    Thread[tid = 20434, name = Camera Job Dispatch Thread              ]: flag = 1, state = 71
    Thread[tid = 20436, name = Camera Job Dispatch Thread              ]: flag = 1, state = 71
    Thread[tid = 20437, name = pool-5-thread-1                         ]: flag = 1, state = 71
    Thread[tid = 20438, name = AsyncTask #1                            ]: flag = 1, state = 68
    Thread[tid = 20442, name = dreamcamera                             ]: flag = 1, state = 89
    Thread[tid = 20491, name = RenderThread                            ]: flag = 1, state = 89
    Thread[tid = 20496, name = AsyncTask #2                            ]: flag = 1, state = 90
    Thread[tid = 20497, name = pool-3-thread-1                         ]: flag = 1, state = 89
    Thread[tid = 20499, name =     ]: flag = 1, state = 89
    

Root Cause

  • futex()系统调用正常返回后,当前监控的内存内容不一定为减少为0, 判断逻辑有误.移除后可再次循环判断是否所有线程已完成suspend.

解决方案

  • patch

    • Google已于去年AOSP提交关于此bug的修复patch
    • Android 7.x均未包含该patch,cherry pick过来即可
    • Android 8.0已包含此patch
  • 验证

    编译通过,手机能正常启动和运行.

你可能感兴趣的:(GC suspend abort)