滑动屏幕和按键都无响应,屏幕内容没有任何刷新;
watchdog没有重启system_server;
问题现场可以连接adb;
对于死机问题,我们需要做一些分析前的准备工作:
(1)拿到问题现场,及时充电以保证问题现场不被破坏;
(2)如果没有现场可以忽略这一步,通过kill -3 后面跟上system_server pid命令产生一份最新的traces文件;
(3)如果最新的traces文件无法产生,则通过debuggerd -b $system_server pid打印出一份所有线程的Native调用栈到文件中;
(4)通过adb将/data/anr下的文件都pull出来;
(5)通过adb将/data/tombstones下的文件都pull出来;
问题现场通过kill -3命令没有产生最新时间点的traces文件,因此只能查看/data/anr下最新时间点的traces文件,但是发现traces文件中的时间点已经是昨天的:
----- pid 1487 at 2017-04-25 22:44:52 -----
Cmd line: system_server
并且昨天生成的这份traces文件中system_server的各个线程的状态都正常,没有明显的问题和block。
接着分析由debuggerd -b $system_server pid打印出的Native调用栈信息,首先查看watchdog线程当前所处的状态,为什么没有重启手机:
"watchdog" sysTid=1877
#00 pc 000000000001bf6c /system/lib64/libc.so (syscall+28)
#01 pc 00000000000e7ac8 /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
#02 pc 000000000037ac68 /system/lib64/libart.so (_ZN3art7Monitor4WaitEPNS_6ThreadElibNS_11ThreadStateE+896)
#03 pc 000000000054e980 /system/framework/arm64/boot.oat (offset 0x54e000) (java.lang.Object.wait+140)
#04 pc 000000000054e8b8 /system/framework/arm64/boot.oat (offset 0x54e000) (java.lang.Object.wait+52)
#05 pc 00000000011035a8 /system/framework/oat/arm64/services.odex (offset 0xf0c000)
发现watchdog等待在ConditionVariable的WaitHoldingLocks方法上,为什么会等在这里?等在这里是否正常?
带着问题我们通过调用栈中的地址和addr2line工具层层定位具体的代码,首先是从Object的wait方法调用Monitor的Wait方法,具体代码如下:
/* art/runtime/monitor.cc */
579void Monitor::Wait(Thread* self, int64_t ms, int32_t ns,
580 bool interruptShouldThrow, ThreadState why) {
...
631
632 bool was_interrupted = false;
633 {
634 // Update thread state. If the GC wakes up, it'll ignore us, knowing
635 // that we won't touch any references in this state, and we'll check
636 // our suspend mode before we transition out.
637 ScopedThreadSuspension sts(self, why);
...
651
652 // Handle the case where the thread was interrupted before we called wait().
653 if (self->IsInterruptedLocked()) {
654 was_interrupted = true;
655 } else {
656 // Wait for a notification or a timeout to occur.
657 if (why == kWaiting) {
658 self->GetWaitConditionVariable()->Wait(self);
659 } else {
660 DCHECK(why == kTimedWaiting || why == kSleeping) << why;
661 self->GetWaitConditionVariable()->TimedWait(self, ms, ns);
662 }
663 was_interrupted = self->IsInterruptedLocked();
664 }
665 }
接着在Monitor的Wait方法中,调用self->GetWaitConditionVariable()->Wait或者TimedWait方法之前会通过ScopedThreadSuspension类的构造方法进行线程状态的切换,从Runable状态切换到Suspended状态,切换的具体代码如下:
/* art/runtime/scoped_thread_state_change.h */
280// Annotalysis helper for going to a suspended state from runnable.
281class ScopedThreadSuspension : public ValueObject {
282 public:
283 explicit ScopedThreadSuspension(Thread* self, ThreadState suspended_state)
...
{
288 DCHECK(self_ != nullptr);
289 self_->TransitionFromRunnableToSuspended(suspended_state);
290 }
随后self->GetWaitConditionVariable()->Wait或者TimedWait方法执行完,即等待条件满足或者超时后会继续往下执行,执行出了ScopedThreadSuspension对象sts所在代码块的作用域之后会执行ScopedThreadSuspension类的析构方法,在析构方法中会再次进行线程状态切换,从Suspended状态切换到Runable状态,切换的具体代码如下:
/* art/runtime/thread-inl.h */
172inline ThreadState Thread::TransitionFromSuspendedToRunnable() {
...
177 do {
...
195 } else if ((old_state_and_flags.as_struct.flags & kActiveSuspendBarrier) != 0) {
196 PassActiveSuspendBarriers(this);
197 } else if ((old_state_and_flags.as_struct.flags & kCheckpointRequest) != 0) {
198 // Impossible
199 LOG(FATAL) << "Transitioning to runnable with checkpoint flag, "
200 << " flags=" << old_state_and_flags.as_struct.flags
201 << " state=" << old_state_and_flags.as_struct.state;
202 } else if ((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0) {
203 // Wait while our suspend count is non-zero.
...
207 while ((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0) {
208 // Re-check when Thread::resume_cond_ is notified.
209 Thread::resume_cond_->Wait(this);
210 old_state_and_flags.as_int = tls32_.state_and_flags.as_int;
211 DCHECK_EQ(old_state_and_flags.as_struct.state, old_state);
212 }
213 DCHECK_EQ(GetSuspendCount(), 0);
214 }
215 } while (true);
在从Suspended状态切换到Runable状态切换的过程会判断是否有人发起了suspend请求,当前watchdog调用栈就是因为有人发起了kSuspendRequest而执行到Thread::resume_cond_->Wait方法,在Thread::resume_cond_->Wait方法中调用了WaitHoldingLocks方法,具体代码如下:
/* art/runtime/base/mutex.cc */
834void ConditionVariable::Wait(Thread* self) {
835 guard_.CheckSafeToWait(self);
836 WaitHoldingLocks(self);
837}
838
839void ConditionVariable::WaitHoldingLocks(Thread* self) {
...
850 if (futex(sequence_.Address(), FUTEX_WAIT, cur_sequence, nullptr, nullptr, 0) != 0) {
851 // Futex failed, check it is an expected error.
852 // EAGAIN == EWOULDBLK, so we let the caller try again.
853 // EINTR implies a signal was sent to this thread.
854 if ((errno != EINTR) && (errno != EAGAIN)) {
855 PLOG(FATAL) << "futex wait failed for " << name_;
856 }
857 }
在WaitHoldingLocks方法中调用了futex函数并最终等待在futex函数中的系统调用上,具体代码如下:
/* art/runtime/base/mutex-inl.h */
43static inline int futex(volatile int *uaddr, int op, int val, const struct timespec *timeout,
44 volatile int *uaddr2, int val3) {
45 return syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
46}
通过上面的分析我们知道watchdog线程等待在futex系统调用上的原因是有人发起了kSuspendRequest,使其在从suspended状态切换到Runable状态的时候进入等待,那什么情况会发起kSuspendRequest呢?
比较常见和正常的情况是GC线程在第二次标记清除的时候以及Signal Catcher在Dump trace的时候会SuspendAll线程,在suspend的过程中会给每个线程发起kSuspendRequest,接下来我们先看看GC线程是否在做SuspendAll的操作,具体调用栈如下:
"HeapTaskDaemon" sysTid=1497
#00 pc 000000000001bf6c /system/lib64/libc.so (syscall+28)
#01 pc 000000000046035c /system/lib64/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+628)
#02 pc 00000000004609c8 /system/lib64/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+536)
#03 pc 00000000001ea0e0 /system/lib64/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+232)
...
从GC线程的调用栈中可以看到,它确实是在做SuspendAll的操作,到这里就解释了为什么watchdog会等待在Thread::resume_cond_->WaitHoldingLocks。
正常情况下SuspendAll操作在很短的时间内就会完成,然后ResumeAll恢复所有等待在Thread::resume_cond_->WaitHoldingLocks的线程以继续执行,但是通过debuggerd -b $system_server pid来多次打印Native调用栈可以确定GC线程一直没有完成SuspendAll操作,导致包括watchdog线程内的很多其他线程在从suspended状态切换到Runable状态的时候都等待在Thread::resume_cond_->WaitHoldingLocks上,但是为什么GC线程一直完成不了SuspendAll操作呢?
带着初步分析的线索和问题,我们继续分析,GC线程完成SuspendAll的前提是除了GC线程自己之外所有其他线程都切换到非Runable状态,以此来保护Java空间的数据和状态,所以如果有线程一直无法切换到非Runable状态,则GC线程就会一直无法完成SuspendAll操作,顺着这条线索我们继续分析,看system_server进程中那个线程当前还处在Runable状态,当前没有完整的包含Java调用栈的traces文件只有Native的调用栈,所以无法直接判断那个线程还处在Runable状态,这个时候怎么办?
我们换一个思路,采用排除法,如果一个线程等待在Thread::resume_cond_->WaitHoldingLocks上,那它一定响应了GC线程发起的kSuspendRequest切换为了非Runable状态,根据这个条件先进行初步过滤,将等待在Thread::resume_cond_->WaitHoldingLocks的线程排除掉,剩下的线程在逻辑上大致可以分为两种状态,一种是在非Runable状态执行Native方法或者block在标准的系统调用和libc函数上,这种状态的线程不会影响GC线程的SuspendAll操作,另外一种是在Runable状态执行FastJNI方法,这种状态的线程如果不能及时的执行完发生block就会直接block GC线程的SuspendAll操作,并且这种FastJNI方法的调用一般都伴随着业务逻辑代码的上下文。
JNI是Java Native Interface的缩写,Java代码和Native代码进行相互操作的API接口称为Java本地接口。
根据上面的思路和线索,多次打印调用栈,发现一个可疑线程android.display一直block在同一个位置,并且上下文是在执行业务相关的代码,具体调用栈如下:
"android.display" sysTid=1509
#00 pc 000000000001bf6c /system/lib64/libc.so (syscall+28)
#01 pc 0000000000068cb8 /system/lib64/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tbPK8timespec+248)
#02 pc 00000000000fc9f8 /system/lib64/libandroid_runtime.so
#03 pc 0000000001cf2078 /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.content.res.AssetManager.applyStyle+244)
#04 pc 0000000001d19bbc /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.content.res.ResourcesImpl$ThemeImpl.obtainStyledAttributes+280)
#05 pc 0000000001d18628 /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.content.res.Resources$Theme.obtainStyledAttributes+100)
#06 pc 0000000002654068 /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.view.animation.DecelerateInterpolator.+132)
...
找到初步的嫌疑线程之后,我们再进一步确认其block的方法是否FastJNI方法,通过代码搜索可以看到android.content.res.AssetManager.applyStyle的定义确实是FastJNI,具体代码如下:
/* frameworks/base/core/jni/android_util_AssetManager.cpp */
2228 { "applyStyle","!(JIIJ[I[I[I)Z",
2229 (void*) android_content_AssetManager_applyStyle },
到这里基本确认block GC线程SuspendAll操作的线程至少有android.display了,顺着这个线索我们继续分析为什么android.display线程一直block在这个FastJNI方法上。
FastJNI即快速Java本地接口,和普通JNI的区别在于快,因为快所以从Java代码调用到FastJNI代码的时候不会将线程的状态从Runable切换到Native,而是一直保持Runable执行,其定义方式是在参数签名的签名加上!。
通过调用栈和addr2line定位到block的代码在如下的1432行:
/* frameworks/base/core/jni/android_util_AssetManager.cpp */
1342static jboolean android_content_AssetManager_applyStyle(JNIEnv* env, jobject clazz,
...
1350{
...
1369
1370 ResTable::Theme* theme = reinterpret_cast(themeToken);
1371 const ResTable& res = theme->getResTable();
...
1430
1431 // Now lock down the resource object and start pulling stuff from it.
1432 res.lock();
最终因为拿不到AssetManager的ResTable中的mLock而block,具体代码如下:
/* frameworks/base/libs/androidfw/ResourceTypes.cpp */
4193void ResTable::lock() const
4194{
4195 mLock.lock();
4196}
既然android.display线程拿不到这个mLock就说明已经有其他线程拿到了,接下来继续在system_server的调用栈中搜寻执行AssetManager以及ResTable相关代码的线程调用栈,发现一个可疑线程Binder:1487_17的调用栈如下:
"Binder:1487_17" sysTid=4827
#00 pc 000000000001bf6c /system/lib64/libc.so (syscall+28)
#01 pc 00000000000e7ac8 /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
#02 pc 000000000034aeb8 /system/lib64/libart.so (_ZN3art3JNI12NewStringUTFEP7_JNIEnvPKc+300)
#03 pc 00000000000f9508 /system/lib64/libandroid_runtime.so
#04 pc 0000000001cf2b38 /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.content.res.AssetManager.getArrayStringResource+132)
#05 pc 0000000001cf5914 /system/framework/arm64/boot-framework.oat (offset 0x1965000) (android.content.res.AssetManager.getResourceStringArray+48)
...
从调用栈来看Binder:1487_17已经等待在Thread::resume_cond_->WaitHoldingLocks上,成功切换到非Runable状态了,但是上面的调用栈中执行了AssetManager相关的操作,很可能持有了android.display线程需要的mLock,为了准确定位到是谁block了android.display线程,我们继续addr2line看一下Binder:1487_17的调用栈,发现确实持有了mLock,具体代码如下:
/* frameworks/base/core/jni/android_util_AssetManager.cpp */
1939static jobjectArray android_content_AssetManager_getArrayStringResource(JNIEnv* env, jobject clazz, jint arrayResId)
1941{
...
1949 const ssize_t N = res.lockBag(arrayResId, &startOfBag);
...
1963 for (size_t i=0; ((ssize_t)i)NewStringUTF(str8);
1980 } else {
1981 ...
1984 }
1985
...
2003}
lockBag方法中获取了mLock,具体定义如下:
/* frameworks/base/libs/androidfw/ResourceTypes.cpp */
4176ssize_t ResTable::lockBag(uint32_t resID, const bag_entry** outBag) const
4177{
4178 mLock.lock();
...
4185}
Binder:1487_17线程在持有了ResTable的mLock之后,接着执行NewStringUTF操作的过程中需要将线程状态切换到Runable,在切换的时候发现GC线程发起了kSuspendRequest,接着Binder:1487_17线程就等待在了Thread::resume_cond_->WaitHoldingLocks上,至此死锁环已经形成,但是还有一个疑问就是GC线程的SuspendAll操作从代码上初步来看是有等待超时的,但是为什么超时机制没有生效?
SuspendAll操作超时的逻辑代码如下:
/* art/runtime/thread_list.cc */
503void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
...
515 SuspendAllInternal(self, self);
...
520 if (Locks::mutator_lock_->ExclusiveLockWithTimeout(self, kThreadSuspendTimeoutMs, 0)) {
521 break;
522 } else if (!long_suspend_) {
...
526 UnsafeLogFatalForThreadSuspendAllTimeout();
527 }
528 }
先执行SuspendAllInternal随后独占持有muator lock并指定超时的时间为kThreadSuspendTimeoutMs,即在30s内要独占获取到mutator lock,kThreadSuspendTimeoutMs定义如下:
/* art/runtime/thread_list.cc */
static constexpr uint64_t kThreadSuspendTimeoutMs = 30 * 1000; // 30s.
mutator lock,即突变锁,顾名思义是为了防止虚拟机中的状态包括Java对象、堆内存等突然变化而设置的锁,常见的使用场景和用途有线程状态切换、GC以及Dump trace等,当线程从非Runable状态切换Runnable状态的时候会shared held mutator lock,当GC第二次标记清理的时候会SuspendAll线程使其进入非Runnable状态并独占mutator lock,当dump trace的时候signal catcher线程在AOSP原生流程也会SuspendAll线程使其进入非Runnable状态并独占mutator lock,主要原因是dump heap状态快照需要让所有线程停下来防止它们再改变虚拟机中的堆内存状态。
30s超时获取不到mutator lock则执行UnsafeLogFatalForThreadSuspendAllTimeout方法,在方法中执行exit退出进程,方法具体定义如下:
/* art/runtime/thread_list.cc */
294NO_RETURN static void UnsafeLogFatalForThreadSuspendAllTimeout() {
...
301 LOG(FATAL) << ss.str();
302 exit(0);
303}
正常情况下SuspendAllInternal操作执行完之后所有线程都已经越过Suspend栅栏并释放mutator lock,同时线程处于非Runable状态,但是通过addr2line定位GC线程的调用栈所在的源代码,发现GC线程并没有执行完SuspendAllInternal操作,所以就没有执行到独占mutator lock的超时操作,而是block在了SuspendAllInternal方法中的futex wait,关键代码如下:
/* art/runtime/thread_list.cc */
515void ThreadList::SuspendAllInternal(Thread* self, Thread* ignore1, Thread* ignore2, bool debug_suspend) {
...
588 InitTimeSpec(true, CLOCK_MONOTONIC, 10000, 0, &wait_timeout);
589 while (true) {
...
592#if ART_USE_FUTEXES
593 if (futex(pending_threads.Address(), FUTEX_WAIT, cur_val, &wait_timeout, nullptr, 0) != 0) {
594 // EAGAIN and EINTR both indicate a spurious failure, try again from the beginning.
595 if ((errno != EAGAIN) && (errno != EINTR)) {
596 if (errno == ETIMEDOUT) {
597 LOG(kIsDebugBuild ? FATAL : ERROR) << "Unexpected time out during suspend all.";
598 } else {
599 PLOG(FATAL) << "futex wait failed for SuspendAllInternal()";
600 }
...
615}
初步从代码来看futex的wait也是有超时的,但是为什么仍然会陷入block?
经过多次打印调用栈及查看log,发现GC线程一直还在运行,并每隔10s左右就打出time out的log,顺着这个线索审查代码,发现等待的代码存在缺陷,当kIsDebugBuild条件不满足,suspend all time out的时候只会打出一句error log,并在while中不断循环,没有退出的条件,通过对比Android M和Android N的代码发现这个新的wait机制是新加的,没有考虑全面,在Android M的时候不存在当前的问题,会直接走到mutator lock的独占持有并设置超时,Android M suspend all的超时机制关键代码如下:
/* art/runtime/thread_list.cc */
454void ThreadList::SuspendAll(const char* cause, bool long_suspend) {
...
486 // Block on the mutator lock until all Runnable threads release their share of access.
487#if HAVE_TIMED_RWLOCK
488 while (true) {
489 if (Locks::mutator_lock_->ExclusiveLockWithTimeout(self, kThreadSuspendTimeoutMs, 0)) {
490 break;
491 } else if (!long_suspend_) {
...
495 UnsafeLogFatalForThreadSuspendAllTimeout();
496 }
497 }
到这里SuspendAll超时机制为什么没有生效的问题就得到了解释。
总结一下问题的死锁流程:
通过初步分析、深入分析和问题总结,我们清楚的知道了问题的原因,接下来我们再分析一下如何解决这个问题:
修复SuspendAll超时机制缺陷的patch去掉了kIsDebugBuild条件判断,在所有版本中只要等待超时就打印FATAL log,并在FATAL LOG对象析构时执行abort,退出进行,修复的关键代码如下:
/* art/runtime/thread_list.cc */
560void ThreadList::SuspendAllInternal(Thread* self,
561 Thread* ignore1,
562 Thread* ignore2,
563 bool debug_suspend) {
...
636 InitTimeSpec(false, CLOCK_MONOTONIC, kIsDebugBuild ? 50000 : 60000, 0, &wait_timeout);
...
639 while (true) {
...
643 if (futex(pending_threads.Address(), FUTEX_WAIT, cur_val, &wait_timeout, nullptr, 0) != 0) {
644 // EAGAIN and EINTR both indicate a spurious failure, try again from the beginning.
645 if ((errno != EAGAIN) && (errno != EINTR)) {
646 if (errno == ETIMEDOUT) {
647 LOG(FATAL)
648 << "Timed out waiting for threads to suspend, waited for "
649 << PrettyDuration(NanoTime() - start_time);
650 }
LOG(FATAL)是一个宏定义,最终会被替换为:?:LogMessage(FILE, LINE, severity, -1).stream(),LogMessage的析构函数中会根据severity是否是FATAL来决定是否abort,宏定义代码如下:
/* art/runtime/base/logging.h */
92// Logs a message to logcat on Android otherwise to stderr. If the severity is FATAL it also causes
93// an abort. For example: LOG(FATAL) << "We didn't expect to reach here";
94#define LOG(severity) ::art::LogMessage(__FILE__, __LINE__, severity, -1).stream()
LogMessage析构中abort的关键代码如下:
195LogMessage::~LogMessage() {
...
229 // Abort if necessary.
230 if (data_->GetSeverity() == FATAL) {
231 Runtime::Abort(msg.c_str());
232 }
233}
将android.content.res.AssetManager.applyStyle方法JNI定义的签名前的!描述符去掉,调整其为非FastJNI的patch如下:
/* frameworks/base/core/jni/android_util_AssetManager.cpp */
2228 { "applyStyle","(JIIJ[I[I[I)Z",
2229 (void*) android_content_AssetManager_applyStyle },
对于JNI方法不能随便将其调整为FastJNI,调整为FastJNI后执行此方法是将不做线程的状态切换,会导致线程一直处于Runable状态,直到FastJNI方法执行完毕;
FastJNI方法必须要满足执行快没有依赖的条件,否则不恰当的FastJNI声明和优化可能会带来不可预料的死锁或者线程状态问题;