一,首先看如下NE问题的backtrace,如下所示:
#00 pc 000000000001d808 /system/lib64/libc.so (abort+120)
#01 pc 0000000000476644 /system/lib64/libart.so (art::Runtime::Abort(char const*)+552)
#02 pc 000000000056c5ec /system/lib64/libart.so (android::base::LogMessage::~LogMessage()+1004)
#03 pc 0000000000264258 /system/lib64/libart.so (art::IndirectReferenceTable::Add(art::IRTSegmentState, art::ObjPtr)+764)
#04 pc 00000000002ff750 /system/lib64/libart.so (art::JavaVMExt::AddGlobalRef(art::Thread*, art::ObjPtr)+68)
#05 pc 0000000000343788 /system/lib64/libart.so (art::JNI::NewGlobalRef(_JNIEnv*, _jobject*)+572)
#06 pc 000000000011f838 /system/lib64/libandroid_runtime.so (JavaDeathRecipient::JavaDeathRecipient(_JNIEnv*, _jobject*, android::sp const&)+136)
出现异常时对应的mobile log如下
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] JNI ERROR (app bug): global reference table overflow (max=51200)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] global reference table dump:
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] Last 10 entries (of 51198):
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51197: 0x14e79af0 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51196: 0x13a74a08 com.android.server.am.ServiceRecord
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51195: 0x14e71ce8 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51194: 0x14e74530 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51193: 0x14e70958 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51192: 0x14e708b8 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.819 838 1526 F zygote64: runtime.cc:531] 51191: 0x14e750b0 com.android.server.wm.WindowState$DeathRecipient
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 51190: 0x14e6db40 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 51189: 0x13a723f0 java.lang.ref.WeakReference (referent is a android.os.BinderProxy)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 51188: 0x14e6c508 android.os.Binder
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] Summary:
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 40143 of com.android.server.am.ServiceRecord (40143 unique instances)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 6023 of java.lang.ref.WeakReference (6023 unique instances)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 2931 of android.os.RemoteCallbackList$Callback (2931 unique instances)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 355 of com.android.server.content.ContentService$ObserverNode$ObserverEntry (355 unique instances)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 313 of java.lang.Class (235 unique instances)
08-14 06:21:12.820 838 1526 F zygote64: runtime.cc:531] 247 of com.android.server.am.ActivityRecord$Token (247 unique instances)
上面log直接打印了错误的原因: JNI ERROR (app bug): global reference table overflow (max=51200)
这个说明全局引用表达到最大值51200,说明有全局对象个数溢出触发了NE。
紧着这行语句的下面的就是打印出具体哪些对象溢出:Last 10 entries (of 51198):后面打印出最后的10个对象信息,而Summary:后面是打印所有的对象概要信息。
相关的代码如下:
IndirectRef IndirectReferenceTable::Add(IRTSegmentState previous_state,
ObjPtr obj) {
size_t top_index = segment_state_.top_index;
CHECK(obj != nullptr);
VerifyObject(obj);
DCHECK(table_ != nullptr);
if (top_index == max_entries_) {
if (resizable_ == ResizableCapacity::kNo) {
LOG(FATAL) << "JNI ERROR (app bug): " << kind_ << " table overflow "
<< "(max=" << max_entries_ << ")\n"
<< MutatorLockedDumpable(*this);
UNREACHABLE();
}
}
}
其中max_entries_当前值定义为51200,全局引用表的个数最大值定义如下:
static constexpr size_t kGlobalsMax = 51200; // Arbitrary sanity check. (Must fit in 16 bits.)
我们在mobile log中看到那些打印,实际上就是上面红色字体输出的,其中<< MutatorLockedDumpable
template
inline std::ostream& operator<<(std::ostream& os, const MutatorLockedDumpable& rhs) {
Locks::mutator_lock_->AssertSharedHeld(Thread::Current());
rhs.Dump(os);
return os;
}
*this指向IndirectReferenceTable对象,它调用Dump()方法如下:
void IndirectReferenceTable::Dump(std::ostream& os) const {
os << kind_ << " table dump:\n";
ReferenceTable::Table entries;
for (size_t i = 0; i < Capacity(); ++i) {
ObjPtr obj = table_[i].GetReference()->Read();
if (obj != nullptr) {
obj = table_[i].GetReference()->Read();
entries.push_back(GcRoot(obj));
}
}
ReferenceTable::Dump(os, entries); //这个是核心打印函数,包括打印最后10个引用对象,以及所有对象的摘要信息都在这个方法里面,感兴趣的可以阅读以下。
}
从Summary的相关打印可以看出ServiceRecord实例有多达40143个,这个是需要重点分析的。
二, 通过DB中的dumpsys activitiy信息可以查看系统中到底有哪些ServiceRecord实例,文件中可以看到大部分ServiceRecord都是如下类型:
* Destroy ServiceRecord{871ec04 u0 com.google.android.googlequicksearchbox/com.google.android.voicesearch.ime.VoiceInputMethodService}
intent={act=android.view.InputMethod cmp=com.google.android.googlequicksearchbox/com.google.android.voicesearch.ime.VoiceInputMethodService}
packageName=com.google.android.googlequicksearchbox
processName=com.google.android.googlequicksearchbox:search
permission=android.permission.BIND_INPUT_METHOD
baseDir=/system/priv-app/Velvet/Velvet.apk
dataDir=/data/user/0/com.google.android.googlequicksearchbox
app=ProcessRecord{e8026a 23595:com.google.android.googlequicksearchbox:search/u0a37}
createTime=-15h53m25s487ms startingBgTimeout=--
lastActivity=-15h53m25s486ms restartTime=-- createdFromFg=true
executeNesting=4 executeFg=true executingStart=-15h53m25s474ms
destroying=true destroyTime=-15h53m25s474ms
这个说明是待销毁的ServiceRecord对象,但是一直在链表mDestroyingServices中未被销毁,那么为什么一直未被销毁呢,可以分析ServiceRecord的销毁代码流程,
最后销毁对象的对应代码如下:
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
boolean finishing) {
r.executeNesting--;
if (r.executeNesting <= 0) {
if (r.app != null) {
//省略无关代码
if (inDestroying) {
if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
"doneExecuting remove destroying " + r);
mDestroyingServices.remove(r); //此处是删除对象的代码
r.bindings.clear();
}
}
}
三, 结合代码和ServiceRecord的打印信息可以看出,ServiceRecord未销毁原因是因为executeNesting=4,导致if (r.executeNesting <= 0)条件始终不满足,那么需要分析为什么这个值一直不满足。
首先得明白executeNesting这个值的含义,它表示Service执行一个动作完成的标志,从整个Service的创建、绑定、解除绑定等流程来说,是很难存在executeNesting值不会为0的情况,
<1>bumpServiceExecutingLocked()中执行r.executeNesting++;
<2> serviceDoneExecutingLocked()中执行r.executeNesting--;
以StopService过程为例来说明,其它调用例如bindService、unbindService、startService调用过程是类似的,不再重复。
private final void bringDownServiceLocked(ServiceRecord r) {
//省略无关代码
bumpServiceExecutingLocked(r, false, "destroy");
mDestroyingServices.add(r);
r.destroying = true;
mAm.updateOomAdjLocked(r.app, true);
//省略无关代码
r.app.thread.scheduleStopService(r);
}
r.app.thread.scheduleStopService(r);语句最终是通过binder调用到对应应用程序的ActivityThread的handleStopService()方法。
四, private void handleStopService(IBinder token) {
Service s = mServices.remove(token);
if (s != null) {
try {
if (localLOGV) Slog.v(TAG, "Destroying service " + s);
s.onDestroy(); //此处调用的对应Service对象的onDestroy(),从而销毁Service。
s.detachAndCleanUp();
Context context = s.getBaseContext();
if (context instanceof ContextImpl) {
final String who = s.getClassName();
((ContextImpl) context).scheduleFinalCleanup(who, "Service");
}
QueuedWork.waitToFinish();
try {
ActivityManager.getService().serviceDoneExecuting(
token, SERVICE_DONE_EXECUTING_STOP, 0, 0);
} catch (RemoteException e) {
throw e.rethrowFromSystemServer();
}
} catch (Exception e) {
if (!mInstrumentation.onException(s, e)) {
throw new RuntimeException(
"Unable to stop service " + s
+ ": " + e.toString(), e);
}
Slog.i(TAG, "handleStopService: exception for " + token, e);
}
}
ActivityManager.getService().serviceDoneExecuting(token, SERVICE_DONE_EXECUTING_STOP, 0, 0);该方法最终会调用到ActiveServices的serviceDoneExecutingLocked()方法,这样的话,上面所述的<1>和<2>就对应上了。
上面是正常代码流程,没有问题,即使binder调用过程中有异常,也会被捕获到,并让进程重启,从而达到释放ServiceRecord对象目的。但是有一点是可能没有考虑到的,就是Service s = mServices.remove(token);这个语句返回的对象s做了非空判断,才调用回调ActivityManager.getService().serviceDoneExecuting(
token, SERVICE_DONE_EXECUTING_STOP, 0, 0);通知到ActiveServices对象。那么s是否可能真的为空呢,其实是有可能的。如果没有调用到
ActivityManager.getService().serviceDoneExecuting(),那么Service对象对应ActiveServices中的ServiceRecord实例的executeNesting值也就不能减1了,最后也就不能变为0,从而不能销毁掉。解决办法就是在上面else分支地方再次发送一个回调消息ActivityManager.getService().serviceDoneExecuting()通知ActiveServices对象。
五, 下面再解释一下一个ServiceRecord实例为什么在art中就会存在一个全局引用对象呢,
final class ServiceRecord extends Binder,由于ServiceRecord直接继承自Binder类,也就是说它属于一个Binder实体,每个Binder实体在jni层都会对应一个JavaBBinder实例,
下面简述一下调用过程
<1>writeStrongBinder(IBinder val) ->
<2>android_os_Parcel_writeStrongBinder()->
<3>ibinderForJavaObject(env, object)->
<4> JavaBBinderHolder::get(env, obj)->
<5> new JavaBBinder(env, obj);->
<6> mObject(env->NewGlobalRef(object)
<7> JNI::NewGlobalRef()->
<8> art::JavaVMExt::AddGlobalRef()->
<9> art::IndirectReferenceTable::Add()