前言
Service发生ANR时一直存在一个误区,一直认为
2121 bumpServiceExecutingLocked(r, execInFg, "bind");
2558 bumpServiceExecutingLocked(r, execInFg, "create");
2682 bumpServiceExecutingLocked(r, execInFg, "start");
2808 bumpServiceExecutingLocked(r, false, "bring down unbind");
2908 bumpServiceExecutingLocked(r, false, "destroy");
3004 bumpServiceExecutingLocked(s, false, "unbind");
这些操作每一个不超过timeout时间就行了,后来发现其实不是这样,以下面的demo为例:
客户端
class MainHandler extends Handler{
public static final int MSG_TEST_1 = 1;
public static final int MSG_TEST_2 = 2;
public static final int MSG_TEST_3 = 3;
public static final int MSG_TEST_4 = 4;
public static final int MSG_TEST_5 = 5;
MainHandler(Looper loop){
super(loop);
}
@Override
public void handleMessage(Message msg) {
switch (msg.what){
case MSG_TEST_1:
Log.i("weijuncheng","msg 1 start");
try {
Thread.sleep(3500);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.i("weijuncheng","msg 1 end");
break;
case MSG_TEST_2:
Log.i("weijuncheng","msg 2 start");
Log.i("weijuncheng","msg 2 end");
break;
case MSG_TEST_3:
Log.i("weijuncheng","msg 3 start");
try {
Thread.sleep(3500);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.i("weijuncheng","msg 3 end");
break;
case MSG_TEST_4:
Log.i("weijuncheng","msg 4 start");
Log.i("weijuncheng","msg 4 end");
break;
case MSG_TEST_5:
Log.i("weijuncheng","msg 5 start");
Log.i("weijuncheng","msg 5 end");
break;
}
}
}
public class MainActivity extends AppCompatActivity {
private Handler mHandler;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
mHandler = new MainHandler(MainActivity.this.getMainLooper());
Button btn = (Button)findViewById(R.id.btn);
btn.setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View v) {
mHandler.post(new Runnable(){
@Override
public void run() {
Log.i("weijuncheng","message 0 start");
try {
Thread.sleep(3500);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.i("weijuncheng","message 0 end");
}});
//上面的post Message打不出来,会不会是post发送的原因
mHandler.sendEmptyMessage(MainHandler.MSG_TEST_1);
mHandler.sendEmptyMessage(MainHandler.MSG_TEST_2);
mHandler.sendEmptyMessage(MainHandler.MSG_TEST_4);
mHandler.sendEmptyMessageDelayed(MainHandler.MSG_TEST_5,100);
startService(new Intent(MainActivity.this, TestAnrService.class));
}
});
mHandler.sendEmptyMessageDelayed(MainHandler.MSG_TEST_3,1000);
}
}
服务端
public class TestAnrService extends Service {
private ITestAnrService.Stub IService = new ITestAnrService.Stub() {
@Override
public void basicTypes(int anInt, long aLong, boolean aBoolean, float aFloat, double aDouble, String aString) throws RemoteException {
}
@Override
public void Method1() throws RemoteException {
}
@Override
public void Method2() throws RemoteException {
}
};
public TestAnrService() {
}
@Override
public void onCreate() {
Log.i("weijuncheng","onCreate start");
super.onCreate();
try {
Thread.sleep(3500);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.i("weijuncheng","onCreate end");
}
@Override
public IBinder onBind(Intent intent) {
// TODO: Return the communication channel to the service.
return IService;
}
@Override
public int onStartCommand(Intent intent, int flags, int startId) {
Log.i("weijuncheng","onStartCommand start");
try {
Thread.sleep(30000);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.i("weijuncheng","onStartCommand end");
return super.onStartCommand(intent, flags, startId);
}
}
按照前面的理解,AMS的每个操作都不超过timeout时间,那么就产生问题了,按理说onCreate前面确实等了很长时间,但是onCreate执行完了,system_server中的ActivityMananger线程对应的Handler应该就会把相应的超时消息移除了啊
整理
我们回顾下移除的位置serviceDoneExecutingLocked:
3138 private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
3139 boolean finishing) {
3140 if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "<<< DONE EXECUTING " + r
3141 + ": nesting=" + r.executeNesting
3142 + ", inDestroying=" + inDestroying + ", app=" + r.app);
3143 else if (DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
3144 "<<< DONE EXECUTING " + r.shortName);
3145 r.executeNesting--;
3146 if (r.executeNesting <= 0) {
3147 if (r.app != null) {
3148 if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
3149 "Nesting at 0 of " + r.shortName);
3150 r.app.execServicesFg = false;
3151 r.app.executingServices.remove(r);
3152 if (r.app.executingServices.size() == 0) {
3153 if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
3154 "No more executingServices of " + r.shortName);
3155 mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
3156 } else if (r.executeFg) {
3157 // Need to re-evaluate whether the app still needs to be in the foreground.
3158 for (int i=r.app.executingServices.size()-1; i>=0; i--) {
3159 if (r.app.executingServices.valueAt(i).executeFg) {
3160 r.app.execServicesFg = true;
3161 break;
3162 }
3163 }
3164 }
3165 if (inDestroying) {
3166 if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
3167 "doneExecuting remove destroying " + r);
3168 mDestroyingServices.remove(r);
3169 r.bindings.clear();
3170 }
3171 mAm.updateOomAdjLocked(r.app, true);
3172 }
3173 r.executeFg = false;
3174 if (r.tracker != null) {
3175 r.tracker.setExecuting(false, mAm.mProcessStats.getMemFactorLocked(),
3176 SystemClock.uptimeMillis());
3177 if (finishing) {
3178 r.tracker.clearCurrentOwner(r, false);
3179 r.tracker = null;
3180 }
3181 }
3182 if (finishing) {
3183 if (r.app != null && !r.app.persistent) {
3184 r.app.services.remove(r);
3185 if (r.whitelistManager) {
3186 updateWhitelistManagerLocked(r.app);
3187 }
3188 }
3189 r.app = null;
3190 }
3191 }
3192 }
可以看到,取消超时Message的时候有一个条件r.executeNesting <= 0;每执行一个操作会+1,执行完进程会通知AMS,调用serviceDoneExecutingLocked的executeNesting--;但是就以startService为例,其操作不止一个;当客户端进程通过binder call调用startService时,system_server一个binder线程A响应,如执行realStartServiceLocked;在中间会再调用binder call对服务端进行操作;服务端的一个binder线程响应,将相应消息发送到服务端进程的主线程;等到执行完毕,通过binder call调用AMS中的binder线程B执行serviceDoneExecutingLocked,r.executeNesting--;但是system_server的binder线程A不会等待B的执行结果再继续执行;那么executeNesting的值很可能为2或者更多,就像上面那种情况;所以说Service ANR的是以一个操作来判断是否ANR,其粒度不是create,start,bind这种粒度,而是如startService这一个操作,需要在第一个超时消息真正被执行前处理完(removeMessage时会将所有what值相同的Message全部移除)
加上debug service log验证:
12-21 11:28:26.309 1000 1487 9392 V ActivityManager: >>> EXECUTING create of ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} in app ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}
12-21 11:28:26.311 1000 1487 9392 V ActivityManager: Sending arguments to: ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} android.content.Intent$FilterComparison@e16fe3c9 args=Intent { cmp=com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService }
12-21 11:28:26.311 1000 1487 9392 V ActivityManager: >>> EXECUTING start of ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} in app ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}
12-21 11:28:36.863 10172 11495 11495 I weijuncheng: onCreate end
12-21 11:28:36.868 1000 1487 4648 V ActivityManager: <<< DONE EXECUTING ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService}: nesting=2, inDestroying=false, app=ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}
12-21 11:28:36.871 10172 11495 11495 I weijuncheng: onStartCommand start
针对小米机型的技巧
在小米机型上,许多机器都有Message统计的功能,统计哪些Message等待时间,执行时间较长,在/data/anr下有anr_info_processName.txt;上面的demo生成的文件如下:
ANR in com.test.weijuncheng.testanr_ipc_server
PID: 10504
Reason: executing service com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService
package com.test.weijuncheng.testanr_ipc_server version Code: 1 version Name: 1.0 cur loop is : Looper (main, tid 2) {16ea0c0}
Dump time : 2018-12-14_09:42:55.746
---------- History of long time messages on Looper (main, tid 2) {16ea0c0}----------
#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_09:42:35.635 latency=+7s8ms processing=+3s518ms }
#1: { what=2 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+7s12ms processing=+1ms }
#2: { what=1 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+3s509ms processing=+3s503ms }
#3: { callback=com.test.weijuncheng.testanr_ipc_server.MainActivity$1$1 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+7ms processing=+3s501ms }
#4: { what=3 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:32.085 latency=+1ms processing=+3s502ms }
#5: { what=159 target=android.app.ActivityThread$H when=2018-12-14_09:42:30.820 latency=+80ms processing=+217ms }
-------------------------- END --------------------------
---------- Dump Current Running Message ----------
{ what=115 target=android.app.ActivityThread$H when=2018-12-14_09:42:35.636 latency=+10s527ms }
-------------------------- END --------------------------
可见,其中有三个关键数据 when,latency,processing
when-理想中被取出的时间
b.append(" when=" + DATE_FORMATTER.format(new Date(planCurrentTime)));
long planCurrentTime; // using the java.lang.System.currentTimeMillis() time-base.
从表现看是入队的真实时间 (应该说是Message期望被取出的时间,假设队列为空,when = 10s,那么其实Message早就入队了,只不过需要等10s,期待10s后被取出运行;然而实际被取出的时间就不一定了)
latency-等待时间
b.append(" latency=" + TimeUtils.formatDuration(getLatencyMillis()));
324 // unexcepted delay time
325 long getLatencyMillis() {
326 return dispatchTime - planTime; //真正从MessageQueue被取出执行的时间-入队时间;即真正开始执行前的等待时间
327 }
#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_06:43:07.762 latency=+7s9ms processing=+3s518ms }
等待了7s,因为队列中前两个Message每个耗时3.5s
processing-处理时间
b.append(" processing=" + TimeUtils.formatDuration(getProcessMillis()));
329 long getProcessMillis() {
330 if (isFinished()) {
331 return finishTime - dispatchTime; //消息真正执行的时常
332 } else {
333 return 0; //如果在dump时消息还没执行完成,返回0
334 }
335 }
#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_06:43:07.762 latency=+7s9ms processing=+3s518ms }
这个执行时长3.5s,符合demo的逻辑
那么当前消息已执行时间 = Dump time : 2018-12-14_06:43:27.901 - 2018-12-14_06:43:07.762 - latency=+10s527ms = 9.5s 也就是执行了9.5s,最后ANR了
---------- History of long time messages on Looper (main, tid 2) {16ea0c0}----------
这个顺序是倒过来,按入队时间从早到晚是#5,#4,#3,#2,#1,#0
打印的阀值是这样设置的latency + processing > 200 就记录
根据这些信息,就可以判断出哪个Message耗时了
总结
那么耗时的点可能在system_server中的ActivityManager线程相关Handler耗时,ANR进程的主线程Handler耗时,binder call耗时3部分;一般是后两种情况,第一种一般不太可能