ANR相关-Service ANR流程理解误区

前言

Service发生ANR时一直存在一个误区,一直认为

2121 bumpServiceExecutingLocked(r, execInFg, "bind");
2558 bumpServiceExecutingLocked(r, execInFg, "create");
2682 bumpServiceExecutingLocked(r, execInFg, "start");
2808 bumpServiceExecutingLocked(r, false, "bring down unbind");
2908 bumpServiceExecutingLocked(r, false, "destroy");
3004 bumpServiceExecutingLocked(s, false, "unbind");

这些操作每一个不超过timeout时间就行了,后来发现其实不是这样,以下面的demo为例:

客户端

class MainHandler extends  Handler{
    public static final int MSG_TEST_1 = 1;
    public static final int MSG_TEST_2 = 2;
    public static final int MSG_TEST_3 = 3;
    public static final int MSG_TEST_4 = 4;
    public static final int MSG_TEST_5 = 5;

    MainHandler(Looper loop){
        super(loop);
    }

    @Override
    public void handleMessage(Message msg) {
        switch (msg.what){
            case MSG_TEST_1:
                Log.i("weijuncheng","msg 1 start");
                try {
                    Thread.sleep(3500);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                Log.i("weijuncheng","msg 1 end");
                break;
            case MSG_TEST_2:
                Log.i("weijuncheng","msg 2 start");
                Log.i("weijuncheng","msg 2 end");
                break;
            case MSG_TEST_3:
                Log.i("weijuncheng","msg 3 start");
                try {
                    Thread.sleep(3500);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                Log.i("weijuncheng","msg 3 end");
                break;
            case MSG_TEST_4:
                Log.i("weijuncheng","msg 4 start");
                Log.i("weijuncheng","msg 4 end");
                break;
            case MSG_TEST_5:
                Log.i("weijuncheng","msg 5 start");
                Log.i("weijuncheng","msg 5 end");
                break;

        }
    }
}


public class MainActivity extends AppCompatActivity {

    private Handler mHandler;


    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        mHandler = new MainHandler(MainActivity.this.getMainLooper());

        Button btn = (Button)findViewById(R.id.btn);
        btn.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View v) {
                mHandler.post(new Runnable(){
                    @Override
                    public void run() {
                        Log.i("weijuncheng","message 0 start");
                        try {
                            Thread.sleep(3500);
                        } catch (InterruptedException e) {
                            e.printStackTrace();
                        }
                        Log.i("weijuncheng","message 0 end");
                    }});
                //上面的post Message打不出来,会不会是post发送的原因
                mHandler.sendEmptyMessage(MainHandler.MSG_TEST_1);
                mHandler.sendEmptyMessage(MainHandler.MSG_TEST_2);
                mHandler.sendEmptyMessage(MainHandler.MSG_TEST_4);
                mHandler.sendEmptyMessageDelayed(MainHandler.MSG_TEST_5,100);
                startService(new Intent(MainActivity.this, TestAnrService.class));
            }

        });

        mHandler.sendEmptyMessageDelayed(MainHandler.MSG_TEST_3,1000);
    }
}

服务端

public class TestAnrService extends Service {

    private ITestAnrService.Stub IService = new ITestAnrService.Stub() {
        @Override
        public void basicTypes(int anInt, long aLong, boolean aBoolean, float aFloat, double aDouble, String aString) throws RemoteException {

        }

        @Override
        public void Method1() throws RemoteException {

        }

        @Override
        public void Method2() throws RemoteException {

        }
    };

    public TestAnrService() {
    }


    @Override
    public void onCreate() {
        Log.i("weijuncheng","onCreate start");
        super.onCreate();
        try {
            Thread.sleep(3500);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        Log.i("weijuncheng","onCreate end");
    }

    @Override
    public IBinder onBind(Intent intent) {
        // TODO: Return the communication channel to the service.
        return IService;
    }

    @Override
    public int onStartCommand(Intent intent, int flags, int startId) {
        Log.i("weijuncheng","onStartCommand start");
        try {
            Thread.sleep(30000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        Log.i("weijuncheng","onStartCommand end");
        return super.onStartCommand(intent, flags, startId);
    }
}

按照前面的理解,AMS的每个操作都不超过timeout时间,那么就产生问题了,按理说onCreate前面确实等了很长时间,但是onCreate执行完了,system_server中的ActivityMananger线程对应的Handler应该就会把相应的超时消息移除了啊

整理

我们回顾下移除的位置serviceDoneExecutingLocked:

3138    private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
3139            boolean finishing) {
3140        if (DEBUG_SERVICE) Slog.v(TAG_SERVICE, "<<< DONE EXECUTING " + r
3141                + ": nesting=" + r.executeNesting
3142                + ", inDestroying=" + inDestroying + ", app=" + r.app);
3143        else if (DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
3144                "<<< DONE EXECUTING " + r.shortName);
3145        r.executeNesting--;
3146        if (r.executeNesting <= 0) {
3147            if (r.app != null) {
3148                if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
3149                        "Nesting at 0 of " + r.shortName);
3150                r.app.execServicesFg = false;
3151                r.app.executingServices.remove(r);
3152                if (r.app.executingServices.size() == 0) {
3153                    if (DEBUG_SERVICE || DEBUG_SERVICE_EXECUTING) Slog.v(TAG_SERVICE_EXECUTING,
3154                            "No more executingServices of " + r.shortName);
3155                    mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
3156                } else if (r.executeFg) {
3157                    // Need to re-evaluate whether the app still needs to be in the foreground.
3158                    for (int i=r.app.executingServices.size()-1; i>=0; i--) {
3159                        if (r.app.executingServices.valueAt(i).executeFg) {
3160                            r.app.execServicesFg = true;
3161                            break;
3162                        }
3163                    }
3164                }
3165                if (inDestroying) {
3166                    if (DEBUG_SERVICE) Slog.v(TAG_SERVICE,
3167                            "doneExecuting remove destroying " + r);
3168                    mDestroyingServices.remove(r);
3169                    r.bindings.clear();
3170                }
3171                mAm.updateOomAdjLocked(r.app, true);
3172            }
3173            r.executeFg = false;
3174            if (r.tracker != null) {
3175                r.tracker.setExecuting(false, mAm.mProcessStats.getMemFactorLocked(),
3176                        SystemClock.uptimeMillis());
3177                if (finishing) {
3178                    r.tracker.clearCurrentOwner(r, false);
3179                    r.tracker = null;
3180                }
3181            }
3182            if (finishing) {
3183                if (r.app != null && !r.app.persistent) {
3184                    r.app.services.remove(r);
3185                    if (r.whitelistManager) {
3186                        updateWhitelistManagerLocked(r.app);
3187                    }
3188                }
3189                r.app = null;
3190            }
3191        }
3192    }

可以看到,取消超时Message的时候有一个条件r.executeNesting <= 0;每执行一个操作会+1,执行完进程会通知AMS,调用serviceDoneExecutingLocked的executeNesting--;但是就以startService为例,其操作不止一个;当客户端进程通过binder call调用startService时,system_server一个binder线程A响应,如执行realStartServiceLocked;在中间会再调用binder call对服务端进行操作;服务端的一个binder线程响应,将相应消息发送到服务端进程的主线程;等到执行完毕,通过binder call调用AMS中的binder线程B执行serviceDoneExecutingLocked,r.executeNesting--;但是system_server的binder线程A不会等待B的执行结果再继续执行;那么executeNesting的值很可能为2或者更多,就像上面那种情况;所以说Service ANR的是以一个操作来判断是否ANR,其粒度不是create,start,bind这种粒度,而是如startService这一个操作,需要在第一个超时消息真正被执行前处理完(removeMessage时会将所有what值相同的Message全部移除)

加上debug service log验证:

12-21 11:28:26.309  1000  1487  9392 V ActivityManager: >>> EXECUTING create of ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} in app ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}
12-21 11:28:26.311  1000  1487  9392 V ActivityManager: Sending arguments to: ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} android.content.Intent$FilterComparison@e16fe3c9 args=Intent { cmp=com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService }
12-21 11:28:26.311  1000  1487  9392 V ActivityManager: >>> EXECUTING start of ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService} in app ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}

12-21 11:28:36.863 10172 11495 11495 I weijuncheng: onCreate end
12-21 11:28:36.868  1000  1487  4648 V ActivityManager: <<< DONE EXECUTING ServiceRecord{da8991b u0 com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService}: nesting=2, inDestroying=false, app=ProcessRecord{5519ab8 11495:com.test.weijuncheng.testanr_ipc_server/u0a172}
12-21 11:28:36.871 10172 11495 11495 I weijuncheng: onStartCommand start

针对小米机型的技巧

在小米机型上,许多机器都有Message统计的功能,统计哪些Message等待时间,执行时间较长,在/data/anr下有anr_info_processName.txt;上面的demo生成的文件如下:

ANR in com.test.weijuncheng.testanr_ipc_server
PID: 10504
Reason: executing service com.test.weijuncheng.testanr_ipc_server/.Service.TestAnrService
 package com.test.weijuncheng.testanr_ipc_server version Code: 1 version Name: 1.0 cur loop is : Looper (main, tid 2) {16ea0c0}
Dump time : 2018-12-14_09:42:55.746
---------- History of long time messages on Looper (main, tid 2) {16ea0c0}----------
#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_09:42:35.635 latency=+7s8ms processing=+3s518ms }
#1: { what=2 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+7s12ms processing=+1ms }
#2: { what=1 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+3s509ms processing=+3s503ms }
#3: { callback=com.test.weijuncheng.testanr_ipc_server.MainActivity$1$1 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:35.631 latency=+7ms processing=+3s501ms }
#4: { what=3 target=com.test.weijuncheng.testanr_ipc_server.MainHandler when=2018-12-14_09:42:32.085 latency=+1ms processing=+3s502ms }
#5: { what=159 target=android.app.ActivityThread$H when=2018-12-14_09:42:30.820 latency=+80ms processing=+217ms }
-------------------------- END --------------------------
---------- Dump Current Running Message ----------
{ what=115 target=android.app.ActivityThread$H when=2018-12-14_09:42:35.636 latency=+10s527ms }
-------------------------- END --------------------------

可见,其中有三个关键数据 when,latency,processing

when-理想中被取出的时间

b.append(" when=" + DATE_FORMATTER.format(new Date(planCurrentTime)));
long planCurrentTime;     // using the java.lang.System.currentTimeMillis() time-base. 

从表现看是入队的真实时间 (应该说是Message期望被取出的时间,假设队列为空,when = 10s,那么其实Message早就入队了,只不过需要等10s,期待10s后被取出运行;然而实际被取出的时间就不一定了)

latency-等待时间

b.append(" latency=" + TimeUtils.formatDuration(getLatencyMillis()));
324        // unexcepted delay time
325        long getLatencyMillis() {
326            return dispatchTime - planTime; //真正从MessageQueue被取出执行的时间-入队时间;即真正开始执行前的等待时间
327        }
#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_06:43:07.762 latency=+7s9ms processing=+3s518ms }

等待了7s,因为队列中前两个Message每个耗时3.5s

processing-处理时间

b.append(" processing=" + TimeUtils.formatDuration(getProcessMillis()));

329        long getProcessMillis() {
330            if (isFinished()) {
331                return finishTime - dispatchTime; //消息真正执行的时常
332            } else {
333                return 0; //如果在dump时消息还没执行完成,返回0
334            }
335        }

#0: { what=114 target=android.app.ActivityThread$H when=2018-12-14_06:43:07.762 latency=+7s9ms processing=+3s518ms }

这个执行时长3.5s,符合demo的逻辑

那么当前消息已执行时间 = Dump time : 2018-12-14_06:43:27.901 - 2018-12-14_06:43:07.762 - latency=+10s527ms = 9.5s 也就是执行了9.5s,最后ANR了

---------- History of long time messages on Looper (main, tid 2) {16ea0c0}---------- 

这个顺序是倒过来,按入队时间从早到晚是#5,#4,#3,#2,#1,#0
打印的阀值是这样设置的latency + processing > 200 就记录
根据这些信息,就可以判断出哪个Message耗时了

总结

那么耗时的点可能在system_server中的ActivityManager线程相关Handler耗时,ANR进程的主线程Handler耗时,binder call耗时3部分;一般是后两种情况,第一种一般不太可能

你可能感兴趣的:(ANR相关-Service ANR流程理解误区)