Android N中SurfaceView泄露的问题分析

最近遇到一个bug,现象为SurfaceView的Layer没有销毁,导致屏幕上一直显示该Layer。觉得该案例有点意思,故在此记录下分析过程及解决方法,供有一定framework基础的Rom开发人员参考。


现象

开心消消乐的界面一直在屏幕上显示,无论如何都不能销毁。


分析过程

首先最直接相关的模块是SurfaceFlinger,既然能看到,应该存在该Layer并且进行了合成,否则这里就有问题,用如下命令dump状态信息:

adb shell dumpsys SurfaceFlinger

这里只摘取该Layer相关的部分:

+ Layer 0x71b57b0400 (SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity)
  Region transparentRegion (this=0x71b57b0708, count=1)
    [  0,   0,   0,   0]
  Region visibleRegion (this=0x71b57b0410, count=1)
    [  0,   0, 1080, 1920]
  Region surfaceDamageRegion (this=0x71b57b0488, count=1)
    [  0,   0,   0,   0]
      layerStack=   0, z=    21015, pos=(0,0), size=(1080,1920), crop=(   0,   0,1080,1920), finalCrop=(   0,   0,  -1,  -1), isOpaque=1, invalidate=0, alpha=0xff, flags=0x00000002, tr=[1.00, 0.00][0.00, 1.00]
      FilterRender Layer= 0, FilterMode= 0 availableRect =(   0,   0,   0,   0)
      client=0x71b86f0f40
      format= 4, activeBuffer=[1080x1920:1088,  1], queued-frames=0, mRefreshPending=0
      mSecure=0, mProtectedByApp=0, mFiltering=0, mNeedsFiltering=0
            mTexName=54 mCurrentTexture=-1
            mCurrentCrop=[0,0,0,0] mCurrentTransform=0
            mAbandoned=0
            -BufferQueue mMaxAcquiredBufferCount=1, mMaxDequeuedBufferCount=3, mDequeueBufferCannotBlock=0 mAsyncMode=0, default-size=[1080x1920], default-format=4, transform-hint=00, FIFO(0)={}
             this=0x71b55e3000 (mConsumerName=SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity, mConnectedApi=0, mConsumerUsageBits=0x900, mId=39, mPid=15358, producer=[-1:com.happyelements.AndroidAnimal], consumer=[15358:/system/bin/surfaceflinger])
             [00:0x0] state=FREE    
             [01:0x0] state=FREE    
             [02:0x0] state=FREE    
             [03:0x0] state=FREE    
                *BufferQueueDump mIsBackupBufInited=0, mAcquiredBufs(size=0), mMode=TRACK_CONSUMER
                 [-1] mLastAcquiredBuf->mGraphicBuffer->handle=0x71b7636900

得到如下信息:

  1. flags=0x00000002,即该Layer是show和opaque状态
  2. alpha=0xff,即alpha值为完全不透明
  3. visibleRegion为[ 0, 0, 1080, 1920],说明有可见区域,而且是全屏

综合以上以及dump出来的合成信息,说明SurfaceFlinger这边的状态没有问题,符合我们看到的现象。

同时注意到有些奇怪的信息,之所以说奇怪是因为跟正常参与合成的Layer不一样:

  1. GraphicBuffer全部是FREE状态,正常应该至少有一个是ACQUIRED
  2. mCurrentTexture=-1,正常应该是>=0
  3. mConnectedApi=0,正常应该是>0

当然能进入到现在这种bug状态本身就不能太按常理来看待,SurfaceFlinger这边暂且先到这里。


目光转向WMS这边,用如下命令dump状态信息:

adb shell dumpsys window

唯一跟该SurfaceView相关的信息如下:

WINDOW MANAGER SURFACES (dumpsys window surfaces)
  Surface #0: #75499c8 SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity
    mLayerStack=0 mLayer=21015
    mShown=true mAlpha=1.0 mIsOpaque=false
    mPosition=0.0,0.0 mSize=1080x1920
    mCrop=[0,0][1080,1920]
    mFinalCrop=[0,0][0,0]
    Transform: (1.0, 0.0, 0.0, 1.0)

这并不是窗口堆栈打印出的内容,为了不让此文写的太过冗长,直接给出结论:

  1. 该信息打印的是一个静态SurfaceTrace集合中的内容
  2. SurfaceTrace是SurfaceControl的子类,而每个SurfaceControl对应的是SF端的一个Layer
  3. 构造新的SurfaceTrace实例会往该静态数组添加元素,销毁时移除该元素

现在有个SurfaceTrace存在于该静态集合中,说明其创建后没有被销毁,这就是该bug的最直接原因,也是我们最开始的切入点。 现在WMS仅有这条信息,并没有窗口堆栈及token的对应状态,这着实让人有点惆怅,否则或许能发现点蛛丝马迹,直接扒代码找原因无异于大海捞针。

现在没有log,只有现场,还能知道如下信息:

  • 通过ps命令知道目标进程已死(好奇怪,进程都死了怎么Layer还在)
  • 还记得上面提到该Layer的一些奇怪的信息,扒了扒代码后得知这是因为调用了SurfaceControl.disconnect(),这是android N中新增的API,并且只在暂存Surface相关的逻辑中调用,所谓暂存Surface是android N新增的用来加速界面响应的一种优化,这可以说明代码曾经走到过某个位置,多少对分析问题有点帮助。

如果没有其它线索,分析到这里已经结束,剩下的事情就是”愉快地“钻进代码的海洋里去寻找bug,并向老天许愿。所幸的是能抓到system_server的hprof,瞬间感觉人生充满了希望。


接下来看hprof文件,为简化分析过程,不会去粘贴大量的数据。

首先从WMS中dump出来的那个SurfaceControl入手,根据代码这个实例只能是SurfaceTrace或者是它的子类SurfaceControlWithBackground,最后发现是SurfaceControlWithBackground,N种mSubLayer小于0的子窗口(即位于父窗口下方)在创建SurfaceControl时默认实例化SurfaceControlWithBackground,而SurfaceView刚好就是这样的窗口。查看它的GcRoot,确实是保存在一个静态的数组中。

顺藤摸瓜找到了对应的WindowState,GcRoot在WMS.mWindowMap中,另外它的父WindowState也一样存在。到这里我们要先下一个重要的结论:
泄露的不止是SurfaceView窗口,还有它的父窗口。
以及我们后面再来回答的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?

接下来马上要回答一个问题:上面不是说WMS已经dump不出来这些窗口了吗?

要回答这个问题要先讲下WindowState的组织方式,它保存在系统中的多个位置,包括如下:

  • WMS.mWindowMap:以IBinder为键值查找WindowState
  • DisplayContent.mWindows:列表方式保存单个屏幕上的WindowState
  • WindowToken.windows或AppWindowToken.allAppWindows:列表方式保存从属的WIndowState
  • WindowState.mChildWindows:列表方式保存子窗口

注:上述的列表方式均以ArrayList的方式保存窗口,索引值越大层级越高

问题的答案是:dump出来的信息是通过DisplayContent.mWindows来取,既然没有对应的信息,说明泄露的WindowState已经从这里面移除,考察上述的其它地方是否存在:

  • WMS.mWindowMap:存在
  • DisplayContent.mWindows:不存在
  • AppWindowToken.allAppWindows:不存在
  • WindowState.mChildWindows:存在

按照正常的逻辑,移除一个WindowState后,所有组织它的地方都应该移除对应的引用。现在这种状况,需要在这几个中找一个最好排查的因素,从代码来看,WMS.mWindowMap是最简单的,因为只有一处代码从这里移除WindowState,即WMS.removeWindowInnerLocked():

void removeWindowInnerLocked(WindowState win) {
    if (win.mRemoved) {
        // Nothing to do.
        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
                "removeWindowInnerLocked: " + win + " Already removed...");
        return;
    }

    for (int i = win.mChildWindows.size() - 1; i >= 0; i--) {
        WindowState cwin = win.mChildWindows.get(i);
        Slog.w(TAG_WM, "Force-removing child win " + cwin + " from container " + win);
        removeWindowInnerLocked(cwin);
    }

    win.mRemoved = true;
    ...
    mPolicy.removeWindowLw(win);
    // WindowState.mChildWindows中移除
    win.removeLocked();

    if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "removeWindowInnerLocked: " + win);
    // WMS.mWindowMap中移除
    mWindowMap.remove(win.mClient.asBinder());
    ...
    final WindowToken token = win.mToken;
    final AppWindowToken atoken = win.mAppToken;
    if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Removing " + win + " from " + token);
    // WindowToken.windows中移除
    token.windows.remove(win);
    if (atoken != null) {
        // AppWindowToken.allAppWindows中移除
        atoken.allAppWindows.remove(win);
    }
    ...
    final WindowList windows = win.getWindowList();
    if (windows != null) {
        // DisplayContent.mWindows中移除
        windows.remove(win);
    }
}

也就是说对于这个泄露的WindowState,肯定没有执行到这里,这从WindowState.mRemoved值为false也可以印证,从WindowState.mChildWindows中移除的唯一位置在WIndowState.removeLocked():

void removeLocked() {
    disposeInputChannel();

    if (isChildWindow()) {
        if (DEBUG_ADD_REMOVE) Slog.v(TAG, "Removing " + this + " from " + mAttachedWindow);
        // 从WindowState.mChildWindows中移除
        mAttachedWindow.mChildWindows.remove(this);
    }
    mWinAnimator.destroyDeferredSurfaceLocked();
    mWinAnimator.destroySurfaceLocked();
    mSession.windowRemovedLocked();
    try {
        mClient.asBinder().unlinkToDeath(mDeathRecipient, 0);
    } catch (RuntimeException e) {
        // Ignore if it has already been removed (usually because
        // we are doing this as part of processing a death note.)
    }
}

这么看来WMS.removeWindowInnerLocked()像是做最后移除工作的地方,因为上述的所有保存WindowState的地方都会在这里进行移除,现在出现不一致的情况,说明有其它地方会对某些引用进行移除,问题集中在DisplayContent.mWindows和AppWindowToken.allAppWindows。

先看下AppWindowToken.allAppWindows,查了一番代码,找到AppWindowToken.removeAllWindows():

void removeAllWindows() {
    ...
    // AppWindowToken.allAppWindows清空
    allAppWindows.clear();
    // WindowToken.windows清空
    windows.clear();
}

调用的部分路径为:

WMS.removeAppToken()->AppWindowToken.removeAppFromTaskLocked()->AppWindowToken.removeAllWindows()

简单地说,我们知道目标进程已经挂掉,至少在死亡讣告中会调用到WMS.removeAppToken。我们说根据结果进行推导,这部分就解释的通。

那DisplayContent.mWindows这边怎么解释,问题出在WMS.rebuildAppWindowListLocked():

private void rebuildAppWindowListLocked(final DisplayContent displayContent) {
    final WindowList windows = displayContent.getWindowList();
    int NW = windows.size();
    int i;
    int lastBelow = -1;
    int numRemoved = 0;

    if (mRebuildTmp.length < NW) {
        mRebuildTmp = new WindowState[NW+10];
    }

    // First remove all existing app windows.
    i=0;
    while (i < NW) {
        WindowState w = windows.get(i);
        if (w.mAppToken != null) {
            // 先从DisplayContent.mWindows移除,并可能在后面重新添加
            WindowState win = windows.remove(i);
            win.mRebuilding = true;
            mRebuildTmp[numRemoved] = win;
            mWindowsChanged = true;
            if (DEBUG_WINDOW_MOVEMENT) Slog.v(TAG_WM, "Rebuild removing window: " + win);
            NW--;
            numRemoved++;
            continue;
        } else if (lastBelow == i-1) {
            if (w.mAttrs.type == TYPE_WALLPAPER) {
                lastBelow = i;
            }
        }
        i++;
    }

    // Keep whatever windows were below the app windows still below,
    // by skipping them.
    lastBelow++;
    i = lastBelow;

    // First add all of the exiting app tokens...  these are no longer
    // in the main app list, but still have windows shown.  We put them
    // in the back because now that the animation is over we no longer
    // will care about them.
    final ArrayList stacks = displayContent.getStacks();
    final int numStacks = stacks.size();
    for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {
        AppTokenList exitingAppTokens = stacks.get(stackNdx).mExitingAppTokens;
        int NT = exitingAppTokens.size();
        for (int j = 0; j < NT; j++) {
            i = reAddAppWindowsLocked(displayContent, i, exitingAppTokens.get(j));
        }
    }

    // And add in the still active app tokens in Z order.
    for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {
        final ArrayList tasks = stacks.get(stackNdx).getTasks();
        final int numTasks = tasks.size();
        for (int taskNdx = 0; taskNdx < numTasks; ++taskNdx) {
            final AppTokenList tokens = tasks.get(taskNdx).mAppTokens;
            final int numTokens = tokens.size();
            for (int tokenNdx = 0; tokenNdx < numTokens; ++tokenNdx) {
                final AppWindowToken wtoken = tokens.get(tokenNdx);
                if (wtoken.mIsExiting && !wtoken.waitingForReplacement()) {
                    continue;
                }
                i = reAddAppWindowsLocked(displayContent, i, wtoken);
            }
        }
    }

    i -= lastBelow;
    if (i != numRemoved) {
        displayContent.layoutNeeded = true;
        Slog.w(TAG_WM, "On display=" + displayContent.getDisplayId() + " Rebuild removed "
                + numRemoved + " windows but added " + i + " rebuildAppWindowListLocked() "
                + " callers=" + Debug.getCallers(10));
        for (i = 0; i < numRemoved; i++) {
            WindowState ws = mRebuildTmp[i];
            if (ws.mRebuilding) {
                StringWriter sw = new StringWriter();
                PrintWriter pw = new FastPrintWriter(sw, false, 1024);
                ws.dump(pw, "", true);
                pw.flush();
                Slog.w(TAG_WM, "This window was lost: " + ws);
                Slog.w(TAG_WM, sw.toString());
                ws.mWinAnimator.destroySurfaceLocked();
            }
        }
        Slog.w(TAG_WM, "Current app token list:");
        dumpAppTokensLocked();
        Slog.w(TAG_WM, "Final window list:");
        dumpWindowsLocked();
    }
    Arrays.fill(mRebuildTmp, null);
}

简单说下逻辑,就是先移除所有的应用窗口,并根据最新的AppWindowToken排列顺序来重新添加,而要重新添加的上,WindowToken.windows必须不为空,而根据上面的分析这里已经为空,那么对不起,移除完后已经加不上了,这从WindowState.mRebuilding为true可以证明。那么这里又解释通了,而且跟WindowToken.windows被清空有关。

那到底为什么没走到清理现场的WMS.removeWindowInnerLocked()?再回到死亡讣告,每个WindowState都会注册死亡讣告,并在窗口所在进程挂掉后调用WMS.removeWindowLocked(),这点是没有疑问的,并且会在后续调用WMS.removeWindowInnerLocked(),但是在这之前有可能提前返回,代码太多,只列出可能提前返回的部分,看注释我们来一一排除:

if (win.mHasSurface && okToDisplay()) {
    final AppWindowToken appToken = win.mAppToken;
    if (win.mWillReplaceWindow) { // mWillReplaceWindow为false
        // This window is going to be replaced. We need to keep it around until the new one
        // gets added, then we will get rid of this one.
        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Preserving " + win + " until the new one is "
                + "added");
        // TODO: We are overloading mAnimatingExit flag to prevent the window state from
        // been removed. We probably need another flag to indicate that window removal
        // should be deffered vs. overloading the flag that says we are playing an exit
        // animation.
        win.mAnimatingExit = true;
        win.mReplacingRemoveRequested = true;
        Binder.restoreCallingIdentity(origId);
        return;
    }
    // 唯一的可能就是进入到这个条件并return
    if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {
        // We started enter animation early with a saved surface, now the app asks to remove
        // this window. If we remove it now and the app is not yet drawn, we'll show a
        // flicker. Delay the removal now until it's really drawn.
        if (DEBUG_ADD_REMOVE) {
            Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win
                    + " due to early animation");
        }
        // Do not set mAnimatingExit to true here, it will cause the surface to be hidden
        // immediately after the enter animation is done. If the app is not yet drawn then
        // it will show up as a flicker.
        setupWindowForRemoveOnExit(win);
        Binder.restoreCallingIdentity(origId);
        return;
    }

    // If we are not currently running the exit animation, we need to see about starting one
    wasVisible = win.isWinVisibleLw();

    if (keepVisibleDeadWindow) { // 这里肯定进不来
        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
                "Not removing " + win + " because app died while it's visible");

        win.mAppDied = true;
        win.setDisplayLayoutNeeded();
        mWindowPlacerLocked.performSurfacePlacement();

        // Set up a replacement input channel since the app is now dead.
        // We need to catch tapping on the dead window to restart the app.
        win.openInputChannel(null);
        mInputMonitor.updateInputWindowsLw(true /*force*/);

        Binder.restoreCallingIdentity(origId);
        return;
    }

    final WindowStateAnimator winAnimator = win.mWinAnimator;
    if (wasVisible) {
        final int transit = (!startingWindow) ? TRANSIT_EXIT : TRANSIT_PREVIEW_DONE;

        // Try starting an animation.
        if (winAnimator.applyAnimationLocked(transit, false)) {
            win.mAnimatingExit = true;
        }
        //TODO (multidisplay): Magnification is supported only for the default display.
        if (mAccessibilityController != null
                && win.getDisplayId() == Display.DEFAULT_DISPLAY) {
            mAccessibilityController.onWindowTransitionLocked(win, transit);
        }
    }
    final boolean isAnimating =
            winAnimator.isAnimationSet() && !winAnimator.isDummyAnimation();
    final boolean lastWindowIsStartingWindow = startingWindow && appToken != null
            && appToken.allAppWindows.size() == 1;
    // We delay the removal of a window if it has a showing surface that can be used to run
    // exit animation and it is marked as exiting.
    // Also, If isn't the an animating starting window that is the last window in the app.
    // We allow the removal of the non-animating starting window now as there is no
    // additional window or animation that will trigger its removal.
    if (winAnimator.getShown() && win.mAnimatingExit
            && (!lastWindowIsStartingWindow || isAnimating)) { // mAnimatingExit为false
        // The exit animation is running or should run... wait for it!
        if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
                "Not removing " + win + " due to exit animation ");
        setupWindowForRemoveOnExit(win);
        if (appToken != null) {
            appToken.updateReportedVisibilityLocked();
        }
        Binder.restoreCallingIdentity(origId);
        return;
    }
}

最后发现只有一处可能,看代码跟Surface的暂存有关,是不是想到了什么?对,上面讲过泄漏的layer走过这部分相关的代码,到这里会将WindowState.mRemoveOnExit置为true,若WindowState.mAnimatingExit同时为true,那么会在WindowStateAnimator.finishExit()中执行最后的移除操作,但是看到的信息是前者为true,后者为false,所以不会被移除。特别的,因为这时候目标进程挂掉了,没有后续的其它调用,状态就一直停留在这里,问题就此发生。给出结论:
目标应用曾经启动过并且退到后台,重新启动的过程中目标进程突然挂掉,并且此时父窗口和子窗口都没有重新完成绘制,即调用WMS.finishDrawingWindow,问题发生。

可以想象,实际上这种情况在日常使用中是非常难出现的,所以出现的概率极低。根据给出的结论进行代码调整使之能够达到复现条件,得到的结果是必现!!


现在回到之前埋下的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?
答案是Layer跟SurfaceControl对应,WindowState泄漏不代表SurfaceControl也泄漏,也就是说子窗口的SurfaceControl没有销毁而父窗口的销毁了。看下hprof中这两个窗口SurfaceControl相关的引用,情况如下:

  • 父窗口的mSurfaceController和mPendingDestroySurface都已经为null,说明已经销毁
  • 子窗口的mSurfaceController为null,mPendingDestroySurface不为null,说明被延迟销毁

实际上两者都调用了WindowStateAnimator.destroySurfaceLocked():

void destroySurfaceLocked() {
    ...
    if (mSurfaceDestroyDeferred) {
        // 子窗口mSurfaceDestroyDeferred为true
        if (mSurfaceController != null && mPendingDestroySurface != mSurfaceController) {
            if (mPendingDestroySurface != null) {
                if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {
                    WindowManagerService.logSurface(mWin, "DESTROY PENDING", true);
                }
                mPendingDestroySurface.destroyInTransaction();
            }
            mPendingDestroySurface = mSurfaceController;
        }
    } else {
        if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {
            WindowManagerService.logSurface(mWin, "DESTROY", true);
        }
        destroySurface();
    }
    ...
}

这下清楚了,子窗口的SurfaceControl因为WindowState.mSurfaceDestroyDeferred为true被延迟销毁;为true是因为SurfaceView进行relayout时带有RELAYOUT_DEFER_SURFACE_DESTROY的flag,在正常情况下稍后SurfaceView会调用WMS.performDeferredDestroyWindow()销毁mPendingDestroySurface,但是在这之前进程挂了,那么就没有了这个调用。

有兴趣的可以看下SurfaceView.updateWindow()函数,正常情况下会有如下调用序列:

WMS.relayoutWindow()->WMS.finishDrawingWindow()->WMS.performDeferredDestroyWindow()

如果在WMS.finishDrawingWindow()之前进程挂了,就跟我们的结论完全吻合,mPendingDestroySurface就会一直得不到销毁。父窗口没有泄漏SurfaceControl就是因为它是被立即销毁。

原因已查明,那要怎么修复?实际上,如果WMS.removeWindowInnerLocked()有被调用到,就不会有任何泄漏,做为框架要做到任何时候都能保持状态正常,而不管应用是不是在某个特殊场景挂掉了!

那么问题就回到陷入上述场景时怎么办,代码如下:

if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {
    // We started enter animation early with a saved surface, now the app asks to remove
    // this window. If we remove it now and the app is not yet drawn, we'll show a
    // flicker. Delay the removal now until it's really drawn.
    if (DEBUG_ADD_REMOVE) {
        Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win
                + " due to early animation");
    }
    // Do not set mAnimatingExit to true here, it will cause the surface to be hidden
    // immediately after the enter animation is done. If the app is not yet drawn then
    // it will show up as a flicker.
    setupWindowForRemoveOnExit(win);
    Binder.restoreCallingIdentity(origId);
}

意思是如果正在使用一个暂存的Surface执行动画,并且应用还没完成绘制,就延迟移除窗口,设置mRemoveOnExit为true,还特意交代不能设置mAnimatingExit为true,因为那会使得动画结束后Surface被马上隐藏,美其名曰:这一切都是为了不闪屏!! mAnimatingExit是可以一直不为true的好吧。


解决方案

一种改法是同时将mAnimatingExit置为true,但是很有可能WindowStateAnimator.finishExit()根本没机会调用到,还是于事无补。

个人最后的改法是注释掉这部分代码,既然现在要销毁窗口,为何还等到绘制完成并且动画结束?后续一定还有机会再进行移除吗?这个所谓的优化真的有意义吗?改完后,不再复现。

你可能感兴趣的:(Android N中SurfaceView泄露的问题分析)