最近遇到一个bug,现象为SurfaceView的Layer没有销毁,导致屏幕上一直显示该Layer。觉得该案例有点意思,故在此记录下分析过程及解决方法,供有一定framework基础的Rom开发人员参考。
开心消消乐的界面一直在屏幕上显示,无论如何都不能销毁。
首先最直接相关的模块是SurfaceFlinger,既然能看到,应该存在该Layer并且进行了合成,否则这里就有问题,用如下命令dump状态信息:
adb shell dumpsys SurfaceFlinger
这里只摘取该Layer相关的部分:
+ Layer 0x71b57b0400 (SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity)
Region transparentRegion (this=0x71b57b0708, count=1)
[ 0, 0, 0, 0]
Region visibleRegion (this=0x71b57b0410, count=1)
[ 0, 0, 1080, 1920]
Region surfaceDamageRegion (this=0x71b57b0488, count=1)
[ 0, 0, 0, 0]
layerStack= 0, z= 21015, pos=(0,0), size=(1080,1920), crop=( 0, 0,1080,1920), finalCrop=( 0, 0, -1, -1), isOpaque=1, invalidate=0, alpha=0xff, flags=0x00000002, tr=[1.00, 0.00][0.00, 1.00]
FilterRender Layer= 0, FilterMode= 0 availableRect =( 0, 0, 0, 0)
client=0x71b86f0f40
format= 4, activeBuffer=[1080x1920:1088, 1], queued-frames=0, mRefreshPending=0
mSecure=0, mProtectedByApp=0, mFiltering=0, mNeedsFiltering=0
mTexName=54 mCurrentTexture=-1
mCurrentCrop=[0,0,0,0] mCurrentTransform=0
mAbandoned=0
-BufferQueue mMaxAcquiredBufferCount=1, mMaxDequeuedBufferCount=3, mDequeueBufferCannotBlock=0 mAsyncMode=0, default-size=[1080x1920], default-format=4, transform-hint=00, FIFO(0)={}
this=0x71b55e3000 (mConsumerName=SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity, mConnectedApi=0, mConsumerUsageBits=0x900, mId=39, mPid=15358, producer=[-1:com.happyelements.AndroidAnimal], consumer=[15358:/system/bin/surfaceflinger])
[00:0x0] state=FREE
[01:0x0] state=FREE
[02:0x0] state=FREE
[03:0x0] state=FREE
*BufferQueueDump mIsBackupBufInited=0, mAcquiredBufs(size=0), mMode=TRACK_CONSUMER
[-1] mLastAcquiredBuf->mGraphicBuffer->handle=0x71b7636900
得到如下信息:
综合以上以及dump出来的合成信息,说明SurfaceFlinger这边的状态没有问题,符合我们看到的现象。
同时注意到有些奇怪的信息,之所以说奇怪是因为跟正常参与合成的Layer不一样:
当然能进入到现在这种bug状态本身就不能太按常理来看待,SurfaceFlinger这边暂且先到这里。
目光转向WMS这边,用如下命令dump状态信息:
adb shell dumpsys window
唯一跟该SurfaceView相关的信息如下:
WINDOW MANAGER SURFACES (dumpsys window surfaces)
Surface #0: #75499c8 SurfaceView - com.happyelements.AndroidAnimal/com.happyelements.hellolua.MainActivity
mLayerStack=0 mLayer=21015
mShown=true mAlpha=1.0 mIsOpaque=false
mPosition=0.0,0.0 mSize=1080x1920
mCrop=[0,0][1080,1920]
mFinalCrop=[0,0][0,0]
Transform: (1.0, 0.0, 0.0, 1.0)
这并不是窗口堆栈打印出的内容,为了不让此文写的太过冗长,直接给出结论:
现在有个SurfaceTrace存在于该静态集合中,说明其创建后没有被销毁,这就是该bug的最直接原因,也是我们最开始的切入点。 现在WMS仅有这条信息,并没有窗口堆栈及token的对应状态,这着实让人有点惆怅,否则或许能发现点蛛丝马迹,直接扒代码找原因无异于大海捞针。
现在没有log,只有现场,还能知道如下信息:
如果没有其它线索,分析到这里已经结束,剩下的事情就是”愉快地“钻进代码的海洋里去寻找bug,并向老天许愿。所幸的是能抓到system_server的hprof,瞬间感觉人生充满了希望。
接下来看hprof文件,为简化分析过程,不会去粘贴大量的数据。
首先从WMS中dump出来的那个SurfaceControl入手,根据代码这个实例只能是SurfaceTrace或者是它的子类SurfaceControlWithBackground,最后发现是SurfaceControlWithBackground,N种mSubLayer小于0的子窗口(即位于父窗口下方)在创建SurfaceControl时默认实例化SurfaceControlWithBackground,而SurfaceView刚好就是这样的窗口。查看它的GcRoot,确实是保存在一个静态的数组中。
顺藤摸瓜找到了对应的WindowState,GcRoot在WMS.mWindowMap中,另外它的父WindowState也一样存在。到这里我们要先下一个重要的结论:
泄露的不止是SurfaceView窗口,还有它的父窗口。
以及我们后面再来回答的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?
接下来马上要回答一个问题:上面不是说WMS已经dump不出来这些窗口了吗?
要回答这个问题要先讲下WindowState的组织方式,它保存在系统中的多个位置,包括如下:
注:上述的列表方式均以ArrayList的方式保存窗口,索引值越大层级越高
问题的答案是:dump出来的信息是通过DisplayContent.mWindows来取,既然没有对应的信息,说明泄露的WindowState已经从这里面移除,考察上述的其它地方是否存在:
按照正常的逻辑,移除一个WindowState后,所有组织它的地方都应该移除对应的引用。现在这种状况,需要在这几个中找一个最好排查的因素,从代码来看,WMS.mWindowMap是最简单的,因为只有一处代码从这里移除WindowState,即WMS.removeWindowInnerLocked():
void removeWindowInnerLocked(WindowState win) {
if (win.mRemoved) {
// Nothing to do.
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
"removeWindowInnerLocked: " + win + " Already removed...");
return;
}
for (int i = win.mChildWindows.size() - 1; i >= 0; i--) {
WindowState cwin = win.mChildWindows.get(i);
Slog.w(TAG_WM, "Force-removing child win " + cwin + " from container " + win);
removeWindowInnerLocked(cwin);
}
win.mRemoved = true;
...
mPolicy.removeWindowLw(win);
// WindowState.mChildWindows中移除
win.removeLocked();
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "removeWindowInnerLocked: " + win);
// WMS.mWindowMap中移除
mWindowMap.remove(win.mClient.asBinder());
...
final WindowToken token = win.mToken;
final AppWindowToken atoken = win.mAppToken;
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Removing " + win + " from " + token);
// WindowToken.windows中移除
token.windows.remove(win);
if (atoken != null) {
// AppWindowToken.allAppWindows中移除
atoken.allAppWindows.remove(win);
}
...
final WindowList windows = win.getWindowList();
if (windows != null) {
// DisplayContent.mWindows中移除
windows.remove(win);
}
}
也就是说对于这个泄露的WindowState,肯定没有执行到这里,这从WindowState.mRemoved值为false也可以印证,从WindowState.mChildWindows中移除的唯一位置在WIndowState.removeLocked():
void removeLocked() {
disposeInputChannel();
if (isChildWindow()) {
if (DEBUG_ADD_REMOVE) Slog.v(TAG, "Removing " + this + " from " + mAttachedWindow);
// 从WindowState.mChildWindows中移除
mAttachedWindow.mChildWindows.remove(this);
}
mWinAnimator.destroyDeferredSurfaceLocked();
mWinAnimator.destroySurfaceLocked();
mSession.windowRemovedLocked();
try {
mClient.asBinder().unlinkToDeath(mDeathRecipient, 0);
} catch (RuntimeException e) {
// Ignore if it has already been removed (usually because
// we are doing this as part of processing a death note.)
}
}
这么看来WMS.removeWindowInnerLocked()像是做最后移除工作的地方,因为上述的所有保存WindowState的地方都会在这里进行移除,现在出现不一致的情况,说明有其它地方会对某些引用进行移除,问题集中在DisplayContent.mWindows和AppWindowToken.allAppWindows。
先看下AppWindowToken.allAppWindows,查了一番代码,找到AppWindowToken.removeAllWindows():
void removeAllWindows() {
...
// AppWindowToken.allAppWindows清空
allAppWindows.clear();
// WindowToken.windows清空
windows.clear();
}
调用的部分路径为:
WMS.removeAppToken()->AppWindowToken.removeAppFromTaskLocked()->AppWindowToken.removeAllWindows()
简单地说,我们知道目标进程已经挂掉,至少在死亡讣告中会调用到WMS.removeAppToken。我们说根据结果进行推导,这部分就解释的通。
那DisplayContent.mWindows这边怎么解释,问题出在WMS.rebuildAppWindowListLocked():
private void rebuildAppWindowListLocked(final DisplayContent displayContent) {
final WindowList windows = displayContent.getWindowList();
int NW = windows.size();
int i;
int lastBelow = -1;
int numRemoved = 0;
if (mRebuildTmp.length < NW) {
mRebuildTmp = new WindowState[NW+10];
}
// First remove all existing app windows.
i=0;
while (i < NW) {
WindowState w = windows.get(i);
if (w.mAppToken != null) {
// 先从DisplayContent.mWindows移除,并可能在后面重新添加
WindowState win = windows.remove(i);
win.mRebuilding = true;
mRebuildTmp[numRemoved] = win;
mWindowsChanged = true;
if (DEBUG_WINDOW_MOVEMENT) Slog.v(TAG_WM, "Rebuild removing window: " + win);
NW--;
numRemoved++;
continue;
} else if (lastBelow == i-1) {
if (w.mAttrs.type == TYPE_WALLPAPER) {
lastBelow = i;
}
}
i++;
}
// Keep whatever windows were below the app windows still below,
// by skipping them.
lastBelow++;
i = lastBelow;
// First add all of the exiting app tokens... these are no longer
// in the main app list, but still have windows shown. We put them
// in the back because now that the animation is over we no longer
// will care about them.
final ArrayList stacks = displayContent.getStacks();
final int numStacks = stacks.size();
for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {
AppTokenList exitingAppTokens = stacks.get(stackNdx).mExitingAppTokens;
int NT = exitingAppTokens.size();
for (int j = 0; j < NT; j++) {
i = reAddAppWindowsLocked(displayContent, i, exitingAppTokens.get(j));
}
}
// And add in the still active app tokens in Z order.
for (int stackNdx = 0; stackNdx < numStacks; ++stackNdx) {
final ArrayList tasks = stacks.get(stackNdx).getTasks();
final int numTasks = tasks.size();
for (int taskNdx = 0; taskNdx < numTasks; ++taskNdx) {
final AppTokenList tokens = tasks.get(taskNdx).mAppTokens;
final int numTokens = tokens.size();
for (int tokenNdx = 0; tokenNdx < numTokens; ++tokenNdx) {
final AppWindowToken wtoken = tokens.get(tokenNdx);
if (wtoken.mIsExiting && !wtoken.waitingForReplacement()) {
continue;
}
i = reAddAppWindowsLocked(displayContent, i, wtoken);
}
}
}
i -= lastBelow;
if (i != numRemoved) {
displayContent.layoutNeeded = true;
Slog.w(TAG_WM, "On display=" + displayContent.getDisplayId() + " Rebuild removed "
+ numRemoved + " windows but added " + i + " rebuildAppWindowListLocked() "
+ " callers=" + Debug.getCallers(10));
for (i = 0; i < numRemoved; i++) {
WindowState ws = mRebuildTmp[i];
if (ws.mRebuilding) {
StringWriter sw = new StringWriter();
PrintWriter pw = new FastPrintWriter(sw, false, 1024);
ws.dump(pw, "", true);
pw.flush();
Slog.w(TAG_WM, "This window was lost: " + ws);
Slog.w(TAG_WM, sw.toString());
ws.mWinAnimator.destroySurfaceLocked();
}
}
Slog.w(TAG_WM, "Current app token list:");
dumpAppTokensLocked();
Slog.w(TAG_WM, "Final window list:");
dumpWindowsLocked();
}
Arrays.fill(mRebuildTmp, null);
}
简单说下逻辑,就是先移除所有的应用窗口,并根据最新的AppWindowToken排列顺序来重新添加,而要重新添加的上,WindowToken.windows必须不为空,而根据上面的分析这里已经为空,那么对不起,移除完后已经加不上了,这从WindowState.mRebuilding为true可以证明。那么这里又解释通了,而且跟WindowToken.windows被清空有关。
那到底为什么没走到清理现场的WMS.removeWindowInnerLocked()?再回到死亡讣告,每个WindowState都会注册死亡讣告,并在窗口所在进程挂掉后调用WMS.removeWindowLocked(),这点是没有疑问的,并且会在后续调用WMS.removeWindowInnerLocked(),但是在这之前有可能提前返回,代码太多,只列出可能提前返回的部分,看注释我们来一一排除:
if (win.mHasSurface && okToDisplay()) {
final AppWindowToken appToken = win.mAppToken;
if (win.mWillReplaceWindow) { // mWillReplaceWindow为false
// This window is going to be replaced. We need to keep it around until the new one
// gets added, then we will get rid of this one.
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM, "Preserving " + win + " until the new one is "
+ "added");
// TODO: We are overloading mAnimatingExit flag to prevent the window state from
// been removed. We probably need another flag to indicate that window removal
// should be deffered vs. overloading the flag that says we are playing an exit
// animation.
win.mAnimatingExit = true;
win.mReplacingRemoveRequested = true;
Binder.restoreCallingIdentity(origId);
return;
}
// 唯一的可能就是进入到这个条件并return
if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {
// We started enter animation early with a saved surface, now the app asks to remove
// this window. If we remove it now and the app is not yet drawn, we'll show a
// flicker. Delay the removal now until it's really drawn.
if (DEBUG_ADD_REMOVE) {
Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win
+ " due to early animation");
}
// Do not set mAnimatingExit to true here, it will cause the surface to be hidden
// immediately after the enter animation is done. If the app is not yet drawn then
// it will show up as a flicker.
setupWindowForRemoveOnExit(win);
Binder.restoreCallingIdentity(origId);
return;
}
// If we are not currently running the exit animation, we need to see about starting one
wasVisible = win.isWinVisibleLw();
if (keepVisibleDeadWindow) { // 这里肯定进不来
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
"Not removing " + win + " because app died while it's visible");
win.mAppDied = true;
win.setDisplayLayoutNeeded();
mWindowPlacerLocked.performSurfacePlacement();
// Set up a replacement input channel since the app is now dead.
// We need to catch tapping on the dead window to restart the app.
win.openInputChannel(null);
mInputMonitor.updateInputWindowsLw(true /*force*/);
Binder.restoreCallingIdentity(origId);
return;
}
final WindowStateAnimator winAnimator = win.mWinAnimator;
if (wasVisible) {
final int transit = (!startingWindow) ? TRANSIT_EXIT : TRANSIT_PREVIEW_DONE;
// Try starting an animation.
if (winAnimator.applyAnimationLocked(transit, false)) {
win.mAnimatingExit = true;
}
//TODO (multidisplay): Magnification is supported only for the default display.
if (mAccessibilityController != null
&& win.getDisplayId() == Display.DEFAULT_DISPLAY) {
mAccessibilityController.onWindowTransitionLocked(win, transit);
}
}
final boolean isAnimating =
winAnimator.isAnimationSet() && !winAnimator.isDummyAnimation();
final boolean lastWindowIsStartingWindow = startingWindow && appToken != null
&& appToken.allAppWindows.size() == 1;
// We delay the removal of a window if it has a showing surface that can be used to run
// exit animation and it is marked as exiting.
// Also, If isn't the an animating starting window that is the last window in the app.
// We allow the removal of the non-animating starting window now as there is no
// additional window or animation that will trigger its removal.
if (winAnimator.getShown() && win.mAnimatingExit
&& (!lastWindowIsStartingWindow || isAnimating)) { // mAnimatingExit为false
// The exit animation is running or should run... wait for it!
if (DEBUG_ADD_REMOVE) Slog.v(TAG_WM,
"Not removing " + win + " due to exit animation ");
setupWindowForRemoveOnExit(win);
if (appToken != null) {
appToken.updateReportedVisibilityLocked();
}
Binder.restoreCallingIdentity(origId);
return;
}
}
最后发现只有一处可能,看代码跟Surface的暂存有关,是不是想到了什么?对,上面讲过泄漏的layer走过这部分相关的代码,到这里会将WindowState.mRemoveOnExit置为true,若WindowState.mAnimatingExit同时为true,那么会在WindowStateAnimator.finishExit()中执行最后的移除操作,但是看到的信息是前者为true,后者为false,所以不会被移除。特别的,因为这时候目标进程挂掉了,没有后续的其它调用,状态就一直停留在这里,问题就此发生。给出结论:
目标应用曾经启动过并且退到后台,重新启动的过程中目标进程突然挂掉,并且此时父窗口和子窗口都没有重新完成绘制,即调用WMS.finishDrawingWindow,问题发生。
可以想象,实际上这种情况在日常使用中是非常难出现的,所以出现的概率极低。根据给出的结论进行代码调整使之能够达到复现条件,得到的结果是必现!!
现在回到之前埋下的一个疑问:
为什么SurfaceFlinger端看不到父窗口的Layer?
答案是Layer跟SurfaceControl对应,WindowState泄漏不代表SurfaceControl也泄漏,也就是说子窗口的SurfaceControl没有销毁而父窗口的销毁了。看下hprof中这两个窗口SurfaceControl相关的引用,情况如下:
实际上两者都调用了WindowStateAnimator.destroySurfaceLocked():
void destroySurfaceLocked() {
...
if (mSurfaceDestroyDeferred) {
// 子窗口mSurfaceDestroyDeferred为true
if (mSurfaceController != null && mPendingDestroySurface != mSurfaceController) {
if (mPendingDestroySurface != null) {
if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {
WindowManagerService.logSurface(mWin, "DESTROY PENDING", true);
}
mPendingDestroySurface.destroyInTransaction();
}
mPendingDestroySurface = mSurfaceController;
}
} else {
if (SHOW_TRANSACTIONS || SHOW_SURFACE_ALLOC) {
WindowManagerService.logSurface(mWin, "DESTROY", true);
}
destroySurface();
}
...
}
这下清楚了,子窗口的SurfaceControl因为WindowState.mSurfaceDestroyDeferred为true被延迟销毁;为true是因为SurfaceView进行relayout时带有RELAYOUT_DEFER_SURFACE_DESTROY的flag,在正常情况下稍后SurfaceView会调用WMS.performDeferredDestroyWindow()销毁mPendingDestroySurface,但是在这之前进程挂了,那么就没有了这个调用。
有兴趣的可以看下SurfaceView.updateWindow()函数,正常情况下会有如下调用序列:
WMS.relayoutWindow()->WMS.finishDrawingWindow()->WMS.performDeferredDestroyWindow()
如果在WMS.finishDrawingWindow()之前进程挂了,就跟我们的结论完全吻合,mPendingDestroySurface就会一直得不到销毁。父窗口没有泄漏SurfaceControl就是因为它是被立即销毁。
原因已查明,那要怎么修复?实际上,如果WMS.removeWindowInnerLocked()有被调用到,就不会有任何泄漏,做为框架要做到任何时候都能保持状态正常,而不管应用是不是在某个特殊场景挂掉了!
那么问题就回到陷入上述场景时怎么办,代码如下:
if (win.isAnimatingWithSavedSurface() && !appToken.allDrawnExcludingSaved) {
// We started enter animation early with a saved surface, now the app asks to remove
// this window. If we remove it now and the app is not yet drawn, we'll show a
// flicker. Delay the removal now until it's really drawn.
if (DEBUG_ADD_REMOVE) {
Slog.d(TAG_WM, "removeWindowLocked: delay removal of " + win
+ " due to early animation");
}
// Do not set mAnimatingExit to true here, it will cause the surface to be hidden
// immediately after the enter animation is done. If the app is not yet drawn then
// it will show up as a flicker.
setupWindowForRemoveOnExit(win);
Binder.restoreCallingIdentity(origId);
}
意思是如果正在使用一个暂存的Surface执行动画,并且应用还没完成绘制,就延迟移除窗口,设置mRemoveOnExit为true,还特意交代不能设置mAnimatingExit为true,因为那会使得动画结束后Surface被马上隐藏,美其名曰:这一切都是为了不闪屏!! mAnimatingExit是可以一直不为true的好吧。
一种改法是同时将mAnimatingExit置为true,但是很有可能WindowStateAnimator.finishExit()根本没机会调用到,还是于事无补。
个人最后的改法是注释掉这部分代码,既然现在要销毁窗口,为何还等到绘制完成并且动画结束?后续一定还有机会再进行移除吗?这个所谓的优化真的有意义吗?改完后,不再复现。