UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除

前言

最近闲着没事想熟悉下UE4的图形模块,自己想寻找个需求,目前UE4 4.27的 Mobile Forward是不支持多光源的,默认是最大支持到4栈点光源. 于是,我萌发了一点想法,通过改造UE的Mobile Forward管线支持多光源剔除.

分析UE MobileForward 的多点光源实现

看下MobileBasePassPixelShader.usf的文件,你会发现是UE是“手动”多光源实现,硬编码最大4个点灯光。如下所示

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第1张图片

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第2张图片

所以想支持多光源,后续必须改掉这部分。

UE MobileForward 的新方案思考

首先本文章不涉及GPU带宽和Overdraw的解决,因为没做PreZ,只是单纯的光源剔除。

直接暴力For遍历

首先直接的方式是直接在Shader里面直接遍历所有灯光来累加效果,当然几百栈灯遍历性能直接拉胯,直接否定这个方案。

Tiled Based Cull

基于块剔除点光源,顾名思义,  把整个屏幕分为N多块,并且通过ComputeShader并行计算每个存储的点光源Index列表,  在像素着色的时候根据像素位置,计算出像素所在的Tile的光源List,最终进行光照计算,这样剔除光源的效率比直接暴力遍历所有光源高。

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第3张图片

 UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第4张图片

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第5张图片

 UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第6张图片

Tiled Based Cull的剔除灯光效率不错,但是效率有些依赖PreZ, 因为除了frustum cull 还得进行DepthRange的剔除优化,上面说了 没PreZ, DepthRange的剔除优化是没了,效率进一步降低。如下所示:

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第7张图片

 当然就算有PreZ,DepthRange的剔除优化存在存在不稳定因素的, 假设在一个tile里刚好有一个小物体处于这个frustum的最远处, 又有一个物体存在这个frustum的最近处,这时候DepthBounds 就失去了意义, tile里大量的像素就接受无效光源计算, 这种现象被称为 “Depth Discontinuity(深度不连续性),

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第8张图片

 UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第9张图片

综上所述,还是暂时放弃Tiled Based Cull的方案。

Cluster Based Cull

基于簇剔除点光源,与TiledBasedCull在屏幕空间分割不同,基于簇剔除是在TiledBasedCull的基础上引入Z,直接分割整个视截体成Cluster,  利用ComputerShader计算每个簇的光源IndexList。不需要PreZ阶段也能有效剔除。

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第10张图片

 这里分三步:

  (1)游戏开始预先在相机空间分配多个cluster,每个cluster用AABB表示.(不用每帧计算)

  (2)   遍历所有的cluster, 并计算影响每个cluster的光源集合.

  (3)   进行Shading, 根据像素的ScreenPos.xy 和ViewZ 计算像素处于哪个cluster, 然后取出光源集合进行Shading计算.

UE4的实现

UE4 MobileForward的渲染管线改造

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第11张图片

实现步骤

MobileBuildCluster

游戏开始预先在相机空间分配多个cluster,每个clusterAABB表示, 这个阶段不需要每帧进行,但是你的视锥体的大小等因素改变会影响cluster的重新分配

这里X,Y的划分和tileShading是一样的,但是Z的划分存在多种方案. 

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第12张图片

 

 UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第13张图片

 Doom (Id software) 的方案

这里划分cluster采用了DOOM的方案进行计算,并且划分数量为 16 X 12 X 24 = 4608

class FMobileBuildClusterCS : public FGlobalShader
{
	DECLARE_SHADER_TYPE(FMobileBuildClusterCS, Global);
	SHADER_USE_PARAMETER_STRUCT(FMobileBuildClusterCS, FGlobalShader);

	BEGIN_SHADER_PARAMETER_STRUCT(FParameters, )
		SHADER_PARAMETER_STRUCT_REF(FViewUniformShaderParameters, View)
		SHADER_PARAMETER(float, FarPlane)
		SHADER_PARAMETER(float, NearPlane)
		SHADER_PARAMETER(FVector4, TileSizes)
		SHADER_PARAMETER_UAV(RWBuffer, ClusterList)
	END_SHADER_PARAMETER_STRUCT()

public:

	static bool ShouldCompilePermutation(const FShaderPermutationParameters& Parameters)
	{
		return IsFeatureLevelSupported(Parameters.Platform, ERHIFeatureLevel::ES3_1);
	}
};

 MobileLightClusterShader.usf

#include "/Engine/Private/Common.ush"

float FarPlane;     //LightCullFarPlane
float NearPlane;    //LightCullNearPlane
float4 TileSizes;

RWBuffer ClusterList;

float3 LineIntersectionToZPlane(float3 a, float3 b, float z)
{
	float3 normal = float3(0.0, 0.0, 1.0);  // ???
	float3 ab = b - a;
	float t = (z - dot(normal, a)) / dot(normal, ab);
	float3 result = a + t * ab;
	return result;
}

[numthreads(1, 1, 1)]
void MainCS(
    uint3 GroupId : SV_GroupID,
	uint3 GroupThreadId : SV_GroupThreadID,
	uint GroupIndex : SV_GroupIndex,
	uint3 DispatchThreadId : SV_DispatchThreadID)
{
    const float3 EyePos = float3(0.0, 0.0, 0.0);
    uint TileIndex = GroupId.x + GroupId.y * (uint)TileSizes.x + GroupId.z * (uint)TileSizes.x * (uint)TileSizes.y;
	float Px = 1.0 / (float)TileSizes.x;
	float Py = 1.0 / (float)TileSizes.y;

    //Calculate the min and max point in screen, far plane, near plane exit error(forever zero)
	float2 MaxPointViewportUV = float2(GroupId.x + 1, GroupId.y + 1) * float2(Px, Py);
	float2 MinPointViewportUV = float2(GroupId.xy) * float2(Px, Py);
    float3 MaxPointViewPos = ScreenToViewPos(MaxPointViewportUV, FarPlane);
    float3 MinPointViewPos = ScreenToViewPos(MinPointViewportUV, FarPlane);

    //Near and far values of the cluster in view space, the split cluster method from siggraph 2016 idtech6
	float TileNear = NearPlane * pow(FarPlane / NearPlane, GroupId.z / TileSizes.z);
	float TileFar = NearPlane * pow(FarPlane / NearPlane, (GroupId.z + 1) / TileSizes.z);

    //find cluster min/max 4 point in view space
    float3 MinPointNear = LineIntersectionToZPlane(EyePos, MinPointViewPos, TileNear);
	float3 MinPointFar = LineIntersectionToZPlane(EyePos, MinPointViewPos, TileFar);
	float3 MaxPointNear = LineIntersectionToZPlane(EyePos, MaxPointViewPos, TileNear);
	float3 MaxPointFar = LineIntersectionToZPlane(EyePos, MaxPointViewPos, TileFar);

    float3 MinPointAABB = min(min(MinPointNear, MinPointFar), min(MaxPointNear, MaxPointFar));
	float3 MaxPointAABB = max(max(MinPointNear, MinPointFar), max(MaxPointNear, MaxPointFar));

	ClusterList[2 * TileIndex] = float4(MinPointAABB.x, MinPointAABB.y, MinPointAABB.z, 1.0);
	ClusterList[2 * TileIndex + 1] = float4(MaxPointAABB.x, MaxPointAABB.y, MaxPointAABB.z, 1.0);
}
void FMobileSceneRenderer::AddBuildLightPass(FRDGBuilder& GraphBuilder, const FViewInfo* View, FRWBuffer ClusterListBuffer)
{
	FMobileBuildClusterCS::FParameters* BuildClusterParameters = GraphBuilder.AllocParameters();
	BuildClusterParameters->ClusterList = ClusterListBuffer.UAV;
	FIntPoint ViewPortSize = View->ViewRect.Size();
	BuildClusterParameters->TileSizes = FVector4(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ, (float)ViewPortSize.X / (float)MobileClusterSizeX);
	BuildClusterParameters->NearPlane = View->NearClippingDistance;
	BuildClusterParameters->FarPlane = GPointLightFarClippingPlane;
	BuildClusterParameters->View = View->ViewUniformBuffer;
	TShaderMapRef ComputeShader(View->ShaderMap);

	FComputeShaderUtils::AddPass(
		GraphBuilder,
		RDG_EVENT_NAME("MobileBuildLightClusterPass"),
		ComputeShader,
		BuildClusterParameters,
		FIntVector(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ));
}

 MobileCullLight

遍历所有的cluster, 计算影响每个cluster的光源集合, 并用全局索引偏移表和全局光源索引表来记录. 这里为了简化数据结构, 全局索引偏移表和全局光源索引表 都是一维数组,  全局索引偏移表元素记录了每个cluster的的光源数量和每个cluster的的光源在全局光源索引表的Index偏移.

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第14张图片

class FMobileLightCullCS : public FGlobalShader
{
	DECLARE_SHADER_TYPE(FMobileLightCullCS, Global);
	SHADER_USE_PARAMETER_STRUCT(FMobileLightCullCS, FGlobalShader);

	BEGIN_SHADER_PARAMETER_STRUCT(FParameters, )
		SHADER_PARAMETER_STRUCT_REF(FViewUniformShaderParameters, View)
		SHADER_PARAMETER_STRUCT_REF(FMobileLocalLightData, LocalLightData)
		SHADER_PARAMETER_RDG_BUFFER_UAV(RWStructuredBuffer, GlobalIndexCount)
		SHADER_PARAMETER_UAV(RWBuffer, LightGridList)
		SHADER_PARAMETER_UAV(RWBuffer, GlobalLightIndexList)
		SHADER_PARAMETER_SRV(StrongTypedBuffer, ClusterList)
		SHADER_PARAMETER_SRV(StrongTypedBuffer, LightViewSpacePositionAndRadius)
	END_SHADER_PARAMETER_STRUCT()

public:

	static bool ShouldCompilePermutation(const FShaderPermutationParameters& Parameters)
	{
		return IsFeatureLevelSupported(Parameters.Platform, ERHIFeatureLevel::ES3_1);
	}
};

 MobileLightCullShader.usf

#include "/Engine/Private/Common.ush"
#define THREAD_GROUD_X 16
#define THREAD_GROUD_Y 12
#define THREAD_GROUD_Z 2

#define GROUD_THREAD_TOTAL_NUM THREAD_GROUD_X * THREAD_GROUD_Y * THREAD_GROUD_Z

Buffer LightViewSpacePositionAndRadius;
Buffer ClusterList;
RWBuffer LightGridList;
RWBuffer GlobalLightIndexList;
RWStructuredBuffer GlobalIndexCount;


float GetSqdisPointAABB(float3 SphereViewPos, uint CluterIndex)
{
	float SqDistance = 0.0;
	float4 MinPoint = ClusterList[2 * CluterIndex];
	float4 MaxPoint = ClusterList[2 * CluterIndex + 1];

	for (int Index = 0; Index < 3; Index++)
	{
		float V = SphereViewPos[Index];

		if (V < MinPoint[Index])
		{
			float Diff = MinPoint[Index] - V;
			SqDistance += Diff * Diff;
		}

		if (V > MaxPoint[Index])
		{
			float Diff = V - MaxPoint[Index];
			SqDistance += Diff * Diff;
		}
	}

	return SqDistance;
}

bool TestSphereAABB(float3 LightViewPos, float LightRadius, uint CluterIndex)
{
	float SqDistance = GetSqdisPointAABB(LightViewPos, CluterIndex);
	return SqDistance <= (LightRadius * LightRadius);
}

[numthreads(THREAD_GROUD_X, THREAD_GROUD_Y, THREAD_GROUD_Z)]
void MainCS(
    uint3 GroupId : SV_GroupID,
	uint3 GroupThreadId : SV_GroupThreadID,
	uint GroupIndex : SV_GroupIndex,
	uint3 DispatchThreadId : SV_DispatchThreadID)
{
	const uint ThreadCount = GROUD_THREAD_TOTAL_NUM;
	uint LightCountInt = (uint)MobileLocalLightData.NumLocalLights;
	uint PassCount = (LightCountInt + ThreadCount - 1) / ThreadCount;
	uint ClusterIndex = GroupIndex + ThreadCount * GroupId.z;
	uint VisibleLightCount = 0;

	//one cluster max light num <= GROUD_THREAD_TOTAL_NUM
	uint VisibleLightIndexs[GROUD_THREAD_TOTAL_NUM];

	//TODO: directly loop for all points
	for (uint PassIndex = 0; PassIndex < PassCount; ++PassIndex)
	{
		for (uint Light = 0; Light < ThreadCount; ++Light)
		{
			uint LightRealIndex = Light + PassIndex * ThreadCount;

			if (LightRealIndex < LightCountInt)
			{
				float4 LightPositionAndRadius = LightViewSpacePositionAndRadius[LightRealIndex];
				float3 ViewSpaceLightPosition = LightPositionAndRadius.xyz;
				float LightRadius = LightPositionAndRadius.w;

				if(TestSphereAABB(ViewSpaceLightPosition, LightRadius, ClusterIndex))
				{
					VisibleLightIndexs[VisibleLightCount] = LightRealIndex;
					VisibleLightCount += 1;
				}
			}
		}
	}

	//We want all thread groups to have completed the light tests before continuing
	GroupMemoryBarrierWithGroupSync();

	uint Offset;
	InterlockedAdd(GlobalIndexCount[0], VisibleLightCount, Offset);
	for (uint Index = 0; Index < VisibleLightCount; ++Index)
	{
		GlobalLightIndexList[Offset + Index] = VisibleLightIndexs[Index];
	}

	LightGridList[ClusterIndex * 2] = Offset;
	LightGridList[ClusterIndex * 2 + 1] = VisibleLightCount;
}
void FMobileSceneRenderer::AddLightCullPass(FRDGBuilder& GraphBuilder, const FViewInfo* View, int32 ViewIndex, FRDGBufferRef ClusterListBuffer, FSortedLightSetSceneInfo &SortedLightSet, bool bCullLightsToGrid)
{
	//....................省略一大段代码, 具体参考github提交

if (ForwardLocalLightData.Num() == 0)
	{
		// Make sure the buffer gets created even though we're not going to read from it in the shader, for platforms like PS4 that assert on null resources being bound
		ForwardLocalLightData.AddZeroed();
	}

	FIntPoint ViewPortSize = View->ViewRect.Size();
	FVector4 TileSizes = FVector4(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ, 0);
	FVector2D ClusterFactor;
	ClusterFactor.X = (float)MobileClusterSizeZ / FMath::Log2(GPointLightFarClippingPlane / View->NearClippingDistance);
	ClusterFactor.Y = -((float)MobileClusterSizeZ *  FMath::Log2(View->NearClippingDistance)) / FMath::Log2(GPointLightFarClippingPlane / View->NearClippingDistance);
	UpdateDynamicVector4BufferData(ForwardLocalLightData, View->ForwardLightingResources->ForwardLocalLightBuffer);
	LocalLightData.ForwardLocalLightBuffer = View->ForwardLightingResources->ForwardLocalLightBuffer.SRV;
	LocalLightData.NumLocalLights = NumLocalLightsFinal;
	LocalLightData.TileSizes = TileSizes;
	LocalLightData.ClusterFactor = ClusterFactor;

	const bool bShouldCacheTemporaryBuffers = View->ViewState != nullptr;
	FForwardLightingCullingResources& ForwardLightingCullingResources = bShouldCacheTemporaryBuffers
		? View->ViewState->ForwardLightingCullingResources
		: *GraphBuilder.AllocObject();

	if (ViewSpacePosAndRadiusData.Num() == 0)
	{
		// Make sure the buffer gets created even though we're not going to read from it in the shader, for platforms like PS4 that assert on null resources being bound
		ViewSpacePosAndRadiusData.AddZeroed();
		ViewSpaceDirAndPreprocAngleData.AddZeroed();
	} 

	// Alloc Large RWBuffer
	if (Scene->UniformBuffers.MobileLightGrid.NumBytes != sizeof(uint32) * 2 * MobileClusterNum)
	{
		Scene->UniformBuffers.MobileLightGrid.Initialize(sizeof(uint32), 2 * MobileClusterNum, EPixelFormat::PF_R32_UINT);
	}

	if (Scene->UniformBuffers.MobileGlobalLightIndexList.NumBytes != sizeof(uint32) * 4 * MobileClusterNum)
	{
		Scene->UniformBuffers.MobileGlobalLightIndexList.Initialize(sizeof(uint32), 4 * MobileClusterNum, EPixelFormat::PF_R32_UINT);
	}

	check(ViewSpacePosAndRadiusData.Num() == ForwardLocalLightData.Num());
	check(ViewSpaceDirAndPreprocAngleData.Num() == ForwardLocalLightData.Num());
	UpdateDynamicVector4BufferData(ViewSpacePosAndRadiusData, ForwardLightingCullingResources.ViewSpacePosAndRadiusData);
	UpdateDynamicVector4BufferData(ViewSpaceDirAndPreprocAngleData, ForwardLightingCullingResources.ViewSpaceDirAndPreprocAngleData);
	if (!Scene->UniformBuffers.MobileLocalLightUniformBuffer.IsValid())
	{
		Scene->UniformBuffers.MobileLocalLightUniformBuffer = TUniformBufferRef::CreateUniformBufferImmediate(LocalLightData, UniformBuffer_MultiFrame);
	}
	else
	{
		Scene->UniformBuffers.MobileLocalLightUniformBuffer.UpdateUniformBufferImmediate(LocalLightData);
	}

	// Add Clear GlobalIndexCountUAV Pass
	FRDGBufferDesc GlobalIndexCountDesc = FRDGBufferDesc::CreateStructuredDesc(sizeof(uint32), 1);
	FRDGBufferRef GlobalIndexCountBuffer = GraphBuilder.CreateBuffer(GlobalIndexCountDesc, TEXT("GlobalIndexCount"));
	FRDGBufferUAVRef GlobalIndexCountUAV = GraphBuilder.CreateUAV(GlobalIndexCountBuffer);
	AddClearUAVPass(GraphBuilder, GlobalIndexCountUAV, 0);
	
	FMobileLightCullCS::FParameters* LightCullParameters = GraphBuilder.AllocParameters();
	LightCullParameters->ClusterList = ClusterListBuffer.SRV;
	LightCullParameters->LightGridList = Scene->UniformBuffers.MobileLightGrid.UAV;
	LightCullParameters->GlobalLightIndexList = Scene->UniformBuffers.MobileGlobalLightIndexList.UAV;
	LightCullParameters->GlobalIndexCount = GlobalIndexCountUAV;
	LightCullParameters->LocalLightData = Scene->UniformBuffers.MobileLocalLightUniformBuffer;
	LightCullParameters->LightViewSpacePositionAndRadius = ForwardLightingCullingResources.ViewSpacePosAndRadiusData.SRV;

	TShaderMapRef ComputeShader(View->ShaderMap);
	FComputeShaderUtils::AddPass(
		GraphBuilder,
		RDG_EVENT_NAME("MobileLightCullPass"),
		ComputeShader,
		LightCullParameters,
		FIntVector(1, 1, MobileClusterSizeZ / 2));
}

MobileBasePass

最后修改MobileBasePass,求出每个像素所在的Cluster, 遍历像素所在Cluster的光源列表进行着色

#if !MATERIAL_SHADINGMODEL_SINGLELAYERWATER
	// Local lights
	float DeviceZ = SvPosition.z / SvPosition.w;
	float PixelDepth = GetPixelDepth(MaterialParameters);

	// ViewPosZ
	float2 ScreenUV = SvPositionToBufferUV(SvPosition);
	float ViewPosZ = PixelDepth;
	float2 ClusterFactor = MobileLocalLightData.ClusterFactor;
	float4 TileSizes = MobileLocalLightData.TileSizes;
	uint ClusterZ = uint(max(log2(ViewPosZ) * ClusterFactor.x + ClusterFactor.y, 0.0));
	uint3 Clusters = uint3(uint(ScreenUV.x * TileSizes.x), uint(ScreenUV.y * TileSizes.y), ClusterZ);
	uint ClusterIndex = Clusters.x + Clusters.y * (uint)TileSizes.x + Clusters.z * (uint)TileSizes.x * (uint)TileSizes.y;
	uint LightOffset = LightGridList[2 * ClusterIndex];
	uint LightCount = LightGridList[2 * ClusterIndex + 1];
	for (uint Index = 0; Index < LightCount; Index++)
	{
		uint LocalLightIndex = GlobalLightIndexList[LightOffset + Index];
		uint LocalLightBaseIndex = LocalLightIndex * LOCAL_LIGHT_DATA_STRIDE;
		float4 LightPositionAndInvRadius = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 0];
		float4 LightColorAndFalloffExponent = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 1];
		float4 LightDirectionAndShadowMask = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 2];
		float4 SpotAnglesAndSourceRadiusPacked = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 3];
		float4 LightTangentAndSoftSourceRadius = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 4];

		AccumulateLightingOfDynamicPointLight(MaterialParameters,
											ShadingModelContext,
											GBuffer,
											LightPositionAndInvRadius,
											LightColorAndFalloffExponent,
											float4(0, 0, 0, 1),
											float4(0, 0, 0, 1),
											Color);

	}
#endif

性能测试

安卓手机Demo(Remi k40 骁龙870), 测试场景100栈光源,剔除用时0.07ms~0.09ms

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第15张图片

 

 

  

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第16张图片

 在手机上的坑

在打包到安卓手机的时候,发现RWStructuredBuffer<>使用结构体的时候打包失败,于是我把所有结构体的RWStructuredBuffer都改为了基本数据类型(uint. float, float4等等)来打包成功了。举个例子:

UE4 4.27 修改Mobile Forward管线支持Cluster多光源剔除_第17张图片

代码链接

https://github.com/2047241149/UnrealEngine/commit/4acdcb34933c7f6910e53834d2947f57be4e49bb

https://github.com/2047241149/UnrealEngine/commit/259ca8f37b4d6a367990d42f5130faf2586c4d93

总结

 很多没优化和没测试的地方,聚光灯暂时被干掉,后面有空慢慢搞

参考资料

【1】http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf

【2】https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2015/presentations/Thomas_Gareth_Advancements_in_Tile-Based.pdf?tdsourcetag=s_pctim_aiomsg

【3】http://www.humus.name/Articles/PracticalClusteredShading.pdf

【4】https://www.slideshare.net/TiagoAlexSousa/siggraph2016-the-devil-is-in-the-details-idtech-666?next_slideshow=1

【5】https://newq.net/dl/pub/SA2014Practical.pdf

【6】 A Primer On Efficient Rendering Algorithms & Clustered Shading.

你可能感兴趣的:(UE4/UE5,Rendering进阶,ue4)