最近闲着没事想熟悉下UE4的图形模块,自己想寻找个需求,目前UE4 4.27的 Mobile Forward是不支持多光源的,默认是最大支持到4栈点光源. 于是,我萌发了一点想法,通过改造UE的Mobile Forward管线支持多光源剔除.
看下MobileBasePassPixelShader.usf的文件,你会发现是UE是“手动”多光源实现,硬编码最大4个点灯光。如下所示
所以想支持多光源,后续必须改掉这部分。
首先本文章不涉及GPU带宽和Overdraw的解决,因为没做PreZ,只是单纯的光源剔除。
首先直接的方式是直接在Shader里面直接遍历所有灯光来累加效果,当然几百栈灯遍历性能直接拉胯,直接否定这个方案。
基于块剔除点光源,顾名思义, 把整个屏幕分为N多块,并且通过ComputeShader并行计算每个存储的点光源Index列表, 在像素着色的时候根据像素位置,计算出像素所在的Tile的光源List,最终进行光照计算,这样剔除光源的效率比直接暴力遍历所有光源高。
Tiled Based Cull的剔除灯光效率不错,但是效率有些依赖PreZ, 因为除了frustum cull 还得进行DepthRange的剔除优化,上面说了 没PreZ, DepthRange的剔除优化是没了,效率进一步降低。如下所示:
当然就算有PreZ,DepthRange的剔除优化存在存在不稳定因素的, 假设在一个tile里刚好有一个小物体处于这个frustum的最远处, 又有一个物体存在这个frustum的最近处,这时候DepthBounds 就失去了意义, tile里大量的像素就接受无效光源计算, 这种现象被称为 “Depth Discontinuity(深度不连续性),
综上所述,还是暂时放弃Tiled Based Cull的方案。
基于簇剔除点光源,与TiledBasedCull在屏幕空间分割不同,基于簇剔除是在TiledBasedCull的基础上引入Z,直接分割整个视截体成Cluster, 利用ComputerShader计算每个簇的光源IndexList。不需要PreZ阶段也能有效剔除。
这里分三步:
(1)游戏开始预先在相机空间分配多个cluster,每个cluster用AABB表示.(不用每帧计算)
(2) 遍历所有的cluster, 并计算影响每个cluster的光源集合.
(3) 进行Shading, 根据像素的ScreenPos.xy 和ViewZ 计算像素处于哪个cluster, 然后取出光源集合进行Shading计算.
游戏开始预先在相机空间分配多个cluster,每个cluster用AABB表示, 这个阶段不需要每帧进行,但是你的视锥体的大小等因素改变会影响cluster的重新分配
这里X,Y的划分和tileShading是一样的,但是Z的划分存在多种方案.
Doom (Id software) 的方案
这里划分cluster采用了DOOM的方案进行计算,并且划分数量为 16 X 12 X 24 = 4608
class FMobileBuildClusterCS : public FGlobalShader
{
DECLARE_SHADER_TYPE(FMobileBuildClusterCS, Global);
SHADER_USE_PARAMETER_STRUCT(FMobileBuildClusterCS, FGlobalShader);
BEGIN_SHADER_PARAMETER_STRUCT(FParameters, )
SHADER_PARAMETER_STRUCT_REF(FViewUniformShaderParameters, View)
SHADER_PARAMETER(float, FarPlane)
SHADER_PARAMETER(float, NearPlane)
SHADER_PARAMETER(FVector4, TileSizes)
SHADER_PARAMETER_UAV(RWBuffer, ClusterList)
END_SHADER_PARAMETER_STRUCT()
public:
static bool ShouldCompilePermutation(const FShaderPermutationParameters& Parameters)
{
return IsFeatureLevelSupported(Parameters.Platform, ERHIFeatureLevel::ES3_1);
}
};
MobileLightClusterShader.usf
#include "/Engine/Private/Common.ush"
float FarPlane; //LightCullFarPlane
float NearPlane; //LightCullNearPlane
float4 TileSizes;
RWBuffer ClusterList;
float3 LineIntersectionToZPlane(float3 a, float3 b, float z)
{
float3 normal = float3(0.0, 0.0, 1.0); // ???
float3 ab = b - a;
float t = (z - dot(normal, a)) / dot(normal, ab);
float3 result = a + t * ab;
return result;
}
[numthreads(1, 1, 1)]
void MainCS(
uint3 GroupId : SV_GroupID,
uint3 GroupThreadId : SV_GroupThreadID,
uint GroupIndex : SV_GroupIndex,
uint3 DispatchThreadId : SV_DispatchThreadID)
{
const float3 EyePos = float3(0.0, 0.0, 0.0);
uint TileIndex = GroupId.x + GroupId.y * (uint)TileSizes.x + GroupId.z * (uint)TileSizes.x * (uint)TileSizes.y;
float Px = 1.0 / (float)TileSizes.x;
float Py = 1.0 / (float)TileSizes.y;
//Calculate the min and max point in screen, far plane, near plane exit error(forever zero)
float2 MaxPointViewportUV = float2(GroupId.x + 1, GroupId.y + 1) * float2(Px, Py);
float2 MinPointViewportUV = float2(GroupId.xy) * float2(Px, Py);
float3 MaxPointViewPos = ScreenToViewPos(MaxPointViewportUV, FarPlane);
float3 MinPointViewPos = ScreenToViewPos(MinPointViewportUV, FarPlane);
//Near and far values of the cluster in view space, the split cluster method from siggraph 2016 idtech6
float TileNear = NearPlane * pow(FarPlane / NearPlane, GroupId.z / TileSizes.z);
float TileFar = NearPlane * pow(FarPlane / NearPlane, (GroupId.z + 1) / TileSizes.z);
//find cluster min/max 4 point in view space
float3 MinPointNear = LineIntersectionToZPlane(EyePos, MinPointViewPos, TileNear);
float3 MinPointFar = LineIntersectionToZPlane(EyePos, MinPointViewPos, TileFar);
float3 MaxPointNear = LineIntersectionToZPlane(EyePos, MaxPointViewPos, TileNear);
float3 MaxPointFar = LineIntersectionToZPlane(EyePos, MaxPointViewPos, TileFar);
float3 MinPointAABB = min(min(MinPointNear, MinPointFar), min(MaxPointNear, MaxPointFar));
float3 MaxPointAABB = max(max(MinPointNear, MinPointFar), max(MaxPointNear, MaxPointFar));
ClusterList[2 * TileIndex] = float4(MinPointAABB.x, MinPointAABB.y, MinPointAABB.z, 1.0);
ClusterList[2 * TileIndex + 1] = float4(MaxPointAABB.x, MaxPointAABB.y, MaxPointAABB.z, 1.0);
}
void FMobileSceneRenderer::AddBuildLightPass(FRDGBuilder& GraphBuilder, const FViewInfo* View, FRWBuffer ClusterListBuffer)
{
FMobileBuildClusterCS::FParameters* BuildClusterParameters = GraphBuilder.AllocParameters();
BuildClusterParameters->ClusterList = ClusterListBuffer.UAV;
FIntPoint ViewPortSize = View->ViewRect.Size();
BuildClusterParameters->TileSizes = FVector4(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ, (float)ViewPortSize.X / (float)MobileClusterSizeX);
BuildClusterParameters->NearPlane = View->NearClippingDistance;
BuildClusterParameters->FarPlane = GPointLightFarClippingPlane;
BuildClusterParameters->View = View->ViewUniformBuffer;
TShaderMapRef ComputeShader(View->ShaderMap);
FComputeShaderUtils::AddPass(
GraphBuilder,
RDG_EVENT_NAME("MobileBuildLightClusterPass"),
ComputeShader,
BuildClusterParameters,
FIntVector(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ));
}
遍历所有的cluster, 计算影响每个cluster的光源集合, 并用全局索引偏移表和全局光源索引表来记录. 这里为了简化数据结构, 全局索引偏移表和全局光源索引表 都是一维数组, 全局索引偏移表元素记录了每个cluster的的光源数量和每个cluster的的光源在全局光源索引表的Index偏移.
class FMobileLightCullCS : public FGlobalShader
{
DECLARE_SHADER_TYPE(FMobileLightCullCS, Global);
SHADER_USE_PARAMETER_STRUCT(FMobileLightCullCS, FGlobalShader);
BEGIN_SHADER_PARAMETER_STRUCT(FParameters, )
SHADER_PARAMETER_STRUCT_REF(FViewUniformShaderParameters, View)
SHADER_PARAMETER_STRUCT_REF(FMobileLocalLightData, LocalLightData)
SHADER_PARAMETER_RDG_BUFFER_UAV(RWStructuredBuffer, GlobalIndexCount)
SHADER_PARAMETER_UAV(RWBuffer, LightGridList)
SHADER_PARAMETER_UAV(RWBuffer, GlobalLightIndexList)
SHADER_PARAMETER_SRV(StrongTypedBuffer, ClusterList)
SHADER_PARAMETER_SRV(StrongTypedBuffer, LightViewSpacePositionAndRadius)
END_SHADER_PARAMETER_STRUCT()
public:
static bool ShouldCompilePermutation(const FShaderPermutationParameters& Parameters)
{
return IsFeatureLevelSupported(Parameters.Platform, ERHIFeatureLevel::ES3_1);
}
};
MobileLightCullShader.usf
#include "/Engine/Private/Common.ush"
#define THREAD_GROUD_X 16
#define THREAD_GROUD_Y 12
#define THREAD_GROUD_Z 2
#define GROUD_THREAD_TOTAL_NUM THREAD_GROUD_X * THREAD_GROUD_Y * THREAD_GROUD_Z
Buffer LightViewSpacePositionAndRadius;
Buffer ClusterList;
RWBuffer LightGridList;
RWBuffer GlobalLightIndexList;
RWStructuredBuffer GlobalIndexCount;
float GetSqdisPointAABB(float3 SphereViewPos, uint CluterIndex)
{
float SqDistance = 0.0;
float4 MinPoint = ClusterList[2 * CluterIndex];
float4 MaxPoint = ClusterList[2 * CluterIndex + 1];
for (int Index = 0; Index < 3; Index++)
{
float V = SphereViewPos[Index];
if (V < MinPoint[Index])
{
float Diff = MinPoint[Index] - V;
SqDistance += Diff * Diff;
}
if (V > MaxPoint[Index])
{
float Diff = V - MaxPoint[Index];
SqDistance += Diff * Diff;
}
}
return SqDistance;
}
bool TestSphereAABB(float3 LightViewPos, float LightRadius, uint CluterIndex)
{
float SqDistance = GetSqdisPointAABB(LightViewPos, CluterIndex);
return SqDistance <= (LightRadius * LightRadius);
}
[numthreads(THREAD_GROUD_X, THREAD_GROUD_Y, THREAD_GROUD_Z)]
void MainCS(
uint3 GroupId : SV_GroupID,
uint3 GroupThreadId : SV_GroupThreadID,
uint GroupIndex : SV_GroupIndex,
uint3 DispatchThreadId : SV_DispatchThreadID)
{
const uint ThreadCount = GROUD_THREAD_TOTAL_NUM;
uint LightCountInt = (uint)MobileLocalLightData.NumLocalLights;
uint PassCount = (LightCountInt + ThreadCount - 1) / ThreadCount;
uint ClusterIndex = GroupIndex + ThreadCount * GroupId.z;
uint VisibleLightCount = 0;
//one cluster max light num <= GROUD_THREAD_TOTAL_NUM
uint VisibleLightIndexs[GROUD_THREAD_TOTAL_NUM];
//TODO: directly loop for all points
for (uint PassIndex = 0; PassIndex < PassCount; ++PassIndex)
{
for (uint Light = 0; Light < ThreadCount; ++Light)
{
uint LightRealIndex = Light + PassIndex * ThreadCount;
if (LightRealIndex < LightCountInt)
{
float4 LightPositionAndRadius = LightViewSpacePositionAndRadius[LightRealIndex];
float3 ViewSpaceLightPosition = LightPositionAndRadius.xyz;
float LightRadius = LightPositionAndRadius.w;
if(TestSphereAABB(ViewSpaceLightPosition, LightRadius, ClusterIndex))
{
VisibleLightIndexs[VisibleLightCount] = LightRealIndex;
VisibleLightCount += 1;
}
}
}
}
//We want all thread groups to have completed the light tests before continuing
GroupMemoryBarrierWithGroupSync();
uint Offset;
InterlockedAdd(GlobalIndexCount[0], VisibleLightCount, Offset);
for (uint Index = 0; Index < VisibleLightCount; ++Index)
{
GlobalLightIndexList[Offset + Index] = VisibleLightIndexs[Index];
}
LightGridList[ClusterIndex * 2] = Offset;
LightGridList[ClusterIndex * 2 + 1] = VisibleLightCount;
}
void FMobileSceneRenderer::AddLightCullPass(FRDGBuilder& GraphBuilder, const FViewInfo* View, int32 ViewIndex, FRDGBufferRef ClusterListBuffer, FSortedLightSetSceneInfo &SortedLightSet, bool bCullLightsToGrid)
{
//....................省略一大段代码, 具体参考github提交
if (ForwardLocalLightData.Num() == 0)
{
// Make sure the buffer gets created even though we're not going to read from it in the shader, for platforms like PS4 that assert on null resources being bound
ForwardLocalLightData.AddZeroed();
}
FIntPoint ViewPortSize = View->ViewRect.Size();
FVector4 TileSizes = FVector4(MobileClusterSizeX, MobileClusterSizeY, MobileClusterSizeZ, 0);
FVector2D ClusterFactor;
ClusterFactor.X = (float)MobileClusterSizeZ / FMath::Log2(GPointLightFarClippingPlane / View->NearClippingDistance);
ClusterFactor.Y = -((float)MobileClusterSizeZ * FMath::Log2(View->NearClippingDistance)) / FMath::Log2(GPointLightFarClippingPlane / View->NearClippingDistance);
UpdateDynamicVector4BufferData(ForwardLocalLightData, View->ForwardLightingResources->ForwardLocalLightBuffer);
LocalLightData.ForwardLocalLightBuffer = View->ForwardLightingResources->ForwardLocalLightBuffer.SRV;
LocalLightData.NumLocalLights = NumLocalLightsFinal;
LocalLightData.TileSizes = TileSizes;
LocalLightData.ClusterFactor = ClusterFactor;
const bool bShouldCacheTemporaryBuffers = View->ViewState != nullptr;
FForwardLightingCullingResources& ForwardLightingCullingResources = bShouldCacheTemporaryBuffers
? View->ViewState->ForwardLightingCullingResources
: *GraphBuilder.AllocObject();
if (ViewSpacePosAndRadiusData.Num() == 0)
{
// Make sure the buffer gets created even though we're not going to read from it in the shader, for platforms like PS4 that assert on null resources being bound
ViewSpacePosAndRadiusData.AddZeroed();
ViewSpaceDirAndPreprocAngleData.AddZeroed();
}
// Alloc Large RWBuffer
if (Scene->UniformBuffers.MobileLightGrid.NumBytes != sizeof(uint32) * 2 * MobileClusterNum)
{
Scene->UniformBuffers.MobileLightGrid.Initialize(sizeof(uint32), 2 * MobileClusterNum, EPixelFormat::PF_R32_UINT);
}
if (Scene->UniformBuffers.MobileGlobalLightIndexList.NumBytes != sizeof(uint32) * 4 * MobileClusterNum)
{
Scene->UniformBuffers.MobileGlobalLightIndexList.Initialize(sizeof(uint32), 4 * MobileClusterNum, EPixelFormat::PF_R32_UINT);
}
check(ViewSpacePosAndRadiusData.Num() == ForwardLocalLightData.Num());
check(ViewSpaceDirAndPreprocAngleData.Num() == ForwardLocalLightData.Num());
UpdateDynamicVector4BufferData(ViewSpacePosAndRadiusData, ForwardLightingCullingResources.ViewSpacePosAndRadiusData);
UpdateDynamicVector4BufferData(ViewSpaceDirAndPreprocAngleData, ForwardLightingCullingResources.ViewSpaceDirAndPreprocAngleData);
if (!Scene->UniformBuffers.MobileLocalLightUniformBuffer.IsValid())
{
Scene->UniformBuffers.MobileLocalLightUniformBuffer = TUniformBufferRef::CreateUniformBufferImmediate(LocalLightData, UniformBuffer_MultiFrame);
}
else
{
Scene->UniformBuffers.MobileLocalLightUniformBuffer.UpdateUniformBufferImmediate(LocalLightData);
}
// Add Clear GlobalIndexCountUAV Pass
FRDGBufferDesc GlobalIndexCountDesc = FRDGBufferDesc::CreateStructuredDesc(sizeof(uint32), 1);
FRDGBufferRef GlobalIndexCountBuffer = GraphBuilder.CreateBuffer(GlobalIndexCountDesc, TEXT("GlobalIndexCount"));
FRDGBufferUAVRef GlobalIndexCountUAV = GraphBuilder.CreateUAV(GlobalIndexCountBuffer);
AddClearUAVPass(GraphBuilder, GlobalIndexCountUAV, 0);
FMobileLightCullCS::FParameters* LightCullParameters = GraphBuilder.AllocParameters();
LightCullParameters->ClusterList = ClusterListBuffer.SRV;
LightCullParameters->LightGridList = Scene->UniformBuffers.MobileLightGrid.UAV;
LightCullParameters->GlobalLightIndexList = Scene->UniformBuffers.MobileGlobalLightIndexList.UAV;
LightCullParameters->GlobalIndexCount = GlobalIndexCountUAV;
LightCullParameters->LocalLightData = Scene->UniformBuffers.MobileLocalLightUniformBuffer;
LightCullParameters->LightViewSpacePositionAndRadius = ForwardLightingCullingResources.ViewSpacePosAndRadiusData.SRV;
TShaderMapRef ComputeShader(View->ShaderMap);
FComputeShaderUtils::AddPass(
GraphBuilder,
RDG_EVENT_NAME("MobileLightCullPass"),
ComputeShader,
LightCullParameters,
FIntVector(1, 1, MobileClusterSizeZ / 2));
}
最后修改MobileBasePass,求出每个像素所在的Cluster, 遍历像素所在Cluster的光源列表进行着色
#if !MATERIAL_SHADINGMODEL_SINGLELAYERWATER
// Local lights
float DeviceZ = SvPosition.z / SvPosition.w;
float PixelDepth = GetPixelDepth(MaterialParameters);
// ViewPosZ
float2 ScreenUV = SvPositionToBufferUV(SvPosition);
float ViewPosZ = PixelDepth;
float2 ClusterFactor = MobileLocalLightData.ClusterFactor;
float4 TileSizes = MobileLocalLightData.TileSizes;
uint ClusterZ = uint(max(log2(ViewPosZ) * ClusterFactor.x + ClusterFactor.y, 0.0));
uint3 Clusters = uint3(uint(ScreenUV.x * TileSizes.x), uint(ScreenUV.y * TileSizes.y), ClusterZ);
uint ClusterIndex = Clusters.x + Clusters.y * (uint)TileSizes.x + Clusters.z * (uint)TileSizes.x * (uint)TileSizes.y;
uint LightOffset = LightGridList[2 * ClusterIndex];
uint LightCount = LightGridList[2 * ClusterIndex + 1];
for (uint Index = 0; Index < LightCount; Index++)
{
uint LocalLightIndex = GlobalLightIndexList[LightOffset + Index];
uint LocalLightBaseIndex = LocalLightIndex * LOCAL_LIGHT_DATA_STRIDE;
float4 LightPositionAndInvRadius = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 0];
float4 LightColorAndFalloffExponent = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 1];
float4 LightDirectionAndShadowMask = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 2];
float4 SpotAnglesAndSourceRadiusPacked = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 3];
float4 LightTangentAndSoftSourceRadius = MobileLocalLightData.ForwardLocalLightBuffer[LocalLightBaseIndex + 4];
AccumulateLightingOfDynamicPointLight(MaterialParameters,
ShadingModelContext,
GBuffer,
LightPositionAndInvRadius,
LightColorAndFalloffExponent,
float4(0, 0, 0, 1),
float4(0, 0, 0, 1),
Color);
}
#endif
安卓手机Demo(Remi k40 骁龙870), 测试场景100栈光源,剔除用时0.07ms~0.09ms
在打包到安卓手机的时候,发现RWStructuredBuffer<>使用结构体的时候打包失败,于是我把所有结构体的RWStructuredBuffer都改为了基本数据类型(uint. float, float4等等)来打包成功了。举个例子:
https://github.com/2047241149/UnrealEngine/commit/4acdcb34933c7f6910e53834d2947f57be4e49bb
https://github.com/2047241149/UnrealEngine/commit/259ca8f37b4d6a367990d42f5130faf2586c4d93
很多没优化和没测试的地方,聚光灯暂时被干掉,后面有空慢慢搞
【1】http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf
【2】https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdc2015/presentations/Thomas_Gareth_Advancements_in_Tile-Based.pdf?tdsourcetag=s_pctim_aiomsg
【3】http://www.humus.name/Articles/PracticalClusteredShading.pdf
【4】https://www.slideshare.net/TiagoAlexSousa/siggraph2016-the-devil-is-in-the-details-idtech-666?next_slideshow=1
【5】https://newq.net/dl/pub/SA2014Practical.pdf
【6】 A Primer On Efficient Rendering Algorithms & Clustered Shading.