现代的图形API都具备对多线程渲染友好的特性,所谓的多线程并不是指GPU端的多线程图像渲染,而是指在CPU提交DrawCall时所做的一系列工作可以并行化,也就是说多线程渲染其实是在CPU端提升程序的性能。
在使用D3D 11或者OpenGL的时候,每次提交DrawCall之前,都需要将相关的状态进行更新,将需要用的资源进行绑定,在提交DrawCall时,还要进行相关的参数检查等工作,这些看上去耗费的时间并没有太大的影响,而如果场景的几何体、材质种类非常多,用到的Shader数量比较多,每一帧的Pass比较多,就会导致有大量的DrawCall产生。那么每次在CPU端进行的这些操作的费时就很有可能会成为瓶颈。一种很自然的优化策略,就是将所有的CPU端的这些操作并行处理,即多线程地进行状态修改、参数检查等工作。但是传统的图形API对此并不友好,不管是D3D11还是OpenGL,它们都具有一个Context的概念,这个Context负责进行资源的绑定、状态修改、DrawCall调用,这样的模式对多线程十分不友好,如果想要实现多线程地提交,理论上是可以完成,但是非常麻烦,而且需要用到很多复杂同步原语,导致整体性能未必能达到理想的效果。
而现代的图形API则进行了一些模式上的更新,使得对多线程的支持更加友好。在Vulkan中的设计则主要体现在Queue和CommandBuffer上。在前几篇中有提到过,Vulkan中所有需要GPU执行的命令,只能通过CommandBuffer来完成,这些命令并不只包括DrawCall,对计算的调用,内存的操作,都需要用到CommandBuffer。而渲染所需要的所有状态(Shader和DescriptorSet等),都需要在CommandBuffer中进行绑定。每一个CommandBuffer,都有它独立的这些状态,在使用任意一个CommandBuffer时,都不可能避免这些操作,这与传统API中,如果不改变一个状态的话那么它将一直保持不变很不一样。而Queue则是在Vulkan中唯一一个可以向GPU提交命令的通道,而不是通过绑定在一个单一线程上的Context来完成。可以向Queue提交任务,而如果需要等待Queue中的某个任务结束的话,就需要手动的进行同步控制。用到上一篇所介绍的同步机制。
因此在Vulkan中的一种简单的多线程模式为:每个线程在每一帧都负责设置好自己的CommandBuffer,等待所有的线程将自己的CommandBuffer都设置好后,再将所有的CommandBuffer全部提交给Queue。
本文需要渲染的场景为:
这个场景由非常多的飞碟构成,观察到每个飞碟中间部分的颜色都不相同,也就是在渲染每一个飞碟时,都需要对渲染的状态进行更新。并且每个飞碟的位置在每一帧都需要进行更新。
下面就介绍这种多线程模式是如何具体实现的:
首先需要手动实现一下Thread:
class Thread
{
private:
bool destroying = false;
std::thread worker;
std::queue<std::function<void()>> jobQueue;
std::mutex queueMutex;
std::condition_variable condition;
// Loop through all remaining jobs
void queueLoop()
{
while (true)
{
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(queueMutex);
condition.wait(lock, [this] { return !jobQueue.empty() || destroying; });
if (destroying)
{
break;
}
job = jobQueue.front();
}
job();
{
std::lock_guard<std::mutex> lock(queueMutex);
jobQueue.pop();
condition.notify_one();
}
}
}
public:
Thread()
{
worker = std::thread(&Thread::queueLoop, this);
}
~Thread()
{
if (worker.joinable())
{
wait();
queueMutex.lock();
destroying = true;
condition.notify_one();
queueMutex.unlock();
worker.join();
}
}
// Add a new job to the thread's queue
void addJob(std::function<void()> function)
{
std::lock_guard<std::mutex> lock(queueMutex);
jobQueue.push(std::move(function));
condition.notify_one();
}
// Wait until all work items have been finished
void wait()
{
std::unique_lock<std::mutex> lock(queueMutex);
condition.wait(lock, [this]() { return jobQueue.empty(); });
}
};
class ThreadPool
{
public:
std::vector<std::unique_ptr<Thread>> threads;
// Sets the number of threads to be allocted in this pool
void setThreadCount(uint32_t count)
{
threads.clear();
for (auto i = 0; i < count; i++)
{
threads.push_back(std::make_unique<Thread>());
}
}
// Wait until all threads have finished their work items
void wait()
{
for (auto &thread : threads)
{
thread->wait();
}
}
};
这里将每个Thread需要执行的任务放在了一个jobQueue中,在jobQueue中没有任何任务时,将当前线程睡眠,而当有新的任务加入进来以后,唤醒该线程执行任务。
ThreadPool负责创建Thread,在每一帧中通过Wait函数,来等待每个线程中的所有任务都结束。
多线程更新相关的数据为:
struct PushConstantBlock
{
glm::mat4 mvp;
glm::vec3 color;
};
struct ObjectData
{
glm::mat4 model;
glm::vec3 pos;
glm::vec3 rotation;
float rotationDir;
float rotationSpeed;
float scale;
float deltaT;
float stateT = 0;
bool visible = true;
};
struct ThreadData
{
VkCommandPool commandPool;
std::vector<VkCommandBuffer> commandBufferVec;
std::vector<PushConstantBlock> pushConstantBlockVec;
std::vector<ObjectData> objectDataVec;
};
注意到所有的飞碟,用的都是同一个Shader:
#version 450
layout (location = 0) in vec3 inPos;
layout (location = 1) in vec3 inNormal;
layout (location = 2) in vec3 inColor;
layout (std140, push_constant) uniform PushConsts
{
mat4 mvp;
vec3 color;
} pushConsts;
layout (location = 0) out vec3 outNormal;
layout (location = 1) out vec3 outColor;
layout (location = 3) out vec3 outViewVec;
layout (location = 4) out vec3 outLightVec;
void main()
{
outNormal = inNormal;
if ( (inColor.r == 1.0) && (inColor.g == 0.0) && (inColor.b == 0.0))
{
outColor = pushConsts.color;
}
else
{
outColor = inColor;
}
gl_Position = pushConsts.mvp * vec4(inPos.xyz, 1.0);
vec4 pos = pushConsts.mvp * vec4(inPos, 1.0);
outNormal = mat3(pushConsts.mvp) * inNormal;
// vec3 lPos = ubo.lightPos.xyz;
vec3 lPos = vec3(0.0);
outLightVec = lPos - pos.xyz;
outViewVec = -pos.xyz;
}
#version 450
layout (location = 0) in vec3 inNormal;
layout (location = 1) in vec3 inColor;
layout (location = 3) in vec3 inViewVec;
layout (location = 4) in vec3 inLightVec;
layout (location = 0) out vec4 outFragColor;
void main()
{
vec3 N = normalize(inNormal);
vec3 L = normalize(inLightVec);
vec3 V = normalize(inViewVec);
vec3 R = reflect(-L, N);
vec3 diffuse = max(dot(N, L), 0.0) * inColor;
vec3 specular = pow(max(dot(R, V), 0.0), 8.0) * vec3(0.75);
outFragColor = vec4(diffuse + specular, 1.0);
}
可以看到飞碟中间颜色的差异是通过在VertexShader中将模型中具备特殊颜色顶点设置为指定颜色实现的。而FragmentShader则是一个比较简单的Phong着色。渲染每个飞碟时都需要更新VertexShader中的PushConsts,它的MVP矩阵决定了飞碟的位置,而color决定了飞碟中部的颜色。Shader中的其他内容都不需要在渲染时进行更新,看起来还是比较简单的。PushConstant的提交是在CommandBuffer内进行,所以每个线程的关键任务就是要对CommandBuffer做更新。
void UpdateCommandBuffer(int ind)
{
Update();
VkCommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_INHERITANCE_INFO;
inheritanceInfo.renderPass = render_pass_;
inheritanceInfo.framebuffer = frame_buffer_[ind];
VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.pInheritanceInfo = &inheritanceInfo;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT;
vkBeginCommandBuffer(ui_command_buffer_, &beginInfo);
VkViewport viewport = {};
viewport.width = width_;
viewport.height = height_;
viewport.minDepth = 0.0f;
viewport.maxDepth = 1.0f;
VkRect2D scissor = {};
scissor.extent.width = width_;
scissor.extent.height = height_;
vkCmdSetViewport(ui_command_buffer_, 0, 1, &viewport);
vkCmdSetScissor(ui_command_buffer_, 0, 1, &scissor);
vkCmdBindPipeline(ui_command_buffer_, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline_);
imgui_->draw(ui_command_buffer_);
vkEndCommandBuffer(ui_command_buffer_);
VkClearValue clearValues[2];
clearValues[0].color = { 0.0f , 0.0f , 0.0f , 1.0f };
clearValues[1].depthStencil = { 1.0f, 0 };
VkRenderPassBeginInfo renderPassBeginInfo = {};
renderPassBeginInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
renderPassBeginInfo.renderArea.extent.width = width_;
renderPassBeginInfo.renderArea.extent.height = height_;
renderPassBeginInfo.framebuffer = frame_buffer_[ind];
renderPassBeginInfo.clearValueCount = 2;
renderPassBeginInfo.pClearValues = clearValues;
renderPassBeginInfo.renderPass = render_pass_;
vkBeginCommandBuffer(draw_command_buffer_[ind] , &beginInfo );
vkCmdBeginRenderPass(draw_command_buffer_[ind], &renderPassBeginInfo, VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);
for (uint32_t t = 0; t < thread_count; t++)
{
for (uint32_t i = 0; i < object_count / thread_count; i++)
{
thread_pool_.threads[t]->addJob([=] { UpdateThreadData(t, i, inheritanceInfo); });
}
}
thread_pool_.wait();
std::vector<VkCommandBuffer> commandBufferVec;
for (uint32_t t = 0; t < thread_count; t++)
{
for (uint32_t i = 0; i < object_count / thread_count; i++)
{
commandBufferVec.push_back(threadDataVec[t].commandBufferVec[i]);
}
}
commandBufferVec.push_back(ui_command_buffer_);
vkCmdExecuteCommands( draw_command_buffer_[ind] , commandBufferVec.size(), commandBufferVec.data() );
vkCmdEndRenderPass(draw_command_buffer_[ind]);
vkEndCommandBuffer(draw_command_buffer_[ind]);
}
这是在每一帧调用的总的更新函数,可以看到,所有线程的CommandBuffer,都是内嵌在一个大的CommandBuffer的一个RenderPass内部的。每个飞碟对应一个CommandBuffer,一个线程在一帧内要处理多个CommandBuffer。
void UpdateThreadData(uint32_t threadIndex , uint32_t commandBufferIndex , VkCommandBufferInheritanceInfo inheritanceInfo )
{
ThreadData & threadData = threadDataVec[threadIndex];
ObjectData & objectData = threadData.objectDataVec[commandBufferIndex];
VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT;
beginInfo.pInheritanceInfo = &inheritanceInfo;
vkBeginCommandBuffer(threadData.commandBufferVec[commandBufferIndex], &beginInfo);
VkViewport viewport = {};
viewport.width = width_;
viewport.height = height_;
viewport.minDepth = 0.0f;
viewport.maxDepth = 1.0f;
VkRect2D scissor = {};
scissor.extent.width = width_;
scissor.extent.height = height_;
vkCmdSetViewport(threadData.commandBufferVec[commandBufferIndex], 0, 1, &viewport);
vkCmdSetScissor(threadData.commandBufferVec[commandBufferIndex], 0, 1, &scissor);
vkCmdBindPipeline(threadData.commandBufferVec[commandBufferIndex], VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline_);
// Update Object Data
objectData.rotation.y += 2.5f * objectData.rotationSpeed * frame_timer;
if (objectData.rotation.y > 360.0f) {
objectData.rotation.y -= 360.0f;
}
objectData.deltaT += 0.15f * frame_timer;
if (objectData.deltaT > 1.0f)
objectData.deltaT -= 1.0f;
objectData.pos.y = sin(glm::radians(objectData.deltaT * 360.0f)) * 2.5f;
objectData.model = glm::translate(glm::mat4(1.0f), objectData.pos);
objectData.model = glm::rotate(objectData.model, -sinf(glm::radians(objectData.deltaT * 360.0f)) * 0.25f, glm::vec3(objectData.rotationDir, 0.0f, 0.0f));
objectData.model = glm::rotate(objectData.model, glm::radians(objectData.rotation.y), glm::vec3(0.0f, objectData.rotationDir, 0.0f));
objectData.model = glm::rotate(objectData.model, glm::radians(objectData.deltaT * 360.0f), glm::vec3(0.0f, objectData.rotationDir, 0.0f));
objectData.model = glm::scale(objectData.model, glm::vec3(objectData.scale));
// Update Push Constant
threadData.pushConstantBlockVec[commandBufferIndex].mvp = uboVS.projectionMatrix * uboVS.viewMatrix * objectData.model;
vkCmdPushConstants(threadData.commandBufferVec[commandBufferIndex],
pipeline_layout_,
VK_SHADER_STAGE_VERTEX_BIT,
0,
sizeof(PushConstantBlock),
&threadData.pushConstantBlockVec[commandBufferIndex]);
VkBuffer vertBuffer = ufo.vertices->GetDesc().buffer;
VkDeviceSize offset = 0;
vkCmdBindVertexBuffers(threadData.commandBufferVec[commandBufferIndex], 0, 1, &vertBuffer, &offset);
vkCmdBindIndexBuffer(threadData.commandBufferVec[commandBufferIndex], ufo.indices->GetDesc().buffer, 0, VK_INDEX_TYPE_UINT32);
vkCmdDrawIndexed(threadData.commandBufferVec[commandBufferIndex], ufo.indexCount , 1, 0, 0, 0);
vkEndCommandBuffer(threadData.commandBufferVec[commandBufferIndex]);
}
上面则是每个线程所要执行的具体的任务,比较直观,在获取到飞碟对应的CommandBuffer后,先对飞碟本身的数据信息进行更新,然后对CommandBuffer进行重新写入,尽管感觉上我们只需要重新提交一次PushConstant命令,但是所有的其他不变的状态也需要再进行一次提交,比如VertexBuffer、IndexBuffer、Scissor、Viewport,这里就体现出与传统API的区别了,如果在D3D11中,我们只需要将某个ConstantBuffer修改一下,其他的都不需要动,然后直接提交DrawCall就行,但是在Vulkan中,每个CommandBuffer内的状态只要需要修改一点,那么其他所有的状态都要跟着再进行一次设定。
程序大体上就是如此,详细地可以参考源码:https://github.com/syddf/VulkanRenderExample
(参考了SaschaWillems的Samples:https://github.com/SaschaWillems/Vulkan)