Metal与图形渲染九：多线程渲染

零. 前言

我们在介绍Metal入门篇的时候提到过，Metal支持多线程操作，以及具备线程间资源共享能力，但当时只是简单地提了一嘴，并没有进行实践，今天来实践一下Metal怎么使用多线程渲染。

本文将以Yuv转Rgb和反弹小球为例，介绍Metal中的多线程渲染，反弹小球的效果如下。

一. 纹理、像素、线程和线程组

一个纹理可以看成一个Grid，一个Grid又可以划分为若干个Group，而一个Group包含了若干个像素，GPU具备并行计算能力，每个线程会处理一个像素，因此，我们把若干个线程合成一个线程组，也就是Thread Group，由若干个Thread Group去执行渲染操作。

执行多线程操作使用的是MTLComputePipelineState，该管线允许用户传入线程组的数量threadGroupCount，和每个线程组包含的线程数threadGroupSize。

而MTLComputePipelineState拥有两个参数，threadExecutionWidth代表每个线程组的宽度，maxTotalThreadsPerThreadgroup代表每个线程组最大的线程数量，苹果建议将threadgroup包含的线程数量设定为threadExecutionWidth的整数倍。因此有公式：

NSUInteger w = pipelineState.threadExecutionWidth;
NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);

那么，threadGroupCount可以根据我们上面求得的threadGroupSize进行计算，一个grid每一行拥有的线程组数量应为(outTexture.width - 1) / threadGroupSize.width + 1，同理，每一列拥有的线程组数量应为(outTexture.height - 1) / threadGroupSize.height + 1，因此有：

NSUInteger w = pipelineState.threadExecutionWidth;
NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadGroupSize = MTLSizeMake(w, h, 1);

MTLSize threadGroupCount;
threadGroupCount.width = (outTexture.width + threadGroupSize.width - 1) / threadGroupSize.width;
threadGroupCount.height = (outTexture.height + threadGroupSize.height - 1) / threadGroupSize.height;
threadGroupCount.depth = 1;

[computeEncoder dispatchThreadgroups:threadGroupCount threadsPerThreadgroup:threadGroupSize];

二. 使用多线程渲染yuv转rgb

在MSL中，只有kernel类型的内核函数才能使用多线程，他既不属于顶点着色函数，也不属于片段着色函数，但我们可以根据kernel函数进行纹理的读写、数据的读写。我们这次使用kernel函数进行yuv和rgb的转换。

kernel void
ycbcrToRgb(texture2d textureY [[ texture(0) ]],
           texture2d textureUV [[ texture(1) ]],
           texture2d outTexture [[ texture(2) ]],
           constant CCAlphaVideoMetalConvertMatrix *convertMatrix [[ buffer(0) ]],
           uint2 gid [[ thread_position_in_grid ]]) {
    if (gid.x >= outTexture.get_width() || gid.y >= outTexture.get_height()) {
        return;
    }
    float y = textureY.read(gid).r;
    float2 uv = textureUV.read(gid / 2).rg;
    float3 rgb = convertMatrix->matrix * (float3(y, uv) + convertMatrix->offset);
    outTexture.write(float4(rgb, 1), gid);
}

这个kernel函数中，thread_position_in_grid句柄代表当前渲染的像素位置相对于整个纹理grid的位置，因为前面我们是使用outTexture计算了ThreadGroup的宽高，因此这里的gid和输出纹理的像素点一一对应。

比如传入的gid为(9, 10)，那么这个gid就相对outTexture对应的(9,10)这个位置。

值得注意的是，Grid是纹理的尺寸，长宽不一定都能整除ThreadGroup，因此超出范围的需要return掉。

由于y纹理的宽高等于输出纹理的宽高；而uv纹理的宽高均为输出纹理宽高的1 / 2，因此采样的位置也有所不同。

最后再调用write即可对输出纹理进行写入。

对应的OC层代码如下：

- (void)computePipeline:(id )pipelineState
          inputTextures:(NSArray  *)inputTextures
                buffers:(nullable NSArray > *)buffers 
             outTexture:(id)outTexture
          commandBuffer:(id )commandBuffer {
    id  computeEncoder = [commandBuffer computeCommandEncoder];
    [computeEncoder setComputePipelineState:pipelineState];
    
    for (int i = 0; i < inputTextures.count; i++) {
        [computeEncoder setTexture:inputTextures[i].texture atIndex:i];
    }
    [computeEncoder setTexture:outTexture atIndex:inputTextures.count];
    
    for (NSInteger i = 0; i < buffers.count; i++) {
        id  buffer = buffers[i];
        [computeEncoder setBuffer:buffer offset:0 atIndex:i];
    }
    
    NSUInteger w = pipelineState.threadExecutionWidth;
    NSUInteger h = pipelineState.maxTotalThreadsPerThreadgroup / w;
    MTLSize threadGroupSize = MTLSizeMake(w, h, 1);
    
    MTLSize threadGroupCount;
    threadGroupCount.width = (outTexture.width + threadGroupSize.width - 1) / threadGroupSize.width;
    threadGroupCount.height = (outTexture.height + threadGroupSize.height - 1) / threadGroupSize.height;
    threadGroupCount.depth = 1;
    
    [computeEncoder dispatchThreadgroups:threadGroupCount threadsPerThreadgroup:threadGroupSize];
    [computeEncoder endEncoding];
}

三. 渲染反弹小球

kernel函数除了能更改纹理的内容外，也可以更改Buffer的内容，从而带入顶点着色器或片段着色器中。

在反弹小球的渲染中，kernel函数负责计算小球的速度、位置信息；vertex函数负责将kernel函数得到的信息，进行位置的归一化计算、决定小球的大小信息；fragment函数负责渲染小球的颜色。

由此，我们可以定义一个小球的数据，首先是需要kernel函数和vertex函数用于计算的数据，该数据为OC层和MSL层共用：

typedef struct {
    vector_float2 position;
    float speed;
    float size;
} BallData;

而光栅化数据需要定义小球的顶点位置、大小和传递给片段着色器的颜色：

typedef struct {
    float4 position [[ position ]];
    float size [[point_size]];
    float4 color;
} BallRasterizerData;

在kernel函数中，我们需要根据小球上一帧的位置、速度，计算出下一帧的位置和速度：

kernel void
bouncingBallCompute(device BallData *data [[ buffer(0) ]],
                    uint gid [[ thread_position_in_grid ]]) {
    constexpr float tickTime = 0.16;  // 每一帧的时间
    constexpr float g = 9.8;  // 重力加速度

    auto d = data[gid];

    d.position.y = d.position.y + d.speed * tickTime;

    if (d.position.y >= kHobenBallOutTextureHeight - d.size / 2) {
        // 触底回弹
        d.position.y = kHobenBallOutTextureHeight - d.size / 2;
        d.speed = -d.speed * 0.9;
    }

    // 让不同大小的小球加速度不一致
    float drag = (abs(d.speed) * d.size) / 100;

    d.speed = d.speed + tickTime * (g - drag);
    data[gid] = d;
}

在vertex函数中，我们需要根据计算好的位置，和外层传来的大小，进行位置的归一化和大小的决定，同时顺便根据大小不同展示不同颜色的小球：

vertex BallRasterizerData
bouncingBallVertex(const device BallData *data [[ buffer(0) ]],
                   uint vid [[ vertex_id ]],
                   uint instance [[ instance_id ]]) {
    BallRasterizerData vout;
    vout.position = float4(0, 0, 0, 1);
    
    auto d = data[instance];
    float2 p = d.position / float2(kHobenBallOutTextureWidth, kHobenBallOutTextureHeight);
    
    // 和顶点着色器坐标系对齐
    p.y = 1 - p.y;
    vout.position.xy = (p - 0.5) * 2;
    
    vout.size = d.size;
    vout.color = float4(1.0, d.size / 48, 1 - d.size / 48, 1.0);
    return vout;
}

在fragment函数中，则声明小球是圆形的即可，point_coord代表某个Point里面的像素点的位置信息。

fragment float4
bouncingBallFragment(BallRasterizerData vertexIn [[ stage_in ]],
                     float2 pointCoor [[ point_coord ]]) {

    if (distance(pointCoor, float2(0.5, 0.5)) >= 0.5) {
        // 画成圆形
        discard_fragment();
    }
    return vertexIn.color;
}

声明好小球的MSL后，我们再来看看在业务层的渲染。

首先需要初始化若干个小球信息，这里初始化了1024个小球，每个小球的大小、位置随机：

static const NSInteger HobenMetalBouncingBallCount = 1024;

@interface HobenMetalBouncingBallOutput() {
    BallData _ballDatas[HobenMetalBouncingBallCount];
}

...

- (instancetype)initWithRenderContext:(HobenMetalRenderContext *)renderContext {
    if (self = [super initWithRenderContext:renderContext]) {
        for (int i = 0; i < HobenMetalBouncingBallCount; i++) {
            BallData ballData;
            
            ballData.size = 8 + arc4random() % 28;
            ballData.speed = 0;
            vector_float2 position = {24 + arc4random() % (kHobenBallOutTextureWidth - 48), 24 + arc4random() % 400};
            ballData.position = position;
            _ballDatas[i] = ballData;
            id  buffer = [_renderContext.device newBufferWithBytes:&_ballDatas
                                                                       length:sizeof(BallData) * HobenMetalBouncingBallCount
                                                                      options:MTLResourceStorageModeShared];
            _buffer = buffer;
        }
    }
    return self;
}

由于我们只需要一个顶点就可以绘制出一个小球，因此，threadGroupSize和threadGroupCount也只需要一维的数据即可：

if (!_outputTexture) {
    _outputTexture = [HobenMetalTexture defaultTextureByWidth:kHobenBallOutTextureWidth height:kHobenBallOutTextureHeight];
}
    
id  commandBuffer = _renderContext.commandBuffer;
    
id  computeEncoder = [commandBuffer computeCommandEncoder];
id  computePipeline = [_renderContext computePipelineStateWithName:@"bouncingBallCompute"];
[computeEncoder setComputePipelineState:computePipeline];

[computeEncoder setBuffer:_buffer offset:0 atIndex:0];

MTLSize threadGroupSize = MTLSizeMake(computePipeline.threadExecutionWidth, 1, 1);

NSUInteger w = (HobenMetalBouncingBallCount + threadGroupSize.width - 1) / threadGroupSize.width;
MTLSize threadGroupCount = MTLSizeMake(w, 1, 1);

[computeEncoder dispatchThreadgroups:threadGroupSize threadsPerThreadgroup:threadGroupCount];
[computeEncoder endEncoding];

绘制函数也变成MTLPrimitiveTypePoint，主要调用了instanceCount方法，表示绘制这么多个数量的点，对应MSL中的[[ instance_id ]]句柄：

MTLRenderPassDescriptor *renderPass = [MTLRenderPassDescriptor renderPassDescriptor];
renderPass.colorAttachments[0].texture = _outputTexture;
renderPass.colorAttachments[0].clearColor = MTLClearColorMake(0, 0, 0, 0);
renderPass.colorAttachments[0].storeAction = MTLStoreActionStore;
renderPass.colorAttachments[0].loadAction = MTLLoadActionClear;

id  renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPass];

[renderEncoder setRenderPipelineState:[_renderContext pipelineStateWithVertexName:@"bouncingBallVertex" fragmentName:@"bouncingBallFragment"]];
[renderEncoder setVertexBuffer:_buffer offset:0 atIndex:0];
[renderEncoder drawPrimitives:MTLPrimitiveTypePoint vertexStart:0 vertexCount:1 instanceCount:HobenMetalBouncingBallCount];
[renderEncoder endEncoding];

最后在主工程决定链式结构走向，并每帧回调就OK啦：

- (void)viewDidLoad {
    [super viewDidLoad];
        
    _renderContext = [[HobenMetalRenderContext alloc] init];

    self.view.backgroundColor = [UIColor grayColor];
    
    CGRect frame = self.view.frame;
    
    HobenMetalRenderView *renderView = [[HobenMetalRenderView alloc] initWithRenderContext:_renderContext];
    renderView.backgroundColor = [UIColor blackColor];
    renderView.frame = CGRectMake(0, 200, frame.size.width, frame.size.height - 400);
    renderView.userInteractionEnabled = NO;
    renderView.paused = NO;
    renderView.delegate = self;
    [self.view addSubview:renderView];
    self.renderView = renderView;
    
    HobenMetalBouncingBallOutput *balloutPut = [[HobenMetalBouncingBallOutput alloc] initWithRenderContext:_renderContext];
    [balloutPut addTarget:self.renderView];
    self.ballOutput = balloutPut;
}

- (void)drawInMTKView:(MTKView *)view {
    [self.ballOutput beginRender];
}

四. 多个Command Encoder的协作执行

在上面的小球渲染你可能注意到了，我们使用了MTLComputeCommandEncoder执行多线程操作，对应的是kernel函数；使用MTLRenderCommandEncoder执行渲染操作，对应的是vertex和fragment函数，最后将两个Encoder encode之后，就会将MTLCommandBuffer commit掉。

也就是说，一个MTLCommandBuffer可以包含多个不同类型的Encoder，开发者可以决定Encoder执行的是计算还是渲染操作。

而当开发者在CPU装载完所有的MTLCommandBuffer后，就会传入MTLCommandQueue中，接下来的工作就会交由GPU执行了。

下图用一个更加具体的例子介绍了MTLCommandEncoder、MTLCommandBuffer、MTLCommandQueue的关系：

五. 总结

Metal多线程渲染中，运用到了线程组的概念，一个纹理可以划分为若干个Group，一个线程组可以对应一个Group。

一个线程组包含若干个线程，每个线程负责一个像素，thread_position_in_grid句柄代表当前渲染的像素位置相对于整个纹理grid的位置。

我们可以根据kernel函数更改纹理、缓冲数据，并应用于vertex、fragment函数中。

不同Encoder工作不一样，kernel函数属于Compute Encoder，他和其他Encoder一起放入一个Command Buffer中，一个Command Queue有多个Command Buffer，最后交由CPU处理。

六. 参考文章

Parallel Computation using MPS

Metal 示例之图像处理

GitHub MetalPetal