Metal官方文档-Performing Calculations on a GPU

Example In GitHub

看了一些大佬的Metal文章，大致了解了Metal，今天开始磕官方文档。单词储备捉急，只能借助度娘翻译（公司的网络对Goole翻译做了限制，然后发现度娘意外的好用呢），之前看的一些忘了记录(在Example In GitHub中可以找到所有的记录)，想到后续要回来翻阅，为了避免重复翻译，就把后面的记录一下，也可以放慢速度，给大脑一些记忆的时间。

果真配合官方Demo来学习心里就踏实多了，不会担心误入歧途。

Get a Reference to the Metal Function

引用（自定义的）Metal函数

The first thing the initializer does is load the function and prepare it to run on the GPU. When you build the app, Xcode compiles the add_arrays function and adds it to a default Metal library that it embeds in the app. You use MTLLibrary and MTLFunction objects to get information about Metal libraries and the functions contained in them. To get an object representing the add_arrays function, ask the MTLDevice to create a MTLLibrary object for the default library, and then ask the library for a MTLFunction object that represents the shader function.

初始化器要做的第一件事是加载函数并准备它在GPU上运行。构建应用程序时，Xcode编译add_arrays函数并将其添加到嵌入到应用程序中的默认Metal库中。您可以使用MTLLibrary和MTLFunction对象来获取有关Metal库及其包含的函数的信息。要获取表示add_arrays函数的对象，请让MTLDevice为默认库创建一个MTLLibrary对象，然后向库请求表示着色器函数的MTLFunction对象。

笔记：
.metal文件中定义的函数是编译阶段进入到默认Metal库中的，
之后通过MTLLibrary和MTLFunction来取。

- (instancetype) initWithDevice: (id) device
{
    self = [super init];
    if (self)
    {
        _mDevice = device;

        NSError* error = nil;

        // Load the shader files with a .metal file extension in the project

        id defaultLibrary = [_mDevice newDefaultLibrary];
        if (defaultLibrary == nil)
        {
            NSLog(@"Failed to find the default library.");
            return nil;
        }

        id addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
        if (addFunction == nil)
        {
            NSLog(@"Failed to find the adder function.");
            return nil;
        }

Prepare a Metal Pipeline

The function object is a proxy for the MSL function, but it’s not executable code. You convert the function into executable code by creating a pipeline. A pipeline specifies the steps that the GPU performs to complete a specific task. In Metal, a pipeline is represented by a pipeline state object. Because this sample uses a compute function, the app creates a MTLComputePipelineState object.

函数对象是MSL函数的代理，但它不是可执行代码。通过创建管道将函数转换为可执行代码。管道指定GPU为完成特定任务而执行的步骤。在Metal中，管道由管道状态对象表示。因为此示例使用compute函数，所以应用程序创建了一个MTLComputePipelineState对象。

_mAddFunctionPSO = [_mDevice newComputePipelineStateWithFunction: addFunction error:&error];

A compute pipeline runs a single compute function, optionally manipulating the input data before running the function, and the output data afterwards.
When you create a pipeline state object, the device object finishes compiling the function for this specific GPU. This sample creates the pipeline state object synchronously and returns it directly to the app. Because compiling does take a while, avoid creating pipeline state objects synchronously in performance-sensitive code.

计算管道运行一个计算函数，在运行函数之前有选择地处理输入数据，之后处理输出数据。
创建管道状态对象时，设备对象将完成为此特定GPU编译函数。此示例同步创建管道状态对象并将其直接返回给应用程序。因为编译需要一段时间，所以避免在性能敏感的代码中同步创建管道状态对象。

笔记：
编译需要时间，所以在初始化的时候创建pipelien state对象。

Note
All of the objects returned by Metal in the code you’ve seen so far are returned as objects that conform to protocols. Metal defines most GPU-specific objects using protocols to abstract away the underlying implementation classes, which may vary for different GPUs. Metal defines GPU-independent objects using classes. The reference documentation for any given Metal protocol make it clear whether you can implement that protocol in your app.

注意
到目前为止，您在代码中看到的由Metal返回的所有对象都作为符合协议的对象返回。Metal定义了大多数GPU特定的对象，使用协议抽象出底层实现类，这些实现类可能因不同的GPU而有所不同。Metal使用类定义独立于GPU的对象。任何给定的Metal协议的参考文档都明确说明了您是否可以在应用程序中实现该协议。

Create a Command Queue

To send work to the GPU, you need a command queue. Metal uses command queues to schedule commands. Create a command queue by asking the MTLDevice for one.

要将工作发送到GPU，您需要一个命令队列。Metal使用命令队列来调度命令。通过请求MTLDevice来创建命令队列。

_mCommandQueue = [_mDevice newCommandQueue];

Create Data Buffers and Load Data

创建数据缓冲区并加载数据

After initializing the basic Metal objects, you load data for the GPU to execute. This task is less performance critical, but still useful to do early in your app’s launch.
A GPU can have its own dedicated memory, or it can share memory with the operating system. Metal and the operating system kernel need to perform additional work to let you store data in memory and make that data available to the GPU. Metal abstracts this memory management using resource objects. (MTLResource). A resource is an allocation of memory that the GPU can access when running commands. Use a MTLDevice to create resources for its GPU.
The sample app creates three buffers and fills the first two with random data. The third buffer is where add_arrays will store its results.

初始化基本Metal对象后，加载要执行的GPU数据。此任务对性能不太重要，但在应用程序启动的早期仍然有用。
GPU可以有自己的专用内存，也可以与操作系统共享内存。Metal和操作系统内核需要执行额外的工作，以便将数据存储在内存中，并使这些数据可供GPU使用。Metal使用资源对象抽象了这种内存管理。（MTLResource）。资源是GPU在运行命令时可以访问的内存分配。使用MTLDevice为其GPU创建资源。
示例应用程序创建了三个缓冲区，并用随机数据填充前两个缓冲区。第三个缓冲区是add_arrays存储结果的地方。

_mBufferA = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
_mBufferB = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
_mBufferResult = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];

[self generateRandomFloatData:_mBufferA];
[self generateRandomFloatData:_mBufferB];

The resources in this sample are (MTLBuffer) objects, which are allocations of memory without a predefined format. Metal manages each buffer as an opaque collection of bytes. However, you specify the format when you use a buffer in a shader. This means that your shaders and your app need to agree on the format of any data being passed back and forth.

此示例中的资源是（MTLBuffer）对象，它们是没有预定义格式的内存分配。Metal将每个缓冲区管理为一个不透明的字节集合。但是，在着色器中使用缓冲区时，可以指定格式。这意味着着色器和应用程序需要就来回传递的任何数据的格式达成一致。

When you allocate a buffer, you provide a storage mode to determine some of its performance characteristics and whether the CPU or GPU can access it. The sample app uses shared memory (storageModeShared), which both the CPU and GPU can access.

当你分配一个缓冲区时，你提供一个存储模式来确定它的一些性能特征，以及CPU或GPU是否可以访问它。示例应用程序使用共享内存（storageModeShared），CPU和GPU都可以访问该内存。

To fill a buffer with random data, the app gets a pointer to the buffer’s memory and writes data to it on the CPU. The add_arrays function in Listing 2 declared its arguments as arrays of floating-point numbers, so you provide buffers in the same format:

为了用随机数据填充缓冲区，应用程序会获取指向缓冲区内存的指针，并将数据写入CPU。清单2中的add_arrays函数将其参数声明为浮点数数组，因此可以使用相同的格式提供缓冲区：

- (void) generateRandomFloatData: (id) buffer
{
    float* dataPtr = buffer.contents;

    for (unsigned long index = 0; index < arrayLength; index++)
    {
        dataPtr[index] = (float)rand()/(float)(RAND_MAX);
    }
}

Create a Command Buffer

Ask the command queue to create a command buffer.

请求命令队列创建命令缓冲区。

id commandBuffer = [_mCommandQueue commandBuffer];

Create a Command Encoder

To write commands into a command buffer, you use a command encoder for the specific kind of commands you want to code. This sample creates a compute command encoder, which encodes a compute pass. A compute pass holds a list of commands that execute compute pipelines. Each compute command causes the GPU to create a grid of threads to execute on the GPU.

要将命令写入命令缓冲区，可以使用命令编码器对要编码的特定类型的命令进行编码。此示例创建一个计算命令编码器，该编码器对计算过程进行编码。计算过程包含执行计算管道的命令列表。每个compute命令都会导致GPU创建一个线程网格来在GPU上执行。

id computeEncoder = [commandBuffer computeCommandEncoder];

To encode a command, you make a series of method calls on the encoder. Some methods set state information, like the pipeline state object (PSO) or the arguments to be passed to the pipeline. After you make those state changes, you encode a command to execute the pipeline. The encoder writes all of the state changes and command parameters into the command buffer.

要对命令进行编码，需要对编码器进行一系列方法调用。有些方法设置状态信息，如管道状态对象（PSO）或要传递给管道的参数。在进行这些状态更改后，您将对命令进行编码以执行管道。编码器将所有状态更改和命令参数写入命令缓冲区。

（图略）

Set Pipeline State and Argument Data

设置管道状态和参数数据

Set the pipeline state object of the pipeline you want the command to execute. Then set data for any arguments that the pipeline needs to send into the add_arrays function. For this pipeline, that means providing references to three buffers. Metal automatically assigns indices for the buffer arguments in the order that the arguments appear in the function declaration in Listing 2, starting with 0. You provide arguments using the same indices.

设置要执行命令的管道的管道状态对象。然后为管道需要发送到add_arrays函数的任何参数设置数据。对于这个管道，这意味着提供对三个缓冲区的引用。Metal自动按照清单2中函数声明中参数出现的顺序为缓冲区参数分配索引，从0开始。使用相同的索引提供参数。

[computeEncoder setComputePipelineState:_mAddFunctionPSO];
[computeEncoder setBuffer:_mBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:_mBufferB offset:0 atIndex:1];
[computeEncoder setBuffer:_mBufferResult offset:0 atIndex:2];

You also specify an offset for each argument. An offset of 0 means the command will access the data from the beginning of a buffer. However, you could use one buffer to store multiple arguments, specifying an offset for each argument.
You don’t specify any data for the index argument because the add_arrays function defined its values as being provided by the GPU.

还可以为每个参数指定偏移量。偏移量为0表示命令将从缓冲区开始访问数据。但是，可以使用一个缓冲区来存储多个参数，为每个参数指定偏移量。
您没有为index参数指定任何数据，因为add_arrays函数将其值定义为由GPU提供。

Specify Thread Count and Organization

指定线程数和组织

Next, decide how many threads to create and how to organize those threads. Metal can create 1D, 2D, or 3D grids. The add_arrays function uses a 1D array, so the sample creates a 1D grid of size (dataSize x 1 x 1), from which Metal generates indices between 0 and dataSize-1.

接下来，决定要创建多少线程以及如何组织这些线程。Metal可以创建一维、二维或三维网格。add_arrays函数使用1D数组，因此示例创建一个1D大小的网格（dataSize x 1 x 1），Metal从中生成介于0和dataSize-1之间的索引。

Specify Threadgroup Size

指定线程组大小

Metal subdivides the grid into smaller grids called threadgroups. Each threadgroup is calculated separately. Metal can dispatch threadgroups to different processing elements on the GPU to speed up processing. You also need to decide how large to make the threadgroups for your command.

Metal将网格细分为更小的网格，称为线程组。每个线程组单独计算。Metal可以将线程组分配给GPU上的不同处理元素以加快处理速度。您还需要决定命令的线程组的大小。

NSUInteger threadGroupSize = _mAddFunctionPSO.maxTotalThreadsPerThreadgroup;
if (threadGroupSize > arrayLength)
{
    threadGroupSize = arrayLength;
}
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);

The app asks the pipeline state object for the largest possible threadgroup and shrinks it if that size is larger than the size of the data set. The maxTotalThreadsPerThreadgroup property gives the maximum number of threads allowed in the threadgroup, which varies depending on the complexity of the function used to create the pipeline state object.

应用程序向管道状态对象请求最大可能的线程组，如果该大小大于数据集的大小，则会收缩该线程组。maxTotalThreadsPerThreadgroup属性提供线程组中允许的最大线程数，这取决于用于创建管道状态对象的函数的复杂性。

Encode the Compute Command to Execute the Threads

对Compute命令进行编码以执行线程

Finally, encode the command to dispatch the grid of threads.

最后，对命令进行编码以分派线程网格。

[computeEncoder dispatchThreads:gridSize
          threadsPerThreadgroup:threadgroupSize];

When the GPU executes this command, it uses the state you previously set and the command’s parameters to dispatch threads to perform the computation.

当GPU执行此命令时，它使用您先前设置的状态和命令的参数来分派线程来执行计算。

You can follow the same steps using the encoder to encode multiple compute commands into the compute pass without performing any redundant steps. For example, you might set the pipeline state object once, and then set arguments and encode a command for each collection of buffers to process.

您可以按照相同的步骤使用编码器将多个计算命令编码到计算过程中，而无需执行任何冗余步骤。例如，可以设置管道状态对象一次，然后为要处理的每个缓冲区集合设置参数并编码一个命令。

End the Compute Pass

结束计算过程

When you have no more commands to add to the compute pass, you end the encoding process to close out the compute pass.

当没有更多的命令添加到计算过程时，结束编码过程以结束计算过程。

[computeEncoder endEncoding];

Commit the Command Buffer to Execute Its Commands

提交命令缓冲区以执行其命令

Run the commands in the command buffer by committing the command buffer to the queue.

通过将命令缓冲区提交到队列来运行命令缓冲区中的命令。

[commandBuffer commit];

The command queue created the command buffer, so committing the buffer always places it on that queue. After you commit the command buffer, Metal asynchronously prepares the commands for execution and then schedules the command buffer to execute on the GPU. After the GPU executes all the commands in the command buffer, Metal marks the command buffer as complete.

命令队列创建了命令缓冲区，因此提交缓冲区时总是将其放在该队列上。提交命令缓冲区后，Metal异步地准备要执行的命令，然后安排命令缓冲区在GPU上执行。在GPU执行命令缓冲区中的所有命令后，Metal将命令缓冲区标记为完成。

Wait for the Calculation to Complete

等待计算完成

Your app can do other work while the GPU is processing your commands. This sample doesn’t need to do any additional work, so it simply waits until the command buffer is complete.

当GPU处理你的命令时，你的应用可以做其他的工作。此示例不需要执行任何其他工作，因此它只需等待命令缓冲区完成。

[commandBuffer waitUntilCompleted];

Alternatively, to be notified when Metal has processed all of the commands, add a completion handler to the command buffer (addCompletedHandler(_:)), or check the status of a command buffer by reading its status property.

或者，要在Metal处理完所有命令时得到通知，请将完成处理程序添加到命令缓冲区（addCompletedHandler（:）），或者通过读取命令缓冲区的状态属性来检查命令缓冲区的状态。

Read the Results From the Buffer

从缓冲区读取结果

After the command buffer completes, the GPU’s calculations are stored in the output buffer and Metal performs any necessary steps to make sure the CPU can see them. In a real app, you would read the results from the buffer and do something with them, such as displaying the results onscreen or writing them to a file. Because the calculations are only used to illustrate the process of creating a Metal app, the sample reads the values stored in the output buffer and tests to make sure the CPU and the GPU calculated the same results.

命令缓冲区完成后，GPU的计算存储在输出缓冲区中，Metal执行任何必要的步骤，以确保CPU可以看到它们。在一个真正的应用程序中，你可以从缓冲区中读取结果并对其进行处理，例如在屏幕上显示结果或将结果写入文件。因为计算只是用来说明创建一个Metal应用程序的过程，所以该示例读取存储在输出缓冲区中的值并进行测试，以确保CPU和GPU计算出的结果相同。

- (void) verifyResults
{
    float* a = _mBufferA.contents;
    float* b = _mBufferB.contents;
    float* result = _mBufferResult.contents;

    for (unsigned long index = 0; index < arrayLength; index++)
    {
        if (result[index] != (a[index] + b[index]))
        {
            printf("Compute ERROR: index=%lu result=%g vs %g=a+b\n",
                   index, result[index], a[index] + b[index]);
            assert(result[index] == (a[index] + b[index]));
        }
    }
    printf("Compute results as expected\n");
}