Arm Mali Bifrost and Valhall OpenCL Developer Guide - Chapter 9 - OpenCL optimizations list

Arm Mali Bifrost and Valhall OpenCL Developer Guide - Chapter 9 - OpenCL optimizations list

Arm Mali Bifrost and Valhall OpenCL Developer Guide - Version 4.0
https://developer.arm.com/documentation/101574/0400

9. OpenCL optimizations list

This chapter lists several optimizations to use when writing OpenCL code for Mali GPUs.
本章列出了为 Mali GPU 编写 OpenCL 代码时要使用的几种优化。

It contains the following sections:

  • 9.1 General optimizations on page 9-75. (常规优化)
  • 9.2 Kernel optimizations on page 9-77. (内核优化)
  • 9.3 Code optimizations on page 9-80. (代码优化)
  • 9.4 Execution optimizations on page 9-82. (执行优化)
  • 9.5 Reducing the effect of serial computations on page 9-83. (减少串行计算的影响)
  • 9.6 Mali Bifrost and Valhall GPU specific optimizations on page 9-84. (Mali Bifrost 和 Valhall GPU 特定的优化)

9.1 General optimizations (常规优化)

Arm recommends general optimizations such as processing large amount of data, using the correct data types, and compiling the kernels once.
Arm 建议进行常规优化,例如处理大量数据,使用正确的数据类型以及一次编译内核。

  • Use the best processor for the job (使用最佳处理器来完成工作)

GPUs are designed for parallel processing.
GPU 设计用于并行处理。

Application processors are designed for high-speed serial computations.
应用处理器专为高速串行计算而设计。

All applications contain sections that perform control functions and others that perform computation. For optimal performance use the best processor for the task:
所有应用程序均包含执行控制功能的部分,以及其他执行计算的部分。为了获得最佳性能,请为任务使用最佳处理器:

  1. Control and serial functions are best performed on an application processor using a traditional language. (控制和串行功能最好使用传统语言在应用处理器上执行。)
  2. Use OpenCL on Mali GPUs for the parallelizable compute functions. (在 Mali GPU 上使用 OpenCL 来实现可并行化的计算功能。)
  • Compile the kernel once at the start of your application (在应用程序开始时编译一次内核)

Ensure that you compile the kernel once at the start of your application. This can reduce the fixed overhead significantly.
确保在应用程序启动时编译一次内核。这样可以大大减少固定开销。

  • Enqueue many work-items (排队许多工作项)

To get maximum use of all your processor or shader cores, you must enqueue many work-items.
为了最大程度地利用所有处理器或着色器内核,您必须加入许多工作项。

  • Process large amounts of data

You must process a relatively large amount of data to get the benefit of OpenCL. This is because of the fixed overheads of starting OpenCL tasks. The exact size of a data set where you start to see benefits depends on the processors you are running your OpenCL code on.
你必须处理相对大量的数据才能获得 OpenCL 的好处。这是因为启动 OpenCL 任务的固定开销。您开始看到收益的数据块的确切大小取决于运行 OpenCL 代码的处理器。

For example, performing simple image processing on a single 640x480 image is unlikely to be faster on a GPU, whereas processing a 1920x1080 image is more likely to be beneficial. Trying to benchmark a GPU with small images is only likely to measure the start-up time of the driver.
例如,对单个 640x480 图像执行简单的图像处理不太可能在 GPU 上更快,而对 1920x1080 图像进行处理更可能是有益的。尝试对处理 small images 的 GPU 进行基准测试只能测量驱动程序的启动时间。

Do not extrapolate these results to estimate the performance of processing a larger data set. Run the benchmark on a representative size of data for your application.
不要外推这些结果,以估计处理较大数据集的性能。在代表数据大小的应用程序上运行基准测试。

  • Align data on 128-bit or 16-byte boundaries (在 128-bit or 16-byte 边界上对齐数据)

Align data on 128-bit or 16-byte boundaries. This can improve the speed of loading and saving data. If you can, align data on 64-byte boundaries. This ensures data fits evenly into the cache on Mali GPUs.
在 128-bit or 16-byte 边界上对齐数据。这样可以提高加载和保存数据的速度。如果可以,请在 64-byte 边界上对齐数据。这样可确保数据均匀地放入 Mali GPU 的缓存中。

  • Use the correct data types (使用准确的数据类型)

Check each variable to see what range it requires.
检查每个变量以查看其要求的范围。

Using smaller data types has several advantages: (使用较小的数据类型有几个优点:)

  1. More operations can be performed per cycle with smaller variables. (每个周期可以使用较小的变量执行更多的操作。)
  2. You can load or store more in a single cycle. (您可以在一个周期内加载或存储更多内容。)
  3. If you store your data in smaller containers, it is more cacheable. (如果将数据存储在较小的容器中,则它更易于缓存。)

If accuracy is not critical, instead of an int, see if a short, ushort, or char works in its place.
如果精度要求不严格,请使用 short, ushort, or char 代替它,而不是 int。

For example, if you add two relatively small numbers you probably do not require an int. However, check in case an overflow might occur.
例如,如果将两个相对较小的数字相加,则可能不需要 int。但是,请检查是否可能发生溢出。

Only use float values if you require their additional range. For example, if you require very small or very large numbers.
仅在需要其附加范围时才使用浮点值。 例如,如果您需要非常小的数字或非常大的数字。

  • Use the right data types (使用对的数据类型)

You can store image and other data as images or as buffers: (您可以将图像和其他数据存储为 images or buffers:)

  1. If your algorithm can be vectorized, use buffers. (如果你的算法可以向量化,请使用 buffers。)
  2. If your algorithm requires interpolation or automatic edge clamping, use images.
    如果您的算法需要插值或自动边缘钳制,请使用 images
  • Do not merge buffers as an optimization (不要合并缓冲区作为优化)

Merging multiple buffers into a single buffer as an optimization is unlikely to provide a performance benefit.
将多个缓冲区合并到单个缓冲区中不太可能提供性能优势。

For example, if you have two input buffers you can merge them into a single buffer and use offsets to compute addresses of data. However, this means that every kernel must perform the offset calculations.
例如,如果您有两个输入缓冲区,则可以将它们合并为一个缓冲区,并使用偏移量来计算数据地址。但是,这意味着每个内核都必须执行偏移量计算。

It is better to use two buffers and pass the addresses to the kernel as a pair of kernel arguments.
最好使用两个缓冲区,并将地址作为一对内核参数传递给内核。

  • Use asynchronous operations (使用异步操作)

If possible, use asynchronous operations between the control threads and OpenCL threads. For example: (如果可能,请在控制线程和 OpenCL 线程之间使用异步操作。)

  1. Do not make the application processor wait for results. (不要让应用处理器等待结果。)
  2. Ensure that the application processor has other operations to process before it requires results from the OpenCL thread. (在需要 OpenCL 线程的结果之前,请确保应用程序处理器还有其他操作要处理。)
  3. Ensure that the application processor does not interact with OpenCL kernels when they are executing. (确保应用程序处理器在执行时不会与 OpenCL 内核进行交互。)
  • Avoid application processor and GPU interactions in the middle of processing (在处理过程中避免应用程序处理器和 GPU 的交互)

Enqueue all the kernels first, and call clFinish() at the end if possible.
首先使所有内核入队,并在可能时最后调用 clFinish()

Call clFlush() after one or more clEnqueueNDRange() calls, and call clFinish() before checking the final result.
在一个或多个 clEnqueueNDRange() 调用之后调用 clFlush(),并在检查最终结果之前调用 clFlush()

  • Avoid blocking calls in the submission thread (避免在提交线程中阻止调用)

Avoid clFinish() or clWaitForEvent() or any other blocking calls in the submission thread.
在提交线程中避免 clFinish() or clWaitForEvent() 或任何其他阻塞调用。

If possible, wait for an asynchronous callback if you want to check the result while computations are in progress.
如果可能的话,如果要在计算过程中检查结果,请等待异步回调。

Try double buffering, if you are using blocking operations in your submission thread.
如果在提交线程中使用阻止操作,请尝试双缓冲。

  • Batching kernels submission (批处理内核提交)

From version r17p0 onwards, the OpenCL driver batches kernels that are flushed together for submission to the hardware. Batching kernels can significantly reduce the runtime overheads and cache maintenance costs. For example, this reduction is useful when the application is accessing multiple sub-buffers created from a buffer imported using clImportMemoryARM in separate kernels.
从版本 r17p0 开始,OpenCL 驱动程序将批处理的一起提交给硬件。批处理内核可以大大减少运行时开销和缓存维护成本。例如,当应用程序访问从使用单独的内核中的 clImportMemoryARM 导入的缓冲区创建的多个子缓冲区时,这种减少很有用。

The application should flush kernels in groups as large as possible. When the GPU is idle though, reaching optimal performance requires the application to flush an initial batch of kernels early so that the GPU execution overlaps the queuing of further kernels.
应用程序应尽可能按组刷新内核。但是,当 GPU 处于空闲状态时,要达到最佳性能,应用程序应及早刷新初始一批内核,以便 GPU 的执行与其他内核的队列重叠。

你可能感兴趣的:(OpenCL)