GPU arch

In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.

In this blog I will finish the construction of this abstract machine, forming the final component: the Mali GPU itself.  This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.

GPU Architecture

The "Midgard" family of Mali GPUs  (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.

The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.

The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.

The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.

Mali GPU Shader Core

The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.

The Tripipe

There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but GPU variants with up to four pipelines are also available.

Massively Multi-threaded Machine

Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.

Arithmetic Pipeline: Vector Core

The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.

While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.

Texture Pipeline

The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.

Load/Store Pipeline

The load/store pipeline (LS-pipe) is responsible for all memory accesses which are not related to texturing.  For graphics workloads this means reading attributes and writing varyings during vertex shading, and reading varyings during fragment shading. In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single instruction.

Early ZS Testing and Late ZS Testing

In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).

In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague seanellis has a great blog looking at this technology - Killing Pixels - A New Optimization for Shading on ARM Mali GPUs - so I won't dive into any more detail here.

Memory System

This section is an after-the-fact addition to this blog, so if you have read this blog before and don't remember this section, don't worry you're not going crazy. We have been getting a lot of questions from developers writing OpenCL kernels and OpenGL ES compute shaders asking for more information about the GPU cache structure, as it can be really beneficial to lay out data structures and buffers to optimize cache locality. The salient facts are:

  • Two 16KB L1 data caches per shader core; one for texture access and one for generic memory access.
  • A single logical L2 which is shared by all of the shader cores. The size of this is variable and can be configured by the silicon integrator, but is typically between 32 and 64 KB per instantiated shader core.
  • Both cache levels use 64 byte cache lines.

If you are new to optimization of massively multi-threaded algorithms on massively multi-threaded architectures I would heartily recommend the SGEMM matrix multiplication video on our Mali Developer portal here:

  • http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/#example

... as the overall system behavior can be very different to what you are used to if you are coming from a traditional CPU background.

GPU Limits

Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.

  • The GPU can issue one vertex per shader core per clock
  • The GPU can issue one fragment per shader core per clock
  • The GPU can retire one pixel per shader core per clock
  • We can issue one instruction per pipe per clock, so for a typical shader core we can issue four instructions in parallel if we have them available to run
    • We can achieve 17 FP32 operations per A-pipe
    • One vector load, one vector store, or one vector varying per LS-pipe
    • One bilinear filtered texel per T-pipe
  • The GPU will typically have 32-bits of DDR access (read and write) per core per clock [configurable]

If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:

  • Fillrate:
    • 8 pixels per clock = 4.8 GPix/s
    • That's 2314 complete 1080p frames per second!
  • Texture rate:
    • 8 bilinear texels per clock = 4.8 GTex/s
    • That's 38 bilinear filtered texture lookups per pixel for 1080p @ 60 FPS!
  • Arithmetic rate:
    • 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
    • That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
  • Bandwidth:
    • 256-bits of memory access per clock = 19.2GB/s read and write bandwidth1.
    • That's 154 bytes per pixel for 1080p @ 60 FPS!

OpenCL and Compute

The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .

Next Time ...

This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the ARM DS-5 Streamline profiling tools.

EDIT: Next blog now available:

  • Mali Performance 1: Checking the Pipeline

Comments and questions welcomed as always,

TTFN,

Pete

Footnotes

  1. ... 19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an ARM-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

,                gpu ,                gpu_compute ,                graphics ,                performance ,                performance_analysis
  • 最受人喜爱
    此内容已被标记为最终。     显示                 36 评论   

评论

  • 36 评论                       

Login / Register to comment.

  • seanlumly012014-5-27 下午7:03

    Thanks peterharris,

    As promised, I have come to brighten your day with a few questions! But first a bit of praise about the Mali GPU. I'm extremely impressed that the A-Pipe is flexible enough to do 64, 32, 16, and 8-bit vec4 workloads! I'm not sure if this flexibility is normal, but it's amazing to think that a T760MP16 running a mediump workload has a peak that is not only far beyond last-gen consoles, but is actually approaching the marketed performance current gen consoles. In addition to that, the memory bandwidth gains enabled by ASTC, TBDR, Transaction Elimination, AFBC, Pixel Local Storage, Forward Pixel Kill, etc, should also cease to make comparisons 1:1 -- I bet, when taken into account the potential is there to move significantly more data than peak bandwidth would suggest. The performance over duration delta between outlet (mains) powered hardware vs. mobile seems to be shrinking thanks to the focus of mobile engineers on efficiency, rather than just a smaller process node and a larger die. I guess this is what ARM meant by a hyper-Moore's Law curve..

    Ok, onto the questions:

    1) Does the compiler attempt to pack the Vec4 as much as possible in the event that there are different same-precision operations that can be combined? Is it possible to mix unit precision in a single vec4 operation?

    2) Do the figures for doing 1 interpolated texel per core (1 T-Pipe per core), per clock apply equally to ETC2 an ASTC for a given amount of bandwidth? Assuming no caches, I would guess that ASTC needs more data considering that the block sizes are different (8bytes vs 4bytes for ETC2).

    3) Is the the Mali T760 MP16 suitable for high performance smartphones or tablets at 28nm (~100mm2 chip)? Is it's power and area consumption suitable for the current process node? If this is the case, there must have been a drastic reduction of wiring to accommodate these performance gains!

    4) Are OpenCL (or ES31 Compute Shader) computations roughly as efficient as those of a Vertex Shader? Could geometry transformation in compute yield similar results as OpenGL?

    5) What is the L2 cache latency?

    6) How well does does the A-Pipe deal with SQRTs, DIVs, TRIG, etc that typically require a number of steps?

    Thank you for considering these. Please only answer what you can.

    Sean

你可能感兴趣的:(GPU arch)