Mali Midgard Family Performance Counters

Mali Midgard Family Performance Counters 版本 41           

Analysis and optimization of graphics and compute content running on a GPU is an important task when trying to build a top quality system integration, or a compelling high performance application. For developers working with the public APIs, such as OpenGL ES and OpenCL, the GPU is a black-box which is very difficult to analyze based solely on the API-visible behaviors. Frame pipelining and asynchronous processing effectively decouple the application's performance from the API calls which define the workload, making analysis of performance an activity based on expert knowledge and intuition rather than direct measurement.

Tools such as ARM® DS-5 Streamline provide developers access to the GPU hardware performance counters, the principle means to determine the behavior inside the black box beneath the API and identify any problem areas which need optimization. This document doesn't explain how to use the tools - other documents and blogs have covered this - but instead focuses on what the counters actually mean to a system integrator or application developer.

Table of Contents

  • 1 Performance Counter Infrastructure
    • 1.1 Supported Counters
    • 1.2 Counter Implementation Caveats
    • 1.3 Counter Naming Convention
  • 2 Job Manager Counters
    • 2.1 Top Level Activity
      • 2.1.1 GPU_ACTIVE
      • 2.1.2 JS0_ACTIVE
      • 2.1.3 JS1_ACTIVE
      • 2.1.4 JS2_ACTIVE
      • 2.1.5 IRQ_ACTIVE
    • 2.2 Task Dispatch
      • 2.2.1 JS0_TASKS (1)
      • 2.2.2 JS0_TASKS (2)
      • 2.2.3 JS1_TASKS
      • 2.2.4 JS2_TASKS
  • 3 Shader Core Counters
    • 3.1 Shader Core Activity
      • 3.1.1 FRAG_ACTIVE
      • 3.1.2 COMPUTE_ACTIVE
      • 3.1.3 TRIPIPE_ACTIVE
    • 3.2 Compute Frontend Events
      • 3.2.1 COMPUTE_TASKS
      • 3.2.2 COMPUTE_THREADS
    • 3.3 Fragment Frontend Events
      • 3.3.1 FRAG_PRIMITIVES
      • 3.3.2 FRAG_PRIMITIVES_DROPPED
      • 3.3.3 FRAG_QUADS_RAST
      • 3.3.4 FRAG_QUADS_EZS_TEST
      • 3.3.5 FRAG_QUADS_EZS_KILLED
      • 3.3.6 FRAG_CYCLES_NO_TILE
      • 3.3.7 FRAG_CYCLES_FPKQ_ACTIVE
      • 3.3.8 FRAG_THREADS
      • 3.3.9 FRAG_DUMMY_THREADS
    • 3.4 Fragment Backend Events
      • 3.4.1 FRAG_THREADS_LZS_TEST
      • 3.4.2 FRAG_THREADS_LZS_KILLED
      • 3.4.3 FRAG_NUM_TILES (1)
      • 3.4.4 FRAG_NUM_TILES (2)
      • 3.4.5 FRAG_TRANS_ELIM
    • 3.5 Arithmetic Pipe Events
      • 3.5.1 ARITH_WORDS (1)
      • 3.5.2. ARITH_WORDS (2)
    • 3.6 Load/Store Pipe Events
      • 3.6.1 LS_WORDS
      • 3.6.2 LS_ISSUES (1)
      • 3.6.3 LS_ISSUES (2)
    • 3.7 Load/Store Cache Events
      • 3.7.1 LSC_READ_HITS
      • 3.7.2 LSC_READ_MISSES
      • 3.7.3 LSC_READ_OPS
      • 3.7.4 LSC_WRITE_HITS
      • 3.7.5 LSC_WRITE_MISSES
      • 3.7.6 LSC_WRITE_OPS
      • 3.7.7 LSC_ATOMIC_HITS
      • 3.7.8 LSC_ATOMIC_MISSES
      • 3.7.9 LSC_ATOMIC_OPS
      • 3.7.10 LSC_LINE_FETCHES
      • 3.7.11 LSC_DIRTY_LINE
      • 3.7.12 LSC_SNOOPS
    • 3.8 Texture Pipe Events
      • 3.8.1 TEX_WORDS
      • 3.8.2 TEX_ISSUES (1)
      • 3.8.3 TEX_ISSUES (2)
      • 3.8.4 TEX_RECIRC_FMISS
  • 4 Tiler Counters
    • 4.1 Tiler Activity
      • 4.1.1 TI_ACTIVE
    • 4.2. Tiler Primitive Occurrence
      • 4.2.1 TI_POINTS
      • 4.2.2 TI_LINES
      • 4.2.3 TI_TRIANGLES
    • 4.3 Tiler Visibility and Culling Occurrence
      • 4.3.1 TI_PRIM_VISIBLE
      • 4.3.2 TI_PRIM_CULLED
      • 4.3.3 TI_PRIM_CLIPPED
      • 4.3.4 TI_FRONT_FACING
      • 4.3.5 TI_BACK_FACING
  • 5 L2 Cache Counters
    • 5.1 Internal Read Traffic Events
      • 5.1.1 L2_READ_LOOKUP
      • 5.1.2 L2_READ_HITS
      • 5.1.3 L2_READ_SNOOP
    • 5.2 Internal Write Traffic Events
      • 5.2.1 L2_WRITE_LOOKUP
      • 5.2.2 L2_WRITE_HITS
      • 5.2.3 L2_WRITE_SNOOP
    • 5.3 External Read Traffic Events
      • 5.3.1 L2_EXT_READ_BEATS
      • 5.3.2 L2_EXT_R_BUF_FULL
      • 5.3.3 L2_EXT_RD_BUF_FULL
      • 5.3.4 L2_EXT_AR_STALL
    • 5.4 External Write Traffic Events
      • 5.4.1 L2_EXT_WRITE_BEATS
      • 5.4.2 L2_EXT_W_BUF_FULL
      • 5.4.3 L2_EXT_W_STALL
  • 6 Conclusions

1 Performance Counter Infrastructure

The Midgard GPU family supports many performance counters which can all be captured simultaneously. Performance counters are provided for each functional block in the design:

  • Job Manager
  • Shader core(s)
  • Tiler
  • L2 cache(s)

See my earlier blog series for an introduction to the Midgard GPU architecture - they introduce some of the fundamental concepts which are important to understand and which place the more detailed information in this document in context.

  • The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining
  • The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering
  • The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

1.1 Supported Counters

The GPUs in the Midgard family implement a large number of performance counters natively in the hardware, and it is also generally useful to generate some pseudo-counters by combining one or more of the raw hardware counters in useful and interesting ways. This document will describe all of the counters exported from DS-5 Streamline, and some of the useful pseudo-counters which can be derived from them. DS-5 Streamline allows custom performance counter graphs to be created using equations, so all of these performance counters can be directly visualized in the GUI.

1.2 Counter Implementation Caveats

The hardware counter implementation in the GPU is designed to be low cost, such that it has minimal impact on performance and power. Many of the counters are close approximations of the behavior described in this document in order to minimize the amount of additional hardware logic required to generate the counter signals, so some small deviations from what you may expect may be encountered.

1.3 Counter Naming Convention

The counters in Midgard GPU family have evolved slightly as we have released new hardware models. The Midgard counters in Streamline use the following naming convention:

  1. ARM_Mali-T<GPUID>_<NAME> 

For example:

  1. ARM_Mali-T76x_GPU_ACTIVE 

The counter descriptions in this document are based on the "<NAME>" sub-string of the overall name, as many GPUs in the family implement similar counters. Availability of the counters in each GPU is documented alongside the description.

2 Job Manager Counters

This section describes the counters implemented by the Mali Job Manager component.

2.1 Top Level Activity

These counters define the overall number of cycles that the GPU was processing workloads of one type or another.

2.1.1 GPU_ACTIVE

Availability: All

This counter increments every cycle that the GPU either has any workload queued in a Job slot, or the GPU cycle counter is running for OpenCL profiling. Note that this counter may increment even though a Job is stalled waiting for memory - that is still counted as "active" time even though no forward progress was made.

If the GPU operating frequency is known then overall GPU utilization can be calculated as:

  1. GPU_UTILIZATION = GPU_ACTIVE / GPU_MHZ 

Well pipelined applications which are not running at vsync and keeping the GPU busy should achieve a utilization of around 98%. Lower utilization than this typically indicates one of the following scenarios:

  •   Content running at vsync.
    •   In this scenario the GPU goes idle as it has no need to run until next vsync signal.
  •   Content which is bottlenecked by the CPU.
    •   In this scenario the application or driver is causing high CPU load, and cannot build new workloads for the GPU quickly enough to keep it busy.
  •   Content which is oscillating between CPU and the GPU activity.
    •   In this scenario the application is using APIs which break the frame-level pipeline needed to keep the GPU busy. The most common causes are calls to glReadpixels() or glFinish(), as these explicitly drain the pipeline, but other API calls can cause stalls if used in a blocking manner before their result is ready. These include calls such as glClientWaitSync(), glWaitSync(), or glGetQueryObjectuiv().

Collecting GPU activity and CPU activity as part of the same DS-5 Streamline data capture can help disambiguate between the cases above. This type of analysis is explored in more detail in this blog: Mali Performance 1: Checking the Pipeline.

Note: It is important to note that most modern devices support Dynamic Frequency and Voltage Scaling (DVFS) to optimize energy usage, which means that the GPU frequency is often not constant while running a piece of content. It is recommended that platform DVFS is disabled, locking the CPU, GPU and memory bus at a fixed frequency, if possible as it makes performance analysis much easier, and results more reproducible. The method for doing this is device specific, and many not be possible at all on production devices - please refer to your platform's documentation for details.

2.1.2 JS0_ACTIVE

Availability: All

This counter increments every cycle that the GPU has a Job chain running in Job slot 0. This Job slot is used solely for the processing of fragment Jobs, so this corresponds directly to fragment shading workloads.

For most content there are orders of magnitude more fragments than vertices, so this Job slot will usually be the dominant Job slot which has the highest processing load. In content which is not hitting vsync and the GPU is the performance bottleneck it is normal for JS0_ACTIVE to be approximately equal to GPU_ACTIVE. In this scenario vertex processing can run in parallel to the fragment processing, allowing fragment processing to run all of the time.

2.1.3 JS1_ACTIVE

Availability: All

This counter increments every cycle the GPU has a Job chain running in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads. This counter cannot disambiguate between these workloads.

2.1.4 JS2_ACTIVE

Availability: All

This counter increments every cycle the GPU has a Job chain running in Job slot 2. This Job slot can be used for compute shaders, and vertex shaders.

In most system configurations vertex shading and tiling workloads are submitted together as a single batched Job chain via Job slot 1, so even though this slot can technically execute vertex shading, it does not often do so. For most graphics content this Job slot is therefore idle.

2.1.5 IRQ_ACTIVE

Availability: All

This counter increments every cycle the GPU has an interrupt pending, awaiting handling by the driver running on the CPU. Note that this does not necessarily indicate lost performance as the GPU can still process Job chains from other Job slots, as well as process the next work item in the pending Job slot, while an interrupt is pending.

2.2 Task Dispatch

This section looks at the counters related to how the Job Manager issues work to shader cores.

2.2.1 JS0_TASKS (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 16x16 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.

2.2.2 JS0_TASKS (2)

Availability: Mali-T760, Mali-T800 series

This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 32x32 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.

2.2.3 JS1_TASKS

Availability: All

This counter increments every time the Job Manager issues a task to a shader core or the tiler. For JS1 these tasks correspond to a range of vertices or compute work items (shader cores), or a range of indices (tiler). The size of these tasks is driver controlled, although for compute tasks must be a multiple of the workgroup size.

2.2.4 JS2_TASKS

Availability: All

This counter increments every time the Job Manager issues a task to a shader core. For JS2 these tasks correspond to a range of vertices or compute work items. The size of these tasks is driver controlled, although for compute tasks must be a multiple of the workgroup size.

3 Shader Core Counters

This section describes the counters implemented by the Mali Shader Core component. For the purposes of clarity this section talks about either fragment workloads or compute workloads. Vertex workloads are treated as a one dimensional compute problem by the shader core, so are counted as a compute workload from the point of view of the counters in this section.

The GPU hardware records separate counters per shader core in the system. DS-5 Streamline shows the average of all of the shader core counters.

3.1 Shader Core Activity

These counters show the total activity level of the shader core.

3.1.1 FRAG_ACTIVE

Availability: All

This counter increments every cycle at least one fragment task is active anywhere inside the shader core, including the fixed-function fragment frontend, the programmable tripipe, or the fixed-function fragment backend.

3.1.2 COMPUTE_ACTIVE

Availability: All

This counter increments every cycle at least one compute task is active anywhere inside the shader core, including the fixed-function compute frontend, or the programmable tripipe.

3.1.3 TRIPIPE_ACTIVE

Availability: All

This counter increments every cycle at least one thread is active inside the programmable tri-pipe. Note that this counter does not give any idea of total utilization of the tri-pipe resources, but simply gives an indication that something was running. An approximation of the overall utilization of the tripipe can be achieve via the following equation:

  1. TRIPIPE_UTILIZATION =  TRIPIPE_ACTIVE / GPU_ACTIVE 

A low tripipe utilization can indicate:

  • Content which is tiler limited, so the shader cores are going idle
  • Content which contains many micro-polygons - triangles smaller than one pixel - which generate no rasterizer sample coverage.
  • Content which contains many tiles containing no geometry; i.e. rendering just clear color.

3.2 Compute Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed function compute frontend which issues work into the programmable tri-pipe.

3.2.1 COMPUTE_TASKS

Availability: All

This counter increments for every compute task handled by the shader core. When totalled across all shader cores this should show the total number of shader core tasks issued by JS1 and JS2.

3.2.2 COMPUTE_THREADS

Availability: All

This counter increments for every compute thread spawned by the shader core. One compute thread is spawned for every work item (compute shaders) or vertex (vertex shaders).

3.3 Fragment Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed-function fragment frontend. This unit is significantly more complicated than the compute frontend, so there are a large number of counters available.

3.3.1 FRAG_PRIMITIVES

Availability: All

This counter increments for every primitive read from the tile list. Not all of these triangles will necessarily be visible in the current tile, due to the use of the hierarchical tiler, but it gives some idea of the internal tile list access rate.

3.3.2 FRAG_PRIMITIVES_DROPPED

Availability: All

This counter increments for every primitive read from the tile list which is subsequently discarded because it is not relevant for the tile currently being rendered. The number of fragment threads which are issued per primitive can be given by the following equation:

  1. THREADS_PER_PRIMITIVE_LOAD = FRAG_THREADS  / (FRAG_PRIMITIVES - FRAG_PRIMITIVES_DROPPED) 

It is recommended that the number of fragment threads per primitive load is kept above 10 to ensure that the tripipe is kept filled with work for as much of the frame rendering  time as possible.

Note: that the averaging of this metric over a large time window can hide problematic cases; a sequence of 1000 primitives of 256 fragments each (i.e. a full tile)  followed by 1000 primitives of 1 fragment each (i.e. a single pixel micro triangle) would give an average of ~128 pixels per primitive, but would run slower than a test rendering 2000 primitives of 128 pixels each.

3.3.3 FRAG_QUADS_RAST

Availability: All

This counter increments for every 2x2 pixel quad which is rasterized by the rasterization unit. The quads generated have at least some coverage based on the current sample pattern, but may subsequently be killed by early depth and stencil testing and as such never issued to the tri-pipe.

3.3.4 FRAG_QUADS_EZS_TEST

Availability: All

This counter increments for every 2x2 pixel quad which is subjected to early depth and stencil (ZS) test and update. We want as many quads as possible to be subject to early ZS testing as it is significantly more efficient than late ZS testing. It is more efficient because it lets us kill work before it becomes executable threads which consume tripipe resources, rather than after those threads have completed execution.

Note: this counter will only increment for quads which complete all necessary ZS test and ZS update operations. If a fragment performs an early ZS test but requires a late ZS update - possible, for example, if the shader contains a discard statement - then this counter will not increment. In this scenario it is possible for the FRAG_QUADS_EZS_KILLED counter to be higher than this FRAG_QUADS_EZS_TEST counter, which can seem slightly counter-intuitive.

3.3.5 FRAG_QUADS_EZS_KILLED

Availability: All

This counter increments for every 2x2 pixel quad which is completely killed by early depth and stencil (ZS) testing. These killed quads will not generate any further processing in the shader core.

3.3.6 FRAG_CYCLES_NO_TILE

Availability: All

This counter increments every cycle the shader core early ZS unit is blocked from making progress because there is no physical tile memory for a new color, depth, or stencil buffer available. Note that this may not actually cause lost performance if there is sufficient work buffered after the EZS unit that is waiting to run.

A high counter value here generally means that the GPU cannot free up tile memory fast enough to satisfy the rate at which it is processing. Ensure that the GPU is not bottlenecking on external memory bandwidth, and use glDiscardFramebufferExt() or glInvalidateFramebuffer() to discard unneeded transient tile state in order to make that memory available for the next tile as quickly as possible. See this blog on framebuffer lifetime for more information on the correct use of these two API calls: Mali Performance 2: How to Correctly Handle Framebuffers.

3.3.7 FRAG_CYCLES_FPKQ_ACTIVE

Availability: All except Mali-T600

This counter increments every cycle the pre-pipe FPK buffer contains at least one 2x2 pixel quad waiting to be executed in the tripipe. If this buffer drains the frontend will be unable to spawn a new thread if a thread slot becomes free, however this counter will increment every cycle the buffer is empty, even if the tripipe is full. This counter is therefore only an indication of possible lost performance, rather than actual lost performance, as the tripipe may contain enough threads to fully utilize the critical path functional units.

3.3.8 FRAG_THREADS

Availability: All

This counter increments for every fragment thread created by the GPU. These may be real threads, or dummy threads (see below for description of a dummy thread).

3.3.9 FRAG_DUMMY_THREADS

Availability: All

This counter increments for every dummy fragment thread created by the GPU. These are real executable threads which have no active sample coverage, but which share a quad with at least one thread which does have sample coverage. These dummy threads are needed to ensure correctly calculated derivatives if the fragment shader uses the dX() or dY() built-in functions, or implicit mipmap level selection.

If this number is a significant percentage of the total number of fragment threads it generally indicates that the content is generating micro-triangles, which have a high percentage of edge pixels. In these cases performance can be improved by reducing triangle counts in the model meshes.

3.4 Fragment Backend Events

These counters record the fragment backend behavior.

3.4.1 FRAG_THREADS_LZS_TEST

Availability: All

This counter increments for every thread triggering late depth and stencil (ZS) testing.

3.4.2 FRAG_THREADS_LZS_KILLED

Availability: All

This counter increments for every thread killed by late depth and stencil (ZS) testing. These threads are killed after their fragment program has executed, so a significant number of threads being killed at late ZS implies a significant amount of lost performance and/or wasted energy performing rendering which has no useful visual output. Late depth and stencil testing is generally triggered for shaders which modify depth or stencil programmatically, or which have shader-dependent sample coverage (the shader uses "discard" or alpha-to-coverage), although the driver will aggressively try to use early-zs whenever possible even in these cases, so using these features does not guarantee an occurrence of late-zs.

3.4.3 FRAG_NUM_TILES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every 16x16 pixel tile rendered by the shader core. In a single core system this should be the same as the JS0_TASKS (1) counter.

3.4.4 FRAG_NUM_TILES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every 32x32 screen region rendered by the shader core. In a single core system this should be the same as the JS0_TASKS (2) counter.

3.4.5 FRAG_TRANS_ELIM

Availability: All

This counter increments for every physical rendered tile which has its writeback cancelled due to a matching transaction elimination CRC hash. If a high percentage of the tile writes are being eliminated this implies that you are re-rendering the entire screen when not much has changed, so consider using scissor rectangles to minimize the amount of area which is redrawn. This isn't always easy, especially for window surfaces which are pipelines using multiple buffers, but EGL extensions such as these may be supported on your platform which can help manage the partial frame updates:

  • https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_partial_update.txt
  • https://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_swap_buffers_with_damage.txt

Note that in Mali-T760 and Mali-T800 series the size of a rendered tile is not the same as the region size counted in the shader core tile issue counter (FRAG_NUM_TILES (2)). Physical render tiles can vary from 16x16 pixels (largest) down to 4x4 pixels (smallest), with a number of intermediate sizes possible. The size of physical tile in pixels used depends on the number of bytes of memory needed to store the working set for each pixel.

This in turn depends on the use of:

  • Multi-sample anti-aliasing (MSAA)
  • OpenGL ES 3.0 multiple render targets (MRT)
  • OpenGL ES 3.0 large color formats, e.g. 16 or 32-bit per channel data formats
  • Pixel local storage (PLS)

This counter cannot be used to determine the physical tile size, and it is possible that one application may use different physical tile sizes for different render passes. However, for most  content using a single render target, RGBA color formats up to 32 bits per pixel, and up to 4x MSAA the rendered tile size will be 16x16 pixels.

3.5 Arithmetic Pipe Events

These counters look at the behavior of the arithmetic pipe.

3.5.1 ARITH_WORDS (1)

Availability: All except Mali-T720, Mali-T820, and Mali-T830

This counter increments for every arithmetic instruction architecturally executed. This counter is normalized based on the number of arithmetic pipelines implemented in the design, so gives the "per pipe" performance, rather than the total executed workload. The peak performance is one arithmetic instruction per arithmetic pipeline per cycle, so the effective utilization of the arithmetic hardware can be computed as:

  1. ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE 

3.5.2. ARITH_WORDS (2)

Availability: Mali- T720, Mali-T820, Mali-T830

This counter increments for every batched arithmetic instruction executed. Mali-T720, Mali-T820, and Mali-T830 implement a different arithmetic pipeline to the other Midgard GPU cores, which improves performance density at the expense of lower peak performance. In this design the GPU automatically batches multiple threads which are not fully utilizing the lanes of a SIMD vector unit together in order to improve lane utilization. This counter counts the number of batched execution issues (where the batch size may be a single thread), rather than the number of individual instruction issues from each logical thread.

Due to the dynamic batching of instructions the architectural performance of this pipeline is not trivial to measure using counters, as it may batch 1, 2, or 4 threads into a single issue cycle and its ability to do so depends heavily on the code generated by the shader compiler. However it is still true that we can issue one batch per clock, so the headline utilization is still measurable via the equation:

  1. ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE 

3.6 Load/Store Pipe Events

These counters look at the behavior of the load/store pipe.

3.6.1 LS_WORDS

Availability: All

This counter increments for every load/store instruction architecturally executed. Under ideal circumstances we can issue one instruction per clock, so the architectural utilization of the load store pipe is:

  1. LS_ARCH_UTILIZATION = LS_WORDS / TRIPIPE_ACTIVE 

3.6.2 LS_ISSUES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store instruction issued, including any reissues due to varying or data cache misses. We can issue or reissue a single instruction per clock cycle, so the microarchitectural utilization can be computed as:

  1. LS_UARCH_UTILIZATION = LS_ISSUES / TRIPIPE_ACTIVE 

Under ideal circumstances we can execute a single load/store word in a single pass, so we can calculate a Cycles Per Instruction (CPI) metric as:

  1. LS_CPI = LS_ISSUES / LS_WORDS 

A good CPI figure is generally around 1.2, anything significantly higher than this generally means that one of the data caches is being thrashed, so for vertex shaders you might want to look at:

  • Reducing the number of uniforms or vertex attributes.
  • Reducing the precision of uniforms or vertex attributes.
  • Ensuring that no unused fields are interleaved in your attribute streams or uniform buffers.
  • Ensuring that your geometry data has good spatial locality (vertices that are near to other vertices in the model mesh are also near to each other in in memory).

For fragment shaders the same rules apply, but to varying values instead of input attributes, for compute shaders the rules apply to shader storage buffer objects (SSBOs).

3.6.3 LS_ISSUES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store instruction issued, including any reissues due to varying cache misses. It should be noted that Mali-T760 onwards does not count retry cycles due to misses in the main data cache as part of this metric, so the cache analysis has to be done using the LSC counters described in the next section.

The derived counters are the same as those in "LS_ISSUES (1)" above.

3.7 Load/Store Cache Events

These events monitor the performance of the load/store cache.

3.7.1 LSC_READ_HITS

Availability: All

This counter increments for every load/store L1 cache read access which is a hit.

3.7.2 LSC_READ_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store L1 cache read access which is a miss.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. If the hit rate is particularly poor it may indicate cache thrashing due to badly packed data, with poor spatial and/or temporal locality of access.

  1. LSC_READ_HITRATE = LSC_READ_HITS / (LSC_READ_HITS + LSC_READ_MISSES) 

3.7.3 LSC_READ_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store L1 cache read access.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. See the section above for a further description; the updated derived counter for the supported GPUs is:

  1. LSC_READ_HITRATE = LSC_READ_HITS / LSC_READ_OPS 

3.7.4 LSC_WRITE_HITS

Availability: All

This counter increments for every load/store L1 cache write access which is a hit.

3.7.5 LSC_WRITE_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store L1 cache write access which is a miss.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. If the hit rate is particularly poor it may indicate cache thrashing due to badly packed data, with poor spatial and/or temporal locality of access.

  1. LSC_WRITE_HITRATE = LSC_WRITE_HITS / (LSC_WRITE_HITS + LSC_WRITE_MISSES) 

3.7.6 LSC_WRITE_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store L1 cache write access.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. See the section above for a further description; the updated derived counter for the supported GPUs is:

  1. LSC_WRITE_HITRATE = LSC_WRITE_HITS / LSC_WRITE_OPS 

3.7.7 LSC_ATOMIC_HITS

Availability: All

This counter increments for every atomic memory access which hits in the L1 atomic cache.

3.7.8 LSC_ATOMIC_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every atomic memory access which misses in the L1 atomic cache.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. If the hit rate is particularly poor it may indicate cache thrashing due to badly packed data, with poor spatial and/or temporal locality of access.

  1. LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / (LSC_ATOMIC_HITS + LSC_ATOMIC_MISSES) 

Atomics are sometimes used for synchronizing multiple workgroups, requiring the same atomic memory location to be accessed by multiple shader cores. This makes them more susceptible than other memory types to cache thrashing as the line in the atomics cache can be deliberate stolen by another shader core - rather than evicted due to normal cache pressure - so application developers should ensure that there is an adequate amount of processing relative to the number of global atomic accesses, otherwise the memory synchronization overheads of the atomic fields will dominate the processing workload.

3.7.9 LSC_ATOMIC_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every atomic memory access which misses in the L1 atomic cache.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. See the section above for a further description; the updated derived counter for the supported GPUs is:

  1. LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / LSC_ATOMIC_OPS 

3.7.10 LSC_LINE_FETCHES

Availability: All

This counter increments for every line fetched by the L1 cache from the L2 memory system.

3.7.11 LSC_DIRTY_LINE

Availability: All

This counter increments for every dirty line evicted from the L1 cache into the L2 memory system.

3.7.12 LSC_SNOOPS

Availability: All

This counter increments for every snoop into the L1 cache from the L2 memory system.

3.8 Texture Pipe Events

This counter set looks at the texture pipe behavior.

3.8.1 TEX_WORDS

Availability: All

This counter increments for every architecturally executed texture instruction.

3.8.2 TEX_ISSUES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to data cache misses, as well as a number of multi-cycle filtering operations:

  • 2D bilinear filtering takes one cycle
  • 2D trilinear filtering takes two cycles
  • 3D bilinear filtering takes two cycles
  • 3D trilinear filtering takes four cycles

We can calculate a texture pipe CPI in the same way we can with the LS pipe:

  1. TEX_CPI = TEX_ISSUES / TEX_WORDS 

This gives some indication of the number of recirculations which are occurring in the design. For content using 2D textures with bilinear filtering a good CPI figure is generally around 1.2, anything significantly higher than this generally means that one of the data caches is being thrashed. In this case consider:

  • Using mipmaps - for 3D scenes you should always use mipmapped textures as they reduce aliasing artefacts and they go faster.
  • Using texture compression - ASTC provides excellent quality and flexibility so there is no reason not to.
  • In extreme cases reduce texture resolution, or the number of textures s being used.

3.8.3 TEX_ISSUES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to multi-cycle filtering operations. Unlike the older version of this counter, this implementation does not incur retries due to misses in the main data cache, and these are no longer included in this counter.

The cycle counts for each operation and the CPI derived counter are the same as shown in "TEX_ISSUES (1)" above, although the CPI counter will generally be lower as cache miss overheads are not included. If this number is substantially over 1.0 and you are texture limited then reduce use of trilinear filtering and 3D textures.

3.8.4 TEX_RECIRC_FMISS

Availability: Mali-T600, Mali-T620, Mali-T760, Mali-T800 series

This counter counts the number of full cache misses where no samples of a texture are present in the texture cache, and require the thread to wait for a cache access. The following equation provides an approximation of the percentage of read hits, although it does assume only one cache line is needed per texture instruction, which is not true for some  texture formats.

  1. TEX_READ_HITRATE = (TEX_WORDS - TEX_RECIRC_FMISS) / TEX_WORDS 

4 Tiler Counters

The tiler counters provide details of the workload of the fixed function tiling unit, which places primitives into the tile lists which are subsequently read by the fragment frontend during fragment shading.

4.1 Tiler Activity

These counters show the overall activity of the tiling unit.

4.1.1 TI_ACTIVE

Availability: Mali-T600, Mali-T620, Mali-T760, Mali-T860, Mali-T880

This counter increments every cycle the tiler is processing a task. The tiler can run in parallel to vertex shading and fragment shading so a high cycle count here does not necessarily imply a bottleneck, unless the COMPUTE_ACTIVE counter in the shader cores are very low relative to this.

4.2. Tiler Primitive Occurrence

These counters give a functional breakdown of the workload given to the GPU by the application.

4.2.1 TI_POINTS

Availability: All

This counter increments for every point primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.2 TI_LINES

Availability: All

This counter increments for every line segment primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.3 TI_TRIANGLES

Availability: All

This counter increments for every triangle primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.3 Tiler Visibility and Culling Occurrence

These counters give a breakdown of how the workload has been affected by clipping and culling.

4.3.1 TI_PRIM_VISIBLE

Availability: All

This counter is incremented for every primitive which is visible according to its type, clip-space coordinates, and front-face or back-face orientation.

4.3.2 TI_PRIM_CULLED

Availability: All

This counter is incremented for every primitive which is culled due to the application of front-face or back-face culling rules. For most meshes approximately half of the triangles are back facing so this counter should typically be similar to the visible primitives, although lower is always better.

4.3.3 TI_PRIM_CLIPPED

Availability: All

This counter is incremented for every primitive which is culled due to being totally outside of the clip-space volume. Application-side culling should be used to minimize the amount of out-of-shot geometry being sent to the GPU as it is expensive in terms of bandwidth and power. One of my blogs looks at application side culling in more detail:Mali Performance 5: An Application's Performance Responsibilities.

4.3.4 TI_FRONT_FACING

Availability: All

This counter is incremented for every triangle which is front-facing. This counter is incremented after culling, so only counts primitives which are actually emitted into the tile list.

4.3.5 TI_BACK_FACING

Availability: All

This counter is incremented for every triangle which is back-facing. This counter is incremented after culling, so only counts primitives which are actually emitted into the tile list. If you are not using back-facing triangles for some special algorithmic purpose, such as Refraction Based on Local Cubemaps, then a high value here relative to the total number of triangles may indicate that the application has forgotten to turn on back-face culling. For most opaque geometry no back facing triangles should be expected.

5 L2 Cache Counters

This section documents the behavior of the L2 memory system counters. In systems which implement multiple L2 caches or bus interfaces the counters presented in DS-5 Streamline are generally the sum of the counters from all of the L2 counter blocks present, so care needs to be taken when comparing these counters against the total GPU cycles.

5.1 Internal Read Traffic Events

These counters profile the internal read traffic into the L2 cache from the various internal masters.

5.1.1 L2_READ_LOOKUP

Availability: All

The counter increments for every read transaction received by the L2 cache.

5.1.2 L2_READ_HITS

Availability: All

The counter increments for every read transaction received by the L2 cache which also hits in the cache.

  1. L2_READ_HITRATE = L2_READ_HITS / L2_READ_LOOKUP 

5.1.3 L2_READ_SNOOP

Availability: All

The counter increments for every inner coherent read snoop transaction received by the L2 cache.

5.2 Internal Write Traffic Events

These counters profile the internal write traffic into the L2 cache from the various internal masters.

5.2.1 L2_WRITE_LOOKUP

Availability: All

The counter increments for every write transaction received by the L2 cache.

5.2.2 L2_WRITE_HITS

Availability: All

The counter increments for every write transaction received by the L2 cache which also hits in the cache.

  1. L2_WRITE_HITRATE = L2_WRITE_HITS / L2_WRITE_LOOKUP 

5.2.3 L2_WRITE_SNOOP

Availability: All

The counter increments for every inner coherent write snoop transaction received by the L2 cache.

5.3 External Read Traffic Events

These counters profile the external read memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.

5.3.1 L2_EXT_READ_BEATS

Availability: All

This counter increments on every clock cycle a read beat is read off the AXI bus. With knowledge of the bus width used in the GPU this can be converted into a raw bandwidth counter.

  1. L2_EXT_READ_BYTES = L2_EXT_READ_BEATS * L2_AXI_WIDTH_BYTES 

Most implementations of the Midgard GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.

It is also possible to determine the total percentage of available AXI port bandwidth used, using the equation below. Note that we need to normalize the number of read beats, which are accumulated by DS-5 Streamline, into a per-port count. If a design is using 1-4 shader cores then only a single AXI port will be present, otherwise two AXI ports will be present.

  1. L2_EXT_READ_UTILIZATION = L2_EXT_READ_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE) 

This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface as AXI will be unable to provide the data as quickly as the GPU can consume it.

5.3.2 L2_EXT_R_BUF_FULL

Availability: All except Mali-T720

This counter increments every cycle that the GPU is unable to create a new read transaction because there are no free entries in the internal response buffer. If this number is high it may indicate that the AXI interface has been built with too few outstanding transactions. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.3.3 L2_EXT_RD_BUF_FULL

Availability: All except Mali-T720

This counter increments if a read response is received and the internal read data buffer is full. This can happen if read responses are interleaved by the bus. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.3.4 L2_EXT_AR_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new read transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention. In these cases application developers can improve performance by reducing the total amount of bandwidth their application is using, be that by reducing geometry complexity, compressing textures, or reducing resolution.

5.4 External Write Traffic Events

These counters profile the external write memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache.

5.4.1 L2_EXT_WRITE_BEATS

Availability: All

This counter increments on every clock cycle a write data beat is sent on the AXI bus. With knowledge of the bus width used in the GPU this can be converted into a raw bandwidth counter.

  1. L2_EXT_WRITE_BYTES = L2_EXT_WRITE_BEATS * L2_AXI_WIDTH_BYTES 

Most implementations of the Midgard GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.

It is also possible to determine the total percentage of available AXI port bandwidth used, using the equation below. Note that we need to normalize the number of read beats, which are accumulated by DS-5 Streamline, into a per-port count.

  1. L2_EXT_WRITE_UTILIZATION = L2_EXT_WRITE_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE) 

This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface as AXI will be unable to accept the data as quickly as the GPU can generate it.

5.4.2 L2_EXT_W_BUF_FULL

Availability: All

This counter increments every cycle that the GPU is unable to create a new write transaction because there are no free entries in the internal write buffer. If this number is high it may indicate that the AXI interface has been built with too few outstanding transactions. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.4.3 L2_EXT_W_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new write transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention, or that the device bus clock is too low. In these cases application developers can improve performance by reducing the total amount of bandwidth their application is using, be that by reducing geometry complexity, compressing textures, or reducing resolution.

6 Conclusions

This document has defined all of the Mali Midgard family performance counters available via DS-5 Streamline, as well as some pseudo-counters which can be derived from them. Hopefully this provides a useful starting point for your application optimization activity when using Mali GPUs.

We also publish a Mali Application Optimization Guide on the Mali Developer Center:

  • http://malideveloper.arm.com/develop-for-mali/tutorials-developer-guides/developer-guides/mali-gpu-application-optimization-guide/

Other blogs and documents can be found the ARM Mali Graphics area which explore using the performance tools, as well as discussing specific optimization strategies, in more detail.

 

你可能感兴趣的:(Mali Midgard Family Performance Counters)