prike

Mali Midgard Family Performance Counters

Mali Midgard Family Performance Counters 版本 41

由 peterharris 于 2015-4-16 下午2:57创建，最后由 peterharris 于 2015-11-2 下午10:02修改。

Analysis and optimization of graphics and compute content running on a GPU is an important task when trying to build a top quality system integration, or a compelling high performance application. For developers working with the public APIs, such as OpenGL ES and OpenCL, the GPU is a black-box which is very difficult to analyze based solely on the API-visible behaviors. Frame pipelining and asynchronous processing effectively decouple the application's performance from the API calls which define the workload, making analysis of performance an activity based on expert knowledge and intuition rather than direct measurement.

Tools such as ARM^® DS-5 Streamline provide developers access to the GPU hardware performance counters, the principle means to determine the behavior inside the black box beneath the API and identify any problem areas which need optimization. This document doesn't explain how to use the tools - other documents and blogs have covered this - but instead focuses on what the counters actually mean to a system integrator or application developer.

Table of Contents

1 Performance Counter Infrastructure
- 1.1 Supported Counters
- 1.2 Counter Implementation Caveats
- 1.3 Counter Naming Convention
2 Job Manager Counters
- 2.1 Top Level Activity
  - 2.1.1 GPU_ACTIVE
  - 2.1.2 JS0_ACTIVE
  - 2.1.3 JS1_ACTIVE
  - 2.1.4 JS2_ACTIVE
  - 2.1.5 IRQ_ACTIVE
- 2.2 Task Dispatch
  - 2.2.1 JS0_TASKS (1)
  - 2.2.2 JS0_TASKS (2)
  - 2.2.3 JS1_TASKS
  - 2.2.4 JS2_TASKS
3 Shader Core Counters
- 3.1 Shader Core Activity
  - 3.1.1 FRAG_ACTIVE
  - 3.1.2 COMPUTE_ACTIVE
  - 3.1.3 TRIPIPE_ACTIVE
- 3.2 Compute Frontend Events
  - 3.2.1 COMPUTE_TASKS
  - 3.2.2 COMPUTE_THREADS
- 3.3 Fragment Frontend Events
  - 3.3.1 FRAG_PRIMITIVES
  - 3.3.2 FRAG_PRIMITIVES_DROPPED
  - 3.3.3 FRAG_QUADS_RAST
  - 3.3.4 FRAG_QUADS_EZS_TEST
  - 3.3.5 FRAG_QUADS_EZS_KILLED
  - 3.3.6 FRAG_CYCLES_NO_TILE
  - 3.3.7 FRAG_CYCLES_FPKQ_ACTIVE
  - 3.3.8 FRAG_THREADS
  - 3.3.9 FRAG_DUMMY_THREADS
- 3.4 Fragment Backend Events
  - 3.4.1 FRAG_THREADS_LZS_TEST
  - 3.4.2 FRAG_THREADS_LZS_KILLED
  - 3.4.3 FRAG_NUM_TILES (1)
  - 3.4.4 FRAG_NUM_TILES (2)
  - 3.4.5 FRAG_TRANS_ELIM
- 3.5 Arithmetic Pipe Events
  - 3.5.1 ARITH_WORDS (1)
  - 3.5.2. ARITH_WORDS (2)
- 3.6 Load/Store Pipe Events
  - 3.6.1 LS_WORDS
  - 3.6.2 LS_ISSUES (1)
  - 3.6.3 LS_ISSUES (2)
- 3.7 Load/Store Cache Events
  - 3.7.1 LSC_READ_HITS
  - 3.7.2 LSC_READ_MISSES
  - 3.7.3 LSC_READ_OPS
  - 3.7.4 LSC_WRITE_HITS
  - 3.7.5 LSC_WRITE_MISSES
  - 3.7.6 LSC_WRITE_OPS
  - 3.7.7 LSC_ATOMIC_HITS
  - 3.7.8 LSC_ATOMIC_MISSES
  - 3.7.9 LSC_ATOMIC_OPS
  - 3.7.10 LSC_LINE_FETCHES
  - 3.7.11 LSC_DIRTY_LINE
  - 3.7.12 LSC_SNOOPS
- 3.8 Texture Pipe Events
  - 3.8.1 TEX_WORDS
  - 3.8.2 TEX_ISSUES (1)
  - 3.8.3 TEX_ISSUES (2)
  - 3.8.4 TEX_RECIRC_FMISS
4 Tiler Counters
- 4.1 Tiler Activity
  - 4.1.1 TI_ACTIVE
- 4.2. Tiler Primitive Occurrence
  - 4.2.1 TI_POINTS
  - 4.2.2 TI_LINES
  - 4.2.3 TI_TRIANGLES
- 4.3 Tiler Visibility and Culling Occurrence
  - 4.3.1 TI_PRIM_VISIBLE
  - 4.3.2 TI_PRIM_CULLED
  - 4.3.3 TI_PRIM_CLIPPED
  - 4.3.4 TI_FRONT_FACING
  - 4.3.5 TI_BACK_FACING
5 L2 Cache Counters
- 5.1 Internal Read Traffic Events
  - 5.1.1 L2_READ_LOOKUP
  - 5.1.2 L2_READ_HITS
  - 5.1.3 L2_READ_SNOOP
- 5.2 Internal Write Traffic Events
  - 5.2.1 L2_WRITE_LOOKUP
  - 5.2.2 L2_WRITE_HITS
  - 5.2.3 L2_WRITE_SNOOP
- 5.3 External Read Traffic Events
  - 5.3.1 L2_EXT_READ_BEATS
  - 5.3.2 L2_EXT_R_BUF_FULL
  - 5.3.3 L2_EXT_RD_BUF_FULL
  - 5.3.4 L2_EXT_AR_STALL
- 5.4 External Write Traffic Events
  - 5.4.1 L2_EXT_WRITE_BEATS
  - 5.4.2 L2_EXT_W_BUF_FULL
  - 5.4.3 L2_EXT_W_STALL
6 Conclusions

1 Performance Counter Infrastructure

The Midgard GPU family supports many performance counters which can all be captured simultaneously. Performance counters are provided for each functional block in the design:

Job Manager
Shader core(s)
Tiler
L2 cache(s)

See my earlier blog series for an introduction to the Midgard GPU architecture - they introduce some of the fundamental concepts which are important to understand and which place the more detailed information in this document in context.

The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining
The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering
The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

1.1 Supported Counters

The GPUs in the Midgard family implement a large number of performance counters natively in the hardware, and it is also generally useful to generate some pseudo-counters by combining one or more of the raw hardware counters in useful and interesting ways. This document will describe all of the counters exported from DS-5 Streamline, and some of the useful pseudo-counters which can be derived from them. DS-5 Streamline allows custom performance counter graphs to be created using equations, so all of these performance counters can be directly visualized in the GUI.

1.2 Counter Implementation Caveats

The hardware counter implementation in the GPU is designed to be low cost, such that it has minimal impact on performance and power. Many of the counters are close approximations of the behavior described in this document in order to minimize the amount of additional hardware logic required to generate the counter signals, so some small deviations from what you may expect may be encountered.

1.3 Counter Naming Convention

The counters in Midgard GPU family have evolved slightly as we have released new hardware models. The Midgard counters in Streamline use the following naming convention:

ARM_Mali-T<GPUID>_<NAME>

ARM_Mali-T<GPUID>_<NAME>

For example:

ARM_Mali-T76x_GPU_ACTIVE

ARM_Mali-T76x_GPU_ACTIVE

The counter descriptions in this document are based on the "<NAME>" sub-string of the overall name, as many GPUs in the family implement similar counters. Availability of the counters in each GPU is documented alongside the description.

2 Job Manager Counters

This section describes the counters implemented by the Mali Job Manager component.

2.1 Top Level Activity

These counters define the overall number of cycles that the GPU was processing workloads of one type or another.

2.1.1 GPU_ACTIVE

Availability: All

This counter increments every cycle that the GPU either has any workload queued in a Job slot, or the GPU cycle counter is running for OpenCL profiling. Note that this counter may increment even though a Job is stalled waiting for memory - that is still counted as "active" time even though no forward progress was made.

If the GPU operating frequency is known then overall GPU utilization can be calculated as:

GPU_UTILIZATION = GPU_ACTIVE / GPU_MHZ

GPU_UTILIZATION = GPU_ACTIVE / GPU_MHZ

Well pipelined applications which are not running at vsync and keeping the GPU busy should achieve a utilization of around 98%. Lower utilization than this typically indicates one of the following scenarios:

Content running at vsync.
- In this scenario the GPU goes idle as it has no need to run until next vsync signal.
Content which is bottlenecked by the CPU.
- In this scenario the application or driver is causing high CPU load, and cannot build new workloads for the GPU quickly enough to keep it busy.
Content which is oscillating between CPU and the GPU activity.
- In this scenario the application is using APIs which break the frame-level pipeline needed to keep the GPU busy. The most common causes are calls to glReadpixels() or glFinish(), as these explicitly drain the pipeline, but other API calls can cause stalls if used in a blocking manner before their result is ready. These include calls such as glClientWaitSync(), glWaitSync(), or glGetQueryObjectuiv().

Collecting GPU activity and CPU activity as part of the same DS-5 Streamline data capture can help disambiguate between the cases above. This type of analysis is explored in more detail in this blog: Mali Performance 1: Checking the Pipeline.

Note: It is important to note that most modern devices support Dynamic Frequency and Voltage Scaling (DVFS) to optimize energy usage, which means that the GPU frequency is often not constant while running a piece of content. It is recommended that platform DVFS is disabled, locking the CPU, GPU and memory bus at a fixed frequency, if possible as it makes performance analysis much easier, and results more reproducible. The method for doing this is device specific, and many not be possible at all on production devices - please refer to your platform's documentation for details.

2.1.2 JS0_ACTIVE

Availability: All

This counter increments every cycle that the GPU has a Job chain running in Job slot 0. This Job slot is used solely for the processing of fragment Jobs, so this corresponds directly to fragment shading workloads.

For most content there are orders of magnitude more fragments than vertices, so this Job slot will usually be the dominant Job slot which has the highest processing load. In content which is not hitting vsync and the GPU is the performance bottleneck it is normal for JS0_ACTIVE to be approximately equal to GPU_ACTIVE. In this scenario vertex processing can run in parallel to the fragment processing, allowing fragment processing to run all of the time.

2.1.3 JS1_ACTIVE

Availability: All

This counter increments every cycle the GPU has a Job chain running in Job slot 1. This Job slot can be used for compute shaders, vertex shaders, and tiling workloads. This counter cannot disambiguate between these workloads.

2.1.4 JS2_ACTIVE

Availability: All

This counter increments every cycle the GPU has a Job chain running in Job slot 2. This Job slot can be used for compute shaders, and vertex shaders.

In most system configurations vertex shading and tiling workloads are submitted together as a single batched Job chain via Job slot 1, so even though this slot can technically execute vertex shading, it does not often do so. For most graphics content this Job slot is therefore idle.

2.1.5 IRQ_ACTIVE

Availability: All

This counter increments every cycle the GPU has an interrupt pending, awaiting handling by the driver running on the CPU. Note that this does not necessarily indicate lost performance as the GPU can still process Job chains from other Job slots, as well as process the next work item in the pending Job slot, while an interrupt is pending.

2.2 Task Dispatch

This section looks at the counters related to how the Job Manager issues work to shader cores.

2.2.1 JS0_TASKS (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 16x16 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.

2.2.2 JS0_TASKS (2)

Availability: Mali-T760, Mali-T800 series

This counter increments every time the Job Manager issues a task to a shader core. For JS0 these tasks correspond to a single 32x32 pixel screen region, although not all of these pixels may be rendered due to viewport or scissor settings.

2.2.3 JS1_TASKS

Availability: All

This counter increments every time the Job Manager issues a task to a shader core or the tiler. For JS1 these tasks correspond to a range of vertices or compute work items (shader cores), or a range of indices (tiler). The size of these tasks is driver controlled, although for compute tasks must be a multiple of the workgroup size.

2.2.4 JS2_TASKS

Availability: All

This counter increments every time the Job Manager issues a task to a shader core. For JS2 these tasks correspond to a range of vertices or compute work items. The size of these tasks is driver controlled, although for compute tasks must be a multiple of the workgroup size.

3 Shader Core Counters

This section describes the counters implemented by the Mali Shader Core component. For the purposes of clarity this section talks about either fragment workloads or compute workloads. Vertex workloads are treated as a one dimensional compute problem by the shader core, so are counted as a compute workload from the point of view of the counters in this section.

The GPU hardware records separate counters per shader core in the system. DS-5 Streamline shows the average of all of the shader core counters.

3.1 Shader Core Activity

These counters show the total activity level of the shader core.

3.1.1 FRAG_ACTIVE

Availability: All

This counter increments every cycle at least one fragment task is active anywhere inside the shader core, including the fixed-function fragment frontend, the programmable tripipe, or the fixed-function fragment backend.

3.1.2 COMPUTE_ACTIVE

Availability: All

This counter increments every cycle at least one compute task is active anywhere inside the shader core, including the fixed-function compute frontend, or the programmable tripipe.

3.1.3 TRIPIPE_ACTIVE

Availability: All

This counter increments every cycle at least one thread is active inside the programmable tri-pipe. Note that this counter does not give any idea of total utilization of the tri-pipe resources, but simply gives an indication that something was running. An approximation of the overall utilization of the tripipe can be achieve via the following equation:

TRIPIPE_UTILIZATION = TRIPIPE_ACTIVE / GPU_ACTIVE

TRIPIPE_UTILIZATION =  TRIPIPE_ACTIVE / GPU_ACTIVE

A low tripipe utilization can indicate:

Content which is tiler limited, so the shader cores are going idle
Content which contains many micro-polygons - triangles smaller than one pixel - which generate no rasterizer sample coverage.
Content which contains many tiles containing no geometry; i.e. rendering just clear color.

3.2 Compute Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed function compute frontend which issues work into the programmable tri-pipe.

3.2.1 COMPUTE_TASKS

Availability: All

This counter increments for every compute task handled by the shader core. When totalled across all shader cores this should show the total number of shader core tasks issued by JS1 and JS2.

3.2.2 COMPUTE_THREADS

Availability: All

This counter increments for every compute thread spawned by the shader core. One compute thread is spawned for every work item (compute shaders) or vertex (vertex shaders).

3.3 Fragment Frontend Events

These counters show the task and thread issue behavior of the shader core's fixed-function fragment frontend. This unit is significantly more complicated than the compute frontend, so there are a large number of counters available.

3.3.1 FRAG_PRIMITIVES

Availability: All

This counter increments for every primitive read from the tile list. Not all of these triangles will necessarily be visible in the current tile, due to the use of the hierarchical tiler, but it gives some idea of the internal tile list access rate.

3.3.2 FRAG_PRIMITIVES_DROPPED

Availability: All

This counter increments for every primitive read from the tile list which is subsequently discarded because it is not relevant for the tile currently being rendered. The number of fragment threads which are issued per primitive can be given by the following equation:

THREADS_PER_PRIMITIVE_LOAD = FRAG_THREADS / (FRAG_PRIMITIVES - FRAG_PRIMITIVES_DROPPED)

THREADS_PER_PRIMITIVE_LOAD = FRAG_THREADS  / (FRAG_PRIMITIVES - FRAG_PRIMITIVES_DROPPED)

It is recommended that the number of fragment threads per primitive load is kept above 10 to ensure that the tripipe is kept filled with work for as much of the frame rendering time as possible.

Note: that the averaging of this metric over a large time window can hide problematic cases; a sequence of 1000 primitives of 256 fragments each (i.e. a full tile) followed by 1000 primitives of 1 fragment each (i.e. a single pixel micro triangle) would give an average of ~128 pixels per primitive, but would run slower than a test rendering 2000 primitives of 128 pixels each.

3.3.3 FRAG_QUADS_RAST

Availability: All

This counter increments for every 2x2 pixel quad which is rasterized by the rasterization unit. The quads generated have at least some coverage based on the current sample pattern, but may subsequently be killed by early depth and stencil testing and as such never issued to the tri-pipe.

3.3.4 FRAG_QUADS_EZS_TEST

Availability: All

This counter increments for every 2x2 pixel quad which is subjected to early depth and stencil (ZS) test and update. We want as many quads as possible to be subject to early ZS testing as it is significantly more efficient than late ZS testing. It is more efficient because it lets us kill work before it becomes executable threads which consume tripipe resources, rather than after those threads have completed execution.

Note: this counter will only increment for quads which complete all necessary ZS test and ZS update operations. If a fragment performs an early ZS test but requires a late ZS update - possible, for example, if the shader contains a discard statement - then this counter will not increment. In this scenario it is possible for the FRAG_QUADS_EZS_KILLED counter to be higher than this FRAG_QUADS_EZS_TEST counter, which can seem slightly counter-intuitive.

3.3.5 FRAG_QUADS_EZS_KILLED

Availability: All

This counter increments for every 2x2 pixel quad which is completely killed by early depth and stencil (ZS) testing. These killed quads will not generate any further processing in the shader core.

3.3.6 FRAG_CYCLES_NO_TILE

Availability: All

This counter increments every cycle the shader core early ZS unit is blocked from making progress because there is no physical tile memory for a new color, depth, or stencil buffer available. Note that this may not actually cause lost performance if there is sufficient work buffered after the EZS unit that is waiting to run.

A high counter value here generally means that the GPU cannot free up tile memory fast enough to satisfy the rate at which it is processing. Ensure that the GPU is not bottlenecking on external memory bandwidth, and use glDiscardFramebufferExt() or glInvalidateFramebuffer() to discard unneeded transient tile state in order to make that memory available for the next tile as quickly as possible. See this blog on framebuffer lifetime for more information on the correct use of these two API calls: Mali Performance 2: How to Correctly Handle Framebuffers.

3.3.7 FRAG_CYCLES_FPKQ_ACTIVE

Availability: All except Mali-T600

This counter increments every cycle the pre-pipe FPK buffer contains at least one 2x2 pixel quad waiting to be executed in the tripipe. If this buffer drains the frontend will be unable to spawn a new thread if a thread slot becomes free, however this counter will increment every cycle the buffer is empty, even if the tripipe is full. This counter is therefore only an indication of possible lost performance, rather than actual lost performance, as the tripipe may contain enough threads to fully utilize the critical path functional units.

3.3.8 FRAG_THREADS

Availability: All

This counter increments for every fragment thread created by the GPU. These may be real threads, or dummy threads (see below for description of a dummy thread).

3.3.9 FRAG_DUMMY_THREADS

Availability: All

This counter increments for every dummy fragment thread created by the GPU. These are real executable threads which have no active sample coverage, but which share a quad with at least one thread which does have sample coverage. These dummy threads are needed to ensure correctly calculated derivatives if the fragment shader uses the dX() or dY() built-in functions, or implicit mipmap level selection.

If this number is a significant percentage of the total number of fragment threads it generally indicates that the content is generating micro-triangles, which have a high percentage of edge pixels. In these cases performance can be improved by reducing triangle counts in the model meshes.

3.4 Fragment Backend Events

These counters record the fragment backend behavior.

3.4.1 FRAG_THREADS_LZS_TEST

Availability: All

This counter increments for every thread triggering late depth and stencil (ZS) testing.

3.4.2 FRAG_THREADS_LZS_KILLED

Availability: All

This counter increments for every thread killed by late depth and stencil (ZS) testing. These threads are killed after their fragment program has executed, so a significant number of threads being killed at late ZS implies a significant amount of lost performance and/or wasted energy performing rendering which has no useful visual output. Late depth and stencil testing is generally triggered for shaders which modify depth or stencil programmatically, or which have shader-dependent sample coverage (the shader uses "discard" or alpha-to-coverage), although the driver will aggressively try to use early-zs whenever possible even in these cases, so using these features does not guarantee an occurrence of late-zs.

3.4.3 FRAG_NUM_TILES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every 16x16 pixel tile rendered by the shader core. In a single core system this should be the same as the JS0_TASKS (1) counter.

3.4.4 FRAG_NUM_TILES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every 32x32 screen region rendered by the shader core. In a single core system this should be the same as the JS0_TASKS (2) counter.

3.4.5 FRAG_TRANS_ELIM

Availability: All

This counter increments for every physical rendered tile which has its writeback cancelled due to a matching transaction elimination CRC hash. If a high percentage of the tile writes are being eliminated this implies that you are re-rendering the entire screen when not much has changed, so consider using scissor rectangles to minimize the amount of area which is redrawn. This isn't always easy, especially for window surfaces which are pipelines using multiple buffers, but EGL extensions such as these may be supported on your platform which can help manage the partial frame updates:

https://www.khronos.org/registry/egl/extensions/KHR/EGL_KHR_partial_update.txt
https://www.khronos.org/registry/egl/extensions/EXT/EGL_EXT_swap_buffers_with_damage.txt

Note that in Mali-T760 and Mali-T800 series the size of a rendered tile is not the same as the region size counted in the shader core tile issue counter (FRAG_NUM_TILES (2)). Physical render tiles can vary from 16x16 pixels (largest) down to 4x4 pixels (smallest), with a number of intermediate sizes possible. The size of physical tile in pixels used depends on the number of bytes of memory needed to store the working set for each pixel.

This in turn depends on the use of:

Multi-sample anti-aliasing (MSAA)
OpenGL ES 3.0 multiple render targets (MRT)
OpenGL ES 3.0 large color formats, e.g. 16 or 32-bit per channel data formats
Pixel local storage (PLS)

This counter cannot be used to determine the physical tile size, and it is possible that one application may use different physical tile sizes for different render passes. However, for most content using a single render target, RGBA color formats up to 32 bits per pixel, and up to 4x MSAA the rendered tile size will be 16x16 pixels.

3.5 Arithmetic Pipe Events

These counters look at the behavior of the arithmetic pipe.

3.5.1 ARITH_WORDS (1)

Availability: All except Mali-T720, Mali-T820, and Mali-T830

This counter increments for every arithmetic instruction architecturally executed. This counter is normalized based on the number of arithmetic pipelines implemented in the design, so gives the "per pipe" performance, rather than the total executed workload. The peak performance is one arithmetic instruction per arithmetic pipeline per cycle, so the effective utilization of the arithmetic hardware can be computed as:

ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE

ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE

3.5.2. ARITH_WORDS (2)

Availability: Mali- T720, Mali-T820, Mali-T830

This counter increments for every batched arithmetic instruction executed. Mali-T720, Mali-T820, and Mali-T830 implement a different arithmetic pipeline to the other Midgard GPU cores, which improves performance density at the expense of lower peak performance. In this design the GPU automatically batches multiple threads which are not fully utilizing the lanes of a SIMD vector unit together in order to improve lane utilization. This counter counts the number of batched execution issues (where the batch size may be a single thread), rather than the number of individual instruction issues from each logical thread.

Due to the dynamic batching of instructions the architectural performance of this pipeline is not trivial to measure using counters, as it may batch 1, 2, or 4 threads into a single issue cycle and its ability to do so depends heavily on the code generated by the shader compiler. However it is still true that we can issue one batch per clock, so the headline utilization is still measurable via the equation:

ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE

ARITH_ARCH_UTILIZATION = ARITH_WORDS / TRIPIPE_ACTIVE

3.6 Load/Store Pipe Events

These counters look at the behavior of the load/store pipe.

3.6.1 LS_WORDS

Availability: All

This counter increments for every load/store instruction architecturally executed. Under ideal circumstances we can issue one instruction per clock, so the architectural utilization of the load store pipe is:

LS_ARCH_UTILIZATION = LS_WORDS / TRIPIPE_ACTIVE

LS_ARCH_UTILIZATION = LS_WORDS / TRIPIPE_ACTIVE

3.6.2 LS_ISSUES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store instruction issued, including any reissues due to varying or data cache misses. We can issue or reissue a single instruction per clock cycle, so the microarchitectural utilization can be computed as:

LS_UARCH_UTILIZATION = LS_ISSUES / TRIPIPE_ACTIVE

LS_UARCH_UTILIZATION = LS_ISSUES / TRIPIPE_ACTIVE

Under ideal circumstances we can execute a single load/store word in a single pass, so we can calculate a Cycles Per Instruction (CPI) metric as:

LS_CPI = LS_ISSUES / LS_WORDS

LS_CPI = LS_ISSUES / LS_WORDS

A good CPI figure is generally around 1.2, anything significantly higher than this generally means that one of the data caches is being thrashed, so for vertex shaders you might want to look at:

Reducing the number of uniforms or vertex attributes.
Reducing the precision of uniforms or vertex attributes.
Ensuring that no unused fields are interleaved in your attribute streams or uniform buffers.
Ensuring that your geometry data has good spatial locality (vertices that are near to other vertices in the model mesh are also near to each other in in memory).

For fragment shaders the same rules apply, but to varying values instead of input attributes, for compute shaders the rules apply to shader storage buffer objects (SSBOs).

3.6.3 LS_ISSUES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store instruction issued, including any reissues due to varying cache misses. It should be noted that Mali-T760 onwards does not count retry cycles due to misses in the main data cache as part of this metric, so the cache analysis has to be done using the LSC counters described in the next section.

The derived counters are the same as those in "LS_ISSUES (1)" above.

3.7 Load/Store Cache Events

These events monitor the performance of the load/store cache.

3.7.1 LSC_READ_HITS

Availability: All

This counter increments for every load/store L1 cache read access which is a hit.

3.7.2 LSC_READ_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store L1 cache read access which is a miss.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. If the hit rate is particularly poor it may indicate cache thrashing due to badly packed data, with poor spatial and/or temporal locality of access.

LSC_READ_HITRATE = LSC_READ_HITS / (LSC_READ_HITS + LSC_READ_MISSES)

LSC_READ_HITRATE = LSC_READ_HITS / (LSC_READ_HITS + LSC_READ_MISSES)

3.7.3 LSC_READ_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store L1 cache read access.

It can be a useful exercise to review the percentage of hits versus the total number of accesses. See the section above for a further description; the updated derived counter for the supported GPUs is:

LSC_READ_HITRATE = LSC_READ_HITS / LSC_READ_OPS

LSC_READ_HITRATE = LSC_READ_HITS / LSC_READ_OPS

3.7.4 LSC_WRITE_HITS

Availability: All

This counter increments for every load/store L1 cache write access which is a hit.

3.7.5 LSC_WRITE_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every load/store L1 cache write access which is a miss.

LSC_WRITE_HITRATE = LSC_WRITE_HITS / (LSC_WRITE_HITS + LSC_WRITE_MISSES)

LSC_WRITE_HITRATE = LSC_WRITE_HITS / (LSC_WRITE_HITS + LSC_WRITE_MISSES)

3.7.6 LSC_WRITE_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every load/store L1 cache write access.

LSC_WRITE_HITRATE = LSC_WRITE_HITS / LSC_WRITE_OPS

LSC_WRITE_HITRATE = LSC_WRITE_HITS / LSC_WRITE_OPS

3.7.7 LSC_ATOMIC_HITS

Availability: All

This counter increments for every atomic memory access which hits in the L1 atomic cache.

3.7.8 LSC_ATOMIC_MISSES

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every atomic memory access which misses in the L1 atomic cache.

LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / (LSC_ATOMIC_HITS + LSC_ATOMIC_MISSES)

LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / (LSC_ATOMIC_HITS + LSC_ATOMIC_MISSES)

Atomics are sometimes used for synchronizing multiple workgroups, requiring the same atomic memory location to be accessed by multiple shader cores. This makes them more susceptible than other memory types to cache thrashing as the line in the atomics cache can be deliberate stolen by another shader core - rather than evicted due to normal cache pressure - so application developers should ensure that there is an adequate amount of processing relative to the number of global atomic accesses, otherwise the memory synchronization overheads of the atomic fields will dominate the processing workload.

3.7.9 LSC_ATOMIC_OPS

Availability: Mali-T760, Mali-T800 series

This counter increments for every atomic memory access which misses in the L1 atomic cache.

LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / LSC_ATOMIC_OPS

LSC_ATOMIC_HITRATE = LSC_ATOMIC_HITS / LSC_ATOMIC_OPS

3.7.10 LSC_LINE_FETCHES

Availability: All

This counter increments for every line fetched by the L1 cache from the L2 memory system.

3.7.11 LSC_DIRTY_LINE

Availability: All

This counter increments for every dirty line evicted from the L1 cache into the L2 memory system.

3.7.12 LSC_SNOOPS

Availability: All

This counter increments for every snoop into the L1 cache from the L2 memory system.

3.8 Texture Pipe Events

This counter set looks at the texture pipe behavior.

3.8.1 TEX_WORDS

Availability: All

This counter increments for every architecturally executed texture instruction.

3.8.2 TEX_ISSUES (1)

Availability: Mali-T600, Mali-T620, Mali-T720

This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to data cache misses, as well as a number of multi-cycle filtering operations:

2D bilinear filtering takes one cycle
2D trilinear filtering takes two cycles
3D bilinear filtering takes two cycles
3D trilinear filtering takes four cycles

We can calculate a texture pipe CPI in the same way we can with the LS pipe:

TEX_CPI = TEX_ISSUES / TEX_WORDS

TEX_CPI = TEX_ISSUES / TEX_WORDS

This gives some indication of the number of recirculations which are occurring in the design. For content using 2D textures with bilinear filtering a good CPI figure is generally around 1.2, anything significantly higher than this generally means that one of the data caches is being thrashed. In this case consider:

Using mipmaps - for 3D scenes you should always use mipmapped textures as they reduce aliasing artefacts and they go faster.
Using texture compression - ASTC provides excellent quality and flexibility so there is no reason not to.
In extreme cases reduce texture resolution, or the number of textures s being used.

3.8.3 TEX_ISSUES (2)

Availability: Mali-T760, Mali-T800 series

This counter increments for every texture issue cycle used. Some instructions take more than one cycle due to multi-cycle filtering operations. Unlike the older version of this counter, this implementation does not incur retries due to misses in the main data cache, and these are no longer included in this counter.

The cycle counts for each operation and the CPI derived counter are the same as shown in "TEX_ISSUES (1)" above, although the CPI counter will generally be lower as cache miss overheads are not included. If this number is substantially over 1.0 and you are texture limited then reduce use of trilinear filtering and 3D textures.

3.8.4 TEX_RECIRC_FMISS

Availability: Mali-T600, Mali-T620, Mali-T760, Mali-T800 series

This counter counts the number of full cache misses where no samples of a texture are present in the texture cache, and require the thread to wait for a cache access. The following equation provides an approximation of the percentage of read hits, although it does assume only one cache line is needed per texture instruction, which is not true for some texture formats.

TEX_READ_HITRATE = (TEX_WORDS - TEX_RECIRC_FMISS) / TEX_WORDS

TEX_READ_HITRATE = (TEX_WORDS - TEX_RECIRC_FMISS) / TEX_WORDS

4 Tiler Counters

The tiler counters provide details of the workload of the fixed function tiling unit, which places primitives into the tile lists which are subsequently read by the fragment frontend during fragment shading.

4.1 Tiler Activity

These counters show the overall activity of the tiling unit.

4.1.1 TI_ACTIVE

Availability: Mali-T600, Mali-T620, Mali-T760, Mali-T860, Mali-T880

This counter increments every cycle the tiler is processing a task. The tiler can run in parallel to vertex shading and fragment shading so a high cycle count here does not necessarily imply a bottleneck, unless the COMPUTE_ACTIVE counter in the shader cores are very low relative to this.

4.2. Tiler Primitive Occurrence

These counters give a functional breakdown of the workload given to the GPU by the application.

4.2.1 TI_POINTS

Availability: All

This counter increments for every point primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.2 TI_LINES

Availability: All

This counter increments for every line segment primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.2.3 TI_TRIANGLES

Availability: All

This counter increments for every triangle primitive processed by the tiler. This counter is incremented before any clipping or culling, so reflects the raw workload from the application.

4.3 Tiler Visibility and Culling Occurrence

These counters give a breakdown of how the workload has been affected by clipping and culling.

4.3.1 TI_PRIM_VISIBLE

Availability: All

This counter is incremented for every primitive which is visible according to its type, clip-space coordinates, and front-face or back-face orientation.

4.3.2 TI_PRIM_CULLED

Availability: All

This counter is incremented for every primitive which is culled due to the application of front-face or back-face culling rules. For most meshes approximately half of the triangles are back facing so this counter should typically be similar to the visible primitives, although lower is always better.

4.3.3 TI_PRIM_CLIPPED

Availability: All

This counter is incremented for every primitive which is culled due to being totally outside of the clip-space volume. Application-side culling should be used to minimize the amount of out-of-shot geometry being sent to the GPU as it is expensive in terms of bandwidth and power. One of my blogs looks at application side culling in more detail:Mali Performance 5: An Application's Performance Responsibilities.

4.3.4 TI_FRONT_FACING

Availability: All

This counter is incremented for every triangle which is front-facing. This counter is incremented after culling, so only counts primitives which are actually emitted into the tile list.

4.3.5 TI_BACK_FACING

Availability: All

This counter is incremented for every triangle which is back-facing. This counter is incremented after culling, so only counts primitives which are actually emitted into the tile list. If you are not using back-facing triangles for some special algorithmic purpose, such as Refraction Based on Local Cubemaps, then a high value here relative to the total number of triangles may indicate that the application has forgotten to turn on back-face culling. For most opaque geometry no back facing triangles should be expected.

5 L2 Cache Counters

This section documents the behavior of the L2 memory system counters. In systems which implement multiple L2 caches or bus interfaces the counters presented in DS-5 Streamline are generally the sum of the counters from all of the L2 counter blocks present, so care needs to be taken when comparing these counters against the total GPU cycles.

5.1 Internal Read Traffic Events

These counters profile the internal read traffic into the L2 cache from the various internal masters.

5.1.1 L2_READ_LOOKUP

Availability: All

The counter increments for every read transaction received by the L2 cache.

5.1.2 L2_READ_HITS

Availability: All

The counter increments for every read transaction received by the L2 cache which also hits in the cache.

L2_READ_HITRATE = L2_READ_HITS / L2_READ_LOOKUP

L2_READ_HITRATE = L2_READ_HITS / L2_READ_LOOKUP

5.1.3 L2_READ_SNOOP

Availability: All

The counter increments for every inner coherent read snoop transaction received by the L2 cache.

5.2 Internal Write Traffic Events

These counters profile the internal write traffic into the L2 cache from the various internal masters.

5.2.1 L2_WRITE_LOOKUP

Availability: All

The counter increments for every write transaction received by the L2 cache.

5.2.2 L2_WRITE_HITS

Availability: All

The counter increments for every write transaction received by the L2 cache which also hits in the cache.

L2_WRITE_HITRATE = L2_WRITE_HITS / L2_WRITE_LOOKUP

L2_WRITE_HITRATE = L2_WRITE_HITS / L2_WRITE_LOOKUP

5.2.3 L2_WRITE_SNOOP

Availability: All

The counter increments for every inner coherent write snoop transaction received by the L2 cache.

5.3 External Read Traffic Events

These counters profile the external read memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache, as some types of access will bypass the L2 cache.

5.3.1 L2_EXT_READ_BEATS

Availability: All

This counter increments on every clock cycle a read beat is read off the AXI bus. With knowledge of the bus width used in the GPU this can be converted into a raw bandwidth counter.

L2_EXT_READ_BYTES = L2_EXT_READ_BEATS * L2_AXI_WIDTH_BYTES

L2_EXT_READ_BYTES = L2_EXT_READ_BEATS * L2_AXI_WIDTH_BYTES

Most implementations of the Midgard GPU use a 128-bit (16 byte) AXI interface, but a 64-bit (8 byte) interface is also possible to reduce the area used by a design. This information can be obtained from your chipset manufacturer.

It is also possible to determine the total percentage of available AXI port bandwidth used, using the equation below. Note that we need to normalize the number of read beats, which are accumulated by DS-5 Streamline, into a per-port count. If a design is using 1-4 shader cores then only a single AXI port will be present, otherwise two AXI ports will be present.

L2_EXT_READ_UTILIZATION = L2_EXT_READ_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

L2_EXT_READ_UTILIZATION = L2_EXT_READ_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface as AXI will be unable to provide the data as quickly as the GPU can consume it.

5.3.2 L2_EXT_R_BUF_FULL

Availability: All except Mali-T720

This counter increments every cycle that the GPU is unable to create a new read transaction because there are no free entries in the internal response buffer. If this number is high it may indicate that the AXI interface has been built with too few outstanding transactions. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.3.3 L2_EXT_RD_BUF_FULL

Availability: All except Mali-T720

This counter increments if a read response is received and the internal read data buffer is full. This can happen if read responses are interleaved by the bus. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.3.4 L2_EXT_AR_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new read transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention. In these cases application developers can improve performance by reducing the total amount of bandwidth their application is using, be that by reducing geometry complexity, compressing textures, or reducing resolution.

5.4 External Write Traffic Events

These counters profile the external write memory interface behavior. Note that this includes traffic from the entire GPU L2 memory subsystem, not just traffic from the L2 cache.

5.4.1 L2_EXT_WRITE_BEATS

Availability: All

This counter increments on every clock cycle a write data beat is sent on the AXI bus. With knowledge of the bus width used in the GPU this can be converted into a raw bandwidth counter.

L2_EXT_WRITE_BYTES = L2_EXT_WRITE_BEATS * L2_AXI_WIDTH_BYTES

L2_EXT_WRITE_BYTES = L2_EXT_WRITE_BEATS * L2_AXI_WIDTH_BYTES

L2_EXT_WRITE_UTILIZATION = L2_EXT_WRITE_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

L2_EXT_WRITE_UTILIZATION = L2_EXT_WRITE_BEATS / (L2_AXI_PORT_COUNT * GPU_ACTIVE)

This utilization metric ignores any frequency changes which may occur downstream of the GPU. If you have, for example, a 600MHz GPU connected to a 300MHz AXI of the same data width then it will be impossible for the GPU to achieve more than 50% utilization of its native interface as AXI will be unable to accept the data as quickly as the GPU can generate it.

5.4.2 L2_EXT_W_BUF_FULL

Availability: All

This counter increments every cycle that the GPU is unable to create a new write transaction because there are no free entries in the internal write buffer. If this number is high it may indicate that the AXI interface has been built with too few outstanding transactions. This counter is mostly useful to system integrators tuning the AXI implementation, rather than application developers.

5.4.3 L2_EXT_W_STALL

Availability: All

This counter increments every cycle that the GPU is unable to issue a new write transaction to AXI, because AXI is unable to accept the request. If this number is high it may indicate that the AXI bus is suffering from high contention, or that the device bus clock is too low. In these cases application developers can improve performance by reducing the total amount of bandwidth their application is using, be that by reducing geometry complexity, compressing textures, or reducing resolution.

6 Conclusions

This document has defined all of the Mali Midgard family performance counters available via DS-5 Streamline, as well as some pseudo-counters which can be derived from them. Hopefully this provides a useful starting point for your application optimization activity when using Mali GPUs.

We also publish a Mali Application Optimization Guide on the Mali Developer Center:

http://malideveloper.arm.com/develop-for-mali/tutorials-developer-guides/developer-guides/mali-gpu-application-optimization-guide/

Other blogs and documents can be found the ARM Mali Graphics area which explore using the performance tools, as well as discussing specific optimization strategies, in more detail.

你可能感兴趣的:(Mali Midgard Family Performance Counters)

specpu2017安装/编译/运行测试总结 So_shine linux调试工具和性能量化 linux
目录前言一、源码镜像获取二、安装三、配置修改四、编译五、运行测试六、结果查看七、遇到的问题前言SPEC是标准性能评估公司（StandardPerformanceEvaluationCorporation）的简称。SPEC是由计算机厂商、系统集成商、大学、研究机构、咨询等多家公司组成的非营利性组织，这个组织的目标是建立、维护一套用于评估计算机系统的标准。SPECCPU测试中，测试系统的处理器、内存子
specpu2017在arm64环境下的部署/测试 So_shine specpu 性能测试 arm64 环境部署
目录前言一、源码镜像获取二、安装三、配置修改四、编译五、运行测试六、结果查看七、遇到的问题前言SPEC是标准性能评估公司（StandardPerformanceEvaluationCorporation）的简称。SPEC是由计算机厂商、系统集成商、大学、研究机构、咨询等多家公司组成的非营利性组织，这个组织的目标是建立、维护一套用于评估计算机系统的标准。SPECCPU测试中，测试系统的处理器、内存子
ZCC5050是一款高性能的高侧 OR-ing FET 控制器替代LM5050 2501_92222359 嵌入式硬件
一产品概述ZCC5050-1是一款高性能的高侧OR-ingFET控制器，适用于冗余电源系统。它通过外部N沟道MOSFET实现理想的二极管整流功能，可显著降低传统二极管整流器带来的功率损耗和电压降。ZCC5050-1提供了快速的电流反转响应能力，能够在50ns内关闭MOSFET，确保系统的稳定性和可靠性。ZCC5050-1isahigh-performancehighsideORingFETcont
ZCC5050是一款高性能的高侧 OR-ing FET 控制器替代LM5050 2501_92222359 嵌入式硬件
一产品概述ZCC5050-1是一款高性能的高侧OR-ingFET控制器，适用于冗余电源系统。它通过外部N沟道MOSFET实现理想的二极管整流功能，可显著降低传统二极管整流器带来的功率损耗和电压降。ZCC5050-1提供了快速的电流反转响应能力，能够在50ns内关闭MOSFET，确保系统的稳定性和可靠性。ZCC5050-1isahigh-performancehighsideORingFETcont
PillarNet: Real-Time and High-PerformancePillar-based 3D Object Detection justtoomuchforyou 目标检测人工智能计算机视觉智驾
ECCV2022paper：[2205.07403]PillarNet:Real-TimeandHigh-PerformancePillar-based3DObjectDetectioncode：https://github.com/VISION-SJTU/PillarNet-LTS纯点云基于pillar3D检测模型网络比较SECOND基于voxel，one-stage，基于sparse3Dc
ZYNQ无DMA的四路HP总线极限性能探索芯作者 D1：ZYNQ设计 fpga开发硬件工程智能硬件
深入挖掘AXIHP总线的直接传输潜力，突破传统DMA的性能瓶颈一、HP总线：ZYNQ系统的"高速公路"在XilinxZYNQ架构中，HP（HighPerformance）总线是连接PS（处理器系统）和PL（可编程逻辑）的关键通道。传统方案依赖DMA控制器进行数据传输，但当我们需要超低延迟或确定性响应时，无DMA的直接CPU控制成为更优选择。本文将揭示如何通过四路HP总线实现惊人的24GB/s理论带
网络工程师知识点精讲与例题解析：网络管理软考和人工智能学堂网络工程师网络规划设计师信息系统项目管理师提高班网络智能路由器
网络工程师知识点精讲与例题解析：网络管理一、网络管理概述网络管理是网络工程师的核心职责之一，主要目标是保障网络稳定、安全和高效运行。根据ISO定义的网络管理五大功能域（FCAPS）：故障管理（Fault）：检测、隔离和修复网络故障配置管理（Configuration）：管理设备配置和版本计费管理（Accounting）：统计资源使用情况（如流量计费）性能管理（Performance）：监控和分析网
NoSQL 之 Redis 配置与优化天空之城夢主 nosql redis 数据库
这里写目录标题Redis介绍关系数据库与非关系型数据库关系型数据库非关系型数据库非关系型数据库产关系型数据库已经诞生很久了，而且一直在使用。面对这样的情况，为什么还会产生NoSQL?那么，下面就来介绍一下NoSQL产生的背景。Highperformance--对数据库高并发读写需求HugeStorage--对海量数据高效存储与访问需求HighScalability&&HighAvailabilit
【Qt-windows】如何使用perfmon 具体分析windows serverR2的Qt程序CPU问题漫步企鹅 Qt Windows 性能分析 CPU性能
可以使用Windows自带的PerfMon（PerformanceMonitor）工具对运行在WindowsServerR2上的Qt程序进行详细的性能分析，尤其是CPU使用情况。以下是具体的操作步骤和建议：一、打开PerfMon工具按下Win+R打开运行窗口。输入perfmon并回车。二、创建自定义数据收集器集步骤如下：在左侧导航栏点击“数据收集器集”→“用户定义”。右键选择“新建”→“数据收集器
解决 “findfont: Generic family ‘sans-serif‘ not found because none of the following families were foun 会飞的土拨鼠呀 Python学习运维服务器 linux
解决“findfont:Genericfamily‘sans-serif’notfoundbecausenoneofthefollowingfamilieswerefound:simhei”错误这个错误表明您的系统无法找到simhei（黑体）字体。以下是彻底解决这个问题的完整指南：完整解决方案步骤1.安装simhei字体#更新软件包索引sudoaptupdate#安装simhei字体（黑体）sud
鸿蒙中位置权限和相机权限大尾巴昂 harmonyos 数码相机华为
1.module.json5中添加相关权限和string.json中配置信息2.详情代码import{hilog}from'@kit.PerformanceAnalysisKit';import{TAG}from'@ohos/hypium/src/main/Constant';import{bundleManager,common}from'@kit.AbilityKit';import{abil
《MySQL 技术内幕（第5版）》逐章精华笔记第七章喵桑.. MySQL 数据库 sql mysql
第7章：性能诊断与慢SQL分析（完整版）本章目标熟练使用慢查询日志、EXPLAIN、performance_schema等工具掌握分析SQL执行瓶颈的流程学会识别并改写典型慢SQL一、慢查询日志简介MySQL提供慢查询日志机制，记录执行时间超过阈值的SQL语句。开启方式--开启慢查询日志SETGLOBALslow_query_log=1;--设置慢查询阈值（单位：秒）SETGLOBALlong_q
NoSQL之Redis配置与优化（缓存加速） Jay&& 缓存 nosql redis 缓存加速
一、非关系型数据库产生背景Highperformance————对数据库高并发读写需求HugeStorage———对海量数据高效存储与访问需求HighScalability&&HighAvailability——对数据库高可扩展性与高可用性需求二、Redis简介基于内存运行并支持持久化采用key-value(键值对)的存储形式1、优点速度快:10WQPS,基于内存,C语言实现单线程使用epoll（
Python训练打卡Day15 编程有点难 Python学习笔记 python 开发语言
复习日回顾一下之前14天的内容：importpandasaspdimportseabornassnsimportmatplotlib.pyplotaspltdata=pd.read_csv('ObesityDataSet.csv')data.head()#分离连续变量与离散变量discrete_features=['Gender','family_history_with_overweight',
hbase:meta 表解析有数的编程笔记 HBase
hbase:meta表中存储了Hbase集群中全部表的所有的region信息，在Hbase2.x之后新增了表的状态信息。hbase:meta表的结构非常简单，在Hbase2.x之前整个表只有一个名为info的ColumnFamily。在Hbase2.x新增表状态信息后，增加了名为table的ColumnFamily。HBase保证hbase:meta表始终只有一个Region，这是为了确保meta
Vue实现选中多张图片一起拖拽功能枫叶&情缘前端 Vue vue.js javascript 前端
Vue图片框选拖拽功能*{box-sizing:border-box;margin:0;padding:0;}body{font-family:'SegoeUI',Tahoma,Geneva,Verdana,sans-serif;background:linear-gradient(135deg,#1a2a6c,#b21f1f,#fdbb2d);color:#333;min-height:100v
仅使用 javascript 构建第一个应用程序网络点点滴 Vue3 javascript 开发语言 ecmascript
目的本次的DEMO代码示例非常的简单，如下图我们只需要再上面的输入框中输入内容，下面的内容就是我们对应上面提交的内容代码示例HTML代码AFirstAppGoalAddGoalCSS代码*{box-sizing:border-box;}html{font-family:sans-serif;}body{margin:0;}#app{margin:3remauto;max-width:40rem;p
华为OD机考-亲子游戏-BFS（JAVA 2025B卷 200分）小猫咪怎么会有坏心思呢华为机考华为od 游戏宽度优先
packageod;importjava.util.*;/***@versionVer1.0*@date2025/6/18*@description亲子游戏*/publicclassFamilyGames{publicstaticvoidmain(String[]args){Scannerscanner=newScanner(System.in);intN=Integer.parseInt(sca
前端性能指标监测前端性能监控
✅一、什么是window.performancewindow.performance是浏览器提供的高精度性能监控API，属于PerformanceAPI的一部分。它能帮助你：精确测量代码执行时间分析页面加载过程评估资源加载性能优化用户体验性能瓶颈✅二、常见使用场景场景使用目的页面加载性能分析分析首屏时间、白屏时间、首字节等用户交互性能监控计算按钮点击或动画执行时间资源加载分析检查图片、JS、CSS
JavaScript 性能优化实战技术文章大纲 CYC20160529 java
JavaScript性能优化实战技术文章大纲理解性能优化的核心指标关键指标：首次内容渲染（FCP）、可交互时间（TTI）、总阻塞时间（TBT）使用Lighthouse或WebPageTest进行性能评估浏览器开发者工具中的Performance面板分析减少JavaScript文件大小代码压缩：使用Terser或UglifyJS进行minification代码分割：动态import()和Webpac
Ubuntu 设置Nginx开机自启 shushei nginx 数据库运维
1.建立服务文件vim/usr/lib/systemd/system/nginx.serviceDescription=nginx-highperformancewebserverAfter=network.targetremote-fs.targetnss-lookup.target[Service]Type=forkingExecStart=/usr/local/nginx/sbin/ngin
python学习-13【网络编程】 kuiini python python 学习网络
1、Socket网络模块Socket模块在Python中，使用socket模块的socket()函数来创建一个socket对象：socket.socket(family,type,proto)family：套接字家族，该参数指定调用者期待返回的套接字接口地址结构的类型AF_UNIX：同一台机器上的进程通信AF_INET：使用IPv4通信，不会返回IPv6的信息AF_INET6：使用IPv6通信，不
保姆级教程—自己创建一个数独游戏网页开心小破孩儿游戏 html5
这是一个数独页面，下面有代码，可以生成一个数独网页，这是源代码：数独游戏tailwind.config={theme:{extend:{colors:{primary:'#3B82F6',secondary:'#10B981',accent:'#F59E0B',dark:'#1F2937',light:'#F3F4F6'},fontFamily:{sans:['Inter','system-ui'
html三角形排序按钮,CSS实现三角形的播放器按钮图标 weixin_39668408 html三角形排序按钮
CSS实现三角形的播放器按钮body{background:#000;}header{font-family:"MicroSoftYaHei";font-size:30px;color:#990000;}.circle{width:120px;height:120px;-webkit-border-radius:60px;-moz-border-radius:60px;border-radius:
ORM 越方便，数据库越慢？顶会论文列举九大反模式
编者注：原论文《Hownottostructureyourdatabase-backedwebapplications:astudyofperformancebugsinthewild》，发表于ICSE'18。虽然已经过去了7年，但依然历久弥新。芝加哥大学和华盛顿大学的研究团队通过一项全面的研究，揭示了ORM（对象关系映射）框架应用中常见的性能问题及其解决方案。在这项研究中，他们对12个有代表性的
分享一个实用的代码对比工具我科绝伦（Huanhuan Zhou） css css3 javascript
关键代码如下：代码对比工具*{margin:0;padding:0;box-sizing:border-box;font-family:'SegoeUI','MicrosoftYaHei',sans-serif;}body{background:linear-gradient(135deg,#1e3c72,#2a5298);color:#f0f8ff;min-height:100vh;paddin
【AI Study】第四天，Pandas（9）- 进阶主题 co-n00b AI Study 人工智能 pandas python ai
文章概要本文详细介绍Pandas的进阶主题，包括：自定义函数高级索引数据导出实际应用示例自定义函数函数应用#基本函数应用defcalculate_bonus(salary,performance):"""计算奖金Args:salary(float):基本工资performance(float):绩效分数(0-1)Returns:float:奖金金额"""returnsalary*performan
CppCon 2016 学习:Rainbow Six Siege: Quest for Performance 虾球xz CppCon 学习 c++开发语言
“RainbowSixSiege:QuestforPerformance”这个标题看起来像是在谈论《彩虹六号：围攻》（RainbowSixSiege）这款游戏为了追求更高性能所做的优化和技术探索。这段内容，看起来是关于多核CPU架构及其缓存层次结构的示意描述，可能还涉及游戏或系统在这种多核、多缓存环境中的运行“现实”（ConsoleReality）与理想架构的对比。理解要点：Core（核心）：CP
PostgreSQL、SQL Server和MySQL数据库性能调优与故障排除技术 weixin_30777913 云原生数据库 azure
通过结合具体技术特性与工具链的深度使用，可系统化提升数据库性能和稳定性。建议根据实际负载特征制定监控-分析-优化的闭环管理流程。数据库技术：PostgreSQL13+：逻辑复制、分区表、并行查询、监控工具（如pg_stat_statements、pgBadger）。MySQL5.7+：InnoDBCluster、性能模式（PerformanceSchema）、JSON支持、GTID复制。SQLSe
TikHub-API-Python-SDK 使用教程瞿千斯Freda
TikHub-API-Python-SDK使用教程TikHub-API-Python-SDKHigh-performanceasynchronousDouyinTikTokInstagramXiaohongshuKuaishouWeibounofficialAPI.项目地址:https://gitcode.com/gh_mirrors/ti/TikHub-API-Python-SDK1.项目介绍T
jQuery 键盘事件keydown ,keypress ,keyup介绍 107x js jquery keydown keypress keyup
本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。一、首先需要知道的是： 1、keydown() keydown事件会在键盘按下时触发. 2、keyup() 代码如下复制代码 $('input').keyup(funciton(){
AngularJS中的Promise bijian1013 JavaScript AngularJS Promise
一.Promise Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： var cu
c++ 用数组实现栈类 CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T, int SIZE = 50> class Stack{ private: T list[SIZE];//数组存放栈的元素 int top;//栈顶位置 public: Stack(
java和c语言的雷同麦田的设计者 java 递归 scaner
软件启动时的初始化代码，加载用户信息2015年5月27号从头学java二 1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句不会再继续执行。 b、for循环相比于whi
LINUX环境并发服务器的三种实现模型被触发 linux
服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 1 循环服务器与并发服务器模型在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。目前最常用的服务器模型有： ·循环服务器：服务器在同一时刻只能响应一个客户端的请求 ·并发服务器：服
Oracle数据库查询指令肆无忌惮_ oracle数据库
20140920 单表查询 -- 查询************************************************************************************************************ -- 使用scott用户登录 -- 查看emp表 desc emp
ext右下角浮动窗口知了ing JavaScript ext
第一种 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/
浅谈REDIS数据库的键值设计矮蛋蛋 redis
http://www.cnblogs.com/aidandan/ 原文地址：http://www.hoterran.info/redis_kv_design 丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。
maven编译可执行jar包 alleni123 maven
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven <build> <plugins> <plugin> <artifactId>maven-asse
人力资源在现代企业中的作用百合不是茶 HR 企业管理
//人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点：工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只知道人力资源是管理企业招聘的当时我被招聘上了，当时给我们培训的人
Linux自启动设置详解 bijian1013 linux
linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。阅读之前建议先看一下附图。本文中假设inittab中设置的init tree为： /etc/rc.d/rc0.d /etc/rc.d/rc1.d /etc/rc.d/rc2.d /etc/rc.d/rc3.d /etc/rc.d/rc4.d /etc/rc.d/rc5.d /etc
Spring Aop Schema实现 bijian1013 java spring AOP
本例使用的是Spring2.5 1.Aop配置文件spring-aop.xml <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmln
【Gson七】Gson预定义类型适配器 bit1129 gson
Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， DateTypeAdapter public final class DateTypeAdapter extends TypeAdapter<Date> { public static final TypeAdapterFacto
【Spark八十八】Spark Streaming累加器操作（updateStateByKey) bit1129 update
在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。比如：对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 Spark Streaming的解决方案是累加器，工作原理是，定义
linux系统下通过shell脚本快速找到哪个进程在写文件 ronin47
一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。幸运的是systemtap的安装包里带了inodewatch.stp，位
java-两种方法求第一个最长的可重复子串 bylijinnan java 算法
import java.util.Arrays; import java.util.Collections; import java.util.List; public class MaxPrefix { public static void main(String[] args) { String str="abbdabcdabcx";
Netty源码学习-ServerBootstrap启动及事件处理过程 bylijinnan java netty
Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： http://bylijinnan.iteye.com/blog/1992325 Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的文章里面提到的操作，每一步都能在Netty里面找到对应的代码其中Reactor里面的Acceptor就对应Netty的ServerBo
servelt filter listener 的生命周期 cngolon filter listener servelt 生命周期
1. servlet 当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
jmpopups获取input元素值 ctrain JavaScript
jmpopups 获取弹出层form表单首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。我们必须到jmpopups生成的代码中去查找这个值，$(
vi查找替换命令详解 daizj linux 正则表达式替换查找 vim
一、查找查找命令 /pattern<Enter> ：向下查找pattern匹配字符串 ?pattern<Enter>：向上查找pattern匹配字符串使用了查找命令之后，使用如下两个键快速查找： n：按照同一方向继续查找 N：按照反方向查找字符串匹配 pattern是需要匹配的字符串，例如： 1: /abc<En
对网站中的js,css文件进行打包 dcj3sjt126com PHP 打包
一，为什么要用smarty进行打包 apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。为什么要进行打包呢，主要目的是为了合理的管理自己的代码。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
php Yii: 出现undefined offset 或者 undefined index解决方案 dcj3sjt126com undefined
在开发Yii 时，在程序中定义了如下方式： if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
linux 文件格式（1） sed工具 eksliang linux linux sed工具 sed工具 linux sed详解
转载请出自出处： http://eksliang.iteye.com/blog/2106082 简介 sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
Android应用程序获取系统权限 gqdy365 android
引用如何使Android应用程序获取系统权限第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 1. 在应用程序的AndroidManifest.xml中的manifest节点
HoverTree开发日志之验证码 hvt .net C#asp.net hovertree webform
HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
JSON API：用 JSON 构建 API 的标准指南中文版 justjavac json
译文地址：https://github.com/justjavac/json-api-zh_CN 如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式，那么 JSON API 就是你的 anti-bikeshedding 武器。通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。基于 JSON API 的客户端还能够充分利用缓存，
数据结构随记_2 lx.asymmetric 数据结构笔记
第三章栈与队列一．简答题 1. 在一个循环队列中，队首指针指向队首元素的前一个位置。 2.在具有n个单元的循环队列中，队满时共有 n-1 个元素。 3. 向栈中压入元素的操作是先移动栈顶指针&n
Linux下的监控工具dstat 网络接口 linux
1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是,
C 语言初级入门--二维数组和指针 1140566087 二维数组 c/c++指针
/* 二维数组的定义和二维数组元素的引用二维数组的定义：当数组中的每个元素带有两个下标时，称这样的数组为二维数组； (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 语法：类型名数组名[常量表达式1][常量表达式2] 二维数组的引用：引用二维数组元素时必须带有两个下标，引用形式如下：例如： int a[3][4]; 引用：
10点睛Spring4.1-Application Event wiselyman application
10.1 Application Event Spring使用Application Event给bean之间的消息通讯提供了手段应按照如下部分实现bean之间的消息通讯继承ApplicationEvent类实现自己的事件实现继承ApplicationListener接口实现监听事件使用ApplicationContext发布消息