秋山丶雪绪

从零开始 TensorRT（4）命令行工具篇：trtexec 基本功能

前言

学习资料：
TensorRT 源码示例
B站视频：TensorRT 教程 | 基于 8.6.1 版本
视频配套代码 cookbook

参考源码：cookbook → 07-Tool → trtexec
官方文档：trtexec

在 TensorRT 的安装目录 xxx/TensorRT-8.6.1.6/bin 下有命令行工具 trtexec，主要功能：
（1）由 ONNX 文件生成 TensorRT 引擎并序列化为 plan 文件
（2）查看 ONNX 或 plan 文件的网络逐层信息
（3）模型性能测试，即测试 TensorRT 引擎在随机输入或给定输入下的性能

示例一：解析 ONNX 生成引擎并推理测速

目录结构

├── resnet18.onnx
└── test_trtexec.sh

在终端中临时添加环境变量后运行脚本

export PATH=xxx/TensorRT-8.6.1.6/bin${PATH:+:${PATH}}
bash test_trtexec.sh

脚本内容如下，使用之前 Python 篇生成的 ResNet18 ONNX 文件生成 TensorRT 引擎，并且会做推理测试，输出日志信息存储在 result-fp32.txt 中。

trtexec \
    --onnx=resnet18.onnx \
    --saveEngine=resnet18-fp32.plan \
    --minShapes=x:1x3x224x224 \
    --optShapes=x:4x3x224x224 \
    --maxShapes=x:16x3x224x224 \
    --memPoolSize=workspace:1024MiB \
    --verbose \
    > result-fp32.txt 2>&1

> 为输出重定向操作符，将输出重定向到指定位置。
2>&1 中 2 为标准错误流 stderr，1 为标准输出流 stdout。表示把报错信息和正常输出一起保存到文件中

命令行常用选项

（1）构建阶段

--onnx=resnet18.onnx				指定模型文件
--saveEngine=resnet18-fp32.plan		输出引擎文件名

--minShapes=x:1x3x224x224
--optShapes=x:4x3x224x224
--maxShapes=x:16x3x224x224			输入形状的最小值、常见值、最大值
--memPoolSize=workspace:1024MiB		优化过程可使用的显存最大值
--verbose							打印详细日志, 但无法设置日志等级
--skipInference						只创建引擎不进行推理

--fp16
--int8
--noTF32
--best								指定引擎精度
--sparsity=[disable|enable|force]	指定稀疏性
--builderOptimizationLevel=5		设置优化等级(默认2)
--timingCacheFile=timing.cache		指定输出优化计时缓存文件
--profilingVerbosity=detailed		构建期保留更多的逐层信息

（2）运行阶段

--loadEngine=resnet18-fp32.plan		读取引擎文件
--shapes=x:4x3x224x224				指定输入形状
--warmUp=1000						预热阶段最短运行时间(单位:ms)
--duration=10						测试阶段最短运行时间(单位:s)
--iterations=100					测试阶段运行最小迭代次数
--useCudaGraph						使用 CudaGraph 捕获和执行推理过程
--noDataTransfers					关闭 Host 和 Device 之间数据传输
--streams=2							使用多个 stream 运行推理
--dumpProfile						显示逐层性能数据
--exportProfile=layerProfile.txt	保存逐层性能数据

解析输出日志信息

输出的日志信息非常多，分块进行解读

（1）各种参数配置

=== Model Options ===
Format: ONNX
Model: resnet18.onnx
Output:

=== Build Options ===
Max batch: explicit batch
Memory Pools: workspace: 1024 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
minTiming: 1
avgTiming: 8
Precision: FP32
LayerPrecisions: 
Layer Device Types: 
Calibration: 
Refit: Disabled
Version Compatible: Disabled
TensorRT runtime: full
Lean DLL Path: 
Tempfile Controls: { in_memory: allow, temporary: allow }
Exclude Lean Runtime: Disabled
Sparsity: Disabled
Safe mode: Disabled
Build DLA standalone loadable: Disabled
Allow GPU fallback for DLA: Disabled
DirectIO mode: Disabled
Restricted mode: Disabled
Skip inference: Disabled
Save engine: resnet18-fp32.plan
Load engine: 
Profiling verbosity: 0
Tactic sources: Using default tactic sources
timingCacheMode: local
timingCacheFile: 
Heuristic: Disabled
Preview Features: Use default preview flags.
MaxAuxStreams: -1
BuilderOptimizationLevel: -1
Input(s)s format: fp32:CHW
Output(s)s format: fp32:CHW
Input build shape: x=1x3x224x224+4x3x224x224+16x3x224x224
Input calibration shapes: model

=== System Options ===
Device: 0
DLACore: 
Plugins:
setPluginsToSerialize:
dynamicPlugins:
ignoreParsedPluginLibs: 0

=== Inference Options ===
Batch: Explicit
Input inference shape: x=4x3x224x224
Iterations: 10
Duration: 3s (+ 200ms warm up)
Sleep time: 0ms
Idle time: 0ms
Inference Streams: 1
ExposeDMA: Disabled
Data transfers: Enabled
Spin-wait: Disabled
Multithreading: Disabled
CUDA Graph: Disabled
Separate profiling: Disabled
Time Deserialize: Disabled
Time Refit: Disabled
NVTX verbosity: 0
Persistent Cache Ratio: 0
Inputs:

=== Reporting Options ===
Verbose: Enabled
Averages: 10 inferences
Percentiles: 90,95,99
Dump refittable layers:Disabled
Dump output: Disabled
Profile: Disabled
Export timing to JSON file: 
Export output to JSON file: 
Export profile to JSON file: 

=== Device Information ===
Selected Device: NVIDIA GeForce RTX 3090
Compute Capability: 8.6
SMs: 82
Device Global Memory: 24258 MiB
Shared Memory per SM: 100 KiB
Memory Bus Width: 384 bits (ECC disabled)
Application Compute Clock Rate: 1.695 GHz
Application Memory Clock Rate: 9.751 GHz

Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.

TensorRT version: 8.6.1
"较长的一段内容在加载标准插件"
Loading standard plugins
Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
...
Registered plugin creator - ::VoxelGeneratorPlugin version 1
[MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 19, GPU 774 (MiB)
Trying to load shared library libnvinfer_builder_resource.so.8.6.1
Loaded shared library libnvinfer_builder_resource.so.8.6.1
[MemUsageChange] Init builder kernel library: CPU +1450, GPU +266, now: CPU 1545, GPU 1034 (MiB)
CUDA lazy loading is enabled.

（2）Parse 过程

Start parsing network model.
----------------------------------------------------------------
"onnx 模型相关信息"
Input filename:   resnet18.onnx
ONNX IR version:  0.0.7
Opset version:    12
Producer name:    pytorch
Producer version: 2.0.1
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
"已加载插件"
Plugin creator already registered - ::xxx

"添加输入、构建网络"
Adding network input: x with dtype: float32, dimensions: (-1, 3, 224, 224)
Registering tensor: x for ONNX tensor: x
Importing initializer: fc.weight
Importing initializer: fc.bias
Importing initializer: onnx::Conv_193
...
Importing initializer: onnx::Conv_251

"每个 Parsing node 作为一段进行查看, 主要关于输入输出和层"
Parsing node: /conv1/Conv [Conv]
Searching for input: x
Searching for input: onnx::Conv_193
Searching for input: onnx::Conv_194
/conv1/Conv [Conv] inputs: [x -> (-1, 3, 224, 224)[FLOAT]], [onnx::Conv_193 -> (64, 3, 7, 7)[FLOAT]], [onnx::Conv_194 -> (64)[FLOAT]], 
Convolution input dimensions: (-1, 3, 224, 224)
Registering layer: /conv1/Conv for ONNX node: /conv1/Conv
Using kernel: (7, 7), strides: (2, 2), prepadding: (3, 3), postpadding: (3, 3), dilations: (1, 1), numOutputs: 64
Convolution output dimensions: (-1, 64, 112, 112)
Registering tensor: /conv1/Conv_output_0 for ONNX tensor: /conv1/Conv_output_0
/conv1/Conv [Conv] outputs: [/conv1/Conv_output_0 -> (-1, 64, 112, 112)[FLOAT]], 

Parsing node: /relu/Relu [Relu]
Searching for input: /conv1/Conv_output_0
/relu/Relu [Relu] inputs: [/conv1/Conv_output_0 -> (-1, 64, 112, 112)[FLOAT]], 
Registering layer: /relu/Relu for ONNX node: /relu/Relu
Registering tensor: /relu/Relu_output_0 for ONNX tensor: /relu/Relu_output_0
/relu/Relu [Relu] outputs: [/relu/Relu_output_0 -> (-1, 64, 112, 112)[FLOAT]], 
...
Finished parsing network model. Parse time: 0.144613

"对模型进行的一系列优化和融合操作, 优化后的层数, 耗时等信息"
Original: 53 layers
After dead-layer removal: 53 layers
Graph construction completed in 0.00150762 seconds.
Running: ConstShuffleFusion on fc.bias
ConstShuffleFusion: Fusing fc.bias with (Unnamed Layer* 56) [Shuffle]
After Myelin optimization: 52 layers
...
After scale fusion: 49 layers
...
After dupe layer removal: 24 layers
After final dead-layer removal: 24 layers
After tensor merging: 24 layers
After vertical fusions: 24 layers
After dupe layer removal: 24 layers
After final dead-layer removal: 24 layers
After tensor merging: 24 layers
After slice removal: 24 layers
After concat removal: 24 layers
Trying to split Reshape and strided tensor
Graph optimization time: 0.0239815 seconds.
Building graph using backend strategy 2
Local timing cache in use. Profiling results in this builder pass will not be stored.
Constructing optimization profile number 0 [1/1].
Applying generic optimizations to the graph for inference.
Reserving memory for host IO tensors. Host: 0 bytes

TODO：
（1）哪些层可以优化？如何优化？
（2）优化前后结构有何区别？

（3）自动优化

=============== Computing costs for /conv1/Conv + /relu/Relu
*************** Autotuning format combination: Float(150528,50176,224,1) -> Float(802816,12544,112,1) ***************
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskConvolution[0x80000009])
"Tactic Name: 策略名称, Tactic: 策略编号, Time: 耗时"
Tactic Name: ampere_scudnn_128x128_relu_medium_nn_v1 Tactic: 0xf067e6205da31c2e Time: 0.11264
...
Tactic Name: ampere_scudnn_128x64_relu_medium_nn_v1 Tactic: 0xf64396b97c889179 Time: 0.0694857
"分析耗时, 最快策略编号及其耗时"
/conv1/Conv + /relu/Relu (CaskConvolution[0x80000009]) profiling completed in 0.124037 seconds. Fastest Tactic: 0xf64396b97c889179 Time: 0.0694857
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CudnnConvolution[0x80000000])
"无有效策略, 跳过"
CudnnConvolution has no valid tactics for this config, skipping
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskFlattenConvolution[0x80000036])
CaskFlattenConvolution has no valid tactics for this config, skipping
>>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0xf64396b97c889179

*************** Autotuning format combination: Float(150528,1,672,3) -> Float(802816,1,7168,64) ***************
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskConvolution[0x80000009])
Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize64x64x8_stage3_warpsize1x4x1_g1_ffma_aligna4_alignc4 Tactic: 0x19b688348f983aa0 Time: 0.155648
...
Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_f32f32_f32f32_f32_nhwckrsc_nhwc_tilesize128x32x8_stage3_warpsize2x2x1_g1_ffma_aligna4_alignc4 Tactic: 0xa6448a1e79f1ca6f Time: 0.174373
/conv1/Conv + /relu/Relu (CaskConvolution[0x80000009]) profiling completed in 0.0186587 seconds. Fastest Tactic: 0xf231cca3335919a4 Time: 0.063488
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskFlattenConvolution[0x80000036])
CaskFlattenConvolution has no valid tactics for this config, skipping
>>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0xf231cca3335919a4

*************** Autotuning format combination: Float(50176,1:4,224,1) -> Float(802816,12544,112,1) ***************
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskConvolution[0x80000009])
Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_f32f32_tf32f32_f32_nhwckrsc_nchw_tilesize128x128x16_stage4_warpsize2x2x1_g1_tensor16x8x8_alignc4 Tactic: 0xe8f7b6a5bab325f8 Time: 0.35957
Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_f32f32_tf32f32_f32_nhwckrsc_nchw_tilesize128x128x16_stage4_warpsize2x2x1_g1_tensor16x8x8 Tactic: 0xe0a307ffe0ffb6a5 Time: 0.344503
/conv1/Conv + /relu/Relu (CaskConvolution[0x80000009]) profiling completed in 0.0105646 seconds. Fastest Tactic: 0xe0a307ffe0ffb6a5 Time: 0.344503
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskFlattenConvolution[0x80000036])
CaskFlattenConvolution has no valid tactics for this config, skipping
>>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0xe0a307ffe0ffb6a5

*************** Autotuning format combination: Float(50176,1:4,224,1) -> Float(200704,1:4,1792,16) ***************
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskConvolution[0x80000009])
Tactic Name: ampere_scudnn_128x64_sliced1x2_ldg4_relu_exp_large_nhwc_tn_v1 Tactic: 0xbdfdef6b84f7ccc9 Time: 0.279845
...
Tactic Name: sm80_xmma_fprop_implicit_gemm_indexed_wo_smem_f32f32_tf32f32_f32_nhwckrsc_nhwc_tilesize128x16x32_stage1_warpsize4x1x1_g1_tensor16x8x8 Tactic: 0xae48d3ccfe1edfcd Time: 0.0678278
/conv1/Conv + /relu/Relu (CaskConvolution[0x80000009]) profiling completed in 0.0911707 seconds. Fastest Tactic: 0x9cb304e2edbc1221 Time: 0.059392
--------------- Timing Runner: /conv1/Conv + /relu/Relu (CaskFlattenConvolution[0x80000036])
CaskFlattenConvolution has no valid tactics for this config, skipping
>>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0x9cb304e2edbc1221
...

"计算重新格式化成本"
=============== Computing reformatting costs
=============== Computing reformatting costs: 
*************** Autotuning Reformat: Float(150528,50176,224,1) -> Float(150528,1,672,3) ***************
--------------- Timing Runner: Optimizer Reformat(x -> <out>) (Reformat[0x80000006])
Tactic: 0x00000000000003e8 Time: 0.00663667
Tactic: 0x00000000000003ea Time: 0.016319
Tactic: 0x0000000000000000 Time: 0.0147323
Optimizer Reformat(x -> <out>) (Reformat[0x80000006]) profiling completed in 0.00680021 seconds. Fastest Tactic: 0x00000000000003e8 Time: 0.00663667
*************** Autotuning Reformat: Float(150528,50176,224,1) -> Float(50176,1:4,224,1) ***************
--------------- Timing Runner: Optimizer Reformat(x -> <out>) (Reformat[0x80000006])
Tactic: 0x00000000000003e8 Time: 0.0116016
Tactic: 0x00000000000003ea Time: 0.0174405
Tactic: 0x0000000000000000 Time: 0.0147739
Optimizer Reformat(x -> <out>) (Reformat[0x80000006]) profiling completed in 0.00589612 seconds. Fastest Tactic: 0x00000000000003e8 Time: 0.0116016
...
=============== Computing reformatting costs

"添加重新格式化层"
Adding reformat layer: Reformatted Input Tensor 0 to /fc/Gemm (/avgpool/GlobalAveragePool_output_0) from Float(512,1,1,1) to Float(128,1:4,128,128)
Adding reformat layer: Reformatted Input Tensor 0 to reshape_after_/fc/Gemm (/fc/Gemm_out_tensor) from Float(250,1:4,250,250) to Float(1000,1,1,1)
Formats and tactics selection completed in 6.81485 seconds.
After reformat layers: 26 layers
Total number of blocks in pre-optimized block assignment: 26
Detected 1 inputs and 1 output network tensors.
"内存、显存、临时内存"
Layer: /conv1/Conv + /relu/Relu Host Persistent: 4016 Device Persistent: 75776 Scratch Memory: 0
...
Layer: /fc/Gemm Host Persistent: 7200 Device Persistent: 0 Scratch Memory: 0
Skipped printing memory information for 3 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
Total Host Persistent Memory: 86272
Total Device Persistent Memory: 75776
Total Scratch Memory: 4608
"峰值"
[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 98 MiB

"块偏移, 用于管理和分配不同层的内存块, 以便在GPU上高效执行计算"
[BlockAssignment] Started assigning block shifts. This will take 27 steps to complete.
STILL ALIVE: Started step 26 of 27
[BlockAssignment] Algorithm ShiftNTopDown took 0.430364ms to assign 4 blocks to 27 nodes requiring 77074944 bytes.
Total number of blocks in optimized block assignment: 4
Total Activation Memory: 77074944
Finalize: /conv1/Conv + /relu/Relu Set kernel index: 0
...
Finalize: /fc/Gemm Set kernel index: 10
Total number of generated kernels selected for the engine: 11
Kernel: 0 CASK_STATIC
...
Kernel: 10 CASK_STATIC
"禁用未使用的策略源, 提高引擎生成的效率"
Disabling unused tactic source: JIT_CONVOLUTIONS
"引擎生成总耗时"
Engine generation completed in 6.96604 seconds.
"删除计时缓存"
Deleting timing cache: 144 entries, served 172 hits since creation.
Engine Layer Information:
Layer(CaskConvolution): /conv1/Conv + /relu/Relu, Tactic: 0xf64396b97c889179, x (Float[-1,3,224,224]) -> /relu/Relu_output_0 (Float[-1,64,112,112])
...
Layer(NoOp): reshape_after_/fc/Gemm, Tactic: 0x0000000000000000, Reformatted Input Tensor 0 to reshape_after_/fc/Gemm (Float[-1,1000,1,1]) -> y (Float[-1,1000])
[MemUsageChange] TensorRT-managed a
Adding 1 engine(s) to plan file.
Engine built in 18.0591 sec.
Loaded engine size: 73 MiB
"反序列化需20415微秒"
Deserialization required 20415 microseconds.
"反序列化中内存变化"
[MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +72, now: CPU 0, GPU 72 (MiB)
Engine deserialized in 0.0260023 sec.
Total per-runner device persistent memory is 75776
Total per-runner host persistent memory is 86272
Allocated activation device memory of size 77074944
"执行上下文创建过程中内存变化"
[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +74, now: CPU 0, GPU 146 (MiB)

TODO：
（1）策略名称包含了关于硬件架构、操作类型、数据类型、优化级别等信息，具体每一项代表了什么含义？
（2）Autotuning format combination 自动优化时使用不同的数据格式，这些输入输出形状是如何确定的？

（4）推理测速

CUDA lazy loading is enabled.
Setting persistentCacheLimit to 0 bytes.
Using enqueueV3.
Using random values for input x
Input binding for x with dimensions 4x3x224x224 is created.
Output binding for y with dimensions 4x1000 is created.
Starting inference
Warmup completed 185 queries over 200 ms
Timing trace has 2523 queries over 3.00267 s

=== Trace details ===
Trace averages of 10 runs:
Average on 10 runs - GPU latency: 1.11339 ms - Host latency: 1.52892 ms (enqueue 0.235403 ms)
...
Average on 10 runs - GPU latency: 1.11299 ms - Host latency: 1.55708 ms (enqueue 0.262207 ms)

=== Performance summary ===
Throughput: 840.251 qps
Latency: min = 1.4353 ms, max = 5.20972 ms, mean = 1.62134 ms, median = 1.55011 ms, percentile(90%) = 1.91345 ms, percentile(95%) = 2.07056 ms, percentile(99%) = 2.51172 ms
Enqueue Time: min = 0.174561 ms, max = 4.94312 ms, mean = 0.261935 ms, median = 0.261475 ms, percentile(90%) = 0.29071 ms, percentile(95%) = 0.315125 ms, percentile(99%) = 0.348938 ms
H2D Latency: min = 0.362305 ms, max = 0.622559 ms, mean = 0.430081 ms, median = 0.433594 ms, percentile(90%) = 0.449829 ms, percentile(95%) = 0.45874 ms, percentile(99%) = 0.493317 ms
GPU Compute Time: min = 1.05573 ms, max = 4.69727 ms, mean = 1.18524 ms, median = 1.11011 ms, percentile(90%) = 1.48682 ms, percentile(95%) = 1.6311 ms, percentile(99%) = 2.10425 ms
D2H Latency: min = 0.00427246 ms, max = 0.0371094 ms, mean = 0.00601927 ms, median = 0.00561523 ms, percentile(90%) = 0.00732422 ms, percentile(95%) = 0.00775146 ms, percentile(99%) = 0.0085144 ms
Total Host Walltime: 3.00267 s
Total GPU Compute Time: 2.99037 s
"GPU计算时间不稳定, 方差系数=18.1606%, 锁定GPU时钟频率或添加--useSpinWait可能提高稳定性"
* GPU compute time is unstable, with coefficient of variance = 18.1606%.
  If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
Explanations of the performance metrics are printed in the verbose logs.
"性能指标说明, 此处注释根据官方文档中的解释进行补充"
=== Explanations of the performance metrics ===
"预热过后第一个 query 加入队列到最后一个 query 完成的时间"
Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.

"GPU执行一个 query 的延迟"
GPU Compute Time: the GPU latency to execute the kernels for a query.

"GPU执行所有 query 的延迟"
"如果明显比 Total Host Walltime 短, 可能由于 host 端开销或数据传输导致 GPU 利用低效"
Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

"吞吐量: 每秒完成 query 数量"
"如果明显小于 GPU Compute Time 的倒数, 可能由于 host 端开销或数据传输导致 GPU 利用低效"
"throughput = the number of inferences / Total Host Walltime"
"使用 CUDA graphs(--useCudaGraph) 或禁用 H2D/D2H 传输(--noDataTransfer) 可能会提高 GPU 利用率"
"检测到 GPU 未充分利用时， 输出日志会提供相关指导"
Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.

"query 排队的 host 延迟"
"如果比 GPU Compute Time 长, GPU 利用低效, 吞吐量可能由 host 端开销主导"
"包括调用 H2D/D2H CUDA APIs、运行 host-side heuristics(host端启发式算法)、启动 CUDA 内核"
"使用 CUDA graphs(--useCudaGraph) 可以减少 Enqueue Time"
Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.

H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.

D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.

"官方文档中为 Host Latency, 单个推理的延迟"
"Host Latency = H2D Latency + GPU Compute Time + D2H Latency"
Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.

query 在官方文档中是 inference，理解为一次推理过程；host-side heuristics 翻译为主机端启发式算法，暂不理解具体是什么。

下图源于官方文档，辅助理解推理过程。

示例二：7项功能

# 当命令返回值不等于0时, 立刻退出脚本, 不会执行后续命令
set -e
# 执行每个命令前, 打印命令及其参数
set -x

clear
rm -rf ./*log ./*.plan ./*.cache ./*.lock ./*.json ./*.raw

# 01 运行onnx
trtexec \
    --onnx=resnet18.onnx \
    > 01-run_onnx.log 2>&1

# 02 parse onnx生成engine
trtexec \
    --onnx=resnet18.onnx \
    --saveEngine=resnet18.plan \
    --timingCacheFile=resnet18.cache \
    --minShapes=x:1x3x224x224 \
    --optShapes=x:4x3x224x224 \
    --maxShapes=x:16x3x224x224 \
    --fp16 \
    --noTF32 \
    --memPoolSize=workspace:1024MiB \
    --builderOptimizationLevel=5 \
    --maxAuxStreams=4 \
    --skipInference \
    --verbose \
    > 02-generate_engine.log 2>&1

# 03 运行engine
trtexec \
    --loadEngine=resnet18.plan \
    --shapes=x:4x3x224x224 \
    --noDataTransfers \
    --useSpinWait \
    --useCudaGraph \
    --verbose \
    > 03-run_engine.log 2>&1

# 04 导出engine信息
trtexec \
    --onnx=resnet18.onnx \
    --skipInference \
    --profilingVerbosity=detailed \
    --dumpLayerInfo \
    --exportLayerInfo="./04-exportLayerInfo.log" \
    > 04-export_layer_info.log 2>&1

# 05 导出profiling信息
trtexec \
    --loadEngine=resnet18.plan \
    --dumpProfile \
    --exportTimes="./05-exportTimes.json" \
    --exportProfile="./05-exportProfile.json" \
    > 05-export_profile.log 2>&1

# 06 保存输入输出数据
trtexec \
    --loadEngine=resnet18.plan \
    --dumpOutput \
    --dumpRawBindingsToFile \
    > 06-save_data.log 2>&1

# 07 读取数据进行推理
trtexec \
    --loadEngine=resnet18.plan \
    --loadInputs=x:x.input.1.3.224.224.Float.raw \
    --dumpOutput \
    > 07-load_data.log 2>&1

在 cookbook 中，第三个命令无法正常运行就做了下修改；第八个命令需要 plugin 便去除了，后续到 plugin 部分再专门研究。

# 03-Load TensorRT engine built above and do inference
trtexec model-02.plan \
    --trt \
    --shapes=tensorX:4x1x28x28 \
    --noDataTransfers \
    --useSpinWait \
    --useCudaGraph \
    --verbose \
    > result-03.log 2>&1

你可能感兴趣的:(TensorRT,TensorRT,trtexec)

英伟达Triton 推理服务详解 leo0308 基础知识机器人 Triton 人工智能
1.TritonInferenceServer简介TritonInferenceServer（简称Triton，原名NVIDIATensorRTInferenceServer）是英伟达推出的一个开源、高性能的推理服务器，专为AI模型的部署和推理服务而设计。它支持多种深度学习框架和硬件平台，能够帮助开发者和企业高效地将AI模型部署到生产环境中。Triton主要用于模型推理服务化，即将训练好的模型通过
模型实战（21）之 C++ - tensorRT部署yolov8-det 目标检测明月醉窗台 #深度学习实战例程人工智能 c++YOLO 目标检测计算机视觉人工智能
C++-tensorRT部署yolov8-det目标检测python环境下如何直接调用推理模型转换并导出：pt->onnx->.engineC++tensorrt部署检测模型不写废话了，直接上具体实现过程+all代码1.Python环境下推理直接命令行推理，巨简单yolodetectpredictmodel=yolov8n.ptsource='https
【深度学习】大模型GLM-4-9B Chat ，微调与部署(3) TensorRT-LLM、TensorRT量化加速、Triton部署 XD742971636 深度学习机器学习深度学习人工智能
文章目录获取TensorRT-LLM代码：构建docker镜像并安装TensorRT-LLM：运行docker镜像：安装依赖魔改下部分package代码：量化：构建图：全局参数插件配置常用配置参数测试推理是否可以代码推理CLI推理性能测试小结验证是否严重退化使用NVIDIATriton部署在线推理服务器代码弄下来编译镜像启动容器安装依赖量化构建trtengines图Triton模板说明实操发起Tr
Jetson Orin NX Super安装TensorRT-LLM u013250861 #LLM/部署&推理 elasticsearch 大数据搜索引擎
根据图片中显示的JetsonOrinNXSuper系统环境（JetPack6.2+CUDA12.6+TensorRT10.7），以下是针对该平台的TensorRT-LLM安装优化方案：一、环境适配调整基于你的实际配置：JetPack6.2（含CUDA12.6,TensorRT10.7）Python3.10.12aarch64架构需选择适配的TensorRT-LLM版本。由于官方预编译包可能未覆盖此
TensorRT-LLM：大模型推理加速引擎的架构与实践
前言：技术背景与发展历程：随着GPT-4、LLaMA等千亿级参数模型的出现，传统推理框架面临三大瓶颈：显存占用高（单卡可达80GB）、计算延迟大（生成式推理需迭代处理）、硬件利用率低（Transformer结构存在计算冗余）。根据MLPerf基准测试，原始PyTorch推理的token生成速度仅为12.3tokens/s（A100显卡）。一、TensorRT-LLM介绍：TensorRT-LLM是
【TensorRT】TensorRT及加速原理浩瀚之水_csdn tensorrt
一、TensorRT架构概览TensorRT是NVIDIA推出的高性能推理优化器，专为GPU加速设计。其核心架构分为三层：前端解析器支持ONNX/UFF/Caffe等格式的模型解析执行格式验证和初步结构优化优化引擎核心优化层（层融合、精度校准、内存优化等）生成优化后的计算图（OptimizedGraph）运行时环境管理GPU内存分配执行优化后的计算图二、核心加速原理（8大关键技术）1.层融合（La
使用numpy或pytorch校验两个张量是否相等
文章目录1、numpy2、pytorch做算法过程中，如果涉及到模型落地，那必然会将原始的深度学习的框架训练好的模型转换成目标硬件模型的格式，如onnx,tensorrt,openvino,tflite;那么就有对比不同格式模型输出的一致性，从而判断模型转换是否成功。1、numpy用到的核心代码就一行，就是：importnumpyasnpnp.testing.assert_allclose(act
YOLOV10的tensorrt C++部署 dddccc1234 YOLO
根据博客进行python版本安装YOLOv10最全使用教程（含ONNX和TensorRT推理）-CSDN博客并将pt转为onnx：yoloexportmodel=yolov10s.ptformat=onnxopset=13simplify然后采用：https://github.com/hamdiboukamcha/yolov10-tensorrt.git进行c++编译配置好cuda11.7tens
tensorRT 与 torchserve-GPU性能对比 joker-G 计算机视觉 pytorch python
实验对比前端时间搭建了TensorRT、Torchserve-GPU，最近抽时间将这两种方案做一个简单的实验对比。实验数据Cuda11.0、Xeon®62423.1*80、RTX309024G、Resnet50TensorRT、Torchserve-GPU各自一张卡搭建10进程接口，感兴趣的可以查看我个人其他文章。30进程并发、2000张1200*720像素图像的总量数据TensorRT的部署使用
YOLOv8模型在RDK5开发板上的部署指南：.pt到.bin转换与优化实践 pk_xz123456 python 算法仿真模型 YOLO 人工智能 rnn 深度学习开发语言 lstm
以下是针对在RDK5开发板（基于NVIDIAJetsonOrin平台）部署YOLOv8模型的详细技术指南，涵盖从模型转换、优化到部署的全流程：YOLOv8模型在RDK5开发板上的部署指南：.pt到.bin转换与优化实践——基于TensorRT的高性能嵌入式部署方案第一章：技术背景与核心概念1.1RDK5开发板硬件架构NVIDIAJetsonOrinNX核心参数：1024-coreAmpereGPU
Pytorch模型安卓部署 python&java pytorch 人工智能 python
Pytorch是一种流行的深度学习框架，用于算法开发，而Android是一种广泛应用的操作系统，多应用于移动设备当中。目前多数的研究都是在于算法上，个人觉得把算法落地是一件很有意思的事情，因此本人准备分享一些模型落地的文章(后续可能分享微信小程序部署，PyQt部署以及exe打包，ncnn部署，tensorRT部署，MNN部署)。本篇文章主要分享Pytorch的Android端部署。看这篇文章的读者
昇腾AI生态组件全解析：与英伟达生态的深度对比
随着人工智能技术的快速发展，国产AI芯片的崛起正在改变全球计算产业的格局。华为昇腾（Ascend）系列AI处理器凭借自主创新的达芬奇架构，构建了完整的软硬件生态体系。本文将从核心组件对比、显卡性能对标两个维度，深入剖析昇腾与英伟达（NVIDIA）生态的技术差异与适用场景。一、昇腾核心组件与英伟达对标分析1.推理引擎：MindIEvsTensorRT昇腾MindIE1.0.0基于昇腾芯片的深度学习推
【推理加速】TensorRT C++ 部署YOLO11全系模型 gloomyfish c++开发语言
YOLO11YOLO11C++推理YOLO11是Ultralytics最新发布的目标检测、实例分割、姿态评估的系列模型视觉轻量化框架，基于前代YOLO8版本进行了多项改进和优化。YOLO11在特征提取、效率和速度、准确性以及环境适应性方面都有显著提升，达到SOTA。TensorRTC++SDK最新版本的TensorRT10.x版本已经修改了推理的接口函数与查询输入输出层的函数，其中以YOLO11对
Java全栈AI平台实战：从模型训练到部署的革命性突破——Spring AI+Deeplearning4j+TensorFlow Java API深度解析墨夶 Java学习资料3 java 人工智能 spring
一、背景与需求：为什么需要Java驱动的AI平台？某医疗影像公司面临以下挑战：多语言开发混乱：Python训练模型，C++部署推理，Java调用服务，导致维护成本高昂部署效率低下：PyTorch模型需手动转换ONNX格式，TensorRT优化耗时2小时/模型实时性不足：视频流分析延迟达3秒，无法满足急诊场景需求通过Java全栈AI平台，我们实现了：端到端开发：Java调用PyTorch训练模型，直
【Bug】Could not locate zlibwapi.dll. Please make sure it is in your library path!
报错信息：使用tensort加速，cmake编译失败，提示缺少zlibwapi.dll文件Couldnotlocatezlibwapi.dll.Pleasemakesureitisinyourlibrarypath!解决方案：从以下链接下载zlibwapi.dllhttp://www.winimage.com/zLibDll/我是在windows10系统下进行的TensorRT加速下载得到的压缩包
win10安装wsl2(ubuntu20.04)并安装 TensorRT-8.6.1.6、cuda_11.6、cudnn 狄龙疤 wsl wsl2 win10 tensorrt cuda cudnn ubuntu
参考博客：1.CUDA】如何在windows上安装Ollama3+openwebui（docker+WSL2+ubuntu+nvidia-container）：https://blog.csdn.net/smileyan9/article/details/1403916672.在Windows10上安装WSL2：https://download.csdn.net/blog/column/10991
【代码分析】TensorRT sampleINT8 详解 HaoBBNuanMM
目录前言代码分析Main入口构建(Build)网络BatchStream推理(Infer)过程资源释放前言TensorRT可以通过INT8量化处理网络，然后大幅加速网络推理速度，本文旨在详细分析MNISTINT8Sample的代码，解释如何使用TensorRT对网络做INT8量化处理。关于INT8量化的背景知识可以参考博文TensorRTINT8校准与量化原理代码分析sampleINT8的gith
TensorRT × TVM 联合优化实战：多架构异构平台的统一推理加速与性能调优全流程观熵大模型高阶优化技术专题架构人工智能
TensorRT×TVM联合优化实战：多架构异构平台的统一推理加速与性能调优全流程关键词TensorRT、TVM、异构推理优化、跨平台部署、GPU加速、NPU融合、自动调度、深度学习推理引擎、性能调优摘要在深度学习模型推理部署场景中，面对GPU、NPU、CPU等多架构异构平台的并存，如何实现统一的高性能推理优化成为企业工程落地的关键挑战。本文聚焦TensorRT与TVM的联合优化策略，从平台结构适
retinaface在ubuntu20.04(wsl2)下使用tensorrt(c++)部署狄龙疤 c++retinaface tensorrt cuda opencv 人脸识别神经网络模型
1.参考博客：1.RetinafaceTensorrtPython/C++部署：https://blog.csdn.net/weixin_45747759/article/details/1245340792.B站视频教程：https://www.bilibili.com/video/BV1Nv4y1K727/3.Retinaface_Tensorrtgithub仓库：https://github
独家首发！低照度环境下YOLOv8的增强方案——从理论到TensorRT部署向哆哆 YOLO 架构 yolov8
文章目录引言一、低照度图像增强技术现状1.1传统低照度增强方法局限性1.2深度学习-based方法进展二、Retinexformer网络原理2.1Retinex理论回顾2.2Retinexformer创新架构2.2.1光照感知Transformer2.2.2多尺度Retinex分解2.2.3自适应特征融合三、YOLOv8-Retinexformer实现3.1网络架构修改3.2联合训练策略四、实验与
win10 环境进行 python + pytorch + yolov8 + tensorRT( c++版 ) 测试过程记录狄龙疤 python pytorch c++cuda tensorRT yolov8 计算机视觉
参考博客：1.YOLOv8模型转换pt-＞onnx(附上代码)：https://blog.csdn.net/2303_80018785/article/details/1381949612.yolov8的TensorRT部署（C++版本）：https://blog.csdn.net/liujiahao123987/article/details/133892746test.cpp就是使用此博客的d
【实战分享】TensorRT+LLM：大模型推理性能优化初探 fengbeely java
TensorRT-LLM初体验千呼万唤始出来，备受期待的Tensorrt-LLM终于发布，发布版本0.5.0。github:https://github.com/NVIDIA/TensorRT-LLM/tree/main1.介绍TensorRT-LLM可以视为TensorRT和FastTransformer的结合体，旨为大模型推理加速而生。1.1丰富的优化特性除了FastTransformer对T
NIPS-2013《Distributed PCA and $k$-Means Clustering》 Christo3 机器学习 kmeans 算法大数据人工智能
推荐深蓝学院的《深度神经网络加速：cuDNN与TensorRT》，课程面向就业，细致讲解CUDA运算的理论支撑与实践，学完可以系统化掌握CUDA基础编程知识以及TensorRT实战，并且能够利用GPU开发高性能、高并发的软件系统，感兴趣可以直接看看链接：深蓝学院《深度神经网络加速：cuDNN与TensorRT》核心思想该论文的核心思想是将主成分分析（PCA）与分布式kkk-均值聚类相结合，提出一种
NVIDIA 实现通义千问 Qwen3 的生产级应用集成和部署【2025年 5月 2日】 u013250861 #LLM/部署&推理 jetson
阿里巴巴近期发布了其开源的混合推理大语言模型（LLM）通义千问Qwen3，此次Qwen3开源模型系列包含两款混合专家模型(MoE)235B-A22B（总参数2,350亿，激活参数220亿）和30B-A3B，以及六款稠密（Dense）模型0.6B、1.7B、4B、8B、14B、32B。现在，开发者能够基于NVIDIAGPU，使用NVIDIATensorRT-LLM、Ollama、SGLang、vLL
YOLO学习笔记｜ YOLO11对象检测，实例分割，姿态评估的TensorRT部署c++ 单北斗SLAMer YOLO学习从零到1 YOLO 机器学习深度学习 c++python
以下是YOLOv11在TensorRT上部署的步骤指南，涵盖对象检测、实例分割和姿态评估：1.模型导出与转换1.1导出ONNX模型importtorchfrommodels.experimentalimportattempt_loadmodel=attempt_load('yolov11s.pt',fuse=True)model.eval
✅ TensorRT Python 安装精简流程（适用于 Ubuntu 20.04+） dbcccccsds python ubuntu 开发语言
安装TensorRTPython轮子的步骤确保pip和wheel模块已更新并安装：参考链接python3-mpipinstall--upgradepippython3-mpipinstallwheel1.确认环境要求Python：版本3.8-3.13OS：Ubuntu20.04+或Windows10+CPU：x86_64或ARMSBSA架构安装前确保pip、wheel是最新的：python3-mp
TensorRT-LLM——优化大型语言模型推理以实现最大性能的综合指南知来者逆 LLM 语言模型人工智能自然语言处理 TensorRT LLM 大语言模型深度学习
引言随着对大型语言模型(LLM)的需求不断增长，确保快速、高效和可扩展的推理变得比以往任何时候都更加重要。NVIDIA的TensorRT-LLM通过提供一套专为LLM推理设计的强大工具和优化，TensorRT-LLM可以应对这一挑战。TensorRT-LLM提供了一系列令人印象深刻的性能改进，例如量化、内核融合、动态批处理和多GPU支持。这些改进使推理速度比传统的基于CPU的方法快8倍，从而改变了
tensorrt部署yolov8 张张张子 YOLO python 边缘计算
记录一下部署过程遇到的问题，我是要再jstson上部署，首先导出onnx文件，没什么问题，然后又两种方案转为engine文件1：trtexec.exe--onnx=best.onnx--saveEngine=best.engine--fp16tensorrt库命令转换，过程中会遇到一些问题，这里不细讲了，可以查。2：用yolov8官方版本转换，较为容易，官方库写的比较好最后会得到trt文件或eng
YOLOv8 TensorRT 部署（Python 推理）保姆级教程码农的日常搅屎棍 YOLO python
本教程手把手教你如何在NVIDIAGPU或RK3588上部署YOLOv8TensorRT推理，让你从零基础到高性能AI推理！1.部署前的准备1.1硬件要求NVIDIAGPU（如RTX3060/4090、Jetson系列）或RK3588NPU（支持TensorRT）CUDA（如11.x）、cuDNN、TensorRT已正确安装可运行nvcc--version、dpkg-l|grepTensorRT检
深度学习部署包含哪些步骤？不学习怎么给老板打工？深度学习
深度学习部署包含哪些步骤？阶段说明示例工具模型导出把.pt、.h5等格式模型导出为通用格式（如ONNX）PyTorch,TensorFlow,ONNX推理优化减小模型体积、加速推理（量化、剪枝）TensorRT,ONNXRuntime系统集成将模型嵌入业务系统中运行（桌面、服务器、边缘设备）C++/C#/Python接口，Flask/Qt/WebApi上线运行打包运行环境，部署在云端、本地或设备上
312个免费高速HTTP代理IP（能隐藏自己真实IP地址） yangshangchuan 高速免费 superword HTTP代理
124.88.67.20:843 190.36.223.93:8080 117.147.221.38:8123 122.228.92.103:3128 183.247.211.159:8123 124.88.67.35:81 112.18.51.167:8123 218.28.96.39:3128 49.94.160.198:3128 183.20
pull解析和json编码百合不是茶 android pull解析 json
n.json文件: [{name:java,lan:c++,age:17},{name:android,lan:java,age:8}] pull.xml文件 <?xml version="1.0" encoding="utf-8"?> <stu> <name>java
[能源与矿产]石油与地球生态系统 comsci 能源
按照苏联的科学界的说法,石油并非是远古的生物残骸的演变产物,而是一种可以由某些特殊地质结构和物理条件生产出来的东西,也就是说,石油是可以自增长的.... 那么我们做一个猜想: 石油好像是地球的体液,我们地球具有自动产生石油的某种机制,只要我们不过量开采石油,并保护好
类与对象浅谈沐刃青蛟 java 基础
类，字面理解，便是同一种事物的总称，比如人类，是对世界上所有人的一个总称。而对象，便是类的具体化，实例化，是一个具体事物，比如张飞这个人，就是人类的一个对象。但要注意的是：张飞这个人是对象，而不是张飞，张飞只是他这个人的名字，是他的属性而已。而一个类中包含了属性和方法这两兄弟，他们分别用来描述对象的行为和性质（感觉应该是
新站开始被收录后，我们应该做什么？ IT独行者 PHP seo
新站开始被收录后，我们应该做什么？百度终于开始收录自己的网站了，作为站长，你是不是觉得那一刻很有成就感呢，同时，你是不是又很茫然，不知道下一步该做什么了？至少我当初就是这样，在这里和大家一份分享一下新站收录后，我们要做哪些工作。至于如何让百度快速收录自己的网站，可以参考我之前的帖子《新站让百
oracle 连接碰到的问题文强chu oracle
Unable to find a java Virtual Machine－－安装64位版Oracle11gR2后无法启动SQLDeveloper的解决方案作者：草根IT网来源：未知人气：813标签：导读：安装64位版Oracle11gR2后发现启动SQLDeveloper时弹出配置java.exe的路径，找到Oracle自带java.exe后产生的路径“C:\app\用户名\prod
Swing中按ctrl键同时移动鼠标拖动组件（类中多借口共享同一数据）小桔子 java 继承 swing 接口监听
都知道java中类只能单继承，但可以实现多个接口，但我发现实现多个接口之后，多个接口却不能共享同一个数据，应用开发中想实现：当用户按着ctrl键时，可以用鼠标点击拖动组件，比如说文本框。编写一个监听实现KeyListener,NouseListener,MouseMotionListener三个接口，重写方法。定义一个全局变量boolea
linux常用的命令 aichenglong linux 常用命令
1 startx切换到图形化界面 2 man命令:查看帮助信息 man 需要查看的命令,man命令提供了大量的帮助信息,一般可以分成4个部分 name:对命令的简单说明 synopsis:命令的使用格式说明 description:命令的详细说明信息 options:命令的各项说明 3 date:显示时间语法：date [OPTION]... [+FORMAT]
eclipse内存优化 AILIKES java eclipse jvm jdk
一基本说明在JVM中，总体上分2块内存区,默认空余堆内存小于 40%时，JVM就会增大堆直到-Xmx的最大限制；空余堆内存大于70%时，JVM会减少堆直到-Xms的最小限制。 1)堆内存(Heap memory):堆是运行时数据区域，所有类实例和数组的内存均从此处分配,是Java代码可及的内存，是留给开发人
关键字的使用探讨百合不是茶关键字
//关键字的使用探讨/*访问关键词private 只能在本类中访问public 只能在本工程中访问protected 只能在包中和子类中访问默认的只能在包中访问*//*final 类方法变量 final 类不能被继承 final 方法不能被子类覆盖，但可以继承 final 变量只能有一次赋值，赋值后不能改变 final 不能用来修饰构造方法*///this()
JS中定义对象的几种方式 bijian1013 js
1. 基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)： <html> <head> <title>基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)</title> </head> <script> var obj = new Object();
表驱动法实例 bijian1013 java 表驱动法 TDD
获得月的天数是典型的直接访问驱动表方式的实例，下面我们来展示一下： MonthDaysTest.java package com.study.test; import org.junit.Assert; import org.junit.Test; import com.study.MonthDays; public class MonthDaysTest { @T
LInux启停重启常用服务器的脚本 bit1129 linux
启动，停止和重启常用服务器的Bash脚本，对于每个服务器，需要根据实际的安装路径做相应的修改 #! /bin/bash Servers=(Apache2, Nginx, Resin, Tomcat, Couchbase, SVN, ActiveMQ, Mongo); Ops=(Start, Stop, Restart); currentDir=$(pwd); echo
【HBase六】REST操作HBase bit1129 hbase
HBase提供了REST风格的服务方便查看HBase集群的信息，以及执行增删改查操作 1. 启动和停止HBase REST 服务 1.1 启动REST服务前台启动（默认端口号8080） [hadoop@hadoop bin]$ ./hbase rest start 后台启动 hbase-daemon.sh start rest 启动时指定
大话zabbix 3.0设计假设 ronin47
What’s new in Zabbix 2.0? 去年开始使用Zabbix的时候，是1.8.X的版本，今年Zabbix已经跨入了2.0的时代。看了2.0的release notes，和performance相关的有下面几个： :: Performance improvements::Trigger related da
http错误码大全 byalias http协议 javaweb
响应码由三位十进制数字组成，它们出现在由HTTP服务器发送的响应的第一行。响应码分五种类型，由它们的第一位数字表示： 1）1xx：信息，请求收到，继续处理 2）2xx：成功，行为被成功地接受、理解和采纳 3）3xx：重定向，为了完成请求，必须进一步执行的动作 4）4xx：客户端错误，请求包含语法错误或者请求无法实现 5）5xx：服务器错误，服务器不能实现一种明显无效的请求
J2EE设计模式-Intercepting Filter bylijinnan java 设计模式数据结构
Intercepting Filter类似于职责链模式有两种实现其中一种是Filter之间没有联系，全部Filter都存放在FilterChain中，由FilterChain来有序或无序地把把所有Filter调用一遍。没有用到链表这种数据结构。示例如下： package com.ljn.filter.custom; import java.util.ArrayList;
修改jboss端口 chicony jboss
修改jboss端口 %JBOSS_HOME%\server\{服务实例名}\conf\bindingservice.beans\META-INF\bindings-jboss-beans.xml 中找到 <!-- The ports-default bindings are obtained by taking the base bindin
c++ 用类模版实现数组类 CrazyMizzz C++
最近c++学到数组类，写了代码将他实现，基本具有vector类的功能 #include<iostream> #include<string> #include<cassert> using namespace std; template<class T> class Array { public: //构造函数
hadoop dfs.datanode.du.reserved 预留空间配置方法 daizj hadoop 预留空间
对于datanode配置预留空间的方法为：在hdfs-site.xml添加如下配置 <property> <name>dfs.datanode.du.reserved</name> <value>10737418240</value>
mysql远程访问的设置 dcj3sjt126com mysql 防火墙
第一步: 激活网络设置你需要编辑mysql配置文件my.cnf. 通常状况，my.cnf放置于在以下目录： /etc/mysql/my.cnf (Debian linux) /etc/my.cnf （Red Hat Linux/Fedora Linux) /var/db/mysql/my.cnf (FreeBSD) 然后用vi编辑my.cnf，修改内容从以下行： [mysqld] 你所需要: 1
ios 使用特定的popToViewController返回到相应的Controller dcj3sjt126com controller
1、取navigationCtroller中的Controllers NSArray * ctrlArray = self.navigationController.viewControllers; 2、取出后，执行， [self.navigationController popToViewController:[ctrlArray objectAtIndex:0] animated:YES
Linux正则表达式和通配符的区别 eksliang 正则表达式通配符和正则表达式的区别通配符
转载请出自出处：http://eksliang.iteye.com/blog/1976579 首先得明白二者是截然不同的通配符只能用在shell命令中,用来处理字符串的的匹配。判断一个命令是否为bash shell(linux 默认的shell)的内置命令 type -t commad 返回结果含义 file 表示为外部命令 alias 表示该
Ubuntu Mysql Install and CONF gengzg Install
http://www.navicat.com.cn/download/navicat-for-mysql Step1: 下载Navicat ，网址：http://www.navicat.com/en/download/download.html Step2：进入下载目录，解压压缩包：tar -zxvf navicat11_mysql_en.tar.gz
批处理，删除文件bat huqiji windows dos
@echo off ::演示：删除指定路径下指定天数之前（以文件名中包含的日期字符串为准）的文件。 ::如果演示结果无误，把del前面的echo去掉，即可实现真正删除。 ::本例假设文件名中包含的日期字符串（比如：bak-2009-12-25.log） rem 指定待删除文件的存放路径 set SrcDir=C:/Test/BatHome rem 指定天数 set DaysAgo=1
跨浏览器兼容的HTML5视频音频播放器天梯梦 html5
HTML5的video和audio标签是用来在网页中加入视频和音频的标签，在支持html5的浏览器中不需要预先加载Adobe Flash浏览器插件就能轻松快速的播放视频和音频文件。而html5media.js可以在不支持html5的浏览器上使video和audio标签生效。 How to enable <video> and <audio> tags in
Bundle自定义数据传递 hm4123660 android Serializable 自定义数据传递 Bundle Parcelable
我们都知道Bundle可能过put****()方法添加各种基本类型的数据，Intent也可以通过putExtras(Bundle)将数据添加进去，然后通过startActivity()跳到下一下Activity的时候就把数据也传到下一个Activity了。如传递一个字符串到下一个Activity 把数据放到Intent
C＃：异步编程和线程的使用（.NET 4.5 ） powertoolsteam .net 线程 C#异步编程
异步编程和线程处理是并发或并行编程非常重要的功能特征。为了实现异步编程，可使用线程也可以不用。将异步与线程同时讲，将有助于我们更好的理解它们的特征。本文中涉及关键知识点 1. 异步编程 2. 线程的使用 3. 基于任务的异步模式 4. 并行编程 5. 总结异步编程什么是异步操作？异步操作是指某些操作能够独立运行，不依赖主流程或主其他处理流程。通常情况下，C＃程序
spark 查看 job history 日志 Stark_Summer 日志 spark history job
SPARK_HOME/conf 下: spark-defaults.conf 增加如下内容 spark.eventLog.enabled true spark.eventLog.dir hdfs://master:8020/var/log/spark spark.eventLog.compress true spark-env.sh 增加如下内容 export SP
SSH框架搭建 wangxiukai2015eye spring Hibernate struts
MyEclipse搭建SSH框架 Struts Spring Hibernate 1、new一个web project。 2、右键项目，为项目添加Struts支持。选择Struts2 Core Libraries -<MyEclipes-Library> 点击Finish。src目录下多了struts