TensorRT trtexec的用法说明

TensorRT Command-Line Wrapper: trtexec

Table Of Contents

  • Description
  • Building trtexec
  • Using trtexec
    • Example 1: Simple MNIST model from Caffe
    • Example 2: Profiling a custom layer
    • Example 3: Running a network on DLA
    • Example 4: Running an ONNX model with full dimensions and dynamic shapes
    • Example 5: Collecting and printing a timing trace
    • Example 6: Tune throughput with multi-streaming
  • Tool command line arguments
  • Additional resources
  • License
  • Changelog
  • Known issues

Description

Included in the samples directory is a command line wrapper tool, called trtexec. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has two main purposes:

  • It’s useful for benchmarking networks on random data.
  • It’s useful for generating serialized engines from models.

Benchmarking network - If you have a model saved as a UFF file, ONNX file, or if you have a network description in a Caffe prototxt format, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

Serialized engine generation - If you generate a saved serialized engine file, you can pull it into another application that runs inference. For example, you can use the TensorRT Laboratory to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats, for example, if you used a Caffe prototxt file and a model is not supplied, random weights are generated. Also, in INT8 mode, random weights are used, meaning trtexec does not provide calibration capability.

Building trtexec

trtexec can be used to build engines, using different TensorRT features (see command line arguments), and run inference. trtexec also measures and reports execution time and can be used to understand performance and possibly locate bottlenecks.

Compile this sample by running make in the /samples/trtexec directory. The binary named trtexec will be created in the /bin directory.

cd /samples/trtexec
make

Where is where you installed TensorRT.

Using trtexec

trtexec can build engines from models in Caffe, UFF, or ONNX format.

Example 1: Simple MNIST model from Caffe

The example below shows how to load a model description and its weights, build the engine that is optimized for batch size 16, and save it to a file.
trtexec --deploy=/path/to/mnist.prototxt --model=/path/to/mnist.caffemodel --output=prob --batch=16 --saveEngine=mnist16.trt

Then, the same engine can be used for benchmarking; the example below shows how to load the engine and run inference on batch 16 inputs (randomly generated).
trtexec --loadEngine=mnist16.trt --batch=16

Example 2: Profiling a custom layer

You can profile a custom layer using the IPluginRegistry for the plugins and trtexec. You’ll need to first register the plugin with IPluginRegistry.

If you are using TensorRT shipped plugins, you should load the libnvinfer_plugin.so file, as these plugins are pre-registered.

If you have your own plugin, then it has to be registered explicitly. The following macro can be used to register the plugin creator YourPluginCreator with the IPluginRegistry.
REGISTER_TENSORRT_PLUGIN(YourPluginCreator);

Example 3: Running a network on DLA

To run the AlexNet network on NVIDIA DLA (Deep Learning Accelerator) using trtexec in FP16 mode, issue:

./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --fp16 --allowGPUFallback

To run the AlexNet network on DLA using trtexec in INT8 mode, issue:

./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback

To run the MNIST network on DLA using trtexec, issue:

./trtexec --deploy=data/mnist/mnist.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback

For more information about DLA, see Working With DLA.

Example 4: Running an ONNX model with full dimensions and dynamic shapes

To run an ONNX model in full-dimensions mode with static input shapes:

./trtexec --onnx=model.onnx

The following examples assumes an ONNX model with one dynamic input with name input and dimensions [-1, 3, 244, 244]

To run an ONNX model in full-dimensions mode with an given input shape:

./trtexec --onnx=model.onnx --shapes=input:32x3x244x244

To benchmark your ONNX model with a range of possible input shapes:

./trtexec --onnx=model.onnx --minShapes=input:1x3x244x244 --optShapes=input:16x3x244x244 --maxShapes=input:32x3x244x244 --shapes=input:5x3x244x244

Example 5: Collecting and printing a timing trace

When running, trtexec prints the measured performance, but can also export the measurement trace to a json file:

./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --exportTimes=trace.json

Once the trace is stored in a file, it can be printed using the tracer.py utility. This tool prints timestamps and duration of input, compute, and output, in different forms:

./tracer.py trace.json

Similarly, profiles can also be printed and stored in a json file. The utility profiler.py can be used to read and print the profile from a json file.

Example 6: Tune throughput with multi-streaming

Tuning throughput may require running multiple concurrent streams of execution. This is the case for example when the latency achieved is well within the desired
threshold, and we can increase the throughput, even at the expense of some latency. For example, saving engines for batch sizes 1 and 2 and assume that both
execute within 2ms, the latency threshold:

trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=1 --saveEngine=g1.trt --int8 --buildOnly
trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=2 --saveEngine=g2.trt --int8 --buildOnly

Now, the saved engines can be tried to find the combination batch/streams below 2 ms that maximizes the throughput:

trtexec --loadEngine=g1.trt --batch=1 --streams=2
trtexec --loadEngine=g1.trt --batch=1 --streams=3
trtexec --loadEngine=g1.trt --batch=1 --streams=4
trtexec --loadEngine=g2.trt --batch=2 --streams=2

Tool command line arguments

To see the full list of available options and their descriptions, issue the ./trtexec --help command.

Note: Specifying the --safe parameter turns the safety mode switch ON. By default, the --safe parameter is not specified; the safety mode switch is OFF. The layers and parameters that are contained within the --safe subset are restricted if the switch is set to ON. The switch is used for prototyping the safety restricted flows until the TensorRT safety runtime is made available. For more information, see the Working With Automotive Safety section in the TensorRT Developer Guide.

Additional resources

The following resources provide more details about trtexec:

Documentation

  • NVIDIA trtexec
  • TensorRT Sample Support Guide
  • NVIDIA’s TensorRT Documentation Library

License

For terms and conditions for use, reproduction, and distribution, see the TensorRT Software License Agreement
documentation.

Changelog

April 2019
This is the first release of this README.md file.

Known issues

There are no known issues in this sample.


=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                  Set max batch size and build an implicit batch engine (default = 1)
  --explicitBatch             Use explicit batch sizes when building the engine (default = implicit)
  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
                              Note: if any of min/max/opt is missing, the profile will be completed using the shapes
                                    provided and assuming that opt will be equal to max unless they are both specified;
                                    partially specified shapes are applied starting from the batch size;
                                    dynamic shapes imply explicit batch
                                    input names can be wrapped with single quotes (ex: 'Input:0')
                              Input shapes spec ::= Ishp[","spec]
                                           Ishp ::= name":"shape
                                          shape ::= N[["x"N]*"*"]
  --inputIOFormats=spec       Type and formats of the input tensors (default = all inputs in fp32:chw)
  --outputIOFormats=spec      Type and formats of the output tensors (default = all outputs in fp32:chw)
                              IO Formats: spec  ::= IOfmt[","spec]
                                          IOfmt ::= type:fmt
                                          type  ::= "fp32"|"fp16"|"int32"|"int8"
                                          fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32")["+"fmt]
  --workspace=N               Set workspace size in megabytes (default = 16)
  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  --fp16                      Enable fp16 algorithms, in addition to fp32 (default = disabled)
  --int8                      Enable int8 algorithms, in addition to fp32 (default = disabled)
  --calib=<file>              Read INT8 calibration cache file
  --safe                      Only test the functionality available in safety restricted flows
  --saveEngine=<file>         Save the serialized engine
  --loadEngine=<file>         Load a serialized engine

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
  --shapes=spec               Set input shapes for dynamic shapes inputs. Input names can be wrapped with single quotes(ex: 'Input:0')
                              Input shapes spec ::= Ishp[","spec]
                                           Ishp ::= name":"shape
                                          shape ::= N[["x"N]*"*"]
  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be wrapped with single quotes (ex: 'Input:0')
                              Input values spec ::= Ival[","spec]
                                           Ival ::= name":"file
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 3)
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --streams=N                 Instantiate N engines to use concurrently (default = 1)
  --exposeDMA                 Serialize DMA transfers to and from device. (default = disabled)
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = disabled)
  --threads                   Enable multithreading to drive engines with independent threads (default = disabled)
  --useCudaGraph              Use cuda graph to capture engine execution and then launch inference (default = disabled)
  --buildOnly                 Skip inference perf measurement (default = disabled)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given,
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they
                              will be used also as min/opt/max in the build profile; if shapes are
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P              Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --plugins                   Plugin library (.so) to load (can be specified multiple times)

=== Help ===
  --help                      Print this message
Note: CUDA graphs is not supported in this version.

你可能感兴趣的:(TensorRT)