TensorRT实战

1.背景

目前主流的深度学习框架(caffee,mxnet,tensorflow,pytorch等)进行模型推断的速度都并不优秀,在实际工程中用上述框架进行模型部署往往是比较低效的。而通过nvidia退出的tensorrt工具来部署主流框架上训练的模型能够极大的提高模型推断的速度,往往相比与原本的框架能够有至少1倍以上的速度提升,同时占用的设备内存也会更少。

2.相关技术

[外链图片转存中…(img-ZdqJD3q7-1629426506851)]

  • 模型量化
  • 动态内存优化
  • 层和张量的融合

3.tensorflow模型tensorrt部署

  1. 模型持久化

部署tensorflow模型的第一步是模型持久化,将模型结构和去那种保存到一个.pb文件中。

pb_graph = tf.graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), [v.op.name for v in outputs])
with tf.gfile,FastGFile('./pbmodel_name.pb',model='wb') as f:
    f.write(pb_graph.SerializeToString())\

只需要在模型定义和读取权重之后执行以上代码,调用tf.graph_util.convert_variables_to_constants函数将权重转为常量,其中outputs是需要作为输出的tensor的列表,最后用pb_graph.SerializeToString()将graph序列化并写入到pb文件当中,就生成了pb模型。

  1. 生成uff模型

有了pb模型,需要将其转换为tensorRT可用的uff模型,只需要调用uff包自带的convert脚本即可

python /usr/lib/python2.7/site-packages/uff/bin/convert_to_uff.py   pbmodel_name.pb

转换成功后会输出如下信息,包含总结点的个数以及推断出的输入输出节点的信息。

[外链图片转存中…(img-VHsnmWdq-1629426506853)]

  1. tensorrt部署模型

使用tensorrt部署生成好的uff模型需要先将uff中保存的模型权值以及网络结构导入进来,然后执行优化算法生成对应的inference engine。

  • 定义一个IBuilder* builder
  • 定义一个用来解析uff文件的parser
  • builder创建的network
  • 给parser赋输入输出节点
  • parser将uff文件中的模型参数和网络结构解析出来存到network
  • 解析后的builder可以根据network中定义的网络结构创建engine。穿件engine前需要指定最大的batchsize大小,之后使用engin时输入的batchsize不能超过这个数值否则会出错,推断时如果batchsize和设定最大值一样时效率最高。举个例子,如果设定最大batchsize为10,实际推理输入一个batch 10张图的时候平均每张推断时间是4ms的话,输入一个batch少于10张图的时候平均每张图推断时间会高于4ms。
//初始化NVINFER PLUGINS
initLibNvInferPlugins(&gLogger.getTRTLogger(), "");
//1.IBulider
IBuilder * builder = createInferBuilder(gLogger.getTRTLogger());
assert(builder ! = nullptr);
//建立UFFParser
auto parser = createUffParser();
//登记输入的node名,尺寸,通道顺序
parser->registerInput(inputtensor_name,DimsCHW(INPUT_C,INPUT_H,INPUT_W),UffInputOrder::kNCHW);
// MarkOutput_0 is a node created by the UFF converter when we specify an ouput with -O.
parser->registerOutput()
parser->registerOutput(outputtensor_name);
// Parse the UFF model to populate the network, then set the outputs.
INetworkDefinition* network=builder->createeNetwork();
gLogInfo << "Begin parsing model..." << std::endl;
if(!parser->parse(uffFile,*network,nvinfer1::DataType::kFLOAT))
{
    gLogError<<"Failure while parsing UFF file"<setMaxBatchSize(maxBatchSize);
// The _GB literal operator is defined in common/common.h
builder->setMaxWorkspaceSize(MAX_WORKSPACE); // We need about 1GB of scratch space for the plugin layer for batch size 5.
if (gArgs.runInInt8)
{
    builder->setInt8Mode(gArgs.runInInt8);
    builder->setInt8Calibrator(calibrator);
}
builder->setFp16Mode(gArgs.runInFp16);
samplesCommon::enableDLA(builder, gArgs.useDLACore);
gLogInfo << "Begin building engine..." << std::endl;
ICudaEngine* engine = builder->buildCudaEngine(*network);
if (!engine)
{
    gLogError << "Unable to create engine" << std::endl;
    return nullptr;
}
gLogInfo << "End building engine..." << std::endl;

// We don't need the network any more, and we can destroy the parser.
network->destroy();
parser->destroy();

builder->destroy();
shutdownProtobufLibrary();

生成engine之后就可以进行推断了,执行推断时需要有一个上下文执行上下文IExecutionContext* context,通过engine->creatExecutionContext()获得。

context = engine->createExecutionContext();
assert(context != nullptr);

推断的核心代码为

 context.execute(batchSize, &buffers[0]);

其中buffers是一个void*数组对应的是模型输入输出tensor的设备地址,通过cudaMalloc开辟输入输出所需要的设备空间(显存)将对应指针存到buffer数组中,在执行execute操作前通过cudaMemcpy把输入数据(输入图像)拷贝到对应输入的设备空间,执行execute之后还是通过cudaMemcpy把输出的结果从设备上拷贝出来。

// Run inference.
doInference(*context, &data[0], &detectionOut[0], &keepCount[0], N); 
void doInference(IExecutionContext& context, float* inputData, float* detectionOut, int* keepCount, int batchSize)
{
    //建立
    //auto t_start = std::chrono::high_resolution_clock::now();
    const ICudaEngine& engine = context.getEngine();
    // Input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings(),
    // of these, but in this case we know that there is exactly 1 input and 2 output.
    int nbBindings = engine.getNbBindings();
    std::vector buffers(nbBindings);
    std::vector> buffersSizes = calculateBindingBufferSizes(engine, nbBindings, batchSize);
    for (int i = 0; i < nbBindings; ++i)
    {
        auto bufferSizesOutput = buffersSizes[i];
        buffers[i] = samplesCommon::safeCudaMalloc(bufferSizesOutput.first * samplesCommon::getElementSize(bufferSizesOutput.second));
    }
    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings().
    int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME),
        outputIndex0 = engine.getBindingIndex(OUTPUT_BLOB_NAME0),
        outputIndex1 = outputIndex0 + 1; //engine.getBindingIndex(OUTPUT_BLOB_NAME1);
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));
    // DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
    CHECK(cudaMemcpyAsync(buffers[inputIndex], inputData, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));

    auto t_start = std::chrono::high_resolution_clock::now();
    context.execute(batchSize, &buffers[0]);
    auto t_end = std::chrono::high_resolution_clock::now();
    float total = std::chrono::duration(t_end - t_start).count();
    gLogInfo << "Time taken for inference is " << total << " ms." << std::endl;
    for (int bindingIdx = 0; bindingIdx < nbBindings; ++bindingIdx)
    {
        if (engine.bindingIsInput(bindingIdx))
            continue;
        auto bufferSizesOutput = buffersSizes[bindingIdx];
        printOutput(bufferSizesOutput.first, bufferSizesOutput.second,
                    buffers[bindingIdx]);
    }

    CHECK(cudaMemcpyAsync(detectionOut, buffers[outputIndex0], batchSize * detectionOutputParam.keepTopK * 7 * sizeof(float), cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(keepCount, buffers[outputIndex1], batchSize * sizeof(int), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release the stream and the buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex0]));
    CHECK(cudaFree(buffers[outputIndex1]));
}

4. caffee模型tensorrt部署教程

相比与tensorflow模型caffe模型的转换更加简单,不需要有tensorflow模型转uff模型这类的操作,tensorRT能够直接解析prototxt和caffemodel文件获取模型的网络结构和权重。具体解析流程和上文描述的一致,不同的是caffe模型的parser不需要预先指定输入层,这是因为prototxt已经进行了输入层的定义,parser能够自动解析出输入,另外caffeparser解析网络后返回一个IBlobNameToTensor *blobNameToTensor记录了网络中tensor和pototxt中名字的对应关系,在解析之后就需要通过这个对应关系,按照输出tensor的名字列表outputs依次找到对应的tensor并通过network->markOutput函数将其标记为输出,之后就可以生成engine了。

IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
ICaffeParser* parser = createCaffeParser();
DataType modelDataType = DataType::kFLOAT;
const IBlobNameToTensor *blobNameToTensor =	parser->parse(deployFile.c_str(),
                                                          modelFile.c_str(),
                                                          *network,
                                                          modelDataType);
assert(blobNameToTensor != nullptr);
for (auto& s : outputs) network->markOutput(*blobNameToTensor->find(s.c_str()));

builder->setMaxBatchSize(maxBatchSize);
builder->setMaxWorkspaceSize(1 << 30);
engine = builder->buildCudaEngine(*network);

5.为tensorrt添加自定义层

tensorRT目前只支持一些非常常见的操作,有很多操作它并不支持比如上采样Upsample操作,这时候就需要我们自行将其编写为tensorRT的插件层,从而使得这些不能支持的操作能在tensorRT中使用。以定义Upsample层为例,我们首先要定义一个继承自tensorRT插件基类的Upsample类

class Upsample: public IPluginExt
{
    //一个是传参数构建
    //定义上采样倍数
    Upsample(int scale = 2) : mScale(scale) {
        assert(mScale > 0);
    }
    //从序列化后的比特流构建
    Upsmaple(const void *data, size_t length) {
        const char *d = reinterpret_cast(data), *a = d;
        mScale = read(d);
        mDtype = read(d);
        mCHW = read(d);
        assert(mScale > 0);
        assert(d == a + length);
    }
    ~Upsample()
    {

    }
    //一些定义层输出信息的方法
    //模型的输出个数
    int getNbOutputs() const override {
        return 1;
    }

    //获取模型输出的形状
    Dims getOutputDimensions(int index, const Dims *inputs, int nbInputDims) override {
        // std::cout << "Get ouputdims!!!" << std::endl;
        assert(nbInputDims == 1);
        assert(inputs[0].nbDims == 3);
        return DimsCHW(inputs[0].d[0], inputs[0].d[1] * mScale, inputs[0].d[2] * mScale);
    }

    //根据输入的形状个数以及采用的数据类型检查合法性以及配置层参数的方法
    //检查层是否支持当前的数据类型和格式
    bool supportsFormat(DataType type, PluginFormat format) const override {
        return (type == DataType::kFLOAT || type == DataType::kHALF || type == DataType::kINT8)
                && format == PluginFormat::kNCHW;
    }
    //配置层的参数
    void configureWithFormat(const Dims *inputDims, int nbInputs, const Dims *outputDims, int nbOutputs,
                             DataType type, PluginFormat format, int maxBatchSize) override
    {
        mDtype = type;
        mCHW.c() = inputDims[0].d[0];
        mCHW.h() = inputDims[0].d[1];
        mCHW.w() = inputDims[0].d[2];
    }


    //层的序列化方法
    //输出序列化层所需的长度
    size_t getSerializationSize() override {
        return sizeof(mScale) + sizeof(mDtype) + sizeof(mCHW);
    }
    //将层参数序列化为比特流

    void serialize(void *buffer) override {
        char *d = reinterpret_cast(buffer), *a = d;
        write(d, mScale);
        write(d, mDtype);
        write(d, mCHW);
        assert(d == a + getSerializationSize());
    }


    //层的运算方法
    //层运算需要的临时工作空间大小
    size_t getWorkspaceSize(int maxBatchSize) const override {
        return 0;
    }
    //层执行计算的具体操作
    int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace,
                cudaStream_t stream) override;
    //在enqueue中我们调用编写好的cuda kenerl来进行Upsample的计算
    
    
}

完成了Upsample类的定义,我们就可以直接在网络中添加我们编写的插件了,通过如下语句我们就定义一个上采样2倍的上采样层。addPluginExt的第一个输入是ITensor**类别,这是为了支持多输出的情况,第二个参数就是输入个数,第三个参数就是需要创建的插件类对象。

Upsample up(2);
auto upsamplelayer=network->addPluginExt(inputtensot,1,up)

6. 为caffeparser添加自定义层支持

对于我们自定义的层如果写到了caffe prototxt中,在部署模型时调用caffeparser来解析就会报错。还是以Upsample为例,如果在prototxt中有下面这段来添加了一个upsample的层

layer {
  name: "upsample0"
  type: "Upsample"
  bottom: "ReLU11"
  top: "Upsample1"
}

这时再调用

const IBlobNameToTensor *blobNameToTensor =	parser->parse(deployFile.c_str(),
                                                              modelFile.c_str(),
                                                              *network,
                                                              modelDataType);

就会出现错误

could not parse layer type Upsample

之前我们已经编写了Upsample的插件,怎么让tensorRT的caffe parser识别出prototxt中的upsample层自动构建我们自己编写的插件呢?这时我们就需要定义一个插件工程类继承基类nvinfer1::IPluginFactory, nvcaffeparser1::IPluginFactoryExt

class PluginFactory : public nvinfer1::IPluginFactory, public nvcaffeparser1::IPluginFactoryExt
{
    //其中必须要的实现的方法有:
    //判断一个层是否是plugin的方法,
    //输入的参数就是prototxt中layer的name,通过name来判断一个层是否注册为插件
    bool isPlugin(const char *name) override {
        return isPluginExt(name);
    }
    //判断层名字是否是upsample层的名字
    bool isPluginExt(const char *name) override {
        
        char *aa = new char[6];
        memcpy(aa, name, 5);
        aa[5] = 0;
        int res = !strcmp(aa, "upsam");
        return res;
    }
    //根据名字创建插件的方法,有两中方式:
    //一个是由权重构建
    //另一个是由序列化后的比特流创建,
    //对应了插件的两种构造函数,Upsample没有权重,对于其他有权重的插件就能够用传入的weights初始化层。mplugin是一个vector用来存储所有创建的插件层
    IPlugin *createPlugin(const char *layerName, const nvinfer1::Weights *weights, int nbWeights) override {
        assert(isPlugin(layerName));
        mPlugin.push_back(std::unique_ptr(new Upsample(2)));
        return mPlugin[mPlugin.size() - 1].get();
    }
    IPlugin *createPlugin(const char *layerName, const void *serialData, size_t serialLength) override {
        assert(isPlugin(layerName));
        
        return new Upsample(serialData, serialLength);
    }
    std::vector > mPlugin;
    //最后需要定义一个destroy方法来释放所有创建的插件层。
    void destroyPlugin() 
    {
        for (unsigned int i = 0; i < mPlugin.size(); i++) 
        {
            mPlugin[i].reset();
        }
    } 
}

对于prototxt存在多个多种插件的情况,可以在isPlugin,createPlugin方法中添加新的条件分支,根据层的名字创建对应的插件层。

实现了PluginFactory之后在调用caffeparser的时候需要设置使用它,在调用parser->parser之前加入如下代码

PluginFactory pluginFactory;
parser->setPluginFactoryExt(&pluginFactory);

就可以设置parser按照pluginFactory里面定义的规则来创建插件层,这样之前出现的不能解析Upsample层的错误就不会再出现了。

  • 除了本文中列举的pluginExt,tensorRT中插件基类还有IPlugin,IPluginV2,继承这些基类所需要实现的类方法有细微区别,具体情况可自行查看tensorRT安装文件夹下的include/NvInfer.h文件。同时添加自己写的层到网络时的函数有addPlugin,addPluginExt,addPluginV2这几种和IPlugin,IPluginExt,IPluginV2一一对应,不能够混用,否则有些默认调用的类方法不会调用的,比如用addPlugin添加的PluginExt层是不会调用configureWithFormat方法的,因为IPlugin类没有该方法。同样的在还有caffeparser的setPluginFactory和setPluginFactoryExt也是不能混用的
  • 运行程序出现cuda failure一般情况下是由于将内存数据拷贝到磁盘时出现了非法内存访问,注意检查buffer开辟的空间大小和拷贝过去数据的大小是否一致
  • 有一些操作在tensorRT中不支持但是可以通过一些支持的操作进行组合替代,比如 [外链图片转存中…(img-CFn7bfiB-1629426506855)] ,这样可以省去一些编写自定义层的时间
  • tensorflow中的flatten操作默认时keepdims=False的,但是在转化uff文时会默认按照keepdims=True转换,因此在tensorflow中对flatten后的向量进行transpose、expanddims等等操作,在转换到uff后用tensorRT解析时容易出现错误,比如“Order size is not matching the number dimensions of TensorRT” 。最好设置tensorflow的reduce,flatten操作的keepdims=True,保持层的输出始终为4维形式,能够有效避免转到tensorRT时出现各种奇怪的错误。
  • tensorRT中的slice层存在一定问题,我用network->addSlice给网络添加slice层后,在执行buildengine这一步时就会出错nvinfer1::builder::checkSanity(const nvinfer1::builder::Graph&): Assertion `tensors.size() == g.tensors.size()’ failed.,构建网络时最好避开使用slice层,或者自己实现自定层来执行slice操作。
  • tensorRT 的github中有着部分的开源代码以及丰富的示例代码,多多学习能够帮助更快的掌握tensorRT的使用

7. samples

7.1 mnist_caffee

使用到的API层和操作:

  • Activation layer:激活层实现元素激活功能,比如本例中使用的kRELU。
  • Convolution layer:卷基层实现有偏差或无偏差的2d(通道,高度,宽度)卷积
  • Fullyconnected layer:去哪连接层实现有偏差或者无偏差的矩阵向量积
  • Pooling layer:池化层实现了根据输入通道的池化,目前支持的池化方式有maximum,average和maximum-average blend
  • Scale layer:缩放层实现按张量,通道或者按元素的放射变换和/或按照常量的指数变换
  • softMaxlayer:按照用户选择实现softmax

准备数据:

在tensorrt安装后,从/usr/src/tensorrt/data下载pgms

export TRT_DATADIR=/usr/src/tensorrt/data
pushd $TRT_DATADIR/mnist
pip install Pillow
python3 download_pgms.py
popd

运行sample:

#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"

#include "NvCaffeParser.h"
#include "NvInfer.h"

#include 
#include 
#include 
#include 
#include 
#include 
#include 
const std::string gSampleName = "TensorRT.sample_mnist";
class SampleMNIST
{
    template
    using  SampleUniquePtr = std::unique_ptr;
public:
    SampleMNIST(const samplesCommon::CaffeSampleParams& params)
        :mParams(params)
    {

    }
    bool build();
    bool infer();
    bool teardown();

private:
    bool constructNetwork(SampleUniquePtr& parser,SampleUniquePtr& network);
    bool processInput(const samplesCommon::BufferManager& buffers,const std::string& inputTensorName,int inputFileIndex) const;
    bool verifyOutput(const samplesCommon::BufferManager& buffers,const std::string& outputTensorName, int groundTruthDigit)const;
    std::shared_ptr mEngine{nullptr};
    samplesCommon::CaffeSampleParams mParams;
    nvinfer1::Dims mInputDims;
    SampleUniquePtr mMeanBlob;

};
bool SampleMNIST::build()
{
    auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    auto network = SampleUniquePtr(builder->createNetwork());
    if(!network)
    {
        return false;
    }
    auto config = SampleUniquePtr(builder->createBuilderConfig());
    if(!config)
    {
        return false;
    }
    auto parser = SampleUniquePtr(nvcaffeparser1::createCaffeParser());
    if(!parser)
    {
        return false;
    }
    if(!constructNetwork(parser,network))
    {
        return false;
    }
    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    config->setFlag(BuilderFlag::kGPU_FALLBACK);
    config->setFlag(BuilderFlag::kSTRICT_TYPES);
    if(mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if(mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore);

    mEngine = std::shared_ptr(builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
    if(!mEngine)
    {
        return false;
    }
    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}
bool SampleMNIST::constructNetwork(SampleUniquePtr& parser, SampleUniquePtr& network)
{
    const nvcaffeparser1::IBlobNameToTensor* blobNameToTensor = parser->parse(
                mParams.prototxtFileName.c_str(),mParams.weightsFileName.c_str(),*network,nvinfer1::DataType::kFLOAT);
    for(auto& s : mParams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }

    nvinfer1::Dims inputDims = network->getInput(0)->getDimensions();
    mMeanBlob=
            SampleUniquePtr(parser->parseBinaryProto(mParams.meanFileName.c_str()));
    nvinfer1::Weights meanWeights{nvinfer1::DataType::kFLOAT,mMeanBlob->getData(),inputDims.d[1]*inputDims.d[1]};
    float maxMean
        = samplesCommon::getMaxValue(static_cast(meanWeights.values), samplesCommon::volume(inputDims));

    auto mean = network->addConstant(nvinfer1::Dims3(1, inputDims.d[1], inputDims.d[2]), meanWeights);
    if (!mean->getOutput(0)->setDynamicRange(-maxMean, maxMean))
    {
        return false;
    }
    if (!network->getInput(0)->setDynamicRange(-maxMean, maxMean))
    {
        return false;
    }
    auto meanSub = network->addElementWise(*network->getInput(0), *mean->getOutput(0), ElementWiseOperation::kSUB);
    if (!meanSub->getOutput(0)->setDynamicRange(-maxMean, maxMean))
    {
        return false;
    }
    network->getLayer(0)->setInput(0, *meanSub->getOutput(0));
    samplesCommon::setAllTensorScales(network.get(), 127.0f, 127.0f);

    return true;

}
bool SampleMNIST::infer()
{
    samplesCommon::BufferManager buffers(mEngine,mParams.batchSize);
    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if(!context)
    {
        return false;
    }
    srand(time(NULL));
    const int digit = rand() % 10;
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers, mParams.inputTensorNames[0], digit))
    {
        return false;
    }
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));
    buffers.copyInputToDeviceAsync(stream);
    if(!context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(),stream,nullptr))
    {
        return false;
    }
    buffers.copyOutputToHostAsync(stream);
    cudaStreamSynchronize(stream);
    cudaStreamDestroy(stream);
    assert(mParams.outputTensorNames.size()==1);
    bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);

    return outputCorrect;
}
bool SampleMNIST::processInput(const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
{
    const int inputH = mInputDims.d[1];
    const int inputW = mInputDims.d[2];

    srand(unsigned(time(nullptr)));
    std::vector fileData(inputH*inputW);
    readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm",mParams.dataDirs),fileData.data(),inputH,inputW);

    sample::gLogInfo<<"Input:\n";
    for (int i = 0; i < inputH * inputW; i++)
    {
        sample::gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
    }
    sample::gLogInfo << std::endl;

    float* hostInputBuffer = static_cast(buffers.getHostBuffer(inputTensorName));
    for(int i=0;i(buffers.getHostBuffer(outputTensorName));

    sample::gLogInfo <<"Output:\n";
    float val{0.0f};
    int idx{0};
    const int kDIGITS=10;
    for(int i=0;i 0.9f);
}
bool SampleMNIST::teardown()
{
    nvcaffeparser1::shutdownProtobufLibrary();
    return true;
}
samplesCommon::CaffeSampleParams initializeSampleParams(const samplesCommon::Args& args)
{
    samplesCommon::CaffeSampleParams params;
    if(args.dataDirs.empty())
    {
        params.dataDirs.push_back("./data/");
//        params.dataDirs.push_back("./data/");
    }
    else
    {
        params.dataDirs = args.dataDirs;
    }

    params.prototxtFileName = locateFile("mnist.prototxt",params.dataDirs);
    params.weightsFileName = locateFile("mnist.caffemodel",params.dataDirs);
    params.meanFileName = locateFile("mnist_mean.binaryproto",params.dataDirs);
    params.inputTensorNames.push_back("data");
    params.batchSize=1;
    params.outputTensorNames.push_back("prob");
    params.dlaCore=args.useDLACore;
    params.int8=args.runInInt8;
    params.fp16=args.runInFp16;
    return params;

}
void printHelpInfo()
{
    std::cout
        << "Usage: ./sample_mnist [-h or --help] [-d or --datadir=] [--useDLACore=]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}
int main(int argc,char** argv)
{
    samplesCommon::Args args;
    bool argsOK= samplesCommon::parseArgs(args,argc,argv);
    if(!argsOK)
    {
        sample::gLogError << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    auto sampleTest=sample::gLogger.defineTest(gSampleName,argc,argv);
    sample::gLogger.reportTestStart(sampleTest);

    samplesCommon::CaffeSampleParams params = initializeSampleParams(args);
    SampleMNIST sample(params);
    sample::gLogInfo <<"Building and running a GPU inference engine for MNIST" << std::endl;
    if (!sample.build())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.infer())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.teardown())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);
}

cmake_minimum_required(VERSION 3.19 FATAL_ERROR)
project(perception)

#选择Debug模式
SET(CMAKE_BUILD_TYPE "Debug")

#选择Release模式
#SET(CMAKE_BUILD_TYPE "Release")

set(CMAKE_CXX_FLAGS   " -std=c++11 -fno-stack-protector" )

#Debug和Release模式优化级别
#set(CMAKE_CXX_FLAGS_DEBUG   "-O0 -g -fstack-protector -fstack-protector-all -lefence" )             # 调试包不优化
#set(CMAKE_CXX_FLAGS_DEBUG   "-O0 -g -fstack-protector-all -lefence" )             # 调试包不优化
#set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG " )   # release包优化

set(CMAKE_CXX_FLAGS_DEBUG "$ENV{CXXFLAGS} -O0 -Wall -g2 -ggdb")
set(CMAKE_CXX_FLAGS_RELEASE "$ENV{CXXFLAGS} -O3 -Wall")

#设置可执行程序输出路径
set(EXECUTABLE_OUTPUT_PATH ${PROJECT_SOURCE_DIR}/bin)
#设置库文件输出路径
set(LIBRARY_OUTPUT_PATH ${PROJECT_SOURCE_DIR}/lib)

#使用相对路径,否则可执行程序和库文件更换路径后,程序无法运行
SET(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE)
SET(CMAKE_INSTALL_RPATH "\${ORIGIN}/../lib") #指定运行时动态库的加载路径,ORIGIN指运行文件所在目录

find_package(CUDA REQUIRED)
# cuda
include_directories(/usr/local/cuda/include)
link_directories(/usr/local/cuda/lib64)
# tensorrt
include_directories(/usr/include/x86_64-linux-gnu/)
link_directories(/usr/lib/x86_64-linux-gnu/)
find_package(OpenCV REQUIRED)

#find_package(Boost REQUIRED)
#find_package(Threads)
#include_directories(${Boost_INCLUDE_DIR})
#ADD_DEFINITIONS(-DBOOST_LOG_DYN_LINK)


#add_subdirectory(common)
#add_subdirectory(lidar)
set(DEOS_LIST "")
list(APPEND DEPS_LIST nvcaffeparser)
#target_include_directories(${TARGET_NAME}
#	PRIVATE common)

add_executable (${PROJECT_NAME}
                samplemnist.cpp
                common/logger.cpp)

target_link_libraries (${PROJECT_NAME}
	nvinfer
	nvcaffe_parser
	cudart
       	${OpenCV_LIBS}
	-lrt)

#redefine_file_macro(${PROJECT_NAME})


7.2 mnist_tensorrt

#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"

#include "NvUffParser.h"
#include "NvInfer.h"
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
const std::string gSampleName = "TensorRT.sample_uff_mnist";
class SampleUffMNIST
{
    template
    using SampleUniquePtr = std::unique_ptr;
public:
    SampleUffMNIST(const samplesCommon::UffSampleParams& params)
        :mParams(params)
    {

    }
    bool build();
    bool infer();
    bool teardown();
private:
    bool constructNetwork(SampleUniquePtr& parser,SampleUniquePtr& network);
    bool processInput(const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx)const;
    bool verifyOutput(const samplesCommon::BufferManager& buffers,const std::string& outputTensorName,int groundTruthDigit)const;

    std::shared_ptr mEngine{nullptr};
    samplesCommon::UffSampleParams mParams;

    nvinfer1::Dims mInputDims;
    const int kDIGITS{10};
};
bool SampleUffMNIST::build()
{
    auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    auto network = SampleUniquePtr(builder->createNetwork());
    if(!network)
    {
        return false;
    }
    auto config = SampleUniquePtr(builder->createBuilderConfig());
    if(!config)
    {
        return false;
    }
    auto parser = SampleUniquePtr(nvuffparser::createUffParser());
    if(!parser)
    {
        return false;
    }
    if(!constructNetwork(parser,network))
    {
        return false;
    }
    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    config->setFlag(BuilderFlag::kGPU_FALLBACK);
    //config->setFlag(BuilderFlag::kSTRICT_TYPES);
    if(mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if(mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore);

    mEngine = std::shared_ptr(builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
    if(!mEngine)
    {
        return false;
    }
    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}
bool SampleUffMNIST::constructNetwork(SampleUniquePtr& parser, SampleUniquePtr& network)
{
    assert(mParams.inputTensorNames.size()==1);
    assert(mParams.outputTensorNames.size() == 1);
    parser->registerInput(mParams.inputTensorNames[0].c_str(),nvinfer1::Dims3(1,28,28),nvuffparser::UffInputOrder::kNCHW);
    parser->registerOutput(mParams.outputTensorNames[0].c_str());
    parser->parse(mParams.uffFileName.c_str(),*network,nvinfer1::DataType::kFLOAT);

    if(mParams.int8)
    {
        samplesCommon::setAllTensorScales(network.get(),127.0f,127.0f);
    }
    return true;

}
bool SampleUffMNIST::infer()
{
    samplesCommon::BufferManager buffers(mEngine,mParams.batchSize);
    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if(!context)
    {
        return false;
    }
    bool outputCorrect = true;
    float total = 0;
    // Try to infer each digit 0-9
    for (int digit = 0; digit < kDIGITS; digit++)
    {
        if (!processInput(buffers, mParams.inputTensorNames[0], digit))
        {
            return false;
        }
        // Copy data from host input buffers to device input buffers
        buffers.copyInputToDevice();

        const auto t_start = std::chrono::high_resolution_clock::now();

        // Execute the inference work
        if (!context->execute(mParams.batchSize, buffers.getDeviceBindings().data()))
        {
            return false;
        }

        const auto t_end = std::chrono::high_resolution_clock::now();
        const float ms = std::chrono::duration(t_end - t_start).count();
        total += ms;

        // Copy data from device output buffers to host output buffers
        buffers.copyOutputToHost();

        // Check and print the output of the inference
        outputCorrect &= verifyOutput(buffers, mParams.outputTensorNames[0], digit);
    }

    total /= kDIGITS;

    sample::gLogInfo << "Average over " << kDIGITS << " runs is " << total << " ms." << std::endl;

    return outputCorrect;
}
bool SampleUffMNIST::processInput(const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
{
    const int inputH = mInputDims.d[1];
    const int inputW = mInputDims.d[2];

    std::vector fileData(inputH*inputW);
    readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm",mParams.dataDirs),fileData.data(),inputH,inputW);

    sample::gLogInfo<<"Input:\n";
    for (int i = 0; i < inputH * inputW; i++)
    {
        sample::gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
    }
    sample::gLogInfo << std::endl;

    float* hostInputBuffer = static_cast(buffers.getHostBuffer(inputTensorName));
    for(int i=0;i(buffers.getHostBuffer(outputTensorName));

    sample::gLogInfo <<"Output:\n";
    float val{0.0f};
    int idx{0};
    for(int i=0;i " << std::setw(10) << prob[j] << "\t : ";

        // Emphasize index with highest output value
        if (j == idx)
        {
            sample::gLogInfo << "***";
        }
        sample::gLogInfo << "\n";
    }

    sample::gLogInfo <] [--useDLACore=]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}
int main(int argc,char** argv)
{
    samplesCommon::Args args;
    bool argsOK= samplesCommon::parseArgs(args,argc,argv);
    if(!argsOK)
    {
        sample::gLogError << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    auto sampleTest=sample::gLogger.defineTest(gSampleName,argc,argv);
    sample::gLogger.reportTestStart(sampleTest);

    samplesCommon::UffSampleParams params = initializeSampleParams(args);
    SampleUffMNIST sample(params);
    sample::gLogInfo <<"Building and running a GPU inference engine for MNIST" << std::endl;
    if (!sample.build())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.infer())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.teardown())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);
}

7.3 mnist_API

#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"

#include "NvUffParser.h"
#include "NvInfer.h"
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
const std::string gSampleName = "TensorRT.sample_uff_mnist";
class SampleUffMNIST
{
    template
    using SampleUniquePtr = std::unique_ptr;
public:
    SampleUffMNIST(const samplesCommon::UffSampleParams& params)
        :mParams(params)
    {

    }
    bool build();
    bool infer();
    bool teardown();
private:
    bool constructNetwork(SampleUniquePtr& parser,SampleUniquePtr& network);
    bool processInput(const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx)const;
    bool verifyOutput(const samplesCommon::BufferManager& buffers,const std::string& outputTensorName,int groundTruthDigit)const;

    std::shared_ptr mEngine{nullptr};
    samplesCommon::UffSampleParams mParams;

    nvinfer1::Dims mInputDims;
    const int kDIGITS{10};
};
bool SampleUffMNIST::build()
{
    auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    auto network = SampleUniquePtr(builder->createNetwork());
    if(!network)
    {
        return false;
    }
    auto config = SampleUniquePtr(builder->createBuilderConfig());
    if(!config)
    {
        return false;
    }
    auto parser = SampleUniquePtr(nvuffparser::createUffParser());
    if(!parser)
    {
        return false;
    }
    if(!constructNetwork(parser,network))
    {
        return false;
    }
    builder->setMaxBatchSize(mParams.batchSize);
    config->setMaxWorkspaceSize(16_MiB);
    config->setFlag(BuilderFlag::kGPU_FALLBACK);
    //config->setFlag(BuilderFlag::kSTRICT_TYPES);
    if(mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if(mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore);

    mEngine = std::shared_ptr(builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
    if(!mEngine)
    {
        return false;
    }
    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;
}
bool SampleUffMNIST::constructNetwork(SampleUniquePtr& parser, SampleUniquePtr& network)
{
    assert(mParams.inputTensorNames.size()==1);
    assert(mParams.outputTensorNames.size() == 1);
    parser->registerInput(mParams.inputTensorNames[0].c_str(),nvinfer1::Dims3(1,28,28),nvuffparser::UffInputOrder::kNCHW);
    parser->registerOutput(mParams.outputTensorNames[0].c_str());
    parser->parse(mParams.uffFileName.c_str(),*network,nvinfer1::DataType::kFLOAT);

    if(mParams.int8)
    {
        samplesCommon::setAllTensorScales(network.get(),127.0f,127.0f);
    }
    return true;

}
bool SampleUffMNIST::infer()
{
    samplesCommon::BufferManager buffers(mEngine,mParams.batchSize);
    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if(!context)
    {
        return false;
    }
    bool outputCorrect = true;
    float total = 0;
    // Try to infer each digit 0-9
    for (int digit = 0; digit < kDIGITS; digit++)
    {
        if (!processInput(buffers, mParams.inputTensorNames[0], digit))
        {
            return false;
        }
        // Copy data from host input buffers to device input buffers
        buffers.copyInputToDevice();

        const auto t_start = std::chrono::high_resolution_clock::now();

        // Execute the inference work
        if (!context->execute(mParams.batchSize, buffers.getDeviceBindings().data()))
        {
            return false;
        }

        const auto t_end = std::chrono::high_resolution_clock::now();
        const float ms = std::chrono::duration(t_end - t_start).count();
        total += ms;

        // Copy data from device output buffers to host output buffers
        buffers.copyOutputToHost();

        // Check and print the output of the inference
        outputCorrect &= verifyOutput(buffers, mParams.outputTensorNames[0], digit);
    }

    total /= kDIGITS;

    sample::gLogInfo << "Average over " << kDIGITS << " runs is " << total << " ms." << std::endl;

    return outputCorrect;
}
bool SampleUffMNIST::processInput(const samplesCommon::BufferManager& buffers, const std::string& inputTensorName, int inputFileIdx) const
{
    const int inputH = mInputDims.d[1];
    const int inputW = mInputDims.d[2];

    std::vector fileData(inputH*inputW);
    readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm",mParams.dataDirs),fileData.data(),inputH,inputW);

    sample::gLogInfo<<"Input:\n";
    for (int i = 0; i < inputH * inputW; i++)
    {
        sample::gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
    }
    sample::gLogInfo << std::endl;

    float* hostInputBuffer = static_cast(buffers.getHostBuffer(inputTensorName));
    for(int i=0;i(buffers.getHostBuffer(outputTensorName));

    sample::gLogInfo <<"Output:\n";
    float val{0.0f};
    int idx{0};
    for(int i=0;i " << std::setw(10) << prob[j] << "\t : ";

        // Emphasize index with highest output value
        if (j == idx)
        {
            sample::gLogInfo << "***";
        }
        sample::gLogInfo << "\n";
    }

    sample::gLogInfo <] [--useDLACore=]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}
int main(int argc,char** argv)
{
    samplesCommon::Args args;
    bool argsOK= samplesCommon::parseArgs(args,argc,argv);
    if(!argsOK)
    {
        sample::gLogError << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    auto sampleTest=sample::gLogger.defineTest(gSampleName,argc,argv);
    sample::gLogger.reportTestStart(sampleTest);

    samplesCommon::UffSampleParams params = initializeSampleParams(args);
    SampleUffMNIST sample(params);
    sample::gLogInfo <<"Building and running a GPU inference engine for MNIST" << std::endl;
    if (!sample.build())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.infer())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.teardown())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);
}

7.4 mnist_onnx

#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"
#include "common/parserOnnxConfig.h"

#include "NvInfer.h"
#include 


#include 
#include 
#include 
#include 
#include 
const std::string gSampleName = "TensorRT.sample_onnx_mnist";
class SampleOnnxMNIST
{
    template
    using SampleUniquePtr = std::unique_ptr;
public:
    SampleOnnxMNIST(const samplesCommon::OnnxSampleParams& params)
        :mParams(params)
        , mEngine(nullptr)
    {

    }
    bool build();
    bool infer();
private:
    samplesCommon::OnnxSampleParams mParams;
    nvinfer1::Dims mInputDims;
    nvinfer1::Dims mOutputDims;

    int mNumber;

    std::shared_ptr mEngine;

    bool constructNetwork(SampleUniquePtr& builder,
                          SampleUniquePtr& network,
                          SampleUniquePtr& config,
                          SampleUniquePtr& parser);
    bool processInput(const samplesCommon::BufferManager& buffers);
    bool verifyOutput(const samplesCommon::BufferManager& buffers)const;
};
bool SampleOnnxMNIST::build()
{
    auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    const auto explicitBatch = 1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    auto network = SampleUniquePtr(builder->createNetworkV2(explicitBatch));
    if(!network)
    {
        return false;
    }
    auto config = SampleUniquePtr(builder->createBuilderConfig());
    if(!config)
 {
        return false;
    }
    auto parser = SampleUniquePtr(nvonnxparser::createParser(*network,sample::gLogger.getTRTLogger()));
    if(!parser)
    {
        return false;
    }

    auto constructed = constructNetwork(builder,network, config,parser);
    if(!constructed)
    {
        return false;
    }
    mEngine = std::shared_ptr(builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
    if(!mEngine)
    {
        return false;
    }
    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 4);

    assert(network->getNbOutputs() == 1);
    mOutputDims = network->getOutput(0)->getDimensions();
    assert(mOutputDims.nbDims == 2);
    return true;
}
bool SampleOnnxMNIST::constructNetwork(SampleUniquePtr& builder,SampleUniquePtr& network,SampleUniquePtr& config,SampleUniquePtr& parser)
{
    auto parserd = parser->parseFromFile(locateFile(mParams.onnxFileName,mParams.dataDirs).c_str(),
         static_cast(sample::gLogger.getReportableSeverity()));
    if(!parserd)
    {
        return false;
    }
    config->setMaxWorkspaceSize(16_MiB);
    if(mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if(mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
        samplesCommon::setAllTensorScales(network.get(),127.0f,127.0f);
    }
    samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore);
    return true;

}
bool SampleOnnxMNIST::infer()
{
    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(mEngine);

    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if (!context)
    {
        return false;
    }

    // Read the input data into the managed buffers
    assert(mParams.inputTensorNames.size() == 1);
    if (!processInput(buffers))
    {
        return false;
    }

    // Memcpy from host input buffers to device input buffers
    buffers.copyInputToDevice();

    bool status = context->executeV2(buffers.getDeviceBindings().data());
    if (!status)
    {
        return false;
    }

    // Memcpy from device output buffers to host output buffers
    buffers.copyOutputToHost();

    // Verify results
    if (!verifyOutput(buffers))
    {
        return false;
    }

    return true;
}
bool SampleOnnxMNIST::processInput(const samplesCommon::BufferManager& buffers) 
{
    const int inputH = mInputDims.d[2];
    const int inputW = mInputDims.d[3];

    // Read a random digit file
    srand(unsigned(time(nullptr)));
    std::vector fileData(inputH * inputW);
    mNumber = rand() % 10;
    readPGMFile(locateFile(std::to_string(mNumber) + ".pgm", mParams.dataDirs), fileData.data(), inputH,inputW);

    // Print ASCII representation of digit image
    sample::gLogInfo << "\nInput:\n" << std::endl;
    for (int i = 0; i (buffers.getHostBuffer(mParams.inputTensorNames[0]));
    for (int i = 0; i < inputH * inputW; i++)
    {
        hostDataBuffer[i] =1.0- float(fileData[i]/255.0);
    }

    return true;

}
bool SampleOnnxMNIST::verifyOutput(const samplesCommon::BufferManager& buffers) const
{
    const int outputSize = mOutputDims.d[1];
    float* output = static_cast(buffers.getHostBuffer(mParams.outputTensorNames[0]));
    float val{0.0f};
    int idx{0};

    // Calculate Softmax
    float sum{0.0f};
    for (int i = 0; i < outputSize; i++)
    {
        output[i] = exp(output[i]);
        sum += output[i];
    }

    sample::gLogInfo << "Output:" << std::endl;
    for (int i = 0; i < outputSize; i++)
    {
        output[i] /= sum;
        val = std::max(val, output[i]);
        if (val == output[i])
        {
            idx = i;
        }

        sample::gLogInfo << " Prob " << i << "  " << std::fixed << std::setw(5) << std::setprecision(4) << output[i]
                         << " "
                         << "Class " << i << ": " << std::string(int(std::floor(output[i] * 10 + 0.5f)), '*')
                         << std::endl;
    }
    sample::gLogInfo << std::endl;

    return idx == mNumber && val > 0.9f;
}

samplesCommon::OnnxSampleParams initializeSampleParams(const samplesCommon::Args& args)
{
    samplesCommon::OnnxSampleParams params;
    if(args.dataDirs.empty())
    {
        params.dataDirs.push_back("./data/");
//        params.dataDirs.push_back("./data/");
    }
    else
    {
        params.dataDirs = args.dataDirs;
    }

    params.onnxFileName = "mnist.onnx";
    params.inputTensorNames.push_back("Input3");
    params.outputTensorNames.push_back("Plus214_Output_0");
    params.dlaCore = args.useDLACore;
    params.int8 = args.runInInt8;
    params.fp16 = args.runInFp16;

    return params;

}
void printHelpInfo()
{
    std::cout
        << "Usage: ./sample_mnist [-h or --help] [-d or --datadir=] [--useDLACore=]\n";
    std::cout << "--help          Display help information\n";
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "--int8          Run in Int8 mode.\n";
    std::cout << "--fp16          Run in FP16 mode.\n";
}
int main(int argc,char** argv)
{
    samplesCommon::Args args;
    bool argsOK= samplesCommon::parseArgs(args,argc,argv);
    if(!argsOK)
    {
        sample::gLogError << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    auto sampleTest=sample::gLogger.defineTest(gSampleName,argc,argv);
    sample::gLogger.reportTestStart(sampleTest);

    SampleOnnxMNIST sample(initializeSampleParams(args));
    sample::gLogInfo <<"Building and running a GPU inference engine for MNIST" << std::endl;
    if (!sample.build())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    if (!sample.infer())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);
}

7.5 mnist_int8

7.5.1 int8校准+推断

  • 定义网络

为INT8执行定义网络与为任何其他精度定义网络完全相同。权重作为FP32导入,构建者将校准网络,以找到适当的量化因子,从而将网络降低到INT8精度。此示例使用NvCaffeParser导入网络:

 const nvcaffeparser1;:IBlobNameToTensor* blobNameToTensor =
            parser->parse(locateFile(mParams.prototxtFileName,mParams.dataDirs).c_str(),
            locateFile(mParams.weightsFileName,mParams.dataDirs).c_str(),*network,
            dataType == DtatType::kINT8 ? DataType::kFLOAT : dataType);
  • 设置校准器

为INT8构建网络时,校准是一个额外的步骤。应用程序必须向TensorRT提供样本输入,即校准数据。然后,TensorRT将在FP32中执行推断,并收集有关中间激活层的统计信息,用于构建精度降低的INT8引擎。

创建int校准器

std::unique_ptr calibrator;

    config->setAvgTimingIterations(1);
    config->setMinTimingIterations(1);
    config->setMaxWOrkspaceSize(1_GiB);

(1)校准数据

 if(dataType == DataType::kINT8)
    {
        MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte",
                                           "train-labels-idx1-ubyte",mParams.dataDirs);
    }

MNISTBatchStream 类提供用于检索批处理数据的帮助方法。校准器使用批处理流对象,在校准时检索批处理数据。通常,batchstream类为

class IBatchStream
{
public:
    virtual void reset(int firstBatch) = 0;
    virtual bool next() = 0;
    virtual void skip(int skipCount) = 0;
    virtual float* getBatch() = 0;
    virtual float* getLabels() = 0;
    virtual int getBatchesRead() const = 0;
    virtual int getBatchSize() const = 0;
    virtual nvinfer1::Dims getDims() const = 0;
};

注意:校准数据必须代表运行时提供给TensorRT的输入;例如,对于图像分类网络,它不应该只包含一小部分类别的图像。对于ImageNet,大约500幅图像的校准就足够了。

(2)校准推理

应用程序必须实现“IInt8Calibrator”接口,用以提供读取和写入校准表文件的校准数据和辅助方法。tensorrt提供了4中实现IInt8Calibrator的方法:

IInt8EntropyCalibrator
IInt8EntropyCalibrator2
IInt8MinMaxCalibrator
IInt8LegacyCalibrator

本例中使用IInt8EntropyCalibrator2

 calibrator.reset(new Int8EntropyCalibrator2(
                             calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str()));

校准器对象需要使用校准批处理流。

为了执行校准,接口必须提供“getBatchSize()”和“getBatch()”的实现,可以从batchstream对象检索数据。

构建器在开始时调用依此“getBatchSize()”方法,以获取校准集的批次大小。

config->setInt8Calibrator(calibrator.get());

然后重复调用“getBatch()”的方法,从应用程序中获取批,知道该方法返回false位置。每个校准批次必须包含指定为批次大小的图像数量。

 float* getBatch() override
    {
        return mData.data() + (mBatchCount * mBatchSize * samplesCommon::volume(mDims));
    }
float* getBatch() override
    {
        return mBatch.data();
    }
 while(batchStream.next())
    {
        assert(mParams.inputTensorNames.size() == 1);
        if(!processInput(buffers,batchStream.getBatch()))
        {
            return false;
        }
     ...
 }

对于每个输入张量,必须将只想GPU内存中输入数据的指针写入绑定数组。名称数组包含输入很张量的名称。绑定数组中每个张量的位置与其名称在名称数组中的位置想匹配。两个数组的大小都为“nbBings”。由于校准步骤非常耗时,可以通过“writeCalibrationCache”实现,以便将校准表写入适当的位置,以便在以后的运行中使用。然后,通过’readCalibrationCache`方法从所需位置读取校准表文件。校准期间,生成器将使用“readCalibrationCache()”检查校准文件是否存在。仅当校准文件不存在或与生成该文件的当前TensorRT版本或校准器变体不兼容时,生成器才会重新校准。

(3)校准文件

每个tensor网络激活都存储一个tensor校准文件。激活比例使用校准算法生成的动态范围计算,=abs(最大动态范围)/127.0f

校准文件名为“CalibrationTable”,其中“”是您的网络名称,例如“mnist”。该文件位于’TensorRT-x.x.x.x/data/mnist’目录中,其中’x.x.x.x’是您安装的TensorRT版本。

如果未找到“CalibrationTable”文件,构建器将再次运行校准算法以创建它。“校准表”的内容包括:

TRT-7000-EntropyCalibration2 //TRT-校准算法的TensorRt版本-校准算法
//layer naem:校准期间为网络中的每个张量确定的浮点激活比例
data: 3c008912
conv1: 3c88edfc
pool1: 3c88edfc
conv2: 3ddc858b
pool2: 3ddc858b
ip1: 3db6bd6e
ip2: 3e691968
prob: 3c010a14

“CalibrationTable”文件是在运行校准算法的构建阶段生成的。创建校准文件后,无需再次运行校准,即可读取该文件进行后续运行。您可以为“readCalibrationCache()”提供实现,以便从所需位置加载校准文件。如果读取的校准文件与校准器类型(用于生成该文件)和TensorRT版本兼容,则生成器将跳过校准步骤,使用校准文件中的每张量刻度值。

  • 配置生成器
 config->setAvgTimingIterations(1); //设置计时迭代的最小次数
    config->setMinTimingIterations(1);//设置计时迭代的平均次数
    config->setMaxWOrkspaceSize(1_GiB);//设置最大工作空间
    //设置除kHALF外,允许的生成器的精度为kFP16
    if(dataType == DataType::kHALF)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    //设置除FP32外,允许的生成器的精度为kINT8,
    if(dataType == DataType::kINT8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    //设置最大批量大小
    builder->setMaxBatchSize(mParams.batchSize);
   
    if(dataType == DataType::kINT8)
    {
       //建立批量处理流
        MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte",
                                           "train-labels-idx1-ubyte",mParams.dataDirs);
        //创建校准器
        calibrator.reset(new Int8EntropyCalibrator2(
                             calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str()));
        //将校准器对象传递给builder()
        config->setInt8Calibrator(calibrator.get());
    }
  • 构建引擎
 mEngine = std::shared_ptr(
                builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
  • 运行引擎

输入输出任然保持32-bit 浮点数

bool SampleINT8::infer(std::vector& score,int firstScoreBatch, int nbScoreBatches)
{
    float ms{0.0f};
    //分配输出,输出内存buffer
    samplesCommon::BufferManager buffers(mEngine,mParams.batchSize);
    //创建执行上下文
    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if(!context)
    {
        return false;
    }
    MNISTBatchStream batchStream(mParams.batchSize,nbScoreBatches+firstScoreBatch,"train-images-idx3-ubyte",
                                 "train-labels-idx1-ubyte",mPArams.dataDirs);
    batchStream.skip(firstScoreBatch);
    //获取数据维度
    Dims outputDims = context->getEngine().getBindingDimensions(
                context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str()));
    int outputSize = samplesCommon::volume(outputDims);
    int top1{0},top5{0};
    float totalTime{0.0f};
    while(batchStream.next())
    {
        assert(mParams.inputTensorNames.size() == 1);
        //读取输入数据生成managed buffer
        if(!processInput(buffers,batchStream.getBatch()))
        {
            return false;
        }
        //将输入数据从host拷贝到device中
        buffers.copyInputToDevice();
        cudaStream_t stream;
        CHECK(cudaSTreamCreate(&stream));

        cudaEvent_t start,end;
        CHECK(cudaEventCreateWithFlags(&start,cudaEventBlockingSync));
        CHECK(cudaEventCreatWithFlags(&end,cudaEventBlockingSync));
        cudaEventRecord(start,stream);
        //运行推理
 
        bool status = context->enqueue(mParams.batchSize,buffers.getDeviceBindings().data(),stream,nullptr);
        if(!status)
        {
            return false;
        }
        cudaEventRecord(end,stream);
        cudaEventSynchronize(end);
        cudaEventElapsedTime(&ms,start,end);
        cudaEventDestory(start);
        cudaEventDestory(end);

        totalTime += ms;
        //从device拷贝输出到host
        buffers.copyOutputToHost();

        CHECK(cudaStreamDestory(stream));
        //此示例输出FP32和INT8精度的Top-1和Top-5指标,
        //以及FP16(如果硬件本机支持)的Top-1和Top-5指标。这些数字应在1%以内。
        top1 += calculateScore(buffers,batchStream.getLabels(),mParams.batchSize,outputSize,1);
        top5 += calculateScore(buffers,batchStream.getLabels(),mPArams.batchSIze,outputSize,5);

        if(batchStream.getBatchesRead() % 100 ==0)
        {
            sample::gLogInfo <<"Processing next set of max 100 batches"<
  • 验证输出

7.5.2 sample_int8

#include "common/BatchStream.h"
#include "common/EntropyCalibrator.h"
#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"
#include "common/logging.h"

#include "NvCaffeParser.h"
#include "NvInfer.h"
#include 

#include 
#include 
#include 
#include 

const std::string gSampleName = "TensorRT.sample_int8";

struct SampleINT8Params : public samplesCommon::CaffeSampleParams
{
    int nbCalBatches;
    int calBatchSize;
    std::string networkName;
};
class SampleINT8
{
    template 
    using SampleUniquePtr = std::unique_ptr;
public:
    SampleINT8(const SampleINT8Params& params)
        : mParams(params)
        , mEngine(nullptr)
    {
        initLibNvInferPlugins(&sample::gLogger.getTRTLOgger(),"");
    }

    bool build(DataType dataType);
    bool isSupported(DataType dataType);
    bool infer(std::vector& score,int firstScoreBatch, int nbScoreBatches);
    bool teardown;
private:
    SampleINT8Params mParams;
    std::shared_ptr mEngine;

    nvinfer1::Dims mInputDims;

    bool constructNetwork(SampleUniquePtr& builder,
                          SampleUniquePtr& network,
                          SampleUniquePtr& config,
                          SampleUniquePtr& parser,
                          DataType datattype);
    bool processInput(const samplesCommon::BufferManager& buffers,const float* data);
    int calculateScore(const samplesCommon::BufferManager& buffers,float* labels,int batchSize,int outputSize,int threshold);
};
bool SampleINT8::build(DataType dataType)
{
    auto builder = ampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    auto network = SampleUniquePtr(builder->createNetwork());
    if(!network)
    {
        return false;
    }
    auto config = SampleUniquePtr(builder->createBuilderConfig());
    if(!config)
    {
        return false;
    }
    auto parser = SampleUniquePtr(nvcaffeparser1::createCaffeParser());
    if(!parser)
    {
        return false;
    }
    if((dataType == DataType::kINT8 && ! builder->platformHaFastInt8())
            || (dataType == DataType::kHALF && !builder->platformHasFatDp16()))
    {
        return false;
    }
    auto constructed = constructNetwork(builder,network,config,parser,dataType);
    if(!constructed)
    {
        return false;
    }
    assert(network->getNbInputs() == 1);
    mInputDims = network->getInput(0)->getDimensions();
    assert(mInputDims.nbDims == 3);

    return true;

}
bool SampleINT8::isSupported(DataType dataType)
{
    auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        return false;
    }
    if((dataType == DataType::kINT8 && !builder->platformHaFastInt8())
        || (dataType == DataType::kHALF && !builder->platformHasFatFp16()))
    {
        return false;
    }
    return true;
}
bool SampleINT8::constructNetwork(SampleUniquePtr &builder,
                                  SampleUniquePtr &network,
                                  SampleUniquePtr &config, int &parser, int datattype)
{
    mEngine = nullptr;
    const nvcaffeparser1;:IBlobNameToTensor* blobNameToTensor =
            parser->parse(locateFile(mParams.prototxtFileName,mParams.dataDirs).c_str(),
            locateFile(mParams.weightsFileName,mParams.dataDirs).c_str(),*network,
            dataType == DtatType::kINT8 ? DataType::kFLOAT : dataType);
    for(auto & s: mPrams.outputTensorNames)
    {
        network->markOutput(*blobNameToTensor->find(s.c_str()));
    }
    std::unique_ptr calibrator;

    config->setAvgTimingIterations(1);
    config->setMinTimingIterations(1);
    config->setMaxWorkspaceSize(1_GiB);
    if(dataType == DataType::kHALF)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    if(dataType == DataType::kINT8)
    {
        config->setFlag(BuilderFlag::kINT8);
    }
    builder->setMaxBatchSize(mParams.batchSize);
    if(dataType == DataType::kINT8)
    {
        MNISTBatchStream calibrationStream(mParams.calBatchSize,mParams.nbCalBatches,"train-images-idx3-ubyte",
                                           "train-labels-idx1-ubyte",mParams.dataDirs);
        calibrator.reset(new Int8EntropyCalibrator2(
                             calibrationStream,0,mParams.networkName.c_str(),mParams.inputTensorNames[0].c_str()));
        config->setInt8Calibrator(calibrator.get());
    }
    if(mParams.dlaCore >= 0)
    {
        samplesCommon::enableDLA(builder.get(),config.get(),mParams.dlaCore);
        if(mParams.batchSize > builder->getMaxDLABatchSize())
        {
            sample::gLogError << "Requested batch size "<getMaxDLABatchSize()
                              << ". Reducing batch size accordingly."<(
                builder->buildEngineWithConfig(*network,*config),samplesCommon::InferDeleter());
    if(!mEngine)
    {
        return false;
    }
    return true;
}
bool SampleINT8::infer(std::vector& score,int firstScoreBatch, int nbScoreBatches)
{
    float ms{0.0f};
    samplesCommon::BufferManager buffers(mEngine,mParams.batchSize);
    auto context = SampleUniquePtr(mEngine->createExecutionContext());
    if(!context)
    {
        return false;
    }
    MNISTBatchStream batchStream(mParams.batchSize,nbScoreBatches+firstScoreBatch,"train-images-idx3-ubyte",
                                 "train-labels-idx1-ubyte",mPArams.dataDirs);
    batchStream.skip(firstScoreBatch);

    Dims outputDims = context->getEngine().getBindingDimensions(
                context->getEngine().getBindingIndex(mParams.outputTensorNames[0].c_str()));
    int outputSize = samplesCommon::volume(outputDims);
    int top1{0},top5{0};
    float totalTime{0.0f};
    while(batchStream.next())
    {
        assert(mParams.inputTensorNames.size() == 1);
        if(!processInput(buffers,batchStream.getBatch()))
        {
            return false;
        }
        buffers.copyInputToDevice();
        cudaStream_t stream;
        CHECK(cudaSTreamCreate(&stream));

        cudaEvent_t start,end;
        CHECK(cudaEventCreateWithFlags(&start,cudaEventBlockingSync));
        CHECK(cudaEventCreatWithFlags(&end,cudaEventBlockingSync));
        cudaEventRecord(start,stream);

        bool status = context->enqueue(mParams.batchSize,buffers.getDeviceBindings().data(),stream,nullptr);
        if(!status)
        {
            return false;
        }
        cudaEventRecord(end,stream);
        cudaEventSynchronize(end);
        cudaEventElapsedTime(&ms,start,end);
        cudaEventDestory(start);
        cudaEventDestory(end);

        totalTime += ms;

        buffers.copyOutputToHost();

        CHECK(cudaStreamDestory(stream));

        top1 += calculateScore(buffers,batchStream.getLabels(),mParams.batchSize,outputSize,1);
        top5 += calculateScore(buffers,batchStream.getLabels(),mPArams.batchSIze,outputSize,5);

        if(batchStream.getBatchesRead() % 100 ==0)
        {
            sample::gLogInfo <<"Processing next set of max 100 batches"<(buffers.getHstBuffer(mParams.inputTensorNames[0]));
    std::memcpy(hostDataBuffer,data,mParams.batchSize*samplesCommon::volume(mInputDims)*sizeof(float));
    return true;
}
int SampleINT8::calculatScore(
        const samplesCommon::BufferManager& buffers,float* labels,int batchSize,int outputSize,int threshold)
{
    float* probs = static_cast(buffers.getHostBuffer(mParams.outputTensorNames[0]));

    int success = 0;
    for(int i=0;i=correct)
            {
                better++;
            }
        }
        if(better <= threshold)
        {
            success++;
        }
    }
    return success;
}
SampleINT8Params initializeSampleParams(const samplesCommon::Args& args,int batchSize)
{
    SampleINT8Params params;
    params.dataDirs = args.dataDirs;
    params.dataDirs.emplace_back("data/");

    params.batchSize = batchSize;
    params.dlaCore = args.useDLACore;
    params.nbCalBAtches=10;
    params.calBatchSize = 50;
    params.inputTensorNames.push_back("data");
    params.outputTensorNames.push_back("prob");
    params.prototxtFileName = "deploy.prototxt";
    params.weightsFileName = "mnist_lenet.caffemodel";
    params.networkName = "mnist";
    return params;
}
void printHelpInfo()
{
    std::cout << "Usage: ./sample_int8 [-h or --help] [-d or --datadir=] "
                 "[--useDLACore=]"
              << std::endl;
    std::cout << "--help, -h      Display help information" << std::endl;
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories."
              << std::endl;
    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
                 "where n is the number of DLA engines on the platform."
              << std::endl;
    std::cout << "batch=N         Set batch size (default = 32)." << std::endl;
    std::cout << "start=N         Set the first batch to be scored (default = 16). All batches before this batch will "
                 "be used for calibration."
              << std::endl;
    std::cout << "score=N         Set the number of batches to be scored (default = 1800)." << std::endl;
}
int main(int argc,char** argv)
{
    if(argc >= 2 && (!strncmp(argv[1],"--help",6) || !strncmp(argv[1],"-h",2)))
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    int natchSize = 32;
    int firstScoreBatch = 16;
    int nbSxoreBatches = 18000;

    for(int i=1;i 60000)
    {
        sample::gLogError << "Only 60000 images available" << std::endl;
        return EXIT_FAILURE;
    }
    samplesCommon::Args args;
    samplesCommon::parseArgs(args, argc, argv);

    SampleINT8 sample(initializeSampleParams(args, batchSize));

    auto sampleTest = sample::gLogger.defineTest(gSampleName, argc, argv);

    sample::gLogger.reportTestStart(sampleTest);

    sample::gLogInfo << "Building and running a GPU inference engine for INT8 sample" << std::endl;

    std::vector dataTypeNames = {"FP32", "FP16", "INT8"};
    std::vector topNames = {"Top1", "Top5"};
    std::vector dataTypes = {DataType::kFLOAT, DataType::kHALF, DataType::kINT8};
    std::vector> scores(3, std::vector(2, 0.0f));
    for (size_t i = 0; i < dataTypes.size(); i++)
    {
        sample::gLogInfo << dataTypeNames[i] << " run:" << nbScoreBatches << " batches of size " << batchSize
                         << " starting at " << firstScoreBatch << std::endl;

        if (!sample.build(dataTypes[i]))
        {
            if (!sample.isSupported(dataTypes[i]))
            {
                sample::gLogWarning << "Skipping " << dataTypeNames[i]
                                    << " since the platform does not support this data type." << std::endl;
                continue;
            }
            return sample::gLogger.reportFail(sampleTest);
        }
        if (!sample.infer(scores[i], firstScoreBatch, nbScoreBatches))
        {
            return sample::gLogger.reportFail(sampleTest);
        }
    }

    auto isApproximatelyEqual = [](float a, float b, double tolerance) { return (std::abs(a - b) <= tolerance); };
    const double tolerance{0.01};
    const double goldenMNIST{0.99};

    if ((scores[0][0] < goldenMNIST) || (scores[0][1] < goldenMNIST))
    {
        sample::gLogError << "FP32 accuracy is less than 99%: Top1 = " << scores[0][0] << ", Top5 = " << scores[0][1]
                          << "." << std::endl;
        return sample::gLogger.reportFail(sampleTest);
    }

    for (unsigned i = 0; i < topNames.size(); i++)
    {
        for (unsigned j = 1; j < dataTypes.size(); j++)
        {
            if (scores[j][i] != 0.0f && !isApproximatelyEqual(scores[0][i], scores[j][i], tolerance))
            {
                sample::gLogError << "FP32(" << scores[0][i] << ") and " << dataTypeNames[j] << "(" << scores[j][i]
                                  << ") " << topNames[i] << " accuracy differ by more than " << tolerance << "."
                                  << std::endl;
                return sample::gLogger.reportFail(sampleTest);
            }
        }
    }

    if (!sample.teardown())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);

}

7.6 动态形状和数字识别

该例子中演示如何在tensorrt中使用动态输入维度。通过创建一个具有

  1. 接受不同输入形状
  2. 并调整输入形态大小
  3. 转化为模型可以使用的形态

功能的引擎。

7.6.1 预处理网络和预测网络创建

  1. 创建预处理网络

创建一个完全支持dims的网络

 auto preprocessorNetwork = makeUnique(
                builder->createNetworkV2(1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)));
    

添加一个可以接受动态形状的的输入层和一个resize层,把输入层的shape转为模型需要的输出shape

auto input = preprocessorNetwork->addInput("input",nvinfer1::DataType::kFLOAT,Dims4{-1,1,-1,-1});
    auto resizeLayer = preprocessorNetwork->addResize(*input);
    resizeLayer->setOutputDimensions(mPredictionInputDims);
    preprocessorNetwork->markOutput(*resizeLayer->getOutput(0));
  • -1表示维度在运行时输入
  1. 解析onnx mnist 模型

创建一个full-dims网络和解析

const auto explicitBatchh = 1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    auto network = makeUnique(builder->createNetworkV2(explicitBatchh));
    if(!network)
    {
        sample::gLogError<<"Creat network failed."<

解析模型文件填充网络

bool parsingSuccess = parser->parseFromFile(locateFile(mParams.onnxFileName,mParams.dataDirs).c_str(),
                                                static_cast(sample::gLogger.getReportableSeverity()));
    if(!parsingSuccess)
    {
        sample::gLogError<<"Failed to parse model."<
  1. 创建引擎

在构建预处理器引擎时,还要提供一个优化配置文件,以便TensorRT知道要优化哪些输入形状:

 auto preprocessorConfig = makeUnique(builder->createBuilderConfig());
    if(!preprocessorConfig)
    {
        sample::gLogError<<"Create builder config failed."<createOptimizationProfile();
 profile->setDimensions(input->getName(),OptProfileSelector::kMIN,Dims4{1,1,1,1});//指定配置文件最小维度
    profile->setDimensions(input->getName(),OptProfileSelector::kOPT,Dims4{1,1,28,28}); //指定配置文件优化的维度
    profile->setDimensions(input->getName(),OptProfileSelector::kMAX,Dims4{1,1,56,56});//指定配置文件最大维度
    preprocessorConfig->addOptimizationProfile(profile);

为calibration创建配置文件

auto profileCalib = builder->createOptimizationProfile();
    const int calibBatchSize{256};

    profileCalib->setDimensions(input->getName(),OptProfileSelector::kMIN,Dims4{calibBatchSize,1,28,28});
    profileCalib->setDimensions(input->getName(),OptProfileSelector::kOPT,Dims4{calibBatchSize,1,28,28});
    profileCalib->setDimensions(input->getName(),OptProfileSelector::kMAX,Dims4{calibBatchSize,1,28,28});
    preprocessorConfig->setCalibrationProfile(profileCalib);

如果使用int8模式运行,创建和准备int8校准

 std::unique_ptr calibrator;
    if(mParams.int8)
    {
        preprocessorConfig->setFlag(BuilderFlag::kINT8);
        const int nCalibBatches{10};
        MNISTBatchStream calibrationStream(
            calibBatchSize, nCalibBatches, "train-images-idx3-ubyte", "train-labels-idx1-ubyte", mParams.dataDirs);
        calibrator.reset(
            new Int8EntropyCalibrator2(calibrationStream, 0, "MNISTPreprocessor", "input"));
        preprocessorConfig->setInt8Calibrator(calibrator.get());
    }

通过config构建engine

 mPreprocessorEngine = makeUnique(builder->buildEngineWithConfig(*preprocessorNetwork,*preprocessorConfig));
    if(!mPreprocessorEngine)
    {
        sample::gLogError << "Preprocessor engine build failed."<

MNIST模型engine,将Softmax层连接到网络末端,将Softmax轴设置为1,因为网络输出在全dims模式下具有形状[1,10],并用Softmax替换现有网络输出:

auto softmax = network->addSoftMax(*network->getOutput(0));
    softmax->setAxes(1<<1);
    network->unmarkOutput(*network->getOutput(0));
    network->markOutput(*softmax->getOutput(0));

如上述方法设置标定配置文件和int8准备。

const int calibBatchSize{1}//`对于预测引擎,calibBatchSize`设置为1,因为ONNX模型具有显式批处理。 

通过config创建engine

 mPredictionEngine = makeUnique(builder->buildEngineWithConfig(*network, *config));
    if (!mPredictionEngine)
    {
        sample::gLogError << "Prediction engine build failed." << std::endl;
        return false;
    }
  1. 推理
CHECK(cudaMemcpy( mInput.deviceBuffer.data(),mInput.hostBuffer.data(),mInput.hostBuffer.nbBytes(),cudaMemcpyHostToDevice));//将输入缓存拷贝到设备中
CHECK_RETURN_W_MSG(mPreprocessorContext->setBindingDimensions(0,inputDims),false,"Invalid binding dimensions.");//给上下文设置具体输入形态
if (!mPreprocessorContext->allInputDimensionsSpecified())
    {
        return false;
    }//当所有输入形态确认时才运行推理
 // 运行预处理网络实现输入数据形状调整
    std::vector preprocessorBindings = {mInput.deviceBuffer.data(), mPredictionInput.data()};
    // For engines using full dims, we can use executeV2, which does not include a separate batch size parameter.
    bool status = mPreprocessorContext->executeV2(preprocessorBindings.data());
    if (!status)
    {
        return  false;
    } 
 std::vector predictionBindings = {mPredictionInput.data(),mOutput.deviceBuffer.data()};
    status = mPredictionContext->executeV2(predictionBindings.data());
    if(!status)
    {
        return false;
    }
    CHECK(cudaMemcpy(mOutput.hostBuffer.data(),mOutput.deviceBuffer.data(),mOutput.deviceBuffer.nbBytes(),cudaMemcpyDeviceToHost));
    return validateOutput(digit);
    

7.6.2 code

#include "common/BatchStream.h"
#include "common/EntropyCalibrator.h"
#include "common/argsParser.h"
#include "common/buffers.h"
#include "common/common.h"
#include "common/logger.h"
#include "common/logging.h"
#include "common/parserOnnxConfig.h"

#include "NvInfer.h"
#include 

#include 

const std::string gSampleName = "TensorRT.sample_dynamic_reshape";

class SamplesDynamicReshape
{
    template 
    using SampleUniquePtr = std::unique_ptr;
public:
    SamplesDynamicReshape(const samplesCommon::OnnxSampleParams& params)
        :mParams(params)
    {

    }
    bool build();
    bool prepare();
    bool infer();
private:
    bool buildPreprocessorEngine(const SampleUniquePtr& builder);
    bool builderPredicetionEngine(const SampleUniquePtr& builder);

    Dims loadPGMFile(const std::string& fileName);
    bool validateOutput(int digit);

    samplesCommon::OnnxSampleParams mParams;

    nvinfer1::Dims mPredictionInputDims;
    nvinfer1::Dims mPredictionOutputDims;

    SampleUniquePtr mPreprocessorEngine{nullptr},mPredictionEngine{nullptr};

    SampleUniquePtr mPreprocessorContext{nullptr},mPredictionContext{nullptr};

    samplesCommon::ManagedBuffer mInput{};
    samplesCommon::DeviceBuffer mPredictionInput{};

    samplesCommon::ManagedBuffer mOutput{};

    template 
    SampleUniquePtr makeUnique(T* t)
    {
        return SampleUniquePtr{t};
    }

};
bool SamplesDynamicReshape::build()
{
    auto builder = makeUnique(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
    if(!builder)
    {
        sample::gLogError <<"Create inference builder failed."< &builder)
{
    auto preprocessorNetwork = makeUnique(
                builder->createNetworkV2(1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)));
    if(!preprocessorNetwork)
    {
        sample::gLogError <<"Create network failed. "<addInput("input",nvinfer1::DataType::kFLOAT,Dims4{-1,1,-1,-1});
    auto resizeLayer = preprocessorNetwork->addResize(*input);
    resizeLayer->setOutputDimensions(mPredictionInputDims);
    preprocessorNetwork->markOutput(*resizeLayer->getOutput(0));

    auto preprocessorConfig = makeUnique(builder->createBuilderConfig());
    if(!preprocessorConfig)
    {
        sample::gLogError<<"Create builder config failed."<createOptimizationProfile();
    profile->setDimensions(input->getName(),OptProfileSelector::kMIN,Dims4{1,1,1,1});
    profile->setDimensions(input->getName(),OptProfileSelector::kOPT,Dims4{1,1,28,28});
    profile->setDimensions(input->getName(),OptProfileSelector::kMAX,Dims4{1,1,56,56});
    preprocessorConfig->addOptimizationProfile(profile);

    auto profileCalib = builder->createOptimizationProfile();
    const int calibBatchSize{256};

    profileCalib->setDimensions(input->getName(),OptProfileSelector::kMIN,Dims4{calibBatchSize,1,28,28});
    profileCalib->setDimensions(input->getName(),OptProfileSelector::kOPT,Dims4{calibBatchSize,1,28,28});
    profileCalib->setDimensions(input->getName(),OptProfileSelector::kMAX,Dims4{calibBatchSize,1,28,28});
    preprocessorConfig->setCalibrationProfile(profileCalib);

    std::unique_ptr calibrator;
    if(mParams.int8)
    {
        preprocessorConfig->setFlag(BuilderFlag::kINT8);
        const int nCalibBatches{10};
        MNISTBatchStream calibrationStream(
            calibBatchSize, nCalibBatches, "train-images-idx3-ubyte", "train-labels-idx1-ubyte", mParams.dataDirs);
        calibrator.reset(
            new Int8EntropyCalibrator2(calibrationStream, 0, "MNISTPreprocessor", "input"));
        preprocessorConfig->setInt8Calibrator(calibrator.get());
    }
    mPreprocessorEngine = makeUnique(builder->buildEngineWithConfig(*preprocessorNetwork,*preprocessorConfig));
    if(!mPreprocessorEngine)
    {
        sample::gLogError << "Preprocessor engine build failed."<parseFromFile(locateFile(mParams.onnxFileName,mParams.dataDirs).c_str(),
                                                static_cast(sample::gLogger.getReportableSeverity()));
    if(!parsingSuccess)
    {
        sample::gLogError<<"Failed to parse model."<addSoftMax(*network->getOutput(0));
    softmax->setAxes(1<<1);
    network->unmarkOutput(*network->getOutput(0));
    network->markOutput(*softmax->getOutput(0));

    mPredictionInputDims = network->getInput(0)->getDimensions();
    mPredictionOutputDims = network->getOutput(0)->getDimensions();

    auto config = makeUnique(builder->createBuilderConfig());
    if(!config)
    {
        sample::gLogError<<"Create builder config failed."<setMaxWorkspaceSize(16_MiB);
    if(mParams.fp16)
    {
        config->setFlag(BuilderFlag::kFP16);
    }
    auto profileCalib = builder->createOptimizationProfile();
    const auto inputName = mParams.inputTensorNames[0].c_str();
    const int calibBatchSize{1};
    profileCalib->setDimensions(inputName, OptProfileSelector::kMIN, Dims4{calibBatchSize, 1, 28, 28});
    profileCalib->setDimensions(inputName, OptProfileSelector::kOPT, Dims4{calibBatchSize, 1, 28, 28});
    profileCalib->setDimensions(inputName, OptProfileSelector::kMAX, Dims4{calibBatchSize, 1, 28, 28});
    config->setCalibrationProfile(profileCalib);

    std::unique_ptr calibrator;
    if (mParams.int8)
    {
        config->setFlag(BuilderFlag::kINT8);
        int nCalibBatches{10};
        MNISTBatchStream calibrationStream(
            calibBatchSize, nCalibBatches, "train-images-idx3-ubyte", "train-labels-idx1-ubyte", mParams.dataDirs);
        calibrator.reset(
            new Int8EntropyCalibrator2(calibrationStream, 0, "MNISTPrediction", inputName));
        config->setInt8Calibrator(calibrator.get());
    }

    mPredictionEngine = makeUnique(builder->buildEngineWithConfig(*network, *config));
    if (!mPredictionEngine)
    {
        sample::gLogError << "Prediction engine build failed." << std::endl;
        return false;
    }
    return true;

}
bool SamplesDynamicReshape::prepare()
{
    mPreprocessorContext = makeUnique(mPreprocessorEngine->createExecutionContext());
    if(!mPreprocessorContext)
    {
        sample::gLogError<<"Preprocessor context build failed."<createExecutionContext());
    if(!mPredictionContext)
    {
        sample::gLogError<<"Prediction contect build failed."< digitDistribution{0,9};
    int digit = digitDistribution(generator);

    Dims inputDims = loadPGMFile(locateFile(std::to_string(digit)+".pgm",mParams.dataDirs));
    mInput.deviceBuffer.resize(inputDims);
    CHECK(cudaMemcpy(
              mInput.deviceBuffer.data(),mInput.hostBuffer.data(),mInput.hostBuffer.nbBytes(),cudaMemcpyHostToDevice));
    CHECK_RETURN_W_MSG(mPreprocessorContext->setBindingDimensions(0,inputDims),false,"Invalid binding dimensions.");

    // We can only run inference once all dynamic input shapes have been specified.
    if (!mPreprocessorContext->allInputDimensionsSpecified())
    {
        return false;
    }

    // Run the preprocessor to resize the input to the correct shape
    std::vector preprocessorBindings = {mInput.deviceBuffer.data(), mPredictionInput.data()};
    // For engines using full dims, we can use executeV2, which does not include a separate batch size parameter.
    bool status = mPreprocessorContext->executeV2(preprocessorBindings.data());
    if (!status)
    {
        return  false;
    }
    std::vector predictionBindings = {mPredictionInput.data(),mOutput.deviceBuffer.data()};
    status = mPredictionContext->executeV2(predictionBindings.data());
    if(!status)
    {
        return false;
    }
    CHECK(cudaMemcpy(mOutput.hostBuffer.data(),mOutput.deviceBuffer.data(),mOutput.deviceBuffer.nbBytes(),cudaMemcpyDeviceToHost));
    return validateOutput(digit);

}
Dims SamplesDynamicReshape::loadPGMFile(const std::string& fileName)
{
    std::ifstream infile(fileName,std::ifstream::binary);
    assert(infile.is_open() && " Attempting to read from a file that is not open.");

    std::string magic;
    int h,w,max;
    infile>>magic>>h>>w>>max;
    infile.seekg(1,infile.cur);
    Dims4 inputDims{1,1,h,w};
    size_t vol = samplesCommon::volume(inputDims);
    std::vector fileData(vol);
    infile.read(reinterpret_cast(fileData.data()),vol);

    sample::gLogInfo <<"Input: \n";
    for(size_t i=0;i(mInput.hostBuffer.data());
    std::transform(fileData.begin(),fileData.end(),hostDataBuffer,
                   [](uint8_t x){return 1.0-static_cast(x/255.0);});
    return inputDims;
}
bool SamplesDynamicReshape::validateOutput(int digit)
{
    const float* buffRaw = static_cast(mOutput.hostBuffer.data());
    std::vector prob(buffRaw,buffRaw+mOutput.hostBuffer.size());

    int curIndex{0};
    for(const auto&elem : prob)
    {
        sample::gLogInfo <<"Prob "<]"
              << std::endl;
    std::cout << "--help, -h      Display help information" << std::endl;
    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
                 "(data/samples/mnist/, data/mnist/)"
              << std::endl;
    std::cout << "--int8          Run in Int8 mode." << std::endl;
    std::cout << "--fp16          Run in FP16 mode." << std::endl;
}
int main(int argc,char** argv)
{
    samplesCommon::Args args;
    bool argsOK = samplesCommon::parseArgs(args, argc, argv);
    if (!argsOK)
    {
        sample::gLogError << "Invalid arguments" << std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (args.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    auto sampleTest = sample::gLogger.defineTest(gSampleName, argc, argv);

    sample::gLogger.reportTestStart(sampleTest);

    SamplesDynamicReshape sample{initializeSampleParams(args)};

    if (!sample.build())
    {
        return sample::gLogger.reportFail(sampleTest);
    }
    if (!sample.prepare())
    {
        return sample::gLogger.reportFail(sampleTest);
    }
    if (!sample.infer())
    {
        return sample::gLogger.reportFail(sampleTest);
    }

    return sample::gLogger.reportPass(sampleTest);

}

7.7 使用API指定无格式I/O

demo展示图和使用API为Float16和int8精度的无格式I/O显式指定为

  • TensorFOrmat::kLINEAR
  • TensorFormat::kCHW2
  • TensorFOrmat::kHWC8

你可能感兴趣的:(笔记,c++)