NX_README
developer-guide:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
套件包含清单
NVIDIA Jetson Xavier NX模块及载板
19v电源适配器
802.11无线网卡及蓝牙模块(安装在载板下方)
说明书
需要准备
sd卡(大于16G,建议64G以上的告诉内存卡)
支持DP或HDMI接口的显示屏
USB键鼠
①:下载SD卡镜像
https://developer.nvidia.com/jetson-nx-developer-kit-sd-card-image
②:下载烧录软件
http://file.ncnynl.com/rpi/Win32DiskImager-0.9.5-install.exe
烧录到sd卡上
③:烧录好的SD卡插入NX的卡槽,然后开机测试
我的NX开发板的刷机版本为Jetpack4.4.0
1、驱动版本:head -n 1 /etc/nv_tegra_release
# R32 (release), REVISION: 4.2, GCID: 20074772, BOARD: t186ref, EABI: aarch64, DATE: Thu Apr 9 01:26:40 UTC 2020
2、内核版本:uname -r
4.9.140-tegra
3、操作系统:lsb_release -i -r
Distributor ID: Ubuntu
Release: 18.04
4、CUDA版本:nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_21:14:42_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
5、cuDNN版本:dpkg -l libcudnn8
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==================-==============-==============-=========================================
ii libcudnn8 8.0.0.145-1+cu arm64 cuDNN runtime libraries
6、opencv版本:dpkg -l libopencv
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==================-==============-==============-=========================================
ii libopencv 4.1.1-2-gd5a58 arm64 Open Computer Vision Library
7、Tensorrt版本: dpkg -l tensorrt
dpkg -l | grep TensorRT
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==================-==============-==============-=========================================
ii tensorrt 7.1.0.16-1+cud arm64 Meta package of TensorRT
配置一下环境
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda
source ~/.bashrc
查看nvcc
nvcc -V
NX开发套件中预装的python版本为2.7,安装了ython3,直接终端输入:
sudo apt-get install python3-pip python3-dev
//接着将pip升级为最新版
python3 -m pip install --upgrade pip #升级pip
安装Jtop进行内存/CPU/GPU监视
sudo pip install jetson-stats
sudo systemctl restart jetson_stats.service
sudo jtop
Xavier NX的风扇在系统内核中有一套自动控制温度和转速的算法,大约在40度左右的时候会自动开启风扇进行散热,在核心温度大约低于39度时候会自动关闭散热风扇。
设置命令:
sudo sh -c 'echo 140 > /sys/devices/pwm-fan/target_pwm'
命令行中数字位数140即代表了风扇的PWM占空比参数。其区间为0~255,0即代表了风扇完全停止,255代表了风扇火力全开。实际上在日常使用过程中我倾向于使用100~150的占空比,也就是40%~60%左右。因为过低风扇散热无力,过高了风扇噪音快赶上台式机了,听起来会比较烦人。除了重度编译,运行较大网络吃满资源,还是用不到255的占空比的。
换源原则:
注意处理器是aarch64架构的Ubuntu 18.04.2 LTS系统类型的,要使用与之匹配的源 。
添加国内清华源,首先备份原本的source.list文件
sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak #为防止误操作后无法恢复,先备份原文件sources.list
sudo gedit /etc/apt/sources.list
然后删除所有内容,复制下列内容到到sources.list后保存
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-security main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-updates main multiverse restricted universe
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-backports main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-security main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-updates main multiverse restricted universe
deb-src http://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ bionic-backports main multiverse restricted universe
最后打开终端输入
sudo apt-get update
在安装tnesorflow之前,一定要找到自己板卡刷机中Jetpack+python所对应的tensorflow版本,
这个最好去nvidia官网的社区去查一下,因为版本不对,即使你安装上。
附上关于Python 3.6+JetPack4.4的官方解答链接
https://forums.developer.nvidia.com/t/official-tensorflow-for-jetson-agx-xavier/65523
大部分摄像头不支持Linux系统,可以通过安装cheese脚本来激活Ubuntu自带的摄像头驱动(UVC),只需通过一条简单的指令即可安装cheese脚本:
sudo apt-get install cheese
安装完成后,在终端输入
cheese
即可打开usb摄像头。此时,摄像头就可以满足即插即用了。
如果想要查看当前插入的摄像头的设备编号,可在终端输入
ls /dev/video*
参考:
https://blog.csdn.net/zbb297918657/article/details/106432773
官方:
提供的jetson_inference项目文件的Github
jetson-inference下载地址:https://github.com/dusty-nv/jetson-inference
jetpack 4.2体验 在jetson tx2上使用python3调用tensorRT推理tensorflow模型
https://blog.csdn.net/weixin_43842032/article/details/88753724
tensorrt路径在:
/usr/src/tensorrt
如果需要插件,则需要源码加插件转引擎
https://blog.csdn.net/u012614287/article/details/81537743
常规模型转trt,不需要NX装源码
出现
CMakeFiles/traffic_det_reg_caffe_trt.dir/src/TrafficDetection.cpp.o:
In function onnxToTRTModel(std::__cxx11::basic_stringstd::char_traits, std::allocator > const&,
std::__cxx11::basic_stringstd::allocator > const&, nvinfer1::ICudaEngine*&, int const&)’:
TrafficDetection.cpp:(.text+0x1357): undefined reference
tocreateNvOnnxParser_INTERNAL’
cmakelist添加:
#/home/name/TensorRT-7.1.3.4/lib/libnvinfer.so
#/home/name/TensorRT-7.1.3.4/lib/libnvinfer_plugin.so
#/home/name/TensorRT-7.1.3.4/lib/libnvparsers.so
-lnvonnxparser
Jetson Xavier NX 的核心竞争力是其机器推理性能。
除了 CPU 和 GPU,Jetson Xavier NX 内还设计有DLA(Deep Learning Accelerator,深度学习加速器)和 PVA(Programmable Vision Accelerator,可编程视觉加速器)单元。Volta GPU 与 DLA 核心的结合,使其在低功耗平台上构筑了强大的处理能力。
为了展示该系统的机器学习推理能力,NVIDIA 为 Jetson 平台提供了大量软件开发套件以及手动调整框架,预先为开发者做了大量繁重的准备工作,使他们能充分利用 GPU 中的 DLA 单元。
有些层DLA不支持,则会回传GPU进行处理,整体来看,节约了GPU资源,但是DLA跑模型速度会慢一倍左右
//sample data:
inline void enableDLA(IBuilder* builder, IBuilderConfig* config, int useDLACore, bool allowGPUFallback = true)
{
if (useDLACore >= 0)
{
if (builder->getNbDLACores() == 0)
{
std::cerr << "Trying to use DLA core " << useDLACore << " on a platform that doesn't have any DLA cores"
<< std::endl;
assert("Error: use DLA core on a platfrom that doesn't have any DLA cores" && false);
}
if (allowGPUFallback)
{
config->setFlag(BuilderFlag::kGPU_FALLBACK);
}
if (!builder->getInt8Mode() && !config->getFlag(BuilderFlag::kINT8))
{
// User has not requested INT8 Mode.
// By default run in FP16 mode. FP32 mode is not permitted.
builder->setFp16Mode(true);
config->setFlag(BuilderFlag::kFP16);
}
config->setDefaultDeviceType(DeviceType::kDLA);
config->setDLACore(useDLACore);
config->setFlag(BuilderFlag::kSTRICT_TYPES);
}
}
//转模型的时候:
// Build the engine
builder->setMaxBatchSize(BATCH_SIZE);
//config->setMaxWorkspaceSize(1_GiB);
config->setMaxWorkspaceSize(1 * (1 << 20)); // 16MB
config->setFlag(nvinfer1::BuilderFlag::kFP16);
std::cout << "**********************************DLA***********************" << std::endl;
// nx /usr/src/tensorrt/samples/common/common.h
std::cout << "start dla." << std::endl;
config->setFlag(nvinfer1::BuilderFlag::kGPU_FALLBACK);
config->setDefaultDeviceType(nvinfer1::DeviceType::kDLA);
config->setDLACore(true);
config->setFlag(nvinfer1::BuilderFlag::kSTRICT_TYPES);
std::cout << "start building engine" << std::endl;
engine = builder->buildEngineWithConfig(*network, *config);
std::cout << "build engine done" << std::endl;
assert(engine);
parser->destroy();
nvinfer1::IHostMemory *data = engine->serialize();
std::ofstream file;
file.open(filename, std::ios::binary | std::ios::out);
std::cout << "writing engine file..." << std::endl;
file.write((const char *)data->data(), data->size());
std::cout << "save engine file done" << std::endl;
file.close();
network->destroy();
builder->destroy();
//调用的时候:
// deserialize the engine
IRuntime* runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
if (gArgs.useDLACore >= 0)
{
runtime->setDLACore(gArgs.useDLACore);
}
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream->data(
可结合场景,应用于事件判断:
人和车,人分析等
行人检测后,再进行定位,,单个人17个关键点定位的耗时:
nx
前处理: 1.1 ms
推理: 31 ms
后处理: 0.6 ms
单帧耗时: 32.7 ms
鼻子-0, 脖子-1,右肩-2,右肘-3,右手腕-4,左肩-5,左肘-6,左手腕-7,右臀-8,右膝盖-9,右脚踝-10,左臀-11,左膝盖-12,左脚踝-13,右眼-14,左眼-15,有耳朵-16,左耳朵-17,背景-18.
可视化恢复到原图显示方法:
// prepareImage
std::vector<float> prepareImage(std::vector<cv::Mat> &vec_img) {
std::vector<float> result(BATCH_SIZE * IMAGE_WIDTH * IMAGE_HEIGHT * INPUT_CHANNEL);
float *data = result.data();
for (const cv::Mat &src_img : vec_img)
{
if (!src_img.data)
continue;
float ratio = std::min(float(IMAGE_WIDTH) / float(src_img.cols), float(IMAGE_HEIGHT) / float(src_img.rows));
cv::Mat flt_img = cv::Mat::zeros(cv::Size(IMAGE_WIDTH, IMAGE_HEIGHT), CV_8UC3);
cv::Mat rsz_img;
cv::resize(src_img, rsz_img, cv::Size(), ratio, ratio);
rsz_img.copyTo(flt_img(cv::Rect(0, 0, rsz_img.cols, rsz_img.rows)));
flt_img.convertTo(flt_img, CV_32FC3, 1.0 / 255);
//HWC TO CHW
std::vector<cv::Mat> split_img(INPUT_CHANNEL);
cv::split(flt_img, split_img);
int channelLength = IMAGE_WIDTH * IMAGE_HEIGHT;
for (int i = 0; i < INPUT_CHANNEL; ++i)
{
split_img[i] = (split_img[i] - img_mean[i]) / img_std[i];
memcpy(data, split_img[i].data, channelLength * sizeof(float));
data += channelLength;
}
}
return result;
}
// postProcess
std::vector<std::vector<KeyPoint>> postProcess(const std::vector<cv::Mat> &vec_Mat, float *output, const int &outSize) {
std::vector<std::vector<KeyPoint>> vec_key_points;
int feature_size = IMAGE_WIDTH * IMAGE_HEIGHT / 16;
int index = 0;
for (const cv::Mat &src_img : vec_Mat) {
std::vector<KeyPoint> key_points = std::vector<KeyPoint>(num_key_points);
float ratio = std::max(float(src_img.cols) / float(IMAGE_WIDTH), float(src_img.rows) / float(IMAGE_HEIGHT));
float *current_person = output + index * outSize;
for (int number = 0; number < num_key_points; number++) {
float *current_point = current_person + feature_size * number;
auto max_pos = std::max_element(current_point, current_point + feature_size);
key_points[number].prob = *max_pos;
float x = (max_pos - current_point) % (IMAGE_WIDTH / 4) + (*(max_pos + 1) > *(max_pos - 1) ? 0.25 : -0.25);
float y = (max_pos - current_point) / (IMAGE_WIDTH / 4) + (*(max_pos + IMAGE_WIDTH / 4) > *(max_pos - IMAGE_WIDTH / 4) ? 0.25 : -0.25);
key_points[number].x = int(x * ratio * 4);
key_points[number].y = int(y * ratio * 4);
key_points[number].number = number;
}
vec_key_points.push_back(key_points);
index++;
}
return vec_key_points;
}
1、例如:
分割网络,可行驶区域检测。
输入512×512×3,输出512×512×2
核函数设置
dim3 dimBlock(64);
dim3 dimGrid((512 + 63) / 64,512,batch_size);
getmax<<<dimGrid, dimBlock>>>(tensor, outputDevice,512);
并行运行,求解最大值就是可行驶区域的点:
__global__ void getmax(float *input, uchar *result, int feature_w) {
int h = threadIdx.x + blockDim.x * blockIdx.x;
int num_class = blockIdx.y;
int batch_size = blockIdx.z;
int n = batch_size * num_class * gridDim.y + num_class * feature_w + h;
if (h < i) {
if( input[n * 2] > input[n * 2 + 1]){
result[n] = 0;
}
else{
result[n] = 255;
}
同一像素两个类别数据是挨着的,
如 0 1
2 3
4 5
6 7
. .
2n 2n+1
所以 input[n * 2] input[n * 2 + 1]
一共需要的线程数量,就是 h * w *batch_size
为了加速,每个grid里面64个线程
2 、例如:
利用cuda编程处理,yoloV5的前处理和后处理,
COCO数据集进行数据获取 output_xywh_pro_index
{
int ntypes = 25200 * 6 * sizeof(float);
cudaMalloc((void **)&output_xywh_pro_index, 1 * 25200 * 6 * sizeof(float));
CHECK(cudaGetLastError());
Detforward_gpu_box(static_cast<float *>(buffers[1]), output_xywh_pro_index);
cudaMemcpyAsync(out_result, output_xywh_pro_index, ntypes, cudaMemcpyDeviceToHost);
std::cout << "Memcpy ok." << std::endl;
boxes = postProcess_gpu(src, out_result, outSize);
}
void Detforward_gpu_box(const float *intput,
float *output_xywh_pro_index)
{
std::cout << "Into Detforward_gpu box ." << std::endl;
// if (1)
// {
dim3 dimBlock(1);
dim3 dimGrid(25200);
YoloProposal_box<<<dimGrid, dimBlock>>>(intput, 25200, output_xywh_pro_index);
cudaError_t cudaStatus;
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
}
std::cout << "set kernel ok." << std::endl;
}
__global__ void YoloProposal_box(const float *tensor, const int FEATURE_SIZE_NUM, float *output)
//__global__ void YoloProposal_box(const float *tensor)
{
int start_indx = 1;
if (true)
{
// int idx = (blockIdx.x + blockIdx.y * gridDim.x) * blockDim.x + threadIdx.x;
int idx = blockIdx.x;
// printf(" idx:%d", idx);
if (idx > 25200)
{
return;
}
float cx = tensor[idx * 85 + 0];
float cy = tensor[idx * 85 + 1];
float w = tensor[idx * 85 + 2];
float h = tensor[idx * 85 + 3];
float score = tensor[idx * 85 + 4];
// if (idx > 25190)
// {
// printf("*** gpu *** x,y,w,h: %d %f %f %f %f:\n", idx, cx, cy, w, h); // 0 - 255
// }
float objProb;
int index;
for (int k = 0; k < 80; k++)
{
float prob_class = tensor[idx * 85 + 5 + k];
// printf(" gpu %f %f %f %f\n:", cx, cy, w, h); // 0 - 255
if (max(prob_class, objProb) == prob_class)
{
index = k;
objProb = prob_class;
}
output[idx * 6 + 0] = cx;
output[idx * 6 + 1] = cy;
output[idx * 6 + 2] = w;
output[idx * 6 + 3] = h;
output[idx * 6 + 4] = objProb * score;
output[idx * 6 + 5] = index;
}
__syncthreads();
}
https://github.com/PaddlePaddle/Paddle-Inference-Demo
https://paddle-inference.readthedocs.io/en/latest
飞桨开源框架项目地址:
GitHub:
https://github.com/PaddlePaddle/Paddle
Gitee:
https://gitee.com/paddlepaddle/Paddle
当模型加载后,模型表示为由算子节点组成的拓扑图。如果在运行前指定了TRT子图模式,那在模型图分析阶段,Paddle Inference会找出能够被TRT运行的算子节点,同时将这些互相链接的OP融合成一个子图并用一个TRT 算子代替,运行期间如果遇到TRT 算子,则调用TRT引擎执行。
在Paddle 1.8 版本中,对Ernie模型进行了TRT子图的集成,支持动态尺寸的输入功能。
当预测期间,被TRT 引擎执行的算子会在初始化期间运行所有候选计算内核(kernel),并根据根据输入的尺寸选择出最佳的那一个出来,保证了模型的最佳推理性能。
collect2: error: ld returned 1 exit status
sudo ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so