nvidia_tlt2记录

1.在本地pc配置环境:

配置Docker源

#更新源

$ sudo apt update

#启用HTTPS

$ sudo apt install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

添加GPG key

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

#添加稳定版的源

$ sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

安装Docker CE

#更新源

$ sudo apt update

#安装Docker CE

$ sudo apt install -y docker-ce

验证Docker CE

如果出现下面的内容,说明安装成功。

$ sudo docker run hello-world
...
Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. ...

配置nvidia-docker源

#添加源

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

#安装并重启docker

$ sudo apt update && sudo apt install -y nvidia-container-toolkit
$ sudo systemctl restart docker

使用在官方CUDA镜像上测试 nvidia-smi

$ sudo docker run --gpus all nvidia/cuda:9.0-base nvidia-smi

#启动支持双GPU的容器

$ sudo docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi

#指定GPU 1,运行容器

$ sudo docker run --gpus device=0 nvidia/cuda:9.0-base nvidia-smi

能看到显卡信息就说明OK了,当前image是基于Ubuntu 16.04的。
参考:https://blog.csdn.net/weixin_30764771/article/details/99831184?spm=1001.2101.3001.6650.4&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-4.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-4.no_search_link

2.此时docker配置完成,接下来拉取镜拉取镜像:

sudo docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

启动镜像:

sudo docker run --gpus 2 -itd --name tlt -v /home/tao/contain/:/workspace/tlt-experiments -p 8888:8888 nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 /bin/bash

第一个8888是自己的端口号,第二个8888是容器的端口号

进入容器

docker exec -it tlt /bin/bash

后边遇到报错:docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].和docker: Error response from daemon: Unknown runtime specified nvidia. See ‘docker run --help’.
试了很多方法,最后发现是镜像拉错了,不是v3.0-dp-py3

3.训练自己的数据集:https://blog.csdn.net/weixin_38106878/article/details/108979178
https://zongxp.blog.csdn.net/article/details/107386744

指令记录:

sudo docker run --gpus 2 -itd --name tlt -v /home/zhanglu/tao/contain/:/workspace/tlt-experiments -p 8888:8888 nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 /bin/bash

sudo docker cp ./resize_kitti_labels tlt:/workspace/tlt-experiments/
docker exec -it tlt /bin/bash
sudo docker exec -it tlt /bin/bash

训练:tlt-train detectnet_v2 -e specs/detectnet_v2_train_resnet18_car.txt -r /workspace/examples/detectnet_v2/backup/train_model/ -k <输入key> -n car --gpus 2

测试:tlt-evaluate detectnet_v2 -e specs/detectnet_v2_train_resnet18_car.txt -m backup-1101/train_model/model.step-529470.tlt -k <输入key>

剪枝:tlt-prune -m backup-1101/train_model/model.step-529470.tlt -o backup-1101/train_model/model.step-529470-pruned.tlt -eq union -pth 0.3 -k <输入key>

pth 阈值用来平衡模型的速度与精度,阈值越大,剪枝的越厉害,得到的模型越小,推理速度越快,但精度损失越严重,相反,阈值越小,裁剪完的模型越大,精度保持的越好,但速度提升没那么明显,可以根据自己的情况调整这个值,例如可以依次试试0.3、0.5、0.7等值。

重新训练:tlt-train detectnet_v2 -e specs/detectnet_v2_train_resnet18_car.txt -r /workspace/examples/detectnet_v2/backup/train_model/ -k <输入key> -n car --gpus 2

使用lpr训练车牌识别

发现tlt2没有lpr,tlt3的镜像我这里总是报错Server Error: Internal Server Error (“could not select device driver “” with capabilities: [[gpu]]”)
于是就开始整tao环境。参考:

你可能感兴趣的:(nvidia,nx,nvidia)