在编译CUDA代码时,应该选择只编译一个与常用GPU显卡匹配的’-arch’值,这将使得运行时更快,因为代码生成将在编译期间进行。如果只写了’-gencode’值而忽略了’-arch’,那么GPU代码将由CUDA驱动程序在JIT编译器上生成。
产生问题的原因是:CmakeList中 CUDA arch和CUDA gencode对应算力关系不匹配
将CmakeList.txt中 设置 GPU arch 和 code generation 参数.,使之和你的GPU显卡版本匹配。
1. Fermi(CUDA 3.2 until CUDA 8) (deprecated from CUDA 9):
SM20 or SM_20, compute_30 - Older cards such as GeForce 400,500,600, GT-630 (Completely dropped from CUDA 10 onwards)
2. Kepler(CUDA 5 and later):
SM30 or SM_30, compute_30 - Kepler architecture(generic - Tesla K40/K80, GeForce 700, GT-730)
Adds support for unified memory programming. (Completely dropped from CUDA 11 onwards.)
SM35 or SM_35, compute_35 – More specific Tesla K40
Adds support for dynamic parallelism. (Deprecated from CUDA 11, will be dropped in future versions.)
SM37 or SM_37, compute_37 – More specific Tesla K80
Adds a few more registers. (Deprecated from CUDA 11, will be dropped in future versions.)
3. Maxwell(CUDA 6 until CUDA 11):
SM50 or SM_50, compute_50 – Tesla/Quadro M series (Deprecated from CUDA 11, will be dropped in future versions)
SM52 or SM_52, compute_52 – Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X
SM53 or SM_53, compute_53 – Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.
4. Pascal(CUDA 8 and later):
SM60 or SM_60, compute_60 – GP100/Tesla P100 – DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61 – GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030, Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
SM62 or SM_62, compute_62 – Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2
5. Volta(CUDA 9 and later):
SM70 or SM_70, compute_70 – DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
SM72 or SM_72, compute_72 – Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX
6. Turing(CUDA 10 and later):
SM75 or SM_75, compute_75 – GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4
7. Ampere(CUDA 11 and later):
SM80 or SM_80, compute_80 – NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards) Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A6000, RTX A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050
8. Hopper(CUDA 12[planned] and later):
SM90 or SM_90, compute_90 - NVIDIA H100(GH100)
根据Nvidia官方介绍,在GCC编译时设置gencode和arch的基本规则如下:
Thearch=clause of the-gencode=command-line option tonvccspecifies the front-end compilation target and must always be a PTX version.
Thecode=clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by thecode=clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.
-arch和-gencode flags在CUDA 10.1上的例子:
-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75
比如我的CUDA Version: 11.6 属于 Ampere,算力flag=sm_80或sm_86,