CUDA_Application(Ⅰ):Lattice-Boltzmann Method Using GPGPU CUDA Platform

源:https://github.com/nyxcalamity/lbm-gpu

正好接触这个领域,复现这篇文章的思想及代码

基础准备如下:

GPGPU:https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units

Lattice-Boltzmann Method :https://en.wikipedia.org/wiki/Lattice_Boltzmann_methods

computational fluid dynamics :https://en.wikipedia.org/wiki/Computational_fluid_dynamics

lid-driven cavity problem :https://www.cfd-online.com/Wiki/Lid-driven_cavity_problem

目录

1.README.md

描述:

技术细节:

目标:

技术先决条件(prerequisites):

实施:

兼容性(Compatability):

建立(Building)和运行(Running):

已知的问题(Known-Issues):

2.复现

初始make报错调整:

运行:

结果对比:


1.README.md

描述:

这个项目是LBM(Lattice-Boltzmann Method)的开源GPGPU的实现,是一种用于流体模拟(fluid simulation)的计算流体力学(computational fluid dynamics--CFD)的方法,它解决了D3Q19 lattice grid 的 [lid-driven cavity problem]。

技术细节:

由于Lattice grid在更新期间执行计算的局部性质(local nature),该方法是高度可并行化的(parallelizable)的,并且可以缩放到与用于域(domin)的单元(cell)的数量几乎相同的计算单元量(compute units)。由于现代GPU具有数千个执行核心(cores)并且核心数量趋于向上,因此特gpus是LBM代码并行化的理想选择。该项目使用CUDA平台,因为CUDA与其竞争对手OpenCL相比,CUDA具有更广泛的功能和更高的数据传输速率,具有几乎相同的计算吞吐量(computational throughput)。

目标:

在项目实施期间,考虑一下因素:

LBM solver的高性能(performance)和高效率(efficiency)

各种NVIDIA GPU架构的代码高扩展性(high scalability)

代码的可维护性(maintainability)和清晰度(clarity)

技术先决条件(prerequisites):

如果读者不熟悉GPGPU编程模型或者GPU硬件内部工作,强烈建议浏览[NVIDIA programming guide]和[NVIDIA GPU architectures].还建议对LBM 求解原理(solver principles)有一个大致的了解

实施:

该项目在[C]中利用[CUDA 5.5 Toolkit]实现,包含两个LBM solver版本 : CPU和GPU。CPU和GPU代码分离(decoupled),并包含在带有_gpu.cu或gpu.cuh结尾的文件中,一般的项目结构如下:

main.c - main funciton 触发模拟活动(simulation routines)

lbm_model.h - problem/LBM 特殊常数(constants)和 验证(validation)方法

GPU:

initialization_gpu.h - GPU内存初始化(initialization)和释放(freeing)

lbm_solver_gpu.h  - LBM solver 包括流(streaming)、碰撞(collision)、边界处理(boundary treatment)

cell_computation_gpu.cuh - 分离(decoupled)本地单元(local cell)计算

lbm_model_gpu.cuh - gpu 特定问题/LBM 定义(definitions)

CPU:

initialization.h - CLI 和 配置文件解析(parsing)

streaming.h - 流计算(streaming computations)

collision.h - 碰撞计算(collision computations)

boundary.h - 边界处理(boundary treatment)

cell_computation.h - 本地单元计算(local cell computations)

visualization.h - 字段(fields)的可视化(visualization)

兼容性(Compatability):

该代码兼容计算能力(compute capability)2.0及更高版本的GPU 和 版本4.0及更高版本的NVIDIA CUDA Tookits工具包

建立(Building)和运行(Running):

这个说明是针对具有支持CUDA的GPU且计算能力为2.0且已经安装并启用了gpu设备驱动程序的Linux用户。

其他依赖(dependencies)

[gcc] version 4.8.2+

[GNU Make] version 3.81+

[git] version 1.9.1+

1.从github repository克隆项目:

git clone https://github.com/nyxcalamity/lbm-gpu.git

2.导航到目录并运行:

make

3.在/data/lbm.dat中的配置文件中调整问题的grid size 或者 physical properties

4.使用下列命令运行项目

/build/lbm-sim -help

5.阅读帮助信息并运行实际模拟:

/build/lbm-sim /data/lbm.dat -gpu

已知的问题(Known-Issues):

项目中存在几个已知问题,这些问题不会影响其性能(performance)或产生的模拟(simulation)

由于边界处理(boundary treatment)的优化,我们将57个检查分支(checking brances)减少到22个,代价是在边缘边界单元(boundary cell)之间交换概率分布(probability distribution )函数

在可视化(visualization)期间发生未知的舍入(rounding)误差,这可能将少数值更改为不超过0.0000001

 

2.复现

初始make报错调整:

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ make
mkdir -p build img
rm -f build/initialization.o build/visualization.o build/boundary.o build/collision.o build/streaming.o build/cell_computation.o build/utils.o build/main.o build/lbm_model.o build/lbm_solver_gpu.o build/initialization_gpu.o build/lbm-sim
rm -f img/*.vtk
g++ -g -Wall -ofast -funroll-loops  -c src/initialization.c -o build/initialization.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/visualization.c -o build/visualization.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/boundary.c -o build/boundary.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/collision.c -o build/collision.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/streaming.c -o build/streaming.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/cell_computation.c -o build/cell_computation.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/utils.c -o build/utils.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/main.c -o build/main.o -lm
g++ -g -Wall -ofast -funroll-loops  -c src/lbm_model.c -o build/lbm_model.o -lm
nvcc -g --ptxas-options=-v -arch=sm_20 -c src/lbm_solver_gpu.cu -o build/lbm_solver_gpu.o -lcudart 
nvcc fatal   : Value 'sm_20' is not defined for option 'gpu-architecture'
Makefile:50: recipe for target 'build/lbm_solver_gpu.o' failed
make: *** [build/lbm_solver_gpu.o] Error 1

修改一:COMPUTE_CAPABILITY=20 =>COMPUTE_CAPABILITY=52

根据自己显卡的计算能力调整,FROM:https://developer.nvidia.com/cuda-gpus

修改结束后仍然报错如下==

/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime_api.h:955:46: note: declared here
 extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaThreadSynchronize(void);
                                              ^
src/lbm_solver_gpu.cu:383:36: warning: ‘cudaError_t cudaThreadSynchronize()’ is deprecated [-Wdeprecated-declarations]
  cudaErrorCheck(cudaThreadSynchronize());
                                    ^
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime_api.h:955:46: note: declared here
 extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaThreadSynchronize(void);
                                              ^
nvcc -g --ptxas-options=-v -arch=sm_52 -c src/initialization_gpu.cu -o build/initialization_gpu.o -lcudart 

等等
g++ build/lbm_solver_gpu.o build/initialization_gpu.o build/initialization.o build/visualization.o build/boundary.o build/collision.o build/streaming.o build/cell_computation.o build/utils.o build/main.o build/lbm_model.o -o build/lbm-sim -lm -lcudart 
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
Makefile:39: recipe for target 'build/lbm-sim' failed
make: *** [build/lbm-sim] Error 1

查询下参数:

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ which nvcc
/usr/local/cuda/bin/nvcc

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64:/usr/local/cuda/lib64

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ sudo ldconfig
[sudo] password for wlsh: 
Sorry, try again.
[sudo] password for wlsh: 
/sbin/ldconfig.real: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7 is not a symbolic link

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ ls /usr/local/cuda/lib64/
libaccinj64.so                libcurand.so.10.1.105    libnppicom_static.a     libnppitc.so.10.1.105
libaccinj64.so.10.1           libcurand_static.a       libnppidei.so           libnppitc_static.a
libaccinj64.so.10.1.105       libcusolver.so           libnppidei.so.10        libnpps.so
libcudadevrt.a                libcusolver.so.10        libnppidei.so.10.1.105  libnpps.so.10
libcudart.so                  libcusolver.so.10.1.105  libnppidei_static.a     libnpps.so.10.1.105
libcudart.so.10.1             libcusolver_static.a     libnppif.so             libnpps_static.a
libcudart.so.10.1.105         libcusparse.so           libnppif.so.10          libnvgraph.so
libcudart_static.a            libcusparse.so.10        libnppif.so.10.1.105    libnvgraph.so.10
libcudnn.so                   libcusparse.so.10.1.105  libnppif_static.a       libnvgraph.so.10.1.105
libcudnn.so.7                 libcusparse_static.a     libnppig.so             libnvgraph_static.a
libcudnn.so.7.5.0             liblapack_static.a       libnppig.so.10          libnvjpeg.so
libcudnn_static.a             libmetis_static.a        libnppig.so.10.1.105    libnvjpeg.so.10
libcufft.so                   libnppc.so               libnppig_static.a       libnvjpeg.so.10.1.105
libcufft.so.10                libnppc.so.10            libnppim.so             libnvjpeg_static.a
libcufft.so.10.1.105          libnppc.so.10.1.105      libnppim.so.10          libnvrtc-builtins.so
libcufft_static.a             libnppc_static.a         libnppim.so.10.1.105    libnvrtc-builtins.so.10.1
libcufft_static_nocallback.a  libnppial.so             libnppim_static.a       libnvrtc-builtins.so.10.1.105
libcufftw.so                  libnppial.so.10          libnppist.so            libnvrtc.so
libcufftw.so.10               libnppial.so.10.1.105    libnppist.so.10         libnvrtc.so.10.1
libcufftw.so.10.1.105         libnppial_static.a       libnppist.so.10.1.105   libnvrtc.so.10.1.105
libcufftw_static.a            libnppicc.so             libnppist_static.a      libnvToolsExt.so
libcuinj64.so                 libnppicc.so.10          libnppisu.so            libnvToolsExt.so.1
libcuinj64.so.10.1            libnppicc.so.10.1.105    libnppisu.so.10         libnvToolsExt.so.1.0.0
libcuinj64.so.10.1.105        libnppicc_static.a       libnppisu.so.10.1.105   libOpenCL.so
libculibos.a                  libnppicom.so            libnppisu_static.a      libOpenCL.so.1
libcurand.so                  libnppicom.so.10         libnppitc.so            libOpenCL.so.1.1
libcurand.so.10               libnppicom.so.10.1.105   libnppitc.so.10         stubs

修改二:LDFLAGS_CU=-lcudart #-lcuda => LDFLAGS_CU=-L/usr/local/cuda/lib64/ -lcudart #-lcuda

cuda runtime API funcion 依赖于(dependency)cudart.lib,由于开始make报错,改为自己路径

任然报错==

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ls
boundary.o          collision.o           initialization.o  lbm-sim           main.o       utils.o
cell_computation.o  initialization_gpu.o  lbm_model.o       lbm_solver_gpu.o  streaming.o  visualization.o

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim 
List of control flags:
	 -gpu             all computations are to be performed on gpu
	 -cpu             all computations are to be performed on cpu
	 -help            prints this help message
NOTE: Control flags are mutually exclusive and only one flag at a time is allowed
Example program usage:
	./lbm-sim ./data/lbm.dat -gpu


wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -gpu
File: ../data/lbm.dat		*tau           = 0.800000
File: ../data/lbm.dat		*velocity_wall_1= 0.010000
File: ../data/lbm.dat		*velocity_wall_2= 0.000000
File: ../data/lbm.dat		*velocity_wall_3= 0.000000
File: ../data/lbm.dat		*xlength       = 94
File: ../data/lbm.dat		*timesteps     = 100
File: ../data/lbm.dat		*timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
src/visualization.c:51 Error : Failed to open img/lbm-img.0.vtk
C-Lib   errno    = 2
C-Lib   strerror = No such file or directory

修改三:发现是缺少个文件

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ls ../img
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ mkdir img

运行:

CPU版本:

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -cpu
File: ../data/lbm.dat		*tau           = 0.800000
File: ../data/lbm.dat		*velocity_wall_1= 0.010000
File: ../data/lbm.dat		*velocity_wall_2= 0.000000
File: ../data/lbm.dat		*velocity_wall_3= 0.000000
File: ../data/lbm.dat		*xlength       = 94
File: ../data/lbm.dat		*timesteps     = 100
File: ../data/lbm.dat		*timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
Time step: #1
Time step: #2
Time step: #3
Time step: #4
Time step: #5
Time step: #6
Time step: #7
Time step: #8
Time step: #9
Time step: #10
Time step: #11
Time step: #12
Time step: #13
Time step: #14
Time step: #15
Time step: #16
Time step: #17
Time step: #18
Time step: #19
Time step: #20
Time step: #21
Time step: #22
Time step: #23
Time step: #24
Time step: #25
Time step: #26
Time step: #27
Time step: #28
Time step: #29
Time step: #30
Time step: #31
Time step: #32
Time step: #33
Time step: #34
Time step: #35
Time step: #36
Time step: #37
Time step: #38
Time step: #39
Time step: #40
Time step: #41
Time step: #42
Time step: #43
Time step: #44
Time step: #45
Time step: #46
Time step: #47
Time step: #48
Time step: #49
Time step: #50
Time step: #51
Time step: #52
Time step: #53
Time step: #54
Time step: #55
Time step: #56
Time step: #57
Time step: #58
Time step: #59
Time step: #60
Time step: #61
Time step: #62
Time step: #63
Time step: #64
Time step: #65
Time step: #66
Time step: #67
Time step: #68
Time step: #69
Time step: #70
Time step: #71
Time step: #72
Time step: #73
Time step: #74
Time step: #75
Time step: #76
Time step: #77
Time step: #78
Time step: #79
Time step: #80
Time step: #81
Time step: #82
Time step: #83
Time step: #84
Time step: #85
Time step: #86
Time step: #87
Time step: #88
Time step: #89
Time step: #90
Time step: #91
Time step: #92
Time step: #93
Time step: #94
Time step: #95
Time step: #96
Time step: #97
Time step: #98
Time step: #99
Average MLUPS: 0.978035
Simulation complete.

GPU版本:

wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -gpu
File: ../data/lbm.dat		*tau           = 0.800000
File: ../data/lbm.dat		*velocity_wall_1= 0.010000
File: ../data/lbm.dat		*velocity_wall_2= 0.000000
File: ../data/lbm.dat		*velocity_wall_3= 0.000000
File: ../data/lbm.dat		*xlength       = 94
File: ../data/lbm.dat		*timesteps     = 100
File: ../data/lbm.dat		*timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
Time step: #1
Time step: #2
Time step: #3
Time step: #4
Time step: #5
Time step: #6
Time step: #7
Time step: #8
Time step: #9
Time step: #10
Time step: #11
Time step: #12
Time step: #13
Time step: #14
Time step: #15
Time step: #16
Time step: #17
Time step: #18
Time step: #19
Time step: #20
Time step: #21
Time step: #22
Time step: #23
Time step: #24
Time step: #25
Time step: #26
Time step: #27
Time step: #28
Time step: #29
Time step: #30
Time step: #31
Time step: #32
Time step: #33
Time step: #34
Time step: #35
Time step: #36
Time step: #37
Time step: #38
Time step: #39
Time step: #40
Time step: #41
Time step: #42
Time step: #43
Time step: #44
Time step: #45
Time step: #46
Time step: #47
Time step: #48
Time step: #49
Time step: #50
Time step: #51
Time step: #52
Time step: #53
Time step: #54
Time step: #55
Time step: #56
Time step: #57
Time step: #58
Time step: #59
Time step: #60
Time step: #61
Time step: #62
Time step: #63
Time step: #64
Time step: #65
Time step: #66
Time step: #67
Time step: #68
Time step: #69
Time step: #70
Time step: #71
Time step: #72
Time step: #73
Time step: #74
Time step: #75
Time step: #76
Time step: #77
Time step: #78
Time step: #79
Time step: #80
Time step: #81
Time step: #82
Time step: #83
Time step: #84
Time step: #85
Time step: #86
Time step: #87
Time step: #88
Time step: #89
Time step: #90
Time step: #91
Time step: #92
Time step: #93
Time step: #94
Time step: #95
Time step: #96
Time step: #97
Time step: #98
Time step: #99
Average MLUPS: 81.179367
Simulation complete.

结果对比:

MLUPS:Million Lattice unpdate per second 每秒更新xx个网格点的数据

Average MLUPS: 

CPU:0.978035

GPU:81.179367

提升了80倍!!!!而且这还不是最优的方法

未完待续~

你可能感兴趣的:(CUDA,By,Example)