源:https://github.com/nyxcalamity/lbm-gpu
正好接触这个领域,复现这篇文章的思想及代码
基础准备如下:
GPGPU:https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
Lattice-Boltzmann Method :https://en.wikipedia.org/wiki/Lattice_Boltzmann_methods
computational fluid dynamics :https://en.wikipedia.org/wiki/Computational_fluid_dynamics
lid-driven cavity problem :https://www.cfd-online.com/Wiki/Lid-driven_cavity_problem
目录
1.README.md
描述:
技术细节:
目标:
技术先决条件(prerequisites):
实施:
兼容性(Compatability):
建立(Building)和运行(Running):
已知的问题(Known-Issues):
2.复现
初始make报错调整:
运行:
结果对比:
这个项目是LBM(Lattice-Boltzmann Method)的开源GPGPU的实现,是一种用于流体模拟(fluid simulation)的计算流体力学(computational fluid dynamics--CFD)的方法,它解决了D3Q19 lattice grid 的 [lid-driven cavity problem]。
由于Lattice grid在更新期间执行计算的局部性质(local nature),该方法是高度可并行化的(parallelizable)的,并且可以缩放到与用于域(domin)的单元(cell)的数量几乎相同的计算单元量(compute units)。由于现代GPU具有数千个执行核心(cores)并且核心数量趋于向上,因此特gpus是LBM代码并行化的理想选择。该项目使用CUDA平台,因为CUDA与其竞争对手OpenCL相比,CUDA具有更广泛的功能和更高的数据传输速率,具有几乎相同的计算吞吐量(computational throughput)。
在项目实施期间,考虑一下因素:
LBM solver的高性能(performance)和高效率(efficiency)
各种NVIDIA GPU架构的代码高扩展性(high scalability)
代码的可维护性(maintainability)和清晰度(clarity)
如果读者不熟悉GPGPU编程模型或者GPU硬件内部工作,强烈建议浏览[NVIDIA programming guide]和[NVIDIA GPU architectures].还建议对LBM 求解原理(solver principles)有一个大致的了解
该项目在[C]中利用[CUDA 5.5 Toolkit]实现,包含两个LBM solver版本 : CPU和GPU。CPU和GPU代码分离(decoupled),并包含在带有_gpu.cu或gpu.cuh结尾的文件中,一般的项目结构如下:
main.c - main funciton 触发模拟活动(simulation routines)
lbm_model.h - problem/LBM 特殊常数(constants)和 验证(validation)方法
GPU:
initialization_gpu.h - GPU内存初始化(initialization)和释放(freeing)
lbm_solver_gpu.h - LBM solver 包括流(streaming)、碰撞(collision)、边界处理(boundary treatment)
cell_computation_gpu.cuh - 分离(decoupled)本地单元(local cell)计算
lbm_model_gpu.cuh - gpu 特定问题/LBM 定义(definitions)
CPU:
initialization.h - CLI 和 配置文件解析(parsing)
streaming.h - 流计算(streaming computations)
collision.h - 碰撞计算(collision computations)
boundary.h - 边界处理(boundary treatment)
cell_computation.h - 本地单元计算(local cell computations)
visualization.h - 字段(fields)的可视化(visualization)
该代码兼容计算能力(compute capability)2.0及更高版本的GPU 和 版本4.0及更高版本的NVIDIA CUDA Tookits工具包
这个说明是针对具有支持CUDA的GPU且计算能力为2.0且已经安装并启用了gpu设备驱动程序的Linux用户。
其他依赖(dependencies)
[gcc] version 4.8.2+
[GNU Make] version 3.81+
[git] version 1.9.1+
1.从github repository克隆项目:
git clone https://github.com/nyxcalamity/lbm-gpu.git
2.导航到
make
3.在
4.使用下列命令运行项目
5.阅读帮助信息并运行实际模拟:
项目中存在几个已知问题,这些问题不会影响其性能(performance)或产生的模拟(simulation)
由于边界处理(boundary treatment)的优化,我们将57个检查分支(checking brances)减少到22个,代价是在边缘边界单元(boundary cell)之间交换概率分布(probability distribution )函数
在可视化(visualization)期间发生未知的舍入(rounding)误差,这可能将少数值更改为不超过0.0000001
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ make
mkdir -p build img
rm -f build/initialization.o build/visualization.o build/boundary.o build/collision.o build/streaming.o build/cell_computation.o build/utils.o build/main.o build/lbm_model.o build/lbm_solver_gpu.o build/initialization_gpu.o build/lbm-sim
rm -f img/*.vtk
g++ -g -Wall -ofast -funroll-loops -c src/initialization.c -o build/initialization.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/visualization.c -o build/visualization.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/boundary.c -o build/boundary.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/collision.c -o build/collision.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/streaming.c -o build/streaming.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/cell_computation.c -o build/cell_computation.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/utils.c -o build/utils.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/main.c -o build/main.o -lm
g++ -g -Wall -ofast -funroll-loops -c src/lbm_model.c -o build/lbm_model.o -lm
nvcc -g --ptxas-options=-v -arch=sm_20 -c src/lbm_solver_gpu.cu -o build/lbm_solver_gpu.o -lcudart
nvcc fatal : Value 'sm_20' is not defined for option 'gpu-architecture'
Makefile:50: recipe for target 'build/lbm_solver_gpu.o' failed
make: *** [build/lbm_solver_gpu.o] Error 1
修改一:COMPUTE_CAPABILITY=20 =>COMPUTE_CAPABILITY=52
根据自己显卡的计算能力调整,FROM:https://developer.nvidia.com/cuda-gpus
修改结束后仍然报错如下==
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime_api.h:955:46: note: declared here
extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaThreadSynchronize(void);
^
src/lbm_solver_gpu.cu:383:36: warning: ‘cudaError_t cudaThreadSynchronize()’ is deprecated [-Wdeprecated-declarations]
cudaErrorCheck(cudaThreadSynchronize());
^
/usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_runtime_api.h:955:46: note: declared here
extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaThreadSynchronize(void);
^
nvcc -g --ptxas-options=-v -arch=sm_52 -c src/initialization_gpu.cu -o build/initialization_gpu.o -lcudart
等等
g++ build/lbm_solver_gpu.o build/initialization_gpu.o build/initialization.o build/visualization.o build/boundary.o build/collision.o build/streaming.o build/cell_computation.o build/utils.o build/main.o build/lbm_model.o -o build/lbm-sim -lm -lcudart
/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
Makefile:39: recipe for target 'build/lbm-sim' failed
make: *** [build/lbm-sim] Error 1
查询下参数:
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ which nvcc
/usr/local/cuda/bin/nvcc
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ echo $LD_LIBRARY_PATH
/usr/local/cuda/lib64:/usr/local/cuda/lib64
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ sudo ldconfig
[sudo] password for wlsh:
Sorry, try again.
[sudo] password for wlsh:
/sbin/ldconfig.real: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7 is not a symbolic link
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master$ ls /usr/local/cuda/lib64/
libaccinj64.so libcurand.so.10.1.105 libnppicom_static.a libnppitc.so.10.1.105
libaccinj64.so.10.1 libcurand_static.a libnppidei.so libnppitc_static.a
libaccinj64.so.10.1.105 libcusolver.so libnppidei.so.10 libnpps.so
libcudadevrt.a libcusolver.so.10 libnppidei.so.10.1.105 libnpps.so.10
libcudart.so libcusolver.so.10.1.105 libnppidei_static.a libnpps.so.10.1.105
libcudart.so.10.1 libcusolver_static.a libnppif.so libnpps_static.a
libcudart.so.10.1.105 libcusparse.so libnppif.so.10 libnvgraph.so
libcudart_static.a libcusparse.so.10 libnppif.so.10.1.105 libnvgraph.so.10
libcudnn.so libcusparse.so.10.1.105 libnppif_static.a libnvgraph.so.10.1.105
libcudnn.so.7 libcusparse_static.a libnppig.so libnvgraph_static.a
libcudnn.so.7.5.0 liblapack_static.a libnppig.so.10 libnvjpeg.so
libcudnn_static.a libmetis_static.a libnppig.so.10.1.105 libnvjpeg.so.10
libcufft.so libnppc.so libnppig_static.a libnvjpeg.so.10.1.105
libcufft.so.10 libnppc.so.10 libnppim.so libnvjpeg_static.a
libcufft.so.10.1.105 libnppc.so.10.1.105 libnppim.so.10 libnvrtc-builtins.so
libcufft_static.a libnppc_static.a libnppim.so.10.1.105 libnvrtc-builtins.so.10.1
libcufft_static_nocallback.a libnppial.so libnppim_static.a libnvrtc-builtins.so.10.1.105
libcufftw.so libnppial.so.10 libnppist.so libnvrtc.so
libcufftw.so.10 libnppial.so.10.1.105 libnppist.so.10 libnvrtc.so.10.1
libcufftw.so.10.1.105 libnppial_static.a libnppist.so.10.1.105 libnvrtc.so.10.1.105
libcufftw_static.a libnppicc.so libnppist_static.a libnvToolsExt.so
libcuinj64.so libnppicc.so.10 libnppisu.so libnvToolsExt.so.1
libcuinj64.so.10.1 libnppicc.so.10.1.105 libnppisu.so.10 libnvToolsExt.so.1.0.0
libcuinj64.so.10.1.105 libnppicc_static.a libnppisu.so.10.1.105 libOpenCL.so
libculibos.a libnppicom.so libnppisu_static.a libOpenCL.so.1
libcurand.so libnppicom.so.10 libnppitc.so libOpenCL.so.1.1
libcurand.so.10 libnppicom.so.10.1.105 libnppitc.so.10 stubs
修改二:LDFLAGS_CU=-lcudart #-lcuda => LDFLAGS_CU=-L/usr/local/cuda/lib64/ -lcudart #-lcuda
cuda runtime API funcion 依赖于(dependency)cudart.lib,由于开始make报错,改为自己路径
任然报错==
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ls
boundary.o collision.o initialization.o lbm-sim main.o utils.o
cell_computation.o initialization_gpu.o lbm_model.o lbm_solver_gpu.o streaming.o visualization.o
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim
List of control flags:
-gpu all computations are to be performed on gpu
-cpu all computations are to be performed on cpu
-help prints this help message
NOTE: Control flags are mutually exclusive and only one flag at a time is allowed
Example program usage:
./lbm-sim ./data/lbm.dat -gpu
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -gpu
File: ../data/lbm.dat *tau = 0.800000
File: ../data/lbm.dat *velocity_wall_1= 0.010000
File: ../data/lbm.dat *velocity_wall_2= 0.000000
File: ../data/lbm.dat *velocity_wall_3= 0.000000
File: ../data/lbm.dat *xlength = 94
File: ../data/lbm.dat *timesteps = 100
File: ../data/lbm.dat *timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
src/visualization.c:51 Error : Failed to open img/lbm-img.0.vtk
C-Lib errno = 2
C-Lib strerror = No such file or directory
修改三:发现是缺少个文件
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ls ../img
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ mkdir img
CPU版本:
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -cpu
File: ../data/lbm.dat *tau = 0.800000
File: ../data/lbm.dat *velocity_wall_1= 0.010000
File: ../data/lbm.dat *velocity_wall_2= 0.000000
File: ../data/lbm.dat *velocity_wall_3= 0.000000
File: ../data/lbm.dat *xlength = 94
File: ../data/lbm.dat *timesteps = 100
File: ../data/lbm.dat *timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
Time step: #1
Time step: #2
Time step: #3
Time step: #4
Time step: #5
Time step: #6
Time step: #7
Time step: #8
Time step: #9
Time step: #10
Time step: #11
Time step: #12
Time step: #13
Time step: #14
Time step: #15
Time step: #16
Time step: #17
Time step: #18
Time step: #19
Time step: #20
Time step: #21
Time step: #22
Time step: #23
Time step: #24
Time step: #25
Time step: #26
Time step: #27
Time step: #28
Time step: #29
Time step: #30
Time step: #31
Time step: #32
Time step: #33
Time step: #34
Time step: #35
Time step: #36
Time step: #37
Time step: #38
Time step: #39
Time step: #40
Time step: #41
Time step: #42
Time step: #43
Time step: #44
Time step: #45
Time step: #46
Time step: #47
Time step: #48
Time step: #49
Time step: #50
Time step: #51
Time step: #52
Time step: #53
Time step: #54
Time step: #55
Time step: #56
Time step: #57
Time step: #58
Time step: #59
Time step: #60
Time step: #61
Time step: #62
Time step: #63
Time step: #64
Time step: #65
Time step: #66
Time step: #67
Time step: #68
Time step: #69
Time step: #70
Time step: #71
Time step: #72
Time step: #73
Time step: #74
Time step: #75
Time step: #76
Time step: #77
Time step: #78
Time step: #79
Time step: #80
Time step: #81
Time step: #82
Time step: #83
Time step: #84
Time step: #85
Time step: #86
Time step: #87
Time step: #88
Time step: #89
Time step: #90
Time step: #91
Time step: #92
Time step: #93
Time step: #94
Time step: #95
Time step: #96
Time step: #97
Time step: #98
Time step: #99
Average MLUPS: 0.978035
Simulation complete.
GPU版本:
wlsh@wlsh-ThinkStation:~/Desktop/lbm-gpu-master/build$ ./lbm-sim ../data/lbm.dat -gpu
File: ../data/lbm.dat *tau = 0.800000
File: ../data/lbm.dat *velocity_wall_1= 0.010000
File: ../data/lbm.dat *velocity_wall_2= 0.000000
File: ../data/lbm.dat *velocity_wall_3= 0.000000
File: ../data/lbm.dat *xlength = 94
File: ../data/lbm.dat *timesteps = 100
File: ../data/lbm.dat *timesteps_per_plotting= 2
Computed Mach number: 0.017321
Computed Reynolds number: 9.400000
Time step: #0
Time step: #1
Time step: #2
Time step: #3
Time step: #4
Time step: #5
Time step: #6
Time step: #7
Time step: #8
Time step: #9
Time step: #10
Time step: #11
Time step: #12
Time step: #13
Time step: #14
Time step: #15
Time step: #16
Time step: #17
Time step: #18
Time step: #19
Time step: #20
Time step: #21
Time step: #22
Time step: #23
Time step: #24
Time step: #25
Time step: #26
Time step: #27
Time step: #28
Time step: #29
Time step: #30
Time step: #31
Time step: #32
Time step: #33
Time step: #34
Time step: #35
Time step: #36
Time step: #37
Time step: #38
Time step: #39
Time step: #40
Time step: #41
Time step: #42
Time step: #43
Time step: #44
Time step: #45
Time step: #46
Time step: #47
Time step: #48
Time step: #49
Time step: #50
Time step: #51
Time step: #52
Time step: #53
Time step: #54
Time step: #55
Time step: #56
Time step: #57
Time step: #58
Time step: #59
Time step: #60
Time step: #61
Time step: #62
Time step: #63
Time step: #64
Time step: #65
Time step: #66
Time step: #67
Time step: #68
Time step: #69
Time step: #70
Time step: #71
Time step: #72
Time step: #73
Time step: #74
Time step: #75
Time step: #76
Time step: #77
Time step: #78
Time step: #79
Time step: #80
Time step: #81
Time step: #82
Time step: #83
Time step: #84
Time step: #85
Time step: #86
Time step: #87
Time step: #88
Time step: #89
Time step: #90
Time step: #91
Time step: #92
Time step: #93
Time step: #94
Time step: #95
Time step: #96
Time step: #97
Time step: #98
Time step: #99
Average MLUPS: 81.179367
Simulation complete.
MLUPS:Million Lattice unpdate per second 每秒更新xx个网格点的数据
Average MLUPS:
CPU:0.978035
GPU:81.179367
提升了80倍!!!!而且这还不是最优的方法
未完待续~