Assess, Parallelize, Optimize, Deploy
Step 1: Profiling the code in order to identify the hot spots.
Strong scaling (Amdahl's Law) is a measure of how, for a given problem size, performance changes as more processors are added to the system
Weak scaling (Gustafson's Law) is a measure of how the performance per unit of work changes as more processors are added
Step 2: Parallelize the code
use GPU-accelerated libraries (https://developer.nvidia.com/gpu-accelerated-libraries)
use CUDA C/C++
Step 3: Optimize
high-level optimizations: algorithm choice & data movement (overlapping movement with computation)
low-level optimizations: explicitly caching data in shared memory or tuning floating point sequences
Step 4: Deploy
some key points to look out for when productizing your GPU-accelerated code:
Make sure you check the return value from API calls
Consider how you will distribute the CUDA runtime and libraries
参考链接:
https://developer.nvidia.com/content/assess-parallelize-optimize-deploy