GPU优化思路

1 each SM support maximum 8 block

2 each SM support maximum 1024? thread

3 SM split block into warp(32)

4 max shared memory 16K

5 max register?

6 IO / calulate

7 bank conflict

8 reduction

9 memory coaleseing -> load serialize into share memory

10 长时间指令提前?

11 Loop unrolling

12 prefetching

13 Use texture constant memory

你可能感兴趣的:(GPU优化思路)