Let us introduce a new term first[1].
It is the ratio of active warps / maximum number(32) of warps.
It depends on three parameters:
1) threads/block (set in <<<>>>)
2) registers/thread (can see in the ptx file or use --ptxas-option=-v to see after finish compiling)
3) shared memory/block(also can see the ptx file and use--ptxas-option=-v to see after finish compiling). However, if our shared memory variable is set extern (use 'extern' to define the shared memory). We get this variable from the runtime.
We can use these charts to see how can we improve the occupancy.(By keeping other two variables the same, changing one variable.)[1]
Also, at the expense of accuracy, we can -use fast_math or replace some math function with CUDA math intrinsic function in the code. [1]
Reference:
[1] 18645 CMU How to write fast code Jike Chong and Ian Lane