This article is illustrated bases on 2.x computation device:
Typically speaking, a shared memory has 16KB totally. And it has 32 banks for 2.x computation device. Bank is a unit of parallel reading/writing. A successive range (32) of 32bit data construct 32 bank. Therefore, if a 32bit data is located at (0+32bit*32), then it belongs the same bank as the 32bit at the very first beginning. These two data could not be read/written together. Bank conflict occurs when two threads in a same warp try to access their data which locate at the same bank. Therefore, the read/write action must be serialised. This will significant affect the efficiency.
Several means to avoid bank conflicting:
1) Set the stride of reading data to be an odd number
If each thread in a same warp is accessing different 32 bit data in consecutive range of 32bit address, it will be okay. It is like the left one[1]:
Actually, the left one is the situation that the stride = 1. This may be the situation that data is stored in SOA.
However, if the data is stored in AOS, it will need us to change the stride to more than 1. Here we come to a point to notice: we would better set the stride to be odd number.
Here is a content quoted from[1]:
A common access pattern for each thread to access a 32-bit word from an array indexed by the thread ID tid and with some stride s: __shared__ float shared[32];
float data = shared[BaseIndex + s * tid];
In this case, threads tid and tid+n access the same bank whenever s*n is a multiple of the number of banks (i.e. 32) or, equivalently, whenever n is a multiple of 32/d where d is the greatest common divisor of 32 and s. As a consequence, there will be no bank conflict only if warp size (i.e. 32) is less than or equal to 32/d., that is only if d is equal to 1, i.e. s is odd.
We have a trick about setting the stride to be a prime number ------ padding[3]:
For example, using padding, we can let the stride change from 32 to 33 to avoid bank conflict.
R = stride = 33.
2) 32-bit Broadcast Access
If there are bunch of threads are access the same 32-bit data. GPU will automatically broadcast this 32bit data to all of those threads.
Like:
int number = data[3];
This will not lead to bank conflict,because the threads are reading the same 32bit data.
Therefore, we can explain this example [1]:
Must notice that only if those threads are reading the same 32bit data, then this data will be broadcast.
Difference in 1.x and 2.x:
1) Situation that will incur bank conflict in 1.x while will not in 2.x [1]:
In 1.x, 8-bit and 16-bit accesses typically generate bank conflicts. For example, there are bank conflicts if an array of char is accessed the following way:
__shared__ char shared[32];
char data = shared[BaseIndex + tid];
because shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank. There are no bank conflicts however, if the same array is accessed the following way: char data = shared[BaseIndex + 4 * tid];
But in 2.x[2]:
A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads: For read accesses, the word is broadcast to the requesting threads (unlike for devices of compute capability 1.x, multiple words can be broadcast in a single transaction); for write accesses, each byte is written by only one of the threads (which thread performs the write is undefined).
bank conflict仅在两个或更多的线程访问属于同一个bank的不同32位word中的字节。如果两个或以上的线程访问了同一个32位字内的字节。这些线程之间将没有bank conflict。对于读访问,这个word将会在所有请求的线程之间广播。对于写操作,每个字节只会被这其中之一的线程写入。(哪个线程执行这个写操作未定义)
This means, in particular, that unlike for devices of compute capability 1.x, there are no bank conflicts if an array of char is accessed as follows, for example:
__shared__ char shared[32];
char data = shared[BaseIndex + tid];
这个在1.x的device上有bank conflict,在2.x上没有bank conflict。因为2.x上只要访问的是同一个bank上的相同32位word中的字节,就不会有bank conflict。
2) Half warp or not:
In 1.x[1]:
Shared memory has 16 banks that are organized such that successive 32-bit words are assign to successive banks, i.e. interleaved. Each bank has a bandwidth of 32 bits per two clock cycles.
Shared memory被组织成16个bank这样连续的32位 word被分配到连续的bank上。每两个时钟周期每条bank有32位的带宽。
A shared memory request for a warp is split into two memory requests, one for each half-warp, that are issued independently. As a consequence, there can be no bank conflict between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.
一个warp的一次shared memory请求被分成两次内存请求,每个half-warp一次,并独立发射。因此,在属于前半部分warp的线程与后半部分warp的线程之间不会有bank conflict。
注:这应该是针对计算能力1.x的device
For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16.
A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.
对于计算能力1.x的device,warp大小为32个线程bank的数量为16。一个warp的一次shared memory请求被分成前半个warp的一次请求和后半个warp的一次请求。
In 2.x[2]:
For devices of compute capability 2.x, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp (see Section F.4.3 of the CUDA C Programming Guide).
对于计算能力2.x的设备,warp大小为32个线程,bank数量也是32.一个warp的一次memory请求不会像计算能力1.x的设备那样分开。这意味着bank conflict会在前半个warp的线程与同一个warp的后半个线程之间发生。
注:这应该是属于同一warp的不同半边的线程访问了同一个bank.
Reference:
[1] CUDA_C_Programming_Guide
[2] http://blog.csdn.net/zhanglei0107/article/details/7386431
[3] 18645 How to write fast code Jike Chong and Ian Lane