http://igoro.com/archive/gallery-of-processor-cache-effects/
此文提到
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024]; // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
The reason why the loops take the same amount of time has to do with memory. The running time of these loops is dominated by the memory accesses to the array, not by the integer multiplications. And, as I’ll explain on Example 2, the hardware will perform the same main memory accesses for the two loops.
Let’s explore this example deeper. We will try other step values, not just 1 and 16:
for (int i = 0; i < arr.Length; i += K) arr[i] *= 3;
Here are the running times of this loop for different step values (K):
Notice that while step is in the range from 1 to 16, the running time of the for-loop hardly changes. But from 16 onwards, the running time is halved each time we double the step.
The reason behind this is that today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
Since 16 ints take up 64 bytes (one cache line), for-loops with a step between 1 and 16 have to touch the same number of cache lines: all of the cache lines in the array. But once the step is 32, we’ll only touch roughly every other cache line, and once it is 64, only every fourth.
Understanding of cache lines can be important for certain types of program optimizations. For example, alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
做了个测试,一开始比较的结果有很大出入,后来发现是重复使用内存导致测试结果不公平,修正后,C测试如下:
using namespace std;
#define MAX 64 * 1024 * 1024
//return ms_timeout
inline int getLoopTimeMs(int iDataLen,int iJump)
{
if(iJump<=0)
iJump=1;
long* pData = new long[iDataLen];
memset(pData,0,iDataLen*sizeof(long));
struct timeval start_tv,loop1_tv;
gettimeofday(&start_tv, NULL);
for (int i = 0; i < iDataLen; i+=iJump)
pData[i] = 1;
gettimeofday(&loop1_tv, NULL);
int iMs = (loop1_tv.tv_sec - start_tv.tv_sec)*1000 + (loop1_tv.tv_usec - start_tv.tv_usec)/1000;
delete[] pData;
return iMs;
}
int main()
{
long* arr;
struct timeval start_tv,loop1_tv,loop2_tv;
gettimeofday(&start_tv, NULL);
// Loop 1
int iMs = getLoopTimeMs(MAX,1);
// Loop 2
int iMs2 = getLoopTimeMs(MAX,16);
printf("loop1,ms:%d,loop2,ms:%d,loop64:%d\n",iMs,iMs2,getLoopTimeMs(MAX,64));
return 0;
}
loop1和loop2比较接近,符合文章所说的64字节cache line。
但是loop64的结果偏高,不太理解