矩阵90°旋转

下面谈谈我对这个问题的看法,如#3所说,这种横纵转换必然是打散了计算,后来我想利用memory cache来
提高速度,对于内存数据,硬盘数据是一个慢速设备,而对于当今的cpu来说,内存又是慢速的了。

我先上网查了一下我的cpu参数:E8400,L1 cache 64k,L2 cache 6M。
然后我设计了数据是1023*1024的矩阵,double类型,这样每一行数据就是1024 * sizeof(double)=8k

这样L1 cache能够装下8行的数据,但是不知道这个缓存是否能够全部利用,经试验装入6行效率最高。我猜
测缓存会装入读取数据的前面一小段和后面一大段(这个只是猜测)。

接下来说说我的算法。

将原始数据6行6行划分成条状,最后的余数不够6行特殊处理,每次读取条状第一个数据的时候,可预计这个
条带会整个被放入L1 cache,对于目标数据,这个条带是纵向的。一条一条的旋转,这样原始数据就不用频繁读取了。

另一种方法是对原始数据进行列划分,而对目标数据进行行划分,这样就相当于对写内存数据进行了缓存,经试验,划分为128时效率最好,我估计写内存的时候应该和L2 cache关系较大,这样算下来,8k*128=1M,也就是说6M 的L2 cache可以利用大约1M来优化写内存?或者是操作系统规定大约缓存1M的数据就开始写内存?
我也不知道是否如此。

以上说法都是个人观点,不一定准确。

我的环境:win7 64bit VS2008 debug/x64
下面是我的测试时间,可以看出读缓存可以快大约4倍,另外,先测试的函数往往要多耗时间:

编译器选项/Od
CacheRead:      1985498775
CacheWrite:    2279309904
Simple:        7816269690
Press any key to continue . . .

编译器选项/O2
CacheRead:      1334952851
CacheWrite:    1714044429
Simple:        4617906147
Press any key to continue . . .

C/C++ code
    
    
    
    
#include < stdio.h > #include < stdlib.h > #include < memory.h > #include < intrin.h > #pragma intrinsic(__rdtsc) #define M 2047 #define N 2048 __declspec( align( 128 ) ) double aMatrix[M * N]; __declspec( align( 128 ) ) double bMatrix[N * M]; void Rot90_Simple( double * pDst, double * pSrc, int width, int height ) { int dataSize = width * height; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; ++ dataSize; while ( pDst < pDstEnd ) { * pDst ++ = * pRunner; pRunner += width; if ( pRunner > pSrcEnd ) pRunner -= dataSize; } } void Rot90_CacheWrite( double * pDst, double * pSrc, int width, int height ) { int blockSize = 128 ; int wblock = width / blockSize; int hblock = height / blockSize; int wlast = width - blockSize * wblock; int hlast = height - blockSize * hblock; int nextRow = width + blockSize; int nextCol = blockSize * height - 1 ; int dataSize = width * height; double preRead = * pSrc; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; for ( int i = 0 ; i < wblock; ++ i ) { while ( pRunner < pSrcEnd ) { for ( int j = 0 ; j < blockSize; ++ j ) { * pDst = * pRunner -- ; pDst += height; } pDst -= nextCol; pRunner += nextRow; } pDst += ( nextCol - height + 1 ); // preRead = pRunner[nextCol]; pRunner -= ( dataSize + blockSize ); } nextCol = wlast * height - 1 ; nextRow = width + wlast; while ( pRunner < pSrcEnd ) { for ( int j = 0 ; j < wlast; ++ j ) { * pDst = * pRunner -- ; pDst += height; } pDst -= nextCol; pRunner += nextRow; } } void Rot90_CacheRead( double * pDst, double * pSrc, int width, int height ) { int blockSize = 6 ; int wblock = width / blockSize; int hblock = height / blockSize; int wlast = width - blockSize * wblock; int hlast = height - blockSize * hblock; int nextRow = height - blockSize; int nextCol = blockSize * width + 1 ; int dataSize = width * height; double preRead = * pSrc; double * pRunner = pSrc + width - 1 ; double * pSrcEnd = pSrc + dataSize; double * pDstEnd = pDst + dataSize; for ( int i = 0 ; i < hblock; ++ i ) { while ( pDst < pDstEnd ) { for ( int j = 0 ; j < blockSize; ++ j ) { * pDst ++ = * pRunner; pRunner += width; } pDst += nextRow; pRunner -= nextCol; } pDst -= ( dataSize - blockSize ); preRead = pRunner[nextCol]; pRunner += ( nextCol + width - 1 ); } nextRow = height - hlast; nextCol = hlast * width + 1 ; while ( pDst < pDstEnd ) { for ( int j = 0 ; j < hlast; ++ j ) { * pDst ++ = * pRunner; pRunner += width; } pDst += nextRow; pRunner -= nextCol; } } float main() { unsigned __int64 t1, t2, t; int times = 20 ; for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_CacheRead( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " CacheRead:/t%llu/n " , t ); for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_CacheWrite( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " CacheWrite:/t%llu/n " , t ); for ( int i = 0 ; i < N * M; i ++ ) aMatrix[i] = 1.0 * rand() / RAND_MAX; t1 = __rdtsc(); for ( int i = 0 ; i < times ; ++ i ) Rot90_Simple( bMatrix, aMatrix, N, M ); // memcpy( aMatrix, bMatrix, M*N*sizeof(double) ); t2 = __rdtsc(); t = t2 - t1; printf( " Simple:/t/t%llu/n " , t ); return 0 .f; }

你可能感兴趣的:(算法,cache,测试,float,编译器,64bit)