英特尔多核平台编码优化大赛的优化过程
[email protected] 2007.01.20
tag:多核编程,sse,sse2,牛顿迭代,代码优化,优化大赛,invsqrt,开方
英特尔最近举办了一次多核平台编码优化大赛,我也参加了这次比赛;大赛代码提交阶段已经结束,
所以也可以公开自己的优化过程;
去年末双核处理器出货量开始超过单核处理器出货量,英特尔也发布了4核CPU;AMD今年也将发布4核,
并计划今年下半年发布8核,多核已经大势所趋;而以前的CPU性能增长,软件界都能很自然的获得相应的
性能增长,而对于这次多核方式带来的CPU性能爆炸性增长,软件界的免费午餐没有了,程序开发人员必须
手工的完成软件的并行化;而很多程序员、计算机教育界等对并行编程的准备并不充足(包括我);英特尔
在这种背景下举办多核平台编码优化大赛是很有现实意义的;
本文章主要介绍了自己参加竞赛过程中,对代码的优化过程;
我测试和优化过程中用的 CPU:AMD64x2 3600+ (双核CPU) (开发初期还使用过AMD3000+)
操作系统:Windows XP 32bit
编译器:Visual Studio 2005
大赛公布的原始代码:
(组织者后来要求计算和输出精度到小数点后7位,这里的输出代码做了相应的调整。)
//
* compute the potential energy of a collection of */
//
* particles interacting via pairwise potential */
#include
<
stdio.h
>
#include
<
stdlib.h
>
#include
<
math.h
>
#include
<
windows.h
>
#include
<
time.h
>
#define
NPARTS 1000
#define
NITER 201
#define
DIMS 3
int
rand(
void
);
int
computePot(
void
);
void
initPositions(
void
);
void
updatePositions(
void
);
double
r[DIMS][NPARTS];
double
pot;
double
distx, disty, distz, dist;
int
main() {
int
i;
clock_t start, stop;
initPositions();
updatePositions();
start
=
clock();
for
( i
=
0
; i
<
NITER; i
++
) {
pot
=
0.0
;
computePot();
if
(i
%
10
==
0
) printf(
"
%5d: Potential: %10.7f
"
, i, pot);
updatePositions();
}
stop
=
clock();
printf (
"
Seconds = %10.9f
"
,(
double
)(stop
-
start)
/
CLOCKS_PER_SEC);
}
void
initPositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[i][j]
=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
void
updatePositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[i][j]
-=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
int
computePot() {
int
i, j;
for
( i
=
0
; i
<
NPARTS; i
++
) {
for
( j
=
0
; j
<
i
-
1
; j
++
) {
distx
=
pow( (r[
0
][j]
-
r[
0
][i]),
2
);
disty
=
pow( (r[
1
][j]
-
r[
1
][i]),
2
);
distz
=
pow( (r[
2
][j]
-
r[
2
][i]),
2
);
dist
=
sqrt( distx
+
disty
+
distz );
pot
+=
1.0
/
dist;
}
}
return
0
;
}
代码执行时间 3.97秒
整个代码的代码量还是很小的,而且很容易发现computePot是一个O(N*N)的算法,运行时间上它的消耗最大,
占了总运行时间的95%以上(建议使用工具来定位程序的热点);
对于computePot函数的优化,最容易看到的优化点就是pow函数的调用,pow(x,y)函数是一个的很耗时的操
作(很多库,会针对y为整数做优化),把pow(x,y)改成x*x的方式;
修改后的computePot代码如下(其它代码不变):
inline pow2(
const
double
x) {
return
x
*
x; }
int
computePot() {
int
i, j;
for
( i
=
0
; i
<
NPARTS; i
++
) {
for
( j
=
0
; j
<
i
-
1
; j
++
) {
distx
=
pow2( (r[
0
][j]
-
r[
0
][i]) );
disty
=
pow2( (r[
1
][j]
-
r[
1
][i]) );
distz
=
pow2( (r[
2
][j]
-
r[
2
][i]) );
dist
=
sqrt( distx
+
disty
+
distz );
pot
+=
1.0
/
dist;
}
}
return
0
;
}
代码执行时间 2.50秒 该修改加速比:158.8% 相对于原始代码加速比:158.8%
由于computePot中对数组r的访问顺序和r定义的顺序方式不一致,尝试修改r的数据组织结构;
将double r[DIMS][NPARTS] 的定义 改为 double r[NPARTS][DIMS];
(后面证明这个修改使自己走了很大的弯路;原来的向量格式在使用SSE2优化的时候,代码上会稍微简单一点)
也相应的修改了其他访问r的代码:
#include
<
stdio.h
>
#include
<
stdlib.h
>
#include
<
math.h
>
#include
<
windows.h
>
#include
<
time.h
>
#define
NPARTS 1000
#define
NITER 201
#define
DIMS 3
int
rand(
void
);
int
computePot(
void
);
void
initPositions(
void
);
void
updatePositions(
void
);
double
r[NPARTS][DIMS];
double
pot;
double
distx, disty, distz, dist;
int
main() {
int
i;
clock_t start, stop;
initPositions();
updatePositions();
start
=
clock();
for
( i
=
0
; i
<
NITER; i
++
) {
pot
=
0.0
;
computePot();
if
(i
%
10
==
0
) printf(
"
%5d: Potential: %10.7f
"
, i, pot);
updatePositions();
}
stop
=
clock();
printf (
"
Seconds = %10.9f
"
,(
double
)(stop
-
start)
/
CLOCKS_PER_SEC);
getchar();
}
void
initPositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[j][i]
=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
void
updatePositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[j][i]
-=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
inline
double
pow2(
const
double
x) {
return
x
*
x; }
int
computePot() {
int
i, j;
for
( i
=
0
; i
<
NPARTS; i
++
) {
for
( j
=
0
; j
<
i
-
1
; j
++
) {
distx
=
pow2( (r[j][
0
]
-
r[i][
0
]) );
disty
=
pow2( (r[j][
1
]
-
r[i][
1
]) );
distz
=
pow2( (r[j][
2
]
-
r[i][
2
]) );
dist
=
sqrt( distx
+
disty
+
distz );
pot
+=
1.0
/
dist;
}
}
return
0
;
}
代码执行时间 2.39秒 该修改加速比:104.6% 相对于原始代码加速比:166.1%
r可以看作一个3维空间点的数组,computePot函数就是求这些点之间距离的某种势能:)
为了将每个点的内存访问对齐到16byte边界,
将double r[NPARTS][DIMS] 的定义 改为 __declspec(align(16))double r[NPARTS][DIMS+1];
(其它代码不变)(速度略有下降,为使用SSE2优化作准备)
代码执行时间 2.42秒 该修改加速比: 98.8% 相对于原始代码加速比:164.0%
使用SSE2来优化computePot函数;SSE2是一种单指令流多数据流指令集,里面包含有能同时处理两个64bit浮点数的指令;
代码需要 #include <emmintrin.h> 文件,从而利用编译器来优化SSE2指令,就不用手工编写汇编代码了;
修改后的computePot代码如下(其它代码不变):
int
computePot() {
int
i, j;
__m128d lcpot;
lcpot
=
_mm_setzero_pd();
for
( i
=
0
; i
<
NPARTS; i
++
) {
__m128d posi_l
=*
(__m128d
*
)(
&
r[i][
0
]);
__m128d posi_h
=*
(__m128d
*
)(
&
r[i][
2
]);
j
=
0
;
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
__m128d dm0
=
_mm_add_pd(dm0_l,dm0_h);
__m128d dm1;
dm1
=
_mm_unpackhi_pd(dm0,dm0);
dm0
=
_mm_add_sd(dm0,dm1);
dm1
=
_mm_sqrt_sd(dm0,dm0);
dm0
=
_mm_div_sd(dm1,dm0);
lcpot
=
_mm_add_sd(lcpot,dm0);
}
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
pot
+=*
(
double
*
)(
&
lcpot);
return
0
;
}
代码执行时间 2.31秒 该修改加速比:104.8% 相对于原始代码加速比:171.9%
(时间提高不多,主要时间耗在了_mm_sqrt_sd、_mm_div_sd指令上,所以SSE2的优化效果没有体现出来)
SSE3中有一条_mm_hadd_pd指令(水平加),可以用来改善一点点代码,
程序需要#include <intrin.h>文件:
int
computePot() {
int
i, j;
__m128d lcpot;
lcpot
=
_mm_setzero_pd();
for
( i
=
0
; i
<
NPARTS; i
++
) {
__m128d posi_l
=*
(__m128d
*
)(
&
r[i][
0
]);
__m128d posi_h
=*
(__m128d
*
)(
&
r[i][
2
]);
j
=
0
;
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
__m128d dm0
=
_mm_add_pd(dm0_l,dm0_h);
__m128d dm1;
//
dm1=_mm_unpackhi_pd(dm0,dm0);
//
dm0=_mm_add_sd(dm0,dm1);
dm0
=
_mm_hadd_pd(dm0,dm0);
//
SSE3
dm1
=
_mm_sqrt_sd(dm0,dm0);
dm0
=
_mm_div_sd(dm1,dm0);
lcpot
=
_mm_add_sd(lcpot,dm0);
}
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
pot
+=*
(
double
*
)(
&
lcpot);
return
0
;
}
代码执行时间 2.22秒 该修改加速比:104.1% 相对于原始代码加速比:178.8%
线程化computePot函数来达到并行使用两个CPU的目的;
也就是将computePot的内部运算改写为支持并行运算的形式,并合理划分成多个任务交给线程执行;
程序实现了一个比较通用的工作线程池,来利用多核CPU完成
代码需要 #include <vector> #include <process.h>
并在编译器中设定项目的属性:运行时库 为 多线程(/MT)
直接给出全部实现代码:代码执行时间 1.16秒 该修改加速比:191.4% 相对于原始代码加速比:342.2%
并行代码使程序性能得到了巨大的提高!!
//
* compute the potential energy of a collection of */
//
* particles interacting via pairwise potential */
#include
<
stdio.h
>
#include
<
stdlib.h
>
#include
<
math.h
>
#include
<
windows.h
>
#include
<
time.h
>
#include
<
emmintrin.h
>
//
使用SSE2
#include
<
intrin.h
>
//
使用SSE3
#include
<
vector
>
#include
<
process.h
>
//
使用线程
#define
NPARTS 1000
#define
NITER 201
#define
DIMS 3
const
int
DefthreadCount
=
2
;
//
设为1则为单线程执行,如果有4个核可以设为4:)
int
rand(
void
);
int
computePot(
void
);
void
initPositions(
void
);
void
updatePositions(
void
);
__declspec(align(
16
))
double
r[NPARTS][DIMS
+
1
];
double
pot;
double
distx, disty, distz, dist;
int
main() {
int
i;
clock_t start, stop;
initPositions();
updatePositions();
start
=
clock();
for
( i
=
0
; i
<
NITER; i
++
) {
pot
=
0.0
;
computePot();
if
(i
%
10
==
0
) printf(
"
%5d: Potential: %10.7f
"
, i, pot);
updatePositions();
}
stop
=
clock();
printf (
"
Seconds = %10.9f
"
,(
double
)(stop
-
start)
/
CLOCKS_PER_SEC);
getchar();
}
void
initPositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[j][i]
=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
void
updatePositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
)
for
( j
=
0
; j
<
NPARTS; j
++
)
r[j][i]
-=
0.5
+
( (
double
) rand()
/
(
double
) RAND_MAX );
}
struct
TWorkData
{
long
iBegin;
long
iEnd;
double
fResult;
};
void
computePotPart(TWorkData
*
work_data) {
int
i, j;
__m128d lcpot;
lcpot
=
_mm_setzero_pd();
for
( i
=
work_data
->
iBegin; i
<
work_data
->
iEnd;
++
i ) {
__m128d posi_l
=*
(__m128d
*
)(
&
r[i][
0
]);
__m128d posi_h
=*
(__m128d
*
)(
&
r[i][
2
]);
j
=
0
;
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
__m128d dm0
=
_mm_add_pd(dm0_l,dm0_h);
__m128d dm1;
//
dm1=_mm_unpackhi_pd(dm0,dm0);
//
dm0=_mm_add_sd(dm0,dm1);
dm0
=
_mm_hadd_pd(dm0,dm0);
//
SSE3
dm1
=
_mm_sqrt_sd(dm0,dm0);
dm0
=
_mm_div_sd(dm1,dm0);
lcpot
=
_mm_add_sd(lcpot,dm0);
}
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
work_data
->
fResult
=*
(
double
*
)(
&
lcpot);
}
////////////////////////////////////////////////////////////
/
//
工作线程池 TWorkThreadPool
//
用于把一个任务拆分成多个线程任务
//
要求每个小任务任务量相近
//
todo:改成任务领取模式
class
TWorkThreadPool;
typedef
void
(
*
TThreadCallBack)(
void
*
pData);
enum
TThreadState{ thrStartup
=
0
, thrReady, thrBusy, thrTerminate, thrDeath };
class
TWorkThread
{
public
:
volatile
HANDLE thread_handle;
volatile
enum
TThreadState state;
volatile
TThreadCallBack func;
volatile
void
*
pdata;
//
work data
volatile
HANDLE waitfor_event;
TWorkThreadPool
*
pool;
volatile
DWORD thread_ThreadAffinityMask;
TWorkThread() { memset(
this
,
0
,
sizeof
(TWorkThread)); }
};
void
do_work_end(TWorkThread
*
thread_data);
void
__cdecl thread_dowork(TWorkThread
*
thread_data)
//
void __stdcall thread_dowork(TWorkThread* thread_data)
{
volatile
TThreadState
&
state
=
thread_data
->
state;
SetThreadAffinityMask(GetCurrentThread(),thread_data
->
thread_ThreadAffinityMask);
state
=
thrStartup;
while
(
true
)
{
WaitForSingleObject(thread_data
->
waitfor_event,
-
1
);
if
(state
==
thrTerminate)
break
;
state
=
thrBusy;
volatile
TThreadCallBack
&
func
=
thread_data
->
func;
if
(func
!=
0
)
func((
void
*
)thread_data
->
pdata);
do_work_end(thread_data);
}
state
=
thrDeath;
_endthread();
//
ExitThread(0);
}
class
TWorkThreadPool
{
private
:
volatile
HANDLE thread_event;
volatile
HANDLE new_thread_event;
std::vector
<
TWorkThread
>
work_threads;
inline
int
passel_count()
const
{
return
(
int
)work_threads.size()
+
1
; }
void
inti_threads(
long
work_count) {
SYSTEM_INFO SystemInfo;
GetSystemInfo(
&
SystemInfo);
long
cpu_count
=
SystemInfo.dwNumberOfProcessors;
long
best_count
=
cpu_count;
if
(cpu_count
>
work_count) best_count
=
work_count;
long
newthrcount
=
best_count
-
1
;
work_threads.resize(newthrcount);
thread_event
=
CreateSemaphore(
0
,
0
,newthrcount ,
0
);
new_thread_event
=
CreateSemaphore(
0
,
0
,newthrcount ,
0
);
for
(
long
i
=
0
; i
<
newthrcount;
++
i)
{
work_threads[i].waitfor_event
=
thread_event;
work_threads[i].state
=
thrTerminate;
work_threads[i].pool
=
this
;
work_threads[i].thread_ThreadAffinityMask
=
1
<<
(i
+
1
);
//
DWORD thr_id;
work_threads[i].thread_handle
=
(HANDLE)_beginthread((
void
(__cdecl
*
)(
void
*
))thread_dowork,
0
, (
void
*
)
&
work_threads[i]);
//
CreateThread(0, 0, (LPTHREAD_START_ROUTINE)thread_dowork,(void*) &work_threads[i], 0, &thr_id);
}
SetThreadAffinityMask(GetCurrentThread(),
0x01
);
for
(
long
i
=
0
; i
<
newthrcount;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrStartup)
break
;
else
Sleep(
0
);
}
work_threads[i].state
=
thrReady;
}
}
void
free_threads(
void
)
{
long
thr_count
=
(
long
)work_threads.size();
long
i;
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrReady)
break
;
else
Sleep(
0
);
}
work_threads[i].state
=
thrTerminate;
}
if
(thr_count
>
0
)
ReleaseSemaphore(thread_event,thr_count,
0
);
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrDeath)
break
;
else
Sleep(
0
);
}
}
CloseHandle(thread_event);
CloseHandle(new_thread_event);
work_threads.clear();
}
void
passel_work(TThreadCallBack work_proc,
void
**
word_data_list,
int
work_count) {
//
assert(work_count>=1);
//
assert(work_count<=passel_count());
if
(work_count
==
1
)
{
work_proc(word_data_list[
0
]);
}
else
{
long
i;
long
thr_count
=
(
long
)work_threads.size();
for
(i
=
0
; i
<
work_count
-
1
;
++
i)
{
work_threads[i].func
=
work_proc;
work_threads[i].pdata
=
word_data_list[i];
work_threads[i].state
=
thrBusy;
}
for
(i
=
work_count
-
1
; i
<
thr_count;
++
i)
{
work_threads[i].func
=
0
;
work_threads[i].pdata
=
0
;
work_threads[i].state
=
thrBusy;
}
if
(thr_count
>
0
)
ReleaseSemaphore(thread_event,thr_count,
0
);
//
current thread do a work
work_proc(word_data_list[work_count
-
1
]);
//
wait for work finish
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrReady)
break
;
else
Sleep(
0
);
}
}
std::swap(thread_event,new_thread_event);
}
}
public
:
explicit
TWorkThreadPool(unsigned
long
work_count):thread_event(
0
),work_threads() {
//
assert(work_count>=1);
inti_threads(work_count); }
~
TWorkThreadPool() { free_threads(); }
long
best_work_count()
const
{
return
passel_count(); }
void
work_execute(TThreadCallBack work_proc,
void
**
word_data_list,
int
work_count) {
while
(work_count
>
0
)
{
long
passel_work_count;
if
(work_count
>=
passel_count())
passel_work_count
=
passel_count();
else
passel_work_count
=
work_count;
passel_work(work_proc,word_data_list,passel_work_count);
work_count
-=
passel_work_count;
word_data_list
=&
word_data_list[passel_work_count];
}
}
inline
void
DoWorkEnd(TWorkThread
*
thread_data){
thread_data
->
waitfor_event
=
new_thread_event;
thread_data
->
func
=
0
;
thread_data
->
state
=
thrReady;
}
};
void
do_work_end(TWorkThread
*
thread_data)
{
thread_data
->
pool
->
DoWorkEnd(thread_data);
}
//
TWorkThreadPool end;
///////////////////////////////////////
/
int
computePot() {
static
bool
is_inti_work
=
false
;
static
TWorkThreadPool g_work_thread_pool(DefthreadCount);
//
线程池
static
TWorkData work_list[DefthreadCount];
static
TWorkData
*
pwork_list[DefthreadCount];
int
i;
if
(
!
is_inti_work)
{
long
fi
=
0
;
for
(
int
i
=
0
;i
<
DefthreadCount;
++
i)
{
if
(
0
==
i)
work_list[i].iBegin
=
0
;
else
work_list[i].iBegin
=
work_list[i
-
1
].iEnd;
if
(i
==
DefthreadCount
-
1
)
work_list[i].iEnd
=
(
long
)NPARTS;
else
work_list[i].iEnd
=
(
long
)( (
double
)(NPARTS
-
1
)
*
sqrt((
double
)(i
+
1
)
/
DefthreadCount)
+
1
+
0.5
);
//
todo:更准确的任务分配方式
pwork_list[i]
=&
work_list[i];
}
is_inti_work
=
true
;
}
g_work_thread_pool.work_execute((TThreadCallBack)computePotPart,(
void
**
)(
&
pwork_list[
0
]),DefthreadCount);
for
(i
=
0
;i
<
DefthreadCount;
++
i)
pot
+=
work_list[i].fResult;
return
0
;
}
在computePot函数中,开方(Sqrt)和倒数(Div)占用了大部分的计算时间;其实把这两个操作和并在一起的操作(invSqrt)反而有很快的计算方法;
这就是牛顿迭代法,牛顿迭代法原理、invSqrt迭代公式的推导过程、收敛性证明 可以参看我的一篇blog文章:
《代码优化-之-优化除法》: http://blog.csdn.net/housisong/archive/2006/08/25/1116423.aspx
invSqrt(a)=1/sqrt(a) 的牛顿迭代公式: x_next=(3-a*x*x)*x*0.5;
先给出一个invSqrt(a)的近似解然后把它带入迭代式,就可以得到更精确的解;
产生invSqrt(a)的一个近似解的方法有很多,SSE中提供了一条_mm_rsqrt_ps指令用来求invSqrt(a)的近似解,可以把它的结果作为一个初始值,
然后进行多次迭代,从而达到自己想要的解的精度,实现代码如下(其它代码不变):代码执行时间 1.05秒 该修改加速比:110.5% 相对于原始代码加速比:378.1%
const
__m128 xmms1_5
=
{ (
1.5
),(
1.5
),(
1.5
),(
1.5
) };
const
__m128d xmmd1_5
=
{ (
1.5
),(
1.5
) };
const
__m128 xmms_0_5
=
{ (
-
0.5
),(
-
0.5
),(
-
0.5
),(
-
0.5
) };
const
__m128d xmmd_0_5
=
{ (
-
0.5
),(
-
0.5
) };
void
computePotPart(TWorkData
*
work_data) {
int
i, j;
__m128d lcpot;
lcpot
=
_mm_setzero_pd();
for
( i
=
work_data
->
iBegin; i
<
work_data
->
iEnd;
++
i ) {
__m128d posi_l
=*
(__m128d
*
)(
&
r[i][
0
]);
__m128d posi_h
=*
(__m128d
*
)(
&
r[i][
2
]);
j
=
0
;
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
__m128d dm0
=
_mm_add_pd(dm0_l,dm0_h);
__m128d dm1;
//
dm1=_mm_unpackhi_pd(dm0,dm0);
//
dm0=_mm_add_sd(dm0,dm1);
dm0
=
_mm_hadd_pd(dm0,dm0);
//
SSE3
//
dm1=_mm_sqrt_sd(dm0,dm0);
//
dm0=_mm_div_sd(dm1,dm0);
//
用SSE的rsqrt近似计算1/sqrt(a); 然后使用牛顿迭代来提高开方精度
//
1/sqrt(a)的牛顿迭代公式x_next=(3-a*x*x)*x*0.5 =( 1.5 + (a*(-0.5)) * x*x) ) * x
{
__m128 sm0
=
_mm_cvtpd_ps(dm0);
sm0
=
_mm_rsqrt_ss(sm0);
//
计算1/sqrt(a)
__m128d dma
=
_mm_mul_sd(dm0,xmmd_0_5);
//
a*(-0.5)
dm0
=
_mm_cvtps_pd(sm0);
//
牛顿迭代,提高开方精度
dm1
=
_mm_mul_sd(dm0,dm0);
dm1
=
_mm_mul_sd(dm1,dma);
dm1
=
_mm_add_sd(dm1,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dm1);
//
再次迭代,加倍精度 这样和原表达式比就几乎没有误差了
dm1
=
_mm_mul_sd(dm0,dm0);
dm1
=
_mm_mul_sd(dm1,dma);
dm1
=
_mm_add_sd(dm1,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dm1);
}
lcpot
=
_mm_add_sd(lcpot,dm0);
}
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
work_data
->
fResult
=*
(
double
*
)(
&
lcpot);
}
对computePot的内部j循环作4路展开,从而可以充分利用SSE的计算带宽,代码如下:
(更高路(比如8路)的展开的效果沒有测试过,应该会有利于指令的并行发射和执行)代码执行时间 0.70秒 该修改加速比:150.0% 相对于原始代码加速比:567.1%
const
__m128 xmms1_5
=
{ (
1.5
),(
1.5
),(
1.5
),(
1.5
) };
const
__m128d xmmd1_5
=
{ (
1.5
),(
1.5
) };
const
__m128 xmms_0_5
=
{ (
-
0.5
),(
-
0.5
),(
-
0.5
),(
-
0.5
) };
const
__m128d xmmd_0_5
=
{ (
-
0.5
),(
-
0.5
) };
void
computePotPart(TWorkData
*
work_data) {
int
i, j;
__m128d lcpot;
lcpot
=
_mm_setzero_pd();
for
( i
=
work_data
->
iBegin; i
<
work_data
->
iEnd;
++
i ) {
__m128d posi_l
=*
(__m128d
*
)(
&
r[i][
0
]);
__m128d posi_h
=*
(__m128d
*
)(
&
r[i][
2
]);
j
=
0
;
//
* 把这个循环做4次展开
for
(;j
+
4
<
i;j
+=
4
)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j ][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j ][
2
]),posi_h);
__m128d dm1_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
1
][
0
]),posi_l);
__m128d dm1_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
1
][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
dm1_l
=
_mm_mul_pd(dm1_l,dm1_l);
dm1_h
=
_mm_mul_pd(dm1_h,dm1_h);
__m128d dm0,dm1,dm2;
dm1
=
_mm_add_pd(dm0_l,dm0_h);
dm2
=
_mm_add_pd(dm1_l,dm1_h);
dm0
=
_mm_hadd_pd(dm1,dm2);
//
SSE3
dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
2
][
0
]),posi_l);
dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
2
][
2
]),posi_h);
dm1_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
3
][
0
]),posi_l);
dm1_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j
+
3
][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
dm1_l
=
_mm_mul_pd(dm1_l,dm1_l);
dm1_h
=
_mm_mul_pd(dm1_h,dm1_h);
__m128d dm5;
dm1
=
_mm_add_pd(dm0_l,dm0_h);
dm2
=
_mm_add_pd(dm1_l,dm1_h);
dm5
=
_mm_hadd_pd(dm1,dm2);
//
SSE3
//
用SSE的rsqrt近似计算1/sqrt(a); 然后使用牛顿迭代来提高开方精度
{
__m128 sm0
=
_mm_cvtpd_ps(dm0);
__m128 sm1
=
_mm_cvtpd_ps(dm5);
sm0
=
_mm_movelh_ps(sm0,sm1);
__m128 sma
=
_mm_mul_ps(sm0,xmms_0_5);
//
a*(-0.5)
sm0
=
_mm_rsqrt_ps(sm0);
//
近似计算1/sqrt(a)
//
牛顿迭代,提高开方精度
sm1
=
_mm_mul_ps(sm0,sm0);
sm1
=
_mm_mul_ps(sm1,sma);
sm1
=
_mm_add_ps(sm1,xmms1_5);
sm0
=
_mm_mul_ps(sm0,sm1);
__m128d dma
=
_mm_mul_pd(dm0,xmmd_0_5);
//
a*(-0.5)
__m128d dmb
=
_mm_mul_pd(dm5,xmmd_0_5);
sm1
=
_mm_movehl_ps(sm0,sm0);
dm0
=
_mm_cvtps_pd(sm0);
//
dm5
=
_mm_cvtps_pd(sm1);
//
//
再次迭代,加倍精度
dm1
=
_mm_mul_pd(dm0,dm0);
dm1
=
_mm_mul_pd(dm1,dma);
dm1
=
_mm_add_pd(dm1,xmmd1_5);
dm0
=
_mm_mul_pd(dm0,dm1);
//
dm2
=
_mm_mul_pd(dm5,dm5);
dm2
=
_mm_mul_pd(dm2,dmb);
dm2
=
_mm_add_pd(dm2,xmmd1_5);
dm5
=
_mm_mul_pd(dm5,dm2);
}
dm0
=
_mm_add_pd(dm0,dm5);
lcpot
=
_mm_add_pd(lcpot,dm0);
}
//
*/
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0_l
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
0
]),posi_l);
__m128d dm0_h
=
_mm_sub_pd(
*
(__m128d
*
)(
&
r[j][
2
]),posi_h);
dm0_l
=
_mm_mul_pd(dm0_l,dm0_l);
dm0_h
=
_mm_mul_pd(dm0_h,dm0_h);
__m128d dm0
=
_mm_add_pd(dm0_l,dm0_h);
__m128d dm1;
//
dm0=_mm_hadd_pd(dm0,dm0);
//
SSE3
dm1
=
_mm_unpackhi_pd(dm0,dm0);
dm0
=
_mm_add_sd(dm0,dm1);
//
dm1=_mm_sqrt_sd(dm0,dm0);
//
dm0=_mm_div_sd(dm1,dm0);
//
用SSE的rsqrt近似计算1/sqrt(a); 然后使用牛顿迭代来提高开方精度
{
__m128 sm0
=
_mm_cvtpd_ps(dm0);
sm0
=
_mm_rsqrt_ss(sm0);
//
计算1/sqrt(a)
__m128d dma
=
_mm_mul_sd(dm0,xmmd_0_5);
//
a*(-0.5)
dm0
=
_mm_cvtps_pd(sm0);
//
牛顿迭代,提高开方精度
dm1
=
_mm_mul_sd(dm0,dm0);
dm1
=
_mm_mul_sd(dm1,dma);
dm1
=
_mm_add_sd(dm1,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dm1);
//
再次迭代,加倍精度 这样和原表达式比就几乎没有误差了
dm1
=
_mm_mul_sd(dm0,dm0);
dm1
=
_mm_mul_sd(dm1,dma);
dm1
=
_mm_add_sd(dm1,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dm1);
}
lcpot
=
_mm_add_sd(lcpot,dm0);
}
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
work_data
->
fResult
=*
(
double
*
)(
&
lcpot);
}
将上面的computePotPart函数用汇编实现:代码执行时间 0.59秒 该修改加速比:118.6% 相对于原始代码加速比:672.9%
#define
_IS_CPU_SSE3_ACTIVE
//
取消该标志则使用不使用SSE3
#define
asm __asm
const
__m128 xmms1_5
=
{ (
1.5
),(
1.5
),(
1.5
),(
1.5
) };
const
__m128d xmmd1_5
=
{ (
1.5
),(
1.5
) };
const
__m128 xmms_0_5
=
{ (
-
0.5
),(
-
0.5
),(
-
0.5
),(
-
0.5
) };
const
__m128d xmmd_0_5
=
{ (
-
0.5
),(
-
0.5
) };
void
__declspec(naked) computePotPart(TWorkData
*
work_data) {
//
work_data: [ebp+8]
asm
{
push ebp
mov ebp, esp
and esp,
-
16
//
16byte 对齐
push edi
sub esp,
4
+
8
+
16
//
for a temp_m128d
mov edi, [ebp
+
8
]
//
work_data
mov ecx, [edi]TWorkData.iBegin
mov edx, [edi]TWorkData.iEnd
cmp ecx, edx
pxor xmm7, xmm7
//
fResult
jge $for_end
mov [esp
+
16
], esi
//
save esi
mov eax, ecx
mov [esp
+
20
], ebx
//
save ebx
shl eax,
5
$for_begin:
movapd xmm5, [r
+
eax]
movapd xmm6, [r
+
eax
+
16
]
xor ebx, ebx
cmp ecx,
4
jle $border
mov esi,
4
mov edi, offset r
align
4
$for4_begin:
movapd xmm0, [edi]
movapd xmm4, [edi
+
16
]
movapd xmm2, [edi
+
32
]
movapd xmm3, [edi
+
48
]
movapd xmm1, [edi
+
96
]
subpd xmm0, xmm5
mulpd xmm0, xmm0
subpd xmm4, xmm6
mulpd xmm4, xmm4
addpd xmm0, xmm4
movapd xmm4, [edi
+
64
]
subpd xmm2, xmm5
mulpd xmm2, xmm2
subpd xmm3, xmm6
mulpd xmm3, xmm3
addpd xmm2, xmm3
movapd xmm3, [edi
+
80
]
subpd xmm4, xmm5
mulpd xmm4, xmm4
#ifdef _IS_CPU_SSE3_ACTIVE
//
SSE3
haddpd xmm0, xmm2
#else
movapd [esp],xmm1
//
temp_m128d
movapd xmm1, xmm0
unpckhpd xmm0, xmm2
unpcklpd xmm1, xmm2
addpd xmm0,xmm1
movapd xmm1,[esp]
#endif
movapd xmm2, [edi
+
112
]
subpd xmm3, xmm6
mulpd xmm3, xmm3
addpd xmm4, xmm3
subpd xmm1, xmm5
mulpd xmm1, xmm1
subpd xmm2, xmm6
mulpd xmm2, xmm2
addpd xmm1, xmm2
movaps xmm2, xmms_0_5
#ifdef _IS_CPU_SSE3_ACTIVE
//
SSE3
haddpd xmm4, xmm1
#else
movapd xmm3, xmm4
unpckhpd xmm4, xmm1
unpcklpd xmm3, xmm1
addpd xmm4, xmm3
#endif
cvtpd2ps xmm1, xmm0
mov ebx, esi
cvtpd2ps xmm3, xmm4
add edi,
4
*
32
movlhps xmm1, xmm3
mulps xmm2, xmm1
rsqrtps xmm3, xmm1
movaps xmm1, xmm3
mulps xmm1, xmm3
mulps xmm1, xmm2
movapd xmm2, xmmd_0_5
mulpd xmm0, xmm2
mulpd xmm4, xmm2
addps xmm1, xmms1_5
mulps xmm3, xmm1
cvtps2pd xmm1, xmm3
add esi,
4
movaps xmm2, xmm3
movhlps xmm2, xmm3
cvtps2pd xmm3, xmm2
cmp ecx, esi
movapd xmm2, xmm1
mulpd xmm2, xmm1
mulpd xmm2, xmm0
movapd xmm0, xmmd1_5
addpd xmm2, xmm0
mulpd xmm1, xmm2
movapd xmm2, xmm3
mulpd xmm2, xmm3
mulpd xmm2, xmm4
addpd xmm2, xmm0
mulpd xmm3, xmm2
addpd xmm1, xmm3
addpd xmm7, xmm1
jg $for4_begin
$border:
lea esi, [ebx
+
1
]
cmp ecx, esi
jle $for_border_end
movapd xmm4, xmmd1_5
movapd xmm3, xmmd_0_5
$for_border_begin:
shl ebx,
5
movapd xmm2, [r
+
ebx]
movapd xmm1, [r
+
ebx
+
16
]
subpd xmm2, xmm5
mulpd xmm2, xmm2
subpd xmm1, xmm6
mulpd xmm1, xmm1
addpd xmm2, xmm1
#ifdef _IS_CPU_SSE3_ACTIVE
//
SSE3
haddpd xmm2, xmm2
#else
movapd xmm0, xmm2
unpckhpd xmm0, xmm0
addsd xmm2, xmm0
#endif
cvtpd2ps xmm0, xmm2
mov ebx, esi
add esi,
1
mulsd xmm2, xmm3
rsqrtss xmm0, xmm0
cvtps2pd xmm0, xmm0
cmp ecx, esi
movapd xmm1, xmm0
mulsd xmm1, xmm0
mulsd xmm1, xmm2
addsd xmm1, xmm4
mulsd xmm0, xmm1
movapd xmm1, xmm0
mulsd xmm1, xmm0
mulsd xmm1, xmm2
addsd xmm1, xmm4
mulsd xmm0, xmm1
addsd xmm7, xmm0
jg $for_border_begin
$for_border_end:
add ecx,
1
add eax,
32
cmp ecx, edx
jl $for_begin
mov esi, [esp
+
16
]
//
resume esi
mov ebx, [esp
+
20
]
//
resume ebx
$for_end:
#ifdef _IS_CPU_SSE3_ACTIVE
//
SSE3
haddpd xmm7, xmm7
#else
movapd xmm0, xmm7
unpckhpd xmm0, xmm0
addpd xmm7, xmm0
#endif
mov edi,[ebp
+
8
]
//
work_data
movsd qword ptr [edi]TWorkData.fResult, xmm7
add esp,
16
+
4
+
8
pop edi
mov esp, ebp
pop ebp
ret
}
}
由于修改了r数组的数据组织方式,虽然有利于内存访问,但也由此带来了一定的代码复杂度;
但这个程序的数据量本来就不大,所以尝试保留原来的向量化数据,然后代码也会简单一些,
并不再使用SSE3,所以有了新的代码,保留double r[DIMS][NPARTS]的定义,并且对整个程序能
够加快速度的其他地方都进行了一些改进。
代码执行时间 0.54秒 该修改加速比:109.3% 相对于原始代码加速比:735.2%
完整代码如下:
/*
compute the potential energy of a collection of
*/
/*
particles interacting via pairwise potential
*/
#include
<
stdio.h
>
#include
<
stdlib.h
>
#include
<
math.h
>
#include
<
windows.h
>
#include
<
xmmintrin.h
>
#include
<
process.h
>
#include
<
vector
>
//
#include <iostream>
#include
<
emmintrin.h
>
//
使用SSE2
///////////////////////////////////////////////////////////////////////////////////////////////////////////////
/
//
作者:侯思松 [email protected]
//
计算结果的精度为double双精度浮点 版本
///////////////////////////////////////////////////////////////////////////////////////////////////////////////
/
//
#define _MY_TIME_CLOCK
#ifdef _MY_TIME_CLOCK
#define
clock_t __int64
clock_t CLOCKS_PER_SEC
=
0
;
inline clock_t clock() {
__int64 result;
if
(CLOCKS_PER_SEC
==
0
)
{
QueryPerformanceFrequency((LARGE_INTEGER
*
)
&
result);
CLOCKS_PER_SEC
=
(clock_t)result;
}
QueryPerformanceCounter((LARGE_INTEGER
*
)
&
result);
return
(clock_t)result;
}
#else
#include
<
time.h
>
#endif
#define
_IS_USES_MY_RAND
//
单线程执行rand函数,所以使用自定义rand是安全的
const
long
DefthreadCount
=
2
;
//
1,2,4,8,16,.. 把计算任务分成多个任务并行执行
inline
double
&
m128d_value(__m128d
&
x,
const
long
index) {
return
((
double
*
)(
&
x))[index]; }
#define
NPARTS 1000
#define
NITER 201
#define
DIMS 3
#ifdef _IS_USES_MY_RAND
class
CMyRand
{
private
:
unsigned
long
_my_holdrand;
public
:
CMyRand():_my_holdrand(
1
){}
inline
int
_my_rand (
void
)
{
unsigned
long
result
=
_my_holdrand
*
214013
+
2531011
;
_my_holdrand
=
result;
return
( (result
>>
16
)
&
RAND_MAX );
}
};
CMyRand _MyRand;
inline
int
_my_rand (
void
){
return
_MyRand._my_rand(); }
#else
#define
_my_rand rand
#endif
int
rand(
void
);
int
computePot();
void
initPositions(
void
);
void
updatePositions(
void
);
__declspec(align(
16
))
double
r[DIMS][(NPARTS
+
1
)
/
2
*
2
];
//
16byte对齐
enum
{ vector_byte_size
=
sizeof
(r)
/
DIMS };
double
pot;
int
main() {
int
i;
clock_t start, stop;
//
char ctmp; std::cin>>ctmp;
initPositions();
updatePositions();
start
=
clock();
for
( i
=
0
; i
<
NITER; i
++
) {
pot
=
0.0
;
computePot();
if
(i
%
10
==
0
) printf(
"
%5d: Potential: %20.7f
"
, i, pot);
updatePositions();
}
pot
=
0.0
;
stop
=
clock();
printf (
"
Seconds = %10.9f
"
,(
double
)(stop
-
start)
/
CLOCKS_PER_SEC);
return
0
;
}
void
initPositions() {
int
i, j;
for
( i
=
0
; i
<
DIMS; i
++
){
for
( j
=
0
; j
<
NPARTS;
++
j )
{
r[i][j]
=
(
0.5
+
_my_rand()
*
(
1.0
/
RAND_MAX) );
}
}
}
void
updatePositions() {
int
i,j;
for
(i
=
0
;i
<
DIMS;
++
i)
{
for
( j
=
0
; j
<
NPARTS;
++
j )
{
r[i][j]
-=
(
0.5
+
_my_rand()
*
(
1.0
/
RAND_MAX) );
}
}
}
struct
TWorkData
{
long
iBegin;
long
iEnd;
double
fResult;
};
const
__m128 xmms_0_5
=
{ (
-
0.5
),(
-
0.5
),(
-
0.5
),(
-
0.5
) };
const
__m128d xmmd_0_5
=
{ (
-
0.5
),(
-
0.5
) };
const
__m128 xmms1_5
=
{ (
1.5
),(
1.5
),(
1.5
),(
1.5
) };
const
__m128d xmmd1_5
=
{ (
1.5
),(
1.5
) };
void
computePotPart_forj(
int
i,__m128d
*
pResult)
{
__m128d lcpot
=
_mm_setzero_pd();
__m128d _mmi0
=
_mm_set1_pd(
-
r[
0
][i]);
__m128d _mmi1
=
_mm_set1_pd(
-
r[
1
][i]);
__m128d _mmi2
=
_mm_set1_pd(
-
r[
2
][i]);
int
j
=
0
;
//
*
for
(;j
+
4
<
i;j
+=
4
)
{
__m128d dm0
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
0
][j],_mmi0);
dm0
=
_mm_mul_pd(dm0,dm0);
__m128d dm1
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
1
][j],_mmi1);
dm1
=
_mm_mul_pd(dm1,dm1);
__m128d dm2
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
2
][j],_mmi2);
dm2
=
_mm_mul_pd(dm2,dm2);
dm0
=
_mm_add_pd(dm0,dm1);
dm0
=
_mm_add_pd(dm0,dm2);
__m128d dm5
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
0
][j
+
2
],_mmi0);
dm5
=
_mm_mul_pd(dm5,dm5);
__m128d dm6
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
1
][j
+
2
],_mmi1);
dm6
=
_mm_mul_pd(dm6,dm6);
dm2
=
_mm_add_pd(
*
(__m128d
*
)
&
r[
2
][j
+
2
],_mmi2);
dm2
=
_mm_mul_pd(dm2,dm2);
dm5
=
_mm_add_pd(dm5,dm6);
dm5
=
_mm_add_pd(dm5,dm2);
//
用SSE的rsqrt近似计算1/sqrt(a); 然后使用牛顿迭代来提高开方精度
//
1/sqrt(a)的牛顿迭代公式x_next=(3-a*x*x)*x*0.5 =( 1.5 + (a*(-0.5)) * x*x) ) * x
{
__m128 sm0
=
_mm_cvtpd_ps(dm0);
__m128 sm1
=
_mm_cvtpd_ps(dm5);
sm0
=
_mm_movelh_ps(sm0,sm1);
__m128 sma
=
_mm_mul_ps(sm0,xmms_0_5);
//
a*(-0.5)
sm0
=
_mm_rsqrt_ps(sm0);
//
计算1/sqrt(a)
//
牛顿迭代,提高开方精度
sma
=
_mm_mul_ps(sma,sm0);
sma
=
_mm_mul_ps(sma,sm0);
sma
=
_mm_add_ps(sma,xmms1_5);
sm0
=
_mm_mul_ps(sm0,sma);
__m128d dma
=
_mm_mul_pd(dm0,xmmd_0_5);
//
a*(-0.5)
__m128d dmb
=
_mm_mul_pd(dm5,xmmd_0_5);
sm1
=
_mm_movehl_ps(sm1,sm0);
dm0
=
_mm_cvtps_pd(sm0);
//
dm5
=
_mm_cvtps_pd(sm1);
//
//
再次迭代,加倍精度 这样和原表达式比就几乎没有任何误差了
dma
=
_mm_mul_pd(dma,dm0);
dmb
=
_mm_mul_pd(dmb,dm5);
dma
=
_mm_mul_pd(dma,dm0);
dmb
=
_mm_mul_pd(dmb,dm5);
dma
=
_mm_add_pd(dma,xmmd1_5);
dmb
=
_mm_add_pd(dmb,xmmd1_5);
dm0
=
_mm_mul_pd(dm0,dma);
dm5
=
_mm_mul_pd(dm5,dmb);
}
lcpot
=
_mm_add_pd(lcpot,dm0);
lcpot
=
_mm_add_pd(lcpot,dm5);
}
for
(;j
+
1
<
i;
++
j)
{
__m128d dm0
=
_mm_set_sd(r[
0
][j]);
dm0
=
_mm_add_pd(dm0,_mmi0);
dm0
=
_mm_mul_pd(dm0,dm0);
__m128d dm1
=
_mm_set_sd(r[
1
][j]);
dm1
=
_mm_add_sd(dm1,_mmi1);
dm1
=
_mm_mul_sd(dm1,dm1);
__m128d dm2
=
_mm_set_sd(r[
2
][j]);
dm2
=
_mm_add_sd(dm2,_mmi2);
dm2
=
_mm_mul_sd(dm2,dm2);
dm0
=
_mm_add_sd(dm0,dm1);
dm0
=
_mm_add_sd(dm0,dm2);
{
__m128 sm0
=
_mm_cvtpd_ps(dm0);
__m128d dma
=
_mm_mul_sd(dm0,xmmd_0_5);
//
a*(-0.5)
sm0
=
_mm_rsqrt_ss(sm0);
//
计算1/sqrt(a)
dm0
=
_mm_cvtps_pd(sm0);
//
//
牛顿迭代,提高开方精度
dm1
=
_mm_mul_sd(dm0,dm0);
dm1
=
_mm_mul_sd(dm1,dma);
dm1
=
_mm_add_sd(dm1,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dm1);
//
再次迭代
dma
=
_mm_mul_sd(dma,dm0);
dma
=
_mm_mul_sd(dma,dm0);
dma
=
_mm_add_sd(dma,xmmd1_5);
dm0
=
_mm_mul_sd(dm0,dma);
}
lcpot
=
_mm_add_sd(lcpot,dm0);
}
*
pResult
=
_mm_add_pd(
*
pResult,lcpot);
}
void
computePotPart(TWorkData
*
work_data) {
int
i;
__m128d lcpot
=
_mm_setzero_pd();
for
( i
=
work_data
->
iBegin; i
<
work_data
->
iEnd;
++
i ) {
computePotPart_forj(i,
&
lcpot);
}
__m128d dm1;
dm1
=
_mm_unpackhi_pd(lcpot,lcpot);
lcpot
=
_mm_add_sd(lcpot,dm1);
work_data
->
fResult
=
m128d_value(lcpot,
0
);
}
////////////////////////////////////////////////////////////
/
//
工作线程池 TWorkThreadPool
//
用于把一个任务拆分成多个线程任务
//
要求每个小任务任务量相近
//
todo:改成任务领取模式
class
TWorkThreadPool;
typedef
void
(
*
TThreadCallBack)(
void
*
pData);
enum
TThreadState{ thrStartup
=
0
, thrReady, thrBusy, thrTerminate, thrDeath };
class
TWorkThread
{
public
:
volatile
HANDLE thread_handle;
volatile
enum
TThreadState state;
volatile
TThreadCallBack func;
volatile
void
*
pdata;
//
work data
volatile
HANDLE waitfor_event;
TWorkThreadPool
*
pool;
volatile
DWORD thread_ThreadAffinityMask;
TWorkThread() { memset(
this
,
0
,
sizeof
(TWorkThread)); }
};
void
do_work_end(TWorkThread
*
thread_data);
void
__cdecl thread_dowork(TWorkThread
*
thread_data)
//
void __stdcall thread_dowork(TWorkThread* thread_data)
{
volatile
TThreadState
&
state
=
thread_data
->
state;
//
SetThreadAffinityMask(GetCurrentThread(),thread_data->thread_ThreadAffinityMask);
state
=
thrStartup;
while
(
true
)
{
WaitForSingleObject(thread_data
->
waitfor_event,
-
1
);
if
(state
==
thrTerminate)
break
;
state
=
thrBusy;
volatile
TThreadCallBack
&
func
=
thread_data
->
func;
if
(func
!=
0
)
func((
void
*
)thread_data
->
pdata);
do_work_end(thread_data);
}
state
=
thrDeath;
_endthread();
//
ExitThread(0);
}
class
TWorkThreadPool
{
private
:
volatile
HANDLE thread_event;
volatile
HANDLE new_thread_event;
std::vector
<
TWorkThread
>
work_threads;
inline
int
passel_count()
const
{
return
(
int
)work_threads.size()
+
1
; }
void
inti_threads(
long
work_count) {
SYSTEM_INFO SystemInfo;
GetSystemInfo(
&
SystemInfo);
long
cpu_count
=
SystemInfo.dwNumberOfProcessors;
long
best_count
=
cpu_count;
if
(cpu_count
>
work_count) best_count
=
work_count;
long
newthrcount
=
best_count
-
1
;
work_threads.resize(newthrcount);
thread_event
=
CreateSemaphore(
0
,
0
,newthrcount ,
0
);
new_thread_event
=
CreateSemaphore(
0
,
0
,newthrcount ,
0
);
for
(
long
i
=
0
; i
<
newthrcount;
++
i)
{
work_threads[i].waitfor_event
=
thread_event;
work_threads[i].state
=
thrTerminate;
work_threads[i].pool
=
this
;
work_threads[i].thread_ThreadAffinityMask
=
1
<<
(i
+
1
);
//
DWORD thr_id;
work_threads[i].thread_handle
=
(HANDLE)_beginthread((
void
(__cdecl
*
)(
void
*
))thread_dowork,
0
, (
void
*
)
&
work_threads[i]);
//
CreateThread(0, 0, (LPTHREAD_START_ROUTINE)thread_dowork,(void*) &work_threads[i], 0, &thr_id);
}
//
SetThreadAffinityMask(GetCurrentThread(),0x01);
for
(
long
i
=
0
; i
<
newthrcount;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrStartup)
break
;
else
Sleep(
0
);
}
work_threads[i].state
=
thrReady;
}
}
void
free_threads(
void
)
{
long
thr_count
=
(
long
)work_threads.size();
long
i;
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrReady)
break
;
else
Sleep(
0
);
}
work_threads[i].state
=
thrTerminate;
}
if
(thr_count
>
0
)
ReleaseSemaphore(thread_event,thr_count,
0
);
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrDeath)
break
;
else
Sleep(
0
);
}
}
CloseHandle(thread_event);
CloseHandle(new_thread_event);
work_threads.clear();
}
void
passel_work(TThreadCallBack work_proc,
void
**
word_data_list,
int
work_count) {
//
assert(work_count>=1);
//
assert(work_count<=passel_count());
if
(work_count
==
1
)
{
work_proc(word_data_list[
0
]);
}
else
{
long
i;
long
thr_count
=
(
long
)work_threads.size();
for
(i
=
0
; i
<
work_count
-
1
;
++
i)
{
work_threads[i].func
=
work_proc;
work_threads[i].pdata
=
word_data_list[i];
work_threads[i].state
=
thrBusy;
}
for
(i
=
work_count
-
1
; i
<
thr_count;
++
i)
{
work_threads[i].func
=
0
;
work_threads[i].pdata
=
0
;
work_threads[i].state
=
thrBusy;
}
if
(thr_count
>
0
)
ReleaseSemaphore(thread_event,thr_count,
0
);
//
current thread do a work
work_proc(word_data_list[work_count
-
1
]);
//
wait for work finish
for
(i
=
0
; i
<
thr_count;
++
i)
{
while
(
true
) {
if
(work_threads[i].state
==
thrReady)
break
;
else
Sleep(
0
);
}
}
std::swap(thread_event,new_thread_event);
}
}
public
:
explicit
TWorkThreadPool(unsigned
long
work_count):thread_event(
0
),work_threads() {
//
assert(work_count>=1);
inti_threads(work_count); }
~
TWorkThreadPool() { free_threads(); }
long
best_work_count()
const
{
return
passel_count(); }
void
work_execute(TThreadCallBack work_proc,
void
**
word_data_list,
int
work_count) {
while
(work_count
>
0
)
{
long
passel_work_count;
if
(work_count
>=
passel_count())
passel_work_count
=
passel_count();
else
passel_work_count
=
work_count;
passel_work(work_proc,word_data_list,passel_work_count);
work_count
-=
passel_work_count;
word_data_list
=&
word_data_list[passel_work_count];
}
}
inline
void
DoWorkEnd(TWorkThread
*
thread_data){
thread_data
->
waitfor_event
=
new_thread_event;
thread_data
->
func
=
0
;
thread_data
->
state
=
thrReady;
}
};
void
do_work_end(TWorkThread
*
thread_data)
{
thread_data
->
pool
->
DoWorkEnd(thread_data);
}
//
TWorkThreadPool end;
///////////////////////////////////////
/
static
TWorkThreadPool g_work_thread_pool(DefthreadCount);
//
线程池
int
computePot() {
static
bool
is_inti_work
=
false
;
static
TWorkData work_list[DefthreadCount];
static
TWorkData
*
pwork_list[DefthreadCount];
if
(
!
is_inti_work)
{
long
fi
=
0
;
for
(
int
i
=
0
;i
<
DefthreadCount;
++
i)
{
if
(
0
==
i)
work_list[i].iBegin
=
0
;
else
work_list[i].iBegin
=
work_list[i
-
1
].iEnd;
if
(i
==
DefthreadCount
-
1
)
work_list[i].iEnd
=
(
long
)NPARTS;
else
work_list[i].iEnd
=
(
long
)( (
double
)(NPARTS
-
1
)
*
sqrt((
double
)(i
+
1
)
/
DefthreadCount)
+
1
);
pwork_list[i]
=&
work_list[i];
}
is_inti_work
=
true
;
}
g_work_thread_pool.work_execute((TThreadCallBack)computePotPart,(
void
**
)(
&
pwork_list[
0
]),DefthreadCount);
for
(
long
i
=
0
;i
<
DefthreadCount;
++
i)
pot
+=
work_list[i].fResult;
return
0
;
}
由于新的实现方案用编译器优化出来的速度就超过了前面的手工汇编的实现,所以自己的推想不错;
接着用汇编实现了computePotPart_forj函数,并细调了汇编码(这方面的经验不多,很可能有这样的风险,就是某个调整在AMD上执行加快,但在酷睿2上执行得更慢:),代码如下(其它部分代码不变):
代码执行时间 0.453秒 该修改加速比:119.2% 相对于原始代码加速比:876.4%
const
__m128 xmms_0_5
=
{ (
-
0.5
),(
-
0.5
),(
-
0.5
),(
-
0.5
) };
const
__m128d xmmd_0_5
=
{ (
-
0.5
),(
-
0.5
) };
const
__m128 xmms1_5
=
{ (
1.5
),(
1.5
),(
1.5
),(
1.5
) };
const
__m128d xmmd1_5
=
{ (
1.5
),(
1.5
) };
#define
asm __asm
void
__declspec(naked) computePotPart_forj(
int
i,__m128d
*
pResult)
{
//
i -> [ebp+ 8]
//
pResult -> [ebp+12]
asm
{
push ebp
mov ebp, esp
and esp,
-
16
//
16byte对齐
sub esp,
16
//
[esp]一个临时__m128d空间
mov eax, [ebp
+
8
]
//
i
movsd xmm1, [r
+
eax
*
8
+
vector_byte_size
*
2
]
mov edx, offset r
pxor xmm0, xmm0
movsd xmm2, [r
+
eax
*
8
+
vector_byte_size]
mov ecx,
4
pxor xmm5, xmm5
movsd xmm3, [r
+
eax
*
8
]
cmp eax, ecx
pxor xmm6, xmm6
subsd xmm0, xmm1
subsd xmm5, xmm2
subsd xmm6, xmm3
pxor xmm7, xmm7
//
result
unpcklpd xmm0, xmm0
unpcklpd xmm5, xmm5
unpcklpd xmm6, xmm6
movapd [esp], xmm0
jle $for4_end
align
4
$for4_begin:
movapd xmm0, [edx]
movapd xmm4, [edx
+
vector_byte_size]
movapd xmm1, [edx
+
vector_byte_size
*
2
]
movapd xmm3, [esp]
addpd xmm0, xmm6
movapd xmm2, [edx
+
vector_byte_size
*
2
+
16
]
mulpd xmm0, xmm0
addpd xmm4, xmm5
mulpd xmm4, xmm4
addpd xmm1, xmm3
addpd xmm0, xmm4
mulpd xmm1, xmm1
movapd xmm4, [edx
+
16
]
addpd xmm0, xmm1
addpd xmm4, xmm6
movapd xmm1, [edx
+
vector_byte_size
+
16
]
mulpd xmm4, xmm4
addpd xmm1, xmm5
addpd xmm2, xmm3
mulpd xmm1, xmm1
mulpd xmm2, xmm2
addpd xmm4, xmm1
cvtpd2ps xmm3, xmm0
addpd xmm4, xmm2
movapd xmm2, xmmd_0_5
cvtpd2ps xmm1, xmm4
add ecx,
4
add edx,
4
*
8
mulpd xmm0, xmm2
mulpd xmm4, xmm2
movaps xmm2, xmms_0_5
movlhps xmm3, xmm1
mulps xmm2, xmm3
//
-0.5*a
rsqrtps xmm1, xmm3
mulps xmm2, xmm1
movaps xmm3, xmms1_5
mulps xmm2, xmm1
addps xmm2, xmm3
//
+1.5
mulps xmm1, xmm2
movhlps xmm2, xmm1
cvtps2pd xmm3, xmm1
cvtps2pd xmm1, xmm2
mulpd xmm0,xmm3
mulpd xmm4,xmm1
movapd xmm2, xmmd1_5
cmp eax, ecx
mulpd xmm0,xmm3
mulpd xmm4,xmm1
addpd xmm0,xmm2
addpd xmm4,xmm2
mulpd xmm0,xmm3
mulpd xmm4,xmm1
addpd xmm7, xmm0
addpd xmm7, xmm4
jg $for4_begin
$for4_end:
sub ecx,
3
cmp eax, ecx
jle $for1_end
movsd xmm3, [esp]
movapd xmm4, xmmd1_5
$for1_begin:
movsd xmm0, [edx]
movsd xmm2, [edx
+
vector_byte_size]
addsd xmm0, xmm6
addsd xmm2, xmm5
mulsd xmm0, xmm0
mulsd xmm2, xmm2
addsd xmm0, xmm2
movsd xmm2, [edx
+
vector_byte_size
*
2
]
inc ecx
addsd xmm2, xmm3
mulsd xmm2, xmm2
movsd xmm1, xmmd_0_5
addsd xmm0, xmm2
cvtpd2ps xmm2, xmm0
add edx,
8
cmp eax, ecx
mulsd xmm0, xmm1
rsqrtss xmm2, xmm2
movapd xmm1,xmm0
cvtps2pd xmm2, xmm2
mulsd xmm0, xmm2
mulsd xmm0, xmm2
addsd xmm0, xmm4
mulsd xmm0, xmm2
mulsd xmm1, xmm0
mulsd xmm1, xmm0
addsd xmm1, xmm4
mulsd xmm0, xmm1
addsd xmm7, xmm0
jg $for1_begin
$for1_end:
mov edx,[ebp
+
12
]
//
pResult
addpd xmm7, [edx]
movapd [edx], xmm7
add esp,
16
mov esp, ebp
pop ebp
ret
}
}
这个版本也是自己提交的最终版本,相对于原始代码加速比:876.4%
////////////////////////////////////////////////////
把结果汇总,优化过程说明:
初始代码 3.97s
pow(x,2) 改为 (x*x) 2.50s
将double r[DIMS][NPARTS] 改为 double r[NPARTS][DIMS] 2.39s
接着改为__declspec(align(16))double r[NPARTS][DIMS+1]; 2.42s
具体说明:这两步主要是增加内存顺序访问(增加缓存命中)和对齐内存
使用SSE2优化computePot函数 2.31s
具体说明: 引入<emmintrin.h>文件
使用SSE3优化computePot函数 2.22s
具体说明: 引入<intrin.h>文件,使用_mm_hadd_pd等函数指令优化SEE2版本的computePot函数
建立一个工作线程池来并行执行computePot函数 1.16s
具体说明:利用线程池来减少线程的创建和销毁开销
使用_mm_rsqrt_ps开方和牛顿迭代两次来替换SSE2的_mm_sqrt_pd和_mm_div_pd 1.05s
对computePot的内部j循环作4路展开 0.70s
使用线程绑定CPU来减少线程切换 0.71s
具体说明:效果不明显
使用内联汇编来优化computePot函数 0.59s
保留double r[DIMS][NPARTS]的定义 0.54s
具体说明:因为数据量比较小,以前的优化内存顺序访问带来了算法的复杂,
不如保留原数据格式,简化算法的效果好
内联汇编来优化computePot函数 0.453s
这些介绍是我这次优化过程的主线,其实还实现了一些其他的版本:以牺牲计算精度换取速度的好几个中间版本,
和一些不成功的尝试;这些可以参考我的另一篇文章的补充说明(提供一个完整的float实现版本,double到float的手工转换、手工得到invSqrt的粗略起始迭代值 等其它几个不算成功的实现的简要介绍,也许还有那么点参考价值)
该文章的补充: http://blog.csdn.net/housisong/archive/2007/01/20/1488566.aspx
(ps:感谢英特尔举办的这次比赛,使自己学到了很多:)
后记: 我的代码在参赛者中速度排在第二位;排在第一位的dwbclz将核心代码作了8次展开,在我的AMDx2 3600+上比我的代码快20%多,在我的新电脑酷睿2 4400上比我的代码略快(一般不超过1%)(这个成绩也更接近于组织者要求的测试环境要求)
速度排第一名的源代码:http://blog.csdn.net/dwbclz/archive/2007/01/21/1489257.aspx
完