最近遇到1500*1500*1500*1500量级的运算。用matlab来计算的话运算效率极低。因此需要对matlab代码进行优化:
原始代码(其中UAt,XA, XB,YA, YB是1500*1500的矩阵):
tic
for m=1:size(UAt,1)
tic
for n=1:size(UAt,2)
RAB=sqrt((XB-XA(1,n)).^2+(YB-YA(m,1)).^2+dAB^2);
Um=UAt(m,n)*(1i*k-1./RAB).*exp(1i*k*RAB)*dAB./RAB.^2;
UBi_1=Um+UBi_1;
end
toc
m
end
toc
原始运算效率(window笔记本,2080ti显卡,i710750h cpu):
思路一,原始的代码里头用了循环套循环的方法来做,但是这种方法在matlab这种以矩阵运算见长的语言上是不适合的。因此将循环改造为矩阵运算:
xa = ((-(size(UAt,2)-1)/2:(size(UAt,2)-1)/2)*dx0);
ya = (-(-(size(UAt,1)-1)/2:(size(UAt,1)-1)/2)*dx0);
xb = ((-(size(UBi,2)-1)/2:(size(UBi,2)-1)/2)*dx0);
yb = (-(-(size(UBi,1)-1)/2:(size(UBi,1)-1)/2)*dx0);
ind12 = xb.^2+(xa.^2)'-2*xa'*xb;
ind34 = yb.^2+(ya.^2)'-2*ya'*yb;
ind34 = (ind34);
ind12 = (ind12);
ind12 = reshape(ind12,[1,1,size(ind12)]);
ind34 = reshape(ind34, [size(ind34),1,1]);% [yb, ya(m), xb, xa(n)]
rab = sqrt(ind12+ind34+dAB2);
uab = reshape(UAt,[size(UAt,1),length(ya),size(UAt,2),length(xa)]);
rab_dot_inv = 1./rab;
um = uab.*(1i*k-rab_dot_inv).*exp(1i*k*rab)*dAB.*rab_dot_inv.*rab_dot_inv;
ubi_1 = sum(um,[2,4]);
这个时候。由于占用的内存实在是太多了。系统报错:
在循环和矩阵运算(内存消耗)之间做trade off:
xa = ((-(size(UAt,2)-1)/2:(size(UAt,2)-1)/2)*dx0);
ya = (-(-(size(UAt,1)-1)/2:(size(UAt,1)-1)/2)*dx0);
xb = ((-(size(UBi,2)-1)/2:(size(UBi,2)-1)/2)*dx0);
yb = (-(-(size(UBi,1)-1)/2:(size(UBi,1)-1)/2)*dx0);
clear UBi
XB = reshape(XB,[1,size(XB)]);
for m=1:size(UAt,1)
b = (YB-ya(m)).^2+dAB^2;
b = reshape(b, [1,size(b)]);
m
step = 10;
total = 1;
tic
for s=1:size(UAt, 2)/step
rg = min(s*step-1,size(UAt, 2));
a = xa(total:rg);
a = reshape(a,[length(a),1,1]);
u = UAt(m,total:rg);
ind12=(a-XB).^2;
ind123 = b + ind12;
rab = sqrt(ind123);
clear ind123;
clear ind12;
rab_inv = 1./rab;
left = (1i*k-rab_inv).*rab_inv.^2.*exp(1i*k*rab)*dAB;
clear rabinv
clear rab
UBi_1 = reshape(pagemtimes(u,left),size(UBi_1))+UBi_1;
total = rg+1;
end
toc
end
发现,反而更加耗时了:
说明matlab在cpu内对三维以上的矩阵运算优化不够?
我们测试一下gpu上的表现,使用gpuArray,首先测试原始的代码,做如下的改动:
XA=gpuArray(XA);
XB= gpuArray(XB);
YA = gpuArray(YA);
YB=gpuArray(YB);
UAt = gpuArray(UAt);
dAB = gpuArray(dAB);
提速惊人!
我们测试之前的trade off版本在gpu上的表现:
xa = gpuArray(xa);
ya = gpuArray(ya);
xb=gpuArray(xb);
yb=gpuArray(yb);
效率还是不如不加矩阵运算的初始版本。
最终的优化版本:
XA=gpuArray(XA);
XB= gpuArray(XB);
YA = gpuArray(YA);
YB=gpuArray(YB);
UAt = gpuArray(UAt);
dAB = gpuArray(dAB);
UBi_1=XB*0;
tic
for m=1:size(UAt,1)
tic
b = (YB-YA(m,1)).^2+dAB^2;
for n=1:size(UAt,2)
RAB=sqrt((XB-XA(1,n)).^2+b);
rab_inv = 1./RAB;
Um=UAt(m,n)*(1i*k-rab_inv).*exp(1i*k*RAB)*dAB.*rab_inv.^2;
UBi_1=Um+UBi_1;
end
toc
m
end
toc
另外需要注意的是,测试过程中发现,用笔记本的显卡,用了一段时间之后,速度就掉下来了。这个主要是因为笔记本显卡温度上去了,烧到60摄氏度以上了。所以性能下降。