背景:将MTCNN部署在FPGA上需要将其代码设计为C代码,c代码中的相乘相加依赖于openBLAS库。改为zynqNet的方式需要将卷积拆分为3*3的卷积,不能采用gemm的形式。
目的:将卷积与全连接去掉对openBLAS库的依赖,改为与zynqNet一致的嵌套for循环形式实现卷积,以便并行化。
目录
一、gemm
1.1 关于卷积的gemm的理解
1.2 替换掉cblas_sgemm为gemm
二、全连接层的cblas_sgemv
2.1 MTCNN中的sgem
2.2 YOLO中的sgemv
2.3 直接用gemm实现sgemv
三、去除对openCV的依赖
3.1 zynqNet中对图片的读取
3.2 MTCNN中引用了两个图像库
四、卷积改为嵌套for循环形式
4.1 YOLO中im2col函数
4.2 YOLO中的嵌套for循环
4.3 卷积的编写
4.4 一个bug的调通
4.5 全连接层的编写
关于im2col的过程:https://blog.csdn.net/lanchunhui/article/details/74838635
卷积中,将feature提取为矩阵形式,然后与权重矩阵相乘是常见的形式。
参考内容:
cblas_segmm的参数 https://blog.csdn.net/u012235274/article/details/52769682
cblas_sgemm(order, transA, transB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDA);
第一个参数的函数是存储的有限性,有行优先和列优先(c语言是行优先)
第二个参数和第三个参数是是否转置
A矩阵经过transA之后的维度是M×K
B矩阵经过transB之后的维度是K×N
C矩阵的维度是M×N
LDA和LDB是对应矩阵还没变换之前,在主维度方向的维度。(如果是行优先就是列数)。
// -------------convulution in 2D matrix format-------------------
// input kernel matrix * input feature matrix(Trans) = output feature matrix
// height (outChannels) height (3D_KernelSize) height (outChannels)
// width (3D_KernelSize) width (outFeatureSize) width (outFeatureSize)
//C=αAB + βC : outpBox=weightIn*matrixIn(T)
// A_transpose B_transpose
gemm_cpu(0, 1, \
//A row C row B col C col A col B row alpha
weightIn->out_ChannelNum, matrixIn->height, matrixIn->width, 1, \
//A* A'col B* B'col beta
weightIn->pdata,matrixIn->width,matrixIn->pdata,matrixIn->width, 0, \
//C* C'col
outpBox->pdata, matrixIn->height);
替换之后程序正常运行
openBLAS中的sgem https://blog.csdn.net/chenlanjie842179335/article/details/8043925
运算式:C=alpha*A*b+beta*C
一般取alpha=1.0,beta=0.0 即计算式:C=A*b
cblas_sgemv(CblasRowMajor, CblasNoTrans,A的行数,A的列数,alpha,A,A的列数,b,1,beta,C,1)
//Y=αAX + βY β must be 0(zero) cblas_sgemv:Multiplies a matrix by a vector (single precision)
// row_Major no_trans A hight A width alpha
cblas_sgemv(CblasRowMajor, CblasNoTrans, weight->out_ChannelNum, weight->in_ChannelNum,1, \
//A* A width x 1 beta C* 1
weight->pdata, weight->in_ChannelNum, Inpbox->pdata, 1, 0, outpBox->pdata, 1);
int m = l.batch;
int k = l.inputs;
int n = l.outputs;
float *a = net.input;//input
float *b = l.weights;//weight
float *c = l.output;//output
gemm(0,1,m,n,k,1,a,k,b,k,1,c,n);
但是YOLO中是input在左,weight在右,我们需要weight在左,input在右的格式。
//C=αAB + βC : outpBox=weightIn*matrixIn(T)
// A_transpose B_transpose
gemm_cpu(0, 0, \
//A hight C hight B width C width A width B hight alpha
weight->out_ChannelNum, 1, weight->in_ChannelNum, 1, \
//A* A'width B* B'width beta
weight->pdata, weight->in_ChannelNum, Inpbox->pdata, 1, 0, \
//C* C'width
outpBox->pdata, 1);
经过验证,我们可以直接将openBLAS的程序全部变为自己的代码实现。可以去除掉对openBLAS库的依赖。
zynqNet直接将图片转换为二进制格式的文件方便读取。
都在network.h之中,
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
using namespace cv;
openCV库对调试暂时较为重要,后续部署FPGA阶段再回来探讨此内内容。
见 MTCNN(九)去除openCV依赖库 https://blog.csdn.net/weixin_36474809/article/details/83343514
嵌套更改之前,需要从四个方面理解程序。
//YOLO additionally.c
float im2col_get_pixel(float *im, int height, int width, int channels,
int row, int col, int channel, int pad){
row -= pad;
col -= pad;
if (row < 0 || col < 0 ||
row >= height || col >= width) return 0;
return im[col + width*(row + height*channel)];
}
此函数为根据当前height,width,channel,row,col得到相应的pad之后的像素。
// im2col_CPU.c
//left matrix weight,right matrix data_col
//data_col height (3D_kernelSize), width (Out_featureSize)
int channels_col = channels * ksize * ksize;//3D_kernelSize
for (c = 0; c < channels_col; ++c) {
int w_offset = c % ksize;
int h_offset = (c / ksize) % ksize;
int c_im = c / ksize / ksize;
for (h = 0; h < height_col; ++h) {
for (w = 0; w < width_col; ++w) {
int im_row = h_offset + h * stride;
int im_col = w_offset + w * stride;
int col_index = (c * height_col + h) * width_col + w;
data_col[col_index] = im2col_get_pixel(data_im, height, width, channels,
im_row, im_col, c_im, pad);
}
}
}
for (fil = 0; fil < l.n; ++fil) {//channels out
int chan, y, x, f_y, f_x;
// channel index
for (chan = 0; chan < l.c; ++chan)//channels in
// input - y
for (y = 0; y < l.h; ++y)
// input - x
for (x = 0; x < l.w; ++x){
//for channels out,for channels in,for row,for col
int const output_index = fil*l.w*l.h + y*l.w + x;
int const weights_pre_index = fil*l.c*l.size*l.size + chan*l.size*l.size;
int const input_pre_index = chan*l.w*l.h;
float sum = 0;
// filter - y
for (f_y = 0; f_y < l.size; ++f_y)
{
int input_y = y + f_y - l.pad;
// filter - x
for (f_x = 0; f_x < l.size; ++f_x)
{
int input_x = x + f_x - l.pad;
if (input_y < 0 || input_x < 0 || input_y >= l.h || input_x >= l.w) continue;
int input_index = input_pre_index + input_y*l.w + input_x;
int weights_index = weights_pre_index + f_y*l.size + f_x;
sum += state.input[input_index] * l.weights[weights_index];
}
}
// l.output[filters][width][height] +=
// state.input[channels][width][height] *
// l.weights[filters][channels][filter_width][filter_height];
l.output[output_index] += sum;
}
}
在每一个for chanel_out, for channel_In, for out_height, for out_width中计算偏移地址,
然后在当前输出piexl下计算每一个卷积核的累乘相加。
//set the output value to 0
for(cur_col_out=0;cur_col_out
我们将卷积改为嵌套for循环的形式,验证通过了程序。至此,我们可以开始参照zynqNet的模式将MTCNN一步一步向zynqNet上实现。
最初编写好嵌套卷积的时候,并未出现与gemm形式一致的结果,后面查找相应gemm的程序,发现最初有一个对卷积前的输出矩阵置零的步骤。
void gemm_cpu(int TA, int TB, int M, int N, int K,
float *A, int lda,
float *B, int ldb,
float *C, int ldc)
{
int i,j;
for(i = 0; i < M; ++i){
for(j = 0; j < N; ++j){
C[i*ldc + j] = 0;
}
}
...
开始我们认为在convolutionInit函数之中已经运用memset将程序置零,但是后续验证发现,pnet是开辟了多个存储空间(金字塔缩放,每个feature的大小不固定),但是Rnet与Onet的内存结构是固定的,每次运行重复运用了很多次开辟的空间,空间一次性在网络初始化时开辟好。所以,必须在卷积之前将output的值置为0,否则值会累加之前的值。我们打出每次卷积之前与之后的信息。
Start run Pnet
Pnet buffer init
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
Start Pnet generate Bbox
Done Pnet generate Bbox
Done run Pnet
Run nms
...
Rnet run
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
Rnet run
just memset 0 :0.518421
after =0 : 0.000000
just memset 0 :0.212925
after =0 : 0.000000
just memset 0 :0.009512
after =0 : 0.000000
Rnet run
Pnet所有的运行空间均为新开辟的空间,而Rnet与Onet重复运用了相同的内存空间运行了多次。所以第一次运行时初始值为0,但后续边为上次的值。解决了此bug,我们已经将此结构改为了与之对应的nestedloop的格式。
//--------------------fc layer in nested loop format--------------
//loop variables
int cur_outChannel,cur_inChannel;
int out_ChannelNum=weight->out_ChannelNum, in_ChannelNum=weight->in_ChannelNum;
//loaction variables
int weight_loc_pre,weight_loc;
//variable pointer
float sum;
for(cur_outChannel=0;cur_outChannelpdata[weight_loc]*Inpbox->pdata[cur_inChannel];
}
outpBox->pdata[cur_outChannel]=sum;
}
至此,我们摆脱了对openBLAS库的依赖,并且根据嵌套for循环将程序改为了zynqNet的模式。