目录
一、IPcore代码概览
1.1 接口
1.2 功能
1.3 时间与空间资源
1.3.1 空间资源
1.3.2 时间资源
二、IPcore正确性及验证
2.1 IPcore在MTCNN之中的调用
2.2 IPcore的testBench与c-simulation
2.3 synthesis与RTL输出
2.4 系统搭建与烧录
三、SDK端的调用
3.1 初始化IPcore
3.2 判断IPcore是否完成
3.3 运行情况
3.4 IPcore问题总结
四、与zynqNet的对比
4.1 一次卷积HLS的时钟周期
4.2 MACC操作的次数
//----------------convolution in FPGA-----------------------------------
int convolution_3x3(int inHight,int inWidth,int inChanNum,int outHight,int outWidth,int OutChanNum,
int stride,
volatile float *weight_ptr,volatile float *input_ptr,volatile float *output_ptr){
#pragma HLS INTERFACE s_axilite port=inHight bundle=axilite
#pragma HLS INTERFACE s_axilite port=inWidth bundle=axilite
#pragma HLS INTERFACE s_axilite port=inChanNum bundle=axilite
#pragma HLS INTERFACE s_axilite port=outHight bundle=axilite
#pragma HLS INTERFACE s_axilite port=outWidth bundle=axilite
#pragma HLS INTERFACE s_axilite port=OutChanNum bundle=axilite
#pragma HLS INTERFACE s_axilite port=stride bundle=axilite
#pragma HLS INTERFACE s_axilite port=return bundle=axilite
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=weight_ptr offset=slave bundle=memorybus
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=input_ptr offset=slave bundle=memorybus
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=output_ptr offset=slave bundle=memorybus
接口描述:
软件端:通过函数接口传入参数,实现卷积运算。
硬件端:通过axi-lite协议将神经网络的参数通过ARM传入IPcore;
通过m-axi用IPcore从DRAM上获取相应的权重,特征以及写出卷积输出到DRAM上。
用FPGA加速实现MTCNN之中的3*3卷积。
下图为IPcore在7z035上占用的资源,在7z020上资源超出预期。
因卷积为变长卷积,所以我们以Pnet的第一层输入为准来确定相应的时钟周期(Pnet的输入与Rnet和Onet不同,Pnet输入没有resize来缩减尺寸,因此卷积尺寸较大且较为耗时,Pnet第一层在单片机端卷积的模拟为数秒左右)。
卷积尺寸:输入640*480*3,输出尺寸640*480*10,权重尺寸 3*3*3*10
时钟周期:
总体的latency与II并不大,表明并行是有效果的。
直接在mtcnn之中调用IPcore的c代码,每次3*3的卷积都对IPcore的c代码进行一次调用,例如:
convolution_3x3(this->rgb->height,this->rgb->width,this->rgb->channel,
this->conv1_out->height,this->conv1_out->width,this->conv1_out->channel,
this->conv1_wb->stride,
this->conv1_wb->pdata,this->rgb->pdata,this->conv1_out->pdata);
经过调用之后的mtcnn能正确运行且产生正确的结果。
编写testBench,将IPcore代码独立出来进行验证。且与卷积结果进行对比。无误,证明IPcore的代码可以独立出MTCNN的代码进行synthesis。
IPcore的c代码输出及RTL输出正常无报错。
系统搭建validate未见异常,比特流生成正常,比特流通过SDK软件烧入FPGA正常。
通过HLS驱动来初始化IPcore及开始IPcore
//set conv IPcore value
XConvolution_3x3_Set_inHight(&XConvolution_3x3_Core,featureIn.height);
XConvolution_3x3_Set_inWidth(&XConvolution_3x3_Core,featureIn.width);
XConvolution_3x3_Set_inChanNum(&XConvolution_3x3_Core,featureIn.channel);
XConvolution_3x3_Set_outHight(&XConvolution_3x3_Core,conv_PL_out.height);
XConvolution_3x3_Set_outWidth(&XConvolution_3x3_Core,conv_PL_out.width);
XConvolution_3x3_Set_OutChanNum(&XConvolution_3x3_Core,conv_PL_out.channel);
XConvolution_3x3_Set_stride(&XConvolution_3x3_Core,weightIn.stride);
XConvolution_3x3_Set_weight_ptr(&XConvolution_3x3_Core,(unsigned int)weightIn.pdata);
XConvolution_3x3_Set_input_ptr(&XConvolution_3x3_Core,(unsigned int)featureIn.pdata);
XConvolution_3x3_Set_output_ptr(&XConvolution_3x3_Core,(unsigned int)conv_PL_out.pdata);
printf("Set conv parameters SUCCESS!\n");
//conv in PL
XConvolution_3x3_Start(&XConvolution_3x3_Core);
printf("IPcore conv start SUCCESS!\n");
for(int cur_sleep_times=0;cur_sleep_times<20;cur_sleep_times++){
if(!XConvolution_3x3_IsDone(&XConvolution_3x3_Core)){
printf("IPcore Not Done! current sleep times is %d \n",cur_sleep_times);
}
else{
printf("IP core Done SUCCESS!");
break;
}
usleep(1000*7);//7 mlili second
}
isDone变为1表示IPcore正常运行结束。
小尺寸卷积正常运行
--------------program start-------------
init network parameters run time is 0.000474 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690571 mili second
Initialize XConvolution_3x3_Core IPcore SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 5
IP core get weight prt is 1000000
Set conv parameters SUCCESS!IPcore conv start SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 1
IP core get inHight is 5
IP core get weight prt is 1000000
Strat again SUCCESS!
IP core Done SUCCESS!
Input_Pixels is 50 and hex memory size is 000000c8
weight_pixels is 36 and hex memory size is 00000090
Output_Pixels is 18 and hex memory size is 00000048
Input pointer value is 01000090
Weight pointer value is 01000000
Output PS pointer value is 010000d8
InputSize is 5, In_channels is 2, Input_Pixels is 50
OutputSize is 3, Out_channels is 2, Output_Pixels is 18
Stride is 1, weight_pixels is 36
------------Program End SUCCESS!-----------
卷积尺寸较大时导致单片机死机,例如下面这个尺寸:
InputSize is 23, In_channels is 32, Input_Pixels is 16928
OutputSize is 21, Out_channels is 64, Output_Pixels is 28224
Stride is 1, weight_pixels is 18432
Input_Pixels is 16928 and hex memory size is 00010880
weight_pixels is 18432 and hex memory size is 00012000
Output_Pixels is 28224 and hex memory size is 0001b900
Input pointer value is 01012000
Weight pointer value is 01000000
Output PS pointer value is 0102d900
IPcore对DDR的指针无法与单片机共享,因为小尺寸卷积输出写入DDR的值单片机无法读出。
IPcore实现尺寸较大卷积会导致单片机死机。
下面为zynqNet的时钟周期:
全局变量的增加:在fpgaAcc.cpp之中,加入extern int,然后在pikaqiu中程序之外定义全局变量int。
MTCNN选用一张:85,176,568 为10^7数量级
另一张:43,543,288为 10^7数量级
zynqNe的MACC次数为: 152,731,648 ,为10^8数量级