背景:ZynqNet能在xilinx的FPGA上实现deep compression的网络,FPGA端程序运用传入每层数据运算后存在DRAM上。
目的:读懂ZynqNet的FPGA端的代码。
源码地址:https://github.com/dgschwend/zynqnet
目录
程序包括:
1. 读取每层信息
1.1 给所有block设置layer信息
1.2 加载权重预加载图像
2. 运算
2.1 image piexl to ICache
2.2 stride=2时
2.3 for channel in
2.3.1 processInputChannel
2.3.2 setPixelWriteBack
2.3 for channels out
2.3.1 post process
2.3.2 gloabl_pool
FPGA端代码经过HLS高层次综合为硬件语言实现在FPGA上。为fpga_top.cpp与fpga_top.hpp
// fpga_top
void fpga_top(layer_t layer, data_t *SHARED_DRAM, unsigned int weights_offset,
weightaddr_t num_weights, unsigned int input_offset) {
#pragma HLS INTERFACE m_axi depth = DRAM_DEPTH port = SHARED_DRAM offset = \
slave bundle = memorybus register
#pragma HLS INTERFACE s_axilite port = layer bundle = axilite register
#pragma HLS INTERFACE s_axilite port = num_weights bundle = axilite register
#pragma HLS INTERFACE s_axilite port = weights_offset bundle = axilite register
#pragma HLS INTERFACE s_axilite port = input_offset bundle = axilite register
#pragma HLS INTERFACE s_axilite port = return bundle = axilite register
通过axi-Lite接口读取由CPU传输过来的每层的信息。包括DRAM的地址,层的信息,权重的偏移地址,权重数量,输入的偏移地址。
其中,layer是一个结构体,其中包含了layer的所有信息。在netconfig.hpp中定义。
layer之外的几个变量在cpu_top.cpp中一次性的定义了。只有layer是每层运算都需要传输的信息。
// fpga_top
//setup memory controller
MemoryController::setup(SHARED_DRAM, weights_offset, input_offset);
// Set Layer Configuration
P_layer_setup : {
P_setLayerConfigs : {
ImageCache::setLayerConfig(layer);
WeightsCache::setLayerConfig(layer, num_weights);
MemoryController::setLayerConfig(layer);
ProcessingElement::setLayerConfig(layer);
}
一共设置了五个元素:这五个元素在c语言中为命名空间,并且为全局变量。
//fpga_top
// Load Weights from DRAM
WeightsCache::loadFromDRAM(SHARED_DRAM);
// Preload Row 0 + Pixel (1,0)
MemoryController::setPixelLoadRow(0);
ImageCache::preloadRowFromDRAM(SHARED_DRAM);
MemoryController::setPixelLoadRow(1);
ImageCache::preloadPixelFromDRAM(SHARED_DRAM);
hight循环,width循环,对于每一个pixel位置
// fpga_top for height for width
// Load Next Pixel (automatically checks #pixels left)
ImageCache::preloadPixelFromDRAM(SHARED_DRAM);
// per pixel
// Stride-2 Skipping
if (layer.stride == 2 & (x % 2 | y % 2)) {
LOG("stride-2, skipping pixel\n");
LOG_LEVEL_DECR;
continue;
}
stride为2时,跳过不需要卷积的像素点。(continue为结束单次循环)
// fpga top per pixel for channels in
ProcessingElement::processInputChannel(y, x, ci, layer.channels_out);
// processing_element.cpp
void ProcessingElement::processInputChannel(const coordinate_t y,
const coordinate_t x,
const channel_t ci_in,
const channel_t ch_out) {
#pragma HLS inline off
#pragma HLS FUNCTION_INSTANTIATE variable = ci_in
#pragma HLS dataflow
channel_t ci = ci_in;
weightaddr_t ci_offset;
data_t pixel_buffer[9];
#pragma HLS ARRAY_PARTITION variable = pixel_buffer complete dim = 0
// Preload Image Pixel Buffer (fetch pixels around (y,x,ci))
preloadPixelsAndPrecalcCIoffset(y, x, ci, ch_out, ci_offset, pixel_buffer);
// MACC All Output Channels
processAllCHout(ch_out, ci, ci_offset, pixel_buffer);
}
运用ProcessingElement::processInputChannel函数对所有输入piexl进行MACC运算,然后输出存到OCache之中。
// Calculate Output Pixel Coordinates
dimension_t y_out = (layer.stride == 2) ? (int)y / 2 : (int)y;
dimension_t x_out = (layer.stride == 2) ? (int)x / 2 : (int)x;
MemoryController::setupPixelWriteback(y_out, x_out);
// Select bias coefficients
// WCache.setInputChannel(layer.channels_in, layer.channels_out);
weightaddr_t ci_offset =
WeightsCache::precalcInputOffset(layer.channels_in);
// per pixel for channels out
// Postprocess
data_t processed = ProcessingElement::postprocess(co, ci_offset);
// Writeback to DRAM
MemoryController::writeBackOutputChannel(SHARED_DRAM, co, processed);
进行后续处理,加偏置项和ReLU激活。并且将结果写回DRAM
// Accumulate for Global Pooling (if enabled)
if (layer.global_pool == true) {
if (x_out == 0 && y_out == 0)
GPoolCache::setChannel(co, processed);
else
GPoolCache::accumulateChannel(co, processed);
}
如果有global_pool的话进行globla_pool