之前在设计中遇到过1个问题,如何在verilog中并行实现大量数据(几十、几百个甚至更多)的加法操作。
最直接想到的方法一般会类似:
reg [7:0] data [N - 1 : 0];
wire[M:0] sum;
assign sum = data[0] + data[1] + ... + data[N-1];
这种方法本身没有问题,但是随着N不断增大,代码量也会随之不断增加,而且很容易写错。 另外,加法是组合逻辑,为了提高时序性能,需要在若干个数的加法结果之后插入一级register,进行pipeline设计。在这种情况下,如果按照上面那种方式写代码量就更大了。 后来又想到用generate for语句,代码量同样很大。如何使用简洁的代码来实现?始终不得其解。
直到有一天看到github上关于FPGA并行加法树的设计代码,恍然大悟:
无pipeline的有符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/AdderTree.sv
有pipeline的有符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/AdderTreePipelined.sv
无pipeline的无符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/UnsignedAdderTree.sv
有pipeline的无符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/UnsignedAdderTreePipelined.sv
上述程序都是使用system verilog设计的,其中最精髓的一个思想就是模块的递归调用。仅通过1个模块就可以实现任意bit位宽,任意数量的数据进行并行加法计算。
贴上UnsignedAdderTreePipelined.sv的代码:
/* This Source Code Form is subject to the terms of the Mozilla Public
* License, v. 2.0. If a copy of the MPL was not distributed with this
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */
module UnsignedAdderTreePipelined(clk, reset, in_addends, in_advance, out_sum);
parameter DATA_WIDTH;
parameter LENGTH;
parameter DELAY_STAGES = $clog2(LENGTH);
localparam OUT_WIDTH = DATA_WIDTH + $clog2(LENGTH);
localparam LENGTH_A = LENGTH / 2;
localparam LENGTH_B = LENGTH - LENGTH_A;
localparam OUT_WIDTH_A = DATA_WIDTH + $clog2(LENGTH_A);
localparam OUT_WIDTH_B = DATA_WIDTH + $clog2(LENGTH_B);
input clk;
input reset;
input in_advance; //if 0, the registers do not load their values, so the tree does not advance
input [DATA_WIDTH-1:0] in_addends [LENGTH];
output logic [OUT_WIDTH-1:0] out_sum;
generate
if (LENGTH == 1) begin
if (DELAY_STAGES > 0) begin
//single element, but with some delays left
logic [DATA_WIDTH-1:0] delay_fifo [DELAY_STAGES];
assign out_sum = delay_fifo[0];
always_ff @(posedge clk) begin
if (reset) begin
for (int i = 0; i < DELAY_STAGES - 1; i++) begin
delay_fifo[i] <= '0;
end
delay_fifo[DELAY_STAGES-1] <= in_advance ? in_addends[0] : '0;
end else if (in_advance) begin
delay_fifo[DELAY_STAGES-1] <= in_addends[0];
for (int i = 0; i < DELAY_STAGES - 1; i++) begin
delay_fifo[i] <= delay_fifo[i+1];
end
end
end
end else begin
//single element, no delays
assign out_sum = in_addends[0];
end
end else begin
wire [OUT_WIDTH_A-1:0] sum_a;
wire [OUT_WIDTH_B-1:0] sum_b;
logic [DATA_WIDTH-1:0] addends_a [LENGTH_A];
logic [DATA_WIDTH-1:0] addends_b [LENGTH_B];
always_comb begin
for (int i = 0; i < LENGTH_A; i++) begin
addends_a[i] = in_addends[i];
end
for (int i = 0; i < LENGTH_B; i++) begin
addends_b[i] = in_addends[i + LENGTH_A];
end
end
if (DELAY_STAGES > 0) begin
//divide set into two chunks, conquer
UnsignedAdderTreePipelined #(
.DATA_WIDTH(DATA_WIDTH),
.LENGTH(LENGTH_A),
.DELAY_STAGES(DELAY_STAGES-1)
) subtree_a (
.clk(clk),
.reset(reset),
.in_advance(in_advance),
.in_addends(addends_a),
.out_sum(sum_a)
);
UnsignedAdderTreePipelined #(
.DATA_WIDTH(DATA_WIDTH),
.LENGTH(LENGTH_B),
.DELAY_STAGES(DELAY_STAGES-1)
) subtree_b (
.clk(clk),
.reset(reset),
.in_advance(in_advance),
.in_addends(addends_b),
.out_sum(sum_b)
);
always_ff @(posedge clk) begin
if (DELAY_STAGES == 1) begin
if (in_advance) begin
out_sum <= sum_a + sum_b;
end else if (reset) begin
out_sum = '0;
end
end else begin
if (reset) begin
out_sum <= '0;
end else if (in_advance) begin
out_sum <= sum_a + sum_b;
end
end
end
end else begin
//no delays left
UnsignedAdderTreePipelined #(
.DATA_WIDTH(DATA_WIDTH),
.LENGTH(LENGTH_A),
.DELAY_STAGES(0)
) subtree_a (
.clk(clk),
.reset(reset),
.in_advance(in_advance),
.in_addends(addends_a),
.out_sum(sum_a)
);
UnsignedAdderTreePipelined #(
.DATA_WIDTH(DATA_WIDTH),
.LENGTH(LENGTH_B),
.DELAY_STAGES(0)
) subtree_b (
.clk(clk),
.reset(reset),
.in_advance(in_advance),
.in_addends(addends_b),
.out_sum(sum_b)
);
assign out_sum = sum_a + sum_b;
end
end
endgenerate
endmodule
主要原理为:通过2分法,将输入数据分成2组,每一组递归调用模块自身,然后在递归调用的模块里再继续分为2组……不断重复,直到最后分解为N组2个数的加法或单个数,类似于二叉树的结构。然后计算过程就是从二叉树的最底层逐层向上进行,直到达到最顶层得出最后的计算结果。同时,在每层二叉树的节点都插入1级pipeline register,相当于每2个数的加法逻辑后都插入1级pipeline register。这种设计结构可称之为并行加法树。
将上述system verilog代码中的部分特殊语法改为verilog语法后,经过仿真,替换原有设计中所有的并行加法计算部分,最后在FPGA硬件上验证通过。关于FPGA并行加法的设计问题终于得到了解决!