FPGA并行加法树设计

        之前在设计中遇到过1个问题,如何在verilog中并行实现大量数据(几十、几百个甚至更多)的加法操作。

        最直接想到的方法一般会类似:

reg [7:0]  data [N - 1 : 0];

wire[M:0]  sum;

assign sum = data[0] + data[1] + ... + data[N-1];

        这种方法本身没有问题,但是随着N不断增大,代码量也会随之不断增加,而且很容易写错。 另外,加法是组合逻辑,为了提高时序性能,需要在若干个数的加法结果之后插入一级register,进行pipeline设计。在这种情况下,如果按照上面那种方式写代码量就更大了。 后来又想到用generate for语句,代码量同样很大。如何使用简洁的代码来实现?始终不得其解。

        直到有一天看到github上关于FPGA并行加法树的设计代码,恍然大悟:

        无pipeline的有符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/AdderTree.sv

        有pipeline的有符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/AdderTreePipelined.sv

        无pipeline的无符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/UnsignedAdderTree.sv

        有pipeline的无符号数并行加法树:https://github.com/raiker/tarsier/blob/master/src/UnsignedAdderTreePipelined.sv

        上述程序都是使用system verilog设计的,其中最精髓的一个思想就是模块的递归调用。仅通过1个模块就可以实现任意bit位宽,任意数量的数据进行并行加法计算。

        贴上UnsignedAdderTreePipelined.sv的代码:

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

module UnsignedAdderTreePipelined(clk, reset, in_addends, in_advance, out_sum);

parameter DATA_WIDTH;
parameter LENGTH;
parameter DELAY_STAGES = $clog2(LENGTH);

localparam OUT_WIDTH = DATA_WIDTH + $clog2(LENGTH);
localparam LENGTH_A = LENGTH / 2;
localparam LENGTH_B = LENGTH - LENGTH_A;
localparam OUT_WIDTH_A = DATA_WIDTH + $clog2(LENGTH_A);
localparam OUT_WIDTH_B = DATA_WIDTH + $clog2(LENGTH_B);

input clk;
input reset;
input in_advance; //if 0, the registers do not load their values, so the tree does not advance
input [DATA_WIDTH-1:0] in_addends [LENGTH];
output logic [OUT_WIDTH-1:0] out_sum;

generate
	if (LENGTH == 1) begin
		if (DELAY_STAGES > 0) begin
			//single element, but with some delays left
			logic [DATA_WIDTH-1:0] delay_fifo [DELAY_STAGES];
			assign out_sum = delay_fifo[0];
			
			always_ff @(posedge clk) begin
				if (reset) begin
					for (int i = 0; i < DELAY_STAGES - 1; i++) begin
						delay_fifo[i] <= '0;
					end
					delay_fifo[DELAY_STAGES-1] <= in_advance ? in_addends[0] : '0;
				end else if (in_advance) begin
					delay_fifo[DELAY_STAGES-1] <= in_addends[0];
					
					for (int i = 0; i < DELAY_STAGES - 1; i++) begin
						delay_fifo[i] <= delay_fifo[i+1];
					end
				end
			end
		end else begin
			//single element, no delays
			assign out_sum = in_addends[0];
		end
	end else begin
		wire [OUT_WIDTH_A-1:0] sum_a;
		wire [OUT_WIDTH_B-1:0] sum_b;
		
		logic [DATA_WIDTH-1:0] addends_a [LENGTH_A];
		logic [DATA_WIDTH-1:0] addends_b [LENGTH_B];
		
		always_comb begin
			for (int i = 0; i < LENGTH_A; i++) begin
				addends_a[i] = in_addends[i];
			end
			
			for (int i = 0; i < LENGTH_B; i++) begin
				addends_b[i] = in_addends[i + LENGTH_A];
			end
		end
		
		if (DELAY_STAGES > 0) begin
			//divide set into two chunks, conquer
			UnsignedAdderTreePipelined #(
				.DATA_WIDTH(DATA_WIDTH),
				.LENGTH(LENGTH_A),
				.DELAY_STAGES(DELAY_STAGES-1)
			) subtree_a (
				.clk(clk),
				.reset(reset),
				.in_advance(in_advance),
				.in_addends(addends_a),
				.out_sum(sum_a)
			);
			
			UnsignedAdderTreePipelined #(
				.DATA_WIDTH(DATA_WIDTH),
				.LENGTH(LENGTH_B),
				.DELAY_STAGES(DELAY_STAGES-1)
			) subtree_b (
				.clk(clk),
				.reset(reset),
				.in_advance(in_advance),
				.in_addends(addends_b),
				.out_sum(sum_b)
			);
			
			always_ff @(posedge clk) begin
				if (DELAY_STAGES == 1) begin
					if (in_advance) begin
						out_sum <= sum_a + sum_b;
					end else if (reset) begin
						out_sum = '0;
					end
				end else begin
					if (reset) begin
						out_sum <= '0;
					end else if (in_advance) begin
						out_sum <= sum_a + sum_b;
					end
				end
			end
		end else begin
			//no delays left
			UnsignedAdderTreePipelined #(
				.DATA_WIDTH(DATA_WIDTH),
				.LENGTH(LENGTH_A),
				.DELAY_STAGES(0)
			) subtree_a (
				.clk(clk),
				.reset(reset),
				.in_advance(in_advance),
				.in_addends(addends_a),
				.out_sum(sum_a)
			);
			
			UnsignedAdderTreePipelined #(
				.DATA_WIDTH(DATA_WIDTH),
				.LENGTH(LENGTH_B),
				.DELAY_STAGES(0)
			) subtree_b (
				.clk(clk),
				.reset(reset),
				.in_advance(in_advance),
				.in_addends(addends_b),
				.out_sum(sum_b)
			);
			
			assign out_sum = sum_a + sum_b;
		end
	end
endgenerate

endmodule

        主要原理为:通过2分法,将输入数据分成2组,每一组递归调用模块自身,然后在递归调用的模块里再继续分为2组……不断重复,直到最后分解为N组2个数的加法或单个数,类似于二叉树的结构。然后计算过程就是从二叉树的最底层逐层向上进行,直到达到最顶层得出最后的计算结果。同时,在每层二叉树的节点都插入1级pipeline register,相当于每2个数的加法逻辑后都插入1级pipeline register。这种设计结构可称之为并行加法树

        将上述system verilog代码中的部分特殊语法改为verilog语法后,经过仿真,替换原有设计中所有的并行加法计算部分,最后在FPGA硬件上验证通过。关于FPGA并行加法的设计问题终于得到了解决!

        

 

你可能感兴趣的:(FPGA,fpga,verilog,systemverilog)