FPGA中系统优化的一些基本方法

有时需要进行一些相对复杂些的计算，如乘除取模等，此时当输入数的位宽不是太大时，可以采用LUT(look up table)，这样将复杂运算直接转化成读取LUT，从资源与速度方面都有较大提高。在xilinxFPGA中，提供基本的LUT是六输入，其也可以变成两个五输入。

Quick FPGA Hacks: Population count

Presented at FPL2014 the paper “Pipelined Compressor Tree Optimization using Integer Linear Programming” by Martin Kumm and Peter Zipf, University of Kassel uses clever (Xilinx) slice-wide (four 6-LUTs + CARRY4) based compressors to reduce as many as 12 input bits to 5 sum bits for an efficiency of up to (12-5) / 4 = 1.75 bits/LUT. This reminded me of a nice application of compressors in FPGAs — population count a.k.a. bit count, e.g. determine the number of one bits set in a word — which uses simpler LUT-based (no carry chain) compressors to good effect. I first learned about this approach in some Altera documentation. Using modern FPGAs’ 6-LUTs it is easy to build a 6-bit wide population count from a bitvector [5:0] input in one LUT delay. This is a “6:3 compressor”:

//
// 6:3 compressor as a 64x3b ROM -- three 6-LUTs
//
module c63(
    input [5:0] i,
    output reg [2:0] o);    // # of 1's in i

    always @* begin
        case (i)
        6'h00: o = 0; 6'h01: o = 1; 6'h02: o = 1; 6'h03: o = 2;
        6'h04: o = 1; 6'h05: o = 2; 6'h06: o = 2; 6'h07: o = 3;
        6'h08: o = 1; 6'h09: o = 2; 6'h0A: o = 2; 6'h0B: o = 3;
        6'h0C: o = 2; 6'h0D: o = 3; 6'h0E: o = 3; 6'h0F: o = 4;
        6'h10: o = 1; 6'h11: o = 2; 6'h12: o = 2; 6'h13: o = 3;
        6'h14: o = 2; 6'h15: o = 3; 6'h16: o = 3; 6'h17: o = 4;
        6'h18: o = 2; 6'h19: o = 3; 6'h1A: o = 3; 6'h1B: o = 4;
        6'h1C: o = 3; 6'h1D: o = 4; 6'h1E: o = 4; 6'h1F: o = 5;
        6'h20: o = 1; 6'h21: o = 2; 6'h22: o = 2; 6'h23: o = 3;
        6'h24: o = 2; 6'h25: o = 3; 6'h26: o = 3; 6'h27: o = 4;
        6'h28: o = 2; 6'h29: o = 3; 6'h2A: o = 3; 6'h2B: o = 4;
        6'h2C: o = 3; 6'h2D: o = 4; 6'h2E: o = 4; 6'h2F: o = 5;
        6'h30: o = 2; 6'h31: o = 3; 6'h32: o = 3; 6'h33: o = 4;
        6'h34: o = 3; 6'h35: o = 4; 6'h36: o = 4; 6'h37: o = 5;
        6'h38: o = 3; 6'h39: o = 4; 6'h3A: o = 4; 6'h3B: o = 5;
        6'h3C: o = 4; 6'h3D: o = 5; 6'h3E: o = 5; 6'h3F: o = 6;
        endcase
    end
endmodule

A 36b popcount may be implemented using nine such 6:3 compressors and a small ternary adder. The first six compressors determine the no. of ones in each bundle of 6 input bits:

//
// 36-bit bitcount
//
module pop36(
    input [35:0] i,
    output [5:0] sum);  // # of 1's in i

    wire [2:0] c0500, c1106, c1712, c2318, c2924, c3530, c0, c1, c2;

    // add six bundles of six bits
    c63 c0500_(i[ 5: 0], c0500);
    c63 c1106_(i[11: 6], c1106);
    c63 c1712_(i[17:12], c1712);
    c63 c2318_(i[23:18], c2318);
    c63 c2924_(i[29:24], c2924);
    c63 c3530_(i[35:30], c3530);

These transform the problem of adding 36 bits to that of adding six 3b vectors. Surprisingly these can also be added with another round of compressors, by focusing on the [0], [1], and [2] bit positions separately.

    // sum the bits set in the [0], [1], and [2] bit positions separately
    c63 c0_({c0500[0],c1106[0],c1712[0],c2318[0],c2924[0],c3530[0]}, c0);
    c63 c1_({c0500[1],c1106[1],c1712[1],c2318[1],c2924[1],c3530[1]}, c1);
    c63 c2_({c0500[2],c1106[2],c1712[2],c2318[2],c2924[2],c3530[2]}, c2);

Now the total number of bits is (c0 * 2^0) + (c1 * 2^1) + (c2 * 2^2):

    assign sum = {2'b0,c0} + {1'b0,c1,1'b0} + {c2,2'b0};
endmodule

The ternary adder is sufficiently simple that synthesis maps it to two levels of pure LUTs (no carry chain). All totaled, this has a delay of four LUTs and runs at 370 MHz in a Kintex-7 -1 speed grade. By adding a pipeline register between the compressors and the ternary add, the circuit is reduced to two LUT delays, and runs at 550 MHz. By adding a pipeline register between each level of LUTs, the critical path is under 1.2 ns, which could run at 860 MHz. However, the minimum clock period for this device is 1.6 ns (625 MHz). Happy hacking!

FPGA中系统优化的一些基本方法

Quick FPGA Hacks: Population count

你可能感兴趣的:(Verilog)