继承上一篇中的硬件设计,本篇文章将继续进行比特币矿机的硬件描述RTL代码的设计,语言Verilog。
上一篇中主要分析了比特币挖矿的算法内容,特别是哈希算法SHA256的具体内容,但一个可以使用的系统需要周边模块为其提供时钟、复位、数据接口等,本篇会给出具体设计,综合编译后进行测试。在设计这个系统之前,我们要列举矿机系统需要什么:
综合上述我们的需求,画出大致系统架构图如下,之后的硬件设计将围绕这张图进行
首先需要复习一下上一篇中设计的SHA256哈希核心的硬件部分,大致分为一个扩展数据循环和一个状态机循环,如果有不了解的可以回到上一篇阅读其细节:
在开始哈希核心的代码之前,我们有必要先介绍一下DSP48E1,这是FPGA中的基础资源中的一种,关于各种基础资源的介绍可以参考这篇博文:FPGA中的基础逻辑单元–Xilinx
使用DSP48E1的原因是,我们需要在资源用量以及时钟方面推进到极限,以达到尽可能高的效率。DSP48E1是NEXYS4开发板中的FPGA XC7A100T使用的DSP型号(现在大多数高级一点的FPGA都用的是DSP48E2),它可以有效克服FPGA不擅长乘法加法之类的数值运算的缺点,使得FPGA在实现这类逻辑时能有更高的时钟频率,并且不会浪费大量的查找表资源。
调用DSP的方式大概有三种:
作为一个硬核博主,自然要用最硬的方式。在使用这种方式之前我们需要相当深入的了解其具体内容,详情要参考这篇官方文档:UG479 DSP48E1。但是直接用源语的代码长度确实太长,我们折中一下,写一个DSP48E1_wrapper.v,减少参数和端口数量,降低一下调用的难度,关键在于ALUMODE, INMODE, OPMODE三个模式接口的配置,它控制DSP中前置加法器、乘法器和后置加法器的数据调用方式,具体用法可以参考之后调用这个wrapper的方法:
module DSP48E1_wrapper #(
parameter A_INPUT = "DIRECT",
parameter B_INPUT = "DIRECT",
parameter USE_MULT = "MULTIPLY",
parameter USE_SIMD = "ONE48",
parameter ADREG = 1,
parameter ALUMODEREG = 1,
parameter ACASCREG = 1,
parameter AREG = 1,
parameter BCASCREG = 1,
parameter BREG = 1,
parameter CREG = 1,
parameter DREG = 1,
parameter INMODEREG = 1,
parameter MREG = 1,
parameter OPMODEREG = 1,
parameter PREG = 1
)
(
input clk,
input rst,
input CE,
input [3:0] ALUMODE,
input [4:0] INMODE,
input [6:0] OPMODE,
input [29:0] A,
input [17:0] B,
input [47:0] C,
input [24:0] D,
input [2:0] CARRYINSEL,
input CARRYIN,
output [3:0] CARRYOUT,
output [47:0] P,
input CARRYCASCIN,
output CARRYCASCOUT
);
DSP48E1 #(
// Feature Control Attributes: Data Path Selection
.A_INPUT(A_INPUT), // Selects A input source, "DIRECT" (A port) or "CASCADE" (ACIN port)
.B_INPUT(B_INPUT), // Selects B input source, "DIRECT" (B port) or "CASCADE" (BCIN port)
.USE_MULT(USE_MULT), // Select multiplier usage (DYNAMIC, MULTIPLY, NONE)
.USE_SIMD(USE_SIMD), // SIMD selection (FOUR12, ONE48, TWO24)
// Pattern Detector Attributes: Pattern Detection Configuration
.AUTORESET_PATDET("NO_RESET"), // NO_RESET, RESET_MATCH, RESET_NOT_MATCH
.MASK(48'h3fffffffffff), // 48-bit mask value for pattern detect (1=ignore)
.PATTERN(48'h000000000000), // 48-bit pattern match for pattern detect
.SEL_MASK("MASK"), // C, MASK, ROUNDING_MODE1, ROUNDING_MODE2
.SEL_PATTERN("PATTERN"), // Select pattern value (C, PATTERN)
.USE_PATTERN_DETECT("NO_PATDET"), // Enable pattern detect (NO_PATDET, PATDET)
// Programmable Inversion Attributes: Specifies built-in programmable inversion on specific pins
.IS_ALUMODE_INVERTED(4'b0000), // Optional inversion for ALUMODE
.IS_CARRYIN_INVERTED(1'b0), // Optional inversion for CARRYIN
.IS_CLK_INVERTED(1'b0), // Optional inversion for CLK
.IS_INMODE_INVERTED(5'b00000), // Optional inversion for INMODE
.IS_OPMODE_INVERTED(9'b000000000), // Optional inversion for OPMODE
// Register Control Attributes: Pipeline Register Configuration
.ACASCREG(ACASCREG), // Number of pipeline stages between A/ACIN and ACOUT (0-2)
.ADREG(ADREG), // Pipeline stages for pre-adder (0-1)
.ALUMODEREG(ALUMODEREG), // Pipeline stages for ALUMODE (0-1)
.AREG(AREG), // Pipeline stages for A (0-2)
.BCASCREG(BCASCREG), // Number of pipeline stages between B/BCIN and BCOUT (0-2)
.BREG(BREG), // Pipeline stages for B (0-2)
.CARRYINREG(1), // Pipeline stages for CARRYIN (0-1)
.CARRYINSELREG(1), // Pipeline stages for CARRYINSEL (0-1)
.CREG(CREG), // Pipeline stages for C (0-1)
.DREG(DREG), // Pipeline stages for D (0-1)
.INMODEREG(INMODEREG), // Pipeline stages for INMODE (0-1)
.MREG(MREG), // Multiplier pipeline stages (0-1)
.OPMODEREG(OPMODEREG), // Pipeline stages for OPMODE (0-1)
.PREG(PREG) // Number of pipeline stages for P (0-1)
)
DSP48E2_inst (
// Cascade outputs: Cascade Ports
.ACOUT(), // 30-bit output: A port cascade
.BCOUT(), // 18-bit output: B cascade
.CARRYCASCOUT(CARRYCASCOUT), // 1-bit output: Cascade carry
.MULTSIGNOUT(), // 1-bit output: Multiplier sign cascade
.PCOUT(), // 48-bit output: Cascade output
// Control outputs: Control Inputs/Status Bits
.OVERFLOW(), // 1-bit output: Overflow in add/acc
.PATTERNBDETECT(), // 1-bit output: Pattern bar detect
.PATTERNDETECT(), // 1-bit output: Pattern detect
.UNDERFLOW(), // 1-bit output: Underflow in add/acc
// Data outputs: Data Ports
.CARRYOUT(CARRYOUT), // 4-bit output: Carry
.P(P), // 48-bit output: Primary data
// Cascade inputs: Cascade Ports
.ACIN(), // 30-bit input: A cascade data
.BCIN(), // 18-bit input: B cascade
.CARRYCASCIN(CARRYCASCIN), // 1-bit input: Cascade carry
.MULTSIGNIN(), // 1-bit input: Multiplier sign cascade
.PCIN(), // 48-bit input: P cascade
// Control inputs: Control Inputs/Status Bits
.ALUMODE(ALUMODE), // 4-bit input: ALU control
.CARRYINSEL(CARRYINSEL), // 3-bit input: Carry select
.CLK(clk), // 1-bit input: Clock
.INMODE(INMODE), // 5-bit input: INMODE control
.OPMODE(OPMODE), // 7-bit input: Operation mode
// Data inputs: Data Ports
.A(A), // 30-bit input: A data
.B(B), // 18-bit input: B data
.C(C), // 48-bit input: C data
.CARRYIN(CARRYIN), // 1-bit input: Carry-in
.D(D), // 25-bit input: D data
// Reset/Clock Enable inputs: Reset/Clock Enable Inputs
.CEA1(CE), // 1-bit input: Clock enable for 1st stage AREG
.CEA2(CE), // 1-bit input: Clock enable for 2nd stage AREG
.CEAD(CE), // 1-bit input: Clock enable for ADREG
.CEALUMODE(CE), // 1-bit input: Clock enable for ALUMODE
.CEB1(CE), // 1-bit input: Clock enable for 1st stage BREG
.CEB2(CE), // 1-bit input: Clock enable for 2nd stage BREG
.CEC(CE), // 1-bit input: Clock enable for CREG
.CECARRYIN(CE), // 1-bit input: Clock enable for CARRYINREG
.CECTRL(CE), // 1-bit input: Clock enable for OPMODEREG and CARRYINSELREG
.CED(CE), // 1-bit input: Clock enable for DREG
.CEINMODE(CE), // 1-bit input: Clock enable for INMODEREG
.CEM(CE), // 1-bit input: Clock enable for MREG
.CEP(CE), // 1-bit input: Clock enable for PREG
.RSTA(1'b0), // 1-bit input: Reset for AREG
.RSTALLCARRYIN(1'b0), // 1-bit input: Reset for CARRYINREG
.RSTALUMODE(1'b0), // 1-bit input: Reset for ALUMODEREG
.RSTB(1'b0), // 1-bit input: Reset for BREG
.RSTC(1'b0), // 1-bit input: Reset for CREG
.RSTCTRL(1'b0), // 1-bit input: Reset for OPMODEREG and CARRYINSELREG
.RSTD(1'b0), // 1-bit input: Reset for DREG and ADREG
.RSTINMODE(1'b0), // 1-bit input: Reset for INMODEREG
.RSTM(1'b0), // 1-bit input: Reset for MREG
.RSTP(rst) // 1-bit input: Reset for PREG
);
endmodule
做好DSP方面的准备工作之后,我们还需要准备哈希计算中的几个函数模块包括Ch, Maj, Sig0, Sig1, Ep0和Ep1,放入common.sv中供核心代码引用:
module ch(
input logic [31:0] x,
input logic [31:0] y,
input logic [31:0] z,
output logic [31:0] result
);
always_comb result = (x & y) ^ ((~x) & z);
endmodule
module maj(
input logic [31:0] x,
input logic [31:0] y,
input logic [31:0] z,
output logic [31:0] result
);
always_comb result = (x & y) ^ (x & z) ^ (y & z);
endmodule
module sig0(
input logic [31:0] x,
output logic [31:0] result
);
always_comb result = {x[1:0], x[31:2]} ^ {x[12:0], x[31:13]} ^ {x[21:0], x[31:22]};
endmodule
module sig1(
input logic [31:0] x,
output logic [31:0] result
);
always_comb result = {x[5:0], x[31:6]} ^ {x[10:0], x[31:11]} ^ {x[24:0], x[31:25]};
endmodule
module ep0(
input logic [31:0] x,
output logic [31:0] result
);
always_comb result = {x[6:0], x[31:7]} ^ {x[17:0], x[31:18]} ^ {3'h0, x[31:3]};
endmodule
module ep1(
input logic [31:0] x,
output logic [31:0] result
);
always_comb result = {x[16:0], x[31:17]} ^ {x[18:0], x[31:19]} ^ {10'h0, x[31:10]};
endmodule
哈希核心的详细代码sha256_core.sv及解析如下:
module sha256_core(
input clk,
input rst,
input logic [511:0] din,
input logic [31:0] din_nounce,
// start without digest, restart the hash state
// start with digest, continue with current hash state
input logic din_digest,
input logic din_valid,
output logic [255:0] dout_hash,
output logic [31:0] dout_nounce,
output logic dout_valid
);
端口定义:
// Constants used in SHA256
logic [0:7][31:0] state_const = {
32'h6a09e667,
32'hbb67ae85,
32'h3c6ef372,
32'ha54ff53a,
32'h510e527f,
32'h9b05688c,
32'h1f83d9ab,
32'h5be0cd19};
logic [0:63][31:0] K_const = {
32'h428a2f98,32'h71374491,32'hb5c0fbcf,32'he9b5dba5,32'h3956c25b,32'h59f111f1,32'h923f82a4,32'hab1c5ed5,
32'hd807aa98,32'h12835b01,32'h243185be,32'h550c7dc3,32'h72be5d74,32'h80deb1fe,32'h9bdc06a7,32'hc19bf174,
32'he49b69c1,32'hefbe4786,32'h0fc19dc6,32'h240ca1cc,32'h2de92c6f,32'h4a7484aa,32'h5cb0a9dc,32'h76f988da,
32'h983e5152,32'ha831c66d,32'hb00327c8,32'hbf597fc7,32'hc6e00bf3,32'hd5a79147,32'h06ca6351,32'h14292967,
32'h27b70a85,32'h2e1b2138,32'h4d2c6dfc,32'h53380d13,32'h650a7354,32'h766a0abb,32'h81c2c92e,32'h92722c85,
32'ha2bfe8a1,32'ha81a664b,32'hc24b8b70,32'hc76c51a3,32'hd192e819,32'hd6990624,32'hf40e3585,32'h106aa070,
32'h19a4c116,32'h1e376c08,32'h2748774c,32'h34b0bcb5,32'h391c0cb3,32'h4ed8aa4a,32'h5b9cca4f,32'h682e6ff3,
32'h748f82ee,32'h78a5636f,32'h84c87814,32'h8cc70208,32'h90befffa,32'ha4506ceb,32'hbef9a3f7,32'hc67178f2};
必要的状态初始常数和K常数
// Phase control logic
(* max_fanout = 32 *)logic [9:0] counter_control, counter_control_delay[8:0];
logic start;
logic din_valid_rising_edge, din_valid_d;
always_ff @(posedge clk) din_valid_d <= din_valid;
assign din_valid_rising_edge = ({din_valid_d, din_valid}==2'b01) ? 1'b1 : 1'b0;
always_ff @(posedge clk) begin
if(rst) counter_control <= 10'd256;
else if(din_valid_rising_edge) counter_control <= 10'd0;
else if(counter_control < 256) counter_control <= counter_control + 10'd1;
counter_control_delay[0] <= counter_control;
end
generate
for(genvar i=1; i<9; i++) begin : counter_delay
always_ff @(posedge clk)
counter_control_delay[i] <= counter_control_delay[i-1];
end
endgenerate
always_ff @(posedge clk) begin
if(rst) start <= 1'b0;
else if(din_valid_rising_edge) start <= 1'b1;
end
一些控制信号:
// get input phase based on din valid
(* max_fanout = 64 *)logic [1:0] din_phase;
always_ff @(posedge clk)
if(rst)
din_phase <= 2'h0;
else if(din_valid)
din_phase <= din_phase + 2'h1;
Phase相位是上一篇中介绍的多信道思路,表示4个信道
// input buffer for 4 512-bit input data
// shift left when the data is taken into message schedule
logic [511:0] strin_buf[3:0];
always_ff @(posedge clk) begin
if(din_valid && (din_phase == 2'h0))
strin_buf[0] <= din;
else if( (!counter_control[8]) && (counter_control[1:0]==2'h0) )
strin_buf[0] <= {strin_buf[0][479:0], 32'h0};
if(din_valid && (din_phase == 2'h1))
strin_buf[1] <= din;
else if( (!counter_control[8]) && (counter_control[1:0]==2'h1) )
strin_buf[1] <= {strin_buf[1][479:0], 32'h0};
if(din_valid && (din_phase == 2'h2))
strin_buf[2] <= din;
else if( (!counter_control[8]) && (counter_control[1:0]==2'h2) )
strin_buf[2] <= {strin_buf[2][479:0], 32'h0};
if(din_valid && (din_phase == 2'h3))
strin_buf[3] <= din;
else if( (!counter_control[8]) && (counter_control[1:0]==2'h3) )
strin_buf[3] <= {strin_buf[3][479:0], 32'h0};
end
由于有4个信道,din_valid会抬高4个时钟周期,分别对应4个信道的数据输入,这里需要将其做缓冲,在那之后,每当引用到一个信道,就会将那个缓冲左移,使得下一个32位数位于最高32位
// 4-ch message schedule
logic [15:0][3:0][31:0] W_delay;
logic [31:0] W_j;
logic [31:0] K_j;
// intermediate logic for message schedule
logic [3:0][31:0] mess_temp_s0;
logic [1:0][31:0] mess_temp_s1;
logic [31:0] mess_temp_s2;
logic [31:0] mess_temp_s3;
// Logic functions used in message schedule
logic [31:0] ep0_15_result, ep1_2_result;
ep0 ep0_15(
.x(W_delay[13][3]),
.result(ep0_15_result)
);
ep1 ep1_2(
.x(W_delay[0][3]),
.result(ep1_2_result)
);
// stage 0 of message schedule
/*
always_ff @(posedge clk) begin
mess_temp_s0[0] <= W_delay[14][3];
mess_temp_s0[1] <= ep0_15_result;
mess_temp_s0[2] <= W_delay[5][3];
mess_temp_s0[3] <= ep1_2_result;
end
*/
// stage 1 of message schedule
/*
always_ff @(posedge clk) begin
mess_temp_s1[0] <= mess_temp_s0[0] + mess_temp_s0[1];
mess_temp_s1[1] <= mess_temp_s0[2] + mess_temp_s0[3];
end
*/
logic [47:0] P_mess_s1_0, P_mess_s1_1;
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.PREG(1)
) dsp_mess_s1_0
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, ep0_15_result[31:18]}),
.B(ep0_15_result[17:0]),
.C({16'h0, W_delay[14][3]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_mess_s1_0),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.PREG(1)
) dsp_mess_s1_1
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, ep1_2_result[31:18]}),
.B(ep1_2_result[17:0]),
.C({16'h0, W_delay[5][3]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_mess_s1_1),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
// stage 2 of message schedule
//always_ff @(posedge clk) mess_temp_s2 <= P_mess_s1_0[31:0] + P_mess_s1_1[31:0];
logic [47:0] P_mess_s3;
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.PREG(1)
) dsp_mess_s2
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, P_mess_s1_1[31:18]}),
.B(P_mess_s1_1[17:0]),
.C({16'h0, P_mess_s1_0[31:0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_mess_s3),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
// stage 3 of message schedule
// assign to input data when counter_control is less than 64
always_comb mess_temp_s3 = P_mess_s3[31:0];
扩展信息块核心的计算部分,在备注中的是方便阅读的逻辑功能部分,如果去掉所有调用DSP wrapper部分,只保留备注中的内容,仿真也是可以通过的。由于源语调用DSP较为复杂,一般使用简单的逻辑先完成仿真目标,再讲其转化为源语调用,因此这里没有删除备注
// updating message schedule
generate
for(genvar i=0; i<16; i++) begin : W_j_minus_n
if(i==0)
always_ff @(posedge clk)
if(rst) W_delay[i][0] <= 32'h0;
else if(counter_control < 10'd64)
case(counter_control[1:0])
2'h0: W_delay[i][0] <= strin_buf[0][511:480];
2'h1: W_delay[i][0] <= strin_buf[1][511:480];
2'h2: W_delay[i][0] <= strin_buf[2][511:480];
2'h3: W_delay[i][0] <= strin_buf[3][511:480];
default: W_delay[i][0] <= 32'h0;
endcase
else if(!counter_control[8]) W_delay[i][0] <= mess_temp_s3;
else W_delay[i][0] <= 32'h0;
else
always_ff @(posedge clk) begin
if(rst) W_delay[i][0] <= 32'h0;
else if(!counter_control[8]) W_delay[i][0] <= W_delay[i-1][3];
end
for(genvar j=1; j<4; j++) begin : W_4ch
always_ff @(posedge clk)
if(!counter_control[8])
W_delay[i][j] <= W_delay[i][j-1];
end
end
endgenerate
扩展信息块核心的延迟部分
// send out W_j
always_ff @(posedge clk)
if(counter_control < 10'd64)
case(counter_control[1:0])
2'h0: W_j <= strin_buf[0][511:480];
2'h1: W_j <= strin_buf[1][511:480];
2'h2: W_j <= strin_buf[2][511:480];
2'h3: W_j <= strin_buf[3][511:480];
default: W_j <= 32'h0;
endcase
else if(counter_control < 10'd256)
W_j <= mess_temp_s3;
else
W_j <= 32'h0;
// Select a proper K_j
always_ff @(posedge clk)
if(counter_control < 10'd256)
K_j <= K_const[counter_control[7:2]];
else
K_j <= 32'h0;
扩展信息块的输出,以及相应的K_j,供给状态机循环逻辑使用
// 4-ch delay of 8 hash states
logic [3:0][31:0] comp_a;
logic [3:0][31:0] comp_b;
logic [3:0][31:0] comp_c;
logic [3:0][31:0] comp_d;
logic [3:0][31:0] comp_e;
logic [3:0][31:0] comp_f;
logic [3:0][31:0] comp_g;
logic [3:0][31:0] comp_h;
// Logic functions used in comp
logic [31:0] ch_efg_result, maj_abc_result, sig0_a_result, sig1_e_result;
logic [31:0] ch_efg_x, ch_efg_y, ch_efg_z;
logic [31:0] maj_abc_x, maj_abc_y, maj_abc_z;
logic [31:0] sig0_a_x, sig1_e_x;
ch ch_efg(
.x(ch_efg_x),
.y(ch_efg_y),
.z(ch_efg_z),
.result(ch_efg_result)
);
maj maj_abc(
.x(maj_abc_x),
.y(maj_abc_y),
.z(maj_abc_z),
.result(maj_abc_result)
);
sig0 sig0_a(
.x(sig0_a_x),
.result(sig0_a_result)
);
sig1 sig1_e(
.x(sig1_e_x),
.result(sig1_e_result)
);
// intermediate temporary logic in comp
logic [5:0][31:0] comp_temp_s0;
logic [2:0][31:0] comp_temp_s1;
logic [1:0][31:0] comp_temp_s2;
logic [1:0][31:0] comp_temp_s3;
assign ch_efg_x = counter_control_delay[4][8] ? comp_e[0] : comp_temp_s3[1];
assign ch_efg_y = counter_control_delay[4][8] ? comp_f[0] : comp_e[3];
assign ch_efg_z = counter_control_delay[4][8] ? comp_g[0] : comp_f[3];
assign maj_abc_x = counter_control_delay[4][8] ? comp_a[0] : comp_temp_s3[0];
assign maj_abc_y = counter_control_delay[4][8] ? comp_b[0] : comp_a[3];
assign maj_abc_z = counter_control_delay[4][8] ? comp_c[0] : comp_b[3];
assign sig0_a_x = maj_abc_x;
assign sig1_e_x = ch_efg_x;
状态机更新循环中的计算部分。比较特殊的是Ch和Maj函数的输入,在还未开始或者已经结束时,它的输入和状态机e的输入是相同的,这样能保证第一个时钟的输出是只和第一个e严格相关的,避免未知传递(unknown propagation)导致出现错误
// stage 0 of comp
always_ff @(posedge clk) begin
comp_temp_s0[0] <= sig0_a_result;
comp_temp_s0[1] <= maj_abc_result;
comp_temp_s0[2] <= sig1_e_result;
comp_temp_s0[3] <= ch_efg_result;
comp_temp_s0[4] <= counter_control_delay[4][8] ? comp_h[3] : comp_g[3];
//comp_temp_s0[5] <= W_j + K_j;
end
logic [47:0] P_comp_s0_5;
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s0_5
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, K_j[31:18]}),
.B(K_j[17:0]),
.C({16'h0, W_j}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s0_5),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
// stage 1 of comp
//always_ff @(posedge clk) begin
// comp_temp_s1[0] <= comp_temp_s0[0] + comp_temp_s0[1];
// comp_temp_s1[1] <= comp_temp_s0[2] + comp_temp_s0[3];
// comp_temp_s1[2] <= comp_temp_s0[4] + P_comp_s0_5[31:0];
//end
logic [47:0] P_comp_s1_0, P_comp_s1_1, P_comp_s1_2;
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s1_0
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, comp_temp_s0[1][31:18]}),
.B(comp_temp_s0[1][17:0]),
.C({16'h0, comp_temp_s0[0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s1_0),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s1_1
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, comp_temp_s0[3][31:18]}),
.B(comp_temp_s0[3][17:0]),
.C({16'h0, comp_temp_s0[2]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s1_1),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s1_2
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, comp_temp_s0[4][31:18]}),
.B(comp_temp_s0[4][17:0]),
.C({16'h0, P_comp_s0_5[31:0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s1_2),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
// stage 2 of comp
always_ff @(posedge clk) begin
comp_temp_s2[0] <= P_comp_s1_0[31:0];
//comp_temp_s2[1] <= P_comp_s1_1[31:0] + P_comp_s1_2[31:0];
end
logic [47:0] P_comp_s2_1;
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s2_1
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, P_comp_s1_2[31:18]}),
.B(P_comp_s1_2[17:0]),
.C({16'h0, P_comp_s1_1[31:0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s2_1),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
// stage 3 of comp
logic [47:0] P_comp_s3_0, P_comp_s3_1;
always_comb begin
//comp_temp_s3[0] <= comp_temp_s2[0] + P_comp_s2_1[31:0];
comp_temp_s3[0] = P_comp_s3_0[31:0];
//comp_temp_s3[1] <= P_comp_s2_1[31:0] + comp_d[2];
comp_temp_s3[1] = P_comp_s3_1[31:0];
end
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s3_0
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, comp_temp_s2[0][31:18]}),
.B(comp_temp_s2[0][17:0]),
.C({16'h0, P_comp_s2_1[31:0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s3_0),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
DSP48E1_wrapper #(
.USE_MULT("NONE"),
.MREG(0),
.ACASCREG(0),
.AREG(0),
.BCASCREG(0),
.BREG(0),
.CREG(0),
.PREG(1)
) dsp_comp_s3_1
(
.clk(clk),
.rst(rst),
.CE(1'b1),
.ALUMODE(4'b0000),
.INMODE(5'b00000),
.OPMODE(7'b0001111),
.A({16'h0, comp_d[2][31:18]}),
.B(comp_d[2][17:0]),
.C({16'h0, P_comp_s2_1[31:0]}),
.D(25'h0),
.CARRYINSEL(),
.CARRYIN(),
.CARRYOUT(),
.P(P_comp_s3_1),
.CARRYCASCIN(),
.CARRYCASCOUT()
);
同样这里调用了大量DSP来实现原本在备注中的加法内容。这段代码比较长,也可以看出使用源语的缺点,阅读起来会很吃力,如果看不懂,可以将调用DSP_wrapper部分删除,使用原本在备注中的部分
// store the hash of first 512 bits, for the second hash
logic [255:0] first_hash;
// store the hash of the first 512 bits
// even though there are four kinds of input, the first 512 bits are the same
logic [31:0] comp_input[7:0];
always_ff @(posedge clk)
if(start && !din_digest && (counter_control_delay[4] == 10'd256) && (counter_control_delay[8] != 10'd256))
first_hash <= {comp_a[3] + state_const[0],
comp_b[3] + state_const[1],
comp_c[3] + state_const[2],
comp_d[3] + state_const[3],
comp_e[3] + state_const[4],
comp_f[3] + state_const[5],
comp_g[3] + state_const[6],
comp_h[3] + state_const[7]};
上一篇中我们讲到挖矿算法中有608比特的固定数据头和32位的变量,这就意味着对于第一重哈希运算要进行两次512比特的哈希,而第一次哈希的结果是固定的,所以我们可以在第一重哈希的第一个512比特运算完成后,把结果记录在first_hash中,这样就不必每次计算两个周期,可以大幅度提高效率。至于第二重哈希,它的输入是第一重的输出,也就是256比特,只需要进行一次运算,因此不需要优化。这些操作会由状态机来控制
// updating the 8 hash state
always_ff @(posedge clk) begin
if(rst) begin
comp_a[0] <= state_const[0];
comp_b[0] <= state_const[1];
comp_c[0] <= state_const[2];
comp_d[0] <= state_const[3];
comp_e[0] <= state_const[4];
comp_f[0] <= state_const[5];
comp_g[0] <= state_const[6];
comp_h[0] <= state_const[7];
end
else if(!counter_control_delay[4][8]) begin
comp_a[0] <= comp_temp_s3[0];
comp_b[0] <= comp_a[3];
comp_c[0] <= comp_b[3];
comp_d[0] <= comp_c[3];
comp_e[0] <= comp_temp_s3[1];
comp_f[0] <= comp_e[3];
comp_g[0] <= comp_f[3];
comp_h[0] <= comp_g[3];
end
else begin
if(din_digest) begin
comp_a[0] <= first_hash[255:224];
comp_b[0] <= first_hash[223:192];
comp_c[0] <= first_hash[191:160];
comp_d[0] <= first_hash[159:128];
comp_e[0] <= first_hash[127: 96];
comp_f[0] <= first_hash[ 95: 64];
comp_g[0] <= first_hash[ 63: 32];
comp_h[0] <= first_hash[ 31: 0];
end
else begin
comp_a[0] <= state_const[0];
comp_b[0] <= state_const[1];
comp_c[0] <= state_const[2];
comp_d[0] <= state_const[3];
comp_e[0] <= state_const[4];
comp_f[0] <= state_const[5];
comp_g[0] <= state_const[6];
comp_h[0] <= state_const[7];
end
end
end
generate
for(genvar i=1; i<4; i++) begin : comp_delay_4ch
always_ff @(posedge clk) begin
comp_a[i] <= comp_a[i-1];
comp_b[i] <= comp_b[i-1];
comp_c[i] <= comp_c[i-1];
comp_d[i] <= comp_d[i-1];
comp_e[i] <= comp_e[i-1];
comp_f[i] <= comp_f[i-1];
comp_g[i] <= comp_g[i-1];
comp_h[i] <= comp_h[i-1];
end
end
endgenerate
八个状态机寄存器的更新逻辑:
// Delay the nounce input, and send out if it matches the target
logic [3:0][31:0] din_nounce_d;
always_ff @(posedge clk)
if(din_valid) begin
din_nounce_d[0] <= din_nounce;
din_nounce_d[1] <= din_nounce_d[0];
din_nounce_d[2] <= din_nounce_d[1];
din_nounce_d[3] <= din_nounce_d[2];
end
将32位变量延迟,跟随哈希结果一起输出
// record the digest state on the input stage
logic din_digest_start;
always_ff @(posedge clk)
if(rst) din_digest_start <= 1'b0;
else if(din_valid_rising_edge) din_digest_start <= din_digest;
// Final result output
logic [1:0] hash_phase;
always_ff @(posedge clk) begin
if(rst) begin
dout_hash <= 256'h0;
dout_valid <= 1'b0;
hash_phase <= 2'h0;
end
// only send out hash for the later 512 bits
else if(start && din_digest_start && (counter_control_delay[4] == 10'd256) && (counter_control_delay[8] != 10'd256)) begin
dout_hash <= {comp_a[3] + first_hash[255:224],
comp_b[3] + first_hash[223:192],
comp_c[3] + first_hash[191:160],
comp_d[3] + first_hash[159:128],
comp_e[3] + first_hash[127: 96],
comp_f[3] + first_hash[ 95: 64],
comp_g[3] + first_hash[ 63: 32],
comp_h[3] + first_hash[ 31: 0]};
dout_valid <= 1'b1;
hash_phase <= hash_phase + 2'h1;
case(hash_phase)
2'h0: dout_nounce <= din_nounce_d[3];
2'h1: dout_nounce <= din_nounce_d[2];
2'h2: dout_nounce <= din_nounce_d[1];
2'h3: dout_nounce <= din_nounce_d[0];
default: dout_nounce <= din_nounce_d[0];
endcase
end
else if(start && !din_digest_start && (counter_control_delay[4] == 10'd256) && (counter_control_delay[8] != 10'd256)) begin
dout_hash <= {comp_a[3] + state_const[0],
comp_b[3] + state_const[1],
comp_c[3] + state_const[2],
comp_d[3] + state_const[3],
comp_e[3] + state_const[4],
comp_f[3] + state_const[5],
comp_g[3] + state_const[6],
comp_h[3] + state_const[7]};
dout_valid <= 1'b1;
hash_phase <= hash_phase + 2'h1;
case(hash_phase)
2'h0: dout_nounce <= din_nounce_d[3];
2'h1: dout_nounce <= din_nounce_d[2];
2'h2: dout_nounce <= din_nounce_d[1];
2'h3: dout_nounce <= din_nounce_d[0];
default: dout_nounce <= din_nounce_d[0];
endcase
end
else begin
dout_valid <= 1'b0;
hash_phase <= 2'h0;
end
end
endmodule
最终结果输出,根据din_digest的状态,状态机的更新状态会有不同
可以看出核心计算部分有相当高的计算复杂度,因此本篇在这部分做了最详细的介绍,而后面部分的代码内容会相对简略,但都可以在附件中下载到,希望能耐心阅读
就像本文一开始画的示意图那样,我们需要一个判断结果是否正确的逻辑,因此代码sha256_detector.sv内容如下:
module sha256_detector(
input clk,
input rst,
input [255:0] din_hash,
input [31:0] din_nounce,
input [7:0] din_target,
input din_vld,
output logic [31:0] dout_nounce,
output logic dout_vld
);
always_ff @(posedge clk) begin
if(rst) begin
dout_nounce <= 32'h0;
dout_vld <= 1'b0;
end
else if(din_vld) begin
dout_nounce <= din_nounce;
case(din_target)
// only for debug
8'd8 : dout_vld <= (din_hash[255:248] == 0) ? 1'b1 : 1'b0;
8'd16: dout_vld <= (din_hash[255:240] == 0) ? 1'b1 : 1'b0;
8'd20: dout_vld <= (din_hash[255:236] == 0) ? 1'b1 : 1'b0;
8'd24: dout_vld <= (din_hash[255:232] == 0) ? 1'b1 : 1'b0;
8'd28: dout_vld <= (din_hash[255:228] == 0) ? 1'b1 : 1'b0;
8'd32: dout_vld <= (din_hash[255:224] == 0) ? 1'b1 : 1'b0;
8'd36: dout_vld <= (din_hash[255:220] == 0) ? 1'b1 : 1'b0;
8'd40: dout_vld <= (din_hash[255:216] == 0) ? 1'b1 : 1'b0;
8'd44: dout_vld <= (din_hash[255:212] == 0) ? 1'b1 : 1'b0;
8'd45: dout_vld <= (din_hash[255:211] == 0) ? 1'b1 : 1'b0;
8'd46: dout_vld <= (din_hash[255:210] == 0) ? 1'b1 : 1'b0;
8'd47: dout_vld <= (din_hash[255:209] == 0) ? 1'b1 : 1'b0;
8'd48: dout_vld <= (din_hash[255:208] == 0) ? 1'b1 : 1'b0;
default: dout_vld <= (din_hash[255:212] == 0) ? 1'b1 : 1'b0;
endcase
end
end
endmodule
这段代码内容相对简单,din_target为难度配置,数字越高则难度越大,当输入的哈希计算结果din_hash满足条件时,将其对饮的变量输入din_nounce输出,并抬高dout_vld表示有符合条件的结果找到了。为了限制case语句中的判断条件数量,我们目前只放置了8-48比特的难度范围(48位以上的基本上算不出来了),如果有需要可以再case语句中放置自己需要的难度
关于控制状态机部分,这里我们只放出它的示意图和端口定义,不做过多的代码介绍,细节可以参考附件
各个状态执行的内容如下:
module sha256_state_machine #(
parameter CORE_NUM = 8
)
(
input clk,
input rst,
input [607:0] ms_header_prefix,
input [7:0] ms_diff,
input ms_valid,
output logic [31:0] ms_nounce,
output logic ms_nounce_valid,
output logic [CORE_NUM-1:0][511:0] sha256_din,
output logic [CORE_NUM-1:0][31:0] sha256_nounce,
output logic [7:0] sha256_target,
output logic sha256_init,
output logic sha256_digest,
output logic sha256_din_valid,
input [CORE_NUM-1:0][31:0] sha256_dout_nounce,
input [CORE_NUM-1:0] sha256_dout_valid,
// signs to show on LEDs
output logic result_found,
output logic [5:0] nounce_step
);
端口定义如下:
在之前使用NEXYS4开发板进行一系列实验时,不少接口都使用了串口进行观察。本文中我们用串口模块接收608位数据头,8位难度系数,以及1位的回车作为标准数据输入,当输入符合格式时会将所有数据传输给状态机进行挖矿。这里我们不做细致的介绍,只需要观察其端口:
module sha256_uart_converter(
input clk,
input rst,
// interface to UART transmitter
input [7:0] din_from_pc,
input din_valid,
output logic [7:0] dout_to_pc,
output logic dout_ready,
// interface to state machine
output logic [607:0] ms_header_prefix,
output logic [7:0] ms_diff,
output logic ms_valid,
input [31:0] ms_nounce,
input ms_nounce_valid
);
本模块的接口需要与过去曾经用过的UART_transmitter配合,可以在附件中找到
到此为止已经贴出了所有相关模块的代码,唯一没有写出的是时钟控制模块PLL,这一块我们将调用Vivado的IP,会在综合编译部分介绍
顶层模块将所有相关模块整合,具体代码如下:
// Take serial input as the header prefix and target
// search for nounce from 0 to 0xFFFF_FFFF
module sha256_top #(
parameter CORE_NUM = 8
)
(
input clk,
input rst,
(* mark_debug = "true" *) output logic TXD,
(* mark_debug = "true" *) input RXD,
(* mark_debug = "true" *) output logic CTS,
(* mark_debug = "true" *) input RTS,
output logic [6:0] led
);
由于NEXYS4的资源量限制,我们这里使用8个子系统的配置。引脚使用如下:
// boot input clock from 100MHz to 200MHz
logic clk_200, clk_locked;
clk_wiz clk_wiz
(
// Clock out ports
.clk_out1(clk_200), // output clk_out1
// Status and control signals
.reset(rst), // input reset
.locked(clk_locked), // output locked
// Clock in ports
.clk_in1(clk)); // input clk_in1
之后将要调用的时钟模块PLL,这里我们将100MHz时钟提升至200MHz,并且可以将rst经过之后的clk_locked信号作为其他模块的复位信号。我尝试过提高到更高的时钟频率,但没有通过timing。更高级的芯片能在时钟频率方面有所提升,并且可以并行布置更多的子系统
// communicate with PC through serial port
logic [7:0] dout_to_pc, din_from_pc;
logic dout_ready, din_valid;
UART_transmitter #(
.COUNT_MAX(1735)
)
uart_transmitter(
.clk(clk_200),
.rst(!clk_locked),
// UART port
.TXD,
.RXD,
.CTS,
.RTS,
// Control port
.dout (dout_to_pc),
.dout_ready (dout_ready),
.din (din_from_pc),
.din_valid (din_valid)
);
// conver the data from uart 8-bits into hex data
logic [607:0] ms_header_prefix;
logic [7:0] ms_diff;
logic ms_valid;
logic [31:0] ms_nounce;
logic ms_nounce_valid;
sha256_uart_converter uart_converter(
.clk(clk_200),
.rst(!clk_locked),
// interface to UART transmitter
.din_from_pc,
.din_valid,
.dout_to_pc,
.dout_ready,
// interface to state machine
.ms_header_prefix,
.ms_diff,
.ms_valid,
.ms_nounce,
.ms_nounce_valid
);
两个模块作为串口通信部分,一个将4线的串口信号转化为8位数据和1位的valid信号,再通过converter转化成状态机需要的608位数据头和8位难度系数
// Main controller to get header prefix, target and distribute hash input to all calculation cores
// and receive result from calculation cores if any of them matches the target
logic [CORE_NUM-1:0][511:0] sha256_din;
logic [CORE_NUM-1:0][31:0] sha256_nounce;
logic [7:0] sha256_target;
(* mark_debug = "true" *)logic sha256_init;
(* mark_debug = "true" *)logic sha256_digest;
(* mark_debug = "true" *)logic sha256_din_valid;
logic [CORE_NUM-1:0][31:0] sha256_dout_nounce;
logic [CORE_NUM-1:0] sha256_dout_valid;
logic result_found;
logic [5:0] nounce_step;
sha256_state_machine state_machine(
.clk(clk_200),
.rst(!clk_locked),
.ms_header_prefix,
.ms_diff,
.ms_valid,
.ms_nounce,
.ms_nounce_valid,
.sha256_din,
.sha256_nounce,
.sha256_target,
.sha256_init,
.sha256_digest,
.sha256_din_valid,
.sha256_dout_nounce,
.sha256_dout_valid,
.result_found,
.nounce_step
);
状态机部分
logic [6:0] led_d;
always_ff @(posedge clk)
if(rst) led_d[0] <= 1'b0;
else if(result_found) led_d[0] <= 1'b1;
always_ff @(posedge clk)
if(rst) led_d[6:1] <= 6'h0;
else led_d[6:1] <= nounce_step;
always_ff @(posedge clk) led <= led_d;
LED观察部分和内部逻辑无关,增加一级的缓冲,可以减缓其对timing的压力
generate
for(genvar i=0; i
最后根据预定的子系统数量8,在内部自动生成8份并行子系统,到此全部完成
在Vivado综合编译之前,先对这个系统进行逻辑仿真
Testbench代码tb_sha256.v内容如下:
`timescale 1ns/1ns
module tb_sha256;
reg clk;
reg rst;
reg RXD, RTS;
wire TXD, CTS;
reg [607:0] prefix;
reg [7:0] target;
reg [615:0] data_buf;
reg [7:0] uart_data;
integer clock_index;
always @(posedge clk) clock_index <= clock_index + 1;
// Send value inside
initial begin
clk = 0;
rst = 0;
clock_index = 0;
#10 rst = 1;
#1000 rst = 0;
#1000;
@(posedge clk);
prefix = 608'h00000020_e30267c7_5d2d24b7_dfd8b645_9fd83744_79d3eb27_8c6a0b00_00000000_00000000_f65c3424_2c475007_67b534c9_49c0a38a_f49907b9_4a6e1363_21407cbc_19d27e62_87072e60_b9210d17;
//target = 8'd20;
target = 8'd8;
@(posedge clk);
force sha256_top.ms_header_prefix = prefix;
force sha256_top.ms_diff = target;
force sha256_top.ms_valid = 1'b1;
@(posedge clk);
force sha256_top.ms_valid = 1'b0;
//uart_write_77byte({prefix, target});
//repeat(9) uart_read_1byte(uart_data);
end
数据初始化部分,这里用了一段从矿池中获得的一个608位数据头作为测试数据,并把难度设置成最低的8位(毕竟太高了仿真要跑很久)。这里有一个问题,虽然串口可以仿真,但串口在实际应用中速度较慢,在仿真中也要花费大量的时间才能到达内核逻辑开始的阶段,因此我们这里采用force语句将串口部分的输出直接配置成需要的数据输入,省去大量的仿真时间。如果需要完整系统仿真,可以将force部分的6句放入备注,而使用最后两句目前被备注的逻辑
另一个问题是,这段逻辑中没有配置结束逻辑,但如果使用uart_rad_1byte指令,它会在收到结果的时候将其打印出来。同样因为串口比较慢,我们就跳过了这一部分直接阅读波形图
// Generate 100MHz clock signal
always #5 clk <= ~clk;
sha256_top #(
.CORE_NUM(8)
) sha256_top
(
.clk,
.rst,
.TXD,
.RXD,
.CTS,
.RTS
);
生成100MHz时钟并调用sha256_top
reg [9:0] RXD_buf;
task uart_write_4bit;
input [4:0] data;
begin
$display("Send %g through UART", data);
case(data)
5'h00: begin RXD_buf <= 10'b0000011001; end
5'h01: begin RXD_buf <= 10'b0100011001; end
5'h02: begin RXD_buf <= 10'b0010011001; end
5'h03: begin RXD_buf <= 10'b0110011001; end
5'h04: begin RXD_buf <= 10'b0001011001; end
5'h05: begin RXD_buf <= 10'b0101011001; end
5'h06: begin RXD_buf <= 10'b0011011001; end
5'h07: begin RXD_buf <= 10'b0111011001; end
5'h08: begin RXD_buf <= 10'b0000111001; end
5'h09: begin RXD_buf <= 10'b0100111001; end
5'h0A: begin RXD_buf <= 10'b0100000101; end
5'h0B: begin RXD_buf <= 10'b0010000101; end
5'h0C: begin RXD_buf <= 10'b0110000101; end
5'h0D: begin RXD_buf <= 10'b0001000101; end
5'h0E: begin RXD_buf <= 10'b0101000101; end
5'h0F: begin RXD_buf <= 10'b0011000101; end
5'h10: begin RXD_buf <= 10'b0101100001; end
endcase
RTS = 1'b0;
repeat(10) begin
repeat(867) @(posedge clk);
{RXD, RXD_buf} = {RXD_buf, 1'b1};
end
end
endtask
task uart_write_77byte;
input [615:0] data;
begin
data_buf = data;
repeat(154) begin
uart_write_4bit({1'b0, data_buf[615:612]});
data_buf = {data_buf[611:0], 4'h0};
end
uart_write_4bit(5'h10);
end
endtask
reg [9:0] TXD_buf;
task uart_read_1byte;
output reg [4:0] data;
begin
@(negedge CTS);
repeat(10) begin
repeat(867) @(posedge clk);
TXD_buf = {TXD_buf[8:0], TXD};
end
case(TXD_buf)
10'b0000011001: begin data = 5'h00; $display("Received 0 through UART"); end
10'b0100011001: begin data = 5'h01; $display("Received 1 through UART"); end
10'b0010011001: begin data = 5'h02; $display("Received 2 through UART"); end
10'b0110011001: begin data = 5'h03; $display("Received 3 through UART"); end
10'b0001011001: begin data = 5'h04; $display("Received 4 through UART"); end
10'b0101011001: begin data = 5'h05; $display("Received 5 through UART"); end
10'b0011011001: begin data = 5'h06; $display("Received 6 through UART"); end
10'b0111011001: begin data = 5'h07; $display("Received 7 through UART"); end
10'b0000111001: begin data = 5'h08; $display("Received 8 through UART"); end
10'b0100111001: begin data = 5'h09; $display("Received 9 through UART"); end
10'b0100000101: begin data = 5'h0A; $display("Received a through UART"); end
10'b0010000101: begin data = 5'h0B; $display("Received b through UART"); end
10'b0110000101: begin data = 5'h0C; $display("Received c through UART"); end
10'b0001000101: begin data = 5'h0D; $display("Received d through UART"); end
10'b0101000101: begin data = 5'h0E; $display("Received e through UART"); end
10'b0011000101: begin data = 5'h0F; $display("Received f through UART"); end
10'b0101100001: begin data = 5'h10; $display("Received ENTER through UART"); end
endcase
end
endtask
endmodule
最后是相应的串口读写函数,当我们使用force语句时不会用到,但保留仿真串口的可能性
写一段sim.do作为仿真脚本。这里需要注意的是时钟模块clk_wiz,IP在仿真时需要先通过Vivado编译产生netlist
vlib work
vlog ../src/common.sv ../src/DSP48E1.v ../src/DSP48E1_wrapper.v ../src/sha256_core.sv ../src/UART_transmitter.v ../src/syn_fifo.v ../src/sha256_top.sv ../src/clk_wiz/clk_wiz_sim_netlist.v ../src/sha256_state_machine.sv ../src/sha256_uart_converter.sv ../src/sha256_detector.sv ../tb/tb_sha256.v
vsim work.tb_sha256 work.glbl -voptargs=+acc +notimingchecks
log -depth 7 /tb_sha256/*
do wave.do
run 15ms
打开ModelSim,切换source directory到
do sim.do
就可以开启仿真。在仿真开始后,可以通过Instance选择观察层级,再通过Objects选择想要的信号加入波形图窗口观察
首先我们观察的是数据是否正常写入了状态机中
数据正常输入状态机中,接下来看状态机是否运行正常,在正常情况下,绝大多数时间都在等待结果,因此处于WAIT_RESULT,也就是状态6中,但也在持续变化:
在一段时间的尝试后,哈希运算得到了想要的结果:
在其中我们可以看到32位变量nounce在持续升高来尝试所有可能性,first_hash在第一个512位数据之后就维持在那里,nounce_count作为状态机中判断是否全部32位变量都已尝试的依据,在每次计算后不断提升,最终dout_vld抬升,将符合条件的变量nounce输出,结果为0x55。我们可以验证{prefix,0x00000055}的双重哈希结果是否头8位哈希值为0x006071b7_a237f3f4_df78fbfa_b5f76778_7b63abb1_82af61a5_c45a470e_e8f6632f
这段数据就是上一篇中我们用来做软件仿真时,C语言使用的数据头,可以将input数据中最后32位改成0x00000055进行验证
按照一般步骤,我们应该开始进入Vivado进行综合编译了。首先建立一个NEXYS4 DDR的工程,如果没有在Vivado中加入NEXYS4 DDR的包,可以参考FPGA基础blink开发板实现
前面还没介绍时钟IP clk_wiz的配置,首先在flow navigator中找到IP catalog,搜索clock,找到clocking wizard,双击打开:
打开页面后,在Component Name中修改自己的IP名称,我选择的是clk_wiz,因此代码中引用的也是这个名字;再在IP location中修改IP文件保存位置,我选择在
在这一步中生成了仿真它所需的netlist文件,也补齐了前面缺失的部分
下面在Sources中加入本文中提到的所有文件,就可以准备综合了
在综合之前还有最后一步,配置引脚。老样子,我们在官方网站上拷贝NEXYS4的标准引脚配置,修改成我们需要的sha256.xdc约束文件如下:
## This file is a general .xdc for the Nexys4 DDR Rev. C
## To use it in a project:
## - uncomment the lines corresponding to used pins
## - rename the used ports (in each line, after get_ports) according to the top level signal names in the project
## Clock signal
set_property -dict {PACKAGE_PIN E3 IOSTANDARD LVCMOS33} [get_ports clk]
create_clock -period 10.000 -name sys_clk_pin -waveform {0.000 5.000} -add [get_ports clk]
##Switches
set_property -dict {PACKAGE_PIN J15 IOSTANDARD LVCMOS33} [get_ports rst]
## LEDs
set_property -dict {PACKAGE_PIN H17 IOSTANDARD LVCMOS33} [get_ports {led[0]}]
set_property -dict {PACKAGE_PIN K15 IOSTANDARD LVCMOS33} [get_ports {led[1]}]
set_property -dict {PACKAGE_PIN J13 IOSTANDARD LVCMOS33} [get_ports {led[2]}]
set_property -dict {PACKAGE_PIN N14 IOSTANDARD LVCMOS33} [get_ports {led[3]}]
set_property -dict {PACKAGE_PIN R18 IOSTANDARD LVCMOS33} [get_ports {led[4]}]
set_property -dict {PACKAGE_PIN V17 IOSTANDARD LVCMOS33} [get_ports {led[5]}]
set_property -dict {PACKAGE_PIN U17 IOSTANDARD LVCMOS33} [get_ports {led[6]}]
##USB-RS232 Interface
set_property -dict {PACKAGE_PIN C4 IOSTANDARD LVCMOS33} [get_ports RXD]
set_property -dict {PACKAGE_PIN D4 IOSTANDARD LVCMOS33} [get_ports TXD]
set_property -dict {PACKAGE_PIN D3 IOSTANDARD LVCMOS33} [get_ports CTS]
set_property -dict {PACKAGE_PIN E5 IOSTANDARD LVCMOS33} [get_ports RTS]
一切准备完成,点击Run Synthesis开始综合,完成后再点击Run Implementation以及Generate Bitstream生成可以烧写的文件。综合的最后状况如下:
从中可以看到我们已经把资源用量提升到了一定的高度(60%以上),但是有一定的时钟冲突,虽然相比于我们的资源用量,这个冲突的总量TNS不算大。并且当我们进入详细的时钟报告时,我们可以发现,主要的时钟冲突只有两个,而且都和LED输出有关,而LED输出本来就和内部逻辑无关,只是用于观察,因此我们可以就以这个状态使用(我不会说是懒得再综合一次了嘿嘿)
在Vivado的Flow Navigator中点击Open Hardware Manager,在连上开发板的情况下,点击Auto connect就可以连上板子(记得打开板子的电源),点击Flow navigator中的Program Device开始下载
首先对最右侧的开关J15拨高再拨低用来复位,再打开Putty连接板子,如果不清楚怎么连接,或者使用过程中出问题,可以参考这篇文章:FPGA基础串口通信
做到这一步,我们已经可以使用串口和FPGA板子通信了,我们使用的UART_transmitter有回声功能,就是打入的字符可以反馈在窗口中,虽然看起来很简单的功能,但在这里还是需要一定的实现基础的
现在我们打入608位的数据头,并跟上8位的难度系数,我们首先选用难度较低的0x14=20,如果实际操作,可以很快的看到结果出现在了串口窗口中,也就是下面照片的第一张,由于很快就有了结果,最低位的LED立刻亮起表示找到结果,前面6位LED不会亮起。从窗口中可以看到结果是0x40008BF2,感兴趣的话可以在C程序中尝试。需要注意的是开头的4是因为我们采用了高位分配的方法,只要有一个子系统找到了结果就会返回,很明显高位被分配为4的子系统先找到了结果
在第二次尝试中,我们把难度系数改成了较大的0x20=32,经过了大约10秒得到了结果0xC3CF0B91,同样感兴趣的话可以尝试。这一次的过程如下面第二章照片显示,前6位LED有了显示,表示正在逐步尝试哈希碰撞
到了这一步我们已经有了初步进行哈希运算挖矿的能力,虽然数据头目前只能通过手打的形式输入到板子中,但到下一篇中介绍和矿池服务器的连接方式之后,我们就有了自动挖矿的能力
比较重要的是,我们现在可以谈谈为什么这套系统不能用来实际挖矿了。从状态机配置中,我们可以看出,出去数据传输和first_hash的计算,我们大多数时间都用来等待输出结果,并且是每个挖矿子系统大概每256个时钟输出4个哈希结果,挖矿子系统使用200MHz时钟,有8个子系统,因此我们的挖矿速度是200MHz*(4/256)*8=25MH/s,功耗是1.88W,这个单位是每秒25兆次哈希运算,就算我们再进一步优化,加上使用更高级的FPGA芯片,也很难突破百兆哈希每秒这个量级。我们看一下市面上矿机的参数,比如蚂蚁矿机S9,每秒13.5T次哈希运算,功耗1350W
经过简单的运算我们就能发现,不仅是绝对运算速度差了好几个量级,单位功耗的运算量也比不上,因此实际挖矿赚钱是别想了。好了回归现实,下一篇介绍自动挖矿的最后一步