https://github.com/dalek-cryptography/curve25519-dalek/blob/master/src/backend/serial/u64/scalar.rs 中的montgomery_reduce()中的算法细节和标准的算法细节有所差异,进一步减少了内部算法的位数要求。
INPUT: integers m = (mn-1…m1m0)b with gcd(m,b) = 1, R= bn, m’ = -m-1 mod b, and T = (t2n-1…t1t0)b < mR.
OUTPUT: TR-1 mod m.
1. A ← T. (Notation: A = (a2n-1…a1a0)b)
2. For i from 0 to (n-1) do the following:
2.1 ui ← aim’ mod b.
2.2 A ← A + uimbi.
3. A ← A/bn.
4. if A >= m then A ← A - m.
5. Return(A).
注意,其中的m’与m 的模运算关系是b,而不是R,其中R=bn。
举例:
假设 m = 72639, b = 10, R = 105, 且 T = 7118368.
则有 n = 5,m’ = −m−1 mod 10 = 1, T mod m = 72385, 且 TR−1 mod m = 39796.
对应的上述算法执行细节为:
i | ui=aim’ mod b | uimbi | A |
---|---|---|---|
- | - | - | 7118368 |
0 | 8 | 581112 | 7699480 |
1 | 8 | 5811120 | 13510600 |
2 | 6 | 43583400 | 57094000 |
3 | 4 | 290556000 | 347650000 |
4 | 5 | 3631950000 | 3979600000 |
A/bn mod m =39796
与预期相符。
在上述算法中,在step2.2中,用于存储数据A的位数会膨胀。
基于以下事实:
m = (mn-1…m1m0)b = mn-1bn-1 + … + m1b1 + m0
T = (t2n-1…t1t0)b = t2n-1b2n-1 + … + t1b1 + t0
A = (a2n-1…a1a0)b = a2n-1b2n-1 + … + a1b1 + a0
而 y*bx mod b = 0, step2.1/2.2/3均可优化!
仍然基于上述算法进行如下改进可知:
INPUT: integers m = (mn-1…m1m0)b with gcd(m,b) = 1, R= bn, m’ = -m-1 mod b, and T = (t2n-1…t1t0)b < mR.
OUTPUT: TR-1 mod m.
1. A ← T. (Notation: A = (a2n-1…a1a0)b)
2. carry ← 0.
3. For i from 0 to (n-1) do the following:
3.1 a i = c a r r y + t i + ∑ k = 0 i − 1 , p = i − k , 0 = < p < n u k ∗ m p \ a_i = carry + t_i + \sum_{k=0}^{i-1, p=i-k, 0 =< p < n}{u_k*m_p} ai=carry+ti+k=0∑i−1,p=i−k,0=<p<nuk∗mp
3.2 ui ← aim’ mod b.
3.3 carry ← (ai + ui*m0)/b.
4. for j from 0 to (n-1) do the following:
4.1 r j = c a r r y + t j + n + ∑ k = 0 j − 1 , q = j + n − k , 0 = < 1 < n u k ∗ m q \ r_j = carry + t_{j+n} + \sum_{k=0}^{j-1, q=j+n-k, 0 =< 1 < n}{u_k*m_q} rj=carry+tj+n+k=0∑j−1,q=j+n−k,0=<1<nuk∗mq
4.2 carry ← rj/b.
4.3 wj ← rj mod b.
5. A = ∑ j = 0 n − 1 w j ∗ b j \ A = \sum_{j=0}^{n-1}{w_j*b^j} A=j=0∑n−1wj∗bj
6. if A >= m then A ← A - m.
7. Return(A).
对于与1)相同的举例:
假设 m = 72639, b = 10, R = 105, 且 T = 7118368.
则有 n = 5,m’ = −m−1 mod 10 = 1, T mod m = 72385, 且 TR−1 mod m = 39796.
初始状态为:m0=9, carry = 0,b=10,n=5, T=71183868=(t9…t1t0)10
相应的算法Step 3执行细节如下:
i | ui=aim’ mod b | a i = c a r r y + t i + ∑ k = 0 i − 1 , p = i − k , 0 = < p < n u k ∗ m p \ a_i = carry + t_i + \sum_{k=0}^{i-1, p=i-k, 0 =< p < n}{u_k*m_p} ai=carry+ti+∑k=0i−1,p=i−k,0=<p<nuk∗mp | carry = (ai + ui*m0)/b |
---|---|---|---|
- | - | - | 0 |
0 | 8 | 8 | (8+8*9)/10 = 8 |
1 | 8 | 8+6+8*3=38 | (38+8*9)/10 = 11 |
2 | 6 | 11+3+86+83=86 | (86+6*9)/10 = 14 |
3 | 4 | 14+8+82+86+6*3=104 | (104+4*9)/10 = 14 |
4 | 5 | 14+1+87+82+66+43=135 | (135+5*9)/10 = 18 |
相应的算法Step 4执行细节如下:
j | r j = c a r r y + t j + n + ∑ k = 0 j − 1 , q = j + n − k , 0 = < 1 < n u k ∗ m q \ r_j = carry + t_{j+n} + \sum_{k=0}^{j-1, q=j+n-k, 0 =< 1 < n}{u_k*m_q} rj=carry+tj+n+∑k=0j−1,q=j+n−k,0=<1<nuk∗mq | carry = rj /b | wj = rj mod b. |
---|---|---|---|
- | - | 18 | - |
0 | 18+1+87+62+46+53=126 | 12 | 6 |
1 | 12+7+67+42+5*6=99 | 9 | 9 |
2 | 9+0+47+52=47 | 4 | 7 |
3 | 4+0+5*7=39 | 3 | 9 |
4 | 3 | 0 | 3 |
所以 A = ∑ j = 0 n − 1 w j ∗ b j \ A = \sum_{j=0}^{n-1}{w_j*b^j} A=∑j=0n−1wj∗bj=39796 , 符合预期。
curve25519-dalek中的montgomery_reduce() 函数中采用的就是经过改进的算法2实现的。
其中对应关系为:
T: limbs
n:u
r:w
/// Compute `limbs/R` (mod l), where R is the Montgomery modulus 2^260
#[inline(always)]
pub (crate) fn montgomery_reduce(limbs: &[u128; 9]) -> Scalar52 {
#[inline(always)]
fn part1(sum: u128) -> (u128, u64) {
let p = (sum as u64).wrapping_mul(constants::LFACTOR) & ((1u64 << 52) - 1);
((sum + m(p,constants::L[0])) >> 52, p)
}
#[inline(always)]
fn part2(sum: u128) -> (u128, u64) {
let w = (sum as u64) & ((1u64 << 52) - 1);
(sum >> 52, w)
}
// note: l3 is zero, so its multiplies can be skipped
let l = &constants::L;
// the first half computes the Montgomery adjustment factor n, and begins adding n*l to make limbs divisible by R
let (carry, n0) = part1( limbs[0]);
let (carry, n1) = part1(carry + limbs[1] + m(n0,l[1]));
let (carry, n2) = part1(carry + limbs[2] + m(n0,l[2]) + m(n1,l[1]));
let (carry, n3) = part1(carry + limbs[3] + m(n1,l[2]) + m(n2,l[1]));
let (carry, n4) = part1(carry + limbs[4] + m(n0,l[4]) + m(n2,l[2]) + m(n3,l[1]));
// limbs is divisible by R now, so we can divide by R by simply storing the upper half as the result
let (carry, r0) = part2(carry + limbs[5] + m(n1,l[4]) + m(n3,l[2]) + m(n4,l[1]));
let (carry, r1) = part2(carry + limbs[6] + m(n2,l[4]) + m(n4,l[2]));
let (carry, r2) = part2(carry + limbs[7] + m(n3,l[4]) );
let (carry, r3) = part2(carry + limbs[8] + m(n4,l[4]));
let r4 = carry as u64;
// result may be >= l, so attempt to subtract l
Scalar52::sub(&Scalar52([r0,r1,r2,r3,r4]), l)
}