附录:极其粗糙的翻译(2)

三、RSA scalabilityRSA可扩展性

Obviously a post-quantum RSA public key n will need to be quite large toresist the attacks described in Section 2. This section analyzes the scalability ofthe best algorithms available for RSA key generation, encryption, decryption,signature generation, and signature verification.很明显,后量子RSA公钥n需要相当大才能抵抗第2节中描述的攻击。本节分析了可用于RSA密钥生成,加密,解密,签名生成和签名验证的最佳算法的可扩展性。

Small exponents.The fundamental RSA public-key operation is computingan eth power modulo n. This modular exponentiation uses approximately lgesquarings modulo n, and, thanks to standard windowing techniques, o(lg e) extramultiplications modulo n.小指数。基本的RSA公钥运算是计算一个n次方的模。 这个模幂运算使用大约lge平方模n,并且,由于标准的窗口技术,o(eg)额外的乘法模n。

In the original RSA paper [43], e was a random number with as many bitsas n. Rabin in [42] suggested instead using a small constant e, and said thate = 2 is “several hundred times faster.” Rabin’s speedup factor grows as Θ(lg n),making it particularly important for the large sizes of n considered in this paper.在最初的RSA文件[43]中,e是一个随机数,其位数与n相同。 拉宾[42]建议使用一个小的常数e,并且说e = 2是“几百倍”。拉宾的加速因子增长为Θ(lg n),这对于考虑的大尺寸的n 本文。

The slower but simpler choice e = 3 was deployed in a variety of real-worldapplications. The much slower alternative e = 65537 subsequently became popular as a means of compensating for poor choices of RSA message-randomizationmechanisms, but with proper randomization no attacks against e = 3 are knownthat are faster than factorization.较慢但简单的选择e = 3被部署在各种实际应用中。 作为补偿RSA消息随机化机制选择不良的一种手段,后来慢得多的e = 65537变得越来越流行,但是通过适当的随机化,没有发现e = 3的攻击比分解更快。

For simplicity this paper also focuses on e = 3. Computing an eth powermodulo n then takes one squaring modulo n and one general multiplicationmodulo n. Each of these steps takes just (lgn)^(1+o(1))bit operations using standardfast-multiplication techniques; see below for further discussion. Notice that(lgn)^(1+o(1))is asymptotically far below the (lgn)^(2+o(1))cost of Shor’s algorithm.为简单起见,本文也将重点放在e = 3上。计算n次模n,然后取n个模n和一个一般乘n n。 这些步骤中的每一个都使用标准的快速乘法技术来进行(lgn)^(1 + o(1))位操作; 请参阅下面的进一步讨论。 注意(lgn)^(1 + o(1))是渐近远低于Shor算法的(lgn)^(2 + o(1))成本的。

Many primes.The fundamental RSA secret-key operation is computing aneth root modulo n. For e = 3 one chooses n as a product of distinct primescongruent to 2 modulo 3; then the inverse of x → x^3 mod n is x→ x^d mod n,where d=(1+2∏p|n(p−1))/3. Unfortunately, d is not a small exponent—ithas approximately lgn bits.许多素数。基本的RSA秘密密钥操作是计算eth根模n。 对于e = 3,选择n作为与2模3一致的不同素数的乘积; 那么x→x ^ 3 mod n的倒数是x→x ^ d mod n,其中d =(1 +2Πp| n(p-1))/ 3。 不幸的是,d不是一个小指数,它大约有lgn个位。

A classic speedup in the computation of x^d mod n is to compute x^d mod p andx^d mod q, where p and q are the prime divisors of n, and to combine them intox^d mod n by a suitably explicit form of the Chinese remainder theorem. Fermat’sidentity x^p mod p = x mod p further implies that x^d mod p = x^(d mod (p−1))mod p(since d mod (p−1) ≥ 1) and similarly x^d mod q = x^(d mod (q−1))mod q. Theexponents d mod (p−1) and d mod (q−1) have only half as many bits as n; theexponentiation x^d mod n is thus replaced by two exponentiations with half-size exponents and half-size moduli.计算x ^ d mod n的一个经典加速是计算x ^ d mod p和x ^ d mod q,其中p和q是n的主要因数,并且将它们组合为x ^ d mod n 中国剩余定理的适当的显式形式。 费马的同一性x ^ p mod p = x mod p进一步意味着x d d mod p = x ^(d mod(p-1))mod p(因为d mod(p-1)≥1)并且类似地x ^ d mod q = x ^(d mod(q-1))mod q。 指数d mod(p-1)和d mod(q-1)只有n的一半; 指数x ^ d mod n因此被具有半尺寸指数和半尺寸模数的两个指数代替。

If n is a product of more primes, say k ≥ 3 primes, then the same speedupbecomes even more effective, using k exponentiations with (1/k)-size exponentsand (1/k)-size moduli. Prime generation also becomes much easier since theprimes are smaller. Of course, if primes are too small then the attacker can findthem using the ring algorithms discussed in the previous section—specificallyEECM before quantum computers, and GEECM after quantum computers.如果n是更多素数的乘积,说k≥3个素数,则使用(1 / k)大小指数和(1 / k)大小模数的k次取幂,相同的加速变得更加有效。 由于素数较小,素数一代也变得容易得多。 当然,如果素数太小,那么攻击者就可以使用上一节讨论的环算法 - 特别是量子计算机之前的EECM和量子计算机之后的GEECM。

What matters for this paper is how multi-prime RSA scales to much largermoduli n. Before quantum computers the top threats are EECM and NFS, andbalancing these threats implies that each prime p has (lgn)^(2/3+o(1))bits (seeabove), i.e., that k ∈ (lg n)^(1/3+o(1)). After quantum computers the top threatsare GEECM and Shor’s algorithm, and balancing these threats implies thateach prime p has just (lg lg n)^(2+o(1))bits, i.e., that k ∈ (lg n)/(lg lg n)^(2+o(1)). RSAkey generation, decryption, and signature generation then take (lgn)^(1+o(1))bitoperations; see below for further discussion.这篇论文的重点是多重RSA如何扩展到更大的模数n。 在量子计算机之前,最主要的威胁是EECM和NFS,平衡这些威胁意味着每个素数p有(lgn)^(2/3 + o(1))位(见上文),即k∈(lg n)^(1/3 + O(1))。 在量子计算机之后,最主要的威胁是GEECM和Shor算法,平衡这些威胁意味着每个素数p只有(lg lg n)^(2 + o(1))个比特,即k∈(lg n)/ lg lg n)^(2 + o(1))。 RSA密钥生成,解密和签名生成然后采取(lgn)^(1 + o(1))位操作; 请参阅下面的进一步讨论。

Key generation.To recap: A k-prime exponent-3 RSA public key n is a productof k distinct primes p congruent to 2 modulo 3. In particular, a post-quantumRSA public key n is a product of k distinct primes p congruent to 2 modulo 3,where each prime p has (lglgn)^(2+o(1))bits.密钥生成。概括来说:k-prime指数-3RSA公钥n是k个与2模3相同的不同的素数p的乘积。特别地,后量子RSA公钥n是k个不同的素数p与2相等的乘积 模3,其中每个素数p具有(lglgn)^(2 + o(1))位。

Standard prime-generation techniques use (lg p)^(3+o(1))bit operations. See, e.g.,[6, Section 3] and [38, Section 4.5]. The point is that one must try about log prandom numbers before finding a prime, and checking primality has similar costto a single exponentiation modulo p.标准的素数生成技术使用(lg p)^(3 + o(1))位操作。 参见例如[6,第3节]和[38,第4.5节]。 重点是在找到素数之前必须尝试一下log p随机数,而检查素数与单个取幂模p有相似的代价。

A standard speedup is to check whether p is divisible by any primes up throughsome limit, say y. The chance of a random integer surviving this divisibility testis approximately 1/ log y, reducing the original pool of log p random numbers to(log p)/ log y random numbers and saving an overall factor of log y if the trialdivision is not a bottleneck. The conventional view is that keeping the cost oftrial division under control requires y to be chosen as a polynomial in lgp, savinga factor of only Θ(lg lg p) and thus still requiring (lg p)^(3+o(1)) bit operations.一个标准的加速是检查p是否可以被任何通过某个极限的幂整除,比如y。 幸运的是,这个可分性测试中存在一个随机整数的概率大约为1 / log y,如果试算组不是原始的随机数,则将原始的随机数池减少到(log p)/ log y随机数, 一个瓶颈。 传统的观点认为,控制审判的成本需要y被选为lgp中的一个多项式,只保存了一个因子(lg lg p),因此仍然需要(1g p)^(3 + o(1 ))位操作。

A nonstandard speedup is to replace trial division (or sieving) by batch trialdivision [8] or batch smoothness detection [9]. The algorithm of [9] reads afinite sequence S of positive integers and a finite set P of primes, and finds“the largest P-smooth divisor of each integer in S” using just b(lgb)^(2+o(1))bitoperations, where b is the total number of bits in P and S. In particular, if Pis the set of primes up through y, and S is a sequence of Θ(y/ lg p) integerseach having Θ(lgp) bits, then b is Θ(y) and this algorithm uses just y(lgy)^(2+o(1))bit operations, i.e., (lg p)(lg y)^(2+o(1)) bit operations for each element of S. Largersequences S can trivially be split into sequences of size Θ(y/lgp), producing thesame performance per element of S.一个非标准的加速是通过分批试验[8]或分批光滑检测[9]来代替试验分割(或筛分)。 文献[9]的算法读取正整数的有限序列S和素数的有限集合P,并使用b(lgb)^(2 + o(1),得到“S中每个整数的最大P光滑除数” ))位操作,其中b是P和S中的总位数。特别地,如果P是通过y的整数集合,并且S是每个具有Θ(y / lg p)的整数的序列, (lgp)比特,则b是Θ(y),该算法仅使用y(lgy)^(2 + o(1))比特运算,即(lg p)(lg y)^(2 + o )位操作。较大的序列S可以平分为大小为Θ(y / lgp)的序列,每个元素S产生相同的性能。

To do even better, assume that the original size of S is at least 2^(2α),andapply batch smoothness detection successively for y = 2^(2^0), y = 2^(2^1), y=2^(2^)2,and so on through y = 2^(2^α). Each step weeds out about half of the remainingelements of S as composites; the next step costs about four times as much perelement but is applied to only half as many elements. The total cost is just (lgp)(2^α)^(1+o(1)) bit operations for each of the original elements of S. Each of the original elements has probability about 1/2^α of surviving this process and incurring an exponentiation, which costs (lg p)^(2+o(1)) bit operations. Choosing 2^α ∈ (lg p)^(0.5+o(1)) balances these costs as (lg p)^(1.5+o(1)) for each of the original elements of S, i.e., (lg p)^(2.5+o(1)) for each prime generated.为了做得更好,假设S的原始大小至少为2 ^(2α),并且对y = 2 ^(2 ^ 0),y = 2 ^(2 ^ 1),y = 2 ^(2 ^)2,依此类推,直到y = 2 ^(2 ^α)。 每个步骤都将S中剩下的一半元素作为复合材料除去; 下一步的成本大约是每个元素的四倍,但只适用于一半的元素。 对于S的每个原始元素,总成本只是(lgp)(2 ^α)^(1 + o(1))位操作。每个原始元素具有约1/2 ^α的存活概率 并产生一个指数运算,其代价是(lg p)^(2 + o(1))位操作。 选择2 ^α∈(lg p)^(0.5 + o(1))为S的每个原始元素(即(lg p))平衡这些代价为(lg p) ^(2.5 + o(1))为每个生成的素数。

In the context of post-quantum RSA the assumption about the original sizeof S is satisfied: one has to generate (lg n)^(1+o(1))primes, so the original size ofS is (lgn)^(1+o(1)), which is at least 2^(2^α)for 2^α ∈ (1 + o(1)) lg lg n; this choice ofα satisfies 2^α ∈ (lg p)^(0.5+o(1))since lg p ∈ (lg lg n)^(2+o(1)). The primes are alsobalanced, in the sense that (lg n)/k ∈ (lg p)^(1+o(1))for each p, so generating kprimes in this way uses k(lg p)^(2.5+o(1))= (lg n)(lg p)^(1.5+o(1))= (lg n)(lg lg n)^(3+o(1))bit operations.在后量子RSA的情况下,满足S的原始大小的假设:必须生成(lg n)^(1 + o(1))素数,所以S的原始大小是(lgn)^ 1 + o(1)),对于2 ^α∈(1 + o(1))lg lg n,至少为2 ^(2 ^α) 由于lg p∈(lg lg n)^(2 + o(1)),所以α的选择满足2 ^α∈(lg p)^(0.5 + o(1) 素数也是平衡的,就每个p而言(lg n)/ k∈(lg p)^(1 + o(1)),所以用这种方式生成k个素数使用k(lg p)(2.5 (1))=(lg n)(lg p)^(1.5 + o(1))=(lg n)(lg lg n)^(3 + o(1))位运算。

Computing n by multiplying these primes uses only (lg n)(lg lg n)^(2+o(1))bitoperations using standard fast-arithmetic techniques; see, e.g., [10, Section 12].At this level of detail it does not matter whether one uses the classic Sch¨onhage–Strassen multiplication algorithm [46], F¨urer’s multiplication algorithm [21], orthe Harvey–van der Hoeven–Lecerf multiplication algorithm [27].通过乘以这些素数来计算n只使用标准快速算术技术使用(lg n)^(2 + o(1))位操作; 例如,参见[10,第12章]。在这个细节层次上,是否使用经典的Schonon-Strassen乘法算法[46],Füller的乘法算法[21]或Harvey-van der Hoeven-Lecerf乘法算法[27]。

The total number of bit operations for key generation is essentially linear inlg n. For comparison, the usual picture is that prime generation is vastly moreexpensive than any of the other steps in RSA.密钥生成的位操作总数在lg n中基本上是线性的。 为了比较,通常的情况是,素数比RSA中的任何其他步骤要昂贵得多。

One can try to further accelerate key generation using Takagi’s idea [52] ofchoosing n as p^(k−1)q. We point out two reasons that this is worrisome. The firstreason is lattice attacks [13]. The second reason is that any nth power modulon has small order, namely some divisor of (p − 1)(q − 1); Shor’s algorithm findsthe order at relatively high speed once the nth power is computed.人们可以尝试使用Takagi的将[n]选择为p ^(k-1)q的思想[52]来进一步加速密钥生成。 我们指出两个原因,这是令人担忧的。 第一个原因是格子攻击[13]。 第二个原因是任何n次幂模n都有小数阶,即(p - 1)(q - 1)的一些除数。 Shor算法一旦计算n次幂,就会以相对较高的速度找到次序。

Encryption and decryption.There are many different RSA encryption mechanisms in the literature. The oldest mechanisms use RSA to directly encrypt auser’s message; this requires careful padding and scrambling of the message.Newer mechanisms generate a secret key (for example, an AES key), use thesecret key to encrypt and authenticate the user’s message, and use RSA to encrypt the secret key; this allows simpler padding, since the secret key is alreadyrandomized. The newest mechanisms such as Shoup’s “RSA-KEM” [51] simplyuse RSA to encrypt lg n bits of random data, hash the random data to obtaina secret key, and use the secret key to encrypt and authenticate the user’s message; this does not require any padding. For simplicity this paper takes the lastapproach.加密和解密。文献中有许多不同的RSA加密机制。 最老的机制使用RSA直接加密用户的消息; 这需要仔细填充和加扰消息。 较新的机制产生一个密钥(例如AES密钥),使用该密钥对用户的消息进行加密和认证,并使用RSA对密钥进行加密; 这允许更简单的填充,因为密钥已经被随机化了。 Shoup的“RSA-KEM”[51]等最新的机制简单地使用RSA对随机数据的n位进行加密,对随机数据进行散列,得到一个密钥,用密钥对用户的消息进行加密和认证。 这不需要任何填充。 为了简单起见,本文采用最后一种方法。

Generating large amounts of truly random data is expensive. Fortunately,truly random data can be simulated by pseudorandom data produced by astream cipher from a much smaller key. (Even better, slight deficiencies in therandomness of the cipher key do not compromise security.) The literature contains several scalable ciphers that produce a Θ(b)-bit block of output from aΘ(b)-bit key, with a conjectured 2^bsecurity level, using b^(2+o(1))bit operations(and even fewer for some ciphers), i.e., b^(1+o(1))bit operations for each output bit.In the context of post-quantum RSA one has b ∈ Θ(lg lg n) so generating lgnpseudorandom bits costs (lg n)(lg lg n)^(1+o(1))bit operations. The same cipherscan also be converted into hash functions with only a constant-factor loss inefficiency, so hashing the bits also costs (lg n)(lg lg n)^(1+o(1))bit operations.产生大量真正随机的数据是昂贵的。 幸运的是,真正的随机数据可以通过流密码产生的伪随机数据来模拟一个更小的密钥。 (甚至更好的是,密码密钥的随机性的轻微缺陷不会危及安全性)。文献包含几个可伸缩密码,其产生来自Θ(b)位密钥的Θ(b)位输出块,具有猜测 对于每个输出比特,使用b ^(2 + o(1))比特运算(对于某些密码甚至更少),即b ^(1 + o(1))比特运算。 在后量子RSA的情况下,有一个b∈Θ(lg lg n),所以产生lgn伪随机比特成本(lg n)(lg lg n)^(1 + o(1))比特运算。 也可以将相同的密码转换成散列函数,其效率只有一个常数因子的损失,所以对这些比特进行散列也会导致比特操作的成本(lg n)(lg lg n)^(1 + o(1))。

Multiplication also takes (lgn)(lglgn)^(1+o(1))bit operations. Squaring, reduction modulo n, multiplication, and another reduction modulo n together take(lgn)(lglgn)^(1+o(1))bit operations. The overall cost of RSA encryption is therefore(lgn)(lglgn)^(1+o(1))bit operations plus the cost of encrypting and authenticatingthe user’s message under the resulting secret key.乘法还需要(lgn)(lglgn)^(1 + o(1))位操作。 平方,减模n,乘法和另一个减法模n一起取(lgn)(lglgn)^(1 + o(1))位操作。 因此,RSA加密的总体成本是(lgn)(lglgn)^(1 + o(1))比特操作加上在产生的密钥下加密和认证用户消息的成本。

Decryption is more complicated but not much slower; it works as follows.First reduce the ciphertext modulo all of the prime divisors of n. This takes(lg n)(lg lg n)^(2+o(1))bit operations using a remainder tree or a scaled remaindertree; see, e.g., [10, Section 18]. Then compute a cube root modulo each prime.A cube root modulo p takes (lg p)^(2+o(1))bit operations, so all of the cube rootstogether take (lg n)(lg lg n)^(2+o(1))bit operations. Then reconstruct the cube rootmodulo n. This takes (lg n)(lg lg n)^(2+o(1))bit operations using fast interpolationtechniques; see, e.g., [10, Section 23]. Finally hash the cube root. The overallcost of RSA decryption is (lg n)(lg lg n)^(2+o(1))bit operations, plus the cost ofverifying and decrypting the user’s message under the resulting secret key.解密更复杂,但速度并不慢。 它的工作原理如下。 首先减少密文模n的所有素因子。 这采用(lg n)(lg lg n)^(2 + o(1))位运算使用余数树或缩放余数树; 参见例如[10,第18节]。 然后计算每个素数模的立方根。 立方根模p取(lg p)^(2 + o(1))位操作,所有立方体根都一起取(1g n)(2 + o。 然后重建立方根n。 这使用快速插值技术来执行(lg n)(lg lg n)^(2 + o(1))位操作; 例如参见[10,第23节]。 最后散列立方根。 RSA解密的总体成本是(lg n)(lg lg n)^(2 + o(1))位操作,加上在得到的密钥下验证和解密用户消息的成本。

Shamir in [47] proposed decrypting modulo just one prime, and choosingplaintexts to be smaller than primes. However, this requires exponents to bemuch larger for security, and in the context of post-quantum RSA this slowsdown encryption by vastly more than it speeds up decryption. A more interesting variant, which we do not explore further, is to use a significant fraction ofthe primes to decrypt a plaintext having (lg n)/(lg lg n)^(0.5+o(1))bits; this shouldreduce the total cost of encryption and decryption to (lg n)(lg lg n)^(1.5+o(1))bitoperations with a properly chosen exponent.Shamir在文献[47]中提出的解密只是一个素数,选择明文小于质数。 然而,这要求指数在安全性方面要大得多,并且在后量子RSA的情况下,加密的速度比加速解密要慢得多。 一个更有趣的变体,我们不进一步探讨,是使用相当一部分质数来解密具有(lg n)/(lg lg n)^(0.5 + o(1))位的明文; 这应该将具有适当选择的指数的加密和解密的总成本降低到(lg n)(lg + l(1))位操作。

Signature generation and verification.Standard padding schemes for RSAsignatures involve the same operations discussed above, such as hashing to ashort string and using a stream cipher to expand the short string to a longstring.签名生成和验证。用于RSA签名的标准填充方案涉及上面讨论的相同操作,例如散列为短字符串并使用流密码将短字符串扩展为长字符串。

The final speeds are, unsurprisingly, (lg n)(lg lg n)^(2+o(1))bit operations to generate a signature and (lg n)(lg lg n)^(1+o(1))bit operations to verify a signature,plus the cost of hashing the user’s message.最后的速度并不令人惊讶,生成一个签名和(lg n)(lg lg n)^(1 + o(1)) 以验证签名,加上散列用户消息的代价。

[if !supportLists]四、[endif]Concrete parameters and initial implementation具体参数和初始实现

Summarizing what we’ve learned so far: Shor’s algorithm takes (lg n)^(2+o(1))qubitoperations to factor n. If the prime divisors of n are too small then GEECMbecomes a larger threat than Shor’s algorithm; protecting against GEECM requires each prime to have (lg lg n)^(2+o(1))bits. Section 3 showed that, under thisconstraint, all of the RSA operations can be carried out using (lg n)(lg lg n)^(O(1))bit operations; the O(1) is 3 + o(1) for key generation, 2 + o(1) for decryptionand signature generation, and 1 + o(1) for encryption and signature verification.总结迄今为止我们学到的东西:Shor算法将(lg n)^(2 + o(1))量子位运算用于因子n。 如果n的主因子太小,那么GEECM比Shor的算法成为更大的威胁; 对于GEECM的保护要求每个素数都有(lg lg n)^(2 + o(1))位。 第3节表明,在这个约束下,所有的RSA操作都可以使用(lg n)(lg lg n)^(O(1))位操作来完成; 对于密钥生成,O(1)是3 + o(1),对解密和签名生成是2 + o(1),对加密和签名验证是1 + o(1)。

These asymptotics do not imply anything about any particular size of n. Thissection looks at performance in more detail, and in particular reports successfulgeneration of a 1-terabyte post-quantum RSA key built from 4096-bit primes.这些渐近词并不意味着任何特定的n的大小。 本节更详细地介绍了性能,特别是成功生成了由4096位素数构建的1 TB后量子RSA密钥。

Prime sizes and key sizes.Before looking at performance, we explain whythese sizes (1-terabyte key, 4096-bit primes) provide ample security原始大小和密钥大小。在查看性能之前,我们解释为什么这些大小(1 TB密钥,4096位素数)提供足够的安全性

A 1-terabyte key n has 2^43 bits, so Shor’s algorithm uses 2^44 multiplicationsmodulo n. We have not found literature analyzing the cost of circuits for optimized FFT-based multiplication at this scale, so we extrapolate as follows.一个1 TB的密钥n有2^43个比特,所以Shor算法使用2^44乘法模n。 我们还没有找到文献分析在这个尺度下基于FFT的乘法电路的成本,所以我们推断如下。

The recent speed records from Harvey–van der Hoeven–Lecerf [28] for multiplication of degree-2^21 polynomials over a particularly favorable finite field, F2^60 ,use 640 milliseconds on a 3.4GHz CPU core. More than half of the cycles areperforming 128-bit vector xor, and more than 10% of the cycles are performing64×64-bit polynomial multiplications, according to [28, Section 3.3], for a totalof approximately 2^40 bit operations to multiply 2^27-bit inputs.Harvey-van der Hoeven-Lecerf [28]最近对一个特别有利的有限域F2 ^ 60乘以2 ^ 21多项式的速度记录在3.4GHz的CPU内核上使用640毫秒。 根据[28,3.3节],超过一半的周期执行128位向量xor,并且超过10%的周期正在执行64×64位多项式乘法,总共大约2 ^ 40位 操作来乘以2 ^ 27位输入。

Imagine that the same 2^13 ratio scales directly from 2^27-bit inputs to 2^43-bit inputs; that integer multiplication uses as few bit operations as binary-polynomialmultiplication; that reduction modulo n does not cost anything; and that thereare no overheads for switching from bit operations to reversible qubit operationsinside a realistic quantum-computer architecture. (For comparison, the ratio in[56] is more than 2^20 for 2^20-bit inputs.) Each multiplication modulo n insideShor’s algorithm then uses 2^56 qubit operations, and overall Shor’s algorithmconsumes an astonishing 2^100 qubit operations.假设相同的2^13比例直接从2^27位输入扩展到2^43位输入; 该整数乘法使用与二进制多项式乘法一样少的位操作; 减数n不会花费任何东西; 并且在现实的量子计算机体系结构内没有从位操作切换到可逆量子位操作的开销。 (为了比较,对于2 ^ 20位输入,[56]中的比率大于2 ^ 20)。在Shor算法内部的每个乘法模n使用2 ^ 56个量化位运算,并且整个Shor算法消耗惊人的2 ^ 100 量子比特操作。

We caution the reader that this is only a preliminary estimate. A thoroughanalysis would have to account for several overheads mentioned above; for thenumber of Shor iterations required; for known techniques to reduce the numberof iterations; for techniques to use slightly fewer multiplications per iteration;and for the latest improvements in integer-multiplication algorithms.我们告诫读者,这只是一个初步的估计。 一个彻底的分析将不得不考虑上面提到的几个开销; 对于所需的Shor迭代次数; 用于减少迭代次数的已知技术; 对于每次迭代使用稍少的乘法的技术; 以及整数乘法算法的最新改进。

As for prime sizes: Standard pre-quantum cost analyses conclude that 4096-bit RSA keys provide roughly 2^140 security against all available algorithms. ECMis well known to be inferior to NFS at such sizes; evidently it uses even morethan 2^140 bit operations to find 2048-bit primes. ECM would be even sloweragainst a much larger modulus, simply because arithmetic is slower. However,the speedup from ECM to GEECM reduces the post-quantum security level of2048-bit primes. Rather than engaging in a detailed analysis of this loss, we moveup to 4096-bit primes, obviously putting GEECM far out of reach.对于素数大小:标准的预量子成本分析得出结论,4096位的RSA密钥提供了大约2 ^ 140的安全性,抵御所有可用的算法。 众所周知,ECM在这种尺寸下比NFS差。 显然它使用甚至超过2 ^ 140位操作来查找2048位素数。 因为算术运算速度较慢,所以ECM对于更大的模数将更慢。 然而,从ECM到GEECM的加速降低了2048位素数的后量子安全级别。 我们没有对这个损失进行详细的分析,而是移动到4096位的素数,显然把GEECM放在了遥远的地方。

Implementation.We now discuss our implementation of post-quantum RSA.Our main result is successful generation of a 1-terabyte exponent-3 RSA keyconsisting of 4096-bit primes. We also have preliminary results for encryptionand decryption, although so far only for smaller sizes.实现。我们现在讨论我们的后量子RSA的实现。 我们的主要结果是成功生成了一个由4096位素数组成的1TB指数-3的RSA密钥。 我们也有加密和解密的初步结果,虽然到目前为止只适用于较小的尺寸。

Our computations were performed on a heterogeneous cluster. We give a description of the machines in Appendix A. The memory-intensive portions of ourcomputations were carried out a single machine running Ubuntu with 24 coresat 3.40 GHz (4 Intel Xeon E7-8893 v2 processors), 3 terabytes of DRAM, and4.9 terabytes of swap memory built from enterprise SSDs. We will refer to thismachine as lattice0 below. We measured memory consumption and overallruntime for bignum multiplications using GNU’s Multiple Precision (GMP) Library [26]. We encountered a number of software limits and bugs, which we detail in Appendix A.我们的计算是在异构集群上进行的。 我们在附录A中给出了这些机器的描述。我们计算的内存密集型部分是在一台运行Ubuntu的机器上运行的,这个机器上有24个核心,频率为3.40GHz(4个Intel Xeon E7-8893 v2处理器),3TB的DRAM, 从企业级SSD构建的4.9TB交换内存。 下面我们将把这台机器称为lattice0。 我们使用GNU的多精度(GMP)库[26]来测量内存消耗和总体运行时间。 我们遇到了一些软件限制和错误,我们在附录A中详细说明。

Prime generation.Generating a 1-terabyte exponent-3 RSA key requires 2^314096-bit primes that are congruent to 2 mod 3. To efficiently generate such alarge number of primes, our implementation first applies the batched smoothnessdetection technique discussed in Section 3 to an input collection of random 4096-bit numbers. We then use the Fermat congruence primality test to produce ourfinal set of primes. While we do not prove that each number in the final outputis prime, this test is sufficient to guarantee with high confidence that all of the4096-bit numbers in the final output are prime. See [31] for quantitative upperbounds on the error probability素数生成。生成一个1 TB的指数-3 RSA密钥需要2 ^ 31 4096位素数与2 mod 3一致。为了有效地生成如此大量的素数,我们的实现首先将第3节中讨论的成批平滑检测技术应用于 随机4096位数字的输入集合。 然后,我们使用费马相合性素测试来产生我们最后的一组素数。 虽然我们不能证明最终输出中的每个数字都是素数,但是这个测试足以保证所有最终输出中的4096位数字都是质数。 有关错误概率的定量上限见[31]

We found that first filtering for random numbers congruent to 5 mod 6, andthen applying batch sieving with the successive bounds y = 2^10 and y = 2^20worked well in practice. Our heterogeneous cluster was able to generate primesat a rate of 750–1585 primes per core-hour. Generating all 231 primes took approximately 1,975,000 core-hours. In calendar time, prime generation completedin four months running on spare compute capacity of a 1,400-core cluster.我们发现,首先对随机数进行滤波,使其与5 mod 6一致,然后应用连续边界y = 2 ^ 10和y = 2 ^ 20的分批筛选在实践中运行良好。 我们的异构集群能够以每核心小时750-1585个素数的速度生成素数。 生成所有231个素数大约需要1,975,000个核心小时。 在日历时间内,主要产能在四个月内以1400核心群集的备用计算容量运行。

Product tree.After we successfully generated 2^31 4096-bit primes, we used aproduct tree to compute the 1-terabyte public RSA key. We distributed individual multiplications across our heterogeneous cluster to reduce the wall-clocktime. We first multiplied batches of 8 million primes and wrote their productsout to disk. Each subsequent single-threaded multiplication job read two integers from disk and wrote their product back to disk. Running times varieddue to different CPU types and non-pqRSA related jobs sharing cache space.Once the integers reached 256GB in size, we finished computing the producton lattice0. The aggregate wall-clock time used by individual multiply jobs was about 1,239,626 seconds, and the elapsed time for the terabyte key generation was about four days. The final multiplication of two 512 GB integers took 176,223 seconds in wall-clock time, using 3.166TB of RAM and 2.5 TB of swap storage.产品树。在成功生成2 ^ 31 4096位素数后,我们使用产品树来计算1TB的公共RSA密钥。 我们在整个异构集群中分布单独的乘法运算以减少挂钟时间。 我们首先乘以800万个素数的批次,并将他们的产品写入磁盘。 随后的每个单线程乘法作业都从磁盘读取两个整数,并将其产品写回磁盘。 由于不同的CPU类型和共享缓存空间的非pqRSA相关作业,运行时间不同。 一旦整数达到了256GB的大小,我们就完成了对lattice0的计算。 单个乘法作业使用的总计挂钟时间大约为1239626秒,太字节密钥生成的时间大约为4天。 两个512 GB整数的最终乘法操作耗时176223秒,使用3.166TB的RAM和2.5TB的交换存储空间。

Encryption.We implemented RSA encryption using RSA-KEM, as describedin Section 3. With the exponent e = 3, we found that a simple square-and-reduce using GMP’s mpz_mult and mpz_mod was almost twice as fast as using themodular exponentiation function mpz_powm. Each operation was single-threaded.We were able to complete RSA encryption for modulus sizes up to 2 terabits, asshown in Table 4.1. For the 2Tb (256GB) encryption, the longest multiplicationtook 13 hours, modular reduction took 40 hours, and in total encryption took alittle over 100 hours.加密。我们使用RSA-KEM实现了RSA加密,如第3节所述。指数e = 3,我们发现使用GMP的mpz_mult和mpz_mod的简单平方和减少速度几乎是使用模指数函数mpz_powm的两倍。 每个操作都是单线程的。 我们能够完成RSA加密,最大模数为2 terabits,如表4.1所示。 对于2Tb(256GB)加密,最长的乘法需要13个小时,模块化的减少需要40个小时,总的加密需要超过100个小时。

Decryption.We implemented RSA decryption as described in Section 3. Table 4.1 gives wall-clock timings for the three computational steps in decryption,each parallelized across 48 threads. Precomputing the entire product and remainder tree for a terabyte-sized key and storing it to disk would have taken32TB of disk space, so instead we recomputed portions of the trees on the fly.The reported timings for the remainder tree step in Table 4.1 include the time ittakes to recompute both the product and remainder tree with a batch size of 8million primes. Using a batch size of 8 million primes was roughly twice as fastas using a batch size of 2 million primes. We obtained experimental results fordecryption of messages for key sizes of up to 16GB.解密。我们实现了第3节中所述的RSA解密。表4.1给出了解密中三个计算步骤的挂钟时序,每个步骤在48个线程中并行化。 预先计算一个太字节大小的密钥的整个产品和剩余树,并将其存储到磁盘将需要32TB的磁盘空间,所以我们相反地重新计算了部分树。 表4.1中剩余树步骤的报告时间包括重新计算批量为800万个素数的产品和剩余树所花费的时间。 使用800万个素数的批量大约是使用200万个素数的批量大小的两倍。 我们获得了密钥大小高达16GB的消息解密实验结果。

A Appendix: Implementation barriers and details附录:实施障碍和细节

Extending GMP’s integer capacity. The GMP library uses hard-coded 32-bit integers to represent sizes in multiple locations in the library. Without anymodifications, GMP supports 237-bit integers on 64-bit machines [25]. To represent large values, we extended GMP’s capacity from 32-bit integers to 64-bitintegers by changing the data typing in GMP’s integer structure, mpz. Namely,we changed mpz_size and mpz_alloc from int types to int64_t types. To accommodate increased memory usage, we increased the bound for GMP’s memoryallocation for the mpzstruct in realloc.c to LLONG MAX. The final modificationswe made were to create binary-format I/O functions for 64-bit mpzs, namely inmpz_inp_out.c and mpz_out_raw.c.扩展GMP的整数容量。 GMP库使用硬编码的32位整数来表示库中多个位置的大小。 没有任何修改,GMP在64位机器上支持237位整数[25]。 为了表示较大的值,我们通过改变GMP的整数结构mpz中的数据类型,将GMP的容量从32位整数扩展到64位整数。 也就是说,我们将mpz_size和mpz_alloc从int类型更改为int64_t类型。 为了适应增加的内存使用,我们将realloc.c中的mpz结构的GMP内存分配的边界增加到了LLONG MAX。 我们所做的最后修改是为64位mpzs创建二进制格式的I / O函数,即在mpz_inp_out.c和mpz_out_raw.c中。

Impact of swapping.We initially evaluated the performance of our product-tree implementation by generating a “dummy key”, a terabyte product of random 4096-bit integers. During this product computation, we counted instructionsper CPU cycle (IPCs) with the command perf stat -e instructions,cycles -a sleep 1 to measure the lost performance caused by swapping. When no swapping occurred, the machine had about 2 instructions per cycle, but upon swapping, the instructions per cycles dropped as low as 0.37 instructions per cycle and held around 0.5 to 1.2 instructions per cycle.交换的影响。我们最初通过生成一个“伪密钥”(一个随机4096位整数的太字节乘积)来评估产品树实现的性能。 在此产品计算期间,我们使用命令perf stat -e指令计算每个CPU周期(IPC)的指令数,周期数为-h睡眠1以测量由交换引起的性能损失。 当没有交换时,机器每个周期有大约2条指令,但是在交换之后,每个周期的指令下降到每个周期0.37条指令,并且每个周期保持大约0.5到1.2条指令。

GMP memory consumption.GMP’s memory consumption is another concern. High RAM and swap usage at higher levels in the product tree are attributed to GMP’s FFT implementation. According to GMP’s developers, theirFFT implementation consumes about 8n bytes of temporary memory space foran n·n product where n is the byte size of the factors [57]. This massive consumption of memory also triggered a known race condition in the Linux kernel [2]. Thebug was found in the huge memory.c code. There are numerous bug reports forvariants of the same bug on various mainline Linux systems throughout the pastsix years. Disabling transparent huge pages avoided the transparent hugepagecode in the kernel.

GMP内存消耗。GMP的内存消耗是另一个问题。 产品树中较高级别的高RAM和交换使用归因于GMP的FFT实现。 根据GMP的开发人员,他们的FFT实现为一个n·n乘积消耗大约8n字节的临时存储空间,其中n是因子的字节大小[57]。 这种大规模的内存消耗也触发了Linux内核中已知的竞争条件[2]。 这个bug在巨大的memory.c代码中被发现。 在过去的六年中,各种主流Linux系统上都有相同bug的变种报告。 禁用透明的巨大页面避免了内核中透明的巨大页面代码。

Measurements for 1-terabyte key product tree.In Table A.1, we showthe wall-clock time for each level of computing a 1-terabyte product tree. Levelsfar down in the product tree are easily parallelized. We carried out the entirecomputation on lattice0 using 48 threads. The computation used a peak of3.16TB of RAM and 2.22TB of swap memory, and completed in 356,709 seconds,or approximately 4 days, in wall-clock time.测量1 TB的关键产品树。在表A.1中,我们显示了计算1 TB产品树的每个级别的挂钟时间。 产品树中的级别很容易并行化。 我们使用48个线程对lattice0进行了整个计算。 计算使用了3.16TB的RAM和2.22TB的交换内存的峰值,在挂钟时间内完成了356,709秒或约4天。

异构集群描述。

B Credits for multi-prime RSA学分多素数RSA

The idea of using RSA with more than two primes is most commonly creditedto Collins, Hopkins, Langford, and Sabin, who received patent 5848159 in 1998for “RSA with several primes”:使用两个以上素数的RSA的想法最常见的是柯林斯,霍普金斯,兰福德和萨宾,他们在1998年获得了专利5848159“RSA with primes”:

The invention, allowing 4 primes each about 150 digits long to obtaina 600 digit n, instead of two primes about 350 [sic] digits long, resultsin a marked improvement in computer performance. For, not only areprimes that are 150 digits in size easier to find and verify than ones on the order of 350 digits, but by applying techniques the inventors derive from the Chinese Remainder Theorem (CRT), public key cryptography calculations for encryption and decryption are completed much faster— even if performed serially on a single processor system.本发明允许每个约150个数字长的4个素数以获得600个数字n,而不是两个大约350个数字长度的素数,这导致计算机性能的显着提高。 因为不仅是150位数的素数比350位数更容易查找和验证,而且通过应用发明人从中国剩余定理(CRT)得出的技术,用于加密和解密的公钥密码计算 即使在单个处理器系统上串行执行,也要快得多。

However, the same idea had already appeared in the original RSA patent in1983:然而,在1983年的原始RSA专利中也出现了同样的想法:

In alternative embodiments, the present invention may use a modulusn which is a product of three or more primes (not necessarily distinct).Decoding may be performed modulo each of the prime factors of n andthe results combined using “Chinese remaindering” or any equivalentmethod to obtain the result modulo n.在替代实施例中,本发明可以使用三个或更多个素数(不一定是不同)的乘积的模数n。 解码可以对n的每个主要因子进行模数,并且使用“中国剩余”或任何等效方法将结果组合起来以获得模n的结果。

In any event, both of these patents have now expired, so they will not interferewith the deployment of post-quantum RSA.无论如何,这两项专利现在已经过期,所以它们不会干扰后量子RSA的部署。

你可能感兴趣的:(附录:极其粗糙的翻译(2))