by Solar Designer
November 18, 2016
As Wikipedia explains, "Zcash is a decentralized and open-source cryptocurrency that offers privacy and selective transparency of transactions. Zcash payments are published on a public blockchain, but the sender, recipient, and amount of a transaction remain private."
This article presents results of analysis performed on request from and funded by the Zcash company.
Although Equihash itself is defined in its designers' paper, Zcash's use of it (most notably, the parameter values) and Zcash community's optimizations of Equihash solvers and Zcash miners are a moving target. Thus, unlike other Openwall articles, this is a time-sensitive report, which might get obsoleted soon (but will likely retain its historical value). I also chose to write it in first person (without the academic "we") and in a rather informal, discussion-like manner (without tedious separate sections with my somewhat educated guesstimates on Equihash efficiency on commodity vs. custom hardware, but instead giving the same information while "arguing" with the claims previously made, which I needed to anyway).
As is usually the case for any non-trivial problem, Zcash had conflicting goals in choosing its proof-of-work (PoW) scheme, and that's fine.
One goal was to minimize the advantage that specialized hardware miners (such as ASICs) will have over commodity hardware ones (such as typical computers and smartphones). This is generally achieved through requiring a significant amount of memory per miner instance (with so-called memory-hard PoWs) and additionally holding this memory for a significant period of time (such as with so-called sequential memory-hard PoWs, formally introduced with Colin Percival's scrypt and now also including Equihash designers' Argon2, my yescrypt except for its ROM feature, and many others). Another goal was lightweight verification, in particular affordable to be used in an Ethereum contract.
Achieving these two goals at once requires a PoW that needs a lot less memory and time for proof verification than it does for proof computation: a so-called asymmetric PoW, of which Equihash is one. It also requires a PoW that is memory-hard, and ideally sequential memory-hard. While Equihash is memory-hard, unfortunately it is not sequential memory-hard: it mostly can be parallelized, with only relatively few steps that have to be sequential. (Certain other asymmetric PoW schemes are claimed to be sequential or latency limited. They vary in other drawbacks. This is a topic of ongoing research.) This means loosening the assurance of how long the memory will have to be held for per computation of a proof. A massively parallel implementation may greatly reduce the overall cost (customarily expressed as ASIC area*time product) through reduction of time that the per-proof amount of memory has to be held for.
Additionally, symmetric sequential memory-hard PoWs tend to be simpler to reason about their memory needs: there's a clear amount of memory (per instance) below which an increase in required amount of computation (and usually time) occurs, and similarly it's clear that having more memory than this (also per instance) is of no benefit. For asymmetric PoWs (including Equihash), it tends to be complicated (it is algorithm-dependent, with opportunity for future optimizations not easily ruled out).
Equihash is not the most ASIC-resistant PoW scheme out there (likely by far) when compared to certain symmetric sequential memory-hard schemes at similar or even lower proof computation times. Yet this comparison isn't fair, since asymmetry is a desirable property, and the maximum acceptable proof verification time may be limiting the memory settings of a symmetric scheme. Thus, the choice of Equihash may pose a reasonable tradeoff given the multiple goals of Zcash and the timing of its release.
Please note that when I write "scrypt", I mostly don't mean its misuse in Litecoin (and many other alt-coins). They went as low as 128 KiB, targeting L2 cache instead of RAM and in the process making their coins ASIC friendly. What I mean here is proper use of scrypt at 16+ MiB, which Colin Percival's paper suggests as a minimum. Unfortunately, with a symmetric PoW, this also means slower verification, yet some other alt-coins managed to go with 2 MiB (which translates to about 10,000 proof or verification computations per second on a current dual-socket server) and above (e.g., YACoin went as high as 64 MiB by now, and will automatically go even higher, but that's likely problematic).
No and yes. On one hand, Zcash's blog post from April 2016 made claims that are not entirely correct. On the other hand, the claims are not entirely wrong, and Equihash appears to be a reasonable choice given Zcash's goals anyway.
Claim 1: "Equihash is a memory-oriented Proof-of-Work, which means how much mining you can do is mostly determined by how much RAM you have."
This claim is mostly wrong.
For Equihash with fixed parameters, as Zcash uses it, there's expected to be a certain minimum required amount of memory below which a major performance impact would be incurred. Using somewhat more memory than this reasonable minimum may sometimes help improve performance (such as through solving multiple instances of Equihash at a time, up to the number of supported hardware threads if on a CPU, and thereby avoiding the need to bounce data between the threads in what would be a shared instance).However, using even more (yet not unaffordably more) memory than this isn't expected to help all that much, if at all. Thus, there's no linear (nor close to linear) relationship between "how much RAM you have" and "how much mining you can do", unlike what the claim above seems to have implied.
Besides memory, another similarly important factor is time. With efficient Equihash solving algorithms including a lot of natural parallelism, the time factor can be greatly reduced. Thus, while "how much RAM you have" does to a very limited extent determine "how much mining you can do", how much parallelism your hardware can exploit may have a greater effect, thereby making the word "mostly" unjustified. But there's a catch, which Zcash was aware of and might have factored into their wording: as the Equihash paper claims, Equihash solving is memory bandwidth bound. If true, this could mean that the time factor is also a function of the number and speed of memory ports, and that's to some extent correlated with "how much RAM you have" (although it isn't the same, by far). I'll address this further in this article.
Claim 2: "We think it is unlikely that anyone will be able to build cost-effective custom hardware (ASICs) for mining in the foreseeable future."
This is not literally true for any PoW scheme, given sufficient demand and thus production volume for the economy of scale to kick in. There's always some portion of cost that can be shaved off through specialization, and with sufficient volume the cost shaved outweighs the savings from using commodity hardware. Thus, the only way this claim can prove right is through lack of sufficient demand in the foreseeable future.
For cryptocurrency PoWs in particular, additional cost savings come from greater tolerance for ASIC failures and intermittent errors, reduced expected useful lifetime of such ASICs (as compared to commodity CPUs, GPUs, and RAM), and perhaps opportunistic use of chip foundries' production lines (without pre-committed due dates, volume, yield).
For Equihash in particular, while commodity RAM prices might in fact be near optimal for the market (both per GB and per GB/s), there's nothing stopping a hypothetical Equihash compute ASIC (which might be more cost-effective than a CPU, GPU, or FPGA in terms of production and/or electricity costs) from interfacing to commodity RAM chips (and potentially doing so more efficiently for the task than a commodity CPU or GPU would, which are generally limited to fetching entire cache lines worth of data). There's also nothing preventing design and production of RAM interfaces and chips most optimal for the task and algorithm chosen (just the right burst length, layout, etc.) Then there can be dual-die designs combining these two, like we're already seeing in some commodity chips.
I am no ASIC expert, but I consulted with some for some of the following. There is one thing sort of (but not really) preventing production of combined single-die compute and memory ASICs for Equihash: compute dies (such as in CPUs, GPUs, and FPGAs) vs. DRAM ones are currently produced using different processes (for good reasons). There are three ways around this.
First, for Equihash parameters requiring little enough memory, SRAM may (eventually) be viable. Current software implementations (as submitted to Zcash's Challenge) require at least 144 MiB of RAM to efficiently find almost all solutions at Zcash's current Equihash parameters (n=200, k=9). Limited to 128 MiB, a modified implementation can find 97% of solutions with no other efficiency loss, which is a better tradeoff point when using the scarce SRAM. (These figures are peak memory usage by one Equihash instance. Slightly better amortized memory usage per instance might be achievable in a larger and more complex multi-instance circuit, but that's likely impractical.) Current largest commodity CPUs include over 64 MiB of SRAM. (For example, E7-8890 v4's caches are 67.5 MiB combined, and that's not including cache tags, TLBs, and other memories. Ditto for engineering samples of a cancelled 24-core flavor of E5-26xx v4, which happen to be available on eBay. These CPUs are very expensive, but their prices are based largely on what this niche market will bear and they do not necessarily reflect production costs.) With a specific goal to maximize the amount of SRAM per die,up to 512 MiB may be feasible in a current 16nm process, although optimal will likely be way less than that as we also need a high number of ports, routing, and the Equihash solver logic. The yield of usable huge dies like this would be unacceptably low for most purposes, but cryptocurrency mining may tolerate a lot of errors, thereby improving yield of usable dies to an acceptable level.
In a similar spirit, a network of already designed many-core CPUs such as Epiphany V could be used. (Incidentally, Epiphany V has 64 MiB of SRAM per chip and interconnect to other such chips. As experience with smaller Epiphany chips shows, at this core count and SRAM size most dies are expected to have some faulty SRAM in a few of their cores. Luckily, as discussed above, this is acceptable for this application, and it may actually provide cost savings through use of otherwise defective dies.) These wouldn't literally be ASICs, but they would also not be commodity hardware that people acquire anyway and readily have. From Equihash designers' perspective, this would be a win: Equihash ASICs would resemble or even would be the largest CPUs.
Second, eDRAM could be used. CPUs with 128 MiB of eDRAM already exist, although so far they put this eDRAM on a separate die in the same package rather than on the same die. The largest on-die eDRAMs appear to be POWER8's 96 MiB L3 cache, and soon POWER9's 120 MiB. At this time, with only IBM getting close, this seems impractical for mere cryptocurrency ASICs, albeit technically possible in the foreseeable future. Similarly to SRAM, several times larger sizes should be feasible in a chip that doesn't also try to be a general-purpose CPU. On a POWER8 die, its 96 MiB of eDRAM along with related interconnect appears to occupy less than one third of the area, in a 22nm process.
Third, the way an Equihash solver algorithm has been expressed for efficient software implementation, it also looks hardware implementation friendly and simple enough to implement on a DRAM fabrication process, along with the DRAM itself. Compared to discrete compute ASICs interfacing to DRAM dies, there will likely be a significant die area increase for the logic and the (small) SRAMs, but the clock rate would not necessarily be lower (since compute ASICs would likely face other bottlenecks before unleashing their full higher clock rate potential: heat dissipation and memory bandwidth). Again I am no expert, but the prospect of Zcash mining on dies produced on a DRAM process looks promising to me now.
Overall, I expect that if/once there is sufficient perceived incentive, custom Zcash mining hardware will appear. It is currently unclear which one(s) of the possible forms it will take. My guesstimate for the current Equihash parameters and current technology is a factor of 10 to 100 improvement in energy efficiency and hardware cost over the most suitable commodity hardware. This might eventually put Zcash on par with Litecoin (but not Bitcoin) in terms of ASIC mining, although ASIC design complexity and required investment will likely be way higher than for Litecoin.
Claim 3: "We also think it is unlikely that there will be any major optimizations of Equihash which would give the miners who know the optimization an advantage. This is because the Generalized Birthday Problem has been widely studied by computer scientists and cryptographers, and Equihash is close to the Generalized Birthday Problem. That is: it looks like a successful optimization of Equihash would be likely also an optimization of the Generalized Birthday Problem."
Worded as it is, this claim is actually correct in that Equihash is merely "close" to the Generalized Birthday Problem (GBP) but isn't exactly it. There are multiple subtle yet important differences between the different problems involved.
As a historical curiosity, what David Wagner's paper that introduced the GBP in 2002 calls "classic" birthday problem is actually a generalization to 2 lists (or types). GBP is a further generalization of it.
Equihash differs from the GBP in two important ways.
First, Equihash preserves the GBP's further generalization (to 2k-XOR), yet it reverts the previous unmentioned generalization (back to one list).
Strictly speaking, because of this important difference Equihash's presumed lack of much further optimization potential does not follow solely from the years of studies on the GBP. The space complexities of the two problems are very different, and even similar optimizations may and in at least one known case do have very different effect on them.
Second, Equihash explicitly introduces an extra constraint, which it calls "algorithm binding". For typical parameters, this makes most solutions to the GBP invalid for Equihash, leaving only a handful valid solutions: those that a straightforward revision of Wagner's algorithm produces. (Specifically, about 1.879 solutions on average are expected per Equihash proof invocation with Zcash's current parameters n=200, k=9. Curiously, optimal Zcash miners may choose to find very slightly fewer than all valid solutions, in order to maximize their solution throughput and/or keep their peak memory usage lower.) This extra constraint is required to prevent so-called amortization attacks, where a non-Wagner algorithm could produce a large number of solutions at a lower amortized cost (per solution). The paper specifically acknowledges that, if it were not for algorithm binding, Gaussian elimination could provide a lower amortized cost for parameters like Zcash's n=200, k=9.
While much needed in Equihash, this extra constraint could invite optimizations that were not relevant and thus were not studied for the original GBP. "Can you produce all/any GBP solutions cheaper?" (and the answer is actually known to be yes for some parameters) vs. "Can you produce these specific expected 2 or so solutions faster?" (a new problem). Note that "cheaper" and "faster" are not the same: it's amortized cost vs. latency. Luckily, it is easy to see that finding the specific few solutions faster would also amount to solving the GBP faster, if (for the purpose of this argument) we disregard Equihash's first difference from the GBP mentioned above. Thus, this second difference is not an optimization potential concern. On the contrary, it's Equihash being hardened as compared to the GBP.
Finally, current optimized Equihash solvers do contain some optimizations that don't appear to have been publicly known for Equihash when Zcash's blog post was made. First, they store intermediate results in a manner much more compact (binary tree of index pairs) than what the Equihash paper implied (growing lists of index pairs). This reduces the big-O space complexity by a factor of 2k/k. While an equivalent approach is mentioned in Wagner's paper on the GBP, on its own it does not appear to have changed the GBP's space complexity by more than a small constant factor. (It appears that to gain much or any advantage from this approach, it'd take a further (maybe novel?) optimization of trimming all of the previous lists from unreferenced pairs, as well as from those no longer indirectly referenced at the current algorithm step. And if something simple like this is in fact novel, it'd show that the GBP wasn't studied all that much, after all.) Second, they identify collisions with incomplete bucket sort that does not ever output the fully sorted lists, whereas Wagner's algorithm is normally expressed in terms of a complete list sort operation. This does not change the big-O's and appears to be similarly applicable to the GBP. Arguably, these optimizations do not invalidate the wording of Claim 3 yet: one is applicable to both Equihash and the GBP even though on its own it's only "major" for Equihash, and the other is applicable to both problems and is not "major". However, I'd say the first one invalidates the spirit of Claim 3. Besides, there's a difference in what would be regarded a major optimization in academia (an improvement in big-O complexities) and mining (also a significant change in a constant factor).
Yes and no. On one hand, memory bandwidth is a limit that an optimized Equihash solver should bump into. On the other hand, zcashd's default and the Equihash designers' reference implementations' performance figures suggest that these implementations are very far from bumping into the memory bandwidth. The optimized Equihash solvers that are now appearing show decent cache locality on CPUs at least for Zcash's current n=200, k=9, and it remains to be seen whether they also bump into the bandwidth to RAM or not. The far from perfect (yet very good) scaling when going from 1 CPU core in use to multiple or all cores in use suggests that a shared resource is in fact being exhausted, but significant further scaling when making use of SMT ("Hyperthreading") suggests it might not be RAM bandwidth yet (but possibly the concurrent shared cache or/and RAM accesses increasing each others' latency, which then needs to be hidden through use of the extra hardware threads - or with code changes). This needs more thorough analysis, such as calculating expected memory bandwidth usage (separately for caches and RAM) from knowledge of specific algorithms and their actual performance, as well as use of CPUs' hardware performance counters to confirm this understanding, before any conclusions can be drawn (and even then they would necessarily apply to the specific implementations only).
The Equihash paper also cites sorting performance on and theoretical peak memory bandwidth of a specific obsolete GPU (NVIDIA GTX 480), since that's what another research paper provides results on, and uses that to draw conclusions about Equihash performance on ASICs. This logic is flawed at least in that we don't actually know whether the sorting implementation bumped into that GPU card's theoretical peak memory bandwidth or (quite likely) into some other bottleneck first.
An actual limit may be access frequency, or bandwidth for an implementation's optimal size accesses, to distributed memory, which on CPUs could mean their caches hierarchy and RAM, and on more specialized devices (such as universal many-core HPC chips or Equihash-specific ASICs) it could mean the cores' local memories accessed via a network on chip (NoC) as well as through interconnects between chips in a network. Latency of such accesses would play a role, too, since the higher the latency, the more accesses need to be in-flight to hide its effect on throughput, and this increases each core's local memory needs (for the queue). Overall, it's a complicated combination of count and throughput of ports to local and differently distant memories, their different latencies, and sizes of local memories needed to accommodate the latencies.
One of the goals of Zcash was to let everyone mine on whatever modern commodity hardware people readily have, including even on smartphones, and also along with making other use of the hardware. While I was and still am skeptical about smartphone mining, it was nevertheless a goal and it limited Equihash solver's memory usage. Additionally, it happened that even the sufficiently low-memory parameters n=144, k=5 resulted in the solver taking too long to complete (about 50 seconds with the inefficient algorithm and code available at the time) given the target block interval (2.5 minutes). This prompted Zcash to switch to the current n=200, k=9 parameters, which reduced the (inefficient) zcashd's solver's run time by about a factor of 1.5 (to about 35 seconds), and at the time this happened to make all the difference for the mining on testnet to start working correctly. The memory requirements of zcashd's solver have stayed roughly the same with this change. So far so good.
Yet per the computational complexity formula given in the Equihash paper, this parameters change should have provided a 9.6x speedup. Moreover, as it turned out, Equihash designers' own reference implementation, when patched slightly to support the n=200, k=9 parameters (which it didn't as-is), would now outperform zcashd's (previously tuned on n=144, k=5) by a factor of 3 or so. Luckily, with all the community interest and contributions, and with Zcash company sponsoring the Challenge with $30,000 in prizes, speeds went way up for both parameter sets, all the way into multiple solutions per second per CPU core for n=200, k=9. Thus, at this time both parameter sets are suitable with respect to the target block interval.
Two of the Equihash solvers contributed by the community need as little as 178 MB and 144 MiB = ~151 MB, respectively, when still finding almost all of the valid solutions for n=200, k=9. This is understandable since while the initial list size for n=144, k=5 was 604 MB, for n=200, k=9 it is only 52 MB. It's the initial list size (times a small factor) that provides a (likely) lower bound on efficient solvers' memory requirements.
In September, I recommended to Zcash that they revert to the n=144, k=5 parameters (expecting that solver optimizations would eventually allow for that). Independently, the Equihash designers also made a similar recommendation, suggesting use of a lower k (such as 4 or 5). Possible reasons to prefer n=144, k=5 over n=200, k=9 include:
On the other hand, it has now been shown that n=200, k=9 allow for decent cache locality in CPU solvers. It remains to be seen how n=144, k=5 measures up in that respect. Giving up on cache locality adequate for current CPUs would be unfortunate since efficient algorithms rely on bucket sort's small lookups, and having those go to higher cache levels or to off-chip RAM would favor ASICs, which, unlike CPUs and GPUs, wouldn't have to fetch entire cache lines.
Zcash company has stated that a parameters change is possible post-launch (with a hard fork). Due to the recent Equihash solver optimizations, even much higher memory and computational complexity parameters are now technically within consideration (and might help discourage botnet mining for a while, which is a concern of many in the community). However, since a GPU mining community has already appeared around Zcash (whether it was initially desired or not), Zcash company might not want to make a parameter change that would alienate that community now. Rather, there exists an opinion that a parameter change may keep or bring CPUs and GPUs "in balance", whatever that means. It is possible that the former n=144, k=5 parameters would satisfy this aspect as well; at least they don't require more memory per instance than typical GPU cards currently have.
Since Zcash is not switching away from n=200, k=9 pre-launch anyway, I think it makes sense to take our time and see how efficient CPU and GPU implementations become for both of these parameter sets, although unfortunately there's far greater incentive to optimize for n=200, k=9, so the comparison isn't fair.
Finally, the threat of a parameters change might discourage investment in ASICs or/and reduce the amount of specialization and thus the efficiency gains in (early) Zcash ASICs. On the other hand, if too specialized ASICs are nevertheless produced and rendered unsuitable with a parameters change, that might encourage a third-party fork of Zcash.
As I have mentioned above, it is important to enforce Equihash's algorithm binding constraint. It is even more important to validate solutions in other aspects, including for lack of duplicate indices.
In Zcash source tree, all of this appears to be enforced in zcash/src/crypto/equihash.cpp: IsValidSolution(), but I have not audited it thoroughly. With the abstraction layers in place, I wouldn't be confident of its correct rejection of subtly invalid solutions without having stress-tested it on many inputs falling into the different categories. There are a few test vectors for lack of algorithm binding and duplicate indices, but they cover some easy to generate special cases only and should be improved.
Another concern is third-party Zcash software. Unfortunately, lack of or buggy verification of the algorithm binding constraint is likely to go unnoticed until such solutions are attempted to be submitted to the network. When that happens, the network may go out of sync, effectively having a fork of Zcash that already includes some and allows future non-algorithm-bound, amortization friendly solutions. It's similar for potential lack of or bugs in the duplicate indices check, although the effect on properties of such inadvertent Zcash fork would be different and, for complete lack of the check, so devastating that no one sane would want to continue with it.
Despite the drawbacks, concerns, and ongoing research identified above, Equihash may be a good and possibly best choice for Zcash, depending on which properties of the PoW turn out to be actually important to users of Zcash (e.g., embedding in Ethereum contracts important or not?)
Equihash parameters may optionally be tweaked to achieve multiple improvements at once.
I will be watching with great interest how the story unfolds, and I appreciate this opportunity to get involved.
I'd like to thank the following people for their direct or indirect help in researching this topic: aabc, Alex Biryukov, Andreas Olofsson, Ariel Gabizon, Bill Cox, Daira Hopwood, Dmitry Khovratovich, Hanshen Xiao, Jack Grigg, John Tromp, Ling Ren, Marc Bevand, xenoncat, Zooko Wilcox.
You may contact me via e-mail at