Don't trust, verify or why you should care about benchmarks

Recently, there has been quite a lot of debate between researchers and engineers on the best proof system. For example, Justin Thaler and Srinath Setty have been discussing whether FRI or KZG based SNARKs are better in computational terms, following some calculations by Eli Ben-Sasson during SBC. Before jumping into the details and our view on benchmarks, we want to say that we really like all the work developed by the authors, bringing new ideas and debates that can help expand and improve zero-knowledge schemes and their applications. We learned from all of them but think that we should have some clearer criteria on what makes something more performant or useful in engineering terms. Besides, performance and suitability are sometimes application dependent, as Zac Williamson pointed out in an exchange on X, indicating that SNARKs could be more advantageous in client side proving.

Nowadays the performance side of things three big strategies are being publicly discussed:

  • Folding schemes
  • Lookup singularity
  • STARKs with small fields

With time some ideas might be combined. In the meantime, we need to have a way of analyzing their practical potential. We can use back of the envelope calculations to analyze these different strategies and proving system. But they are just estimations of the total number of operations, and as such should always be taken with a grain of salt. They may be useful to assess whether some system or algorithm could outperform another, but not as a final measure of performance. Something similar happens with asymptotic complexity; we know of algorithms that may be optimal from their complexity point of view, but have no practical applications (the famous galactic algorithms). Besides, in engineering, problems are multidimensional and there is a lot of interaction between different parts.

There are constraints regarding memory, data communication, having hardware acceleration, code maintainability, economics, etc. For example, memory access patterns can cause a program with less instructions to run slower, if it isn't suited for cache algorithms, data prefetching and other memory optimizations. Complexity increases if we have to additionally consider the degree of parallelization of the algoritms and GPUs, and even more when we can distribute computation between many machines. An efficient algorithm that can be run only in one machine may be worse in some scenarios than other one that is less efficient and can be distributed in multiple devices. This is, once again, something really similar to what Zac has mentioned. There may be different criteria for selecting algorithms depending on the use case. Most of the times in software, multiple solutions for one problem are used depending on the scenario, and even mixed together when it's required. To think we already have a grand solution for all the problems, that's optimal in all scenarios, may be overestimating the complexities of the applied world. There are claims about the number of operations not taking into account the constraints imposed by hardware or use special field families to count the number of operations, which are not applicable to the kind of elliptic curve chosen. For example, commonly used pairing-friendly elliptic curves are defined over primes that don't have the same type of efficient arithmetic such as Mersenne primes or the "MiniGoldilocks" prime.

Another example of that the complexity of real engineering systems is seen from our point of view in this tweet by Thaler. He asked why Starkware continues to use a rather large finite field despite it not offering any advantages over smaller ones. The reason is quite simple: the SHARP was developed before many improvements and has been in production for many years. Evenmore, for production ready software we need more than a prover. We need languages, compilers, VMs, tools for developers and sequencers for blockchains. There is a lot of work, and rushing to improve the prover with each possible upgrade, on a system that's in production with a lot of value, may be reckless. From a brilliant idea in paper to a production ready system, there is a lot of engineering work and we always find many more difficulties along the way, that were not originally considered or could have been difficult to foresee.

Critical analysis, with measurements and a good understanding of the possible solutions, is key. We have seen claims such as a STARKs use over 100 GBs of RAM for small programs. It's not clear what is the criteria of comparison and how many GBs would the alternatives would use. It is important to take advantage of open source software and play with the tools developed by others, to check whether they work as stated and corroborate numbers.

We think that Nova and Lasso bring interesting ideas, which can spark new solutions to other proof systems. We wrote a post on Nova and we plan to have one on Jolt and Lasso. We even had discussions on whether we could adapt some of the ideas behind to a STARKs prover. Folding schemes such as Nova can help solve many problems related to SNARKs based on Plonkish or R1CS arithmetization. In the case of the Cairo prover, there is a strategy that zips the constraints. The Cairo AIR contains the constraints for all the instructions of a Turing-complete virtual machine. The number of constraints does not change with the computation size, as opposed to the execution trace, which grows linearly with the size of the program. The trace is then interpolated and the constraints are enforced via quotients. So, the relevant measure here is the number of steps of the program and not the number of constraints. Fair measurements should be conducted over some commonly used calculations or transactions, for example, an ERC-20 contract. We should also be careful to see speed in a single task as the only thing that matters. Clean codebases, easy to maintain and update, robustness, security, memory use, and auditability are also factors to take into account.

We like the work done in the benchmarks by Celer Network trying to give a fair comparison between different proof systems, using SHA-256 as example circuit. That said we have to always keep in mind that it can become tempting for a project or a particular team to over optimize it's codebase for a particular benchmark. It's good to see that the Celer benchmark points out, however, that it is quite difficult to establish a comparison for Nova, as they mention "It’s important to recognize that Nova cannot be directly compared with other frameworks in terms of time and computation. This uniqueness stems from the incremental computing capabilities enabled by Nova. To put it simply, breaking down the entire computation into more detailed steps naturally leads to a decrease in memory consumption, even though it may cause an increase in computation time." We point out that some of the proof systems are not fully optimized and that could change the trends. The memory vs speed trade-off may be convenient for some use cases, but not in others.

Another point worth noting is that some people tend to add constraints that in practice do not exist, or tend to generalize the strategies that one company uses to all other possible implementations. For example, if A uses Poseidon as hash function, they assume that B, C and D should also use Poseidon, even though that may not fit their particular application. In a recursive environment, we can prove with a SNARK that we verified a STARK proof, which has a lot of usecases. Of course, if we have a tree of recursive proofs of verifications, there is no inconvenient in using a faster hash function for the leaves, such as Blake2, then proving in the second layer that we verified proofs that used Blake2, with Poseidon or other hash.

We think that we should have clear benchmarks, with code used in production. There are, of course, new technologies or ideas that may be promising and we should explore, but we should never be too hasty to jump into the next boat, especially when users assets or privacy are at stake. We will be implementing the different proving systems into the Lambdaworks library, so that anyone can run the benches easily and check which one suits him best. Moreover, if there are optimizations for any of the systems, anyone can submit their PR to improve them. We are not maximalists on any proof system; what we want is this technology to succeed and develop applications on top of it. If a particular system works better, we will learn it and work with it.

We think that debate and having different points of view is important to bring new ideas and improvements to the table, from which we can all benefit. Having open source code, and not only papers, available to tweak, analyze, and play with proving systems is crucial to be able to do comparison. Starkware just open sourced its battle tested Stone prover and this will help a lot to do improvements and comparison between startegies. We also like a lot initiatives such as ZPrize, where teams propose open source optimizations to common problems in zero-knowledge proofs. This can give us the opportunity to explore different strategies and arrive at algorithms that work best in practice.