Responsible disclosure of an exploit in Succinct's SP1 zkVM, found in partnership with 3MI Labs and Aligned, which arises from the interaction of two distinct security vulnerabilities.
TL;DR: We found two security bugs that can be combined to perform an exploit in Succinct's SP1 zkVM, which allows you to generate false proofs. In severe cases, this could lead to loss of funds. This was found thanks to a collaboration between the top notch research 3MI Labs, Aligned, and LambdaClass. This is a different security bug from the one we informed previously in our blog and it highlights the importance of having multiple teams paying attention to security and working towards having simpler codebases. For a PoC of the exploit, see the following repo
LambdaClass and Fuzzing Labs will invest in further investigating critical security bugs in zkVMs. We believe that codebases have become too complex and over-engineered and this gives rise to lots of bugs. We think that the industry is at risk if we do not invest, add more eyes and simplify codebases. The industry has become complacent when it comes to security and is being pushed by business decisions to rush into production use, leaving aside these security issues, which could lead to very serious consequences. In this post, we analyze the case of SP1, but we think that all zkVM's codebases need to be simplified and follow the standards, lowering the attack surface. As mentioned, we will conduct a more thourough research on different zkVMs.
Introduction
We have seen in several engineering projects the development of long and complex codebases, with too many fearures and poor documentation and testing. Some people believe that having such codebases shows that you are smart, have excellent coding skills and given a lot of thought on everything. We think it otherwise: the proof of mastery lies in simplicity. Bugs will always happen in any project, but the chance of having critical bugs increases with codebase complexity and length in a nonlinear way: the longer and more complex, the more bugs and hard to predict behaviors you can have. During our analysis of zk virtual machines and proof systems, we found two security bugs that can be combined to produce an exploit allowing a malicious party to prove false statements. Basically, you could generate false proofs for programs and modify some public inputs, which, in a world were computation is to be verified on chain using this technology, could lead to several exploits and loss of funds.
From our point of view, these exploits arise from the complexity of the codebase, having many files with different constraints and the addition of several features and optimizations that bloat the codebase. We also believe that business decisions are trying to rush these systems into production, when we should still focus on improving their security and auditability. We think that more care needs to be taken when designing, developing and testing zk virtual machines that could be used in real world applications, especially when funds are at risk.
Description of the exploits
Two bugs were identified in the sp1-sdk
crate at version 3.4.0, the most recently published version at the time we started our analysis. We were able to exploit them to generate valid SP1 proofs of incorrect execution of an arbitrary program, which results in universal forgeries of proofs for arbitrary statements, even incorrect ones.
Bug Descriptions
This report describes two bugs:
- SP1-2.1: Unconstrained
committed_value_digest
withoutCOMMIT
syscalls - SP1-2.2: Unchecked
next_pc
Condition in First Layer of Recursion Tree
SP1-2.1: Unconstrained committed_value_digest
without COMMIT
syscalls
When an SP1 executor executes a guest program, the COMMIT
syscalls(0x00_00_00_10
) resulting from calls to sp1_zkvm::io::commit()
are delayed to the end of the execution and emitted only when the main
function returns.
Since these syscalls are the only events that generate constraints on the committed_value_digest
of the ExecutionRecord
, this implies that if the execution of the program is halted before the return point of main()
, then the COMMIT
syscalls are never issued. As a result, the committed_value_digest
remains initialised to the all-zero value and is not constrained during proof generation.
This absence of constraints on the digest of committed values during the execution of the program raises further questions on the compatibility of the verifier code of sp1-sdk
with proofs of "arbitrary" program, but exploring this falls outside of the scope of this report.
SP1-2.2: Unchecked next_pc
Condition in First Layer of Recursion Tree
In the verifying code for the first layer of the recursion tree, the condition next_pc == 0
is not checked if a shard is indicated as containing a "complete" execution. However, this check is present in other recursion constraints (e.g., two-to-one proof compression, field-switch wrapping, proof-system switching).
Other checks exist that are performed on other code paths when is_complete
is asserted but are missing from the first layer of the recursion tree; these may be critical to proof soundness, but were not used in our exploit. These can be found in sp1_recursion_circuit::machine::complete::assert_complete
(for the recursive constraints) and sp1_prover::verify::verify
(for the plain, uncompressed verification). These checks include whether:
- at least one execution shard is present
- (execution) shard numbering is consecutive, and starts at one,
- the leaf challenger and the final reconstruct challenger match
- the deferred proof digests are consistent, and
- the cumulative sum is consistent.
Exploit Description
Initially, an attempt to exploit SP1-2.1 was made by making an explicit HALT
(0x00_00_00_00
) syscall within main()
.
While this is not the methodology that was ultimately used, we note that making such explicit HALT
syscalls within main might reasonably be passed off as an optimization within the guest program, e.g., by arguing that an early halt within main is a way to shorten a program's execution trace and therefore reduce its proof computation time. This would seem innocent enough, but SP1-2.1 would then imply that the digest of public values produced before the syscall would remain unconstrained.
However, it suffices to instead create a malicious SP1 executor which stops executing the guest program at an arbitrary pc
value. As long as the chosen pc
value happens before the return point of main()
, no COMMIT
syscall will have been produced by the virtual machine. In the exploit presented here, the proof forgery was produced by stopping execution as soon as the program reached the start of the main
function.
Before returning the resulting ExecutionRecord
, the malicious executor is then free to replace the committed_value_digest
of the public_values: PublicValues
field with the digest of an arbitrary value. This makes it seem as if the guest program had committed to it during its execution.
Then, an honest CoreProver
(sp1_prover::components::SP1ProverComponents::CoreProver
) is run to generate an SP1CoreProof
(sp1_prover::types::SP1CoreProof
) with the maliciously crafted ExecutionRecord
. Since there are no COMMIT
syscalls contained within the record, the altered committed_value_digest
, which does not match the digest of the values committed to by the program, does not cause the proof generation to fail. The CoreProver
therefore successfully creates an SP1CoreProof
containing two shards s1
and s2
.
Finally, a malicious SP1Prover
(sp1_prover::SP1Prover
) is created with the following two modifications:
- The second shard
s2
of theSP1CoreProof
is discarded. - The
is_complete
flag is set totrue
in theSP1RecursionWitnessValues
created from the remaining first shards1
. This recursion witness is then used when generating the recursion program for compressing theSP1CoreProof
with aCompressProver
.
Because the malicious executor stopped the guest program with a next_pc
value pointing to the start of main, the first shard s1
has a non-zero next_pc
value. However, since the is_complete
flag is true
, SP1-2.2 implies that the CompressProver
does not constrain the equality next_pc ==0
, resulting in an SP1ReduceProof
proof generated without errors. When de-serialized by an honest prover and submitted for verification, this SP1ReduceProof
then passes honest verification for the guest program.
Exploit demonstration
We accompanied the bug report with two artefacts that demonstrate how the two bugs can be exploited to provide valid proofs of invalid execution of arbitrary programs. See the repo for the code.
The is-prime directory contains the source files and compiled ELF version of a program(./is-prime/program
) which checks primality of a number read from sp1_zkvm::io
, together with a script(./is-prime/script
) which demonstrates that 42-is-prime.proof
(./is-prime/script/42-is-prime.proof
) is a valid proof that executing the program results in 42 being verified as a prime number.
The i-am-satoshi(./i-am-satoshi/
) directory contains a example of the transferability of this technique. Here, the guest program uses the independent bitcoin
crate to compute the Bitcoin address corresponding to the secret key given as input, and then commits to the resulting address with sp1_zkvm::io::commit()
. Running the corresponding verifier program
(./i-am-satoshi/script/src/main.rs
) reveals that the proof demonstrates knowledge of the secret key behind the "genesis Satoshi address" 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa
.
This proof of concept illustrates arbitrary statement proving for the SP1 verifier, by verifying knowledge of the secret key to the reward address in the bitcoin genesis block. To perform these exploits, the prover client needs to be modified locally.
Consequences and Limitations
While the first program presented in this exploit is innocent enough (42 is obviously not prime), the second one exemplifies more serious consequences.
Proof of ownership of any address on any blockchain can be forged and would be accepted by a naïve verifier. We draw the attention to the fact that the program itself was not modified. All of the modifications to create the proof forgery were performed locally to the proving client which implies that this forgery methodology is generalizable to arbitrary guest programs.
We note that SP1-2.2 is a bug that is restricted to the first layer of the recursion tree, and that subsequent recursive proving of the proof would fail, because the ShrinkProver
properly constrains the next_pc == 0
check when is_complete == true
.
Nonetheless, the honest verifier code(./is-prime/script/src/bin/verifier.rs
) is agnostic to the kind of proof that it deserializes which implies that the malicious prover is not forced to further reduce or wrap the forged proof before submitting it for verification to a system that runs the Rust verifier from the sp1-sdk
crate.
Possible Mitigations
Mitigating SP1-2.1
Mitigating SP1-2.1 requires making sure that the committed_value_digest
is constrained within the proof system, even if no COMMIT
syscalls are made.
Any implementation of this mitigation would conflict with the current implementation which assumes that committed_value_digest
is written to only once. A concrete proposal requires further exploration of the possibilities which is outside of the scope of this report.
Mitigating SP1-2.2
Ultimately, the code for the recursion program should be patched so that the next_pc == 0
constraint (and the other related constraints for complete programs) is applied by the CompressProver
in the first layer of the recursion tree, just like it is in the other recursion programs.
As a hot fix which would not be a breaking change to currently accepted verification keys, since the next_pc
value is accessible as part of a proof's
public values, the Rust code of the verifier should check that it is 0
, even if this isn't constrained within the proof system. This hot fix would further enable existing verifiers to check whether this bug was triggered by proofs that are still in storage.
Summary
Together with 3MI Labs and Aligned, we found two security bugs in Succinct's SP1 zkVM, and showed how to use them to perform an exploit to generate false proofs that an honest verifier would accept. Since these exploits do not require to change the code of the program and are done locally, they could be used without naïve verifiers even suspecting it is a malicious proof. We think that the complexity and length of the codebase, as well as unclear documentation contribute to the proliferation of bugs, and that we should be working harder on simplifying the codebases and on security, instead of rushing into production due to business concerns, especially when funds could be at risk.