Responsible disclosure of an exploit in Succinct's SP1 zkVM, found in partnership with 3MI Labs and Aligned, which arises from the interaction of two distinct security vulnerabilities.

TL;DR: We found two security bugs that can be combined to perform an exploit in Succinct's SP1 zkVM, which allows you to generate false proofs. In severe cases, this could lead to loss of funds. This was found thanks to a collaboration between the top notch research 3MI Labs, Aligned, and LambdaClass. This is a different security bug from the one we informed previously in our blog and it highlights the importance of having multiple teams paying attention to security and working towards having simpler codebases. For a PoC of the exploit, see the following repo

LambdaClass and Fuzzing Labs will invest in further investigating critical security bugs in zkVMs. We believe that codebases have become too complex and over-engineered and this gives rise to lots of bugs. We think that the industry is at risk if we do not invest, add more eyes and simplify codebases. The industry has become complacent when it comes to security and is being pushed by business decisions to rush into production use, leaving aside these security issues, which could lead to very serious consequences. In this post, we analyze the case of SP1, but we think that all zkVM's codebases need to be simplified and follow the standards, lowering the attack surface. As mentioned, we will conduct a more thourough research on different zkVMs.

Introduction

We have seen in several engineering projects the development of long and complex codebases, with too many fearures and poor documentation and testing. Some people believe that having such codebases shows that you are smart, have excellent coding skills and given a lot of thought on everything. We think it otherwise: the proof of mastery lies in simplicity. Bugs will always happen in any project, but the chance of having critical bugs increases with codebase complexity and length in a nonlinear way: the longer and more complex, the more bugs and hard to predict behaviors you can have. During our analysis of zk virtual machines and proof systems, we found two security bugs that can be combined to produce an exploit allowing a malicious party to prove false statements. Basically, you could generate false proofs for programs and modify some public inputs, which, in a world were computation is to be verified on chain using this technology, could lead to several exploits and loss of funds.

From our point of view, these exploits arise from the complexity of the codebase, having many files with different constraints and the addition of several features and optimizations that bloat the codebase. We also believe that business decisions are trying to rush these systems into production, when we should still focus on improving their security and auditability. We think that more care needs to be taken when designing, developing and testing zk virtual machines that could be used in real world applications, especially when funds are at risk.

Description of the exploits

Two bugs were identified in the sp1-sdk crate at version 3.4.0, the most recently published version at the time we started our analysis. We were able to exploit them to generate valid SP1 proofs of incorrect execution of an arbitrary program, which results in universal forgeries of proofs for arbitrary statements, even incorrect ones.

Bug Descriptions

This report describes two bugs:

SP1-2.1: Unconstrained committed_value_digest without COMMIT syscalls
SP1-2.2: Unchecked next_pc Condition in First Layer of Recursion Tree

SP1-2.1: Unconstrained `committed_value_digest` without `COMMIT` syscalls

When an SP1 executor executes a guest program, the COMMIT syscalls(0x00_00_00_10) resulting from calls to sp1_zkvm::io::commit() are delayed to the end of the execution and emitted only when the main function returns.
Since these syscalls are the only events that generate constraints on the committed_value_digest of the ExecutionRecord, this implies that if the execution of the program is halted before the return point of main(), then the COMMIT syscalls are never issued. As a result, the committed_value_digest remains initialised to the all-zero value and is not constrained during proof generation.

This absence of constraints on the digest of committed values during the execution of the program raises further questions on the compatibility of the verifier code of sp1-sdk with proofs of "arbitrary" program, but exploring this falls outside of the scope of this report.

SP1-2.2: Unchecked `next_pc` Condition in First Layer of Recursion Tree

In the verifying code for the first layer of the recursion tree, the condition next_pc == 0 is not checked if a shard is indicated as containing a "complete" execution. However, this check is present in other recursion constraints (e.g., two-to-one proof compression, field-switch wrapping, proof-system switching).

Other checks exist that are performed on other code paths when is_complete is asserted but are missing from the first layer of the recursion tree; these may be critical to proof soundness, but were not used in our exploit. These can be found in sp1_recursion_circuit::machine::complete::assert_complete (for the recursive constraints) and sp1_prover::verify::verify (for the plain, uncompressed verification). These checks include whether:

at least one execution shard is present
(execution) shard numbering is consecutive, and starts at one,
the leaf challenger and the final reconstruct challenger match
the deferred proof digests are consistent, and
the cumulative sum is consistent.

Exploit Description

Initially, an attempt to exploit SP1-2.1 was made by making an explicit HALT (0x00_00_00_00) syscall within main().
While this is not the methodology that was ultimately used, we note that making such explicit HALT syscalls within main might reasonably be passed off as an optimization within the guest program, e.g., by arguing that an early halt within main is a way to shorten a program's execution trace and therefore reduce its proof computation time. This would seem innocent enough, but SP1-2.1 would then imply that the digest of public values produced before the syscall would remain unconstrained.

However, it suffices to instead create a malicious SP1 executor which stops executing the guest program at an arbitrary pc value. As long as the chosen pc value happens before the return point of main(), no COMMIT syscall will have been produced by the virtual machine. In the exploit presented here, the proof forgery was produced by stopping execution as soon as the program reached the start of the main function.

Before returning the resulting ExecutionRecord, the malicious executor is then free to replace the committed_value_digest of the public_values: PublicValues field with the digest of an arbitrary value. This makes it seem as if the guest program had committed to it during its execution.

Then, an honest CoreProver (sp1_prover::components::SP1ProverComponents::CoreProver) is run to generate an SP1CoreProof (sp1_prover::types::SP1CoreProof) with the maliciously crafted ExecutionRecord. Since there are no COMMIT syscalls contained within the record, the altered committed_value_digest, which does not match the digest of the values committed to by the program, does not cause the proof generation to fail. The CoreProver therefore successfully creates an SP1CoreProof containing two shards s1 and s2.

Finally, a malicious SP1Prover (sp1_prover::SP1Prover) is created with the following two modifications:

The second shard s2 of the SP1CoreProof is discarded.
The is_complete flag is set to true in the SP1RecursionWitnessValues created from the remaining first shard s1. This recursion witness is then used when generating the recursion program for compressing the SP1CoreProof with a CompressProver.

Because the malicious executor stopped the guest program with a next_pc value pointing to the start of main, the first shard s1 has a non-zero next_pc value. However, since the is_complete flag is true, SP1-2.2 implies that the CompressProver does not constrain the equality next_pc ==0, resulting in an SP1ReduceProof proof generated without errors. When de-serialized by an honest prover and submitted for verification, this SP1ReduceProof then passes honest verification for the guest program.

Exploit demonstration

We accompanied the bug report with two artefacts that demonstrate how the two bugs can be exploited to provide valid proofs of invalid execution of arbitrary programs. See the repo for the code.

The is-prime directory contains the source files and compiled ELF version of a program(./is-prime/program) which checks primality of a number read from sp1_zkvm::io, together with a script(./is-prime/script) which demonstrates that 42-is-prime.proof(./is-prime/script/42-is-prime.proof) is a valid proof that executing the program results in 42 being verified as a prime number.

The i-am-satoshi(./i-am-satoshi/) directory contains a example of the transferability of this technique. Here, the guest program uses the independent bitcoin crate to compute the Bitcoin address corresponding to the secret key given as input, and then commits to the resulting address with sp1_zkvm::io::commit(). Running the corresponding verifier program(./i-am-satoshi/script/src/main.rs) reveals that the proof demonstrates knowledge of the secret key behind the "genesis Satoshi address" 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa.

This proof of concept illustrates arbitrary statement proving for the SP1 verifier, by verifying knowledge of the secret key to the reward address in the bitcoin genesis block. To perform these exploits, the prover client needs to be modified locally.

Consequences and Limitations

While the first program presented in this exploit is innocent enough (42 is obviously not prime), the second one exemplifies more serious consequences.
Proof of ownership of any address on any blockchain can be forged and would be accepted by a naïve verifier. We draw the attention to the fact that the program itself was not modified. All of the modifications to create the proof forgery were performed locally to the proving client which implies that this forgery methodology is generalizable to arbitrary guest programs.

We note that SP1-2.2 is a bug that is restricted to the first layer of the recursion tree, and that subsequent recursive proving of the proof would fail, because the ShrinkProver properly constrains the next_pc == 0 check when is_complete == true.
Nonetheless, the honest verifier code(./is-prime/script/src/bin/verifier.rs) is agnostic to the kind of proof that it deserializes which implies that the malicious prover is not forced to further reduce or wrap the forged proof before submitting it for verification to a system that runs the Rust verifier from the sp1-sdk crate.

Possible Mitigations

Mitigating SP1-2.1

Mitigating SP1-2.1 requires making sure that the committed_value_digest is constrained within the proof system, even if no COMMIT syscalls are made.

Any implementation of this mitigation would conflict with the current implementation which assumes that committed_value_digest is written to only once. A concrete proposal requires further exploration of the possibilities which is outside of the scope of this report.

Mitigating SP1-2.2

Ultimately, the code for the recursion program should be patched so that the next_pc == 0 constraint (and the other related constraints for complete programs) is applied by the CompressProver in the first layer of the recursion tree, just like it is in the other recursion programs.

As a hot fix which would not be a breaking change to currently accepted verification keys, since the next_pc value is accessible as part of a proof's
public values, the Rust code of the verifier should check that it is 0, even if this isn't constrained within the proof system. This hot fix would further enable existing verifiers to check whether this bug was triggered by proofs that are still in storage.

Summary

Together with 3MI Labs and Aligned, we found two security bugs in Succinct's SP1 zkVM, and showed how to use them to perform an exploit to generate false proofs that an honest verifier would accept. Since these exploits do not require to change the code of the program and are done locally, they could be used without naïve verifiers even suspecting it is a malicious proof. We think that the complexity and length of the codebase, as well as unclear documentation contribute to the proliferation of bugs, and that we should be working harder on simplifying the codebases and on security, instead of rushing into production due to business concerns, especially when funds could be at risk.