RVV (RISC-V V Extension): Key to the Efficient On-Chain Cryptography

All cryptographic algorithms on the CKB (Common Knowledge Base) are executed in its virtual machine, CKB-VM. It gives CKB the power and flexibility to offer developers more possibilities. But when it comes to some complex cryptographic algorithms, such as zero-knowledge proofs, they are currently too expensive to fit in the cycles limit and/or block interval, even if they can run in the CKB-VM natively. Should we simply increase the cycles limit? Or, are there any better solutions?

The subject of this article is vectorization in CKB-VM, implemented by the RISC-V V (Vector) Extension, or RVV for short. Vectorization is a parallel computing method with the idea of processing multiple data elements in a single instruction. We are witnessing an increasing tendency of using vectorization for high performance in 3D games, neural networks, etc., among many other specialized computer programs.

Cryptape has been building a high-performance RVV implementation for CKB-VM. The workload of implementing the RVV instructions has been unprecedented, surpassing all previous CKB-VM instructions. Despite all the challenges, we’ve made significant progress. We are excited to share our thoughts with you and provide a brief overview of our work.

Introduction to RVV in CKB-VM

RISC-V consists of two parts:

Core ISA
Standard extensions

The standard instruction set performs the basic arithmetic logic operations necessary for general-purpose computation, thereby being abbreviated as “G”. The extension instruction sets are optional, depending on design goals.

RVV contains the following elements:

Vector Register: the RISC-V vector instruction set specifies 32 vector registers.
Vector Register length (VLEN): CKB-VM specifies a VLEN of 2048 bits, meaning that a vector register can store up to 256 uint8, 128 uint16, 64 uint32, 32 uint64, 16 uint128, 8 uint256, 4 uint512 or 2 uint1024.
The number of elements involved in the operation VL (Vector Length)

Here we use a simple example of uint256 addition vectorization to illustrate the above parameters. With RVV we can utilize vector registers v0 and v1. Each register stores two uint256 elements to be added, so the VL is 2.

The operation we’re going to do is: v2 = v0 + v1, i.e. each uint256 element in v2 is the sum of the corresponding uint256 elements in v0 and v1.

The uint256 addition example is illustrated below:

    |<----------------------------- 2048 bit ------------------------------>|
    |<------ 256 bit ------>|<------ 256 bit ------>|<------ 1536 bit ----->|
V0: |       uint256.0       |       uint256.1       | ... ... ... ... ... . | VL = 2
V1: |       uint256.2       |       uint256.3       | ... ... ... ... ... . | VL = 2
V2: | uint256.0 + uint256.2 | uint256.1 + uint256.3 | ... ... ... ... ... . | VL = 2

From Register to Code

Imagine that one day your boss asks you to write an addition contract of uint256 arrays. He promises a bonus if you can reduce the computation cycles spent in CKB-VM (because that would save him a lot of transaction fee!). You have three uint256 arrays in CKB-VM memory: A, B and C are of equal length, and you need to compute C[i] = A[i] + B[i] for each element in the arrays.

That’s the task, and you know RVV. So, how can you solve the problem and get the bonus?

This problem differs from the first simple example above in two aspects:

Uint256 arrays are stored in memory.
The length of the array is unknown.

Because a vector register can maximumly store eight uint256, we can choose to load up to eight uint256 at a time from the memory of arrays A and B into the vector registers. Then, we perform the addition of the vector registers and write the result back to the memory of array C.

This programming paradigm has an interesting name: strip-mining. In this metaphor, the data you are processing in-memory is symbolized as a mine, and you use a truck to transport one carload of minerals from the mine at a time. You go back and forth until the mine is exhausted—note that the last carload can never be full.

We use four consecutive uint64s to represent a uint256. In the following Rust code, we imported an external function, vadd_vv, to add multiple uint256 one by one and write back the results. In the assembly code, we give the implementation of vadd_vv. The hand-written assembly code for the vadd_vv function is a typical strip-mining program.

#![no_std]
#![no_main]
#![feature(asm)]
#![feature(lang_items)]

extern "C" {
    // n: Length of the array
    // a: Array A
    // b: Array B
    // c: Array C, Store the results of A + B
    fn vadd_vv(n: usize, a: *const u64, b: *const u64, c: *mut u64);
}

fn exit(code: i8) -> ! {
    unsafe {
        asm!("mv a0, {0}",
             "li a7, 93",
             "ecall",
             in(reg) code,
        )
    }
    loop {}
}

#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
    exit(-128);
}

#[lang = "eh_personality"]
extern "C" fn eh_personality() {}

#[no_mangle]
fn abort() -> ! {
    panic!("abort!")
}

#[no_mangle]
fn _start() -> ! {
    let a: [[u64; 4]; 2] = [
        [0x0000000000000001, 0x0000000000000002, 0x0000000000000003, 0x0000000000000004],
        [0x0000000000000005, 0x0000000000000006, 0x0000000000000007, 0x0000000000000008],
    ];
    let b: [[u64; 4]; 2] = [
        [0x0000000000000001, 0x0000000000000002, 0x0000000000000003, 0x0000000000000004],
        [0x0000000000000005, 0x0000000000000006, 0x0000000000000007, 0x0000000000000008],
    ];
    let mut c: [[u64; 4]; 2] = [[0; 4]; 2];

    unsafe {
        vadd_vv(2, &a as *const [u64; 4] as *const u64, &b as *const [u64; 4] as *const u64, &mut c as *mut [u64; 4] as *mut u64);
    }

    exit(0);
}

vadd_vv is implemented in assembly as follows:

 .text
    .balign 4
    .global vadd_vv
vadd_vv:
    vsetvli t0, a0, e256, m1, ta, ma # Set vector length
    vle256.v v0, (a1)                # Get VL elements from a1
      sub a0, a0, t0                 # Decrement number done
      slli t0, t0, 5                 # Multiply number done by 32 bytes
      add a1, a1, t0                 # Bump A pointer
    vle256.v v1, (a2)                # Get VL elements from a2
      add a2, a2, t0                 # Bump B pointer
    vadd.vv v2, v0, v1               # Sum vectors
    vse256.v v2, (a3)                # Store result to C
      add a3, a3, t0                 # Bump C pointer
      bnez a0, vadd_vv               # Loop back
      ret                            # Finished?

Three Stages of RVV Implementation on CKB-VM

Given the large workload of the RVV, the implementation goes through three stages.

Stage 1 Large Number Library for CKB-VM

Since the RISC-V vector instruction set involves the computation of numbers larger than uint64, we need a large number library. And for CKB-VM, a new library is needed for all the nuances in number processing. For example, dividing a number by 0 or dividing the minimum signed integer by -1 gives different results on different platforms. Unlike the existing large number libraries, the one for CKB-VM is fully compatible with RISC-V standards.

Here is a preliminary implementation of a large number library on CKB-VM. We use two small numbers to synthesize a larger number. For example, U256 is defined as the combination of two U128s, and U512 is defined as the combination of two U256s. In addition to using the "right" data structure, we use the "right" algorithm—something that was greatly inspired by the Hacker's Delight.

Stage 2 Generic RVV Implementation

We classify the RVV into three categories by priority:

The first category includes memory read/write operations, basic arithmetic and logic operations, shift operations, etc. This part of the instruction set can already boost most cryptographic algorithms on the blockchain.
The second category includes 137 instructions. It contains some special memory read and write operations, such as striding read and write, and some less urgent instructions.
The third category includes 145 instructions. They are all atomic and floating-point instructions. Because CKB-VM is a single-threaded virtual machine, where using floating-point numbers in smart contracts in most cases is not a good idea. So we won’t add them to the CKB-VM.

After the second category is completed, we will build a benchmarking platform to evaluate the performance impact of the V extension instruction set on the CKB Scripts. Considering some general advantages of Vectorization: parallelism, higher instruction density, higher decoder cache hit rate, we are optimistic that for some algorithms, RVV can bring more than 100% performance improvement.

Stage 3 AVX2 Implementation

This is probably the hardest part. After we have the RVV interpreter, we will explore more of the capabilities of the x86 platform: taking advantage of the AVX2 instruction set to further speed up the execution of the RVV interpreter.

Part of the calculation process will bypass our large number library and they will be directly completed by the x86 computing platform. Why partial? Because RVV can support U1024 at most, which exceeds the capability of AVX2 instructions, and RVV does not correspond to AVX2 one-to-one.

What We’ve Learned

If a piece of code frequently occurs, abstract it.

Consider this question: if we want to implement vector addition and subtraction instructions vadd.vv and vsub.vv, how to minimize the code to do so?

Both instructions are identical in the following steps: fetching, instruction decoding, register reading, and data return, except for the intermediate execution stage of register reading and data return, where vadd.vv is for addition and vsub.vv for subtraction.

To address this, we use Rust generics and macros to reduce repetition, only passing the functions used in the execution as an argument:

insts::OP_VADD_VV => {
    vv_loop_s!(inst, machine, { Element::wrapping_add });
}
insts::OP_VSUB_VV => {
    vv_loop_s!(inst, machine, { Element::wrapping_sub });
}

Meanwhile, when dealing with the large number of instructions encoded (447 instructions in total), we created an automatic code generation tool. It can auto-generate Rust code for the CKB-VM decoder by parsing the instruction descriptions in the official documentation. This code generation approach saves us a lot of time and guarantees correctness.

RVV Rust Library

Now we come to the most exciting part. RVV Rust Library also includes two macros rvv_vector and rvv_asm. The proc-macro rvv_vector is provided to help CKB contract developers write efficient code easily without even knowing or understanding RVV! With this macro, developers can use RVV instructions to optimize their existing code that involves large numbers operations. This macro acts on a function and translates all large number operations within the function into the corresponding rvv_asm (in the form of .byte { },{ },{ },{ }). Simple Rust code can also be inserted in between, and it will be merged into the final code generated.

The conversion has the following result:

#[rvv_vector]
fn op_add(a: U256, b: U256) -> U256 {
    a + b
}

The code above is translated as follows:

fn op_add(a: U256, b: U256) -> U256 {
    unsafe {asm!("li t0, 1", ".byte {0}, {1}, {2}, {3}", const 87u8, const 240u8, const 130u8, const 14u8)}
    unsafe {asm!("mv t0, {0}", ".byte {1}, {2}, {3}, {4}", in (reg) a . as_ref () . as_ptr (), const 135u8, const 208u8, const 2u8, const 16u8)}
    unsafe {asm!("mv t0, {0}", ".byte {1}, {2}, {3}, {4}", in (reg) b . as_ref () . as_ptr (), const 7u8, const 209u8, const 2u8, const 16u8)}
    {
        unsafe {asm!(".byte {0}, {1}, {2}, {3}", const 215u8, const 1u8, const 17u8, const 0u8)}
        let mut tmp_rvv_vector_buf = [0u8; 32usize];
        unsafe {asm!("mv t0, {0}", ".byte {1}, {2}, {3}, {4}", in (reg) tmp_rvv_vector_buf . as_mut_ptr (), const 167u8, const 209u8, const 2u8, const 16u8)}
        unsafe { core::mem::transmute::<[u8; 32usize], U256>(tmp_rvv_vector_buf) }
    }
}

For example, Montgomery Reduction is a commonly used algorithm in cryptography. It’s easy to write a U256 version using rvv_vector:

// implemented by rvv_vector
#[rvv_vector]
pub fn mont_reduce(np1: U256, n: U256, t: U512, bits: usize) -> U256 {
    let t0: U512 = U256::from(t).into(); // low part of `t`, same as `% self.r`, avoid overflow
    let np1_512: U512 = U512::from(np1);
    let m_512: U512 = U256::from(t0 * np1_512).into(); // `% self.r`
    let n_512: U512 = U512::from(n);
    let bits_512: U512 = U512::from(bits);
    let u: U512 = (t + m_512 * n_512) >> bits_512; // `/ self.r`
    if u >= n_512 {
        U256::from(u - n_512)
    } else {
        U256::from(u)
    }
}

As you can see, rvv_vector makes cryptographic algorithms implementation much easier and more efficient.

Besides, considering the current version of Rust lacks the support for RVV instructions, we provide a macro rvv_asm! that is syntactically consistent with Rust's built-in asm! macro (Rust's way of writing inline assembly).

rvv_asm is a function-like procedure macro that helps us write inline assembly with RVV instruction support. It parses the input and replaces the RVV instructions (such as vle256.v in the code sample below) with the instructions in the form of .byte { },{ },{ },{ }.

This code sample indicates how to use rvv_asm:

fn main() {
    let a: u64 = 3;
    let lo: u64;
    unsafe {
        rvv_asm!(
            "vsetvl x5, s3, t6",
            "li {a}, 3",
            "vle256.v v3, (a0)",
            "1:",
            "li {hi}, 4",
            "vse64.v v3, (a0)",
            a = in(reg) a,
            hi = out(reg) hi,
        );
    }
}

The current open-source RVV toolchain (llvm or gcc) is slow to advance. But no worries, with the above implementation of RVV Rust Library, we can develop independently with no need for any third party toolchain.

✍🏻 This blog post is written by: Wanbiao Ye, Linfeng Qian, Jiandong Xu

Other articles that you might like: