Optimizing a RISC-V Emulator

I recently took on an exciting challenge from Succinct Labs to optimize their RISC-V emulator. What started as a simple performance improvement task turned into a deep dive into systems optimization, profiling, and understanding the subtle art of making code run fast.

The Challenge #

Succinct Labs created a challenge to improve their RISC-V emulator performance. Their emulator processes about 2 million instructions per second, and they wanted it faster. Much faster.

The setup is straightforward:

[Input Buffer] → [RSP Program] → [RISC-V Emulator] → [Output Hash]

Think of it like this:

RSP Program: A recipe (program written for RISC-V)
Input Buffer: The ingredients (data the program needs)
RISC-V Emulator: A kitchen that follows RISC-V recipes

The benchmark runs the emulator 5 times, measures performance in MHz (millions of operations per second), and verifies correctness by checking output hashes.

My Starting Point #

On my modest setup, I was getting pretty poor performance:

===== BENCHMARK RESULTS =====
Runs: 5
Average elapsed: 25.2662 seconds
Average MHz: 2.22

My setup:

OS: Pop!_OS 22.04 LTS
CPU: Intel i5-3210M (4 cores)
Memory: ~5.8 GB RAM

For comparison, Succinct's AWS m7i.8xlarge instance achieved 9.35 MHz, which became my target to beat.

Finding the Bottlenecks #

Instead of blindly optimizing, I profiled the code using cargo instruments to understand where the time was actually going.

Profiling Results

The profiler revealed the real culprits:

Time    % of Total   Function
2.42 s   21.2%      Memory write operations (mw)
1.65 s   14.4%      Load register-register (load_rr)  
1.39 s   12.2%      ALU register-register (alu_rr)
1.34 s   11.8%      Store register-register (store_rr)

Nearly 60% of execution time was spent on these four operations, with memory writes being the biggest bottleneck. Now I had clear targets for optimization.

Optimization #1: Faster Address Translation #

The original address translation function was checking an assertion on every memory access:

fn translate_addr(addr: u32) -> u32 {
    assert!(addr >= 0x10000);  // This runs every single time!
    (addr - 0x10000) >> 2
}

In a tight loop processing millions of instructions, this assertion was expensive. Here's my optimized version:

#[inline(always)]
fn translate_addr(addr: u32) -> u32 {
    // Fast path for registers
    if addr < 32 {
        return addr;
    }
    // Only check assertion in debug builds
    #[cfg(debug_assertions)]
    assert!(addr >= 0x10000);
    (addr - 0x10000) >> 2
}

Key improvements:

Added fast path for register addresses
Moved assertion behind debug flag
Inlined the function to eliminate call overhead

Optimization #2: Smarter Register Access #

The register access code had redundant array indexing:

// Before: accessing the same array index multiple times
if self.registers[addr as usize].is_some() {
    return MemEntry::Occupied(MemOccupied::Register(
        self.registers[addr as usize].as_mut().unwrap(),
    ));
}
return MemEntry::Vacant(MemVacant::Register(&mut self.registers[addr as usize]));

I simplified this to cache the reference:

// After: get the reference once, use it multiple times
if addr < 32 {
    let reg = &mut self.registers[addr as usize];
    return if reg.is_some() {
        MemEntry::Occupied(MemOccupied::Register(reg.as_mut().unwrap()))
    } else {
        MemEntry::Vacant(MemVacant::Register(reg))
    };
}

This reduces redundant array accesses and is more branch-predictor friendly.

Optimization #3: The HashMap Problem #

With memory writes taking 21.2% of execution time, I suspected the HashMap was the bottleneck. I tried several approaches:

Custom Hasher (Didn't Use) #

I wrote a custom MurmurHash-based hasher to replace the default SipHash:

struct MemoryHasher { state: u64 }

impl std::hash::Hasher for MemoryHasher {
    fn finish(&self) -> u64 { self.state }
    fn write(&mut self, bytes: &[u8]) {
        let key = u32::from_ne_bytes(bytes.try_into().expect("u32 keys only"));
        let mut h = key as u64;
        h ^= h >> 33;
        h *= 0xff51afd7ed558ccd; // MurmurHash magic numbers
        h ^= h >> 33;
        h *= 0xc4ceb9fe1a85ec53;
        h ^= h >> 33;
        self.state = h;
    }
}

While faster than SipHash (10-15 cycles vs 50-100), it was still slower than no hashing at all. This led me to a better idea.

No-Hash Hasher (Epic Fail) #

I tried BuildNoHashHasher thinking sequential addresses would work well:

type MemoryHasher = BuildNoHashHasher<u32>;

Results: 13x slower! Hash collisions were destroying performance.

Optimization #4: Vec-Based Memory (The Winner!) #

Instead of optimizing the HashMap, I replaced it entirely with a Vec:

// Before: HashMap with hashing overhead
pub struct MemoryMap<V> {
    pub registers: [Option<V>; 32],
    memory: HashMap<u32, V, BuildMemoryHasher>,
}

// After: Direct indexing with Vec
pub struct MemoryMap<V: Clone> {
    pub registers: [Option<V>; 32],
    memory: Vec<Option<V>>,  
}

Why this works for RISC-V:

Memory addresses translate to perfect sequential indices
Pre-allocate 4MiB (matches typical program needs)
Zero hashing overhead
Better cache locality
Direct O(1) array access

pub fn new() -> Self {
    Self::with_capacity(1 << 20) // 1,048,576 entries = 4MiB
}

The Vec automatically resizes for larger programs while maintaining performance for typical cases.

Optimization #5: Register Caching #

pub struct Executor<'a> {
    // ... other fields ...
    register_cache: [Option<u32>; 32],
    register_dirty: [bool; 32],
}

Fast Register Reads #

#[inline(always)]
pub fn rr(&mut self, register: Register, position: MemoryAccessPosition) -> u32 {
    let reg_idx = register as usize;
    
    // Fast path: check cache first
    if let Some(value) = self.register_cache[reg_idx] {
        return value;
    }
    
    // Slow path: load from memory map
    // ... fallback code ...
}

Smart Register Writes #

#[inline(always)]  
pub fn rw(&mut self, register: Register, value: u32) {
    // Fast path for x0 (always zero in RISC-V)
    if register == Register::X0 {
        return;
    }

    let reg_idx = register as usize;
    self.register_cache[reg_idx] = Some(value);
    self.register_dirty[reg_idx] = true;

    // Defer memory map updates until checkpoint
    if self.executor_mode == ExecutorMode::Checkpoint {
        self.flush_register(register);
    } else {
        self.mw_cpu(register as u32, value, MemoryAccessPosition::A);
    }
}

Benefits:

Most register reads become simple array access
Deferred writes reduce memory map operations
Special handling for x0 register
Dirty tracking for selective flushing

The Results #

These optimizations transformed the emulator's performance by targeting the actual bottlenecks rather than guessing. The combination of:

Eliminating hashing overhead (Vec instead of HashMap)
Caching frequent operations (register cache)
Reducing function call overhead (inlining)
Optimizing hot paths (fast register access)

Created a significantly faster emulator while maintaining correctness.

Future Ideas #

For even more performance, I considered:

Inter-Shard Caching: Cache hot instruction metadata across the 2M-cycle shards that make up a full program execution.

Global Caching: Store compiled instruction blocks across multiple benchmark runs, essentially JIT compilation for repeated programs.

These optimizations show overhead might not pay off for single runs, but could provide 20-30% improvements for repeated execution of the same programs.

Key Lessons #

Profile first: Don't guess where the bottlenecks are
Question assumptions: Sometimes the best optimization is avoiding work entirely (Vec vs HashMap)
Understand your data: RISC-V's sequential memory access patterns made Vec perfect
Cache hot paths: Register access is frequent enough to justify caching
Measure everything: Even "obvious" optimizations can backfire

The most important lesson? When every cycle counts, sometimes the best hash function is no hash function at all.

Published 15 Jan 2025