Why this benchmark exists

Top coding benchmarks have stopped tracking what frontier models can do. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.

Motivation

Top coding benchmarks have stopped tracking what frontier models can do. SWE-bench Verified is saturating between 75% and 94% depending on the harness, and OpenAI's February 2026 audit found material problems with test design or task specification in 59.4% of its hardest cases. Companies are already paying agents to do the work. In May 2026, Jarred Sumner had Claude agents rewrite Bun from Zig to Rust for memory safety; the PR was around a million lines and landed in about six days. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.

Three ports

Each is a multi-week or multi-month job for a senior engineer (estimate; no human baseline).

A C compiler in Rust. Lowers to x86-64 assembly. Graded against WACCT chapters 1 to 20, c-testsuite, and the GCC torture tests: 3,207 tests across 44 ordered milestones.
SQLite to Rust. Graded against the sqllogictest corpus: 5,806,415 records across 18 milestones.
A Game Boy / GBC emulator from SameBoy. Cycle-accurate CPU, PPU, APU, and timer. Graded against Blargg, Mooneye, acid2, mealybug, SameSuite, and generated state-oracle suites: 1,999 tests across 16 milestones.

12 hours of wall clock per run. As inference budgets grow the window grows; the task doesn't change. The eventual goal is a Linux-kernel-in-Rust environment.

The artifact

A score and the full per-run record (commits, eval results, feedback messages, reasoning steps). The trajectory is where post-training data lives, where stalls and pathologies show up, and where new sub-environments come from.

How a run is graded

Four numbers per run:

Peak % completion: the highest milestone-completion percentage across every successful eval.
Wall time.
Total cost in USD.
Total tokens (input + output + reasoning).

A milestone unlocks the next when it passes at least 60% of its tests.

What stops a run from gaming the eval

Per-environment defenses run at eval time. Game Boy uses per-run ROM obfuscation (random hex filenames, title bytes zeroed, header checksum recomputed), a per-run secret cycle offset, and canary control ROMs. The C compiler uses an strace -f -e trace=execve probe that fails the eval if the submitted binary calls cc, gcc, g++, c++, clang, or clang++. SQLite is prompt-level only.

Post hoc, verification agents are dispatched to read the run's transcript and submitted source and flag anything that looks like hardcoded answers, test-input fingerprinting, or harness gaming. One disqualifications so far: GPT-5.4-mini on c-compiler, May 12, 2026, for delegating compilation to /usr/bin/cc; and the May 2026 Game Boy Codex runs, for eval/harness-specific run-control and reporting behavior.

What this release doesn't yet do

Three active environments. Population-level claims are limited; the trajectories are the contribution.
Single-run reporting. Each run costs $5 to $260. Two same-config GPT-5.5 SQLite runs in May produced wildly different outcomes: one crashed at 44%, the other reached 99.9%. Variance isn't yet systematically measured.
Contamination is unmitigated. Rust ports of all three projects exist in open source.
No human baseline. "Weeks to months for a senior engineer" is an estimate.
Data and code releases aren't out yet.

How this fits alongside other benchmarks

Different questions, different task setups. The numbers aren't directly comparable.

Benchmark	What it measures	Top result, May 2026
ProgramBench Meta FAIR + Stanford + Harvard	Rebuild-from-binary on 200 tasks (small utilities up to FFmpeg, SQLite, PHP) with a minimal scaffold, no internet	Opus 4.7 passes 95% of tests on ~3% of tasks
MirrorCode Epoch AI + METR	Reimplement a CLI program with the public tests visible to the agent; preliminary results on ~4 of an eventual ~20+ programs	Substantially resolves gotree (Opus 4.6, 99.95%); ~35% on Pkl
SWE-bench Verified Princeton + OpenAI	Fix one real GitHub issue in an existing repo; graded against a held-out test patch the agent never sees	72-94% pass@1 (varies by harness)
RIIR Bench	Long-horizon trajectories on three engineering ports over many hours, with rich per-eval feedback and the full run record published alongside the score	100% on SQLite (GPT-5.4 xhigh, 5 hrs, $52); 97.05% on c-compiler (GPT-5.5 xhigh); 27.60% on Game Boy current schema at the best clean eval before later cheating

How to cite

@misc{hayat2026riirbench,
  title  = {RIIR Bench: A long-horizon benchmark for AI coding
            agents on real engineering ports to Rust},
  author = {Hassan Hayat},
  year   = {2026},
  url    = {https://riirbench.com},
  note   = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}