RIIR Bench

Why this benchmark exists

Top coding benchmarks have stopped tracking what frontier models can do. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.

Motivation

Top coding benchmarks have stopped tracking what frontier models can do. SWE-bench Verified is saturating between 75% and 94% depending on the harness, and OpenAI's February 2026 audit found material problems with test design or task specification in 59.4% of its hardest cases. Companies are already paying agents to do the work. In May 2026, Jarred Sumner had Claude agents rewrite Bun from Zig to Rust for memory safety; the PR was around a million lines and landed in about six days. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.

Three ports

Each is a multi-week or multi-month job for a senior engineer (estimate; no human baseline).

12 hours of wall clock per run. As inference budgets grow the window grows; the task doesn't change. The eventual goal is a Linux-kernel-in-Rust environment.

The artifact

A score and the full per-run record (commits, eval results, feedback messages, reasoning steps). The trajectory is where post-training data lives, where stalls and pathologies show up, and where new sub-environments come from.

How a run is graded

Four numbers per run:

A milestone unlocks the next when it passes at least 60% of its tests.

What stops a run from gaming the eval

Per-environment defenses run at eval time. Game Boy uses per-run ROM obfuscation (random hex filenames, title bytes zeroed, header checksum recomputed), a per-run secret cycle offset, and canary control ROMs. The C compiler uses an strace -f -e trace=execve probe that fails the eval if the submitted binary calls cc, gcc, g++, c++, clang, or clang++. SQLite is prompt-level only.

Post hoc, verification agents are dispatched to read the run's transcript and submitted source and flag anything that looks like hardcoded answers, test-input fingerprinting, or harness gaming. One disqualifications so far: GPT-5.4-mini on c-compiler, May 12, 2026, for delegating compilation to /usr/bin/cc; and the May 2026 Game Boy Codex runs, for eval/harness-specific run-control and reporting behavior.

What this release doesn't yet do

How this fits alongside other benchmarks

Different questions, different task setups. The numbers aren't directly comparable.

Benchmark What it measures Top result, May 2026
ProgramBench
Meta FAIR + Stanford + Harvard
Rebuild-from-binary on 200 tasks (small utilities up to FFmpeg, SQLite, PHP) with a minimal scaffold, no internet Opus 4.7 passes 95% of tests on ~3% of tasks
MirrorCode
Epoch AI + METR
Reimplement a CLI program with the public tests visible to the agent; preliminary results on ~4 of an eventual ~20+ programs Substantially resolves gotree (Opus 4.6, 99.95%); ~35% on Pkl
SWE-bench Verified
Princeton + OpenAI
Fix one real GitHub issue in an existing repo; graded against a held-out test patch the agent never sees 72-94% pass@1 (varies by harness)
RIIR Bench Long-horizon trajectories on three engineering ports over many hours, with rich per-eval feedback and the full run record published alongside the score 100% on SQLite (GPT-5.4 xhigh, 5 hrs, $52); 97.05% on c-compiler (GPT-5.5 xhigh); 27.60% on Game Boy current schema at the best clean eval before later cheating

How to cite

@misc{hayat2026riirbench,
  title  = {RIIR Bench: A long-horizon benchmark for AI coding
            agents on real engineering ports to Rust},
  author = {Hassan Hayat},
  year   = {2026},
  url    = {https://riirbench.com},
  note   = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}