Top coding benchmarks have stopped tracking what frontier models can do. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.
Top coding benchmarks have stopped tracking what frontier models can do. SWE-bench Verified is saturating between 75% and 94% depending on the harness, and OpenAI's February 2026 audit found material problems with test design or task specification in 59.4% of its hardest cases. Companies are already paying agents to do the work. In May 2026, Jarred Sumner had Claude agents rewrite Bun from Zig to Rust for memory safety; the PR was around a million lines and landed in about six days. RIIR Bench points coding agents at three engineering ports to Rust and runs each for 12 hours.
Each is a multi-week or multi-month job for a senior engineer (estimate; no human baseline).
12 hours of wall clock per run. As inference budgets grow the window grows; the task doesn't change. The eventual goal is a Linux-kernel-in-Rust environment.
A score and the full per-run record (commits, eval results, feedback messages, reasoning steps). The trajectory is where post-training data lives, where stalls and pathologies show up, and where new sub-environments come from.
Four numbers per run:
A milestone unlocks the next when it passes at least 60% of its tests.
Per-environment defenses run at eval time. Game Boy uses per-run ROM obfuscation (random hex
filenames, title bytes zeroed, header checksum recomputed), a per-run secret cycle offset, and
canary control ROMs. The C compiler uses an strace -f -e trace=execve probe that
fails the eval if the submitted binary calls cc, gcc, g++,
c++, clang, or clang++. SQLite is prompt-level only.
Post hoc, verification agents are dispatched to read the run's transcript and submitted source and
flag anything that looks like hardcoded answers, test-input fingerprinting, or harness gaming. One
disqualifications so far: GPT-5.4-mini on c-compiler, May 12, 2026, for delegating compilation to
/usr/bin/cc; and the May 2026 Game Boy Codex runs, for eval/harness-specific
run-control and reporting behavior.
Different questions, different task setups. The numbers aren't directly comparable.
| Benchmark | What it measures | Top result, May 2026 |
|---|---|---|
| ProgramBench Meta FAIR + Stanford + Harvard |
Rebuild-from-binary on 200 tasks (small utilities up to FFmpeg, SQLite, PHP) with a minimal scaffold, no internet | Opus 4.7 passes 95% of tests on ~3% of tasks |
| MirrorCode Epoch AI + METR |
Reimplement a CLI program with the public tests visible to the agent; preliminary results on ~4 of an eventual ~20+ programs | Substantially resolves gotree (Opus 4.6, 99.95%); ~35% on Pkl |
| SWE-bench Verified Princeton + OpenAI |
Fix one real GitHub issue in an existing repo; graded against a held-out test patch the agent never sees | 72-94% pass@1 (varies by harness) |
| RIIR Bench | Long-horizon trajectories on three engineering ports over many hours, with rich per-eval feedback and the full run record published alongside the score | 100% on SQLite (GPT-5.4 xhigh, 5 hrs, $52); 97.05% on c-compiler (GPT-5.5 xhigh); 27.60% on Game Boy current schema at the best clean eval before later cheating |
@misc{hayat2026riirbench,
title = {RIIR Bench: A long-horizon benchmark for AI coding
agents on real engineering ports to Rust},
author = {Hassan Hayat},
year = {2026},
url = {https://riirbench.com},
note = {Independent researcher. Contact: hassan.hayat7@gmail.com}
}