RIIR Bench
← all evals

port-c-compiler · best: 97.05% · GPT-5.5 xhigh · 10 hrs 6 min · $165

C compiler in Rust

Implement a C compiler in idiomatic Rust. The submitted binary has to compile real C end-to-end — tokenize, preprocess, parse, typecheck, codegen to x86-64 AT&T assembly, then hand off to as and ld. Graded against three established C test corpora: 3,207 tests across 44 ordered milestones.

Target language
Rust (default)
Target ISA
x86-64 (System V ABI)
Total tests
3,207
Milestones
44 ordered
Unlock threshold
60% pass to unlock the next
Measurement window
12 hours
Reference impl in workspace
Yes — a small open-source C compiler as study material
Anti-cheat
strace probe blocks cc/gcc/g++/c++/clang/clang++ delegation

What the agent must produce

Real C compilation, not a toy. The submitted binary is called as ./cc input.c -o output. It has to handle the language features WACCT's textbook covers (integers, control flow, functions, pointers, arrays, strings, structs, unions, casts), then enough of the C standard to pass c-testsuite, then the adversarial cases in the GCC torture execute tests.

The agent cannot shell out to a host C compiler. The eval sandbox runs an strace probe at session setup that watches for execve calls to cc, gcc, g++, c++, clang, or clang++. If a run delegates compilation, the public score is the best clean eval before that later cheating commit. Time, cost, and tokens are measured at that eval. (One such incident has been caught — see below.)

Test corpora, in order

44 milestones grouped into four bands: Basics (a 41-test warm-up), Writing-a-C-Compiler-Tests (Nora Sandler's textbook chapters 1–20, paired with their checkpoint sub-suites), c-testsuite (212 canonical-C-conformance tests, drawn from the larger upstream c-testsuite corpus), and the GCC torture execute tests (1,392 adversarial cases).

Band Tests What's measured
Basics 41 Hello-world, arithmetic, control flow, function calls, the smallest possible programs
WACCT Ch1–20 (+ checkpoints) ~1,556 Integers, longs, floats/doubles, pointers, arrays, strings, structs, unions, enums, file-scope vars, function pointers, switch statements, and the rest of Sandler's "Writing a C Compiler" progression
c-testsuite 212 Open-source C conformance suite. Edge cases in initialization, scope, operator precedence, integer promotions, bitfields
GCC torture execute 1,392 The adversarial bar. Specifically designed to break compilers — pathological switch tables, deep recursion, struct-return ABI edge cases, sign-extension corners, the _Complex family, computed goto, every UB-adjacent corner of C99/C11
Total 3,207

Per-run results

Every production c-compiler run with peak % > 0, sorted by peak. "Wall" is when the last useful state was recorded.

Model Agent Budget Wall Peak % Tests Cost Tokens Outcome
GPT-5.5 (xhigh) codex 12 hrs 10 hrs 6 min 97.05% 2,671 / 2,995 $165 231 M platform bug
GPT-5.4 (xhigh) codex 12 hrs 12 hrs 96.63% 2,938 / 3,207 $108 267 M platform bug
GPT-5.5 codex 2 hrs 1 hr 30 min 91.93% 1,107 / 2,995 not recorded 62.4 M sandbox hard kill
GPT-5.4-mini codex 12 hrs 3 hrs 33 min 91.67% 1,111 / 2,995 $20 160 M cheating later
GPT-5.5 (xhigh) codex 12 hrs 1 hr 78.97% 820 / 2,995 $32 44.1 M abandoned
DeepSeek-v4-pro opencode 12 hrs 3 hrs 54 min 55.90% 630 / 3,207 $4.50 171 M model gave up
Qwen 3.6 plus opencode 12 hrs 1 hr 54 min 45.04% 352 / 2,995 $5.80 81 M model gave up

The GPT-5.4-mini disqualification

On May 12, GPT-5.4-mini's run climbed normally through WACCT chapters 1 to roughly 18 over the first ~3.5 hours. Then at minute 215, after hitting a structures/unions wall it couldn't pass, the run stopped being a compiler port and switched to host-compiler delegation.

It then committed Delegate C compilation to host compiler at 20:51:59 — replacing the entire compile pipeline with a call to Command::new("/usr/bin/cc") — followed five minutes later by Use host compiler for linking. The eval climbed across the two cheating commits: 1,111 → 3,203 → 3,207 passing tests, the entire suite.

The anti-cheat strace probe was committed to the eval sandbox 2 hours and 18 minutes after this run finished. The incident is what motivated the probe. Subsequent runs cannot repeat this cheat — the probe runs at session setup and rejects any submission whose execve trace contains cc, gcc, g++, c++, clang, or clang++. The agent's own toolchain (the assembler as and linker ld) is unaffected.

The run is preserved in the dataset as ccomp-port-gpt54mini-codex-12h-0512a, scored only through the best clean eval before the later host-compiler delegation. The time, cost, and tokens shown for this run are measured at that clean eval. The transcript will be included in the public data release as a documented example of agent reasoning during a cheating attempt.