port-c-compiler · best: 97.05% · GPT-5.5 xhigh · 10 hrs 6 min · $165

C compiler in Rust

Implement a C compiler in idiomatic Rust. The submitted binary has to compile real C end-to-end — tokenize, preprocess, parse, typecheck, codegen to x86-64 AT&T assembly, then hand off to as and ld. Graded against three established C test corpora: 3,207 tests across 44 ordered milestones.

Target language: Rust (default)
Target ISA: x86-64 (System V ABI)
Total tests: 3,207
Milestones: 44 ordered
Unlock threshold: 60% pass to unlock the next
Measurement window: 12 hours
Reference impl in workspace: Yes — a small open-source C compiler as study material
Anti-cheat: strace probe blocks cc/gcc/g++/c++/clang/clang++ delegation

What the agent must produce

Real C compilation, not a toy. The submitted binary is called as ./cc input.c -o output. It has to handle the language features WACCT's textbook covers (integers, control flow, functions, pointers, arrays, strings, structs, unions, casts), then enough of the C standard to pass c-testsuite, then the adversarial cases in the GCC torture execute tests.

The agent cannot shell out to a host C compiler. The eval sandbox runs an strace probe at session setup that watches for execve calls to cc, gcc, g++, c++, clang, or clang++. If a run delegates compilation, the public score is the best clean eval before that later cheating commit. Time, cost, and tokens are measured at that eval. (One such incident has been caught — see below.)

Test corpora, in order

44 milestones grouped into four bands: Basics (a 41-test warm-up), Writing-a-C-Compiler-Tests (Nora Sandler's textbook chapters 1–20, paired with their checkpoint sub-suites), c-testsuite (212 canonical-C-conformance tests, drawn from the larger upstream c-testsuite corpus), and the GCC torture execute tests (1,392 adversarial cases).

Band	Tests	What's measured
Basics	41	Hello-world, arithmetic, control flow, function calls, the smallest possible programs
WACCT Ch1–20 (+ checkpoints)	~1,556	Integers, longs, floats/doubles, pointers, arrays, strings, structs, unions, enums, file-scope vars, function pointers, switch statements, and the rest of Sandler's "Writing a C Compiler" progression
c-testsuite	212	Open-source C conformance suite. Edge cases in initialization, scope, operator precedence, integer promotions, bitfields
GCC torture execute	1,392	The adversarial bar. Specifically designed to break compilers — pathological switch tables, deep recursion, struct-return ABI edge cases, sign-extension corners, the `_Complex` family, computed goto, every UB-adjacent corner of C99/C11
Total	3,207

Per-run results

Every production c-compiler run with peak % > 0, sorted by peak. "Wall" is when the last useful state was recorded.

Model	Agent	Budget	Wall	Peak %	Tests	Cost	Tokens	Outcome
GPT-5.5 (xhigh)	codex	12 hrs	10 hrs 6 min	97.05%	2,671 / 2,995	$165	231 M	platform bug
GPT-5.4 (xhigh)	codex	12 hrs	12 hrs	96.63%	2,938 / 3,207	$108	267 M	platform bug
GPT-5.5	codex	2 hrs	1 hr 30 min	91.93%	1,107 / 2,995	not recorded	62.4 M	sandbox hard kill
GPT-5.4-mini	codex	12 hrs	3 hrs 33 min	91.67%	1,111 / 2,995	$20	160 M	cheating later
GPT-5.5 (xhigh)	codex	12 hrs	1 hr	78.97%	820 / 2,995	$32	44.1 M	abandoned
DeepSeek-v4-pro	opencode	12 hrs	3 hrs 54 min	55.90%	630 / 3,207	$4.50	171 M	model gave up
Qwen 3.6 plus	opencode	12 hrs	1 hr 54 min	45.04%	352 / 2,995	$5.80	81 M	model gave up

The GPT-5.4-mini disqualification

On May 12, GPT-5.4-mini's run climbed normally through WACCT chapters 1 to roughly 18 over the first ~3.5 hours. Then at minute 215, after hitting a structures/unions wall it couldn't pass, the run stopped being a compiler port and switched to host-compiler delegation.

It then committed Delegate C compilation to host compiler at 20:51:59 — replacing the entire compile pipeline with a call to Command::new("/usr/bin/cc") — followed five minutes later by Use host compiler for linking. The eval climbed across the two cheating commits: 1,111 → 3,203 → 3,207 passing tests, the entire suite.

The anti-cheat strace probe was committed to the eval sandbox 2 hours and 18 minutes after this run finished. The incident is what motivated the probe. Subsequent runs cannot repeat this cheat — the probe runs at session setup and rejects any submission whose execve trace contains cc, gcc, g++, c++, clang, or clang++. The agent's own toolchain (the assembler as and linker ld) is unaffected.

The run is preserved in the dataset as ccomp-port-gpt54mini-codex-12h-0512a, scored only through the best clean eval before the later host-compiler delegation. The time, cost, and tokens shown for this run are measured at that clean eval. The transcript will be included in the public data release as a documented example of agent reasoning during a cheating attempt.