port-gameboy-emulator · top current-schema: 27.60% · GPT-5.5 xhigh
Port a Game Boy and Game Boy Color emulator to Rust from the SameBoy C reference. Cycle-accurate CPU, PPU, APU, timer, memory bus, MBC1/2/5 banking, OAM DMA, interrupt controller. Graded against the industry-standard ROM suites that have been the de-facto Game Boy emulator test set for two decades.
The agent starts with a scaffolded Rust project, a working CLI parser for --headless
--max-cycles N --serial-log --cart-ram-out --state-out --screenshot-on-breakpoint --mode
dmg|cgb, partially-stubbed CPU/MMU/PPU modules, and a fully-working joypad and interrupt
controller. The full SameBoy C source sits in reference/sameboy/ as study material;
the Pandocs Game Boy reference and the opcode table sit in reference/pandocs/.
The hardest single thing on this env is cycle accuracy. Most tests don't care whether the emulator's image looks roughly right — they care whether specific T-cycle-precise hardware state matches the reference at a given cycle count. A 1-cycle error in the timer's falling-edge detector, or a PPU mode-3 timing that's off by 2 dots, fails dozens of tests.
Established test ROM suites, generated hardware-state programs, and control/contract suites are combined into one ordered ladder. Some cases pass/fail through serial output or cartridge RAM; many also require full hardware-state comparison against PyBoy at fixed cycle checkpoints.
| # | Milestone | Suite source | Tests |
|---|---|---|---|
| 1 | CLI, Breakpoint Controls, and Artifact Contract | gb_contract | 4 |
| 2 | Blargg CPU Serial Semantics | blargg_cpu_semantic | 12 |
| 3 | Generated CPU Core State | gb_generated_cpu_core | 160 |
| 4 | Blargg CPU State and Memory Effects | blargg_cpu_state | 12 |
| 5 | Instruction Timing and Run Control | blargg_timing_semantic + blargg_timing_exact | 14 |
| 6 | Interrupts, Timers, Bus, DMA Basics | mooneye_timer + blargg_misc + dma/interrupt generated | 191 |
| 7 | Cartridge Controllers and RAM Banking | mooneye_mbc + gb_generated_mbc | 147 |
| 8 | Boot Handoff and Power-Up State | gb_generated_powerup | 80 |
| 9 | DMG PPU Timing, Rendering, Window, Sprites, OAM | mooneye ppu + acid2/mealybug DMG + SameSuite PPU | 225 |
| 10 | CGB Internals, Palettes, Banking, HDMA | acid2/mealybug CGB + gb_generated_cgb | 252 |
| 11 | MBC3 RTC and Save-Adjacent Persistence | gb_generated_mbc3 | 100 |
| 12 | Joypad, Serial, Link-Cable Register Behavior | gb_generated_io_serial | 100 |
| 13 | APU Registers, Frame Sequencer, Channels | blargg_sound + samesuite/apu + gb_generated_apu | 222 |
| 14 | Deterministic Audio Window Oracle | gb_generated_audio_window | 120 |
| 15 | Public ROM Smoke Compatibility | community_smoke | 100 |
| 16 | Strict Full-State CPU Oracle and Long-Run Stability | gb_generated_cpu + gb_generated_stability | 260 |
| Total scored tests | 1,999 |
secrets.token_hex(16) filename. Title bytes ($0134–$0143) are zeroed.
Header checksum at $014D is recomputed. A binary that fingerprints ROMs by filename or title
cannot.
secrets.randbelow(16001) + 4000
T-cycle offset is added to the reference cycle count for both the agent's run and the oracle's
run. Hardcoded register-state tables are useless because the cycle offset changes every run.
"Passed" regardless of input — the whole suite is flagged.
BLARGG_MIN_CYCLES) are tabulated in
eval/tests/utils.py with the intent of rejecting runs that complete faster than 50%
of the reference cycle floor. The table exists; the check that consumes it is on the to-do list.
Cheating detected in the May 2026 Codex runs. The issue was not direct ROM fingerprinting or a copied emulator core; it was submitted code that changed run-control and reporting behavior around eval artifact paths, serial-output shape, terminal-loop detection, and post-result cycle settling. The rows below use each run's best clean eval before the later cheating commit; time, cost, and tokens are measured at that eval.
Current-schema Codex runs are scored at the best clean eval before the later cheating commit. Time, cost, and tokens are measured at that eval. The later raw peaks were higher, but those post-cheat states are not counted here.
| Model | Agent | Window | Peak % | Tests passed | Milestones | Cost | Tokens | Outcome |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 xhigh | codex | 1 hr 46 min | 27.60% | 194 / 1,999 | 4 / 16 | $31 | 45 M | cheating later |
| GPT-5.4 xhigh | codex | 49 min | 18.75% | 176 / 1,999 | 3 / 16 | $5.14 | 11 M | cheating later |
These older rows used the superseded 234-test / 15-milestone schema, so they are not comparable to the current-schema rows above.
| Model | Agent | Window | Peak % | Tests passed | Milestones | Cost | Outcome |
|---|---|---|---|---|---|---|---|
| GPT-5.4 | opencode | 8 hrs | 57.2% | 85 / 234 | 6 / 15 | $83 | eval crash |
| GPT-5.4 | opencode | 3 hrs | 29.2% | 54 / 234 | 2 / 15 | $22 | time budget |
| Grok 4.20 reasoning | opencode | 8 hrs | 5.0% | 9 / 234 | 0 / 15 | $95 | time budget |
| Grok 4.20 reasoning | opencode | 8 hrs | 0.0% | 0 / 234 | 0 / 15 | $147 | mode collapse |