Game Benchmarks

Each test consists of two rounds: initial code generation and self-correction