RevengeBench

Pipeline

Observation

Passive phase: the learner inspects the game traces of the hidden policy playing against a diverse set of opponents sampled from a policy pool.

Intervention

Active phase: the learner designs probe opponents to elicit specific behaviour from the hidden policy and disambiguate competing hypotheses.

Evaluation

Test phase: the learner submits one executable policy, scored by how often it picks the same action as the hidden policy on held-out trajectories.

Arenas

Five code-based arenas from CodeClash, spanning four programming languages and a range of game mechanics.

BattleSnake

Halite

Poker

RoboCode

RobotRumble

Findings

Probing helps stronger models

The learner can run its own experiments, designing opponents to probe the hidden bot. We measure whether probing improves recovery, and for which models.

Probing is front-loaded: probe counts peak in the early rounds and decline as the agent converges on a hypothesis.
Probing helps: probing improves recovery in 16 of 20 model–arena pairs, but the gains concentrate in the stronger models.
Weaker models design worse probes: stronger models craft targeted experiments to expose the target's decision rules. Weaker models fall back on trivial probes that do nothing and reveal little.

Recovered strategies are actionable intelligence

To test whether a recovered strategy is useful, we ask an LLM agent to write a policy that plays the original target. We give it three levels of intel about the target: just the game rules (blind), our pipeline's reconstruction (recovered), or the ground-truth code (oracle). We measure the win rate gain: the increase in win rate relative to the blind baseline. Across 5 models, 4 of the 5 arenas, and 720 tournaments:

Recovered strategies help in designing counter-strategies: oracle > recovered > blind across all five models.
Weaker models benefit more: there is a negative correlation between blind win rate and how much the recovered policy helps.
Stronger models catch up: the recovered-policy advantage shrinks over rounds for strong models as their blind play improves.

Stronger agents act, weaker agents look

How an agent spends its turns predicts how well it recovers the strategy. We classify each command as inspecting (reading files and state), executing (running and testing code), writing, or probing.

Act vs. look: strong agents test hypotheses, weak agents gather context: GPT-5.5 mostly executes, GPT-5.4-mini mostly reads.
Strong agents are busier: GPT-5.5 (medium) issues roughly 2.5× as many commands per target as GPT-5.4-mini before submitting.
Everyone under-probes: active experimentation is the rarest action type for nearly every model, even though it pays off.

Protocol

RevengeBench operationalises an inverse problem in code space: given only behavioural traces of an opaque target agent in a programming-game arena, can a learner reconstruct a runnable program that reproduces its decisions? Because targets are themselves executable, hypotheses can be scored mechanically against ground truth, a property that behavioural inverse problems normally lack.

Targets: top 15 strongest policies by Elo per arena, extracted from CodeClash tournaments. 75 in total.
Opponent pool: 20 opponents sampled each round from the remaining pool.
Starter policy: arena-specific naive baseline that every learner edits from.
Protocol: closed loop of observation, intervention through probe opponents, hypothesis formulation, and evaluation. 5 rounds with persistent memory; best round reported for each model.
Probe budget: 5 probe opponents per round in the intervention regime.
Harness: mini-SWE-agent.
Metric: distance reduction $$\Delta = \frac{D_0 - D_R}{D_0}$$ where $D_0$ and $D_R$ are mean action distances of the starter policy $\hat{\pi}_0$ and the final hypothesis $\hat{\pi}_R$. Reporting $\Delta$ controls for differences in baseline difficulty across targets and arenas.

Citation

If you found RevengeBench useful, please cite us as:

@article{revengebench_2026,
  title   = {RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments},
  author  = {Babak Rahmani and Sebastian Dziadzio and Joschka Strüber and Sergio Hernández Gutiérrez and Matthias Bethge},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.26094},
  url     = {https://arxiv.org/abs/2606.26094},
}