Benchmark adoption
Adopt an upstream benchmark into a BenchFlow benchmark withbench eval adopt.
What the router is
bench eval adopt is the benchmark-adoption router. It routes an external
benchmark into benchmarks/<name>/ — scaffold, codex-driven conversion, and a
parity gate — so the result is a first-class BenchFlow benchmark. It sits
upstream of evaluation: the router adopts, while bench eval create runs
the resulting tasks. Once bench eval adopt verify <name> reports
parity-confirmed, you point bench eval create at the converted tasks and run
them like any other benchmark.
(These commands were bench agent create|run|verify before 0.6; the old names
still work as deprecated aliases through 0.6 and are removed in 0.7.)
Three subcommands form the adopt → verify loop:
benchmarks/programbench/; the conversion
contract is benchmarks/CONVERT.md. The router
embeds both into the conversion workflow for you.
bench eval adopt init <name>
init writes a deterministic scaffold under benchmarks/<name>/, matching
the reference layout and the CONVERT.md contract. Use --benchmarks-dir to
target a directory other than the repo’s benchmarks/:
benchflow.py— the converter. Its documentedconvert()/convert_all()entry points areNotImplementedErrorstubs that point at CONVERT.md step 2; you fill them in to map each source instance to a Harbor-format task directory (task.toml,instruction.md,environment/Dockerfile,tests/test.sh).parity_test.py— the parity harness, with--mode full | eval-parity | side-by-side(CONVERT.md steps 3–5). Side-by-side parity records the per-criterionoriginal_verdict/adapted_verdictpairs thatverifyscores.parity_experiment.json— the recorded parity resultsverifyreads. The scaffold writes astatus: "template"placeholder with emptyconversion_parity.tasksandreward_distribution_parity.samples; you populate it from a real parity run.benchmark.yaml— the standard descriptor (name, conversion method, verification method, parity tallies). Fields start asTODO/0.
main.py, run_webarena_lite.py, and webarena-lite.yaml are the converter
CLI delegator, the convert-then-evaluate runner, and the BenchFlow job config
respectively.
Fail-closed behavior
init refuses to overwrite an existing benchmark — re-running it is an error,
not a silent clobber:
init/verify from being steered
outside benchmarks/. An uppercase or underscored name is rejected:
bench eval adopt convert <source> [--name]
convert drives the conversion. It assembles an adoption prompt — the source, the
target benchmarks/<name>/ path, the adoption skills (CONVERT.md, the
programbench worked example, the parity harness), and the full embedded
CONVERT.md guide — then launches the host codex CLI to do the conversion
toward a pull request. If you omit --name, the slug is derived from the source
basename (so .../webarena becomes webarena).
Use --dry-run to print the exact command the router would launch without
running it:
codex exec
argv is constructed deterministically: it runs in the repo root
(--cd <repo>), with --skip-git-repo-check and
--sandbox workspace-write. Pass --model to set the codex driver model and
--codex-bin to point at a different codex binary.
A live run (drop --dry-run) requires codex credentials and fails closed
without them — set OPENAI_API_KEY (or CODEX_API_KEY), or run codex login
to create ~/.codex/auth.json. Without credentials convert errors before
assembling any context:
bench eval adopt verify confirms parity.
bench eval adopt verify <name>
verify is the gate that closes the loop. It reads the adopted benchmark’s
parity_experiment.json and emits a confidence verdict. The gate is parity
only: a faithful conversion must reproduce the original’s behavior on identical
inputs — including any reward-hackability the original has. It never “improves”
or sanitizes the source.
It scores two layers:
- Conversion parity (deterministic floor) — every compared criterion’s converted verdict must match the original’s verdict on identical inputs.
- Reward-distribution parity (statistical layer) — every
legacy-vs-converted reward delta must sit within
--tolerance(default0.02).
| Verdict | Meaning |
|---|---|
parity-confirmed | Every recorded layer agrees; high-confidence the conversion is faithful. |
parity-divergent | A criterion disagrees or a reward delta exceeds tolerance. |
insufficient-evidence | No recorded comparisons at all — run parity_test.py and record results first. |
insufficient-evidence and exits non-zero:
A parity-confirmed run
Populateparity_experiment.json from a parity run. verify reads
per-criterion verdicts under conversion_parity.tasks and reward samples under
reward_distribution_parity.samples:
parity-confirmed and verify exits zero:
A parity-divergent run
Flip one criterion so the converted verdict no longer matches the original (hereC-002’s adapted_verdict goes from fail to pass). The deterministic
floor trips, the verdict becomes parity-divergent, and verify prints a draft
GitHub issue body for the support path:
--issue-out PATH to write it to a
file instead of stdout:
The --roundtrip-task structural hook
By default verify scores the recorded parity_experiment.json at the
benchmark level. Pass --roundtrip-task <task-dir> to also run the structural
round-trip conformance check on one concrete task tree (it reuses the existing
Harbor round-trip parity utility). It is opt-in because that harness needs a
concrete task directory, which the benchmark-level verdict does not require.
verify exits non-zero for parity-divergent and insufficient-evidence, and
errors if the benchmark was never adopted:
From adoption to evaluation
Onceverify reports parity-confirmed, the benchmark is a normal BenchFlow
benchmark: run its tasks with bench eval create (see
Running benchmarks), using the job config the
scaffold generated. The router’s job ends at parity-confirmed; evaluation
takes it from there.