CLI reference
BenchFlow uses a resource-verb pattern:bench <resource> <verb>.
bench agent
bench agentis agent management only.bench agent listandbench agent showoperate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate abenchmarks/<name>/adoption) is a separate workflow underbench eval adopt(init→convert→verify). The legacybench agent create|run|verifystill work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.
bench agent list
List all registered agents with their protocol and native/default auth requirements. Provider-prefixed models may use provider-specific credentials; Azure Foundry models useAZURE_API_KEY plus AZURE_API_ENDPOINT.
bench agent show
Show details for a specific agent, including native/default auth and a note about provider-specific credentials.bench eval adopt
Bring a third-party benchmark into the environment framework: scaffold abenchmarks/<name>/ package, drive the codex CONVERT.md conversion, then
parity-gate it (init → convert → verify). These commands were previously
bench agent create|run|verify, which still work as hidden deprecated aliases
through 0.6 (they print a one-line notice and are removed in 0.7). See
Benchmark adoption for the full walkthrough.
bench eval adopt init
Scaffoldbenchmarks/<name>/ for a new benchmark adoption. The layout mirrors
the reference benchmark benchmarks/programbench/ and the contract in
benchmarks/CONVERT.md: it writes
benchflow.py (converter), main.py, parity_test.py, run_<name>.py,
<name>.yaml, benchmark.yaml, parity_experiment.json (status template),
README.md, and __init__.py. It is fail-closed: the slug is validated
(lowercase, leading letter, single internal hyphens, max 64 chars) and the
command refuses to overwrite an existing benchmark directory.
| Flag | Default | Description |
|---|---|---|
--benchmarks-dir | repo benchmarks/ | Target benchmarks/ directory |
bench eval adopt convert
Drive theCONVERT.md adoption workflow by launching the host codex CLI.
The command assembles the adoption context (the source, the target
benchmarks/<name>/ path, the adoption skills, and the embedded
benchmarks/CONVERT.md guide) and runs codex exec against the repo root to
drive the conversion toward a benchmarks/<name>/ pull request. It is
fail-closed on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY)
in the environment, or a ~/.codex/auth.json from codex login, otherwise the
command exits before assembling any context. Use --dry-run to print the exact
launch command without running it (no credentials required). When --name is
omitted the slug is derived from the source basename.
| Flag | Default | Description |
|---|---|---|
--name | derived from source | Benchmark slug (default: from source basename) |
--model | codex default | Model for the codex driver |
--dry-run | false | Print the launch command, do not run |
--codex-bin | codex | Host codex binary |
-c, --codex-config | — | Codex config override as key=value, passed through to codex as -c key=value; repeatable. Use it to work around host ~/.codex/config.toml drift without editing the file — e.g. -c service_tier=flex when an installed codex version rejects a stale value. |
bench eval adopt verify
Run the parity gate for an adopted benchmark and emit a confidence verdict. It readsbenchmarks/<name>/parity_experiment.json and scores two layers: a
deterministic conversion-faithfulness floor (every compared criterion’s
converted verdict must match the original’s verdict on identical inputs) and a
statistical reward-distribution layer (every legacy-vs-converted reward delta
must sit within --tolerance). The gate is parity-only — a faithful conversion
reproduces the original’s behavior, including any reward-hackability the source
has; it never “improves” or sanitizes the source. The verdict is one of
parity-confirmed, parity-divergent, or insufficient-evidence (no recorded
comparisons). On any non-confirmed verdict the command exits non-zero and emits
a draft GitHub issue body for human support — printed to stdout, or written to
--issue-out. The draft is never filed automatically. Pass --roundtrip-task
to also run the structural round-trip conformance check on a concrete task
directory.
By default the gate scores the recorded parity_experiment.json — fast, but
it trusts an artifact the conversion produced about itself. Pass --rerun to
independently re-execute parity_test.py --mode side-by-side and score its
fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py,
a timeout, or output that is not in the scoreable parity_experiment.json shape
all exit non-zero (rather than silently reporting insufficient-evidence).
| Flag | Default | Description |
|---|---|---|
--benchmarks-dir | repo benchmarks/ | Target benchmarks/ directory |
--tolerance | 0.02 | Max abs reward delta (statistical layer) |
--issue-out | — | Write the divergence issue draft to this path instead of stdout |
--roundtrip-task | — | Also run the structural round-trip check on this task dir |
--rerun | false | Re-execute parity_test.py --mode side-by-side and score its fresh output instead of the recorded parity_experiment.json |
bench eval
bench eval create
Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.| Flag | Default | Description |
|---|---|---|
--config | — | YAML config file |
--tasks-dir | — | Local task dir (single task with task.toml, or parent of many) |
-d, --dataset | — | Registry dataset to run as <name>@<version> (e.g. skillsbench@1.1). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task’s sha256 content digest, and checks the dataset’s bench_version range against the installed benchflow. Each result.json/config.json is stamped with dataset_name, dataset_version, and the task’s task_digest. |
--registry | skillsbench registry | Dataset registry JSON URL or local file. Only valid with --dataset. |
--source-repo | — | Remote repo as org/repo (e.g. benchflow-ai/skillsbench) |
--source-path | — | Subpath within the repo (e.g. tasks) |
--source-ref | — | Branch or tag to clone (e.g. main) |
--source-env | — | Hosted environment source (e.g. primeintellect/general-agent) |
--source-env-version | — | Hosted environment version |
--source-env-arg | — | Hosted environment argument as KEY=VALUE; repeatable |
--source-env-num-examples | 1 | Number of hosted environment examples |
--source-env-rollouts-per-example | 1 | Rollouts per hosted environment example |
--source-env-max-tokens | 1024 | Max tokens for hosted environment model calls |
--source-env-temperature | 0.0 | Temperature for hosted environment model calls |
--source-env-sampling-arg | — | Verifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal) |
--agent | claude-agent-acp | Agent name |
--model | Agent default | Model ID |
--reasoning-effort | — | Agent reasoning/thinking effort when the agent exposes one (e.g. max) |
--sandbox | docker | Sandbox: docker, daytona, or modal |
--usage-tracking | auto | Token usage telemetry policy: auto, required, or off |
--environment-manifest | — | Path to an Environment-plane manifest (environment.toml); applied to every rollout in the batch |
--prompt | instruction.md | Prompt to send to the agent; repeatable for multi-prompt runs |
--concurrency | 4 | Max concurrent tasks (batch mode only) |
--build-concurrency | --concurrency | Max concurrent docker image builds; set lower (e.g. 8) when --concurrency is high to avoid overwhelming the docker daemon |
--worker-concurrency | — | Run batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; --concurrency remains the aggregate target |
--worker-retries | 1 | Retry a crashed worker shard this many times, resuming its jobs dir |
--worker-start-stagger-sec | 1.0 | Seconds to stagger worker starts to avoid Daytona connection storms |
--agent-idle-timeout | (built-in default) | Abort ACP prompts after this many idle seconds; 0 disables idle detection |
--jobs-dir | jobs | Output directory |
--sandbox-user | agent | Sandbox user (null for root) |
--sandbox-setup-timeout | 120 | Timeout in seconds for sandbox user setup |
--skills-dir | — | Advanced custom skills directory; valid only with --skill-mode with-skill. Omit it to use each task’s environment/skills. |
--skill-mode | no-skill | Skill mode: no-skill, with-skill, or self-gen |
--skill-creator-dir | — | Path to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen |
--self-gen-no-internet | false | Disable web tools for the self-generated skill run |
--agent-env | — | Agent environment variable as KEY=VALUE; repeatable |
--include | — | Only run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3) |
--exclude | — | Skip these task names; repeatable (e.g. --exclude quantum-numerical-simulation) |
--loop-strategy | — | Wrap each rollout in a loop, e.g. verify-retry:k=3,feedback=names or self-review:k=3 (omit for single-shot) |
--ignore-bench-version | false | With --dataset, skip the dataset’s bench_version compatibility gate |
--agent-env BENCHFLOW_SKILL_NUDGE=name. See
Architecture: skill loading for how
with-skill mode is registered with each agent and how the nudge modes differ.
Daytona batch runs collect provider token/cost telemetry by default with a
sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry
should fail the rollout, or --usage-tracking off for recovery runs that should
leave provider traffic untouched.
--source-env is for external hosted environment hubs. The first supported
runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity
(env_uid, hub_url), installs the versioned package into an isolated local
virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task
sandbox selector for local/repo task sources; Verifiers source environments own
their own harness and sandbox behavior. --model is passed to the Verifiers
model endpoint; use a model id available to that provider. Provider-specific
sampling options are not inferred; pass them explicitly with
--source-env-sampling-arg.
bench eval list
List completed evaluations from a jobs directory.bench eval metrics
Collect and display metrics (pass/fail/score, memory score, tool calls, duration) from a jobs directory. Use--json for machine-readable output.
bench eval view
Serve a trial trajectory viewer in the browser for a rollout or job directory.bench skills
bench skills list
List skills discovered under the default skills roots (or--dir).
bench skills eval
Evaluate a skill against its evals.json test cases.bench tasks
bench tasks init
Scaffold a new benchmark task.| Flag | Default | Description |
|---|---|---|
--format | task-md | Task format: task-md (native single-document) or legacy (split task.toml + instruction.md layout) |
bench tasks check
Validate a task directory (task.md or legacy task.toml + instruction.md, environment/Dockerfile, verifier/ or legacy tests/).
--level, validation runs at a chosen depth: schema, structural,
runtime-capability, publication-grade, acceptance, or acceptance-live.
Acceptance-level errors such as
acceptance validation requires benchflow.evidence mapping refer to the
benchflow.evidence schema documented in the “Assets, Provenance, And
Evidence” section of docs/task-standard.md.
bench tasks migrate
Convert a legacytask.toml + instruction.md task into the unified
task.md format. By default the legacy files are kept alongside the new
task.md.
| Flag | Default | Description |
|---|---|---|
--overwrite | false | Replace an existing task.md |
--remove-legacy | false | Delete split files and promote tests/solution aliases after task.md is verified |
bench tasks normalize
Expand minimaltask.md authoring profiles into the canonical task.md
form. Prints the normalized document to stdout unless told otherwise.
| Flag | Default | Description |
|---|---|---|
--output, -o | — | Write normalized task.md to this path instead of stdout |
--write | false | Replace task.md in place with the normalized canonical form |
bench tasks export
Export atask.md task to a Harbor/Pier-compatible split layout, with a
compatibility loss report written to compatibility/export-report.json in
the export directory.
TASK_DIR (task directory to export) and optional OUTPUT_DIR
(destination split-layout directory; may be omitted with --report-only).
| Flag | Default | Description |
|---|---|---|
--target | harbor | Compatibility target: harbor or pier |
--overwrite | false | Replace an existing export directory |
--report-only | false | Print the compatibility loss report without writing files |
bench tasks digest
Compute the content digest that pins a task’s files, independent of git — the sha256 the dataset registry keys on (matches the digestsbench eval create -d
verifies and the task_digest stamped into every result.json). Recognizes
both legacy task.toml tasks and native task.md tasks. Given a single task
directory it prints the digest; given a directory of tasks it prints one
<name> <digest> line per task. Output goes to stdout via echo (not Rich), so
it is safe to pipe into machine-readable tooling.
PATH (a task directory, or a directory of task directories).
bench tasks generate
Generate benchmark task directories from real agent traces.| Flag | Default | Description |
|---|---|---|
--from-local | — | Generate from local Claude Code sessions |
--from-file | — | Generate from a JSONL trace file |
--from-hf | — | Generate from a HuggingFace dataset ID or alias |
--output | tasks | Output directory for generated tasks |
--projects-dir | ~/.claude/projects/ | Claude Code projects directory |
--project | — | Filter local sessions by project path substring |
--format | auto | Trace format override |
--split | train | HuggingFace dataset split |
--max-rows | 100 | Max rows to download from HuggingFace |
--limit | 20 | Max traces to process |
--min-steps | 2 | Minimum steps per trace |
--outcome | — | Filter by outcome: success, failure, unknown |
--author | benchflow-traces | Author name for generated task metadata |
--task-format | task-md | Generated task package format: task-md or legacy |
--dry-run | false | Preview traces without generating tasks |
bench tasks list-sources
List known HuggingFace trace datasets. The aliases listed here can be passed tobench tasks generate --from-hf.
bench sandbox
Local sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.bench sandbox create
Create an environment object from a task directory. This validates environment construction but does not start the sandbox.bench sandbox list
List active local (Daytona) sandboxes.bench sandbox cleanup
Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older than 24 hours; use--dry-run to preview what would be deleted.
BUILD_FAILED are reaped sooner than healthy ones, and
an idle-activity guard means concurrent live runs are never reaped). Set
BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive)
to disable that automatic pass and rely on the manual command above.
bench environment (deprecated)
bench environment is a hidden deprecated alias group, removed in 0.7. The
local lifecycle moved to bench sandbox (create/list/cleanup)
and hosted-provider browsing to bench hub env. The old
bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.
bench hub
External environment hubs: compatibility checks (check) and browsing a hosted
provider’s environments (env).
bench hub env
Read-only browsing of a hosted provider’s environments (PrimeIntellect “Environments”). To run one, usebench eval create --source-env.
bench hub check
Inventory or structurally check representative tasks from an environment hub’s registry. Defaults to an inventory pass against the public Harbor registry JSON.| Flag | Default | Description |
|---|---|---|
--registry | Harbor public registry URL | Harbor registry JSON URL or local file |
--tasks-per-dataset | 2 | Representative tasks selected per dataset |
--level | inventory | Compatibility level: inventory or check |
--out | — | Optional JSONL output path |
--cache-dir | .cache/hub/harbor | Cache directory for sparse clones |
--limit | — | Optional cap on selected task refs |
YAML Config Format
Batch config with skills and skill nudge
Multi-scene (BYOS skill generation)
Use the Python API for multi-scene experiments.bench eval create --config is for
batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built
directly in Python.
bench continue
Resume a previous, unfinished (timed-out)openhands run to completion via
record-replay. Standalone — it does not touch the normal run path. See
Continuing timed-out runs for the full guide.
--model (override the live-continuation model; defaults to the
original run’s model), --timeout, --output, --require-timeout,
--strict-divergence, --replay-only (rebuild via replay and stop at the
cut-point — no live model or API key needed), and --proxy-mode (replay
proxy placement: auto, host, or sandbox; default auto uses
sandbox-local replay for Daytona/Modal and host replay for Docker).
bench continue-batch
Continue all timed-out OpenHands runs found under a directory tree. Discovers run folders (config.json + trajectory/llm_trajectory.jsonl) recursively,
continues each, and prints a JSON batch summary (exits 1 if any continuation
failed).
| Flag | Default | Description |
|---|---|---|
--tasks-dir | — | Directory holding task sources; required unless the recorded task path exists |
--model | original run’s model | Override the live-continuation model |
--timeout | — | Wall-clock budget per continuation |
--output | — | Output jobs dir for continued runs |
--concurrency | 100 | Maximum number of continuation runs in flight |
--limit | — | Limit discovered timeout folders |
--strict-divergence | false | Abort a run if replay leaves the original rails |
--proxy-mode | auto | Replay proxy placement: auto, host, or sandbox |