CLI reference

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.

bench --version

bench agent

bench agent is agent management only. bench agent list and bench agent show operate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate a benchmarks/<name>/ adoption) is a separate workflow under bench eval adopt (init → convert → verify). The legacy bench agent create|run|verify still work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.

bench agent list

List all registered agents with their protocol and native/default auth requirements. Provider-prefixed models may use provider-specific credentials; Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.

bench agent list

bench agent show

Show details for a specific agent, including native/default auth and a note about provider-specific credentials.

bench agent show gemini

bench eval adopt

Bring a third-party benchmark into the environment framework: scaffold a benchmarks/<name>/ package, drive the codex CONVERT.md conversion, then parity-gate it (init → convert → verify). These commands were previously bench agent create|run|verify, which still work as hidden deprecated aliases through 0.6 (they print a one-line notice and are removed in 0.7). See Benchmark adoption for the full walkthrough.

bench eval adopt init

Scaffold benchmarks/<name>/ for a new benchmark adoption. The layout mirrors the reference benchmark benchmarks/programbench/ and the contract in benchmarks/CONVERT.md: it writes benchflow.py (converter), main.py, parity_test.py, run_<name>.py, <name>.yaml, benchmark.yaml, parity_experiment.json (status template), README.md, and __init__.py. It is fail-closed: the slug is validated (lowercase, leading letter, single internal hyphens, max 64 chars) and the command refuses to overwrite an existing benchmark directory.

bench eval adopt init my-bench
bench eval adopt init my-bench --benchmarks-dir ./benchmarks

Flag	Default	Description
`--benchmarks-dir`	repo `benchmarks/`	Target benchmarks/ directory

bench eval adopt convert

Drive the CONVERT.md adoption workflow by launching the host codex CLI. The command assembles the adoption context (the source, the target benchmarks/<name>/ path, the adoption skills, and the embedded benchmarks/CONVERT.md guide) and runs codex exec against the repo root to drive the conversion toward a benchmarks/<name>/ pull request. It is fail-closed on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY) in the environment, or a ~/.codex/auth.json from codex login, otherwise the command exits before assembling any context. Use --dry-run to print the exact launch command without running it (no credentials required). When --name is omitted the slug is derived from the source basename.

# Print the codex launch command without running it
bench eval adopt convert https://github.com/org/some-benchmark --dry-run

# Launch the host codex driver against a local source
bench eval adopt convert ./vendor/some-benchmark --name my-bench --model o3

Flag	Default	Description
`--name`	derived from source	Benchmark slug (default: from source basename)
`--model`	codex default	Model for the codex driver
`--dry-run`	`false`	Print the launch command, do not run
`--codex-bin`	`codex`	Host codex binary
`-c`, `--codex-config`	—	Codex config override as `key=value`, passed through to codex as `-c key=value`; repeatable. Use it to work around host `~/.codex/config.toml` drift without editing the file — e.g. `-c service_tier=flex` when an installed codex version rejects a stale value.

bench eval adopt verify

Run the parity gate for an adopted benchmark and emit a confidence verdict. It reads benchmarks/<name>/parity_experiment.json and scores two layers: a deterministic conversion-faithfulness floor (every compared criterion’s converted verdict must match the original’s verdict on identical inputs) and a statistical reward-distribution layer (every legacy-vs-converted reward delta must sit within --tolerance). The gate is parity-only — a faithful conversion reproduces the original’s behavior, including any reward-hackability the source has; it never “improves” or sanitizes the source. The verdict is one of parity-confirmed, parity-divergent, or insufficient-evidence (no recorded comparisons). On any non-confirmed verdict the command exits non-zero and emits a draft GitHub issue body for human support — printed to stdout, or written to --issue-out. The draft is never filed automatically. Pass --roundtrip-task to also run the structural round-trip conformance check on a concrete task directory. By default the gate scores the recorded parity_experiment.json — fast, but it trusts an artifact the conversion produced about itself. Pass --rerun to independently re-execute parity_test.py --mode side-by-side and score its fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py, a timeout, or output that is not in the scoreable parity_experiment.json shape all exit non-zero (rather than silently reporting insufficient-evidence).

bench eval adopt verify my-bench
bench eval adopt verify my-bench --tolerance 0.05 --issue-out divergence.md
bench eval adopt verify my-bench --roundtrip-task benchmarks/my-bench/tasks/example
bench eval adopt verify my-bench --rerun   # re-run parity_test.py, score fresh output

Flag	Default	Description
`--benchmarks-dir`	repo `benchmarks/`	Target benchmarks/ directory
`--tolerance`	`0.02`	Max abs reward delta (statistical layer)
`--issue-out`	—	Write the divergence issue draft to this path instead of stdout
`--roundtrip-task`	—	Also run the structural round-trip check on this task dir
`--rerun`	`false`	Re-execute `parity_test.py --mode side-by-side` and score its fresh output instead of the recorded `parity_experiment.json`

bench eval

bench eval create

Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.

# From YAML config
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml

# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --concurrency 64 \
  --sandbox-setup-timeout 300

# From remote repo with required token usage telemetry
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --usage-tracking required \
  --concurrency 16 \
  --sandbox-setup-timeout 300

# From local directory
bench eval create --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview

# From a hosted PrimeIntellect / Verifiers environment
bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite

# Single task with mounted skills and the recommended skill nudge
bench eval create \
  --tasks-dir tasks/pdf-fix \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --skill-mode with-skill \
  --agent-env BENCHFLOW_SKILL_NUDGE=name

# Pinned registry dataset: resolves skillsbench@1.1, verifies task digests,
# and stamps dataset identity into every result.json/config.json
bench eval create -d skillsbench@1.1 --agent gemini --model gemini-3.1-flash-lite-preview

Flag	Default	Description
`--config`	—	YAML config file
`--tasks-dir`	—	Local task dir (single task with task.toml, or parent of many)
`-d`, `--dataset`	—	Registry dataset to run as `<name>@<version>` (e.g. `skillsbench@1.1`). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task’s sha256 content digest, and checks the dataset’s `bench_version` range against the installed benchflow. Each `result.json`/`config.json` is stamped with `dataset_name`, `dataset_version`, and the task’s `task_digest`.
`--registry`	skillsbench registry	Dataset registry JSON URL or local file. Only valid with `--dataset`.
`--source-repo`	—	Remote repo as `org/repo` (e.g. `benchflow-ai/skillsbench`)
`--source-path`	—	Subpath within the repo (e.g. `tasks`)
`--source-ref`	—	Branch or tag to clone (e.g. `main`)
`--source-env`	—	Hosted environment source (e.g. `primeintellect/general-agent`)
`--source-env-version`	—	Hosted environment version
`--source-env-arg`	—	Hosted environment argument as `KEY=VALUE`; repeatable
`--source-env-num-examples`	`1`	Number of hosted environment examples
`--source-env-rollouts-per-example`	`1`	Rollouts per hosted environment example
`--source-env-max-tokens`	`1024`	Max tokens for hosted environment model calls
`--source-env-temperature`	`0.0`	Temperature for hosted environment model calls
`--source-env-sampling-arg`	—	Verifiers sampling argument as `KEY=VALUE`; repeatable (for example `reasoning_effort=minimal`)
`--agent`	`claude-agent-acp`	Agent name
`--model`	Agent default	Model ID
`--reasoning-effort`	—	Agent reasoning/thinking effort when the agent exposes one (e.g. `max`)
`--sandbox`	`docker`	Sandbox: docker, daytona, or modal
`--usage-tracking`	`auto`	Token usage telemetry policy: `auto`, `required`, or `off`
`--environment-manifest`	—	Path to an Environment-plane manifest (`environment.toml`); applied to every rollout in the batch
`--prompt`	`instruction.md`	Prompt to send to the agent; repeatable for multi-prompt runs
`--concurrency`	`4`	Max concurrent tasks (batch mode only)
`--build-concurrency`	`--concurrency`	Max concurrent docker image builds; set lower (e.g. `8`) when `--concurrency` is high to avoid overwhelming the docker daemon
`--worker-concurrency`	—	Run batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; `--concurrency` remains the aggregate target
`--worker-retries`	`1`	Retry a crashed worker shard this many times, resuming its jobs dir
`--worker-start-stagger-sec`	`1.0`	Seconds to stagger worker starts to avoid Daytona connection storms
`--agent-idle-timeout`	(built-in default)	Abort ACP prompts after this many idle seconds; `0` disables idle detection
`--jobs-dir`	`jobs`	Output directory
`--sandbox-user`	`agent`	Sandbox user (null for root)
`--sandbox-setup-timeout`	`120`	Timeout in seconds for sandbox user setup
`--skills-dir`	—	Advanced custom skills directory; valid only with `--skill-mode with-skill`. Omit it to use each task’s `environment/skills`.
`--skill-mode`	`no-skill`	Skill mode: `no-skill`, `with-skill`, or `self-gen`
`--skill-creator-dir`	—	Path to a `skill-creator` directory (or a skills root containing it); used when `--skill-mode self-gen`
`--self-gen-no-internet`	`false`	Disable web tools for the self-generated skill run
`--agent-env`	—	Agent environment variable as `KEY=VALUE`; repeatable
`--include`	—	Only run these task names; repeatable (e.g. `--include jax-computing-basics --include data-to-d3`)
`--exclude`	—	Skip these task names; repeatable (e.g. `--exclude quantum-numerical-simulation`)
`--loop-strategy`	—	Wrap each rollout in a loop, e.g. `verify-retry:k=3,feedback=names` or `self-review:k=3` (omit for single-shot)
`--ignore-bench-version`	`false`	With `--dataset`, skip the dataset’s `bench_version` compatibility gate

When mounting skills, the recommended docs default is --agent-env BENCHFLOW_SKILL_NUDGE=name. See Architecture: skill loading for how with-skill mode is registered with each agent and how the nudge modes differ. Daytona batch runs collect provider token/cost telemetry by default with a sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry should fail the rollout, or --usage-tracking off for recovery runs that should leave provider traffic untouched. --source-env is for external hosted environment hubs. The first supported runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity (env_uid, hub_url), installs the versioned package into an isolated local virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task sandbox selector for local/repo task sources; Verifiers source environments own their own harness and sandbox behavior. --model is passed to the Verifiers model endpoint; use a model id available to that provider. Provider-specific sampling options are not inferred; pass them explicitly with --source-env-sampling-arg.

bench eval list

List completed evaluations from a jobs directory.

bench eval list jobs/

bench eval metrics

Collect and display metrics (pass/fail/score, memory score, tool calls, duration) from a jobs directory. Use --json for machine-readable output.

bench eval metrics jobs/
bench eval metrics jobs/ --json

bench eval view

Serve a trial trajectory viewer in the browser for a rollout or job directory.

bench eval view jobs/run/task__abc123
bench eval view jobs/ --port 9000

bench skills

bench skills list

List skills discovered under the default skills roots (or --dir).

bench skills list
bench skills list --dir ./skills

bench skills eval

Evaluate a skill against its evals.json test cases.

bench skills eval skills/my-skill/ \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.

bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/
bench tasks init my-new-task --format legacy

Flag	Default	Description
`--format`	`task-md`	Task format: `task-md` (native single-document) or `legacy` (split `task.toml` + `instruction.md` layout)

bench tasks check

Validate a task directory (task.md or legacy task.toml + instruction.md, environment/Dockerfile, verifier/ or legacy tests/).

bench tasks check tasks/my-task

With --level, validation runs at a chosen depth: schema, structural, runtime-capability, publication-grade, acceptance, or acceptance-live. Acceptance-level errors such as acceptance validation requires benchflow.evidence mapping refer to the benchflow.evidence schema documented in the “Assets, Provenance, And Evidence” section of docs/task-standard.md.

bench tasks migrate

Convert a legacy task.toml + instruction.md task into the unified task.md format. By default the legacy files are kept alongside the new task.md.

bench tasks migrate tasks/my-task
bench tasks migrate tasks/my-task --overwrite --remove-legacy

Flag	Default	Description
`--overwrite`	`false`	Replace an existing task.md
`--remove-legacy`	`false`	Delete split files and promote tests/solution aliases after task.md is verified

bench tasks normalize

Expand minimal task.md authoring profiles into the canonical task.md form. Prints the normalized document to stdout unless told otherwise.

bench tasks normalize tasks/my-task
bench tasks normalize tasks/my-task --write
bench tasks normalize tasks/my-task -o normalized-task.md

Flag	Default	Description
`--output`, `-o`	—	Write normalized task.md to this path instead of stdout
`--write`	`false`	Replace task.md in place with the normalized canonical form

bench tasks export

Export a task.md task to a Harbor/Pier-compatible split layout, with a compatibility loss report written to compatibility/export-report.json in the export directory.

bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only
bench tasks export tasks/my-task out/my-task-split --target pier --overwrite

Arguments: TASK_DIR (task directory to export) and optional OUTPUT_DIR (destination split-layout directory; may be omitted with --report-only).

Flag	Default	Description
`--target`	`harbor`	Compatibility target: `harbor` or `pier`
`--overwrite`	`false`	Replace an existing export directory
`--report-only`	`false`	Print the compatibility loss report without writing files

bench tasks digest

Compute the content digest that pins a task’s files, independent of git — the sha256 the dataset registry keys on (matches the digests bench eval create -d verifies and the task_digest stamped into every result.json). Recognizes both legacy task.toml tasks and native task.md tasks. Given a single task directory it prints the digest; given a directory of tasks it prints one <name> <digest> line per task. Output goes to stdout via echo (not Rich), so it is safe to pipe into machine-readable tooling.

bench tasks digest tasks/my-task          # -> sha256:<hex>
bench tasks digest tasks/                  # one "<name> sha256:<hex>" line per task

Arguments: PATH (a task directory, or a directory of task directories).

bench tasks generate

Generate benchmark task directories from real agent traces.

bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50

Flag	Default	Description
`--from-local`	—	Generate from local Claude Code sessions
`--from-file`	—	Generate from a JSONL trace file
`--from-hf`	—	Generate from a HuggingFace dataset ID or alias
`--output`	`tasks`	Output directory for generated tasks
`--projects-dir`	`~/.claude/projects/`	Claude Code projects directory
`--project`	—	Filter local sessions by project path substring
`--format`	`auto`	Trace format override
`--split`	`train`	HuggingFace dataset split
`--max-rows`	`100`	Max rows to download from HuggingFace
`--limit`	`20`	Max traces to process
`--min-steps`	`2`	Minimum steps per trace
`--outcome`	—	Filter by outcome: success, failure, unknown
`--author`	`benchflow-traces`	Author name for generated task metadata
`--task-format`	`task-md`	Generated task package format: `task-md` or `legacy`
`--dry-run`	`false`	Preview traces without generating tasks

bench tasks list-sources

List known HuggingFace trace datasets. The aliases listed here can be passed to bench tasks generate --from-hf.

bench tasks list-sources

bench sandbox

Local sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.

bench sandbox create

Create an environment object from a task directory. This validates environment construction but does not start the sandbox.

bench sandbox create tasks/my-task --sandbox daytona

bench sandbox list

List active local (Daytona) sandboxes.

bench sandbox list

bench sandbox cleanup

Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older than 24 hours; use --dry-run to preview what would be deleted.

bench sandbox cleanup --dry-run --max-age 1440

Daytona-backed evals also reap orphaned sandboxes automatically at run start (failure states such as BUILD_FAILED are reaped sooner than healthy ones, and an idle-activity guard means concurrent live runs are never reaped). Set BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive) to disable that automatic pass and rely on the manual command above.

bench environment (deprecated)

bench environment is a hidden deprecated alias group, removed in 0.7. The local lifecycle moved to bench sandbox (create/list/cleanup) and hosted-provider browsing to bench hub env. The old bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.

bench hub

External environment hubs: compatibility checks (check) and browsing a hosted provider’s environments (env).

bench hub env

Read-only browsing of a hosted provider’s environments (PrimeIntellect “Environments”). To run one, use bench eval create --source-env.

bench hub env list --provider primeintellect --owner primeintellect --search general-agent --limit 5
bench hub env show primeintellect/general-agent --version 0.1.1
bench hub env inspect primeintellect/general-agent --version 0.1.1 --path README.md

bench hub check

Inventory or structurally check representative tasks from an environment hub’s registry. Defaults to an inventory pass against the public Harbor registry JSON.

# Inventory the public Harbor hub registry
bench hub check

# Structural check, two tasks per dataset, JSONL output
bench hub check --level check --tasks-per-dataset 2 --out hub.jsonl

Flag	Default	Description
`--registry`	Harbor public registry URL	Harbor registry JSON URL or local file
`--tasks-per-dataset`	`2`	Representative tasks selected per dataset
`--level`	`inventory`	Compatibility level: `inventory` or `check`
`--out`	—	Optional JSONL output path
`--cache-dir`	`.cache/hub/harbor`	Cache directory for sparse clones
`--limit`	—	Optional cap on selected task refs

YAML Config Format

Batch config with skills and skill nudge

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skill_mode: with-skill
skills_dir: shared-skills/
agent_env:
  BENCHFLOW_SKILL_NUDGE: name
max_retries: 2

Multi-scene (BYOS skill generation)

Use the Python API for multi-scene experiments. bench eval create --config is for batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built directly in Python.

task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

bench continue

Resume a previous, unfinished (timed-out) openhands run to completion via record-replay. Standalone — it does not touch the normal run path. See Continuing timed-out runs for the full guide.

bench continue path/to/original/run-folder --tasks-dir path/to/tasks

Key options: --model (override the live-continuation model; defaults to the original run’s model), --timeout, --output, --require-timeout, --strict-divergence, --replay-only (rebuild via replay and stop at the cut-point — no live model or API key needed), and --proxy-mode (replay proxy placement: auto, host, or sandbox; default auto uses sandbox-local replay for Daytona/Modal and host replay for Docker).

bench continue-batch

Continue all timed-out OpenHands runs found under a directory tree. Discovers run folders (config.json + trajectory/llm_trajectory.jsonl) recursively, continues each, and prints a JSON batch summary (exits 1 if any continuation failed).

bench continue-batch path/to/jobs-root --tasks-dir path/to/tasks

Flag	Default	Description
`--tasks-dir`	—	Directory holding task sources; required unless the recorded task path exists
`--model`	original run’s model	Override the live-continuation model
`--timeout`	—	Wall-clock budget per continuation
`--output`	—	Output jobs dir for continued runs
`--concurrency`	`100`	Maximum number of continuation runs in flight
`--limit`	—	Limit discovered timeout folders
`--strict-divergence`	`false`	Abort a run if replay leaves the original rails
`--proxy-mode`	`auto`	Replay proxy placement: `auto`, `host`, or `sandbox`

​CLI reference

​bench agent

​bench agent list

​bench agent show

​bench eval adopt

​bench eval adopt init

​bench eval adopt convert

​bench eval adopt verify

​bench eval

​bench eval create

​bench eval list

​bench eval metrics

​bench eval view

​bench skills

​bench skills list

​bench skills eval

​bench tasks

​bench tasks init

​bench tasks check

​bench tasks migrate

​bench tasks normalize

​bench tasks export

​bench tasks digest

​bench tasks generate

​bench tasks list-sources

​bench sandbox

​bench sandbox create

​bench sandbox list

​bench sandbox cleanup

​bench environment (deprecated)

​bench hub

​bench hub env

​bench hub check

​YAML Config Format

​Batch config with skills and skill nudge

​Multi-scene (BYOS skill generation)

​bench continue

​bench continue-batch