Skip to main content

CLI reference

BenchFlow uses a resource-verb pattern: bench <resource> <verb>.
bench --version

bench agent

bench agent is agent management only. bench agent list and bench agent show operate on registered AI agents (Claude Code, Gemini CLI, Codex, OpenHands, …) — the programs that solve tasks. Onboarding a third-party benchmark (scaffold → drive → parity-gate a benchmarks/<name>/ adoption) is a separate workflow under bench eval adopt (initconvertverify). The legacy bench agent create|run|verify still work as hidden deprecated aliases through 0.6, printing a one-line notice; they are removed in 0.7.

bench agent list

List all registered agents with their protocol and native/default auth requirements. Provider-prefixed models may use provider-specific credentials; Azure Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.
bench agent list

bench agent show

Show details for a specific agent, including native/default auth and a note about provider-specific credentials.
bench agent show gemini

bench eval adopt

Bring a third-party benchmark into the environment framework: scaffold a benchmarks/<name>/ package, drive the codex CONVERT.md conversion, then parity-gate it (initconvertverify). These commands were previously bench agent create|run|verify, which still work as hidden deprecated aliases through 0.6 (they print a one-line notice and are removed in 0.7). See Benchmark adoption for the full walkthrough.

bench eval adopt init

Scaffold benchmarks/<name>/ for a new benchmark adoption. The layout mirrors the reference benchmark benchmarks/programbench/ and the contract in benchmarks/CONVERT.md: it writes benchflow.py (converter), main.py, parity_test.py, run_<name>.py, <name>.yaml, benchmark.yaml, parity_experiment.json (status template), README.md, and __init__.py. It is fail-closed: the slug is validated (lowercase, leading letter, single internal hyphens, max 64 chars) and the command refuses to overwrite an existing benchmark directory.
bench eval adopt init my-bench
bench eval adopt init my-bench --benchmarks-dir ./benchmarks
FlagDefaultDescription
--benchmarks-dirrepo benchmarks/Target benchmarks/ directory

bench eval adopt convert

Drive the CONVERT.md adoption workflow by launching the host codex CLI. The command assembles the adoption context (the source, the target benchmarks/<name>/ path, the adoption skills, and the embedded benchmarks/CONVERT.md guide) and runs codex exec against the repo root to drive the conversion toward a benchmarks/<name>/ pull request. It is fail-closed on credentials: codex needs OPENAI_API_KEY (or CODEX_API_KEY) in the environment, or a ~/.codex/auth.json from codex login, otherwise the command exits before assembling any context. Use --dry-run to print the exact launch command without running it (no credentials required). When --name is omitted the slug is derived from the source basename.
# Print the codex launch command without running it
bench eval adopt convert https://github.com/org/some-benchmark --dry-run

# Launch the host codex driver against a local source
bench eval adopt convert ./vendor/some-benchmark --name my-bench --model o3
FlagDefaultDescription
--namederived from sourceBenchmark slug (default: from source basename)
--modelcodex defaultModel for the codex driver
--dry-runfalsePrint the launch command, do not run
--codex-bincodexHost codex binary
-c, --codex-configCodex config override as key=value, passed through to codex as -c key=value; repeatable. Use it to work around host ~/.codex/config.toml drift without editing the file — e.g. -c service_tier=flex when an installed codex version rejects a stale value.

bench eval adopt verify

Run the parity gate for an adopted benchmark and emit a confidence verdict. It reads benchmarks/<name>/parity_experiment.json and scores two layers: a deterministic conversion-faithfulness floor (every compared criterion’s converted verdict must match the original’s verdict on identical inputs) and a statistical reward-distribution layer (every legacy-vs-converted reward delta must sit within --tolerance). The gate is parity-only — a faithful conversion reproduces the original’s behavior, including any reward-hackability the source has; it never “improves” or sanitizes the source. The verdict is one of parity-confirmed, parity-divergent, or insufficient-evidence (no recorded comparisons). On any non-confirmed verdict the command exits non-zero and emits a draft GitHub issue body for human support — printed to stdout, or written to --issue-out. The draft is never filed automatically. Pass --roundtrip-task to also run the structural round-trip conformance check on a concrete task directory. By default the gate scores the recorded parity_experiment.json — fast, but it trusts an artifact the conversion produced about itself. Pass --rerun to independently re-execute parity_test.py --mode side-by-side and score its fresh output instead. --rerun is fail-closed: a missing/failing parity_test.py, a timeout, or output that is not in the scoreable parity_experiment.json shape all exit non-zero (rather than silently reporting insufficient-evidence).
bench eval adopt verify my-bench
bench eval adopt verify my-bench --tolerance 0.05 --issue-out divergence.md
bench eval adopt verify my-bench --roundtrip-task benchmarks/my-bench/tasks/example
bench eval adopt verify my-bench --rerun   # re-run parity_test.py, score fresh output
FlagDefaultDescription
--benchmarks-dirrepo benchmarks/Target benchmarks/ directory
--tolerance0.02Max abs reward delta (statistical layer)
--issue-outWrite the divergence issue draft to this path instead of stdout
--roundtrip-taskAlso run the structural round-trip check on this task dir
--rerunfalseRe-execute parity_test.py --mode side-by-side and score its fresh output instead of the recorded parity_experiment.json

bench eval

bench eval create

Create and run an evaluation. Use it for YAML configs and batch runs; it also accepts a single task directory.
# From YAML config
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml

# From remote repo (fast Daytona batch; token usage may be unavailable)
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --concurrency 64 \
  --sandbox-setup-timeout 300

# From remote repo with required token usage telemetry
bench eval create \
  --source-repo benchflow-ai/skillsbench \
  --source-path tasks \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --usage-tracking required \
  --concurrency 16 \
  --sandbox-setup-timeout 300

# From local directory
bench eval create --tasks-dir ./tasks --agent gemini --model gemini-3.1-flash-lite-preview

# From a hosted PrimeIntellect / Verifiers environment
bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite

# Single task with mounted skills and the recommended skill nudge
bench eval create \
  --tasks-dir tasks/pdf-fix \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona \
  --skill-mode with-skill \
  --agent-env BENCHFLOW_SKILL_NUDGE=name

# Pinned registry dataset: resolves skillsbench@1.1, verifies task digests,
# and stamps dataset identity into every result.json/config.json
bench eval create -d skillsbench@1.1 --agent gemini --model gemini-3.1-flash-lite-preview
FlagDefaultDescription
--configYAML config file
--tasks-dirLocal task dir (single task with task.toml, or parent of many)
-d, --datasetRegistry dataset to run as <name>@<version> (e.g. skillsbench@1.1). Resolves the pinned snapshot from the registry, clones tasks at their pinned commit, verifies each task’s sha256 content digest, and checks the dataset’s bench_version range against the installed benchflow. Each result.json/config.json is stamped with dataset_name, dataset_version, and the task’s task_digest.
--registryskillsbench registryDataset registry JSON URL or local file. Only valid with --dataset.
--source-repoRemote repo as org/repo (e.g. benchflow-ai/skillsbench)
--source-pathSubpath within the repo (e.g. tasks)
--source-refBranch or tag to clone (e.g. main)
--source-envHosted environment source (e.g. primeintellect/general-agent)
--source-env-versionHosted environment version
--source-env-argHosted environment argument as KEY=VALUE; repeatable
--source-env-num-examples1Number of hosted environment examples
--source-env-rollouts-per-example1Rollouts per hosted environment example
--source-env-max-tokens1024Max tokens for hosted environment model calls
--source-env-temperature0.0Temperature for hosted environment model calls
--source-env-sampling-argVerifiers sampling argument as KEY=VALUE; repeatable (for example reasoning_effort=minimal)
--agentclaude-agent-acpAgent name
--modelAgent defaultModel ID
--reasoning-effortAgent reasoning/thinking effort when the agent exposes one (e.g. max)
--sandboxdockerSandbox: docker, daytona, or modal
--usage-trackingautoToken usage telemetry policy: auto, required, or off
--environment-manifestPath to an Environment-plane manifest (environment.toml); applied to every rollout in the batch
--promptinstruction.mdPrompt to send to the agent; repeatable for multi-prompt runs
--concurrency4Max concurrent tasks (batch mode only)
--build-concurrency--concurrencyMax concurrent docker image builds; set lower (e.g. 8) when --concurrency is high to avoid overwhelming the docker daemon
--worker-concurrencyRun batch eval through isolated worker subprocesses, each with at most this many concurrent tasks; --concurrency remains the aggregate target
--worker-retries1Retry a crashed worker shard this many times, resuming its jobs dir
--worker-start-stagger-sec1.0Seconds to stagger worker starts to avoid Daytona connection storms
--agent-idle-timeout(built-in default)Abort ACP prompts after this many idle seconds; 0 disables idle detection
--jobs-dirjobsOutput directory
--sandbox-useragentSandbox user (null for root)
--sandbox-setup-timeout120Timeout in seconds for sandbox user setup
--skills-dirAdvanced custom skills directory; valid only with --skill-mode with-skill. Omit it to use each task’s environment/skills.
--skill-modeno-skillSkill mode: no-skill, with-skill, or self-gen
--skill-creator-dirPath to a skill-creator directory (or a skills root containing it); used when --skill-mode self-gen
--self-gen-no-internetfalseDisable web tools for the self-generated skill run
--agent-envAgent environment variable as KEY=VALUE; repeatable
--includeOnly run these task names; repeatable (e.g. --include jax-computing-basics --include data-to-d3)
--excludeSkip these task names; repeatable (e.g. --exclude quantum-numerical-simulation)
--loop-strategyWrap each rollout in a loop, e.g. verify-retry:k=3,feedback=names or self-review:k=3 (omit for single-shot)
--ignore-bench-versionfalseWith --dataset, skip the dataset’s bench_version compatibility gate
When mounting skills, the recommended docs default is --agent-env BENCHFLOW_SKILL_NUDGE=name. See Architecture: skill loading for how with-skill mode is registered with each agent and how the nudge modes differ. Daytona batch runs collect provider token/cost telemetry by default with a sandbox-local LiteLLM gateway. Use --usage-tracking required when missing telemetry should fail the rollout, or --usage-tracking off for recovery runs that should leave provider traffic untouched. --source-env is for external hosted environment hubs. The first supported runner is PrimeIntellect / Verifiers: BenchFlow preserves the hosted identity (env_uid, hub_url), installs the versioned package into an isolated local virtual environment, and runs vf-eval. --sandbox remains the BenchFlow task sandbox selector for local/repo task sources; Verifiers source environments own their own harness and sandbox behavior. --model is passed to the Verifiers model endpoint; use a model id available to that provider. Provider-specific sampling options are not inferred; pass them explicitly with --source-env-sampling-arg.

bench eval list

List completed evaluations from a jobs directory.
bench eval list jobs/

bench eval metrics

Collect and display metrics (pass/fail/score, memory score, tool calls, duration) from a jobs directory. Use --json for machine-readable output.
bench eval metrics jobs/
bench eval metrics jobs/ --json

bench eval view

Serve a trial trajectory viewer in the browser for a rollout or job directory.
bench eval view jobs/run/task__abc123
bench eval view jobs/ --port 9000

bench skills

bench skills list

List skills discovered under the default skills roots (or --dir).
bench skills list
bench skills list --dir ./skills

bench skills eval

Evaluate a skill against its evals.json test cases.
bench skills eval skills/my-skill/ \
  --agent gemini \
  --model gemini-3.1-flash-lite-preview \
  --sandbox daytona

bench tasks

bench tasks init

Scaffold a new benchmark task.
bench tasks init my-new-task
bench tasks init my-new-task --dir tasks/
bench tasks init my-new-task --format legacy
FlagDefaultDescription
--formattask-mdTask format: task-md (native single-document) or legacy (split task.toml + instruction.md layout)

bench tasks check

Validate a task directory (task.md or legacy task.toml + instruction.md, environment/Dockerfile, verifier/ or legacy tests/).
bench tasks check tasks/my-task
With --level, validation runs at a chosen depth: schema, structural, runtime-capability, publication-grade, acceptance, or acceptance-live. Acceptance-level errors such as acceptance validation requires benchflow.evidence mapping refer to the benchflow.evidence schema documented in the “Assets, Provenance, And Evidence” section of docs/task-standard.md.

bench tasks migrate

Convert a legacy task.toml + instruction.md task into the unified task.md format. By default the legacy files are kept alongside the new task.md.
bench tasks migrate tasks/my-task
bench tasks migrate tasks/my-task --overwrite --remove-legacy
FlagDefaultDescription
--overwritefalseReplace an existing task.md
--remove-legacyfalseDelete split files and promote tests/solution aliases after task.md is verified

bench tasks normalize

Expand minimal task.md authoring profiles into the canonical task.md form. Prints the normalized document to stdout unless told otherwise.
bench tasks normalize tasks/my-task
bench tasks normalize tasks/my-task --write
bench tasks normalize tasks/my-task -o normalized-task.md
FlagDefaultDescription
--output, -oWrite normalized task.md to this path instead of stdout
--writefalseReplace task.md in place with the normalized canonical form

bench tasks export

Export a task.md task to a Harbor/Pier-compatible split layout, with a compatibility loss report written to compatibility/export-report.json in the export directory.
bench tasks export tasks/my-task out/my-task-split
bench tasks export tasks/my-task --report-only
bench tasks export tasks/my-task out/my-task-split --target pier --overwrite
Arguments: TASK_DIR (task directory to export) and optional OUTPUT_DIR (destination split-layout directory; may be omitted with --report-only).
FlagDefaultDescription
--targetharborCompatibility target: harbor or pier
--overwritefalseReplace an existing export directory
--report-onlyfalsePrint the compatibility loss report without writing files

bench tasks digest

Compute the content digest that pins a task’s files, independent of git — the sha256 the dataset registry keys on (matches the digests bench eval create -d verifies and the task_digest stamped into every result.json). Recognizes both legacy task.toml tasks and native task.md tasks. Given a single task directory it prints the digest; given a directory of tasks it prints one <name> <digest> line per task. Output goes to stdout via echo (not Rich), so it is safe to pipe into machine-readable tooling.
bench tasks digest tasks/my-task          # -> sha256:<hex>
bench tasks digest tasks/                  # one "<name> sha256:<hex>" line per task
Arguments: PATH (a task directory, or a directory of task directories).

bench tasks generate

Generate benchmark task directories from real agent traces.
bench tasks generate --from-local --project my-repo --limit 5
bench tasks generate --from-file session.jsonl --dry-run
bench tasks generate --from-hf opentraces-test --limit 50
FlagDefaultDescription
--from-localGenerate from local Claude Code sessions
--from-fileGenerate from a JSONL trace file
--from-hfGenerate from a HuggingFace dataset ID or alias
--outputtasksOutput directory for generated tasks
--projects-dir~/.claude/projects/Claude Code projects directory
--projectFilter local sessions by project path substring
--formatautoTrace format override
--splittrainHuggingFace dataset split
--max-rows100Max rows to download from HuggingFace
--limit20Max traces to process
--min-steps2Minimum steps per trace
--outcomeFilter by outcome: success, failure, unknown
--authorbenchflow-tracesAuthor name for generated task metadata
--task-formattask-mdGenerated task package format: task-md or legacy
--dry-runfalsePreview traces without generating tasks

bench tasks list-sources

List known HuggingFace trace datasets. The aliases listed here can be passed to bench tasks generate --from-hf.
bench tasks list-sources

bench sandbox

Local sandbox lifecycle: provision a task on a docker/daytona/modal backend, list active sandboxes, and reap stale ones.

bench sandbox create

Create an environment object from a task directory. This validates environment construction but does not start the sandbox.
bench sandbox create tasks/my-task --sandbox daytona

bench sandbox list

List active local (Daytona) sandboxes.
bench sandbox list

bench sandbox cleanup

Clean up orphaned Daytona sandboxes. By default this deletes sandboxes older than 24 hours; use --dry-run to preview what would be deleted.
bench sandbox cleanup --dry-run --max-age 1440
Daytona-backed evals also reap orphaned sandboxes automatically at run start (failure states such as BUILD_FAILED are reaped sooner than healthy ones, and an idle-activity guard means concurrent live runs are never reaped). Set BENCHFLOW_DAYTONA_AUTO_REAP to any of 0/false/no/off (case-insensitive) to disable that automatic pass and rely on the manual command above.

bench environment (deprecated)

bench environment is a hidden deprecated alias group, removed in 0.7. The local lifecycle moved to bench sandbox (create/list/cleanup) and hosted-provider browsing to bench hub env. The old bench environment create|list|cleanup and show|inspect (plus list --provider/--hub) still work, each printing a one-line stderr notice.

bench hub

External environment hubs: compatibility checks (check) and browsing a hosted provider’s environments (env).

bench hub env

Read-only browsing of a hosted provider’s environments (PrimeIntellect “Environments”). To run one, use bench eval create --source-env.
bench hub env list --provider primeintellect --owner primeintellect --search general-agent --limit 5
bench hub env show primeintellect/general-agent --version 0.1.1
bench hub env inspect primeintellect/general-agent --version 0.1.1 --path README.md

bench hub check

Inventory or structurally check representative tasks from an environment hub’s registry. Defaults to an inventory pass against the public Harbor registry JSON.
# Inventory the public Harbor hub registry
bench hub check

# Structural check, two tasks per dataset, JSONL output
bench hub check --level check --tasks-per-dataset 2 --out hub.jsonl
FlagDefaultDescription
--registryHarbor public registry URLHarbor registry JSON URL or local file
--tasks-per-dataset2Representative tasks selected per dataset
--levelinventoryCompatibility level: inventory or check
--outOptional JSONL output path
--cache-dir.cache/hub/harborCache directory for sparse clones
--limitOptional cap on selected task refs

YAML Config Format

Batch config with skills and skill nudge

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300
agent: gemini
model: gemini-3.1-flash-lite-preview
skill_mode: with-skill
skills_dir: shared-skills/
agent_env:
  BENCHFLOW_SKILL_NUDGE: name
max_retries: 2

Multi-scene (BYOS skill generation)

Use the Python API for multi-scene experiments. bench eval create --config is for batch job configs; scene configs are loaded with benchflow._utils.yaml_loader or built directly in Python.
task_dir: tasks/my-task
environment: daytona
sandbox_setup_timeout: 300

scenes:
  - name: skill-gen
    roles:
      - name: creator
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: creator
        prompt: "Analyze the task and write a skill document to /app/generated-skill.md"

  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

bench continue

Resume a previous, unfinished (timed-out) openhands run to completion via record-replay. Standalone — it does not touch the normal run path. See Continuing timed-out runs for the full guide.
bench continue path/to/original/run-folder --tasks-dir path/to/tasks
Key options: --model (override the live-continuation model; defaults to the original run’s model), --timeout, --output, --require-timeout, --strict-divergence, --replay-only (rebuild via replay and stop at the cut-point — no live model or API key needed), and --proxy-mode (replay proxy placement: auto, host, or sandbox; default auto uses sandbox-local replay for Daytona/Modal and host replay for Docker).

bench continue-batch

Continue all timed-out OpenHands runs found under a directory tree. Discovers run folders (config.json + trajectory/llm_trajectory.jsonl) recursively, continues each, and prints a JSON batch summary (exits 1 if any continuation failed).
bench continue-batch path/to/jobs-root --tasks-dir path/to/tasks
FlagDefaultDescription
--tasks-dirDirectory holding task sources; required unless the recorded task path exists
--modeloriginal run’s modelOverride the live-continuation model
--timeoutWall-clock budget per continuation
--outputOutput jobs dir for continued runs
--concurrency100Maximum number of continuation runs in flight
--limitLimit discovered timeout folders
--strict-divergencefalseAbort a run if replay leaves the original rails
--proxy-modeautoReplay proxy placement: auto, host, or sandbox