Getting started
A 5-minute path from install to first eval.Prerequisites
- Python 3.12+
uv- Docker for local sandboxes,
pip install benchflow[sandbox-daytona]+DAYTONA_API_KEYfor Daytona cloud runs, orpip install benchflow[sandbox-modal]for Modal-backed runs - An API key or subscription/OAuth auth for at least one agent (see below)
Install
0.6.0 is on PyPI. Install (or upgrade) with uv or pip:
--prerelease allow (uv) / --pre (pip) flag is required for BenchFlow’s
pinned LiteLLM release-candidate dependency, not for benchflow itself (0.6.0
is a final release). If uv reports Executables already exist: bench, benchflow, rerun with --force to replace older non-uv entrypoints. Confirm
with bench --version. See Release channels for the full
command matrix.
This gives you the benchflow (alias bench) CLI plus the Python SDK. To install for editable development:
Auth: OAuth, long-lived token, or API key
You don’t need an API key if you’re a Claude / Codex / Gemini subscriber. Three options, pick one per agent:Option 1 — Subscription OAuth from host CLI login
If you’ve logged into the agent’s CLI on your host (claude login, codex --login, gemini interactive flow), benchflow picks up the credential file and copies it into the sandbox. No API key billing.
| Agent | How to log in on the host | What benchflow detects | Replaces env var |
|---|---|---|---|
claude-agent-acp | claude login (Claude Code CLI) | ~/.claude/.credentials.json | ANTHROPIC_API_KEY |
codex-acp | codex --login (Codex CLI) | ~/.codex/auth.json | OPENAI_API_KEY |
gemini | gemini (interactive login) | ~/.gemini/oauth_creds.json | GEMINI_API_KEY |
Option 2 — Long-lived OAuth token (CI / headless)
For CI pipelines, scripts, or anywhere the host can’t run an interactive browser login, generate a 1-year OAuth token withclaude setup-token and export it:
CLAUDE_CODE_OAUTH_TOKEN from your shell into the sandbox; the Claude CLI inside reads it directly. Same auth precedence as plain claude (Anthropic docs): API keys override OAuth tokens, so unset ANTHROPIC_API_KEY if you want the token to win.
claude setup-token only authenticates Claude. Codex can also use a provided subscription access token, such as CODEX_ACCESS_TOKEN from a host/orchestrator integration; benchflow passes it through to Codex without copying ~/.codex/auth.json. Gemini does not have an equivalent today — use Option 1 (host login) or Option 3 (API key).
Option 3 — API key
Set the API-key env var directly. Works with every agent:azure-foundry-openai/gpt-5.5 or
azure-foundry-anthropic/claude-opus-4-5; benchflow derives the Azure resource
from AZURE_API_ENDPOINT and routes the selected agent through a generated
LiteLLM gateway config.
Several providers with user-supplied endpoints — deepseek, glm, kimi,
minimax, hunyuan, and others — follow the <PROVIDER>_API_KEY +
<PROVIDER>_BASE_URL convention; providers with fixed endpoints (such as
zai or openai) need only the API key. For example, deepseek/<model>
reads:
Provider 'deepseek' for model 'deepseek/<model>' requires DEEPSEEK_BASE_URL to build the provider base URL.
These variables must be exported to reach the benchflow runtime — a plain
shell assignment or a source .env without export stays local to your shell
and never reaches the bench process. The portable pattern for a .env file:
.env file in the
current directory; exporting works from any directory.)
Precedence
If multiple credentials are set, benchflow / the agent CLI uses provider-specific credentials selected by the model prefix first, then the agent’s native auth precedence. For Claude, native auth is (high to low): cloud provider creds →ANTHROPIC_AUTH_TOKEN → ANTHROPIC_API_KEY → apiKeyHelper →
CLAUDE_CODE_OAUTH_TOKEN → host subscription OAuth. To force a lower-priority
option, unset the higher one in your shell before running.
Run your first eval
bench eval create is the primary command for running evaluations — it works for
single tasks, batch runs, and remote repos. Use --tasks-dir <dir> for a local
directory or --config <config.yaml> for a YAML config.
You can also fetch tasks straight from a remote repo with
--source-repo <org/repo> --source-path <subpath>, but note that this clones
the full repository (git clone --depth 1 into .cache/datasets/<org>/<repo>/
under the enclosing git repo root, or the current directory when you run
outside one) — large for big task repos. To download only the task you need,
use a sparse checkout and point --tasks-dir at it:
BENCHFLOW_SKILL_NUDGE=name as the default docs
option. See Architecture: skill loading for
how mounted skills reach the agent and how name, description, and full
differ.
Where results land
Each run writes under--jobs-dir (default jobs/):
Reading results
Exit code 0 means the pipeline completed — it is not a pass/fail signal. A rollout whose reward is below the pass threshold still exits 0 and prints[FAIL] with Score: 0/1: Score is pass-threshold aggregation (a task
counts as passed only at reward 1.0), while reward — in result.json and
verifier/reward.txt — is the raw verifier value. Config errors (unknown
agents, missing credentials) exit 1, and so do runs with agent or verifier
errors. CLI usage errors (bad flags) exit 2.
The Docker sandbox needs the Docker daemon running. There is no up-front
check — if the daemon is down the run fails partway through rather than at
startup, so start Docker before bench eval create --sandbox docker.
Run from Python
The CLI is a thin shim over the Python API. For programmatic use:Rollout is decomposable — invoke each lifecycle phase individually for custom flows. See Concepts: rollout lifecycle.
What to read next
| If you want to… | Read |
|---|---|
| Understand how BenchFlow runs any benchmark (the three-layer model) | Run any benchmark |
| Understand the model — Rollout, Scene, Role, Verifier | Concepts |
| Author a task | Task authoring |
| Run multi-agent patterns (coder/reviewer, simulated user, BYOS) | Use cases |
| Run multi-round single-agent (progressive disclosure) | Progressive disclosure |
| Evaluate skills, not tasks | Skill eval |
| Understand the security model | Sandbox hardening |
| CLI flags + commands | CLI reference |
| Python API surface | Python API reference |