Running Adapted Benchmarks
How to run benchmarks that have been converted into Harbor-format tasks for BenchFlow. BenchFlow ships with adapted benchmarks underbenchmarks/<name>/. Each benchmark
includes a converter, parity tests, metadata, and one or more YAML job configs.
This guide covers how to run them — from a single task to a full evaluation sweep.
[!NOTE] BenchFlow is providing first-party support for PrimeIntellect Verifiers and OpenReward Standard.
Working inside the benchflow repo? Useuv run benchinstead ofbenchto run the CLI from your local editable install.
Available benchmarks
| Benchmark | Tasks | Verification | Config |
|---|---|---|---|
| Harvey LAB | 1,251 | LLM-as-judge (per-criterion) | benchmarks/harvey-lab/ |
| ProgramBench | 201 | Deterministic unit tests | benchmarks/programbench/ |
| SkillsBench | 94+ | Unit tests | --source-repo benchflow-ai/skillsbench --source-path tasks |
benchflow.py— converter for the raw benchmark sourcebenchmark.yaml— metadata descriptor (task count, categories, verification method, parity results)<name>-*.yaml— job configs for different agents/modelsparity_test.py— parity validation suiteparity_experiment.json— recorded parity results
Environment-plane benchmarks
Stateful, multi-service benchmarks integrate differently: instead of a converter they ship anenvironment.toml manifest and run on the
Environment plane. Two are onboarded:
| Benchmark | Topology | Manifest |
|---|---|---|
| ClawsBench | mock Gmail/Slack/Calendar/Docs/Drive, framework-started | benchmarks/clawsbench/environment.toml |
| chi-bench | ~25k-LOC healthcare simulator, image-owned lifecycle | benchmarks/chi-bench/environment.toml |
Quick start
Option 1: YAML config (bench eval create --config)
The simplest path. Point at a YAML config that specifies the benchmark source,
agent, and model:
Option 2: CLI flags
Use CLI flags for ad-hoc runs without a config file:Note: Harvey LAB task names inbenchflow-ai/benchmarksare flattened with hyphens (e.g.corporate-ma-analyze-cim-deal-teaser-scenario-01), not nested paths like the original repo (corporate-ma/analyze-cim-deal-teaser/scenario-01).
Option 3: Python API
For programmatic use, custom pipelines, or integration with other tools:Versioned dataset runs (--dataset)
For runs whose results should be attributable to a published, immutable
dataset version (leaderboards, papers, release evidence), resolve the task
set from a dataset registry instead of pointing at a directory or branch:
registry.json at a dataset repo’s root — see skillsbench’s
docs/dataset-versioning.md)
pins each dataset version to an exact git_commit_id and per-task sha256
content digests. Resolution clones the pinned commit into
.cache/datasets, materializes an immutable per-commit snapshot,
recomputes every task’s digest, and fails before running anything on
any mismatch. Snapshot directories that are not part of the registry entry
are excluded from the run. The entry’s bench_version range is a hard
gate: running on a benchflow outside the range the dataset was validated
against fails before anything runs, because the results may not be
comparable with published runs. --ignore-bench-version overrides the
gate for local experimentation — the run then proceeds with a visible
warning on the record.
The registry is fetched from the skillsbench repo by default; point
--registry at another URL or a local registry.json to override.
Every result.json/config.json is stamped with dataset_name,
dataset_version, and the per-task task_digest (summary.json carries
the name/version), so downstream tooling can group results by
dataset@version. --tasks-dir stays as the visibly distinct dev mode:
its artifacts carry no dataset fields — but they still stamp a
live-computed task_digest, so even dev trajectories remain attributable
to exact task content. bench tasks digest <task-dir> prints the same
digest for any task directory.
Running a subset of tasks
Single task
Batch with a tasks directory
Pointbench eval create --tasks-dir at a directory containing only the tasks you want:
Using --source-path for remote subsets
Running ProgramBench
201 program-reconstruction tasks across 7 languages (C, Rust, Go, C++, Java, Haskell, Bash). Tasks are generated at runtime from the ProgramBench repo’s metadata —benchmarks/programbench/tasks/ is not checked into this repo and must be
produced first.
Prerequisites
- Docker (images are linux/amd64 only — use a Linux x86_64 machine)
- ~20GB disk for Docker images
- Internet access for HuggingFace test blob downloads during verification
- A local clone of
programbench(passed via--programbench-dirto the generator)
Generate the tasks
Run all tasks
Run a single task (after generation)
Oracle verification
Verify a task is solvable using the gold solution (original source at commit):Validate a task directory
Choosing an agent
Any registered BenchFlow agent works with adapted benchmarks. List them:| Agent | Key | Auth |
|---|---|---|
| Gemini | gemini | GEMINI_API_KEY or host login |
| Claude Code | claude-agent-acp (alias: claude) | ANTHROPIC_API_KEY or host login |
| Codex | codex-acp (alias: codex) | OPENAI_API_KEY, CODEX_API_KEY, CODEX_ACCESS_TOKEN, or host login |
| OpenHands | openhands (alias: oh) | LLM_API_KEY |
| Harvey LAB harness | harvey-lab-harness (alias: harvey-lab) | Provider key matching model |
AZURE_API_KEY plus AZURE_API_ENDPOINT with prefixes such
as azure-foundry-openai/gpt-5.5 or
azure-foundry-anthropic/claude-opus-4-5.
Any agent can also be run via ACPX by prefixing with acpx/:
Choosing a sandbox
| Sandbox | Flag | Best for |
|---|---|---|
| Docker | --sandbox docker | Local development, small runs (≤10 tasks) |
| Daytona | --sandbox daytona | Cloud runs with concurrency (needs DAYTONA_API_KEY) |
| Modal | --sandbox modal | Serverless, high concurrency (needs Modal auth) |
Daytona has a 10 GB-per-sandbox hard cap. Tasks with heavy images (large HuggingFace model snapshots, Playwright, LaTeX/marker — e.g.latex-formula-extraction) overflow during bootstrap (No space left on device) or hang at “Sandbox user agent ready” with no trajectory. Run those on--sandbox docker(host disk, no cap); keep Daytona for lighter tasks.
SkillsBench skill-toggle matrix (Opus-4.8 + Gemini) on Daytona
A self-contained recipe for the four-cell matrix of{Opus-4.8 via Bedrock, Gemini-3.5-flash} × {with-skills, without-skills},
agent openhands, sandbox daytona. Each cell produces a complete trajectory
(trajectory/{acp,llm}_trajectory.jsonl) plus a verifier reward — but treat a
cell as done only after the audit in Verifying the batch
passes.
Setup (once per shell)
Pick a light task — Daytona caps each sandbox at 10 GB (see the note above).citation-checkis a good default; heavy tasks need--sandbox docker. Note that MAX effort makes each Opus turn much slower (deep server-side reasoning — acitation-checkcell took ~10–15 min atmaxvs ~3 min at the default effort).
Run the four cells
BENCHFLOW_BEDROCK_THINKING_EFFORT=max is what makes the two Opus cells actually
run at MAX. LiteLLM writes the provider call metadata to
trajectory/llm_trajectory.jsonl; confirm the adaptive thinking effort there.
Model (--model) | Skills | Cell-specific flags |
|---|---|---|
aws-bedrock/us.anthropic.claude-opus-4-8 | with | --skill-mode with-skill |
aws-bedrock/us.anthropic.claude-opus-4-8 | without | --skill-mode no-skill |
gemini-3.5-flash | with | --agent-env LLM_CACHING_PROMPT=false --skill-mode with-skill |
gemini-3.5-flash | without | --agent-env LLM_CACHING_PROMPT=false --skill-mode no-skill |
Verifying the batch
A finished command is not a healthy trial. After each batch, audit the trajectories with thebenchflow-experiment-review skill (repo copy at
.claude/skills/benchflow-experiment-review; see the Experiment-guidance notes in
AGENTS.md). A trial counts as healthy only when every check passes: complete
trajectory + metadata (timing, token usage, tool usage), correct
pass/fail/timeout status, verifier isolation (verifier starts after the agent
exits), no reward hacking, and the right skill posture — with-skills cells must
show the task skill loaded (task_skills_loading: 1), without-skills cells must
not (task_skills_loading: 0, and the task skill absent from the trajectory;
generic openhands built-ins such as .agents/skills / invoke_skill appear in
every run and are not leakage).
Quick smoke checks before the full audit (per jobs-dir):
--usage-tracking requiredrecords provider-reported token usage into each trajectory.--agent-idle-timeout nonedisables the idle watchdog (the task wall-clock still applies).- Opus-4.8 on Bedrock needs the adaptive-thinking patch, which LiteLLM loads into its proxy process (
src/benchflow/providers/litellm_bedrock_patch.py); seeAGENTS.md. - For heavy tasks, replace
--sandbox daytonawith--sandbox docker— same flags otherwise.
Running a benchmark with an Environment manifest
A stateful benchmark — one with mock services, databases, or accounts the agent acts on — declares its world in anenvironment.toml manifest and runs
on the Environment plane. Use
bench eval create --tasks-dir ... for both single-task and batch
manifest-backed evaluations; --environment-manifest applies the manifest to
every rollout in the Job pipeline.
environment_manifest: <path> at the top level so the batch run is reproducible from disk.
--environment-manifest is distinct from --sandbox: the sandbox is where
the rollout runs; the environment manifest is the world the agent acts in.
BenchFlow provisions the environment, gates on its readiness before the agent
runs, and tears it down afterward. See the Environment plane
for the full manifest schema, both onboarded benchmarks, and the
snapshot/restore roll-back contract.
Running foreign benchmarks (inbound adapters)
BenchFlow runs benchmarks authored in other formats without converting them first. An inbound adapter translates a foreign task directory into BenchFlow-native shape; the rollout then runs natively:| Source format | Signature file | Adapter |
|---|---|---|
| Harbor | task.toml | HarborAdapter |
benchflow.adapters.inbound.detect_adapter() sniffs a task directory and
picks the adapter whose format it matches. The adapter is a pure
Path -> InboundTask translation: it reads a directory and returns an
in-memory native task, building no sandboxes and running nothing.
Continual learning (sequential-shared job mode)
By default a job runs its rollouts concurrently and isolated
(parallel-independent). A continual-learning job instead runs them
strictly in order over one persistent, versioned store of memory + skills —
set job_mode: sequential-shared in the YAML config:
LearnerStore state and, after
it scores, offers its reward as a learning-curve metric: an improvement
stamps a new generation, a regression is reverted to the best generation so
far. Concurrency is ignored — a shared mutable store cannot be written by
overlapping rollouts. See the architecture doc,
capability 5, for the full design.
Reading results
Results land underjobs/<job-name>/<rollout-name>/:
result.json contains (abridged):
| You want | Read | Notes |
|---|---|---|
| reward | rewards.reward | scalar 0.0–1.0, or null if unscored |
| token total | agent_result.total_tokens | null when no provider usage was captured (e.g. hosted runs) |
| outcome / status | derived from rewards.reward + error/verifier_error | not stored as a field; see below |
reward, total_tokens, or status
key — those names are absent, not null. A naive consumer doing
result["reward"] or result.get("total_tokens") gets None because the key
does not exist, never because the value is null. Pass/fail is a derived
classification (only reward == 1.0 passes); BenchFlow computes it from
rewards.reward plus the error channels rather than persisting a redundant
status. The same nested shape is produced for both native rollouts and
hosted-env runs, so one reader handles both.
n_skill_invocations is derived from structured ACP trajectory events: BenchFlow
counts only tool_call events whose kind is skill. Job summary.json
also includes total_skill_invocations and avg_skill_invocations across the
rollouts in the run.
List evaluations:
Running parity validation
Parity validation is a developer/maintainer workflow for verifying that an adapter preserves benchmark semantics. These scripts live under each benchmark’s directory:parity_experiment.json and benchmark.yaml.
YAML config reference
Job configs use the two-fieldsource pattern to reference remote benchmark repos:
source pattern, pointing at the
benchmarks dataset repo:
tasks_dir: for local paths:
source, tasks_dir, agent, model, environment, concurrency,
sandbox_setup_timeout, skills_dir, agent_env, max_retries.
Adding a new benchmark
See the Benchmark Conversion Guide for the 9-step process to convert a new benchmark into Harbor-format tasks for BenchFlow. Harvey LAB (benchmarks/harvey-lab/) and ProgramBench (benchmarks/programbench/) are
reference implementations.