Skip to main content

Concepts

The mental model for benchflow. Read once, then refer back from the how-tos.

The five primitives

PrimitiveWhat it is
TaskA directory on disk: a task.md document (YAML frontmatter + prompt body) plus environment/Dockerfile for the sandbox, verifier/ checks, and optional oracle/ — or the legacy split layout (task.toml + instruction.md + tests/ + solution/). Authored once, evaluated many times.
AgentA registered ACP-speaking program (Claude Code, Gemini CLI, OpenCode, etc.). Identified by name ("gemini", "opencode") plus an optional model ID. Use the acpx/ prefix (e.g. acpx/gemini) to route through ACPX, a headless ACP client with persistent sessions and crash recovery.
EnvironmentThe sandbox where the agent runs and the verifier checks the result. Docker locally, Daytona for cloud, Modal for serverless/GPU. Abstracted behind the Sandbox protocol — bring your own sandbox backend.
VerifierThe test runner that scores the rollout. Its entry point is a test.sh script (native verifier/test.sh, legacy tests/test.sh) — which typically runs pytest against the workspace the agent left behind. For subjective tasks, use an LLM-as-judge verifier scored against a rubric. Outputs rewards: {reward: float}. See the verifier file map for which file lives where in native vs legacy packages.
RolloutOne agent run on one task. Holds the lifecycle (setup → start → install → execute → verify → cleanup). All higher-level primitives below are built on Rollouts.

Rollout lifecycle

A Rollout is decomposable: each phase is a callable method, you can either run them in sequence or invoke Rollout.run() to execute all six in order. Multi-agent flows reuse phases (e.g. connect + execute + disconnect repeats per role).
┌──────────────────────────────────────────────────────────────┐
│                    Rollout.run()                             │
│                                                              │
│  setup()         resolve config, create sandbox env handle   │
│    ↓                                                         │
│  start()         start container, upload task files          │
│    ↓                                                         │
│  install_agent() install agent binary, write credentials,    │
│                  set up sandbox user                         │
│    ↓                                                         │
│  ┌─ connect_as(role)  ◄─── multi-agent loops here            │
│  │  execute(prompts)        each role's turn                 │
│  └─ disconnect()                                             │
│    ↓                                                         │
│  verify()        harden sandbox, run pytest, score           │
│    ↓                                                         │
│  cleanup()       kill agent procs, stop container            │
└──────────────────────────────────────────────────────────────┘
Each phase has a name, a clear contract, and is independently testable. Rollout.run() is the convenience that calls them in order.
import benchflow as bf
from benchflow import RolloutConfig, Scene
from pathlib import Path

config = RolloutConfig(
    task_path=Path("tasks/edit-pdf"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-pro-preview")],
    environment="daytona",
)
result = await bf.run(config)   # full lifecycle
print(result.rewards)            # {'reward': 1.0}

Scenes, Roles, Turns

A Scene is authoring sugar for Step metadata. Inside a Scene:
  • Roles are the agents that participate (one or more).
  • Turns are the prompt sequence — which Role acts when, and what they’re told.
  • All Roles share the same sandbox filesystem.
Before rollout execution, BenchFlow desugars Scenes into explicit rollout Steps carrying role, prompt, and skill attribution. Scene has no runtime object, scheduler, message router, or lifecycle.
Scene(
    name="review-loop",
    roles=[
        Role(name="coder",    agent="opencode", model="anthropic/claude-sonnet-4-6"),
        Role(name="reviewer", agent="gemini",   model="gemini-3.1-pro-preview"),
    ],
    turns=[
        Turn(role="coder"),
        Turn(role="reviewer", prompt="Review the current workspace."),
        Turn(role="coder",    prompt="Read the reviewer's feedback and revise."),
    ],
)
A Rollout may have multiple Scenes — used for staged flows like “skill generation → solve” (BYOS / Bring Your Own Skill). Same sandbox, sequential Scenes.

The User abstraction (multi-round, single-agent)

Sometimes you want the agent to take multiple turns guided not by another LLM but by a Python callback that watches what happened and decides what to say next. That’s a User. A User is a BaseUser subclass (or FunctionUser wrapping a function) with two methods:
  • setup(instruction, solution) — once, before round 0
  • run(round, instruction, round_result) → str | None — per round; return None to stop the loop
Between rounds, BenchFlow executes soft_verify() (verifier without the destructive parts of full hardening), gives the user the round’s RoundResult (trajectory, rewards, verifier output, tool count), and lets the user decide round N+1’s prompt. Use BaseUser when the loop logic is rule-based (compress instruction → show test failures as hints → stop on pass). See progressive-disclosure.md for the full guide.

Verifier, sandbox, hardening

Once the agent stops, the verifier runs. Its entry point is the task’s test.sh script — uploaded to /verifier for native packages (/tests for legacy ones) — executed against the workspace the agent left behind. benchflow runs test.sh as a script (it chmod +x’s the file and executes it directly; a native script strategy runs cd /verifier && <command>). It never hands test.sh to pytest — pytest cannot collect a shell script as a test target. Most test.sh scripts invoke pytest internally. For those invocations, benchflow applies hardening through PYTEST_ADDOPTS in the verifier environment — every pytest run inside test.sh inherits roughly:
PYTEST_ADDOPTS="-c /dev/null --confcutdir=<verifier-dir> --rootdir=<workspace> -p no:cacheprovider"
where <verifier-dir> is /verifier for native packages (/tests for legacy), and <workspace> is the agent workspace (/app for Harbor/SWE-bench conventions, /root for SkillsBench — injected dynamically). -c /dev/null blocks pyproject.toml/pytest.ini discovery and --confcutdir blocks conftest.py walk-up beyond the verifier dir. Tasks that do not use pytest (e.g. a test.sh that diffs files and writes reward.txt directly) are scored the same way — pytest is just the most common tool, not a requirement. Between agent and verifier, benchflow hardens the sandbox to prevent the agent from gaming the score:
  • Kill any lingering agent processes
  • Restore build-config files (setup.py, pyproject.toml, …) to their pre-agent snapshots
  • Delete agent-injected conftest.py, sitecustomize.py, .pth files
  • Lock the workspace to root, set restrictive PYTHONPATH/PATH for the verifier process
  • Run pytest with plugin auto-discovery off, only allowing plugins declared in the task config ([verifier] pytest_plugins in task.toml, or auto-discovered root-owned plugins)
This catches the BenchJack and Meerkat exploit families documented in the historical (0.2.x-era) labs docs/labs/benchjack-sandbox-hardening/ and docs/labs/reward-hack-matrix/. When a task ships a legitimate conftest.py (e.g. qutebrowser uses one to break a real circular import), the task opts out via task.toml:
[verifier.hardening]
cleanup_conftests = false
See progressive-disclosure.md for the full opt-out list.

Verifier file map

Native task.md packages and the legacy split layout name their verifier files differently. The runtime resolves native files first and falls back to the legacy names, so a task ships one of each row, not both:
What it isNative (task.md) packageLegacy split layoutSandbox path
Verifier directoryverifier/tests//verifier (native), /tests (legacy)
Script entry pointverifier/test.shtests/test.shexecuted as a script (chmod +x then run) inside the verifier dir
Strategy declaration (how it’s scored)verifier/verifier.md— (legacy uses [verifier] in task.toml)not uploaded as a runtime target; selects the strategy
LLM-judge rubricverifier/rubrics/verifier.md + verifier/rubrics/verifier.tomltests/rubric.toml (also rubric.json, Harvey-LAB style)downloaded for the judge
A plain test.sh is a complete verifier on its own: with no verifier.md strategy declared, the runtime just executes it. verifier/verifier.md declares how a task is scored (script / llm-judge / reward-kit / agent-judge / ors-episode) and is the native equivalent of the legacy [verifier] section in task.toml. The native LLM-judge rubric lives under verifier/rubrics/ (both a human-readable verifier.md and a machine-readable verifier.toml), not in a single top-level rubric.toml. For the native verifier document and its strategy table see Native task.md authoring; for the legacy [verifier.judge] rubric path see LLM-as-judge.

Multi-turn vs multi-round vs multi-scene

Three different axes — easy to confuse, worth pinning down:
AxisWhat changesExample
Multi-turnSame Role, multiple prompts within one Scene. The ACP session persists; the agent has continuous memory.One coder gets prompted twice: “fix the bug”, then “now write a test”.
Multi-roundSame Role, multiple connect → execute → disconnect cycles. New ACP session each round; sandbox state persists; a Python User callback decides each round’s prompt.Progressive disclosure on SWE-bench Pro: round 0 terse spec, round 1 hints with failing tests, round 2 full spec.
Multi-sceneMultiple Scenes in one Rollout. Sandbox state persists; agent process and ACP session restart between Scenes.BYOS: Scene 1 generates a skill, Scene 2 solves the task using it.
Single-agent simple runs use none of these. Pick the axis based on what state needs to persist (memory? sandbox? both?).

Trajectories and rewards

Every agent action is captured as an event in the trajectory — tool calls, agent messages, agent thoughts. A RolloutResult (aliased as RunResult) has the full trajectory plus tool count, plus rewards from the verifier and any error. rewards is a dict produced by the task’s verifier. Convention: {"reward": float} where 1.0 = pass, 0.0 = fail. Tasks may add additional metrics (e.g. exact_match, partial_credit). Trajectories are written to <jobs_dir>/<job_name>/<rollout_name>/trajectory/acp_trajectory.jsonl (the --jobs-dir directory, default jobs/). Use them for replay, debugging, or training data.

Where to go next