Use cases
BenchFlow’s Scene-based lifecycle enables evaluation patterns that go far beyond single-turn “prompt and score.” This document covers the key use cases for multi-turn, multi-agent, and stateful environment evaluation. The patterns below are all variants of one primitive: Scenes with Roles and Turns, all running in a single shared sandbox via ACP. No sidecar containers, no Docker Compose networking — every role lives in the same workspace and talks through ACP.Sandbox paths used in the prompts below. The runtime stages the task instruction at/instruction.md(sandbox root), and the oracle at/oraclefor nativetask.mdtasks (legacy split-layout tasks use/solutionas an alias). The agent workspace is/app. For a turn with no explicit prompt (Turn("role")/ a bare- role:entry), the runtime passes the task goal inline — for nativetask.mdtasks it reads the prompt body fromtask.mdand sends it directly, so the agent doesn’t have to read/instruction.mdto know the task./instruction.mdis still staged for every task, so a role with an explicit prompt can read or quote it. Use/oraclefirst and fall back to/solutionif you support both layouts, e.g.cat /oracle/solve.sh 2>/dev/null || cat /solution/solve.sh.
1. Interactive User Simulation
A “user” role provides instructions iteratively; the agent responds. The user has oracle access to the solution and reveals information gradually, simulating realistic human-agent interaction. In BenchFlow, this is a two-role Scene where the “user” role is just another agent with a different prompt and (optionally) a different model. Both roles share one sandbox and one ACP session — no sidecar container, no Docker Compose networking.YAML
Python
Why this design
- One sandbox, one ACP session — no sidecar container, no Docker Compose networking, no extra server to maintain.
- Roles share the sandbox filesystem; any handoff is explicit task state, such as a file named in the next prompt. BenchFlow does not inject messages between turns.
- The user agent is a real LLM with full tool access — it can read files, check outputs, and give nuanced feedback, not just templated responses.
- Same task folder works for single-turn (baseline) and interactive (with user) via different YAML configs.
Lighter-weight alternative: BaseUser callback
When you don’t need a second LLM and your “user” logic is rule-based or oracle-guided (e.g. compress instruction → show test failures as hints → stop on pass), use a BaseUser Python callback instead of a multi-role Scene. See progressive-disclosure.md. Built for the SWE-bench Pro progressive-disclosure use case.
2. Code Review Loop (followup-bench)
A coder agent solves the task, then an independent reviewer agent critiques the solution. The coder revises based on the feedback. The reviewer never has write access to/app/ — it can only read and provide feedback.
YAML
Python (with MCP reviewer sidecar)
For stronger isolation, use the MCP reviewer server pattern. The reviewer runs as a sidecar service — it has no filesystem write access at all. The coder calls the reviewer via a tool call:benchflow.experimental.mcp.reviewer_server) runs as a background process in the sandbox. It exposes review_code and get_review_status tools via streamable-http. The reviewer LLM reads the code but has no ability to write files — all it can do is return feedback text.
Results
Compare reviewer variants on your task set across three conditions:| Condition | Description |
|---|---|
baseline | Single-agent, single-turn |
reviewer | Coder + plain reviewer + coder revision |
reviewer+spec | Coder + reviewer that re-reads instruction + coder revision |
Why this design
- No Docker Compose, no sidecar container, no FastMCP server to maintain.
- The MCP hook pattern gives the reviewer tool-level isolation: it cannot write to the workspace, preventing reward hacking via reviewer collusion.
- Same task, same verifier — define roles and turns in
RolloutConfigor rollout YAML.
3. Skill Generation (BYOS — Bring Your Own Skill)
An agent generates a task-specific skill before solving. This is a two-scene rollout:prep (unscored) and solve (scored). Both scenes share the sandbox, so the generated skill persists.
YAML
Python
How scenes work here
- Scene 1 (
skill-gen): Thegenagent reads the task instruction, analyzes it, and writes a skill file. This scene is unscored — its output is an artifact that persists in the sandbox filesystem. - Scene 2 (
solve): A fresh agent session starts (no context from scene 1). Thesolveragent gets the standard task goal as its prompt (passed inline for nativetask.mdtasks; legacy tasks read it from/instruction.md) and also sees/app/generated-skill.mdon disk. The verifier scores only the final/app/state.
disconnect() between scenes kills the agent process, so there is no context bleed. The only communication is through the shared filesystem.
Research findings
From the SkillsBench paper: self-generated skills with generic prompts yield approximately 0 percentage points of lift over baseline. The BYOS pattern only helps when the skill-generation prompt is task-type-specific (e.g., “write a skill for compiler tasks” vs. “write a skill for this task”). This result informed the GEPA (Guided Evolution of Prompts and Agents) skill improvement pipeline.4. Multi-turn Conversation
The same agent receives multiple prompts in sequence, maintaining full conversation context between turns. This is the simplest multi-turn pattern — no role switching, just sequential prompts to a persistent ACP session.YAML
Python
How it works
ACP sessions are persistent — the agent process stays alive across all turns within a scene. The agent retains full conversation history (tool calls, outputs, reasoning) between prompts. EachTurn sends a new prompt() call on the existing session.
No simulated user is required — the “user” in this pattern is the benchmark framework itself, issuing predetermined follow-up prompts.
Why this is useful
- Self-review: The second prompt asks the agent to check its own work, catching obvious errors.
- Iterative refinement: Tasks that require build-test-fix cycles benefit from explicit prompts to test and iterate.
- Decomposition: Complex tasks can be broken into phases (“first set up the environment”, “now implement the feature”, “now write tests”).
5. Cross-model Review
Different models fill different roles in the same scene. A cheap model codes, an expensive model reviews. Role-level model configuration makes this trivial.YAML
Python
Cost-performance tradeoff
The cross-model pattern lets you sweep the reviewer axis independently:| Variant | Coder | Reviewer | Question |
|---|---|---|---|
| Self-review | gemini-flash | gemini-flash | Does same-model review help? |
| Cross-model | gemini-flash | claude-sonnet | Does a different model catch different bugs? |
| Strong reviewer | gemini-flash | claude-opus | Does a stronger reviewer help a weaker coder? |
| Weak reviewer | claude-opus | gemini-flash | Does a weaker reviewer hurt a stronger coder? |
6. Stateful Service Tasks
Tasks that require agents to interact with live services — Gmail, Calendar, Docs, Drive, Slack. Services run as sidecar processes in the sandbox, exposing REST APIs on localhost. The agent interacts with real HTTP endpoints, not mocked tool calls.Python
RolloutConfig.services is reserved metadata;
it does not start services unless you translate it into pre_agent_hooks.
Service registry
BenchFlow ships with 5 built-in services (from the SmolClaws project):| Service | CLI binary | Port | Description |
|---|---|---|---|
gmail | claw-gmail | 9001 | Mock Gmail REST API (FastAPI + SQLite) |
slack | claw-slack | 9002 | Mock Slack API |
gcal | claw-gcal | 9003 | Mock Google Calendar API |
gdoc | claw-gdoc | 9004 | Mock Google Docs API |
gdrive | claw-gdrive | 9005 | Mock Google Drive API |
- Runs as a background process in the same container.
- Exposes a health endpoint (
/health) for startup detection. - Uses SQLite for state — pre-seeded from the task’s
environment/directory. - Is indistinguishable from the real API from the agent’s perspective.