> ## Documentation Index
> Fetch the complete documentation index at: https://docs.benchflow.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Use cases

# Use cases

BenchFlow's Scene-based lifecycle enables evaluation patterns that go far beyond single-turn "prompt and score." This document covers the key use cases for multi-turn, multi-agent, and stateful environment evaluation.

The patterns below are all variants of one primitive: **Scenes with Roles and Turns**, all running in a single shared sandbox via ACP. No sidecar containers, no Docker Compose networking — every role lives in the same workspace and talks through ACP.

> **Sandbox paths used in the prompts below.** The runtime stages the task
> instruction at `/instruction.md` (sandbox root), and the oracle at `/oracle`
> for native `task.md` tasks (legacy split-layout tasks use `/solution` as an
> alias). The agent workspace is `/app`. For a turn with no explicit prompt
> (`Turn("role")` / a bare `- role:` entry), the runtime passes the task goal
> **inline** — for native `task.md` tasks it reads the prompt body from
> `task.md` and sends it directly, so the agent doesn't have to read
> `/instruction.md` to know the task. `/instruction.md` is still staged for
> every task, so a role with an explicit prompt can read or quote it. Use
> `/oracle` first and fall back to `/solution` if you support both layouts,
> e.g. `cat /oracle/solve.sh 2>/dev/null || cat /solution/solve.sh`.

***

## 1. Interactive User Simulation

A "user" role provides instructions iteratively; the agent responds. The user has oracle access to the solution and reveals information gradually, simulating realistic human-agent interaction.

In BenchFlow, this is a two-role Scene where the "user" role is just another agent with a different prompt and (optionally) a different model. Both roles share one sandbox and one ACP session — no sidecar container, no Docker Compose networking.

### YAML

```yaml theme={null}
source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: interactive-assist
    roles:
      - name: user
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: assistant
        agent: claude-agent-acp
        model: claude-sonnet-4-6
    turns:
      - role: user
        prompt: |
          You are simulating a user who needs help with the task in /instruction.md.
          You have access to the oracle solution at /oracle/solve.sh (legacy tasks: /solution/solve.sh).
          Give the assistant a high-level description of what you want. Do NOT reveal implementation details yet.
          Write your guidance to /app/user-guidance.md.
      - role: assistant
      - role: user
        prompt: |
          Read the assistant's work in /app/. Compare against /oracle/solve.sh (legacy: /solution/solve.sh).
          If incomplete, provide a targeted hint (one specific detail from the solution).
          Update /app/user-guidance.md with the targeted hint.
      - role: assistant
        prompt: "The user provided additional guidance. Read it and continue working."
      - role: user
        prompt: |
          Final check. Read /app/ and compare to /oracle/ (legacy: /solution/). If correct, write
          LGTM to /app/user-guidance.md.
          If not, give one final hint.
      - role: assistant
        prompt: "Address the user's latest feedback and finalize your solution."
```

### Python

```python theme={null}
from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="interactive-assist",
              roles=[
                  Role("user", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("assistant", "claude-agent-acp", "claude-sonnet-4-6"),
              ],
              turns=[
                  Turn("user", "You are simulating a user. Read /instruction.md..."),
                  Turn("assistant"),  # None = native goal passed inline (legacy: instruction.md)
                  Turn("user", "Check the assistant's work against /oracle/ (legacy: /solution/)..."),
                  Turn("assistant", "The user provided additional guidance..."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)
```

### Why this design

* One sandbox, one ACP session — no sidecar container, no Docker Compose networking, no extra server to maintain.
* Roles share the sandbox filesystem; any handoff is explicit task state, such as a file named in the next prompt. BenchFlow does not inject messages between turns.
* The user agent is a real LLM with full tool access — it can read files, check outputs, and give nuanced feedback, not just templated responses.
* Same task folder works for single-turn (baseline) and interactive (with user) via different YAML configs.

### Lighter-weight alternative: `BaseUser` callback

When you don't need a second LLM and your "user" logic is rule-based or oracle-guided (e.g. compress instruction → show test failures as hints → stop on pass), use a `BaseUser` Python callback instead of a multi-role Scene. See [progressive-disclosure.md](./progressive-disclosure.md). Built for the SWE-bench Pro progressive-disclosure use case.

***

## 2. Code Review Loop (followup-bench)

A coder agent solves the task, then an independent reviewer agent critiques the solution. The coder revises based on the feedback. The reviewer never has write access to `/app/` -- it can only read and provide feedback.

### YAML

```yaml theme={null}
source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: review-loop
    roles:
      - name: coder
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: reviewer
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: coder
      - role: reviewer
        prompt: |
          You are an expert code reviewer. Read the task at /instruction.md
          and the coder's work in /app/. Write specific, actionable feedback.
          IMPORTANT: Do NOT modify any files in /app/ except /app/review-feedback.md.
          Write your specific feedback to /app/review-feedback.md.
      - role: coder
        prompt: "Read /app/review-feedback.md and revise your solution."
```

### Python (with MCP reviewer sidecar)

For stronger isolation, use the MCP reviewer server pattern. The reviewer runs as a sidecar service -- it has no filesystem write access at all. The coder calls the reviewer via a tool call:

```python theme={null}
from pathlib import Path

from benchflow.experimental.mcp.hooks import mcp_reviewer_hook
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="solve-and-review",
              roles=[Role("coder", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[
                  Turn("coder"),
                  Turn("coder", "Call the review_code MCP tool to get feedback, then fix issues."),
              ]),
    ],
    environment="daytona",
    pre_agent_hooks=[mcp_reviewer_hook(port=8100, model="gemini-3.1-flash-lite")],
)
result = await bf.run(config)
```

The MCP reviewer server (`benchflow.experimental.mcp.reviewer_server`) runs as a background process in the sandbox. It exposes `review_code` and `get_review_status` tools via streamable-http. The reviewer LLM reads the code but has **no ability to write files** -- all it can do is return feedback text.

### Results

Compare reviewer variants on your task set across three conditions:

| Condition       | Description                                                 |
| --------------- | ----------------------------------------------------------- |
| `baseline`      | Single-agent, single-turn                                   |
| `reviewer`      | Coder + plain reviewer + coder revision                     |
| `reviewer+spec` | Coder + reviewer that re-reads instruction + coder revision |

Treat reviewer lift as an empirical question for the target benchmark. It is most relevant for tasks that require debugging or multi-file coordination, but it should be measured rather than assumed.

### Why this design

* No Docker Compose, no sidecar container, no FastMCP server to maintain.
* The MCP hook pattern gives the reviewer tool-level isolation: it cannot write to the workspace, preventing reward hacking via reviewer collusion.
* Same task, same verifier -- define roles and turns in `RolloutConfig` or rollout
  YAML.

***

## 3. Skill Generation (BYOS -- Bring Your Own Skill)

An agent generates a task-specific skill before solving. This is a two-scene rollout: `prep` (unscored) and `solve` (scored). Both scenes share the sandbox, so the generated skill persists.

### YAML

```yaml theme={null}
source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: skill-gen
    roles:
      - name: gen
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: gen
        prompt: |
          Read /instruction.md. Analyze the task requirements.
          Write a skill document to /app/generated-skill.md that will help
          an agent solve this task. Include: key steps, common pitfalls,
          relevant commands or APIs, and a solution outline.
  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver
```

### Python

```python theme={null}
from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="skill-gen",
              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
        Scene(name="solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("solver")]),  # None prompt = native goal inline (legacy: instruction.md)
    ],
    environment="daytona",
)
result = await bf.run(config)
```

### How scenes work here

1. **Scene 1 (`skill-gen`)**: The `gen` agent reads the task instruction, analyzes it, and writes a skill file. This scene is unscored -- its output is an artifact that persists in the sandbox filesystem.
2. **Scene 2 (`solve`)**: A fresh agent session starts (no context from scene 1). The `solver` agent gets the standard task goal as its prompt (passed inline for native `task.md` tasks; legacy tasks read it from `/instruction.md`) and also sees `/app/generated-skill.md` on disk. The verifier scores only the final `/app/` state.

The key insight: `disconnect()` between scenes kills the agent process, so there is no context bleed. The only communication is through the shared filesystem.

### Research findings

From the SkillsBench paper: self-generated skills with generic prompts yield approximately 0 percentage points of lift over baseline. The BYOS pattern only helps when the skill-generation prompt is task-type-specific (e.g., "write a skill for compiler tasks" vs. "write a skill for this task"). This result informed the GEPA (Guided Evolution of Prompts and Agents) skill improvement pipeline.

***

## 4. Multi-turn Conversation

The same agent receives multiple prompts in sequence, maintaining full conversation context between turns. This is the simplest multi-turn pattern -- no role switching, just sequential prompts to a persistent ACP session.

### YAML

```yaml theme={null}
source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: iterative-solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver
      - role: solver
        prompt: "Review your solution. Run the tests if available. Check for edge cases and fix any issues you find."
      - role: solver
        prompt: "Final check: re-read the original instruction and verify your solution addresses every requirement."
```

### Python

```python theme={null}
from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="iterative-solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[
                  Turn("solver"),  # native goal inline (legacy: instruction.md)
                  Turn("solver", "Review your solution. Run tests. Fix issues."),
                  Turn("solver", "Final check: verify every requirement is met."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)
```

### How it works

ACP sessions are persistent -- the agent process stays alive across all turns within a scene. The agent retains full conversation history (tool calls, outputs, reasoning) between prompts. Each `Turn` sends a new `prompt()` call on the existing session.

No simulated user is required — the "user" in this pattern is the benchmark framework itself, issuing predetermined follow-up prompts.

### Why this is useful

* **Self-review**: The second prompt asks the agent to check its own work, catching obvious errors.
* **Iterative refinement**: Tasks that require build-test-fix cycles benefit from explicit prompts to test and iterate.
* **Decomposition**: Complex tasks can be broken into phases ("first set up the environment", "now implement the feature", "now write tests").

***

## 5. Cross-model Review

Different models fill different roles in the same scene. A cheap model codes, an expensive model reviews. Role-level model configuration makes this trivial.

### YAML

```yaml theme={null}
source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 32

scenes:
  - name: cross-model-review
    roles:
      - name: coder
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: reviewer
        agent: claude-agent-acp
        model: claude-sonnet-4-6
    turns:
      - role: coder
      - role: reviewer
        prompt: |
          You are reviewing code written by a different agent.
          Read /instruction.md for the task requirements.
          Examine the coder's work in /app/. Write specific feedback
          to /app/review-feedback.md
      - role: coder
        prompt: "Read /app/review-feedback.md and revise your solution."
```

### Python

```python theme={null}
from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="cross-model-review",
              roles=[
                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("reviewer", "claude-agent-acp", "claude-sonnet-4-6"),
              ],
              turns=[
                  Turn("coder"),
                  Turn("reviewer", "Review the coder's work..."),
                  Turn("coder", "Address the reviewer's feedback."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)
```

### Cost-performance tradeoff

The cross-model pattern lets you sweep the reviewer axis independently:

| Variant         | Coder        | Reviewer      | Question                                      |
| --------------- | ------------ | ------------- | --------------------------------------------- |
| Self-review     | gemini-flash | gemini-flash  | Does same-model review help?                  |
| Cross-model     | gemini-flash | claude-sonnet | Does a different model catch different bugs?  |
| Strong reviewer | gemini-flash | claude-opus   | Does a stronger reviewer help a weaker coder? |
| Weak reviewer   | claude-opus  | gemini-flash  | Does a weaker reviewer hurt a stronger coder? |

Each variant is just a different YAML file -- same task folder, same verifier, different role configurations. This enables controlled experiments on the marginal value of reviewer quality.

***

## 6. Stateful Service Tasks

Tasks that require agents to interact with live services -- Gmail, Calendar, Docs, Drive, Slack. Services run as sidecar processes in the sandbox, exposing REST APIs on localhost. The agent interacts with real HTTP endpoints, not mocked tool calls.

### Python

```python theme={null}
from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn
from benchflow import SERVICES, build_service_hooks

# Declare which services the task needs
services = [SERVICES["gmail"], SERVICES["gcal"], SERVICES["slack"]]

config = RolloutConfig(
    task_path=Path("tasks/schedule-meeting-from-email"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
    environment="daytona",
    pre_agent_hooks=build_service_hooks(services),
)
result = await bf.run(config)
```

Service hooks are explicit today. `RolloutConfig.services` is reserved metadata;
it does not start services unless you translate it into `pre_agent_hooks`.

### Service registry

BenchFlow ships with 5 built-in services (from the SmolClaws project):

| Service  | CLI binary    | Port | Description                            |
| -------- | ------------- | ---- | -------------------------------------- |
| `gmail`  | `claw-gmail`  | 9001 | Mock Gmail REST API (FastAPI + SQLite) |
| `slack`  | `claw-slack`  | 9002 | Mock Slack API                         |
| `gcal`   | `claw-gcal`   | 9003 | Mock Google Calendar API               |
| `gdoc`   | `claw-gdoc`   | 9004 | Mock Google Docs API                   |
| `gdrive` | `claw-gdrive` | 9005 | Mock Google Drive API                  |

Each service:

* Runs as a background process in the same container.
* Exposes a health endpoint (`/health`) for startup detection.
* Uses SQLite for state -- pre-seeded from the task's `environment/` directory.
* Is indistinguishable from the real API from the agent's perspective.

### Example task structure

```
tasks/schedule-meeting-from-email/
├── task.toml
├── instruction.md          # "Read the email from Alice, create a calendar event..."
├── environment/
│   ├── Dockerfile          # FROM benchflow/claws-base (has all claw-* binaries)
│   ├── gmail.db            # Pre-seeded: email from Alice with meeting request
│   └── gcal.db             # Pre-seeded: existing calendar entries
├── solution/
│   └── solve.sh            # Oracle: curl commands to Gmail + GCal APIs
└── tests/
    └── test.sh             # Verify: check gcal.db has the new event
```
