Use cases

BenchFlow’s Scene-based lifecycle enables evaluation patterns that go far beyond single-turn “prompt and score.” This document covers the key use cases for multi-turn, multi-agent, and stateful environment evaluation. The patterns below are all variants of one primitive: Scenes with Roles and Turns, all running in a single shared sandbox via ACP. No sidecar containers, no Docker Compose networking — every role lives in the same workspace and talks through ACP.

Sandbox paths used in the prompts below. The runtime stages the task instruction at /instruction.md (sandbox root), and the oracle at /oracle for native task.md tasks (legacy split-layout tasks use /solution as an alias). The agent workspace is /app. For a turn with no explicit prompt (Turn("role") / a bare - role: entry), the runtime passes the task goal inline — for native task.md tasks it reads the prompt body from task.md and sends it directly, so the agent doesn’t have to read /instruction.md to know the task. /instruction.md is still staged for every task, so a role with an explicit prompt can read or quote it. Use /oracle first and fall back to /solution if you support both layouts, e.g. cat /oracle/solve.sh 2>/dev/null || cat /solution/solve.sh.

1. Interactive User Simulation

A “user” role provides instructions iteratively; the agent responds. The user has oracle access to the solution and reveals information gradually, simulating realistic human-agent interaction. In BenchFlow, this is a two-role Scene where the “user” role is just another agent with a different prompt and (optionally) a different model. Both roles share one sandbox and one ACP session — no sidecar container, no Docker Compose networking.

YAML

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: interactive-assist
    roles:
      - name: user
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: assistant
        agent: claude-agent-acp
        model: claude-sonnet-4-6
    turns:
      - role: user
        prompt: |
          You are simulating a user who needs help with the task in /instruction.md.
          You have access to the oracle solution at /oracle/solve.sh (legacy tasks: /solution/solve.sh).
          Give the assistant a high-level description of what you want. Do NOT reveal implementation details yet.
          Write your guidance to /app/user-guidance.md.
      - role: assistant
      - role: user
        prompt: |
          Read the assistant's work in /app/. Compare against /oracle/solve.sh (legacy: /solution/solve.sh).
          If incomplete, provide a targeted hint (one specific detail from the solution).
          Update /app/user-guidance.md with the targeted hint.
      - role: assistant
        prompt: "The user provided additional guidance. Read it and continue working."
      - role: user
        prompt: |
          Final check. Read /app/ and compare to /oracle/ (legacy: /solution/). If correct, write
          LGTM to /app/user-guidance.md.
          If not, give one final hint.
      - role: assistant
        prompt: "Address the user's latest feedback and finalize your solution."

Python

from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="interactive-assist",
              roles=[
                  Role("user", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("assistant", "claude-agent-acp", "claude-sonnet-4-6"),
              ],
              turns=[
                  Turn("user", "You are simulating a user. Read /instruction.md..."),
                  Turn("assistant"),  # None = native goal passed inline (legacy: instruction.md)
                  Turn("user", "Check the assistant's work against /oracle/ (legacy: /solution/)..."),
                  Turn("assistant", "The user provided additional guidance..."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)

Why this design

One sandbox, one ACP session — no sidecar container, no Docker Compose networking, no extra server to maintain.
Roles share the sandbox filesystem; any handoff is explicit task state, such as a file named in the next prompt. BenchFlow does not inject messages between turns.
The user agent is a real LLM with full tool access — it can read files, check outputs, and give nuanced feedback, not just templated responses.
Same task folder works for single-turn (baseline) and interactive (with user) via different YAML configs.

Lighter-weight alternative: `BaseUser` callback

When you don’t need a second LLM and your “user” logic is rule-based or oracle-guided (e.g. compress instruction → show test failures as hints → stop on pass), use a BaseUser Python callback instead of a multi-role Scene. See progressive-disclosure.md. Built for the SWE-bench Pro progressive-disclosure use case.

2. Code Review Loop (followup-bench)

A coder agent solves the task, then an independent reviewer agent critiques the solution. The coder revises based on the feedback. The reviewer never has write access to /app/ — it can only read and provide feedback.

YAML

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: review-loop
    roles:
      - name: coder
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: reviewer
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: coder
      - role: reviewer
        prompt: |
          You are an expert code reviewer. Read the task at /instruction.md
          and the coder's work in /app/. Write specific, actionable feedback.
          IMPORTANT: Do NOT modify any files in /app/ except /app/review-feedback.md.
          Write your specific feedback to /app/review-feedback.md.
      - role: coder
        prompt: "Read /app/review-feedback.md and revise your solution."

Python (with MCP reviewer sidecar)

For stronger isolation, use the MCP reviewer server pattern. The reviewer runs as a sidecar service — it has no filesystem write access at all. The coder calls the reviewer via a tool call:

from pathlib import Path

from benchflow.experimental.mcp.hooks import mcp_reviewer_hook
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="solve-and-review",
              roles=[Role("coder", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[
                  Turn("coder"),
                  Turn("coder", "Call the review_code MCP tool to get feedback, then fix issues."),
              ]),
    ],
    environment="daytona",
    pre_agent_hooks=[mcp_reviewer_hook(port=8100, model="gemini-3.1-flash-lite")],
)
result = await bf.run(config)

The MCP reviewer server (benchflow.experimental.mcp.reviewer_server) runs as a background process in the sandbox. It exposes review_code and get_review_status tools via streamable-http. The reviewer LLM reads the code but has no ability to write files — all it can do is return feedback text.

Results

Compare reviewer variants on your task set across three conditions:

Condition	Description
`baseline`	Single-agent, single-turn
`reviewer`	Coder + plain reviewer + coder revision
`reviewer+spec`	Coder + reviewer that re-reads instruction + coder revision

Treat reviewer lift as an empirical question for the target benchmark. It is most relevant for tasks that require debugging or multi-file coordination, but it should be measured rather than assumed.

Why this design

No Docker Compose, no sidecar container, no FastMCP server to maintain.
The MCP hook pattern gives the reviewer tool-level isolation: it cannot write to the workspace, preventing reward hacking via reviewer collusion.
Same task, same verifier — define roles and turns in RolloutConfig or rollout YAML.

3. Skill Generation (BYOS — Bring Your Own Skill)

An agent generates a task-specific skill before solving. This is a two-scene rollout: prep (unscored) and solve (scored). Both scenes share the sandbox, so the generated skill persists.

YAML

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: skill-gen
    roles:
      - name: gen
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: gen
        prompt: |
          Read /instruction.md. Analyze the task requirements.
          Write a skill document to /app/generated-skill.md that will help
          an agent solve this task. Include: key steps, common pitfalls,
          relevant commands or APIs, and a solution outline.
  - name: solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver

Python

from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="skill-gen",
              roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]),
        Scene(name="solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("solver")]),  # None prompt = native goal inline (legacy: instruction.md)
    ],
    environment="daytona",
)
result = await bf.run(config)

How scenes work here

Scene 1 (skill-gen): The gen agent reads the task instruction, analyzes it, and writes a skill file. This scene is unscored — its output is an artifact that persists in the sandbox filesystem.
Scene 2 (solve): A fresh agent session starts (no context from scene 1). The solver agent gets the standard task goal as its prompt (passed inline for native task.md tasks; legacy tasks read it from /instruction.md) and also sees /app/generated-skill.md on disk. The verifier scores only the final /app/ state.

The key insight: disconnect() between scenes kills the agent process, so there is no context bleed. The only communication is through the shared filesystem.

Research findings

From the SkillsBench paper: self-generated skills with generic prompts yield approximately 0 percentage points of lift over baseline. The BYOS pattern only helps when the skill-generation prompt is task-type-specific (e.g., “write a skill for compiler tasks” vs. “write a skill for this task”). This result informed the GEPA (Guided Evolution of Prompts and Agents) skill improvement pipeline.

4. Multi-turn Conversation

The same agent receives multiple prompts in sequence, maintaining full conversation context between turns. This is the simplest multi-turn pattern — no role switching, just sequential prompts to a persistent ACP session.

YAML

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 64

scenes:
  - name: iterative-solve
    roles:
      - name: solver
        agent: gemini
        model: gemini-3.1-flash-lite-preview
    turns:
      - role: solver
      - role: solver
        prompt: "Review your solution. Run the tests if available. Check for edge cases and fix any issues you find."
      - role: solver
        prompt: "Final check: re-read the original instruction and verify your solution addresses every requirement."

Python

from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="iterative-solve",
              roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[
                  Turn("solver"),  # native goal inline (legacy: instruction.md)
                  Turn("solver", "Review your solution. Run tests. Fix issues."),
                  Turn("solver", "Final check: verify every requirement is met."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)

How it works

ACP sessions are persistent — the agent process stays alive across all turns within a scene. The agent retains full conversation history (tool calls, outputs, reasoning) between prompts. Each Turn sends a new prompt() call on the existing session. No simulated user is required — the “user” in this pattern is the benchmark framework itself, issuing predetermined follow-up prompts.

Why this is useful

Self-review: The second prompt asks the agent to check its own work, catching obvious errors.
Iterative refinement: Tasks that require build-test-fix cycles benefit from explicit prompts to test and iterate.
Decomposition: Complex tasks can be broken into phases (“first set up the environment”, “now implement the feature”, “now write tests”).

5. Cross-model Review

Different models fill different roles in the same scene. A cheap model codes, an expensive model reviews. Role-level model configuration makes this trivial.

YAML

source:
  repo: benchflow-ai/skillsbench
  path: tasks
environment: daytona
concurrency: 32

scenes:
  - name: cross-model-review
    roles:
      - name: coder
        agent: gemini
        model: gemini-3.1-flash-lite-preview
      - name: reviewer
        agent: claude-agent-acp
        model: claude-sonnet-4-6
    turns:
      - role: coder
      - role: reviewer
        prompt: |
          You are reviewing code written by a different agent.
          Read /instruction.md for the task requirements.
          Examine the coder's work in /app/. Write specific feedback
          to /app/review-feedback.md
      - role: coder
        prompt: "Read /app/review-feedback.md and revise your solution."

Python

from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn

config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="cross-model-review",
              roles=[
                  Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
                  Role("reviewer", "claude-agent-acp", "claude-sonnet-4-6"),
              ],
              turns=[
                  Turn("coder"),
                  Turn("reviewer", "Review the coder's work..."),
                  Turn("coder", "Address the reviewer's feedback."),
              ]),
    ],
    environment="daytona",
)
result = await bf.run(config)

Cost-performance tradeoff

The cross-model pattern lets you sweep the reviewer axis independently:

Variant	Coder	Reviewer	Question
Self-review	gemini-flash	gemini-flash	Does same-model review help?
Cross-model	gemini-flash	claude-sonnet	Does a different model catch different bugs?
Strong reviewer	gemini-flash	claude-opus	Does a stronger reviewer help a weaker coder?
Weak reviewer	claude-opus	gemini-flash	Does a weaker reviewer hurt a stronger coder?

Each variant is just a different YAML file — same task folder, same verifier, different role configurations. This enables controlled experiments on the marginal value of reviewer quality.

6. Stateful Service Tasks

Tasks that require agents to interact with live services — Gmail, Calendar, Docs, Drive, Slack. Services run as sidecar processes in the sandbox, exposing REST APIs on localhost. The agent interacts with real HTTP endpoints, not mocked tool calls.

Python

from pathlib import Path
import benchflow as bf
from benchflow.rollout import RolloutConfig, Scene, Role, Turn
from benchflow import SERVICES, build_service_hooks

# Declare which services the task needs
services = [SERVICES["gmail"], SERVICES["gcal"], SERVICES["slack"]]

config = RolloutConfig(
    task_path=Path("tasks/schedule-meeting-from-email"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
    environment="daytona",
    pre_agent_hooks=build_service_hooks(services),
)
result = await bf.run(config)

Service hooks are explicit today. RolloutConfig.services is reserved metadata; it does not start services unless you translate it into pre_agent_hooks.

Service registry

BenchFlow ships with 5 built-in services (from the SmolClaws project):

Service	CLI binary	Port	Description
`gmail`	`claw-gmail`	9001	Mock Gmail REST API (FastAPI + SQLite)
`slack`	`claw-slack`	9002	Mock Slack API
`gcal`	`claw-gcal`	9003	Mock Google Calendar API
`gdoc`	`claw-gdoc`	9004	Mock Google Docs API
`gdrive`	`claw-gdrive`	9005	Mock Google Drive API

Each service:

Runs as a background process in the same container.
Exposes a health endpoint (/health) for startup detection.
Uses SQLite for state — pre-seeded from the task’s environment/ directory.
Is indistinguishable from the real API from the agent’s perspective.

Example task structure

tasks/schedule-meeting-from-email/
├── task.toml
├── instruction.md          # "Read the email from Alice, create a calendar event..."
├── environment/
│   ├── Dockerfile          # FROM benchflow/claws-base (has all claw-* binaries)
│   ├── gmail.db            # Pre-seeded: email from Alice with meeting request
│   └── gcal.db             # Pre-seeded: existing calendar entries
├── solution/
│   └── solve.sh            # Oracle: curl commands to Gmail + GCal APIs
└── tests/
    └── test.sh             # Verify: check gcal.db has the new event

Start here

Guides

Reference

Use cases

Use cases

1. Interactive User Simulation

YAML

Python

Why this design

Lighter-weight alternative: `BaseUser` callback

2. Code Review Loop (followup-bench)

YAML

Python (with MCP reviewer sidecar)

Results

Why this design

3. Skill Generation (BYOS — Bring Your Own Skill)

YAML

Python

How scenes work here

Research findings

4. Multi-turn Conversation

YAML

Python

How it works

Why this is useful

5. Cross-model Review

YAML

Python

Cost-performance tradeoff

6. Stateful Service Tasks

Python

Service registry

Example task structure

​Use cases

​1. Interactive User Simulation

​YAML

​Python

​Why this design

​Lighter-weight alternative: BaseUser callback

​2. Code Review Loop (followup-bench)

​YAML

​Python (with MCP reviewer sidecar)

​Results

​Why this design

​3. Skill Generation (BYOS — Bring Your Own Skill)

​YAML

​Python

​How scenes work here

​Research findings

​4. Multi-turn Conversation

​YAML

​Python

​How it works

​Why this is useful

​5. Cross-model Review

​YAML

​Python

​Cost-performance tradeoff

​6. Stateful Service Tasks

​Python

​Service registry

​Example task structure

Use cases

1. Interactive User Simulation

YAML

Python

Why this design

Lighter-weight alternative: `BaseUser` callback

2. Code Review Loop (followup-bench)

YAML

Python (with MCP reviewer sidecar)

Results

Why this design

3. Skill Generation (BYOS — Bring Your Own Skill)

YAML

Python

How scenes work here

Research findings

4. Multi-turn Conversation

YAML

Python

How it works

Why this is useful

5. Cross-model Review

YAML

Python

Cost-performance tradeoff

6. Stateful Service Tasks

Python

Service registry

Example task structure