Python API
The Rollout/Scene API is the primary way to run agent benchmarks programmatically.Install
Quick Start
Core Types
RolloutConfig
Declarative configuration for a rollout — a sequence of Scenes in a shared sandbox.sandbox_setup_timeout when sandbox user setup needs more than the default 120 seconds.
The same field is also available on JobConfig and RuntimeConfig.
Scene
Authoring sugar for role, prompt, and skill attribution. Scenes compile to explicit rollout Steps before execution; there is no runtime Scene object or message scheduler.Rollout
The execution engine — decomposed into independently-callable phases.RuntimeConfig
Runtime-level configuration for theAgent + Environment execution path.
bf.run()
Convenience function — multiple calling conventions:Rollout Lifecycle
Multi-Turn vs Multi-Round
| Pattern | Roles | Turns | Communication | Example |
|---|---|---|---|---|
| Single-turn | 1 | 1 | — | Baseline benchmark |
| Multi-turn | 1 | 2+ | Same session, sequential prompts | Self-review |
| Multi-role | 2+ | 2+ | Explicit prompt sequence | Coder + Reviewer |
RolloutConfig with different Scene configurations.
Multi-Agent Patterns
Coder + Reviewer (followup-bench)
Skill Generation + Solve (BYOS)
User-Driven Loops
UseBaseUser or FunctionUser when one agent should run multiple rounds and
Python should decide the next prompt from verifier feedback. This is the
progressive-disclosure path: the user callback can stop early, read
RoundResult after each soft_verify(), and optionally receive the oracle
solution during setup() when oracle_access=True.
BaseUser when the loop is deterministic or verifier-driven. See
progressive-disclosure.md and
docs/examples/scene-patterns.ipynb.
YAML Rollout Configs
Registered Agents
| Agent | Protocol | Auth | Aliases |
|---|---|---|---|
gemini | ACP | GEMINI_API_KEY | — |
claude-agent-acp | ACP | ANTHROPIC_API_KEY | claude |
codex-acp | ACP | OPENAI_API_KEY, CODEX_API_KEY, CODEX_ACCESS_TOKEN, or host login | codex |
opencode | ACP | inferred from model/provider | — |
openhands | ACP | LLM_API_KEY | oh |
pi-acp | ACP | ANTHROPIC_API_KEY | pi |
openclaw | ACP | inferred from model | — |
AZURE_API_KEY plus AZURE_API_ENDPOINT with prefixes such
as azure-foundry-openai/gpt-5.5 or
azure-foundry-anthropic/claude-opus-4-5. BenchFlow routes these providers
through LiteLLM on both Docker and Daytona.
Any agent can be prefixed with acpx/ to run via ACPX (e.g. acpx/gemini, acpx/claude). ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent’s install, env, credentials, and skill paths are preserved.
Retry and Error Handling
Rollout.run() catches common errors:TimeoutError— agent exceeded timeoutConnectionError— SSH/ACP pipe closed (retried 3x with exponential backoff)ACPError— agent protocol error
RetryConfig:
Sandbox and Reward Types
Sandbox Protocol
TheSandbox protocol defines the interface any sandbox backend must implement.
Docker and Daytona are built-in; you can bring your own (Modal, Firecracker, E2B, etc.).