Progressive disclosure
TL;DR
BaseUser is a Python callback that drives a benchflow rollout across multiple rounds. Each round: the callback sees the previous verifier result and decides what to tell the agent next, or stops the loop. No second LLM, no outbox protocol — just a function that knows how to grade and hint.
It was built for the SWE-bench Pro progressive-disclosure use case: the dataset’s instructions are long structured specs that overwhelm agents in a single turn. A BaseUser lets you compress the spec for round 0, watch which tests fail, then disclose hints from the spec on subsequent rounds — all driven by deterministic Python, not by another LLM acting as a “user.”
Other agent-eval frameworks model this with a “simulated user” — a second LLM running in a sidecar container that talks to the agent over a side channel. benchflow’s BaseUser is just in-process Python: no second LLM, no sidecar, no outbox protocol.
Case study: SWE-bench Pro
SWE-bench Pro tasks ship long, structuredinstruction.md specs (typically 2-5KB) describing API requirements, test fixtures, and expected behaviors. Single-shot agents either drown in the spec or under-engineer because they bail before reading to the bottom.
The SWE-bench Pro eval that motivated this feature wanted exactly this loop:
Validation (2026-04-25, 5 SWE-bench Pro tasks, Daytona, Gemini 3.1 Pro Preview)
| Task | Oracle | Single-round baseline | 3-round progressive (final) | Per-round soft-verify |
|---|---|---|---|---|
| ansible | ✅ 1.0 | ✅ 1.0 (23 tools, 207s) | ✅ 1.0 (126 tools, 3 rounds) | 0.0 / 0.0 / 0.0 |
| flipt | ✅ 1.0 | ❌ 0.0 (61 tools, 1444s) | ❌ 0.0 (195 tools, 3 rounds) | 0.0 / 0.0 / 0.0 |
| openlibrary | ✅ 1.0 | ✅ 1.0 (32 tools, 340s) | ✅ 1.0 (82 tools, 3 rounds) | 0.0 / 0.0 / 0.0 |
| navidrome | ✅ 1.0 | (not tested) | ❌ 0.0 (145 tools, 3 rounds) | 0.0 / 0.0 / 0.0 |
| qutebrowser | ✅ 1.0 (with cleanup_conftests=false) | ❌ 0.0 (verifier broken pre-fix) | ✅ 1.0 (183 tools, 3 rounds) | 0.0 / 0.0 / 0.0 |
- The infrastructure works on real SWE-bench Pro tasks. All 5 tasks completed 3 rounds end-to-end (after one retry on ansible/qutebrowser to clear intermittent flake). Round trajectories captured, soft_verify runs between rounds, BaseUser callback drives the loop.
- 3/5 hit the canonical reward (ansible, openlibrary, qutebrowser). flipt and navidrome stayed at 0.0 across all three rounds — Gemini 3.1 Pro doesn’t crack them with this hint schedule, and progressive disclosure didn’t help.
- Per-round soft-verify scored 0.0 even on tasks where the final hardened verify scored 1.0. Soft-verify runs between rounds without the full hardening sequence (no workspace restore, no process kill so the sandbox stays alive), so its scoring can diverge from the final verifier. The user’s hint schedule reacts to soft-verify, not the canonical reward — something to keep in mind when designing the loop.
- First-run flake. ansible’s first run hit a transport EOF after 17min and qutebrowser timed out at 50min. Both succeeded on retry. v0.3.3 adds
agent_idle_timeout(default 600s) and clearer EOF diagnostics so the next time a hang happens the failure is fast and actionable rather than silent.
examples/swebench_pro_progressive_disclosure.ipynb has the executable cells.
Where it lives in the rollout lifecycle
BaseUser plugs into the existing Rollout lifecycle (concepts) without changing any of the existing phases. When RolloutConfig.user is set, Rollout._run_user_loop() replaces the single-pass connect → execute → disconnect block with a per-round version:
User — the loop assumes one Scene with one Role. Setting both raises ValueError.
Soft-verify and full-verify: two different verifiers
Between rounds, BenchFlow needs to score the agent’s progress so the user can react. But the final, end-of-rollout verifier does destructive things (kills the agent, restores the workspace, chowns to root) that would prevent the next round from running. So BenchFlow executes two verifier passes:| Soft-verify (between rounds) | Full-verify (end of rollout) | |
|---|---|---|
| Kills agent processes | ❌ no | ✅ yes |
| Restores workspace from snapshot | ❌ no | ✅ optional, task-driven |
Purges agent-injected conftest.py, sitecustomize.py, .pth | ✅ yes | ✅ yes |
| Locks down PATH/PYTHONPATH | ✅ yes | ✅ yes |
chmod 777 /logs/verifier | ✅ yes (so non-root verifier can write) | n/a (root) |
| Runs verifier | ✅ yes | ✅ yes |
| Result | feeds RoundResult.rewards | the rollout’s final score |
CLEANUP_CMD), so an agent can’t plant a conftest.py that flips the round score.
API
BaseUser
RoundResult
Dataclass passed to run() from round 1 onward.
PassthroughUser
Sends the instruction unchanged on round 0, stops on round 1. Use it as the explicit single-round-equivalent.
FunctionUser
Wraps a plain function as a BaseUser. Sync or async — uses inspect.isawaitable to detect.
RolloutConfig fields
Oracle access
Whenoracle_access=True:
- Before round 0, the rollout reads
/solution/solve.shand passes its contents touser.setup(instruction, solution=...). - The rollout moves
/solution→/solution_oracle_backupso the agent can’t read it during its rounds. - Between rounds, soft-verify temporarily restores
/solution(some verifiers consult it) then re-hides it. - Before the final
verify(), the rollout permanently restores/solution.
try/finally against the user loop: if a round throws, the restore still runs.
⚠️ SettingUse cases for oracle access:oracle_access=Truewithout aUseris a misconfiguration — the solution stays exposed to the agent for the entire rollout. benchflow logs aWARNINGat setup time when this happens.
- Dataset generation — the user has the answer, generates an optimal prompt for the agent
- Curriculum learning — progressively reveal pieces of the solution
- Research — measure how much oracle information is required for an agent to succeed
Per-task hardening opt-outs
The verifier’s pre-run cleanup deletesconftest.py outside /tests/ to prevent reward-hacking. Some tasks (qutebrowser) ship legitimate conftest.py files that fix real circular imports — deleting them breaks pytest collection.
Tasks opt out in task.toml:
| Flag | Default | Effect when false |
|---|---|---|
cleanup_conftests | true | Don’t delete conftest.py outside /tests/ before verify |
sitecustomize.py, .pth files, and *.py in /tmp always get cleaned — they have no legitimate use in a test artifact and disabling them broadens the attack surface beyond what real-world tasks need.
Unknown keys in [verifier.hardening] are warned and ignored. String values for boolean flags are rejected.
Failure modes
The user loop catches exceptions fromuser.run() and stops, with the exception message stored in Rollout._error:
soft_verify() between rounds catches its own timeouts and crashes — they surface as RoundResult.verifier_error, not as a rollout-level failure. The next round still runs and the user can decide what to do.
Trajectory and tool counts are sliced per round from Rollout._trajectory. The session counters reset on disconnect(), so each round’s RoundResult.trajectory and n_tool_calls reflect only that round’s events, not cumulative.
Comparison with multi-agent simulated user
benchflow has two patterns for multi-round agent runs. Neither requires a sidecar container.| Pattern | What “user” is | When to use |
|---|---|---|
BaseUser callback (this doc) | Python function in the scheduler process | Programmatic, deterministic, rule-based. No second LLM. Cheap. Best for progressive disclosure, curriculum, scripted hints. |
| Multi-role Scene with simulated-user role (use-cases §1) | Another LLM with full tool access | Open-ended, conversational. The “user” can read files, check outputs, give nuanced feedback. Best when the user’s behavior must itself be adaptive or LLM-quality. |
BaseUser). For the SWE-bench Pro use case, the disclosure schedule is fixed, the grading is the verifier, and there’s nothing for a second LLM to add — BaseUser wins on cost and determinism.
Worked examples
examples/swebench_pro_progressive_disclosure.ipynb— the SWE-bench Pro case study, executable end-to-end with the latest oracle/baseline data.examples/swebench_pro_user_dogfood.py— runnable script for any of the 5 SWE-bench Pro tasks.--task flipt --max-rounds 3.examples/user_dogfood.py— minimal edit-pdf task withFunctionUser, useful as a starting template.