Continuing timed-out runs (benchflow continue)
benchflow continue resumes a previous, unfinished (timed-out) agent run to
completion. It is a standalone tool — it does not modify benchflow’s normal
eval/run path — and currently targets the openhands agent.
The goal is a transparent resume: the continued run behaves as if the original
timeout had simply been larger. The agent keeps its exact context and
environment and continues its own loop with no injected prompt.
The problem it solves
A finished run keeps nothing of the container — cleanup tears the sandbox down. What survives on disk is the run folder:config.json, result.json,
prompts.json, and trajectory/llm_trajectory.jsonl. So a historical timeout
has only its trajectory + the task; there is no saved container to restore.
benchflow continue reconstructs the missing state from the trajectory.
How it works — record-replay
The recordedllm_trajectory.jsonl is the exact sequence of LLM
request/response pairs from the original run. benchflow continue:
- Loads the original run folder and the recorded exchanges.
- Boots a fresh, pristine sandbox from the same base image.
- Stands up a replay proxy that OpenHands talks to via
LLM_BASE_URL. For the first N requests it returns the recorded responses in order, so the agent re-executes its own past decisions for real — rebuilding the byte-exact workspace and its exact internal conversation/event state. - When the recorded responses run out (the timeout cut-point), the proxy flips to the live model and the agent continues — no new prompt.
- Re-verifies with the task verifier and writes a new HF-compatible folder,
with a stitched
llm_trajectory.jsonl(recorded prefix + live suffix) andcontinued_fromprovenance — a drop-in replacement for the timed-out entry.
Usage
--tasks-dir at the directory containing the task (matched by name). If the
task_path recorded in config.json still exists on disk, --tasks-dir is
optional.
Options
| Flag | Default | Meaning |
|---|---|---|
--tasks-dir DIR | recorded task_path | Task source (instruction + verifier). |
--model MODEL | original run’s model | Override the live-continuation model. |
--timeout SEC | original run’s timeout | Wall-clock budget for the continuation. |
--output DIR | <orig-parent>/continued | Output jobs dir for the new run. |
--require-timeout | off | Refuse runs whose recorded status isn’t a timeout. |
--strict-divergence | off | Abort if replay leaves the original rails. |
--replay-only | off | Rebuild via replay and stop at the cut-point (no live model needed). |
Models and credentials
- The live-continuation model defaults to the original run’s model so the
continuation is a faithful continuation of the same brain. Tests use
--model gemini-3.1-flash-lite-previewfor a cheap path. - The replay phase needs no API key — responses are served from the
recording. Only the live continuation calls the real provider, so the
host needs that provider’s credentials (e.g.
GEMINI_API_KEY) in its environment.--replay-onlyskips the live leg entirely.
Limitations and caveats
openhandsonly for now (the proxy seam relies onLLM_BASE_URL).- Replay fidelity is best-effort. Replay re-runs the original shell
commands for real; if a command’s output diverges from the original
(network, timestamps, nondeterminism), the agent may see a different
observation than recorded. A message-count check warns on divergence
(
--strict-divergenceaborts instead). - “Identical output” means a faithful continuation, not a bit-identical result — the model samples, and no “original full run” exists past the timeout. The bar is: the stitched trajectory reads as one continuous run, as if the timeout had been larger.
- Re-running the episode’s commands costs wall-clock time (model latency is skipped, since recorded responses are served instantly).