Authoring tasks
A BenchFlow task packages an instruction, a sandboxed environment, and a verifier into a directory that BenchFlow runs and scores automatically. This page covers the Harbor-compatible split layout (task.toml + instruction.md). For the native single-document format, see Authoring native task.md tasks.
Directory layout
[!NOTE] BenchFlow will provide first-party support for hosted competition platforms, Verifiers, and OpenReward Standard.You can create Harbor-format tasks in BenchFlow with a
task.toml config file, separate instruction.md, sandbox assets under environment/, verifier files under tests/, and an optional solution/ oracle.
tests/ may also include test_outputs.py (pytest module called by test.sh).
task.toml
pre_agent_hooks=build_service_hooks([...]); for CLI-only task authoring, keep services inside the task’s own Dockerfile/startup scripts until a dedicated service declaration is wired through the CLI.
Install tooling to shared prefixes, not /root — when a task image ships Node.js, Python tools, or agent binaries that the sandbox user must execute, install them to /usr/local/bin, /usr/local/lib, or /opt, not /root/.nvm or /root/.local/bin. setup_sandbox_user() creates the non-root user, prepares small config/auth dirs, and chowns the workspace — it does not clone /root into the sandbox home. Legacy images that already install tools under /root still work via a narrow symlink fallback, but shared prefixes are the supported path. Pre-creating the sandbox user in the Dockerfile is an optional speedup, not a requirement.
Multi-container tasks
A task may ship anenvironment/docker-compose.yaml alongside the
Dockerfile. The agent always runs in the main service; any additional
services you declare become sibling containers on the same Docker network.
This supports vulhub-style CVE tasks where the agent attacks a separate target
container over the network.
environment/Dockerfileis always required —bench tasks checkrejects a task that ships only adocker-compose.yaml. If yourmainservice uses a prebuiltimage:and needs no build context, still include a minimalDockerfile(e.g.FROM <same-image>) so structural validation and other tooling agree on the task package shape.
main reaches target by service name (http://target:8080). The verifier
can inspect target-side state — not just the agent’s workspace — by passing
a service argument when running commands:
exec(..., service=...) works on the Docker sandbox and the Daytona DinD
(compose) sandbox. Single-container backends (Modal, direct Daytona) raise a
clear error for any non-main service. This lets a verifier check
write-based oracles (/tmp/exploit.txt in the target), database modifications,
or RCE markers without trusting the agent container.
Target-side test.sh verification
For tasks whose success oracle lives in a target container — an RCE marker
file, a modified database row — point the test.sh verifier at that service
with [verifier].service:
tests/ directory into the
target container, runs test.sh there, and copies the resulting
reward.txt / reward.json back to the host. service defaults to "main"
(the agent container), so existing single-container tasks are unaffected.
[verifier].service is the declarative, task-schema way to do cross-container
verification; the env.exec_in_service(...) Python API above is the
imperative equivalent for hook-driven runs.
Use the sameservicename you declared indocker-compose.yaml. Atest.shrunning in the target reachesmain(and vice versa) by service name over the Docker network, just like the agent does.
Hardening policy for multi-container tasks
BenchFlow’s pre-verification hardening — killing the sandbox user’s processes, scrubbingPATH/PYTHONPATH, restoring build-config files —
applies only to the main (agent) container. Target containers are
deliberately left unhardened: a vulhub-style target is meant to be
vulnerable, the agent never has a shell inside it, and hardening it would
risk breaking the very vulnerability the task exercises. [verifier].service
selects where test.sh runs; it does not move hardening off main.
instruction.md
The first prompt sent to the agent. Write it as you would for a skilled developer:- State the precise goal in the first sentence.
- Name exact files or paths the agent must create or modify.
- Specify constraints (no external libraries, must pass existing tests, etc.).
- Don’t mention the verifier or
reward.txt— those are internal.
None prompt means “use instruction.md”:
Verifier contract (tests/test.sh)
After the agent finishes, the BenchFlow runtime copiestests/ to /tests/ and runs /tests/test.sh. The working directory is the Dockerfile’s WORKDIR (typically /app/ in the example Dockerfile below).
Your script must write a single float (0.0–1.0) to /logs/verifier/reward.txt. After writing the reward, exit 0; a nonzero test.sh exit is treated as verifier infrastructure failure, not a scored task failure.
| Path | Contents |
|---|---|
/app/ | Agent’s working directory |
/tests/ | Your tests/ directory |
/solution/ | solution/ (oracle runs only) |
/logs/verifier/ | Write reward.txt (and optionally ctrf.json) here |
Pure bash verifier
pytest verifier
Partial credit
/logs/verifier/reward.txt or modify /tests/test.sh. For tasks running arbitrary code, use allow_internet = false and verify output files only. For LLM agent runs, BenchFlow preserves the network path needed for model APIs and agent startup, then disables supported agent web browsing/fetch tools through agent config or launch controls. Oracle runs still use the environment’s network policy directly.
solution/ (optional)
Include when you want to verify the task is solvable or provide a reference implementation. When BenchFlow runs with--agent oracle, it copies solution/ to /solution/ and runs solution/solve.sh instead of an ACP agent.
solve.sh has the same filesystem access as the agent — write only to /app/, not to /logs/verifier/.
CLI
bench tasks generate converts agent traces (Claude Code sessions, opentraces records, or HuggingFace datasets) into task directories with task.toml, instruction.md, and a file-existence test.sh. Use --dry-run to preview traces before generating. See CLI reference for all flags.
bench tasks check validates task definition presence (task.md or legacy task.toml + instruction.md), a non-empty instruction, environment/Dockerfile, and a runnable verifier entrypoint (verifier/ or legacy tests/). It surfaces task.toml parse errors but does not require [agent].timeout_sec (unset means no wall-clock cap). Exits with code 1 on failure (CI-friendly).