Skip to main content

Task standard — validation evidence

Date: 2026-06-09 This page records how the BenchFlow task standard (docs/task-standard.md) was validated: that a single task.md package can represent, run, and score real benchmarks end-to-end. SkillsBench is the worked example because it exercises the full surface — environment build, skill injection, oracle, and verifier. The bar throughout is “runs end-to-end with a real rollout and a produced verdict,” not “the agent passes.” Pass rate is a property of the model under test, not of the standard.

1. Representation — conversion parity 88 / 88 (deterministic)

For all 88 SkillsBench main tasks, a roundtrip conformance check migrates each legacy split package → native task.md → exports back to split, and compares the canonical task config, the normalized prompt, and SHA-256 file maps for the environment/, solution/, and tests/ trees.
=== SkillsBench conversion-parity: 88 tasks | PASS=88  MISMATCH=0  ERROR=0 ===
Every task is representable as native task.md with zero semantic loss on the compatibility surface.

2. Scoring — oracle E2E parity 6 / 6 (Docker)

Each sampled task runs twice with the deterministic oracle agent (--agent oracle, no LLM — it executes the reference solution), once on the legacy layout and once on the pure-native package (task.md + oracle/ + verifier/). Reward must match.
Tasklegacytask.mdparity
tictoc-unnecessary-abort-detection1.01.0
llm-prefix-cache-replay1.01.0
parallel-tfidf-search1.01.0
grid-dispatch-operator1.01.0
travel-planning0.00.0✅ (oracle fails identically)
3d-scan-calc0.00.0✅ (oracle fails identically)
The native package builds the environment, runs the oracle, and runs the verifier end-to-end, scoring identically to the legacy layout. The two 0.0 rows are the upstream oracle itself failing under the hardened sandbox — independent of the standard; parity is preserved.

3. Live agent run — three interaction modes

To show the packages execute under a real agent (not just the deterministic oracle), five converted tasks were run with a live agent across all three SkillsBench interaction modes — no-skill, with-skill, and self-gen (the agent authors a skill, then solves with it). A rollout is counted only when it produced real work (tokens spent, tool calls made, and a verdict returned). 12 of 15 mode-cells ran clean (4 of 5 tasks across 3 skill modes; the fifth fails for non-task.md infrastructure reasons), each a real rollout. The skill machinery genuinely engages: skills are injected and invoked, not merely declared. All clean rewards were 0.0 — expected for the small model used here; these tasks sit below its capability floor, so the result to read is the packages run and the skill path fires in every mode, not the absolute score. The failing task’s three mode-cells failed for environment/verifier-portability reasons unrelated to the conversion.

What this shows / does not show

  • Shows: task.md losslessly represents 100% of SkillsBench, scores with exact oracle parity to the legacy layout, and runs end-to-end under a live agent in all three interaction modes on 4 of the 5 sampled tasks (the fifth fails for non-task.md infrastructure reasons).
  • Does not show: a measurable skill uplift — that needs a base model strong enough to sometimes pass; the runs here establish the plumbing, not the lift.