Skill evals
Test whether your agent skill actually helps agents perform better.Install
Overview
bench skills eval takes a skill directory with an evals/evals.json
file, generates benchmark tasks from it, runs them with and without the
skill installed, and reports the “lift” — how much the skill improves
agent performance.
0.6 task-standard validation is in
docs/reports/2026-06-09-task-standard-validation.md.
Quick start
1. Add evals to your skill
2. Write test cases
3. Run the eval
evals.json reference
Top-level fields
| Field | Type | Required | Description |
|---|---|---|---|
version | string | No | Schema version (default: “1”) |
skill_name | string | No | Skill name (auto-detected from SKILL.md) |
defaults.timeout_sec | int | No | Per-task timeout in seconds (default: 300) |
defaults.judge_model | string | No | Model for LLM judge (default: gemini-3.1-flash-lite) |
defaults.skill_mount_dir | string | No | Neutral sandbox path where the generated task exposes the skill before BenchFlow links it into agent-specific discovery paths (default: /skills) |
Case fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | No | Unique case ID (auto-generated if missing) |
question | string | Yes | The task instruction sent to the agent |
ground_truth | string | No | Expected final answer (used for exact match fallback) |
expected_behavior | string[] | No | Behavioral rubric for LLM judge |
expected_skill | string | No | Which skill should be invoked |
expected_script | string | No | Which script should be called |
environment | object | No | Per-case env var overrides |
Grading logic
- If
expected_behavioris provided → LLM judge scores the agent’s trajectory against the rubric (0.0-1.0) - If only
ground_truthis provided → exact match checks if the answer appears in agent output (0.0 or 1.0) - If neither → reward is 0.0
Agent and judge credentials
bench skills eval runs real agents. The selected agent must have whatever
provider credentials or subscription auth it normally needs, and LLM-judge
cases also need a supported judge key available in the environment. Exact-match
cases can avoid the judge model, but they still need a working agent.
For Codex agents, that auth can be OPENAI_API_KEY, CODEX_API_KEY,
CODEX_ACCESS_TOKEN, or a host ~/.codex/auth.json login.
Provider-prefixed models can use provider-specific credentials instead; Azure
Foundry models use AZURE_API_KEY plus AZURE_API_ENDPOINT.
When a supported judge key is present on the host (GOOGLE_API_KEY,
GEMINI_API_KEY, ANTHROPIC_API_KEY, or OPENAI_API_KEY), generated tasks
reference it through [verifier.env] template syntax such as
${GEMINI_API_KEY}. Secret values are resolved at verifier runtime and are not
written into generated task files.
The oracle agent is useful for generic task and sandbox smoke tests, but it is
not a replacement for skill evaluation. Skill-eval tasks are generated from
questions and rubrics and do not include solution/solve.sh, so oracle runs
will error instead of measuring skill lift.
Existing task-embedded skills
Skills embedded under a benchmark task, such astasks/<task>/environment/skills/<skill>/SKILL.md, are task-local skill packs.
They are not exposed to ordinary no-skills runs by default. To evaluate one
directly with bench skills eval, add a sibling evals/evals.json inside that
skill directory or copy the skill into a standalone skill directory with the
same evals/ contract.
The repo includes a real standalone example at
skills/citation-management/, adapted
from the SkillsBench citation-check task:
Multi-agent comparison
Test your skill across multiple agents:Custom environments
For skills that need specific dependencies, add a Dockerfile:python:3.12-slim base.
For with-skill runs, BenchFlow appends a COPY skills/ <skill_mount_dir>/
step so the generated task exposes the skill at the neutral path declared in
task.toml. During rollout setup, BenchFlow links that neutral path into the
selected agent’s configured discovery paths.
GEPA integration
Export traces for GEPA skill evolution:jobs/skill-eval/<skill>/gepa/:
End-to-End Walkthrough
Here’s a complete example evaluating a real skill from scratch.Step 1: Create the skill
gws-skill/SKILL.md:
gws-skill/scripts/draft_email.py:
Step 2: Write eval cases
Writegws-skill/evals/evals.json:
Step 3: Run the eval
Step 4: Inspect results
Results are saved tojobs/skill-eval/<skill-name>/:
Step 5: Improve with GEPA (optional)
Architecture
For Skill Developers (Jon Snow Adapter Pattern)
If you maintain skills and want CI-integrated eval:Tips for writing good eval cases
- Be specific in questions — “Use the calculator skill to compute X” is better than “Compute X”
- Write 3-5 rubric items per case — Each should be independently verifiable from the trajectory
- Include edge cases — Test error handling, unusual inputs, multi-step workflows
- Keep ground_truth simple — Exact match works best for numeric or short-string answers
- Use 2-4 cases minimum — Enough to show a pattern, not so many that runs get expensive
- Test the lift, not just correctness — The goal is to show the skill improves performance vs baseline. If baseline already scores high, the skill isn’t adding value