LLM-as-Judge Verifier
Use an LLM to evaluate agent outputs against a rubric instead of deterministic tests.When to use LLM-as-judge
Use LLM-as-judge when the task output is subjective, open-ended, or hard to verify with unit tests — legal analysis, code review quality, document drafting, research summaries. For tasks with a clear right answer (e.g. “write fizzbuzz”), stick with deterministictest.sh verifiers.
BenchFlow’s LLM judge supports:
- First-class
[verifier]type —type = "llm-judge"intask.toml, notest.shneeded - Multi-criterion rubrics with binary, likert, and numeric scoring
- Per-criterion weights for non-uniform importance
- Dense reward events emitted per criterion during evaluation
- Multi-provider routing across Anthropic, OpenAI, and Google models
- Configurable aggregation (weighted mean, all-pass, any-pass, threshold)
test.sh verifier. A task selects it with one line of config — the framework
handles deliverable collection, prompting, provider routing, retries, and
reward aggregation.
Quick start
0. Install the judge provider SDKs
The judge calls the Anthropic, OpenAI, and Google SDKs — these are not installed by default. Install thejudge extra (you only need at least one
provider’s SDK for the model you use, but the extra ships all three):
0.0 — a missing dependency is an environment failure,
not a score.
1. Select the judge verifier in task.toml
tests/test.sh to write — the
Verifier downloads the agent’s deliverables from input_dir, scores them
against the rubric, and writes reward.json itself.
2. Write a rubric.toml
Place it where rubric_path points (by convention tests/rubric.toml):
rubric.json works too — set rubric_path = "tests/rubric.json":
[0, 1].
[verifier] reference
| Field | Type | Default | Description |
|---|---|---|---|
type | string | "test-script" | "test-script" (run tests/test.sh) or "llm-judge" |
timeout_sec | float | 600 | Overall verifier timeout |
env | table | {} | Env vars for the verifier — judge API keys go here |
[verifier.judge] (used when type = "llm-judge")
| Field | Type | Default | Description |
|---|---|---|---|
model | string | "claude-sonnet-4-6" | Judge model; provider routed from prefix |
rubric_path | string | "tests/rubric.toml" | Rubric file relative to the task dir (.toml or .json) |
input_dir | string | "/app" | Sandbox dir whose contents are graded |
input_type | string | "deliverables" | Only "deliverables" is supported — trajectory judging is not available at verify time |
context | string | "" | Extra judge context (defaults to the task instruction) |
Library use — LLMJudgeRewardFunc
The judge is also a composable RewardFunc, usable directly or from a custom
test.sh verifier:
rubric.toml/rubric.json is in the rollout directory or
its parent, it’s found automatically:
rubric.toml reference
[judge] section
| Field | Type | Default | Description |
|---|---|---|---|
model | string | "claude-sonnet-4-6" | LLM model for judging. Prefix with anthropic/, openai/, or google/ to force a provider |
mode | string | "individual" | "individual" scores each criterion separately; "batched" is reserved for future use |
files | string[] | [] | Default files to evaluate (fallback when a criterion doesn’t specify its own) |
timeout | int | 120 | Timeout in seconds per judge call |
[[criterion]] entries
| Field | Type | Default | Description |
|---|---|---|---|
name | string | — | Criterion identifier (falls back to first 40 chars of description) |
description | string | required | What the judge should evaluate |
type | string | "binary" | "binary", "likert", or "numeric" |
weight | float | 1.0 | Relative importance in aggregation |
points | int | 5 | Scale for likert type (1 to N) |
min | float | 0.0 | Minimum for numeric type |
max | float | 100.0 | Maximum for numeric type |
files | string[] | [] | Specific files this criterion should evaluate |
[scoring] section
| Field | Type | Default | Description |
|---|---|---|---|
aggregation | string | "weighted_mean" | How to combine criterion scores |
threshold | float | 0.7 | Pass threshold (only used with "threshold" aggregation) |
Score normalization
Each criterion type normalizes its raw score to[0, 1]:
| Type | Raw | Normalized |
|---|---|---|
binary | pass/fail | 1.0 or 0.0 |
likert | 1–N integer | (raw - 1) / (points - 1) |
numeric | min–max float | (raw - min) / (max - min), clamped to [0, 1] |
Aggregation strategies
| Strategy | Behavior |
|---|---|
weighted_mean | sum(score × weight) / sum(weight) — continuous reward |
all_pass | 1.0 if every criterion scores ≥ 0.5, else 0.0 |
any_pass | 1.0 if any criterion scores ≥ 0.5, else 0.0 |
threshold | 1.0 if weighted mean ≥ threshold, else 0.0 |
Criterion types
Binary (pass/fail)
The judge decides whether the criterion is satisfied. The LLM returns{"verdict": "pass", "reasoning": "..."}.
Likert (scaled)
The judge rates on a 1-to-N scale. The LLM returns{"score": 4, "reasoning": "..."}.
(3-1)/(5-1) = 0.5.
Numeric (range)
The judge assigns a value within a continuous range. The LLM returns{"score": 75.0, "reasoning": "..."}.
Inline criteria (no TOML file)
For programmatic use or Harvey LAB-style criteria, pass criteria directly:match_criteria keys are also supported:
Dense reward events
Each criterion emits aRewardEvent during evaluation, enabling per-criterion observability and training signal:
"dense", a reward in [0, 1], a source of "criterion:{name}", and a step index. Events are cleared between score() calls.
Multi-provider routing
The judge model string determines which provider SDK is used:| Prefix | Provider | Auth env var |
|---|---|---|
claude-*, anthropic/* | Anthropic | ANTHROPIC_API_KEY |
gpt-*, o1*, o3*, o4*, openai/* | OpenAI | OPENAI_API_KEY |
gemini*, google/* | GOOGLE_API_KEY or GEMINI_API_KEY |
judge extra (uv sync --extra judge). If
none are installed, the judge raises a verifier error instead of recording
a reward — see step 0.
Evaluation output
After scoring, anevaluation_details.json is written to the rollout directory:
score field is the actual aggregated score from the configured strategy, not n_passed / n_total.
File discovery
The judge automatically discovers deliverable files in the rollout directory. Supported formats:| Extension | Reader | Dependency |
|---|---|---|
.txt, .md, .json, .csv | Built-in | None |
.docx | pandoc or python-docx | pandoc (preferred) or pip install python-docx |
.xlsx | openpyxl | pip install openpyxl |
.pptx | markitdown | pip install markitdown |
.pdf | pdfplumber | pip install pdfplumber |
.) and internal metadata files (rubric.json) are excluded. File content is truncated at 15,000 characters per file when sent to the judge.
To scope a criterion to specific files:
Python API
All rubric config types are importable from the top level:Worked example — Harvey LAB legal task
A legal document analysis task scored entirely by config — notest.sh:
/app, grades each
criterion, aggregates, and writes reward.json — no scripting required.
Where to go next
- Concepts — the five primitives including Verifier
- Task authoring —
task.toml,tests/, verifier contract - Running benchmarks — Harvey LAB uses LLM-as-judge
- Python API reference —
LLMJudgeRewardFuncand friends