Skip to main content

LLM-as-Judge Verifier

Use an LLM to evaluate agent outputs against a rubric instead of deterministic tests.

When to use LLM-as-judge

Use LLM-as-judge when the task output is subjective, open-ended, or hard to verify with unit tests — legal analysis, code review quality, document drafting, research summaries. For tasks with a clear right answer (e.g. “write fizzbuzz”), stick with deterministic test.sh verifiers. BenchFlow’s LLM judge supports:
  • First-class [verifier] typetype = "llm-judge" in task.toml, no test.sh needed
  • Multi-criterion rubrics with binary, likert, and numeric scoring
  • Per-criterion weights for non-uniform importance
  • Dense reward events emitted per criterion during evaluation
  • Multi-provider routing across Anthropic, OpenAI, and Google models
  • Configurable aggregation (weighted mean, all-pass, any-pass, threshold)
The judge is a first-class verification method alongside the deterministic test.sh verifier. A task selects it with one line of config — the framework handles deliverable collection, prompting, provider routing, retries, and reward aggregation.

Quick start

0. Install the judge provider SDKs

The judge calls the Anthropic, OpenAI, and Google SDKs — these are not installed by default. Install the judge extra (you only need at least one provider’s SDK for the model you use, but the extra ships all three):
# in a checkout
uv sync --extra judge

# or as an installed tool
uv tool install --prerelease allow 'benchflow[judge]==0.6.0'

# or with pip
pip install 'benchflow[judge]'
If no provider SDK is installed, the judge cannot run: the verifier raises a verifier error (the rollout is marked errored) rather than silently recording a reward of 0.0 — a missing dependency is an environment failure, not a score.

1. Select the judge verifier in task.toml

[verifier]
type = "llm-judge"
timeout_sec = 600

[verifier.judge]
model = "claude-sonnet-4-6"          # judge model (provider routed from prefix)
rubric_path = "tests/rubric.toml"    # rubric file, relative to the task dir
input_dir = "/app"                   # sandbox dir holding agent deliverables

# API keys for the judge — resolved from the host environment / .env
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
That’s the entire verifier. There is no tests/test.sh to write — the Verifier downloads the agent’s deliverables from input_dir, scores them against the rubric, and writes reward.json itself.

2. Write a rubric.toml

Place it where rubric_path points (by convention tests/rubric.toml):
[[criterion]]
name = "accuracy"
description = "The response accurately addresses the question with correct facts"
type = "binary"
weight = 3.0

[[criterion]]
name = "clarity"
description = "The response is well-organized and easy to understand"
type = "likert"
points = 5
weight = 1.0

[scoring]
aggregation = "weighted_mean"
A Harvey LAB style rubric.json works too — set rubric_path = "tests/rubric.json":
{
  "title": "Task Title",
  "criteria": [
    {"id": "criterion-1", "title": "...", "match_criteria": "What constitutes a pass"}
  ]
}
That’s it. Run the task as usual — the reward is the proportion of criteria passed (or the configured aggregation), a partial float in [0, 1].

[verifier] reference

FieldTypeDefaultDescription
typestring"test-script""test-script" (run tests/test.sh) or "llm-judge"
timeout_secfloat600Overall verifier timeout
envtable{}Env vars for the verifier — judge API keys go here

[verifier.judge] (used when type = "llm-judge")

FieldTypeDefaultDescription
modelstring"claude-sonnet-4-6"Judge model; provider routed from prefix
rubric_pathstring"tests/rubric.toml"Rubric file relative to the task dir (.toml or .json)
input_dirstring"/app"Sandbox dir whose contents are graded
input_typestring"deliverables"Only "deliverables" is supported — trajectory judging is not available at verify time
contextstring""Extra judge context (defaults to the task instruction)

Library use — LLMJudgeRewardFunc

The judge is also a composable RewardFunc, usable directly or from a custom test.sh verifier:
import asyncio
from pathlib import Path
from benchflow.rewards import LLMJudgeRewardFunc

func = LLMJudgeRewardFunc(rubric_path=Path("rubric.toml"))
score = asyncio.run(func.score(Path("/app")))
print(f"Score: {score:.2f}")
Auto-discovery — if rubric.toml/rubric.json is in the rollout directory or its parent, it’s found automatically:
func = LLMJudgeRewardFunc()
score = asyncio.run(func.score(Path("/app")))

rubric.toml reference

[judge] section

FieldTypeDefaultDescription
modelstring"claude-sonnet-4-6"LLM model for judging. Prefix with anthropic/, openai/, or google/ to force a provider
modestring"individual""individual" scores each criterion separately; "batched" is reserved for future use
filesstring[][]Default files to evaluate (fallback when a criterion doesn’t specify its own)
timeoutint120Timeout in seconds per judge call

[[criterion]] entries

FieldTypeDefaultDescription
namestringCriterion identifier (falls back to first 40 chars of description)
descriptionstringrequiredWhat the judge should evaluate
typestring"binary""binary", "likert", or "numeric"
weightfloat1.0Relative importance in aggregation
pointsint5Scale for likert type (1 to N)
minfloat0.0Minimum for numeric type
maxfloat100.0Maximum for numeric type
filesstring[][]Specific files this criterion should evaluate

[scoring] section

FieldTypeDefaultDescription
aggregationstring"weighted_mean"How to combine criterion scores
thresholdfloat0.7Pass threshold (only used with "threshold" aggregation)

Score normalization

Each criterion type normalizes its raw score to [0, 1]:
TypeRawNormalized
binarypass/fail1.0 or 0.0
likert1–N integer(raw - 1) / (points - 1)
numericmin–max float(raw - min) / (max - min), clamped to [0, 1]

Aggregation strategies

StrategyBehavior
weighted_meansum(score × weight) / sum(weight) — continuous reward
all_pass1.0 if every criterion scores ≥ 0.5, else 0.0
any_pass1.0 if any criterion scores ≥ 0.5, else 0.0
threshold1.0 if weighted mean ≥ threshold, else 0.0

Criterion types

Binary (pass/fail)

The judge decides whether the criterion is satisfied. The LLM returns {"verdict": "pass", "reasoning": "..."}.
[[criterion]]
name = "has-executive-summary"
description = "The document includes an executive summary in the first section"
type = "binary"

Likert (scaled)

The judge rates on a 1-to-N scale. The LLM returns {"score": 4, "reasoning": "..."}.
[[criterion]]
name = "writing-quality"
description = "Overall quality of prose — grammar, flow, and precision"
type = "likert"
points = 5
A score of 3 on a 5-point scale normalizes to (3-1)/(5-1) = 0.5.

Numeric (range)

The judge assigns a value within a continuous range. The LLM returns {"score": 75.0, "reasoning": "..."}.
[[criterion]]
name = "coverage-pct"
description = "Percentage of key topics from the source material covered in the summary"
type = "numeric"
min = 0.0
max = 100.0

Inline criteria (no TOML file)

For programmatic use or Harvey LAB-style criteria, pass criteria directly:
func = LLMJudgeRewardFunc(
    criteria=[
        {
            "description": "The response is factually accurate",
            "type": "binary",
            "weight": 2.0,
        },
        {
            "description": "The response addresses all parts of the question",
            "type": "binary",
            "weight": 1.0,
        },
    ],
    judge_model="claude-sonnet-4-6",
)
Harvey LAB match_criteria keys are also supported:
func = LLMJudgeRewardFunc(
    criteria=[
        {"match_criteria": "Identifies the key risk factors", "type": "binary"},
        {"match_criteria": "Provides supporting evidence", "type": "binary"},
    ],
)

Dense reward events

Each criterion emits a RewardEvent during evaluation, enabling per-criterion observability and training signal:
func = LLMJudgeRewardFunc(rubric_path=Path("rubric.toml"))
score = await func.score(rollout_dir)

for event in func.events:
    print(f"  {event.source}: {event.reward:.2f} (step {event.step})")
Output:
  criterion:accuracy: 1.00 (step 0)
  criterion:clarity: 0.50 (step 1)
  criterion:completeness: 0.75 (step 2)
Events have type "dense", a reward in [0, 1], a source of "criterion:{name}", and a step index. Events are cleared between score() calls.

Multi-provider routing

The judge model string determines which provider SDK is used:
PrefixProviderAuth env var
claude-*, anthropic/*AnthropicANTHROPIC_API_KEY
gpt-*, o1*, o3*, o4*, openai/*OpenAIOPENAI_API_KEY
gemini*, google/*GoogleGOOGLE_API_KEY or GEMINI_API_KEY
If the primary provider fails, the judge falls back through the other providers with retries and exponential backoff. The provider SDKs ship in the judge extra (uv sync --extra judge). If none are installed, the judge raises a verifier error instead of recording a reward — see step 0.

Evaluation output

After scoring, an evaluation_details.json is written to the rollout directory:
{
  "score": 0.75,
  "n_passed": 2,
  "n_total": 3,
  "results": [
    {
      "id": "accuracy",
      "description": "The response accurately addresses the question",
      "score": 1.0,
      "weight": 3.0,
      "verdict": {"verdict": "pass", "reasoning": "..."}
    },
    {
      "id": "clarity",
      "description": "The response is well-organized",
      "score": 0.5,
      "weight": 1.0,
      "verdict": {"score": 3, "reasoning": "..."}
    }
  ]
}
The score field is the actual aggregated score from the configured strategy, not n_passed / n_total.

File discovery

The judge automatically discovers deliverable files in the rollout directory. Supported formats:
ExtensionReaderDependency
.txt, .md, .json, .csvBuilt-inNone
.docxpandoc or python-docxpandoc (preferred) or pip install python-docx
.xlsxopenpyxlpip install openpyxl
.pptxmarkitdownpip install markitdown
.pdfpdfplumberpip install pdfplumber
Files larger than 50 MB are skipped. Hidden files (starting with .) and internal metadata files (rubric.json) are excluded. File content is truncated at 15,000 characters per file when sent to the judge. To scope a criterion to specific files:
[[criterion]]
name = "memo-quality"
description = "The legal memo follows IRAC structure"
files = ["memo.docx", "analysis.md"]

Python API

All rubric config types are importable from the top level:
from benchflow import (
    Criterion,
    JudgeConfig,
    LLMJudgeRewardFunc,
    RubricConfig,
    ScoringConfig,
    load_rubric,        # dispatches on extension (.toml / .json)
    load_rubric_json,
    load_rubric_toml,
)

# Load and inspect a rubric (TOML or Harvey LAB style JSON)
rubric = load_rubric(Path("rubric.json"))
print(f"Model: {rubric.judge.model}")
print(f"Criteria: {len(rubric.criteria)}")
for c in rubric.criteria:
    print(f"  {c.id} ({c.type}, weight={c.weight})")

A legal document analysis task scored entirely by config — no test.sh:
# task.toml
[verifier]
type = "llm-judge"
timeout_sec = 600

[verifier.judge]
model = "claude-sonnet-4-6"
rubric_path = "tests/rubric.toml"
input_dir = "/app"

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
# tests/rubric.toml
[judge]
files = ["analysis.md"]

[[criterion]]
name = "key-terms-identified"
description = "All material terms from the contract are identified and listed"
type = "binary"
weight = 2.0

[[criterion]]
name = "risk-assessment"
description = "Each identified risk includes severity rating and mitigation suggestion"
type = "likert"
points = 5
weight = 3.0

[[criterion]]
name = "completeness"
description = "Percentage of contract sections addressed in the analysis"
type = "numeric"
min = 0
max = 100
weight = 1.0

[scoring]
aggregation = "weighted_mean"
The framework downloads the agent’s deliverables from /app, grades each criterion, aggregates, and writes reward.json — no scripting required.

Where to go next