Evals
Evaluate LLM outputs with scored metrics and historical tracking.
Contents
- What is an Eval?
- Quick Start
- How It Works
- EvalSuite
- EvalCase
- Evaluators
- Fixtures
- ModelLabel
- Judge
- TaskResult (SUT Usage Tracking)
- Usage Display
- Evaluator Errors
- Score Namespacing
- Multi-Model Sessions
- CLI
- Output
- History
- Progress Output
What is an Eval?
A test produces pass/fail. An eval produces scores - numeric values (0.0–1.0) that measure output quality. Scores are aggregated across cases, tracked over time, and compared between runs.
ProTest evals use the same infrastructure as tests: fixtures, DI, parallelism, tags. An eval is a test that returns a value, scored by evaluators.
First-run expectations: don't expect 100% green
Unlike tests, evals are expected to have failing cases - that's
the signal you're measuring. protest eval still exits 1 when any
case fails a Verdict (so CI surfaces regressions), but the
failures are not bugs, they're data points. The aggregate-stats
table is designed for this - you watch the metrics drift over time.
Every run is recorded to .protest/history.jsonl so the trend
accumulates from day one (browsing and run-comparison tooling lands
in a future release).
Quick Start
# evals/session.py
from typing import Annotated
from protest import ForEach, From, ProTestSession
from protest.evals import EvalCase, ModelLabel, evaluator
from protest.evals.evaluators import contains_keywords
from protest.evals import EvalSuite
cases = ForEach([
EvalCase(inputs="Who is Marie?", expected="Marie, Resistance", name="lookup"),
EvalCase(inputs="What is 2+2?", expected="4", name="math"),
])
session = ProTestSession()
chatbot_suite = EvalSuite("chatbot", model=ModelLabel(name="gpt-4o-mini"))
session.add_suite(chatbot_suite)
@chatbot_suite.eval(evaluators=[contains_keywords(keywords=["Marie"])])
async def chatbot(case: Annotated[EvalCase, From(cases)]) -> str:
return await my_agent(case.inputs)
How It Works
@suite.eval() wraps a function to run evaluators on its return value:
- Your function receives case data via
ForEach/From(same as parameterized tests) - It returns the output (string, object, anything)
- ProTest passes the output to evaluators → scores
- Bool verdicts determine pass/fail
- Aggregated stats appear in the terminal
The rest of the pipeline - fixtures, DI, parallelism, reporters - works identically to tests.
EvalSuite
EvalSuite groups eval cases. It's the eval equivalent of ProTestSuite - it forces kind=EVAL and carries model/judge configuration. Model and judge are suite-level config: each suite declares which model produced its results and which judge scores them.
from protest.evals import EvalSuite
from protest.evals import ModelLabel
chatbot_suite = EvalSuite("chatbot", model=ModelLabel(name="gpt-4o-mini"))
session.add_suite(chatbot_suite)
@chatbot_suite.eval(evaluators=[my_scorer])
async def chatbot(case: Annotated[EvalCase, From(cases)]) -> str:
return await my_agent(case.inputs)
EvalCase
Typed dataclass for eval case data. All eval cases must use EvalCase - plain dicts are not supported.
from protest.evals import EvalCase
cases = ForEach([
EvalCase(inputs="What is 2+2?", expected="4", name="math"),
EvalCase(inputs="Who is Napoleon?", expected="emperor, France", name="history"),
])
| Field | Type | Description |
|---|---|---|
inputs |
Any |
Input to your task function |
expected |
Any |
Expected output (passed to evaluators as ctx.expected_output) |
name |
str |
Case identifier (used in test IDs and history) |
evaluators |
list |
Per-case evaluators (added to suite-level ones) |
tags |
list[str] |
First-class tags - flow to protest eval --tag … (see below) |
metadata |
dict |
Arbitrary metadata, opaque to the framework |
Why EvalCase and not a dict?
The runtime reads case data via attribute access (case.expected, case.metadata, case.evaluators), not by string key. A plain dict would compile fine but blow up at runtime, and you'd lose the IDE refactor/Ctrl+Click affordances. Making EvalCase a typed dataclass surfaces typos at import time and keeps the contract one obvious place - same trade-off as Annotated[T, Use(fn)] over pytest's name-based fixture lookup.
Per-case tags
EvalCase.tags is a first-class field. Tags flow through the test collector and become first-class on the resulting TestItem, so protest eval --tag slow works out of the box. Use metadata for any other free-form annotation the framework should ignore.
EvalCase(
inputs="Long doc to summarize…",
expected="…",
name="long_doc_case",
tags=["slow", "summarization"],
metadata={"source_dataset": "v3"}, # opaque to the framework
)
Evaluators
An evaluator is a function decorated with @evaluator that receives an EvalContext and returns a verdict. The decorator is mandatory: passing a plain function in evaluators=[...] raises TypeError at registration. The wrapping is what gives the evaluator its identity (used for hashing, history, reporting) and a typed run(ctx) method - there's no implicit conversion.
Every eval case must have at least one evaluator at runtime - suite-level
(@suite.eval(evaluators=[...])) or per-case (EvalCase(evaluators=[...])).
Zero evaluators raises NoEvaluatorsError: passed is all(verdicts) and
all([]) is True, so a mis-wired eval would otherwise pass silently forever.
If your eval task returns a non-string output
The built-in evaluators (contains_keywords, not_empty, max_length,
matches_regex, json_valid, word_overlap) assume ctx.output is a
string and call methods like .lower() on it. They drop in cleanly for
summarization, chatbot replies, single-string completions, etc.
For a structured output (dict, dataclass, pydantic.BaseModel, list
of objects, …), the path is to write custom evaluators that
pick the field they care about. A typical pattern:
@evaluator
def category_matches_expected(ctx: EvalContext) -> CategoryMatch:
expected = (ctx.expected_output or {}).get("category")
actual = ctx.output.get("category")
return CategoryMatch(category_matches=(expected == actual), ...)
See Structured Evaluator below and EvalContext for the data
you can read off ctx.
Return Types
Evaluators return bool (simple verdict) or a dataclass (structured result). In dataclasses, annotate fields to tell the framework what each one is:
| Annotation | Role |
|---|---|
Annotated[bool, Verdict] |
Verdict - pass/fail (all(verdicts)) |
Annotated[float, Metric] |
Metric - aggregated in stats (mean/p50/p95) |
Annotated[int, Metric] |
Metric - converted to float |
Annotated[str, Reason] |
Reason - displayed on failure, stored in history |
Unannotated fields are ignored by the runner - free metadata.
The return annotation is required and must be bool or a dataclass -
@evaluator raises TypeError at decoration time otherwise (missing
annotation, -> float, -> X | None, …). The annotation is the score
contract: it determines the score names recorded in history and the keys
of skipped placeholders, so it must also resolve at runtime - import the
return type for real (not only under TYPE_CHECKING) and define it at
module level.
Tracking-Only Evaluators
A dataclass with Metric fields but no Verdict is tracking-only. The case always passes for this evaluator - it measures without gating.
@dataclass
class OverlapMetrics:
overlap: Annotated[float, Metric]
@evaluator
def word_overlap(ctx: EvalContext) -> OverlapMetrics:
...
In the terminal, tracking evaluators show with · instead of ✓/✗:
✓ chatbot[lookup] (1.2s) keyword_check.recall=0.95 keyword_check.all_present=✓
· chatbot[lookup] word_overlap.overlap=0.80
Simple Evaluator
Structured Evaluator
from dataclasses import dataclass
from typing import Annotated
from protest.evals import Metric, Verdict, Reason
@dataclass
class KeywordScores:
recall: Annotated[float, Metric]
all_present: Annotated[bool, Verdict]
detail: Annotated[str, Reason] = ""
@evaluator
def keyword_check(ctx: EvalContext, keywords: list[str], min_recall: float = 0.5) -> KeywordScores:
found = [k for k in keywords if k.lower() in ctx.output.lower()]
recall = len(found) / len(keywords)
return KeywordScores(
recall=recall,
all_present=recall >= min_recall,
detail=f"found {len(found)}/{len(keywords)}",
)
Scores surface namespaced by the evaluator's name (keyword_check.recall,
keyword_check.all_present), so field names only need to be unique within
their own dataclass.
The threshold (min_recall) is a parameter of the evaluator, not a framework concept. The evaluator decides the verdict.
Async (LLM Judge)
Use ctx.judge() for structured LLM evaluation (requires judge= on EvalSuite):
@dataclass
class JudgeResult:
accuracy: Annotated[float, Metric]
accurate_enough: Annotated[bool, Verdict]
reason: Annotated[str, Reason] = ""
@evaluator
async def llm_judge(ctx: EvalContext, rubric: str = "", min_score: float = 0.7) -> JudgeResult:
return await ctx.judge(
f"Evaluate this response on a 0-1 scale.\n\n"
f"Response: {ctx.output}\nCriteria: {rubric}",
JudgeResult,
)
The judge handles structured output - no text parsing needed. See Judge for setup.
Per-Case Thresholds
Different thresholds per case = different evaluator bindings:
EvalCase(name="easy_lookup", inputs="easy lookup", evaluators=[keyword_check(keywords=["paris"], min_recall=0.9)]),
EvalCase(name="hard_causal", inputs="hard causal", evaluators=[keyword_check(keywords=["paris"], min_recall=0.3)]),
ShortCircuit
Skip expensive evaluators (LLM judges) when cheap ones already fail:
from protest.evals import ShortCircuit
evaluators=[
not_empty, # always runs
ShortCircuit([
contains_keywords(keywords=["paris"], min_recall=0.5), # 0ms - if fail → stop
llm_judge(rubric="factual accuracy"), # 3s - skipped if above fails
]),
]
ShortCircuit is a group of ordered evaluators. The first Verdict=False stops the group. Evaluators outside the ShortCircuit always run.
Execution order - evaluators=[a, ShortCircuit([b, c]), d]:
a ← always runs
├─ pass → continue
└─ fail → continue (a is outside the group, doesn't gate b/c)
[ShortCircuit group ──────────────────────────────────┐
b ← always runs (first in group) │
├─ pass → c │
└─ fail → c skipped (Verdict=False stops group) │
c ← runs only if b passed │
└─────────────────────────────────────────────────────┘
d ← always runs (outside the group)
The list evaluators=[…] is sequential at the top level; a ShortCircuit is just a sub-group that may stop early. Use it to gate expensive evaluators (LLM judges) behind cheap ones (keyword/regex checks).
Using Evaluators
# No params → use directly
evaluators=[not_empty]
# With params → call to bind
evaluators=[contains_keywords(keywords=["python", "async"], min_recall=0.75)]
# Per-case evaluators (added to suite-level)
EvalCase(name="factual_accuracy_case", inputs="...", evaluators=[llm_judge(rubric="Check factual accuracy")])
EvalContext
| Field / Method | Type | Description |
|---|---|---|
name |
str |
Case name |
inputs |
I |
Case inputs |
output |
O |
Task return value |
expected_output |
O \| None |
From EvalCase.expected |
metadata |
Any |
From EvalCase.metadata |
duration |
float |
Task execution time (seconds) |
judge(prompt, type) |
async |
Call the configured LLM judge (see Judge) |
judge_call_count |
int |
Number of judge calls made |
Built-in Evaluators
| Evaluator | Params | Returns |
|---|---|---|
contains_keywords |
keywords, min_recall=1.0 |
recall: float, all_present: bool |
contains_expected |
case_sensitive=False |
bool |
does_not_contain |
forbidden |
ok: bool |
not_empty |
- | bool |
max_length |
max_chars=500 |
conciseness: float, within_limit: bool |
min_length |
min_chars=1 |
bool |
matches_regex |
pattern |
bool |
json_valid |
required_keys=[] |
valid: bool, has_required_keys: bool |
word_overlap |
- | overlap: float (tracking-only) |
Dataclass fields surface as <evaluator>.<field> in scores - e.g.
contains_keywords.recall, json_valid.valid (see
Score Namespacing).
contains_expected and word_overlap require expected on the case and
raise when it is None: with nothing to compare against, a vacuous pass
(or a fake overlap=1.0) would make a case-wiring mistake look like a
healthy run. Attach them only to cases that carry expected, per-case if
the case set is mixed.
Fixtures
Evals use the same fixture system as tests. Expensive setup (database, pipeline, graph) runs once and is shared across all cases.
@fixture()
async def pipeline():
driver = await build_pipeline() # 3 minutes, once
yield driver
await driver.close()
session.bind(pipeline)
pipeline_suite = EvalSuite("pipeline")
session.add_suite(pipeline_suite)
@pipeline_suite.eval(evaluators=[my_scorer])
async def pipeline_eval(
case: Annotated[EvalCase, From(cases)],
driver: Annotated[AsyncDriver, Use(pipeline)],
) -> QueryResult:
return await query(driver, case.inputs)
ModelLabel
ModelLabel is a passive label that ProTest stores in the history alongside each run, so you can attribute results to a specific model and compare runs side-by-side. It does not route requests, set a temperature, pick a provider, or otherwise touch any LLM - the actual model wiring happens inside your task function (or the agent / SDK it calls).
Judge
A Judge is a protocol for LLM-as-judge evaluators. ProTest owns the interface - you plug in your LLM library.
The Protocol
class Judge(Protocol):
async def judge(self, prompt: str, output_type: type[T]) -> JudgeResponse[T]: ...
Minimal contract: takes a prompt and a return type, returns a JudgeResponse wrapping the typed result with optional usage stats. All configuration (model, temperature, system prompt, max_tokens) lives in your implementation's constructor, not in the protocol.
Writing a Judge
The judge() method returns a JudgeResponse[T] that wraps the output with optional usage stats:
from pydantic_ai import Agent
from protest.evals import JudgeResponse
class PydanticAIJudge:
name = "gpt-4o-mini" # used in history
provider = "openai" # optional, used in history
def __init__(self, model: str = "gpt-4o-mini", temperature: float = 0):
self.model = model
self.temperature = temperature
async def judge(self, prompt: str, output_type: type[T]) -> JudgeResponse[T]:
agent = Agent(self.model, output_type=output_type)
result = await agent.run(prompt)
usage = result.usage()
return JudgeResponse(
output=result.output,
input_tokens=usage.request_tokens,
output_tokens=usage.response_tokens,
cost=usage.request_tokens * 0.15/1e6 + usage.response_tokens * 0.60/1e6,
)
Tokens and cost are optional - omit them if your provider doesn't expose usage data:
Configuring the Judge
suite = EvalSuite(
"pipeline",
model=ModelLabel(name="qwen-2.5"),
judge=PydanticAIJudge(model="gpt-4o-mini", temperature=0),
)
JudgeInfo (name, provider) is derived automatically from the instance for history tracking.
Using the Judge in Evaluators
Evaluators access the judge via ctx.judge():
@dataclass
class JudgeResult:
accurate: Annotated[bool, Verdict]
reason: Annotated[str, Reason] = ""
@evaluator
async def llm_rubric(ctx: EvalContext, rubric: str = "") -> JudgeResult:
return await ctx.judge(
f"Evaluate this response.\n\nResponse: {ctx.output}\nCriteria: {rubric}",
JudgeResult, # structured output - no text parsing
)
For simple verdicts, use bool or str as output_type:
@evaluator
async def simple_judge(ctx: EvalContext) -> bool:
return await ctx.judge(f"Is this a valid answer? {ctx.output}", bool)
No Judge Configured
If an evaluator calls ctx.judge() and no judge was passed to EvalSuite, a RuntimeError is raised. This is treated as an infrastructure error (not a test failure), same as a fixture crash.
Usage Tracking
Each call to ctx.judge() is counted. Tokens and cost from JudgeResponse are accumulated per case and flow to EvalPayload:
| Field | Description |
|---|---|
judge_call_count |
Number of judge calls |
judge_input_tokens |
Total input tokens |
judge_output_tokens |
Total output tokens |
judge_cost |
Total cost (user-computed) |
These are available in history, letting you track LLM usage across runs.
TaskResult (SUT Usage Tracking)
If your eval task calls an LLM, you can report usage by returning TaskResult instead of a plain value:
from protest.evals import TaskResult
@chatbot_suite.eval(evaluators=[my_scorer])
async def chatbot(case: Annotated[EvalCase, From(cases)]) -> TaskResult[str]:
result = await agent.run(case.inputs)
usage = result.usage()
return TaskResult(
output=result.output,
input_tokens=usage.request_tokens,
output_tokens=usage.response_tokens,
cost=usage.request_tokens * 0.10/1e6 + usage.response_tokens * 0.30/1e6,
)
This is opt-in - returning a plain str still works. ProTest unwraps TaskResult transparently: evaluators see the plain output, usage stats flow to the reporter and history.
Usage Display
When task or judge usage data is available, ProTest shows a summary after the eval stats:
Lines only appear when there is data. No TaskResult = no Task line. No judge configured = no Judge line.
Evaluator Errors
If an evaluator raises an exception (e.g. LLM judge timeout), the case is marked as error (not fail). The stack trace appears in the output.
Tip: For non-deterministic evaluators (LLM judges), catch exceptions in the evaluator and return a verdict indicating failure rather than letting them propagate.
Score Namespacing
Each Verdict / Metric / Reason field from a dataclass evaluator becomes
a score named <evaluator>.<field> - e.g. keyword_check.recall.
A bool evaluator's single verdict keeps the evaluator's bare name
(not_empty). These names are the keys in the per-case score dict, the
history file, and the markdown artifacts.
Because names are namespaced per evaluator, two evaluators on the same case
can both declare plain ok / detail fields without clashing:
@dataclass
class SummaryShape:
well_formed: Annotated[bool, Verdict]
detail: Annotated[str, Reason] = "" # → summary_shape.detail
@dataclass
class CategoryMatch:
matches: Annotated[bool, Verdict]
detail: Annotated[str, Reason] = "" # → category_match.detail
Namespacing is unconditional: a score's full name depends only on its own evaluator, never on which other evaluators are wired in alongside it - so the identifiers you grep for in history and scripts stay stable when evaluators are added or removed.
A collision is still possible in one case: the same evaluator name attached
twice to one case (the same @evaluator function rebound with different
kwargs, or two functions sharing a name). ProTest raises
ScoreNameCollisionError at runtime so the duplicate is loud instead of
silently overwriting the other's scores. Wrap each binding in its own named
@evaluator function to give them distinct names.
Multi-Model Sessions
Track which model produced each eval suite's results. Each EvalSuite can have its own model:
session = ProTestSession()
pipeline_suite = EvalSuite("pipeline", model=ModelLabel(name="qwen-2.5"))
chatbot_suite = EvalSuite("chatbot", model=ModelLabel(name="mistral-7b"))
session.add_suite(pipeline_suite)
session.add_suite(chatbot_suite)
@pipeline_suite.eval(evaluators=[...])
async def pipeline_eval(case, driver) -> str: ...
@chatbot_suite.eval(evaluators=[...])
async def chatbot_eval(case, deps) -> str: ...
Each run records the model per suite in .protest/history.jsonl, so a
mixed-model session (e.g. pipeline on qwen-2.5, chatbot on
mistral-7b) keeps each suite's model alongside its scores.
CLI
# Run evals
protest eval evals.session:session
# Parallelism
protest eval evals.session:session -n 4
# Filter by tag
protest eval evals.session:session --tag chatbot
# Filter by name
protest eval evals.session:session -k "lookup"
# Re-run failures only
protest eval evals.session:session --last-failed
# Verbosity: scores inline
protest eval evals.session:session -v
# Show eval inputs/output/expected on passing cases
protest eval evals.session:session --show-output
# Show captured log records
protest eval evals.session:session --show-logs
protest eval evals.session:session --show-logs=DEBUG
Flags are independent and combinable: -v --show-output --show-logs.
Note: Failed eval cases always show inputs/output/expected - no flag needed.
Output
Default
✓ chatbot[lookup] (1.2s) keyword_check.recall=1.00 keyword_check.all_present=✓
✗ chatbot[math]: keyword_check.all_present=False
│ inputs: What is 2+2?
│ output: The answer is 4.
│ expected: 4
│ keyword_check.detail: found 0/1
Eval: chatbot (2 cases)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┓
┃ Score ┃ mean ┃ p50 ┃ p5 ┃ p95 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━┩
│ keyword_check.recall │ 0.50 │ 0.50 │ 0.00 │ 1.00 │
└──────────────────────────────┴──────┴──────┴──────┴──────┘
Passed: 1/2 (50.0%)
Results: .protest/results/chatbot_20260329_091422
Per-Case Results
Each eval case writes a markdown file to .protest/results/<suite>_<timestamp>/:
History
Every eval run is persisted as one JSONL entry in .protest/history.jsonl,
recorded from the first run so trends accumulate over time. Each entry holds
per-suite pass rates, per-case verdicts and scores, the model per suite, and
git metadata. Tooling to browse, trend and compare runs lands in a future
release; the file is a stable, schema-versioned format you can also read
yourself in the meantime.
Integrity Hashes
Each case in history carries two hashes:
case_hash- hash of inputs + expected output. Changes when the test data changes.eval_hash- hash of evaluators. Changes when the scoring criteria change.
These hashes are what lets a later comparison distinguish a real regression
from a definition change: when a case's eval_hash differs between two runs,
the score moved because the scoring criteria changed ("scoring modified"), not
because the system under test regressed.
Progress Output
For long-running fixtures, use console.print to show progress without polluting test capture:
from protest import console
@fixture()
async def pipeline():
for i, scene in enumerate(scenes):
console.print(f"[cyan]pipeline:[/] importing {scene.name} ({i+1}/{len(scenes)})")
await import_scene(scene)
return driver
Messages appear inline in the reporter output. Rich markup is supported (stripped for ASCII).