Eval Framework Blueprint: All Planes, All Methods
This is the implementation guide for the Eval Engineering executive insight — Eval Engineering: The Control System for Trustworthy AI (not yet published). That piece explains why evals are the control system. This series explains how to build one that covers every plane, every failure mode, every data source, and every execution mode — without pretending a single end-of-pipeline score is enough.
A robust eval framework scores every plane on static golden sets, synthetic cases, production replay, and user feedback — using automated checks, calibrated LLM-as-judge, and human review — in offline CI gates and online sampling — with incidents feeding the dataset continuously.
What you are building
A production eval framework is six connected capabilities:
- Plane-aware harness — replay production paths and score each stage, not just the final answer
- Data sources — golden, synthetic, replay, and feedback loops (see below)
- Scoring stack — automated checks, LLM-as-judge, human review on shared rubrics
- Offline gates — CI/CD blocks promotion on regression
- Online eval — sample, shadow, canary, drift detection after ship
- Improvement loop — failures promote to datasets within days
Eval modes & data sources
Scorers answer how you grade. Modes and sources answer when and from what.
| Offline (CI) | Online (production) | |
|---|---|---|
| Golden (static) | Primary gate dataset | Compare drift vs baseline |
| Synthetic | Edge + adversarial expansion | Usually offline only |
| Production replay | Nightly + pre-release | Shadow path on live copies |
| User feedback | Promoted to golden | Real-time triage queue |
| Decision | Ship or reject | Drift alert · canary rollback |
Deep dives: Golden Datasets · Synthetic Generation · Online & Dynamic Eval
Static vs replay (do not conflate)
| Static golden | Production replay | |
|---|---|---|
| Origin | Expert + sampled + synthetic | Full trace from prod |
| Strength | Stable regression baseline | Predicts real path behavior |
| Weakness | Ages without refresh | Needs trace infra + redaction |
| Gate use | Every PR | Model/index/tool changes + nightly |
The three scoring methods (use all three)
| Method | Best for | Never use it alone for |
|---|---|---|
| Automated checks | Schema, policy, latency, recall@k, tool args | Nuance, tone, partial correctness |
| LLM-as-judge | Grounding, completeness, reasoning at scale | Compliance sign-off, novel failures |
| Human review | Calibration, high-risk, audit samples | Every PR at enterprise scale |
Calibration rule: Humans anchor ground truth on a fixed sample. Judge tuned to κ ≥ 0.7. Automation encodes non-negotiables.
See Human Review and LLM-as-Judge.
Inside “automated checks” (specialized scorers)
“Automated” is not only if json.valid. Name these explicitly in your harness:
| Specialized scorer | Plane | vs LLM-as-judge |
|---|---|---|
| Policy / PDP replay | Action | Deterministic verdict match |
| Schema & type validation | Tool, Input | Hard fail |
| Retrieval metrics | Context | recall@k, scope violations — math on IDs |
| NLI / entailment | Reasoning | Claim ↔ chunk support (optional model) |
| Safety / injection classifiers | Input | Trained classifier, not rubric |
| Property / invariant tests | All | “Never call tool X without ALLOW” |
| Latency / cost budgets | System | SLO assertions |
Compliance and money movement never rely on judge alone — policy replay + human.
Comparative eval (pairwise)
| Mode | When |
|---|---|
| Pointwise | Default CI gate — absolute rubric thresholds |
| Pairwise | Pick better prompt/model when both pass pointwise |
| Shadow pairwise | New stack vs prod on same live inputs |
Documented in LLM-as-Judge.
Eight planes, eight eval surfaces
Each plane gets its own dataset slice, rubric dimensions, and gate.
| Plane | Eval focus | Deep dive |
|---|---|---|
| ① Input | Parsing, injection, intent, PII | Input |
| ② Data | Freshness, lineage, access | Data |
| ③ Context | Retrieval, scope, abstention | Context |
| ④ Reasoning | Faithfulness, conclusions, tools | Reasoning |
| ⑤ Tool | Selection, args, errors | Tool |
| ⑥ Memory | Isolation, TTL, leakage | Memory |
| ⑦ Action | Policy, authorization, audit | Action |
| ⑧ Outcome | Task success, clarity, trust | Outcome |
Golden case schema (every plane)
{
"id": "eval-2026-07-001",
"plane": "context",
"scenario": "representative | edge | adversarial | incident_replay",
"status": "draft | active",
"input": { "user_message": "...", "principal": "...", "session": "..." },
"expected": {
"must_retrieve": ["doc-id-1"],
"must_not_retrieve": ["doc-id-9"],
"abstain": false
},
"rubric": ["grounding", "scope", "ranking"],
"failure_class": null,
"source": "production_replay | synthetic | manual | user_feedback",
"risk_tier": "low | medium | high"
}
Only status: active cases gate releases. High-risk → human regardless of judge score.
Per-plane eval recipe
- Define failure taxonomy for the plane
- Instrument traces on every production request
- Build dataset slice (representative + edge + adversarial per use case)
- Automated checks + specialized scorers where applicable
- Judge rubric (3–5 dimensions, anchored 1–5)
- Calibrate judge vs human (κ ≥ 0.7)
- Set offline gate thresholds per plane
- Online sample + drift alerts for same plane metrics
- Incident → case within one week
Release gate matrix
| Change type | Planes to re-run | Offline gate | Online follow-up |
|---|---|---|---|
| Model swap | Context, Reasoning, Outcome | Golden + replay; no judge drift | Shadow 24h before full cutover |
| Prompt change | Reasoning, Outcome | Rubric ≥ baseline | Sample judge scores 48h |
| Retrieval / index | Data, Context | recall@k; scope = 0 on adversarial | Retrieval metric dashboard |
| Tool / ACL | Tool, Action | Schema 100%; PDP replay 100% | Policy violation alert |
| Memory store | Memory, Reasoning | Leakage = 0 | Session isolation monitor |
Ownership
| Role | Owns |
|---|---|
| Product / domain | Rubrics, representative cases, business thresholds |
| AI platform | Harness, judge pipelines, score store, CI + online pipelines |
| Governance | Policy cases, audit sampling, high-risk human queue |
| SRE / reliability | Replay infra, drift alerts, incident-to-case SLA |
Further reading (external)
Third-party articles, guides, and tool docs — curated by topic and mapped to each page in this series. By other practitioners, not this site.
Series index
Foundations
- Golden Datasets — curated case libraries
- Synthetic Generation — scale edge & adversarial coverage
- Online & Dynamic Eval — post-ship sampling, shadow, canary, drift
- Human Review — manual eval & calibration
- LLM-as-Judge — scaled scoring & pairwise
Plane playbooks
Reference
- Eval Engineering (executive insight, coming soon) · G.A.I.N Evaluation