Skip to main content

Eval Framework Blueprint: All Planes, All Methods

This is the implementation guide for the Eval Engineering executive insight — Eval Engineering: The Control System for Trustworthy AI (not yet published). That piece explains why evals are the control system. This series explains how to build one that covers every plane, every failure mode, every data source, and every execution mode — without pretending a single end-of-pipeline score is enough.

THE CLAIM

A robust eval framework scores every plane on static golden sets, synthetic cases, production replay, and user feedback — using automated checks, calibrated LLM-as-judge, and human review — in offline CI gates and online sampling — with incidents feeding the dataset continuously.

What you are building

A production eval framework is six connected capabilities:

  1. Plane-aware harness — replay production paths and score each stage, not just the final answer
  2. Data sources — golden, synthetic, replay, and feedback loops (see below)
  3. Scoring stack — automated checks, LLM-as-judge, human review on shared rubrics
  4. Offline gates — CI/CD blocks promotion on regression
  5. Online eval — sample, shadow, canary, drift detection after ship
  6. Improvement loop — failures promote to datasets within days

Eval modes & data sources

Scorers answer how you grade. Modes and sources answer when and from what.

Offline (CI)Online (production)
Golden (static)Primary gate datasetCompare drift vs baseline
SyntheticEdge + adversarial expansionUsually offline only
Production replayNightly + pre-releaseShadow path on live copies
User feedbackPromoted to goldenReal-time triage queue
DecisionShip or rejectDrift alert · canary rollback

Deep dives: Golden Datasets · Synthetic Generation · Online & Dynamic Eval

Static vs replay (do not conflate)

Static goldenProduction replay
OriginExpert + sampled + syntheticFull trace from prod
StrengthStable regression baselinePredicts real path behavior
WeaknessAges without refreshNeeds trace infra + redaction
Gate useEvery PRModel/index/tool changes + nightly

The three scoring methods (use all three)

MethodBest forNever use it alone for
Automated checksSchema, policy, latency, recall@k, tool argsNuance, tone, partial correctness
LLM-as-judgeGrounding, completeness, reasoning at scaleCompliance sign-off, novel failures
Human reviewCalibration, high-risk, audit samplesEvery PR at enterprise scale

Calibration rule: Humans anchor ground truth on a fixed sample. Judge tuned to κ ≥ 0.7. Automation encodes non-negotiables.

See Human Review and LLM-as-Judge.

Inside “automated checks” (specialized scorers)

“Automated” is not only if json.valid. Name these explicitly in your harness:

Specialized scorerPlanevs LLM-as-judge
Policy / PDP replayActionDeterministic verdict match
Schema & type validationTool, InputHard fail
Retrieval metricsContextrecall@k, scope violations — math on IDs
NLI / entailmentReasoningClaim ↔ chunk support (optional model)
Safety / injection classifiersInputTrained classifier, not rubric
Property / invariant testsAll“Never call tool X without ALLOW”
Latency / cost budgetsSystemSLO assertions

Compliance and money movement never rely on judge alone — policy replay + human.

Comparative eval (pairwise)

ModeWhen
PointwiseDefault CI gate — absolute rubric thresholds
PairwisePick better prompt/model when both pass pointwise
Shadow pairwiseNew stack vs prod on same live inputs

Documented in LLM-as-Judge.

Eight planes, eight eval surfaces

Each plane gets its own dataset slice, rubric dimensions, and gate.

PlaneEval focusDeep dive
① InputParsing, injection, intent, PIIInput
② DataFreshness, lineage, accessData
③ ContextRetrieval, scope, abstentionContext
④ ReasoningFaithfulness, conclusions, toolsReasoning
⑤ ToolSelection, args, errorsTool
⑥ MemoryIsolation, TTL, leakageMemory
⑦ ActionPolicy, authorization, auditAction
⑧ OutcomeTask success, clarity, trustOutcome

Golden case schema (every plane)

{
"id": "eval-2026-07-001",
"plane": "context",
"scenario": "representative | edge | adversarial | incident_replay",
"status": "draft | active",
"input": { "user_message": "...", "principal": "...", "session": "..." },
"expected": {
"must_retrieve": ["doc-id-1"],
"must_not_retrieve": ["doc-id-9"],
"abstain": false
},
"rubric": ["grounding", "scope", "ranking"],
"failure_class": null,
"source": "production_replay | synthetic | manual | user_feedback",
"risk_tier": "low | medium | high"
}

Only status: active cases gate releases. High-risk → human regardless of judge score.

Per-plane eval recipe

  1. Define failure taxonomy for the plane
  2. Instrument traces on every production request
  3. Build dataset slice (representative + edge + adversarial per use case)
  4. Automated checks + specialized scorers where applicable
  5. Judge rubric (3–5 dimensions, anchored 1–5)
  6. Calibrate judge vs human (κ ≥ 0.7)
  7. Set offline gate thresholds per plane
  8. Online sample + drift alerts for same plane metrics
  9. Incident → case within one week

Release gate matrix

Change typePlanes to re-runOffline gateOnline follow-up
Model swapContext, Reasoning, OutcomeGolden + replay; no judge driftShadow 24h before full cutover
Prompt changeReasoning, OutcomeRubric ≥ baselineSample judge scores 48h
Retrieval / indexData, Contextrecall@k; scope = 0 on adversarialRetrieval metric dashboard
Tool / ACLTool, ActionSchema 100%; PDP replay 100%Policy violation alert
Memory storeMemory, ReasoningLeakage = 0Session isolation monitor

Ownership

RoleOwns
Product / domainRubrics, representative cases, business thresholds
AI platformHarness, judge pipelines, score store, CI + online pipelines
GovernancePolicy cases, audit sampling, high-risk human queue
SRE / reliabilityReplay infra, drift alerts, incident-to-case SLA

Further reading (external)

Third-party articles, guides, and tool docs — curated by topic and mapped to each page in this series. By other practitioners, not this site.

Further reading (external) →

Series index

Foundations

Plane playbooks

Reference