Skip to main content

Human Review: Manual Eval That Calibrates the System

Human review does not scale to every request. It anchors truth for everything that does. Without a human-labeled calibration set, LLM-as-judge drifts, automated checks miss nuance, and leadership gets a false sense of precision.

Part of the Eval Framework Blueprint series.

THE CLAIM

Humans own rubric definition, calibration, high-risk adjudication, and audit samples — not every passing PR.

Three human review queues

QueueTriggerVolumeOutput
CalibrationWeekly sample from golden + prod50–200 cases/weekJudge tuning, rubric drift alerts
High-riskrisk_tier: high or policy-adjacent100% of casesPass/fail for release
AuditRegulator-ready random sampleMonthly batchSigned review record

Reviewer workflow

  1. Blind to model version when scoring quality (avoid brand bias)
  2. See full trace across planes — not final answer only
  3. Score per rubric dimension (1–5 anchored scale)
  4. Tag failure class if any dimension ≤ 2
  5. Add notes for dataset enrichment

Anchored rubric template (1–5)

Use the same anchors across planes; plane playbooks add dimension-specific criteria.

ScoreMeaning
5Fully meets criteria; no material issues
4Meets criteria; minor issues that do not affect trust
3Partial; user might be misled or task incomplete
2Significant failure; wrong, unsafe, or ungrounded
1Critical failure; harm, policy violation, or fabrication

Gate rule: Any dimension at ≤ 2 on a high-risk case → fail. Median < 4 on representative sample → investigate before release.

Dimensions by concern (pick 3–5 per use case)

DimensionHumans score best when
GroundingSubtle misquotes, wrong doc, missing citation
Policy fitGray-area compliance, tone in regulated contexts
Task completionMulti-step workflows, partial answers
ClarityJargon, ambiguous next steps
TrustOverconfidence, missing uncertainty

Calibration protocol (LLM-as-judge alignment)

  1. Humans score calibration split (see Golden Datasets)
  2. Run judge on same cases blind
  3. Measure Cohen's κ per dimension; target κ ≥ 0.7
  4. If below threshold: revise rubric anchors or judge prompt — not the holdout set
  5. Re-run monthly; alert on κ drift

What humans should not do

  • Score every CI run (use judge + automation)
  • Approve without trace visibility
  • Change golden expected outputs during review to match model (log model failure instead)
  • Single reviewer on high-risk cases (require 2-of-3 or expert sign-off)

Review UI requirements

  • Side-by-side: input, retrieved chunks, tool calls, policy verdicts, output
  • Plane timeline (same order as production)
  • One-click: "promote to golden dataset" with failure class pre-filled
  • Immutable review record: reviewer, timestamp, rubric version, scores

Sampling from production

See Online & Dynamic Eval for sampling rates, shadow scoring, and drift alerts.

StreamRateFilter
Random0.5–2%Stratify by use case
Low judge confidence100%Judge score borderline (e.g. 3±0.5)
User thumbs-down100%Already signal
Policy near-miss100%PEP returned STEP-UP or DENY

Next in series