Human Review: Manual Eval That Calibrates the System
Human review does not scale to every request. It anchors truth for everything that does. Without a human-labeled calibration set, LLM-as-judge drifts, automated checks miss nuance, and leadership gets a false sense of precision.
Part of the Eval Framework Blueprint series.
THE CLAIM
Humans own rubric definition, calibration, high-risk adjudication, and audit samples — not every passing PR.
Three human review queues
| Queue | Trigger | Volume | Output |
|---|---|---|---|
| Calibration | Weekly sample from golden + prod | 50–200 cases/week | Judge tuning, rubric drift alerts |
| High-risk | risk_tier: high or policy-adjacent | 100% of cases | Pass/fail for release |
| Audit | Regulator-ready random sample | Monthly batch | Signed review record |
Reviewer workflow
- Blind to model version when scoring quality (avoid brand bias)
- See full trace across planes — not final answer only
- Score per rubric dimension (1–5 anchored scale)
- Tag failure class if any dimension ≤ 2
- Add notes for dataset enrichment
Anchored rubric template (1–5)
Use the same anchors across planes; plane playbooks add dimension-specific criteria.
| Score | Meaning |
|---|---|
| 5 | Fully meets criteria; no material issues |
| 4 | Meets criteria; minor issues that do not affect trust |
| 3 | Partial; user might be misled or task incomplete |
| 2 | Significant failure; wrong, unsafe, or ungrounded |
| 1 | Critical failure; harm, policy violation, or fabrication |
Gate rule: Any dimension at ≤ 2 on a high-risk case → fail. Median < 4 on representative sample → investigate before release.
Dimensions by concern (pick 3–5 per use case)
| Dimension | Humans score best when |
|---|---|
| Grounding | Subtle misquotes, wrong doc, missing citation |
| Policy fit | Gray-area compliance, tone in regulated contexts |
| Task completion | Multi-step workflows, partial answers |
| Clarity | Jargon, ambiguous next steps |
| Trust | Overconfidence, missing uncertainty |
Calibration protocol (LLM-as-judge alignment)
- Humans score calibration split (see Golden Datasets)
- Run judge on same cases blind
- Measure Cohen's κ per dimension; target κ ≥ 0.7
- If below threshold: revise rubric anchors or judge prompt — not the holdout set
- Re-run monthly; alert on κ drift
What humans should not do
- Score every CI run (use judge + automation)
- Approve without trace visibility
- Change golden expected outputs during review to match model (log model failure instead)
- Single reviewer on high-risk cases (require 2-of-3 or expert sign-off)
Review UI requirements
- Side-by-side: input, retrieved chunks, tool calls, policy verdicts, output
- Plane timeline (same order as production)
- One-click: "promote to golden dataset" with failure class pre-filled
- Immutable review record: reviewer, timestamp, rubric version, scores
Sampling from production
See Online & Dynamic Eval for sampling rates, shadow scoring, and drift alerts.
| Stream | Rate | Filter |
|---|---|---|
| Random | 0.5–2% | Stratify by use case |
| Low judge confidence | 100% | Judge score borderline (e.g. 3±0.5) |
| User thumbs-down | 100% | Already signal |
| Policy near-miss | 100% | PEP returned STEP-UP or DENY |
Next in series
- LLM-as-Judge — scale what humans define
- Golden Datasets — case library design
- Plane playbooks: Context · Action