Golden Datasets: Building Eval Data That Predicts Production

A golden dataset is the institutional memory of what "good" looks like for your AI system. Without it, every eval conversation devolves into anecdotes. With a bad one, you optimize for the demo and fail in production.

Part of the Eval Framework Blueprint series.

THE CLAIM

Golden datasets are versioned, plane-tagged, risk-tiered case libraries — not a one-time export of happy-path prompts.

Five dataset layers (build all five)

Layer	Purpose	Minimum size (per use case)
Representative	Real user tasks	50+ cases from prod sampling or domain workshops
Edge	Boundaries, ambiguity, multi-step	20+ cases
Adversarial	Injection, misuse, policy bypass	15+ cases
Incident replay	Every production failure	1 case per incident, permanent
Regression	Locked baselines after each gate	Snapshot per release

Case schema

{
  "id": "ctx-001",
  "version": "2026.07.1",
  "plane": ["context", "outcome"],
  "scenario": "representative",
  "risk_tier": "medium",
  "input": {
    "message": "What is our wire limit for tier-2 customers?",
    "principal": { "role": "advisor", "segment": "tier-2" }
  },
  "trace_fixture": "s3://eval-fixtures/ctx-001-trace.json",
  "expected": {
    "must_cite": ["policy-wire-limits-v3"],
    "must_not_cite": ["internal-draft-2024"],
    "abstain_if_missing_evidence": true
  },
  "rubric_dims": ["grounding", "scope", "completeness"],
  "automated": {
    "citation_required": true,
    "max_latency_ms": 8000
  },
  "failure_class": null,
  "owner": "product-banking",
  "source": "production_replay"
}

Plane tagging

Tag every case with which planes it exercises. One case often spans multiple planes — that is correct. The harness scores each plane independently from the same replay.

Tag	When to include
`input`	Injection, malformed input, PII
`data`	Stale source, wrong catalog version
`context`	Retrieval scope, ranking, abstention
`reasoning`	Multi-hop logic, tool selection
`tool`	API args, error handling
`memory`	Cross-turn state, session bleed
`action`	Policy, authorization, side effects
`outcome`	End-user task completion

How to populate each layer

Representative tasks

Sample production traffic (redacted, consented) into a review queue
Domain experts label: task type, success criteria, risk tier
Promote to golden set after human confirms expected behavior

Edge cases

Workshop with support and compliance: "what almost went wrong?"
Boundary values: empty retrieval, max context, concurrent tools, timeout paths

Adversarial cases

Prompt injection in user and retrieved content
Tool argument manipulation
Requests for out-of-scope data
Jailbreak patterns relevant to your domain (not generic internet lists)

Incident replay

SLA: incident → golden case within 7 days
Store full trace fixture, not just final Q&A
Tag failure_class from the taxonomy in the Eval Engineering executive insight

Versioning and change control

Dataset version bumps on any case add/edit/remove
CI runs eval against dataset@version + system@build
Store (dataset_version, system_version, scores) for audit replay
Never edit a case in place after it gates a release — append a new case or supersede with replaces: "ctx-001"

Splitting datasets for judge calibration

Split	Use
Calibration (15%)	Tune LLM-as-judge against human scores
Holdout (15%)	Final judge validation — never tune on this
Gate (70%)	CI/CD release gate

Anti-patterns

Anti-pattern	Why it fails
Single "accuracy" label per case	Hides which plane broke
Copy-paste from public benchmarks	Domain and policy mismatch
No adversarial layer	Production surprises are guaranteed
Stale dataset (>90 days, no replay)	Drift makes scores meaningless
Editing cases to make CI green	You are lying to the gate

Next in series

Synthetic Generation — scale edge & adversarial layers
Online & Dynamic Eval — production sampling & drift
Human Review — score cases humans own
LLM-as-Judge — scale scoring with calibration
Eval Framework Blueprint — full architecture

Five dataset layers (build all five)​

Case schema​

Plane tagging​

How to populate each layer​

Representative tasks​

Edge cases​

Adversarial cases​

Incident replay​

Versioning and change control​

Splitting datasets for judge calibration​

Anti-patterns​

Next in series​