Skip to main content

Golden Datasets: Building Eval Data That Predicts Production

A golden dataset is the institutional memory of what "good" looks like for your AI system. Without it, every eval conversation devolves into anecdotes. With a bad one, you optimize for the demo and fail in production.

Part of the Eval Framework Blueprint series.

THE CLAIM

Golden datasets are versioned, plane-tagged, risk-tiered case libraries — not a one-time export of happy-path prompts.

Five dataset layers (build all five)

LayerPurposeMinimum size (per use case)
RepresentativeReal user tasks50+ cases from prod sampling or domain workshops
EdgeBoundaries, ambiguity, multi-step20+ cases
AdversarialInjection, misuse, policy bypass15+ cases
Incident replayEvery production failure1 case per incident, permanent
RegressionLocked baselines after each gateSnapshot per release

Case schema

{
"id": "ctx-001",
"version": "2026.07.1",
"plane": ["context", "outcome"],
"scenario": "representative",
"risk_tier": "medium",
"input": {
"message": "What is our wire limit for tier-2 customers?",
"principal": { "role": "advisor", "segment": "tier-2" }
},
"trace_fixture": "s3://eval-fixtures/ctx-001-trace.json",
"expected": {
"must_cite": ["policy-wire-limits-v3"],
"must_not_cite": ["internal-draft-2024"],
"abstain_if_missing_evidence": true
},
"rubric_dims": ["grounding", "scope", "completeness"],
"automated": {
"citation_required": true,
"max_latency_ms": 8000
},
"failure_class": null,
"owner": "product-banking",
"source": "production_replay"
}

Plane tagging

Tag every case with which planes it exercises. One case often spans multiple planes — that is correct. The harness scores each plane independently from the same replay.

TagWhen to include
inputInjection, malformed input, PII
dataStale source, wrong catalog version
contextRetrieval scope, ranking, abstention
reasoningMulti-hop logic, tool selection
toolAPI args, error handling
memoryCross-turn state, session bleed
actionPolicy, authorization, side effects
outcomeEnd-user task completion

How to populate each layer

Representative tasks

  • Sample production traffic (redacted, consented) into a review queue
  • Domain experts label: task type, success criteria, risk tier
  • Promote to golden set after human confirms expected behavior

Edge cases

  • Workshop with support and compliance: "what almost went wrong?"
  • Boundary values: empty retrieval, max context, concurrent tools, timeout paths

Adversarial cases

  • Prompt injection in user and retrieved content
  • Tool argument manipulation
  • Requests for out-of-scope data
  • Jailbreak patterns relevant to your domain (not generic internet lists)

Incident replay

  • SLA: incident → golden case within 7 days
  • Store full trace fixture, not just final Q&A
  • Tag failure_class from the taxonomy in the Eval Engineering executive insight

Versioning and change control

  • Dataset version bumps on any case add/edit/remove
  • CI runs eval against dataset@version + system@build
  • Store (dataset_version, system_version, scores) for audit replay
  • Never edit a case in place after it gates a release — append a new case or supersede with replaces: "ctx-001"

Splitting datasets for judge calibration

SplitUse
Calibration (15%)Tune LLM-as-judge against human scores
Holdout (15%)Final judge validation — never tune on this
Gate (70%)CI/CD release gate

Anti-patterns

Anti-patternWhy it fails
Single "accuracy" label per caseHides which plane broke
Copy-paste from public benchmarksDomain and policy mismatch
No adversarial layerProduction surprises are guaranteed
Stale dataset (>90 days, no replay)Drift makes scores meaningless
Editing cases to make CI greenYou are lying to the gate

Next in series