Synthetic Eval Generation: Scaling Coverage Safely

Humans cannot write every injection variant, tool-argument permutation, or cross-plane failure combo. Synthetic generation expands coverage — but ungoverned synthetics teach you to pass fake tests.

Part of the Eval Framework Blueprint series.

THE CLAIM

Synthetics fill the gaps in golden sets — edge, adversarial, and combinatorial cases — with human review before they gate releases.

When to use synthetics

Use synthetic	Prefer real / replay
Adversarial prompt variants	Representative user tasks
Combinatorial tool args	Production incident traces
Rare policy edge cases	Compliance audit samples
Load / stress harness inputs	Drift detection baselines

Rule: Representative layer should be mostly production-sampled or expert-authored. Synthetics dominate edge + adversarial layers.

Generation methods

Method	How	Risk
Template expansion	Slots in domain templates (`{amount}`, `{role}`)	Low — auditable
LLM scenario writer	Model proposes cases from rubric + domain doc	Medium — needs human filter
Mutation	Perturb golden cases (paraphrase, inject noise)	Medium — can drift off intent
Red-team playbook	Security team scripted attacks	Low for security; narrow scope
Combinatorial	Cartesian product of tools × roles × limits	High volume; needs filtering

Synthetic case lifecycle

Generate with source: synthetic and status: draft
Auto-filter — schema valid, no PII placeholders, plane tags present
Human review — 100% for adversarial; 10–20% sample for edge
Promote — status: active, version bump on dataset
Gate — only active cases in release suite

Never gate on draft synthetics.

LLM generator prompt pattern

Generate 10 eval cases for plane: context
Scenario: adversarial
Domain: wire transfers, tier-2 customers
Constraints:
- Each case must specify must_retrieve and must_not_retrieve doc ids
- Include one injection attempt in retrieved content per case
- Do not use real customer names
Output: JSON array matching golden schema

Human reviewer validates expected fields — generators often mark wrong docs as "must retrieve."

Quality controls

Control	Why
Dedup vs existing golden	Avoid near-duplicate inflation
Domain expert sign-off on adversarial	Prevent fantasy scenarios
Cap synthetic % per gate run	e.g. max 30% of CI cases synthetic
Track pass rate by `source`	If synthetic 99% pass but replay fails → bad synthetics
Rotate generator seed / prompt version	Detect overfitting to generator

Plane-specific synthetic ideas

Plane	Synthetic focus
Input	Injection templates, homoglyphs, encoding attacks
Data	Stale version ids, wrong catalog snapshots
Context	Poisoned chunk in pack, cross-tenant doc ids
Reasoning	Distractor docs, conflicting evidence
Tool	Invalid JSON args, boundary amounts
Memory	Session id collision attempts
Action	Policy boundary amounts, STEP-UP triggers
Outcome	Misleading "success" phrasing

See each plane playbook for gate criteria.

Pair with replay

Synthetics find classes of failure. Production replay confirms fixes on real traces. After synthetic-driven fix:

Add synthetic case (regression)
Find matching prod trace if exists → add replay fixture
Both gate together

Anti-patterns

LLM generates expected answers without domain review
Synthetics replace production sampling
Adversarial set auto-promoted without security review
Optimizing prompt until synthetic suite is 100% (teaching to the test)

Next in series

Golden Datasets — schema and layers
Online & Dynamic Eval — prod feedback
Eval Framework Blueprint

When to use synthetics​

Generation methods​

Synthetic case lifecycle​

LLM generator prompt pattern​

Quality controls​

Plane-specific synthetic ideas​

Pair with replay​

Anti-patterns​

Next in series​