Skip to main content

Synthetic Eval Generation: Scaling Coverage Safely

Humans cannot write every injection variant, tool-argument permutation, or cross-plane failure combo. Synthetic generation expands coverage — but ungoverned synthetics teach you to pass fake tests.

Part of the Eval Framework Blueprint series.

THE CLAIM

Synthetics fill the gaps in golden sets — edge, adversarial, and combinatorial cases — with human review before they gate releases.

When to use synthetics

Use syntheticPrefer real / replay
Adversarial prompt variantsRepresentative user tasks
Combinatorial tool argsProduction incident traces
Rare policy edge casesCompliance audit samples
Load / stress harness inputsDrift detection baselines

Rule: Representative layer should be mostly production-sampled or expert-authored. Synthetics dominate edge + adversarial layers.

Generation methods

MethodHowRisk
Template expansionSlots in domain templates ({amount}, {role})Low — auditable
LLM scenario writerModel proposes cases from rubric + domain docMedium — needs human filter
MutationPerturb golden cases (paraphrase, inject noise)Medium — can drift off intent
Red-team playbookSecurity team scripted attacksLow for security; narrow scope
CombinatorialCartesian product of tools × roles × limitsHigh volume; needs filtering

Synthetic case lifecycle

  1. Generate with source: synthetic and status: draft
  2. Auto-filter — schema valid, no PII placeholders, plane tags present
  3. Human review — 100% for adversarial; 10–20% sample for edge
  4. Promotestatus: active, version bump on dataset
  5. Gate — only active cases in release suite

Never gate on draft synthetics.

LLM generator prompt pattern

Generate 10 eval cases for plane: context
Scenario: adversarial
Domain: wire transfers, tier-2 customers
Constraints:
- Each case must specify must_retrieve and must_not_retrieve doc ids
- Include one injection attempt in retrieved content per case
- Do not use real customer names
Output: JSON array matching golden schema

Human reviewer validates expected fields — generators often mark wrong docs as "must retrieve."

Quality controls

ControlWhy
Dedup vs existing goldenAvoid near-duplicate inflation
Domain expert sign-off on adversarialPrevent fantasy scenarios
Cap synthetic % per gate rune.g. max 30% of CI cases synthetic
Track pass rate by sourceIf synthetic 99% pass but replay fails → bad synthetics
Rotate generator seed / prompt versionDetect overfitting to generator

Plane-specific synthetic ideas

PlaneSynthetic focus
InputInjection templates, homoglyphs, encoding attacks
DataStale version ids, wrong catalog snapshots
ContextPoisoned chunk in pack, cross-tenant doc ids
ReasoningDistractor docs, conflicting evidence
ToolInvalid JSON args, boundary amounts
MemorySession id collision attempts
ActionPolicy boundary amounts, STEP-UP triggers
OutcomeMisleading "success" phrasing

See each plane playbook for gate criteria.

Pair with replay

Synthetics find classes of failure. Production replay confirms fixes on real traces. After synthetic-driven fix:

  1. Add synthetic case (regression)
  2. Find matching prod trace if exists → add replay fixture
  3. Both gate together

Anti-patterns

  • LLM generates expected answers without domain review
  • Synthetics replace production sampling
  • Adversarial set auto-promoted without security review
  • Optimizing prompt until synthetic suite is 100% (teaching to the test)

Next in series