Skip to main content

Online & Dynamic Eval: Scoring Production After Ship

Offline CI gates stop regressions before release. Online eval catches what no golden set imagined after release: new phrasing, seasonal policy, adversarial adaptation, and slow drift.

Part of the Eval Framework Blueprint series.

THE CLAIM

Online eval applies the same three scorers to live or shadow traffic — sample, score, alert on drift, promote failures into golden datasets. CI alone is not enough.

Offline vs online (both required)

ModeWhenDataDecision
Offline (CI)Pre-releaseGolden + synthetic + replay fixturesShip or reject
Online (prod)Post-release, continuousLive sample, shadow, canaryDrift alert, case promotion
Replay batchPre-release + nightlyFull production tracesRegression on reality

CI proves you did not break known behavior. Online proves you are not drifting on unknown behavior.

Three online execution patterns

1. Production sampling

  • Score 0.5–2% of live traffic (stratified by use case)
  • Same rubrics as offline; async pipeline (never block user latency SLO)
  • Store scores in score store with trace_id, build_id, plane_scores

2. Shadow eval

  • New model/prompt/index runs in parallel on copied request
  • User sees production version; shadow scores compared offline
  • Use for model swaps and prompt candidates before flipping traffic

3. Canary eval

  • 1–5% of users get new version; compare canary vs control online
  • Pairwise metrics on matched traffic (see LLM-as-Judge pairwise mode)
  • Promote when canary ≥ control on critical dimensions + no risk regression

What to score online

Use the same three scorers — do not invent a fourth stack:

ScorerOnline use
AutomatedPolicy violations, schema errors, latency SLO, retrieval scope
LLM-as-judgeSampled rubric dimensions on subset
HumanLow-confidence judge, thumbs-down, near-miss policy

High-risk online events → 100% human queue, not judge-only.

Drift detection

Monitor distributions, not single averages:

SignalAlert when
Judge dimension medianDrops > 0.3 vs 7-day baseline
Automated failure rate> 2× baseline
Abstention rateSpikes without product change
User negative feedback> threshold per use case
Plane-specific metrice.g. recall@k proxy drops

Drift alert → triage → if real: incident → golden case within 7 days.

User feedback loop

SourceAction
Thumbs down100% score + human triage
Support ticket tagged "AI wrong"Promote to incident replay
Supervisor overrideCapture trace + expected correction
Escaped policy actionImmediate adversarial case + Action plane review

Feedback is eval input, not a separate quality program.

Architecture placement

Fits the production eval loop from the Eval Engineering executive insight (§8):

Production → Telemetry → Trace → Replay → Eval Engine → Scoring → Feedback → Improvement

Online sampling attaches at Telemetry/Trace; scores land in Scoring; promoted cases feed Improvement (golden datasets).

Gate rules (online complements CI)

RuleMeaning
CI green + online drift redNo auto-promote of next change until drift explained
Canary pairwise loss on risk dimensionRollback canary
Online policy violationPage + block feature flag path

Anti-patterns

  • Scoring 100% of traffic with judge (cost + latency)
  • Online-only eval with no golden CI gate
  • Ignoring drift because CI is green
  • No promotion path from online failures to datasets

Next in series