Skip to main content

Eval Plane ②: Data

Blueprint · ← Input · Data · Context →

The Data plane is the source-of-truth layer behind your AI system. It owns what exists in the world the model can cite: document catalogs, APIs, knowledge bases, embedding indexes, chunk stores, and the pipelines that ingest, transform, and publish them.

It answers four questions that Context and Reasoning assume are already true:

  1. Correctness — is the material factually right?
  2. Freshness — is it current enough for the use case?
  3. Lineage — can every chunk be traced to an authoritative source?
  4. Access — is only entitled material indexed and reachable?

When the Data plane fails, downstream planes often look fine. Retrieval returns chunks confidently. The model grounds answers in those chunks. The outcome reads polished — but it is built on stale policy, wrong catalog, or material a principal should never see. That is why Data eval is not "did we find a chunk?" It is is the corpus itself trustworthy before query time begins.

THE CLAIM

Eval the Data plane for freshness, lineage, and access — not just whether a chunk exists in the index.

What the Data plane owns (and what it does not)

In scope (Data)Out of scope (other planes)
Source ingestion, normalization, chunkingQuery parsing, intent routing (Input)
Catalog / schema of documents and APIsRetrieve, rank, pack for one call (Context)
ACL at ingest and index-time entitlementWhether the model reasoned correctly (Reasoning)
Index builds, embeddings, tombstonesTool calls and side effects (Tool, Action)
Freshness SLAs, reindex jobs, deletion propagationWhether the user got a useful answer (Outcome)

The vector store is a component in the Data plane, not the whole plane. Buying a better embedding model does not fix stale sources, missing tombstones, or dev data in prod.

Data ≠ Context (eval them separately)

Teams often merge Data and Context into one "RAG eval." That hides who owns the fix and sends you tuning the ranker when the index is stale — or re-indexing everything when scoped retrieval is broken.

QuestionPlaneWhen it runs
Is policy v4 in the index? Was v3 tombstoned?DataAsync pipelines, nightly gates
For this user and this question, is policy v4 in top-k?ContextPer inference, CI + online sample

Same bad answer, different root cause:

  • User gets 2023 refund rules → Data failure (stale source indexed)
  • User gets 2025 rules but ranked fifth → Context failure (recall@k)
  • HR doc visible to non-HR user in the index → Data failure (ACL at ingest)
  • HR doc indexed correctly but retrieved for wrong principal → Context failure (filter at query time)

Eval gates should be able to read data green, context red (or the reverse). See Context plane eval → and RAG Is Not a Database.

The Data lifecycle (where evals attach)

Data quality is mostly pipeline and index work, not per-request magic. Map evals to each stage:

StageWhat can go wrongEval type
Source publishWrong version promoted, API returns stale JSONGolden reference vs live source
Ingest / transformBad chunking drops tables, wrong metadataFixture docs → expected chunk boundaries
ACL / entitlementDoc indexed without role filterPrincipal × doc matrix
Index buildPartial build, wrong catalog envindex_build_id, row counts, hash checks
Deletion / tombstoneRevoked doc still searchableTombstone tests after delete events
Drift over timeSLA breach on time-sensitive corporaScheduled freshness probes

Most Data gates run offline on index snapshots plus scheduled probes — not on every user request. Context eval still runs per query; Data eval proves the substrate those queries run on.

What to evaluate

SignalWhat it meansPass criteria
FreshnessIndexed version matches authoritative source within SLAsource_version on chunks matches golden; time-sensitive cases within TTL
LineageEvery chunk maps to source id, version, and transform step100% of sampled chunks resolve in catalog
AccessOnly entitled documents exist in the searchable set for each principal classACL adversarial matrix: 0 violations
CorrectnessFacts in indexed extracts match domain referenceSpot-check vs golden extracts; hash match on fixtures
DeletionRevoked material absent from indexTombstone within SLA after delete event
Catalog integrityRight environment, right corpusNo cross-env leakage (dev in prod)

Freshness SLAs are use-case specific. A product FAQ might tolerate 24h lag; a rate table or compliance policy might require minutes after publish. Define SLAs per corpus, not one global number.

Failure classes

ClassExampleHow you usually catch it
Stale sourceRefund policy v3 indexed; v4 live in CMS for a weekFreshness probe; incident replay
Wrong catalogStaging index attached to prod routerindex_build_id / env tags in trace
Entitlement gapHR-only PDF embedded without ACL metadataPrincipal × doc adversarial set
Bad chunkingRate table split across chunks; model sees partial numbersFixture ingest → expected chunk boundaries
Tombstone missDeleted customer PII still retrievableDelete event → search still returns doc
Lineage breakChunk has no source_id; audit cannot reconstructMonthly lineage sample

Tag production incidents with these classes. Promote each to a golden case within a week (Golden Datasets).

Golden dataset examples

Build slices per corpus (policy, product, support, rates) — not one generic "RAG set."

ScenarioSetupExpected
RepresentativeQuery corpus for current policy v3All returned chunk metadata: source_version: v3, entitled principal
EdgeDoc updated 1h ago in CMSIndex freshness within SLA for that corpus
AdversarialPrincipal without HR roleNo HR-only doc_id in searchable set for that principal class
AdversarialTombstone after legal hold liftDoc absent from index within SLA
Incident replayStale rate table caused wrong answer in prodv4 present and hash-matched after reindex

Schema fields every Data golden case should carry:

{
"plane": "data",
"corpus": "refund_policy",
"principal_class": "customer_tier_1",
"expected": {
"source_versions": ["v3"],
"forbidden_doc_ids": ["hr-handbook-2024"],
"max_staleness_minutes": 60
},
"failure_class": null,
"source": "incident_replay"
}

Automated checks (primary scorer for Data)

Data eval should be mostly deterministic. Judges and humans supplement; they do not replace ACL and version checks.

# Version assertion on indexed chunks
for chunk in index_sample(corpus="refund_policy"):
assert chunk.source_version in ALLOWED_VERSIONS

# Principal × document entitlement matrix
for principal, forbidden_ids in ACL_ADVERSARIAL_FIXTURES:
searchable = index.searchable_doc_ids(principal)
assert forbidden_ids.isdisjoint(searchable)

# Tombstone after deletion
publish_delete_event(doc_id="policy-v3")
wait_until_within_sla()
assert doc_id not in index.searchable_doc_ids(any_principal)

# Golden extract hash (correctness on fixtures)
assert sha256(index.get_extract(fixture_id)) == FIXTURE_HASHES[fixture_id]

Also enforce:

  • source_version, source_id, ingest_timestamp on every chunk
  • index_build_id and catalog_env on every trace that touches retrieval
  • Row-count and corpus-size bounds after each build (catch partial failures)

LLM-as-judge (limited role)

Judges are weak at proving ACL or version metadata. Use them only where human judgment on appropriateness matters at corpus level:

  1. Source appropriateness (1–5) — for a representative question set, is this corpus the right authority? (Not: did retrieval rank well.)
  2. Recency plausibility (1–5) — would a domain expert trust the vintage of sampled extracts?

Calibrate against human experts on a fixed monthly sample. Do not gate ACL or tombstone compliance on judge scores.

Human review

TriggerWhoCadence
Monthly correctness auditDomain expertRandom sample per corpus
All stale_source / wrong_catalog incidentsDomain + platformWithin 1 week → golden case
New corpus before first prod gateDomain ownerSign-off on golden fixtures
High-risk corpora (legal, rates, clinical)GovernanceQuarterly full spot-check

Humans anchor what correct source material looks like. Automation encodes non-negotiables (ACL, versions, tombstones).

Release gate (Data-specific changes)

When the change touches ingest, catalog, ACL, or index:

GateThreshold
ACL adversarial set0 violations
Freshness SLA100% on time-sensitive golden cases
Source-version assertionsNo regression vs baseline
Tombstone tests100% pass after delete fixtures
Catalog / env integrityNo cross-env doc ids in prod index

Pair with Context gates on the same release: Data green does not imply Context green. Model swap → Context + Reasoning; index-only change → Data + Context recall@k.

Online eval (async, not per-request)

Data plane online eval is probes and dashboards, not sampling every user query:

  • Scheduled freshness probes against authoritative APIs/CMS
  • Drift alerts when source_version distribution shifts
  • Index build success / duration / row-count monitors
  • ACL regression jobs on synthetic principals nightly

Feed failures into the same incident → golden case loop as offline CI.

Trace fields

Capture these so Data failures are diagnosable without re-running ingest:

FieldWhy
source_idsWhich authoritative objects backed the chunks
source_versionsDetect stale corpus in prod traces
index_build_idTie behavior to a specific build
catalog_envCatch wrong-catalog incidents
acl_decision_logProve entitlement at index or query filter
ingest_timestampFreshness vs SLA
chunk_idsource_id mapLineage for audit

Further reading


Blueprint · Context →