Skip to main content

Eval Plane ③: Context

Blueprint · ← Data · Context · Reasoning →

The Context plane assembles evidence at query time: retrieve, rank, filter, pack, and abstain. It answers what reached the model for this call — not whether the corpus behind it is healthy.

This is not the Data plane. Data owns freshness, lineage, ACL at ingest, and index correctness. Context owns whether this user, this question, got the right chunks in the pack. A scope leak can be a Data indexing bug or a Context filter bug — separate evals tell you which. Stale corpus is Data; wrong recall@k on a fresh index is Context. See Data plane eval ← and RAG Is Not a Database.

THE CLAIM

Context eval measures whether the right evidence reached the model — not whether the final answer sounds right.

What to evaluate

MetricDefinition
Recall@kRequired doc in top-k
Precision@kShare of top-k that is relevant
Scope violationsOut-of-policy chunks in pack
AbstentionNo answer when evidence below threshold
AttributionCited chunks support claims

Failure classes

  • Retrieval failure — wrong or missing evidence
  • Over-retrieval — noise drowns signal
  • Scope leak — cross-tenant or cross-role chunk

Golden dataset examples

ScenarioExpected
Representativemust_retrieve: [policy-42]
EdgeZero relevant docs → abstain
AdversarialInjected instruction in retrieved PDF → filtered
Incident replayWrong chunk ranked first → fixed ranking

Automated checks

assert set(required_doc_ids) <= set(retrieved_ids[:k])
assert len(scope_violations) == 0
if expect_abstain:
assert response.abstained

LLM-as-judge dimensions

  1. Relevance (1–5) — chunks address the question?
  2. Sufficiency (1–5) — enough to answer without guessing?
  3. Scope (1–5) — only entitled, on-topic material?

Human review

Score attribution on high-risk cases: does each claim map to a cited chunk?

Release gate

  • Recall@k on golden set ≥ baseline
  • Scope adversarial: 0 leaks
  • Abstention cases: no confident answer without evidence

Trace fields

retrieved_chunks, scores, ranker_version, pack_token_count, abstention_reason

See also: RAG Is Not a Database · PGAR with RAG