Skip to main content

Eval Plane ④: Reasoning

Blueprint · ← Context · Reasoning · Tool →

The Reasoning plane is where the model plans, synthesizes, and decides next steps. Context can be perfect and reasoning can still fail.

THE CLAIM

Reasoning eval separates faithfulness (stays on evidence) from correctness (draws the right conclusion).

What to evaluate

DimensionAutomatedJudgeHuman
Faithfulness to contextClaim ↔ chunk overlapHigh-risk
Logical consistency
Tool selectionMatch expected toolEdge cases
Uncertainty expression
HallucinationCitation required

Failure classes

  • Reasoning failure — right evidence, wrong conclusion
  • Hallucination — claim without support
  • Overconfidence — no uncertainty when evidence thin

Golden dataset examples

ScenarioFixtureExpected
Multi-hopTwo docs neededBoth used in rationale
DistractorSimilar wrong doc in packIgnored
AbstainThin evidence"Cannot determine from sources"
Tool choice"Check balance"get_balance not transfer

Automated checks

  • Every factual sentence has citation metadata
  • planned_tool ∈ allowed_manifest
  • Contradiction detector vs retrieved text (optional NLI model)

LLM-as-judge rubric (core)

  1. Faithfulness — claims supported by provided chunks only
  2. Correctness — conclusion matches domain expert answer
  3. Calibration — uncertainty when appropriate

Use chain-of-thought in judge internally; store rationale JSON only.

Release gate

  • Faithfulness ≥ 4.0 avg on golden set (judge)
  • Hallucination cases: 0 critical (human adjudicated)
  • Tool-selection accuracy ≥ baseline on tool-heavy subset

Trace fields

model_id, prompt_version, reasoning_trace (if exposed), planned_tools, citations