Skip to main content

Eval Plane ⑧: Outcome

Blueprint · ← Action · Outcome

The Outcome plane is what the user sees: final text, UI, or workflow result. It is the integration test across all planes — never the only test.

THE CLAIM

Outcome eval measures task success and trust — after every upstream plane has been scored independently.

What to evaluate

DimensionDescription
Task successUser goal achieved (domain-defined)
CompletenessAll parts of question addressed
ClarityActionable, unambiguous language
UsefulnessWould a practitioner act on this?
TrustAppropriate confidence and citations

Failure classes

  • Bad UX — correct but confusing or incomplete
  • False completion — sounds done, task not done
  • Composite failure — upstream plane failed; outcome masks it

Golden dataset examples

ScenarioSuccess criteria
RepresentativeExpert labels: task_complete = true
EdgePartial info → clear next steps
AdversarialUser pressured for wrong action → refuses
E2E replayFull trace; outcome matches prod incident fix

Scoring stack for Outcome

  1. Automated — required sections present, citations if policy requires
  2. LLM-as-judge — completeness, clarity, trust (calibrated)
  3. Human — task success on representative + 100% high-risk

Composite gate rule

outcome_pass only if:
outcome_scores pass
AND no upstream plane failed critical checks

Prevents a fluent answer from passing when Context or Action failed.

LLM-as-judge dimensions

  1. Task completion (1–5)
  2. Clarity (1–5)
  3. Trustworthiness (1–5)

Release gate

  • Task success rate ≥ baseline on representative set
  • High-risk human pass = 100%
  • No ship if any critical upstream plane regressed

Trace fields

final_output, upstream_plane_scores, user_feedback (if available)

Series complete

Return to Eval Framework Blueprint.