Eval Plane ①: Input
How to evaluate the Input plane — parsing, intent, injection resistance, and PII handling before inference begins.
How to evaluate the Input plane — parsing, intent, injection resistance, and PII handling before inference begins.
How to evaluate the Data plane — source freshness, lineage, access boundaries, and factual correctness of underlying knowledge.
How to evaluate the Context plane — retrieval precision, ranking, scope, packing, and abstention when evidence is thin.
How to evaluate the Reasoning plane — faithfulness to context, conclusion quality, tool selection, and multi-step logic.
How to evaluate the Tool plane — selection, arguments, idempotency, error handling, and schema compliance for agent tool calls.
How to evaluate the Memory plane — session scope, TTL, consistency, and cross-session leakage in agent and copilot systems.
How to evaluate the Action plane — policy enforcement, authorization, side effects, and auditability before irreversible operations execute.
How to evaluate the Outcome plane — end-user task success, clarity, usefulness, and trust in the final delivered response.
Curated third-party articles, guides, and tool docs on LLM and agent evaluation — mapped to the Eval Framework Blueprint series.
How to design, version, and maintain golden datasets for plane-aware evaluation — representative tasks, edge cases, adversarial cases, and production replays.
Playbooks for human evaluation in production AI — sampling strategy, rubrics, adjudication, and how manual scores anchor automated and LLM-as-judge gates.
How to deploy LLM-as-judge for plane-aware evaluation — rubric design, judge selection, bias controls, and calibration against human ground truth.
How to run online evaluation on live traffic — sampling, shadow scoring, canary eval, drift detection, and promoting production signals back into golden datasets.
How to generate synthetic eval cases for edge and adversarial coverage — without polluting golden datasets or optimizing for the generator.