14 docs tagged with "ai-intelligence"

Eval Plane ①: Input

How to evaluate the Input plane — parsing, intent, injection resistance, and PII handling before inference begins.

Eval Plane ②: Data

How to evaluate the Data plane — source freshness, lineage, access boundaries, and factual correctness of underlying knowledge.

Eval Plane ③: Context

How to evaluate the Context plane — retrieval precision, ranking, scope, packing, and abstention when evidence is thin.

Eval Plane ④: Reasoning

How to evaluate the Reasoning plane — faithfulness to context, conclusion quality, tool selection, and multi-step logic.

Eval Plane ⑤: Tool

How to evaluate the Tool plane — selection, arguments, idempotency, error handling, and schema compliance for agent tool calls.

Eval Plane ⑥: Memory

How to evaluate the Memory plane — session scope, TTL, consistency, and cross-session leakage in agent and copilot systems.

Eval Plane ⑦: Action

How to evaluate the Action plane — policy enforcement, authorization, side effects, and auditability before irreversible operations execute.

Eval Plane ⑧: Outcome

How to evaluate the Outcome plane — end-user task success, clarity, usefulness, and trust in the final delivered response.

Golden Datasets: Building Eval Data That Predicts Production

How to design, version, and maintain golden datasets for plane-aware evaluation — representative tasks, edge cases, adversarial cases, and production replays.

Human Review: Manual Eval That Calibrates the System

Playbooks for human evaluation in production AI — sampling strategy, rubrics, adjudication, and how manual scores anchor automated and LLM-as-judge gates.

LLM-as-Judge: Scaled Eval With Calibration

How to deploy LLM-as-judge for plane-aware evaluation — rubric design, judge selection, bias controls, and calibration against human ground truth.

Online & Dynamic Eval: Scoring Production After Ship

How to run online evaluation on live traffic — sampling, shadow scoring, canary eval, drift detection, and promoting production signals back into golden datasets.

Synthetic Eval Generation: Scaling Coverage Safely

How to generate synthetic eval cases for edge and adversarial coverage — without polluting golden datasets or optimizing for the generator.