Adversarial Testing
Prompt injection, PEP bypass, manifest violations, and entitlement escalation tests for PGAR runtimes.
Prompt injection, PEP bypass, manifest violations, and entitlement escalation tests for PGAR runtimes.
Immutable verdict logs, examiner questions, and replaying authorization without chat transcripts.
Retrieval as a PEP-gated tool, context pack logging, validation handoff, and PGAR applied to RAG.
Tool manifests, schema compliance, PEP gating per tool, and blocking proposals outside the registry.
How to evaluate the Input plane — parsing, intent, injection resistance, and PII handling before inference begins.
How to evaluate the Data plane — source freshness, lineage, access boundaries, and factual correctness of underlying knowledge.
How to evaluate the Context plane — retrieval precision, ranking, scope, packing, and abstention when evidence is thin.
How to evaluate the Reasoning plane — faithfulness to context, conclusion quality, tool selection, and multi-step logic.
How to evaluate the Tool plane — selection, arguments, idempotency, error handling, and schema compliance for agent tool calls.
How to evaluate the Memory plane — session scope, TTL, consistency, and cross-session leakage in agent and copilot systems.
How to evaluate the Action plane — policy enforcement, authorization, side effects, and auditability before irreversible operations execute.
How to evaluate the Outcome plane — end-user task success, clarity, usefulness, and trust in the final delivered response.
Curated third-party articles, guides, and tool docs on LLM and agent evaluation — mapped to the Eval Framework Blueprint series.
Curated third-party resources on PDP/PEP, OAuth, policy engines, and agent authorization, mapped to the PGAR playbook series.
How to design, version, and maintain golden datasets for plane-aware evaluation — representative tasks, edge cases, adversarial cases, and production replays.
Playbooks for human evaluation in production AI — sampling strategy, rubrics, adjudication, and how manual scores anchor automated and LLM-as-judge gates.
How to deploy LLM-as-judge for plane-aware evaluation — rubric design, judge selection, bias controls, and calibration against human ground truth.
Where to maintain tool manifests, how agentic apps load them, versioning and rollback, and pros and cons of repo files vs registry APIs.
How to run online evaluation on live traffic — sampling, shadow scoring, canary eval, drift detection, and promoting production signals back into golden datasets.
ALLOW, DENY, and STEP_UP only — policy versioning, rule authoring, and deterministic authorization.
The four steps every Policy Enforcement Point runs on every proposal: receive, ask PDP, audit, act.
API gateway, Identity Provider, token validation, and claims issuance at the trust boundary.
Session custody, orchestration, proposal routing, and receiving results before validation or synthesis.
Tool schemas only, proposal-not-permission, and keeping authority out of the model boundary.
The policy layer — enforcement point, decision point, verdict handling, and deny-before-downstream.
Re-authorization, side-effect execution, and returning results to the agentic app, not the LLM directly.
The five PGAR trust boundaries in request order (ingress, agentic app, LLM proposal, PEP + PDP, downstream), including multi-agent workflows, with links to each implementation playbook.
Core PGAR building blocks in implementation order — SARAC contracts, token custody, PEP/PDP enforcement, step-up, and audit replay.
Hub for Policy-Governed Agent Runtime playbooks (foundation, assurance, boundary, and domain groups in recommended implementation order).
Subject, action, resource, and context schemas for PEP-to-PDP calls — the contracts that make verdict chains replayable.
Golden scenario libraries for PDP/PEP regression, representative, edge, adversarial, and incident replay cases.
STEP_UP verdict handling, four-eyes approval, re-evaluation with context.approval, and UX ownership in the agentic app.
How to generate synthetic eval cases for edge and adversarial coverage — without polluting golden datasets or optimizing for the generator.
What stays in the agentic app, what the LLM sees, and the PGAR test for credential isolation.