34 docs tagged with "arch"

View all tags

Adversarial Testing

Prompt injection, PEP bypass, manifest violations, and entitlement escalation tests for PGAR runtimes.

Audit & Replay

Immutable verdict logs, examiner questions, and replaying authorization without chat transcripts.

Domain: RAG Retrieval

Retrieval as a PEP-gated tool, context pack logging, validation handoff, and PGAR applied to RAG.

Domain: Tool Registry

Tool manifests, schema compliance, PEP gating per tool, and blocking proposals outside the registry.

Eval Plane ①: Input

How to evaluate the Input plane — parsing, intent, injection resistance, and PII handling before inference begins.

Eval Plane ②: Data

How to evaluate the Data plane — source freshness, lineage, access boundaries, and factual correctness of underlying knowledge.

Eval Plane ③: Context

How to evaluate the Context plane — retrieval precision, ranking, scope, packing, and abstention when evidence is thin.

Eval Plane ④: Reasoning

How to evaluate the Reasoning plane — faithfulness to context, conclusion quality, tool selection, and multi-step logic.

Eval Plane ⑤: Tool

How to evaluate the Tool plane — selection, arguments, idempotency, error handling, and schema compliance for agent tool calls.

Eval Plane ⑥: Memory

How to evaluate the Memory plane — session scope, TTL, consistency, and cross-session leakage in agent and copilot systems.

Eval Plane ⑦: Action

How to evaluate the Action plane — policy enforcement, authorization, side effects, and auditability before irreversible operations execute.

Eval Plane ⑧: Outcome

How to evaluate the Outcome plane — end-user task success, clarity, usefulness, and trust in the final delivered response.

Golden Datasets: Building Eval Data That Predicts Production

How to design, version, and maintain golden datasets for plane-aware evaluation — representative tasks, edge cases, adversarial cases, and production replays.

Human Review: Manual Eval That Calibrates the System

Playbooks for human evaluation in production AI — sampling strategy, rubrics, adjudication, and how manual scores anchor automated and LLM-as-judge gates.

LLM-as-Judge: Scaled Eval With Calibration

How to deploy LLM-as-judge for plane-aware evaluation — rubric design, judge selection, bias controls, and calibration against human ground truth.

Manifest Lifecycle & Registry Patterns

Where to maintain tool manifests, how agentic apps load them, versioning and rollback, and pros and cons of repo files vs registry APIs.

Online & Dynamic Eval: Scoring Production After Ship

How to run online evaluation on live traffic — sampling, shadow scoring, canary eval, drift detection, and promoting production signals back into golden datasets.

PDP Policy Surfaces

ALLOW, DENY, and STEP_UP only — policy versioning, rule authoring, and deterministic authorization.

PEP Enforcement

The four steps every Policy Enforcement Point runs on every proposal: receive, ask PDP, audit, act.

PGAR Boundary ①: Ingress

API gateway, Identity Provider, token validation, and claims issuance at the trust boundary.

PGAR Boundary ②: Agentic App

Session custody, orchestration, proposal routing, and receiving results before validation or synthesis.

PGAR Boundary ③: LLM Proposal

Tool schemas only, proposal-not-permission, and keeping authority out of the model boundary.

PGAR Boundary ④: PEP + PDP

The policy layer — enforcement point, decision point, verdict handling, and deny-before-downstream.

PGAR Boundary ⑤: Downstream

Re-authorization, side-effect execution, and returning results to the agentic app, not the LLM directly.

PGAR Boundary Playbooks

The five PGAR trust boundaries in request order (ingress, agentic app, LLM proposal, PEP + PDP, downstream), including multi-agent workflows, with links to each implementation playbook.

PGAR Foundation Playbooks

Core PGAR building blocks in implementation order — SARAC contracts, token custody, PEP/PDP enforcement, step-up, and audit replay.

PGAR Runtime Playbooks

Hub for Policy-Governed Agent Runtime playbooks (foundation, assurance, boundary, and domain groups in recommended implementation order).

Policy Contracts: SARAC and Payload Shapes

Subject, action, resource, and context schemas for PEP-to-PDP calls — the contracts that make verdict chains replayable.

Policy Test Scenarios

Golden scenario libraries for PDP/PEP regression, representative, edge, adversarial, and incident replay cases.

Step-Up & Attestation

STEP_UP verdict handling, four-eyes approval, re-evaluation with context.approval, and UX ownership in the agentic app.

Synthetic Eval Generation: Scaling Coverage Safely

How to generate synthetic eval cases for edge and adversarial coverage — without polluting golden datasets or optimizing for the generator.

Token & Session Boundary

What stays in the agentic app, what the LLM sees, and the PGAR test for credential isolation.