Skip to main content

Policy Test Scenarios

Blueprint · ← Audit & replay · Policy scenarios · Adversarial →

Policy regressions are deterministic. Build a versioned scenario library parallel to eval golden datasets.

THE CLAIM

Policy scenarios are SARAC fixtures with expected verdicts, not prompts you hope the model respects.

Scenario schema

{
"id": "pgar-wire-001",
"version": "2026.07.1",
"domain": "payments",
"scenario": "representative",
"risk_tier": "high",
"input": {
"subject": { "sub": "officer-123", "emts": { "payments.wire.initiate": true } },
"action": "initiate_wire",
"resource": { "beneficiary_id": "bene-acme-441" },
"context": { "amount": 15000, "sanctions_status": "clear" }
},
"expected": {
"verdict": "ALLOW",
"downstream_called": true,
"policy_version": "pgar.payments.wire/v3"
}
}

Four scenario layers

LayerPurposeExamples
RepresentativeHappy path under policyUnder-limit wire ALLOW
EdgeBoundariesAt-limit amount, expired token
AdversarialBypass attemptsDirect downstream, wrong subject
Incident replayProduction failuresSanctions DENY, scope leak

CI gate

for case in active_scenarios:
result = pep_simulate(case.input)
assert result.verdict == case.expected.verdict
assert result.downstream_called == case.expected.downstream_called

Only status: active scenarios block releases.

When to run (before, during, after)

Assurance runs at different depths across the delivery lifecycle. Policy test scenarios are the deterministic core (SARAC in, verdict out). Adversarial testing adds bypass and injection paths on top.

Teams (aligned with PGAR Blueprint ownership):

TeamAssurance role
AI platformHarness, CI jobs, PEP integration tests, staging/canary wiring, trace evals
Governance / compliancePDP regression policy, adversarial + incident fixtures, examiner replay rules, pen-test findings → CI
DomainRepresentative + edge business rules, UAT journeys, downstream sandbox behavior
Security / IAMToken/entitlement fixtures, infra bypass tests, network choke-point validation
SREProduction replay jobs, drift alerts, audit log infra, rollout monitoring

Who runs what (summary)

PhasePrimary runnerAuthors fixturesApproves gate
Before (CI)AI platformDomain (rep/edge), Governance (adversarial/incident), Security (authz shape)Governance + Domain lead for policy/manifest changes
During (staging/rollout)AI platform + DomainSame as before; Domain owns UAT scriptsDomain sign-off on journeys; Governance on adversarial 100%
After (production)SRE + GovernanceGovernance (incident replay), Security (pen test)Governance for new active incident scenarios

Before (design and CI)

Run offline, blocking on every change that touches policy, PEP, manifest, or agent orchestration.

WhenWhat runsRuns testAuthors fixtureHowGate
Local / PRRepresentative + edge for changed domainEngineer (AI platform)Domainpep_simulate(case.input) or PDP unit testVerdict match on touched domains
Merge to mainAll status: active scenariosAI platform (CI)All teams (PR contributors)CI job: PEP harness + mock PDP/downstream100% verdict match
Policy version bumpFull regression suiteAI platform (CI)GovernanceRe-run every active case vs new policy_versionNo ALLOW↔DENY drift without Governance sign-off
New tool in manifestTool + PEP scenarios for that actionAI platform (CI)Domain + AI platformBlock merge if new action has zero active scenariosSchema + verdict coverage

Before is where most scenarios live: fast, deterministic, no real downstream. Domain and Governance write cases; AI platform runs them in CI.

During (staging, pre-prod, rollout)

Run integration and adversarial tests against a deployed stack (real PEP, real PDP, mock or sandbox downstream).

WhenWhat runsRuns testAuthors / validatesHowGate
Staging deployFull active library + adversarial setAI platformGovernance reviews adversarial coverageE2E: LLM stub → PEP → mock downstreamAdversarial 100%; downstream_called as expected
Pre-prod / UATRepresentative business journeysDomain (with AI platform support)Domain (business rules), Governance (STEP_UP/DENY policy)Scripted or manual flows with test principalsSTEP_UP and DENY exercised, not only ALLOW
Canary / shadowProduction-shaped SARAC sampleSRE + AI platformGovernance (baseline thresholds)Shadow PEP or anonymized replay; no side effectsVerdict distribution within baseline
Infra changeNetwork bypass checksSecurity / IAM + SRESecurity / IAMAgentic app cannot reach downstream except via PEPZero critical bypass findings

During proves wiring. AI platform owns the pipeline; Domain owns “does this match business intent?”; Security owns “can anything skip the choke point?”

After (production and incidents)

Run monitoring, sampling, and replay, not the full CI suite on every request.

WhenWhat runsRuns testAuthors fixtureHowGate
Steady stateSample audit replaySRE (scheduled job)Governance (replay rules)Replay audit_id chains: SARAC + policy_version → verdictAlert if replay ≠ logged verdict
OnlineTrace eval overlapAI platform (eval pipeline)AI platform + Governance (thresholds)Action / Tool planes on tracesDashboard: block rate, step-up rate, unknown tools
Incident / near-missNew incident replay scenarioGovernance (lead)Governance + Domain + AI platform (redacted audit export)status: candidate → review → activeSame failure cannot merge without scenario
PeriodicAdversarial + pen testSecurity / Governance (red team)Governance → CI fixturesInjection, manifest escape, subject swapFindings become active scenarios in before CI

After does not replace before. SRE and Governance detect drift; AI platform keeps CI green so regressions do not ship again.

How to run (minimal harness)

  1. Store scenarios as versioned JSON/YAML (git), same schema as above.
  2. Simulate at the lowest useful layer:
    • Unit: PDP only (input → verdict), fastest for policy edits.
    • Integration: PEP + mock downstream (downstream_called assertion).
    • E2E (staging): agentic app + stub LLM proposal, real PEP path.
  3. Tag each case: status: active | candidate | retired. Only active blocks release.
  4. Record regression_run_id, policy_version, and actual_verdict in CI artifacts for audit.

Example CI step (pseudocode):

# Before merge — blocking
pnpm run pgar:scenarios --status active --policy-version pgar.payments.wire/v3

# After deploy — non-blocking sample (cron)
pnpm run pgar:replay-audit --sample 100 --since 24h

Pair with the PGAR release gate matrix for which scenario layers to re-run per change type.

Ownership (fixture library)

RoleAuthorsRuns
Governance / complianceAdversarial, incident replay, policy regression casesApproves new active scenarios; leads incident → fixture workflow
DomainRepresentative + edge business rulesUAT during rollout; validates business journeys
AI platformHarness, CI wiring, integration fixturesPR + merge CI, staging E2E, online trace evals
Security / IAMToken, entitlement, bypass fixturesInfra bypass tests during rollout
SREReplay job config (with Governance)Nightly audit replay, canary/shadow ops, drift alerts

See When to run above for phase-by-phase runner vs author split.

Trace fields

scenario_id, expected_verdict, actual_verdict, policy_version, regression_run_id

See: Adversarial testing · PDP policy surfaces