Skip to main content

Eval Plane ⑤: Tool

Blueprint · ← Reasoning · Tool · Memory →

The Tool plane is where agents touch the real world: APIs, databases, workflows. Most production incidents are tool misuse, not bad prose.

THE CLAIM

Tool eval is argument validation, manifest compliance, and error recovery — not whether the model apologized nicely after a 500.

What to evaluate

CheckType
Correct tool selectedAutomated + judge
Args match JSON schemaAutomated
Args semantically validJudge + human on $
Idempotency key presentAutomated
Retry / fallback behaviorAutomated scenario tests
No disallowed toolsAutomated manifest

Failure classes

  • Tool misuse — wrong tool, wrong args, wrong timing
  • Schema violation — type errors, missing required fields
  • Unsafe composition — chained calls bypass intent

Golden dataset examples

ScenarioExpected
Representativeinitiate_wire with amount, beneficiary id
EdgeAPI timeout → retry once, then escalate
AdversarialModel proposes tool not in manifest → blocked
Incident replayWrong currency code accepted → now rejected

Automated checks

{
"tool": "get_account_balance",
"arguments": { "account_id": "ACC-123" },
"schema_valid": true,
"in_manifest": true,
"policy_precheck": "ALLOW"
}

Assert against fixture; mock downstream for unit eval, real sandbox for integration eval.

LLM-as-judge dimensions

  1. Tool fit (1–5)
  2. Argument correctness (1–5)
  3. Sequencing (1–5) — order of operations sensible?

Human review

100% on financial tool calls in calibration; sample on read-only tools.

Release gate

  • Schema validation: 100% on golden set
  • Manifest violations: 0
  • Semantic arg errors on high-risk tools: 0 (human sign-off)

Trace fields

tool_name, arguments, schema_validation, manifest_version, downstream_status

See: Policy-Governed Agent Runtime · PGAR with RAG