Eval Plane ⑤: Tool

Blueprint · ← Reasoning · Tool · Memory →

The Tool plane is where agents touch the real world: APIs, databases, workflows. Most production incidents are tool misuse, not bad prose.

THE CLAIM

Tool eval is argument validation, manifest compliance, and error recovery — not whether the model apologized nicely after a 500.

What to evaluate

Check	Type
Correct tool selected	Automated + judge
Args match JSON schema	Automated
Args semantically valid	Judge + human on $
Idempotency key present	Automated
Retry / fallback behavior	Automated scenario tests
No disallowed tools	Automated manifest

Failure classes

Tool misuse — wrong tool, wrong args, wrong timing
Schema violation — type errors, missing required fields
Unsafe composition — chained calls bypass intent

Golden dataset examples

Scenario	Expected
Representative	`initiate_wire` with amount, beneficiary id
Edge	API timeout → retry once, then escalate
Adversarial	Model proposes tool not in manifest → blocked
Incident replay	Wrong currency code accepted → now rejected

Automated checks

{
  "tool": "get_account_balance",
  "arguments": { "account_id": "ACC-123" },
  "schema_valid": true,
  "in_manifest": true,
  "policy_precheck": "ALLOW"
}

Assert against fixture; mock downstream for unit eval, real sandbox for integration eval.

LLM-as-judge dimensions

Tool fit (1–5)
Argument correctness (1–5)
Sequencing (1–5) — order of operations sensible?

Human review

100% on financial tool calls in calibration; sample on read-only tools.

Release gate

Schema validation: 100% on golden set
Manifest violations: 0
Semantic arg errors on high-risk tools: 0 (human sign-off)

Trace fields

tool_name, arguments, schema_validation, manifest_version, downstream_status

See: Policy-Governed Agent Runtime · PGAR with RAG

What to evaluate​

Failure classes​

Golden dataset examples​

Automated checks​

LLM-as-judge dimensions​

Human review​

Release gate​

Trace fields​