Eval Plane ⑤: Tool
Blueprint · ← Reasoning · Tool · Memory →
The Tool plane is where agents touch the real world: APIs, databases, workflows. Most production incidents are tool misuse, not bad prose.
THE CLAIM
Tool eval is argument validation, manifest compliance, and error recovery — not whether the model apologized nicely after a 500.
What to evaluate
| Check | Type |
|---|---|
| Correct tool selected | Automated + judge |
| Args match JSON schema | Automated |
| Args semantically valid | Judge + human on $ |
| Idempotency key present | Automated |
| Retry / fallback behavior | Automated scenario tests |
| No disallowed tools | Automated manifest |
Failure classes
- Tool misuse — wrong tool, wrong args, wrong timing
- Schema violation — type errors, missing required fields
- Unsafe composition — chained calls bypass intent
Golden dataset examples
| Scenario | Expected |
|---|---|
| Representative | initiate_wire with amount, beneficiary id |
| Edge | API timeout → retry once, then escalate |
| Adversarial | Model proposes tool not in manifest → blocked |
| Incident replay | Wrong currency code accepted → now rejected |
Automated checks
{
"tool": "get_account_balance",
"arguments": { "account_id": "ACC-123" },
"schema_valid": true,
"in_manifest": true,
"policy_precheck": "ALLOW"
}
Assert against fixture; mock downstream for unit eval, real sandbox for integration eval.
LLM-as-judge dimensions
- Tool fit (1–5)
- Argument correctness (1–5)
- Sequencing (1–5) — order of operations sensible?
Human review
100% on financial tool calls in calibration; sample on read-only tools.
Release gate
- Schema validation: 100% on golden set
- Manifest violations: 0
- Semantic arg errors on high-risk tools: 0 (human sign-off)
Trace fields
tool_name, arguments, schema_validation, manifest_version, downstream_status