In this article, you will learn how tool design — not model capability — is the root cause of most AI agent failures, and what concrete design patterns you can apply to fix it.
Topics we will cover include:
Tool design practices that improve agent reliability, including single-responsibility tools, tight schemas, and structured error returns.
Common failure modes such as unfiltered API exposure, silent partial success, and overlapping tool names that break real-world workloads.
Schema and error handling patterns that reduce hallucination and unreliable behavior at the tool boundary.
Let’s get into it.
AI Agent Tool Design: What Works and What Doesn’t
Introduction
Most AI agent failures look like model mistakes: choosing the wrong tool, passing bad arguments, or mishandling errors. But in practice, the model is usually working with the interface it was given. The underlying issue is often the tool design itself.
A model can only reason from the information exposed through the tool interface: the tool name, its description, the parameter schema, and the parameter descriptions. Those details shape how the model interprets intent, plans actions, and executes tasks. When the tool design is unclear, incomplete, or loosely structured, failures become predictable rather than accidental.
Problems like vague naming, ambiguous instructions, inconsistent schemas, weak parameter definitions, and poor error handling all increase the likelihood of failures. Stronger models can reduce some mistakes, but they cannot reliably compensate for a flawed interface. This article covers:
Tool design practices that improve reliability
Failure modes that look fine in demos but break under real workloads
Schema and error design that reduces hallucination at the tool boundary
Each pattern is paired with its failure counterpart, because understanding why a design fails is as important as knowing what to replace it with.
What Works in AI Agent Tool Design
1. One Tool, One Responsibility
In most agent systems, a tool should represent a single, clear operation. When one tool handles multiple behaviors through an action parameter, the model must first figure out which mode to invoke before it can solve the actual task.
The difference becomes clearer when comparing a multi-action tool against dedicated single-purpose tools:
# Avoid: action-based multi-behavior tool
@tool
def manage_customer(
action: str,
customer_id: str | None = None,
data: dict | None = None
):
“””
action: create | get | update | delete | suspend
“””
…
# Prefer: single-responsibility tools
@tool
def create_customer(data: CustomerInput) -> Customer:
“””Create a new customer record.”””
…
@tool
def get_customer(customer_id: str) -> Customer:
“””Retrieve a customer by ID.”””
…
@tool
def suspend_customer(customer_id: str, reason: str) -> SuspensionResult:
“””Suspend a customer account.”””
…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Avoid: action-based multi-behavior tool
@tool
def manage_customer(
action: str,
customer_id: str | None = None,
data: dict | None = None
):
“””
action: create | get | update | delete | suspend
“””
…
# Prefer: single-responsibility tools
@tool
def create_customer(data: CustomerInput) -> Customer:
“””Create a new customer record.”””
…
@tool
def get_customer(customer_id: str) -> Customer:
“””Retrieve a customer by ID.”””
…
@tool
def suspend_customer(customer_id: str, reason: str) -> SuspensionResult:
“””Suspend a customer account.”””
…

One Tool, One Responsibility
Single-responsibility tools give the model an unambiguous function and give you cleaner error handling and easier observability.
⚠️ Note: This is a useful default rather than a universal rule. Some domains — such as shell, filesystem, browser, or calendar tools — may benefit from a constrained multi-action interface because the action space itself is part of the underlying abstraction.
2. Schemas That Make Invalid States Impossible
In tool-calling agents, the model constructs tool call arguments by reasoning from your schema.
A loose schema means the model guesses at constraints.
A tight schema encodes those constraints so no guessing is needed.
Here’s an example:
from pydantic import BaseModel, Field
from enum import Enum
class Priority(str, Enum):
LOW = “low”
MEDIUM = “medium”
HIGH = “high”
class CreateTaskInput(BaseModel):
title: str = Field(
description=”Short, actionable task title. Use imperative form: ‘Review PR’, not ‘PR Review’.”,
min_length=5,
max_length=100
)
priority: Priority = Field(
description=”Task priority. Use HIGH only for blockers affecting other work.”,
default=Priority.MEDIUM
)
due_date: str = Field(
description=”Due date in ISO 8601 format: YYYY-MM-DD. Must be a future date.”,
pattern=r”^\d{4}-\d{2}-\d{2}$”
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pydantic import BaseModel, Field
from enum import Enum
class Priority(str, Enum):
LOW = “low”
MEDIUM = “medium”
HIGH = “high”
class CreateTaskInput(BaseModel):
title: str = Field(
description=”Short, actionable task title. Use imperative form: ‘Review PR’, not ‘PR Review’.”,
min_length=5,
max_length=100
)
priority: Priority = Field(
description=”Task priority. Use HIGH only for blockers affecting other work.”,
default=Priority.MEDIUM
)
due_date: str = Field(
description=”Due date in ISO 8601 format: YYYY-MM-DD. Must be a future date.”,
pattern=r”^\d{4}-\d{2}-\d{2}$”
)
Enums are particularly useful for fields with a small set of valid values because they eliminate a class of plausible-but-invalid outputs. Validation failures surface at the tool boundary rather than as cryptic downstream errors.
3. Descriptions That Define Scope, Not Just Purpose
Tool descriptions are model-facing documentation. They need to do two things: explain when to use the tool, and explain when not to. Most descriptions only do the first.
# Weak: explains what it does, not when not to use it
“””Search for documents in the knowledge base.”””
# Strong: defines purpose, scope, and boundaries
“””
Search the internal knowledge base for documents, policies, and reference material.
Use this when the user asks about company procedures, product specs, or documented workflows.
Do NOT use this for real-time data (prices, availability, current status) — use get_live_data() instead.
Returns up to 5 results ranked by relevance. If no results are returned, the information is not in the knowledge base.
“””
# Weak: explains what it does, not when not to use it
“””Search for documents in the knowledge base.”””
# Strong: defines purpose, scope, and boundaries
“””
Search the internal knowledge base for documents, policies, and reference material.
Use this when the user asks about company procedures, product specs, or documented workflows.
Do NOT use this for real-time data (prices, availability, current status) — use get_live_data() instead.
Returns up to 5 results ranked by relevance. If no results are returned, the information is not in the knowledge base.
“””
Without the disambiguation, the model infers scope from the tool name alone, which is often a reliable source of selection errors at scale. A good tool definition includes clear boundaries from other tools, not just usage instructions.
4. Structured, Actionable Error Returns
When a tool fails, the model reads the error and decides what to do next. An unhandled exception or stack trace produces noise-driven follow-up behavior. A structured error gives the model something to branch on.
Structured errors should not only report what failed but also help the agent decide what to do next. A good error format makes retry behavior explicit and gives the model a clear recovery path:
class ToolError(BaseModel):
error_code: str # machine-readable, for the model to branch on
message: str # human-readable description
recoverable: bool # can the agent retry?
suggested_action: str # what the agent should do next
# Record not found: retryable
return ToolError(
error_code=”RECORD_NOT_FOUND”,
message=”No user record found with ID ‘usr_123’.”,
recoverable=True,
suggested_action=”Use list_users() to get valid user IDs before calling get_user().”
)
# Quota exceeded: not retryable
return ToolError(
error_code=”QUOTA_EXCEEDED”,
message=”API quota for this tool has been reached for today.”,
recoverable=False,
suggested_action=”Notify the user and stop. Do not retry this tool today.”
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class ToolError(BaseModel):
error_code: str # machine-readable, for the model to branch on
message: str # human-readable description
recoverable: bool # can the agent retry?
suggested_action: str # what the agent should do next
# Record not found: retryable
return ToolError(
error_code=”RECORD_NOT_FOUND”,
message=”No user record found with ID ‘usr_123’.”,
recoverable=True,
suggested_action=”Use list_users() to get valid user IDs before calling get_user().”
)
# Quota exceeded: not retryable
return ToolError(
error_code=”QUOTA_EXCEEDED”,
message=”API quota for this tool has been reached for today.”,
recoverable=False,
suggested_action=”Notify the user and stop. Do not retry this tool today.”
)
The recoverable flag and suggested_action field are what change agent behavior. Without them, models retry non-retryable errors or abandon recoverable ones.
5. Idempotent State-Changing Operations
Every tool that mutates state — creates a record, sends a message, transfers funds — must be safe to call twice. In practice, agents retry, networks fail, and the LLM loop may issue a second call because confirmation of the first never arrived.
A simple way to prevent duplicate side effects is to require an idempotency key for every write operation:
@tool
def send_email(
to: str,
subject: str,
body: str,
idempotency_key: str = Field(
description=”Unique key for this send operation. Use a hash of recipient + subject + timestamp. “
“Same key on retry returns the original result without re-sending.”
)
) -> dict:
“””Send an email. Idempotent: the same idempotency_key will not trigger a second send.”””
existing = idempotency_store.get(idempotency_key)
if existing:
return existing
result = email_service.send(to=to, subject=subject, body=body)
idempotency_store.set(idempotency_key, result, ttl=86400)
return result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@tool
def send_email(
to: str,
subject: str,
body: str,
idempotency_key: str = Field(
description=”Unique key for this send operation. Use a hash of recipient + subject + timestamp. “
“Same key on retry returns the original result without re-sending.”
)
) -> dict:
“””Send an email. Idempotent: the same idempotency_key will not trigger a second send.”””
existing = idempotency_store.get(idempotency_key)
if existing:
return existing
result = email_service.send(to=to, subject=subject, body=body)
idempotency_store.set(idempotency_key, result, ttl=86400)
return result
Without idempotency guarantees, transient failures can easily turn into duplicate actions.
What Doesn’t Work in AI Agent Tool Design
1. Thin Wrappers Around Unfiltered APIs
Pointing an agent at a REST API and surfacing it as a tool is the most common shortcut and the most common source of production failures. APIs built for developers often expose far more detail than agents actually need. Responses come packed with hundreds of fields, even when only a handful are relevant. They rely on pagination, use opaque internal IDs with little contextual meaning, and return error codes that require deep domain knowledge to interpret.
A purpose-built wrapper handles pagination internally, projects only the fields the agent needs, and maps API errors to the structured ToolError format discussed above. The agent never constructs API paths or manages pages; it receives typed objects it can reason about.
That said, over-wrapping can also be harmful. If every endpoint becomes a separate, narrowly defined tool with no shared structure, the tool surface can become fragmented and harder for the model to navigate. The goal is not maximal abstraction, but a consistent, agent-friendly abstraction layer.
2. Loading All Tools Into Every Context
Accuracy degrades as the tool catalog grows. LongFuncEval, a 2025 study on tool-calling performance across long contexts, found performance drops substantially as the tool catalog size increased — even in models with 128K context windows. Loading every tool into every system prompt compounds this by consuming token budget before any task content is processed.
Dynamic tool loading addresses both problems. Determine which tools are relevant to the current step and include only those:
STEP_TOOL_MAP = {
“research”: (“search_documents”, “search_web”, “get_url_content”),
“write”: (“create_document”, “update_document”, “format_text”),
“send”: (“send_email”, “post_to_slack”, “create_calendar_event”),
}
def get_tools_for_step(step_type: str, available_tools: list) -> list:
relevant_names = STEP_TOOL_MAP.get(step_type, ())
return (t for t in available_tools if t.name in relevant_names)
STEP_TOOL_MAP = {
“research”: (“search_documents”, “search_web”, “get_url_content”),
“write”: (“create_document”, “update_document”, “format_text”),
“send”: (“send_email”, “post_to_slack”, “create_calendar_event”),
}
def get_tools_for_step(step_type: str, available_tools: list) -> list:
relevant_names = STEP_TOOL_MAP.get(step_type, ())
return (t for t in available_tools if t.name in relevant_names)

Dynamic Tool Loading
Exposing only a small, relevant subset of tools at each step — rather than the full toolset — generally improves selection accuracy and reduces per-call token cost.
3. Silent Partial Success
Partial success becomes a problem when a tool completes only part of the requested work but returns a response that looks fully successful. The agent continues execution with an incomplete or misleading view of the system state.
This usually happens when tools suppress internal failures and return only the successful portion of the result:
# This version silently misleads the agent
@tool
def bulk_create_tasks(tasks: list) -> dict:
created = ()
for task in tasks:
try:
result = task_api.create(task)
created.append(result.id)
except Exception:
pass # silent failure: this is the bug
return {“created”: created}
# This version makes partial success explicit
@tool
def bulk_create_tasks(tasks: list) -> BulkCreateResult:
created, failed = (), ()
for task in tasks:
try:
created.append(task_api.create(task).id)
except TaskCreationError as e:
failed.append({“input”: task.title, “reason”: str(e)})
return BulkCreateResult(
created_ids=created,
failed_items=failed,
success=len(failed) == 0,
partial_success=len(created) > 0 and len(failed) > 0
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# This version silently misleads the agent
@tool
def bulk_create_tasks(tasks: list) -> dict:
created = ()
for task in tasks:
try:
result = task_api.create(task)
created.append(result.id)
except Exception:
pass # silent failure: this is the bug
return {“created”: created}
# This version makes partial success explicit
@tool
def bulk_create_tasks(tasks: list) -> BulkCreateResult:
created, failed = (), ()
for task in tasks:
try:
created.append(task_api.create(task).id)
except TaskCreationError as e:
failed.append({“input”: task.title, “reason”: str(e)})
return BulkCreateResult(
created_ids=created,
failed_items=failed,
success=len(failed) == 0,
partial_success=len(created) > 0 and len(failed) > 0
)
The partial_success flag gives the model something to branch on: retry the failed items, surface the partial result to the user, or halt the workflow.
4. Overlapping Tool Names and Descriptions
When two tools do similar things, the model reasons about which to use on every call. That reasoning costs tokens and introduces errors. Some common examples include:
search_documents and find_documents with identical purpose
get_user and fetch_user_profile with unclear differences
create_task, add_task, and new_task as three tools for one operation
In such cases, renaming alone isn’t the fix. Every tool needs a purpose that can be described without reference to other tools in the set. If a description requires “unlike X, this one…” to make sense, that’s a design problem. Tool sprawl — too many tools with overlapping scope — is a source of unreliable agent behavior in enterprise deployments.
5. Destructive Actions Without a Confirmation Gate
Any tool that takes an irreversible action — deleting records, messaging real users, executing financial transactions — needs a structural two-step confirmation, not an in-prompt “are you sure?” A staged approach introduces an explicit confirmation boundary that reduces the risk of accidental or unauthorized execution.
The safest pattern is to separate staging from execution and require a short-lived confirmation token between the two steps:
@tool
def stage_deletion(record_ids: list(str), reason: str) -> StagedDeletion:
“””Stage records for deletion. Does NOT delete anything.
Returns a confirmation token that expires in 60 seconds.
Call confirm_deletion() with this token to proceed.”””
token = generate_deletion_token(record_ids)
staged_deletions(token) = {“ids”: record_ids, “expires”: now() + 60}
return StagedDeletion(token=token, records_to_delete=len(record_ids), expires_in_seconds=60)
@tool
def confirm_deletion(token: str) -> DeletionResult:
“””Execute a staged deletion. IRREVERSIBLE. Confirm only after explicit user approval.”””
staged = staged_deletions.get(token)
if not staged or staged(“expires”) < now():
raise ValueError(“Token invalid or expired. Stage the deletion again.”)
# proceed
@tool
def stage_deletion(record_ids: list(str), reason: str) -> StagedDeletion:
“””Stage records for deletion. Does NOT delete anything.
Returns a confirmation token that expires in 60 seconds.
Call confirm_deletion() with this token to proceed.”””
token = generate_deletion_token(record_ids)
staged_deletions(token) = {“ids”: record_ids, “expires”: now() + 60}
return StagedDeletion(token=token, records_to_delete=len(record_ids), expires_in_seconds=60)
@tool
def confirm_deletion(token: str) -> DeletionResult:
“””Execute a staged deletion. IRREVERSIBLE. Confirm only after explicit user approval.”””
staged = staged_deletions.get(token)
if not staged or staged(“expires”) < now():
raise ValueError(“Token invalid or expired. Stage the deletion again.”)
# proceed

Destructive Actions Without a Confirmation Gate
Two distinct tool calls mean the model cannot complete a destructive operation in a single reasoning step, which is the point.
⚠️ Note: Two-step safety flows, however, are often not sufficient on their own in many systems. Even when staging and confirmation are used, additional safeguards — such as short-lived, single-use tokens, strict session binding, and replay protection — are necessary to prevent token reuse, leakage, or cross-session execution that can bypass the intended safety boundary.
AI Agent Tool Design Decisions at a Glance
Every row represents a key decision in AI agent tool design:
Design Area
Works
Doesn’t Work
Tool Scope
Single responsibility per tool
Action-parameter tools like manage_database(action=”create”)
Schema
Tight: enums, validators, typed fields
Loose: free strings, untyped dicts
Descriptions
Include scope boundaries and when not to use
Happy path only
Write Operations
Idempotent with idempotency keys
Fire-and-forget, no retry safety
Error Returns
Structured: error_code, recoverable, suggested_action
Unhandled exceptions or untyped strings
Tool Count
Dynamic loading per step
All tools in every context
API Wrapping
Purpose-built wrapper with agent-facing schema
Unfiltered API exposure
Partial Success
Explicit partial_success field in return
Silent exception swallowing
Destructive Actions
Two-step staging + confirmation
Single-call delete/send/execute
Tool Overlap
Semantically distinct, audited before deploy
Similar names and descriptions competing
Writing effective tools for AI agents — using AI agents from Anthropic is a useful reference on tool design.



GIPHY App Key not set. Please check settings