How to Evaluate AI Agent Tools Properly
When building tools for AI agents like Claude, how do you know if your tools are actually effective? Unlike traditional software where you can predict exactly how functions will be called, AI agents are non-deterministic—they might approach the same task in completely different ways each time.
This uncertainty makes it crucial to have a systematic way to evaluate and improve your tools. Based on Anthropic's recent research on writing effective tools for LLM agents, I'll walk you through building a comprehensive evaluation framework that can help you create better tools for AI agents.
The Challenge: Non-Deterministic Tool Usage
Traditional software follows predictable patterns. Call send_email(to="john@company.com", subject="Meeting Reminder") and you'll always send that exact email the same way. But when an AI agent is asked "Remind John about tomorrow's meeting", it might:
- Send the email immediately
- Check John's calendar first to confirm he's available
- Search for previous meeting details to include context
- Or ask you what specific details to include
This variability means we need to test tools differently than we test traditional APIs.
Setting Up Evaluation Tasks
The foundation of good tool evaluation is creating realistic, challenging tasks. Here's how to structure them:
from dataclasses import dataclass
from typing import List
@dataclass
class EvaluationTask:
id: str
prompt: str
expected_outcome: str
verification_criteria: List[str]
def create_evaluation_tasks() -> List[EvaluationTask]:
return [
EvaluationTask(
id="email_complex",
prompt="Send a follow-up email to the Acme Corp team about yesterday's project meeting. Include the action items we discussed and attach the updated project timeline.",
expected_outcome="Email sent with meeting follow-up, action items included, timeline attached",
verification_criteria=[
"Email sent to correct recipients",
"Email mentions yesterday's meeting",
"Action items are included in email body",
"Project timeline attachment is included"
]
),
EvaluationTask(
id="email_simple",
prompt="Send an email to sarah@acme.com with the subject 'Project Update'",
expected_outcome="Email sent to sarah@acme.com with correct subject",
verification_criteria=[
"Email sent to correct recipient",
"Email has correct subject line"
]
),
EvaluationTask(
id="email_review",
prompt="Review my unread emails and summarize any urgent messages about the quarterly budget review.",
expected_outcome="Unread emails reviewed and budget-related urgent items summarized",
verification_criteria=[
"Unread emails were accessed",
"Budget-related emails identified",
"Urgent items highlighted in summary"
]
)
]
Each evaluation task includes key fields that work together to enable thorough verification:
verification_criteria: Specific, checkable requirements that are verified programmatically (e.g., "Email sent to correct recipients", "Action items included in email body")expected_outcome: High-level description used for LLM-based verification where Claude compares the actual result against what was expected
This dual verification approach catches both technical failures (wrong tools called) and semantic failures (right tools called but wrong outcome achieved).
Notice how the complex email task requires multiple tool calls and coordination between different systems. The agent would likely need to:
search_meeting_notes("yesterday", "Acme Corp")to find the meeting detailsextract_action_items()to pull out the discussed itemssearch_files("project timeline")to locate the attachmentsend_email()to compose and send the final message
This multi-step workflow mirrors real-world complexity and stress-tests how well your tools work together, not just individually.
Compare this to a simple task like "Send an email to sarah@acme.com with the subject 'Project Update'"—too straightforward to reveal coordination issues between tools.
The Agentic Evaluation Loop
The core of the evaluation system is an agentic loop that alternates between AI reasoning and tool execution:
def run_single_evaluation(self, task: EvaluationTask) -> EvaluationMetrics:
metrics = EvaluationMetrics(task_id=task.id, success=False)
max_iterations = 10
iteration = 0
# System prompt that encourages reasoning
system_prompt = """Before taking any action, output your reasoning in a <reasoning> block.
Then make the necessary tool calls to complete the task.
Finally, provide a summary in a <summary> block."""
messages = [{"role": "user", "content": task.prompt}]
while iteration < max_iterations:
iteration += 1
# Get AI response
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
system=system_prompt,
messages=messages
)
response_text = response.content[0].text
metrics.reasoning_trace += f"
--- Iteration {iteration} ---
{response_text}
"
# Parse and execute tool calls
tool_calls = self.parse_tool_calls_from_response(response_text)
if not tool_calls:
metrics.final_response = response_text
break # Task complete
# Execute tools and collect metrics
tool_results_text = ""
for tool_call in tool_calls:
metrics.tool_calls.append(tool_call)
result = self.execute_tool_call(tool_call.tool_name, tool_call.parameters)
metrics.tool_results.append(result)
metrics.total_tokens += result.tokens_used
if not result.success:
metrics.error_count += 1
tool_results_text += f"
Tool: {tool_call.tool_name}
Result: {result.result}
"
# Continue conversation with tool results
messages.extend([
{"role": "assistant", "content": response_text},
{"role": "user", "content": f"Tool results:
{tool_results_text}
Continue if needed."}
])
metrics.total_runtime = time.time() - start_time
metrics.success = self.verify_task_completion(task, metrics)
return metrics
This loop captures the natural back-and-forth between AI reasoning and tool execution, while collecting detailed metrics at each step.
Comprehensive Verification Using Task Fields
Unlike traditional unit tests that just check pass/fail, our verification system uses the task fields for robust checking:
def verify_task_completion(self, task: EvaluationTask, metrics: EvaluationMetrics) -> bool:
"""Verify task completion using structured criteria and LLM-based outcome checking"""
# Step 1: Check verification criteria
criteria_passed = 0
for criterion in task.verification_criteria:
if self.check_criterion(criterion, metrics):
criteria_passed += 1
criteria_success = criteria_passed >= len(task.verification_criteria) * 0.8
# Step 2: LLM-based outcome verification
outcome_success = self.verify_expected_outcome(task, metrics)
# Both checks must pass for overall success
return criteria_success and outcome_success
def check_criterion(self, criterion: str, metrics: EvaluationMetrics) -> bool:
"""Check specific verification criteria against execution metrics"""
if "Email sent to correct recipients" in criterion:
# Check that send_email was called AND verify recipients
email_calls = [tc for tc in metrics.tool_calls if tc.tool_name == "send_email"]
if not email_calls:
return False
# Check if the tool result indicates success
email_results = [tr for tr in metrics.tool_results if tr.success]
return len(email_results) > 0
elif "action items" in criterion:
# Verify action items were both found AND included
extract_calls = [tc for tc in metrics.tool_calls if "action" in tc.tool_name.lower()]
include_check = "action item" in metrics.final_response.lower()
return len(extract_calls) > 0 and include_check
elif "attachment" in criterion:
# Check file search AND attachment in final email
file_calls = [tc for tc in metrics.tool_calls if "search_files" in tc.tool_name]
attach_calls = [tc for tc in metrics.tool_calls if "attach" in tc.tool_name.lower()]
return len(file_calls) > 0 and len(attach_calls) > 0
# Add more criterion checks as needed
return False
def verify_expected_outcome(self, task: EvaluationTask, metrics: EvaluationMetrics) -> bool:
"""Use LLM to verify if actual outcome matches expected outcome"""
verification_prompt = f"""
Evaluate if this task was completed successfully:
TASK: {task.prompt}
EXPECTED OUTCOME: {task.expected_outcome} # Uses task.expected_outcome
ACTUAL EXECUTION:
Tools called: {[tc.tool_name for tc in metrics.tool_calls]}
Final response: {metrics.final_response[:500]}...
Did the agent achieve the expected outcome? Respond with just "YES" or "NO" followed by a brief explanation.
"""
try:
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": verification_prompt}]
)
result = response.content[0].text.strip()
return result.upper().startswith("YES")
except Exception as e:
# Fallback to basic success if LLM verification fails
return len(metrics.tool_calls) > 0 and metrics.error_count == 0
This verification system provides two layers of checking:
- Structured verification: Checks specific technical requirements from
verification_criteria - Semantic verification: Uses
expected_outcomefor LLM-based evaluation of overall success
Automated Analysis and Improvement
I great tip I got from Anthropic's research is using AI agents themselves to analyze evaluation results:
def save_for_claude_analysis(self, results):
with open('evaluation_transcripts.txt', 'w') as f:
for metric in results['detailed_metrics']:
f.write(f"=== Task: {metric.task_id} ===
")
f.write(metric.reasoning_trace)
f.write("
" + "="*50 + "
")
print("Transcripts saved. Now paste into Claude Code for analysis!")
After running your evaluation, you can literally paste the transcripts into Claude Code and ask it to analyze patterns, suggest tool improvements, and even refactor your tool implementations automatically.
What Actually Happens
This evaluation-driven approach has proven quite effective. In Anthropic's experiments, Claude-optimized tools actually outperformed human-written ones on complex tasks. The key was the systematic feedback loop:
- Run evaluation on current tools
- Analyze failures and inefficiencies
- Improve tools based on insights
- Re-evaluate to measure improvement
- Repeat until performance plateaus
Getting Started
Here's how to use the complete framework:
class ToolEvaluator:
def __init__(self, anthropic_api_key: str):
self.client = anthropic.Anthropic(api_key=anthropic_api_key)
self.tools = YourEmailTools() # Your actual tools
def run_evaluation(self):
tasks = self.create_evaluation_tasks()
results = []
for task in tasks:
# Uses task.prompt in the evaluation loop
metrics = self.run_single_evaluation(task)
results.append(metrics)
return self.compile_results(results)
# Usage
evaluator = ToolEvaluator("your-api-key")
results = evaluator.run_evaluation()
print(f"Overall success rate: {results['summary']['success_rate']:.1%}")
print(f"Total tool calls: {results['summary']['total_tool_calls']}")
print(f"Average runtime: {results['summary']['average_runtime']:.2f}s")
# Review detailed results
for metric in results['detailed_metrics']:
print(f"Task {metric.task_id}: {'âś“' if metric.success else 'âś—'} "
f"({len(metric.tool_calls)} tools, {metric.total_runtime:.1f}s)")
Key Takeaways
Building effective tools for AI agents requires a fundamentally different approach than traditional software development:
- Test with realistic, complex scenarios that require multiple tool interactions
- Use task fields for comprehensive verification - criteria for technical checks, outcomes for semantic validation
- Use agents to analyze agents - let Claude help improve your tools
- Iterate systematically - evaluation should drive continuous improvement
The non-deterministic nature of AI agents isn't a bug—it's a feature that enables creative problem-solving. But to harness that creativity effectively, we need systematic evaluation frameworks that can measure and improve tool performance across the full spectrum of possible agent behaviors.
Start simple, measure everything, and let the data (and Claude) guide your improvements. Your tools—and the agents that use them—will be dramatically more effective as a result.