AI

How to Evaluate AI Agent Tools Properly

When building tools for AI agents like Claude, how do you know if your tools are actually effective? Unlike traditional software where you can predict exactly how functions will be called, AI agents are non-deterministic—they might approach the same task in completely different ways each time.

This uncertainty makes it crucial to have a systematic way to evaluate and improve your tools. Based on Anthropic's recent research on writing effective tools for LLM agents, I'll walk you through building a comprehensive evaluation framework that can help you create better tools for AI agents.

The Challenge: Non-Deterministic Tool Usage

Traditional software follows predictable patterns. Call send_email(to="john@company.com", subject="Meeting Reminder") and you'll always send that exact email the same way. But when an AI agent is asked "Remind John about tomorrow's meeting", it might:

  • Send the email immediately
  • Check John's calendar first to confirm he's available
  • Search for previous meeting details to include context
  • Or ask you what specific details to include

This variability means we need to test tools differently than we test traditional APIs.

Setting Up Evaluation Tasks

The foundation of good tool evaluation is creating realistic, challenging tasks. Here's how to structure them:

from dataclasses import dataclass
from typing import List

@dataclass
class EvaluationTask:
    id: str
    prompt: str
    expected_outcome: str
    verification_criteria: List[str]

def create_evaluation_tasks() -> List[EvaluationTask]:
    return [
        EvaluationTask(
            id="email_complex",
            prompt="Send a follow-up email to the Acme Corp team about yesterday's project meeting. Include the action items we discussed and attach the updated project timeline.",
            expected_outcome="Email sent with meeting follow-up, action items included, timeline attached",
            verification_criteria=[
                "Email sent to correct recipients",
                "Email mentions yesterday's meeting",
                "Action items are included in email body",
                "Project timeline attachment is included"
            ]
        ),
        EvaluationTask(
            id="email_simple", 
            prompt="Send an email to sarah@acme.com with the subject 'Project Update'",
            expected_outcome="Email sent to sarah@acme.com with correct subject",
            verification_criteria=[
                "Email sent to correct recipient",
                "Email has correct subject line"
            ]
        ),
        EvaluationTask(
            id="email_review",
            prompt="Review my unread emails and summarize any urgent messages about the quarterly budget review.",
            expected_outcome="Unread emails reviewed and budget-related urgent items summarized",
            verification_criteria=[
                "Unread emails were accessed",
                "Budget-related emails identified",
                "Urgent items highlighted in summary"
            ]
        )
    ]

Each evaluation task includes key fields that work together to enable thorough verification:

  • verification_criteria: Specific, checkable requirements that are verified programmatically (e.g., "Email sent to correct recipients", "Action items included in email body")
  • expected_outcome: High-level description used for LLM-based verification where Claude compares the actual result against what was expected

This dual verification approach catches both technical failures (wrong tools called) and semantic failures (right tools called but wrong outcome achieved).

Notice how the complex email task requires multiple tool calls and coordination between different systems. The agent would likely need to:

  • search_meeting_notes("yesterday", "Acme Corp") to find the meeting details
  • extract_action_items() to pull out the discussed items
  • search_files("project timeline") to locate the attachment
  • send_email() to compose and send the final message

This multi-step workflow mirrors real-world complexity and stress-tests how well your tools work together, not just individually.

Compare this to a simple task like "Send an email to sarah@acme.com with the subject 'Project Update'"—too straightforward to reveal coordination issues between tools.

The Agentic Evaluation Loop

The core of the evaluation system is an agentic loop that alternates between AI reasoning and tool execution:

def run_single_evaluation(self, task: EvaluationTask) -> EvaluationMetrics:
    metrics = EvaluationMetrics(task_id=task.id, success=False)
    max_iterations = 10
    iteration = 0
    
    # System prompt that encourages reasoning
    system_prompt = """Before taking any action, output your reasoning in a <reasoning> block.
    Then make the necessary tool calls to complete the task.
    Finally, provide a summary in a <summary> block."""
    
    messages = [{"role": "user", "content": task.prompt}]
    
    while iteration < max_iterations:
        iteration += 1
        
        # Get AI response
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            system=system_prompt,
            messages=messages
        )
        
        response_text = response.content[0].text
        metrics.reasoning_trace += f"
--- Iteration {iteration} ---
{response_text}
"
        
        # Parse and execute tool calls
        tool_calls = self.parse_tool_calls_from_response(response_text)
        
        if not tool_calls:
            metrics.final_response = response_text
            break  # Task complete
            
        # Execute tools and collect metrics
        tool_results_text = ""
        for tool_call in tool_calls:
            metrics.tool_calls.append(tool_call)
            
            result = self.execute_tool_call(tool_call.tool_name, tool_call.parameters)
            metrics.tool_results.append(result)
            metrics.total_tokens += result.tokens_used
            
            if not result.success:
                metrics.error_count += 1
            
            tool_results_text += f"
Tool: {tool_call.tool_name}
Result: {result.result}
"
        
        # Continue conversation with tool results
        messages.extend([
            {"role": "assistant", "content": response_text},
            {"role": "user", "content": f"Tool results:
{tool_results_text}

Continue if needed."}
        ])
    
    metrics.total_runtime = time.time() - start_time
    
    metrics.success = self.verify_task_completion(task, metrics)
    
    return metrics

This loop captures the natural back-and-forth between AI reasoning and tool execution, while collecting detailed metrics at each step.

Comprehensive Verification Using Task Fields

Unlike traditional unit tests that just check pass/fail, our verification system uses the task fields for robust checking:

def verify_task_completion(self, task: EvaluationTask, metrics: EvaluationMetrics) -> bool:
    """Verify task completion using structured criteria and LLM-based outcome checking"""
    
    # Step 1: Check verification criteria
    criteria_passed = 0
    for criterion in task.verification_criteria:
        if self.check_criterion(criterion, metrics):
            criteria_passed += 1
    
    criteria_success = criteria_passed >= len(task.verification_criteria) * 0.8
    
    # Step 2: LLM-based outcome verification
    outcome_success = self.verify_expected_outcome(task, metrics)
    
    # Both checks must pass for overall success
    return criteria_success and outcome_success

def check_criterion(self, criterion: str, metrics: EvaluationMetrics) -> bool:
    """Check specific verification criteria against execution metrics"""
    
    if "Email sent to correct recipients" in criterion:
        # Check that send_email was called AND verify recipients
        email_calls = [tc for tc in metrics.tool_calls if tc.tool_name == "send_email"]
        if not email_calls:
            return False
        # Check if the tool result indicates success
        email_results = [tr for tr in metrics.tool_results if tr.success]
        return len(email_results) > 0
        
    elif "action items" in criterion:
        # Verify action items were both found AND included
        extract_calls = [tc for tc in metrics.tool_calls if "action" in tc.tool_name.lower()]
        include_check = "action item" in metrics.final_response.lower()
        return len(extract_calls) > 0 and include_check
        
    elif "attachment" in criterion:
        # Check file search AND attachment in final email
        file_calls = [tc for tc in metrics.tool_calls if "search_files" in tc.tool_name]
        attach_calls = [tc for tc in metrics.tool_calls if "attach" in tc.tool_name.lower()]
        return len(file_calls) > 0 and len(attach_calls) > 0
        
    # Add more criterion checks as needed
    return False

def verify_expected_outcome(self, task: EvaluationTask, metrics: EvaluationMetrics) -> bool:
    """Use LLM to verify if actual outcome matches expected outcome"""
    
    verification_prompt = f"""
    Evaluate if this task was completed successfully:
    
    TASK: {task.prompt}
    EXPECTED OUTCOME: {task.expected_outcome}  # Uses task.expected_outcome
    
    ACTUAL EXECUTION:
    Tools called: {[tc.tool_name for tc in metrics.tool_calls]}
    Final response: {metrics.final_response[:500]}...
    
    Did the agent achieve the expected outcome? Respond with just "YES" or "NO" followed by a brief explanation.
    """
    
    try:
        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=100,
            messages=[{"role": "user", "content": verification_prompt}]
        )
        
        result = response.content[0].text.strip()
        return result.upper().startswith("YES")
        
    except Exception as e:
        # Fallback to basic success if LLM verification fails
        return len(metrics.tool_calls) > 0 and metrics.error_count == 0

This verification system provides two layers of checking:

  • Structured verification: Checks specific technical requirements from verification_criteria
  • Semantic verification: Uses expected_outcome for LLM-based evaluation of overall success

Automated Analysis and Improvement

I great tip I got from Anthropic's research is using AI agents themselves to analyze evaluation results:

def save_for_claude_analysis(self, results):
    with open('evaluation_transcripts.txt', 'w') as f:
        for metric in results['detailed_metrics']:
            f.write(f"=== Task: {metric.task_id} ===
")
            f.write(metric.reasoning_trace)
            f.write("
" + "="*50 + "
")
    
    print("Transcripts saved. Now paste into Claude Code for analysis!")

After running your evaluation, you can literally paste the transcripts into Claude Code and ask it to analyze patterns, suggest tool improvements, and even refactor your tool implementations automatically.

What Actually Happens

This evaluation-driven approach has proven quite effective. In Anthropic's experiments, Claude-optimized tools actually outperformed human-written ones on complex tasks. The key was the systematic feedback loop:

  1. Run evaluation on current tools
  2. Analyze failures and inefficiencies
  3. Improve tools based on insights
  4. Re-evaluate to measure improvement
  5. Repeat until performance plateaus

Getting Started

Here's how to use the complete framework:

class ToolEvaluator:
    def __init__(self, anthropic_api_key: str):
        self.client = anthropic.Anthropic(api_key=anthropic_api_key)
        self.tools = YourEmailTools()  # Your actual tools
    
    def run_evaluation(self):
        tasks = self.create_evaluation_tasks()
        results = []
        
        for task in tasks:
            # Uses task.prompt in the evaluation loop
            metrics = self.run_single_evaluation(task)
            results.append(metrics)
            
        return self.compile_results(results)

# Usage
evaluator = ToolEvaluator("your-api-key")
results = evaluator.run_evaluation()

print(f"Overall success rate: {results['summary']['success_rate']:.1%}")
print(f"Total tool calls: {results['summary']['total_tool_calls']}")
print(f"Average runtime: {results['summary']['average_runtime']:.2f}s")

# Review detailed results
for metric in results['detailed_metrics']:
    print(f"Task {metric.task_id}: {'âś“' if metric.success else 'âś—'} "
          f"({len(metric.tool_calls)} tools, {metric.total_runtime:.1f}s)")

Key Takeaways

Building effective tools for AI agents requires a fundamentally different approach than traditional software development:

  1. Test with realistic, complex scenarios that require multiple tool interactions
  2. Use task fields for comprehensive verification - criteria for technical checks, outcomes for semantic validation
  3. Use agents to analyze agents - let Claude help improve your tools
  4. Iterate systematically - evaluation should drive continuous improvement

The non-deterministic nature of AI agents isn't a bug—it's a feature that enables creative problem-solving. But to harness that creativity effectively, we need systematic evaluation frameworks that can measure and improve tool performance across the full spectrum of possible agent behaviors.

Start simple, measure everything, and let the data (and Claude) guide your improvements. Your tools—and the agents that use them—will be dramatically more effective as a result.