How Anthropic Built Their Multi-Agent Research System: Architecture, Engineering Challenges, and Lessons Learned#

Published: August 23, 2025

Anthropic recently unveiled the architecture behind Claude’s Research capabilities—a sophisticated multi-agent system that can search across the web, Google Workspace, and various integrations to accomplish complex research tasks. This post provides a comprehensive analysis of their system design, engineering decisions, and hard-won lessons from building production-ready multi-agent architectures.

Table of Contents#

Introduction: Why Multi-Agent Systems Matter
System Architecture Overview
The Orchestrator-Worker Pattern
Prompt Engineering for Agent Coordination
Evaluation Strategies for Multi-Agent Systems
Production Engineering Challenges
Performance Metrics and Token Economics
Lessons Learned and Best Practices
Future Directions
Conclusion

Introduction: Why Multi-Agent Systems Matter {#introduction}#

The evolution from single-agent to multi-agent AI systems represents a fundamental shift in how we approach complex, open-ended problems. Just as human societies achieved exponential capabilities through collective intelligence and coordination, AI systems can transcend individual agent limitations through orchestrated collaboration.

Research tasks exemplify why multi-agent architectures are essential. Unlike deterministic workflows with predictable steps, research demands:

Dynamic exploration: The ability to pivot based on discoveries
Parallel investigation: Exploring multiple aspects simultaneously
Adaptive depth: Adjusting effort based on query complexity
Context management: Handling information that exceeds single context windows

Anthropic’s internal evaluations reveal striking performance gains: their multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on research tasks. This isn’t just incremental improvement—it’s a paradigm shift in capability.

System Architecture Overview {#architecture-overview}#

The Research system employs an orchestrator-worker pattern where a lead agent coordinates the research process while delegating to specialized subagents operating in parallel. This architecture balances centralized coordination with distributed execution.

High-Level Architecture#

1
graph TB
2
    subgraph "User Interface"
3
        U[User Query]
4
        R[Research Results]
5
    end
6

7
    subgraph "Lead Research Agent"
8
        LA[Lead Agent<br/>Claude Opus 4]
9
        M[Memory Store]
10
        P[Research Plan]
11
    end
12

13
    subgraph "Subagent Pool"
14
        SA1[Subagent 1<br/>Claude Sonnet 4]
15
        SA2[Subagent 2<br/>Claude Sonnet 4]
16
        SA3[Subagent 3<br/>Claude Sonnet 4]
17
        SAN[Subagent N<br/>Claude Sonnet 4]
18
    end
19

20
    subgraph "Tool Ecosystem"
21
        WS[Web Search]
22
        GW[Google Workspace]
23
        MCP[MCP Servers]
24
        CI[Custom Integrations]
25
    end
26

27
    subgraph "Citation System"
28
        CA[Citation Agent]
29
        CS[Citation Store]
30
    end
31

32
    U --> LA
33
    LA <--> M
34
    LA --> P
35
    P --> SA1
36
    P --> SA2
37
    P --> SA3
38
    P --> SAN
39

40
    SA1 <--> WS
41
    SA2 <--> GW
42
    SA3 <--> MCP
43
    SAN <--> CI
44

45
    SA1 --> LA
46
    SA2 --> LA
47
    SA3 --> LA
48
    SAN --> LA
49

50
    LA --> CA
51
    CA --> CS
52
    CA --> R
53
    R --> U
54

55
    style LA fill:#e1f5fe
56
    style SA1 fill:#fff3e0
57
    style SA2 fill:#fff3e0
58
    style SA3 fill:#fff3e0
59
    style SAN fill:#fff3e0
60
    style CA fill:#f3e5f5

Key Architectural Components#

Lead Research Agent: The orchestrator that decomposes queries, creates execution plans, and synthesizes results
Subagent Pool: Specialized workers that execute parallel searches with independent context windows
Memory Store: Persistent context management for long-running research sessions
Tool Ecosystem: Diverse information sources accessed through standardized interfaces
Citation System: Ensures all claims are properly attributed to sources

The Orchestrator-Worker Pattern {#orchestrator-worker-pattern}#

The orchestrator-worker pattern enables sophisticated coordination while maintaining separation of concerns. Here’s how the research workflow unfolds:

1
sequenceDiagram
2
    participant User
3
    participant LeadAgent
4
    participant Memory
5
    participant Subagent1
6
    participant Subagent2
7
    participant SubagentN
8
    participant Tools
9
    participant CitationAgent
10

11
    User->>LeadAgent: Submit research query
12
    LeadAgent->>LeadAgent: Analyze query complexity
13
    LeadAgent->>Memory: Save research plan
14

15
    LeadAgent->>LeadAgent: Decompose into subtasks
16

17
    par Parallel Execution
18
        LeadAgent->>Subagent1: Create with task 1
19
        LeadAgent->>Subagent2: Create with task 2
20
        LeadAgent->>SubagentN: Create with task N
21
    end
22

23
    par Parallel Research
24
        Subagent1->>Tools: Search iterations
25
        Tools-->>Subagent1: Results
26
        Subagent1->>Subagent1: Evaluate & refine
27

28
        Subagent2->>Tools: Search iterations
29
        Tools-->>Subagent2: Results
30
        Subagent2->>Subagent2: Evaluate & refine
31

32
        SubagentN->>Tools: Search iterations
33
        Tools-->>SubagentN: Results
34
        SubagentN->>SubagentN: Evaluate & refine
35
    end
36

37
    Subagent1-->>LeadAgent: Condensed findings
38
    Subagent2-->>LeadAgent: Condensed findings
39
    SubagentN-->>LeadAgent: Condensed findings
40

41
    LeadAgent->>LeadAgent: Synthesize results
42

43
    alt More research needed
44
        LeadAgent->>LeadAgent: Refine strategy
45
        LeadAgent->>Subagent1: New tasks
46
    else Sufficient information
47
        LeadAgent->>CitationAgent: Process citations
48
        CitationAgent->>CitationAgent: Match claims to sources
49
        CitationAgent-->>User: Final research report
50
    end

Workflow Stages#

Query Analysis: The lead agent evaluates query complexity and required effort
Task Decomposition: Complex queries are broken into parallelizable subtasks
Subagent Creation: Specialized agents are spawned with specific objectives
Parallel Execution: Subagents independently explore their assigned aspects
Result Synthesis: The lead agent combines findings into coherent insights
Citation Processing: A dedicated agent ensures proper source attribution

Prompt Engineering for Agent Coordination {#prompt-engineering}#

Prompt engineering for multi-agent systems differs fundamentally from single-agent prompting. The challenge isn’t just instructing individual agents—it’s orchestrating collaborative behavior across a distributed system.

Key Prompting Principles#

1. Teaching Effective Delegation#

The lead agent must provide subagents with:

Clear objectives: Specific, measurable goals
Output formats: Structured response requirements
Tool guidance: Which tools to prioritize
Task boundaries: What’s in and out of scope

Example delegation pattern:

1
Subagent Task:
2
- Objective: Identify the top 5 AI companies by market cap in 2025
3
- Output: JSON list with {company, market_cap, source, date}
4
- Tools: Use web search, prioritize financial sites
5
- Boundaries: Focus only on pure-play AI companies, exclude conglomerates

2. Scaling Effort to Complexity#

Anthropic embedded explicit scaling rules in their prompts:

1
graph LR
2
    subgraph "Query Complexity Assessment"
3
        S[Simple Fact]
4
        C[Comparison]
5
        M[Multi-faceted]
6
        D[Deep Research]
7
    end
8

9
    subgraph "Resource Allocation"
10
        S --> A1[1 agent<br/>3-10 calls]
11
        C --> A2[2-4 agents<br/>10-15 calls each]
12
        M --> A3[5-8 agents<br/>15-20 calls each]
13
        D --> A4[10+ agents<br/>20+ calls each]
14
    end
15

16
    style S fill:#e8f5e9
17
    style C fill:#fff9c4
18
    style M fill:#ffe0b2
19
    style D fill:#ffcdd2

3. Extended Thinking as a Controllable Scratchpad#

The system leverages Claude’s extended thinking mode for planning and evaluation:

1
flowchart TD
2
    subgraph "Lead Agent Thinking"
3
        T1[Assess query complexity]
4
        T2[Identify required tools]
5
        T3[Determine agent count]
6
        T4[Define subtask boundaries]
7
        T5[Plan synthesis approach]
8
    end
9

10
    subgraph "Subagent Thinking"
11
        S1[Evaluate search results]
12
        S2[Identify information gaps]
13
        S3[Refine query strategy]
14
        S4[Judge source quality]
15
        S5[Decide completion]
16
    end
17

18
    T1 --> T2 --> T3 --> T4 --> T5
19
    S1 --> S2 --> S3 --> S4 --> S5

4. Tool Selection Heuristics#

Agents receive explicit guidance for tool selection:

Examine all available tools first
Match tools to user intent
Prefer specialized tools over generic ones
Use web search for broad exploration
Validate tool descriptions match capabilities

Parallel Execution Patterns#

The system achieves dramatic speed improvements through two levels of parallelization:

Lead agent parallelization: Spawning 3-5 subagents simultaneously
Subagent parallelization: Each subagent using 3+ tools in parallel

This reduced research time by up to 90% for complex queries.

Evaluation Strategies for Multi-Agent Systems {#evaluation-strategies}#

Evaluating multi-agent systems presents unique challenges due to their emergent behaviors and non-deterministic execution paths. Anthropic developed a multi-layered evaluation approach:

Evaluation Framework#

1
graph TD
2
    subgraph "Evaluation Layers"
3
        I[Initial Development<br/>20 test cases]
4
        S[Scaled Testing<br/>100s of cases]
5
        H[Human Evaluation<br/>Edge cases]
6
        P[Production Monitoring<br/>Continuous]
7
    end
8

9
    subgraph "Evaluation Criteria"
10
        F[Factual Accuracy]
11
        C[Citation Accuracy]
12
        CP[Completeness]
13
        Q[Source Quality]
14
        E[Tool Efficiency]
15
    end
16

17
    subgraph "Methods"
18
        L[LLM-as-Judge]
19
        M[Manual Review]
20
        A[Automated Metrics]
21
        T[Tracing Analysis]
22
    end
23

24
    I --> L
25
    S --> L
26
    S --> A
27
    H --> M
28
    P --> T
29

30
    L --> F
31
    L --> C
32
    L --> CP
33
    M --> Q
34
    A --> E

Key Evaluation Insights#

Start Small, Iterate Fast: With effect sizes often exceeding 50%, even 20 test cases can reveal significant improvements
LLM-as-Judge Scaling: A single LLM judge with comprehensive rubrics outperformed multiple specialized judges
Human Evaluation Remains Critical: Humans caught biases like preference for SEO-optimized content over authoritative sources
End-State vs. Process Evaluation: Focus on outcomes rather than prescribing specific paths

Production Engineering Challenges {#production-challenges}#

Moving from prototype to production revealed several critical engineering challenges unique to multi-agent systems:

State Management and Error Handling#

1
stateDiagram-v2
2
    [*] --> Initialized
3
    Initialized --> Planning: User Query
4
    Planning --> CreatingAgents: Plan Complete
5
    CreatingAgents --> Executing: Agents Created
6

7
    Executing --> Executing: Tool Calls
8
    Executing --> Error: Tool Failure
9
    Error --> Recovering: Retry Logic
10
    Recovering --> Executing: Resume
11

12
    Executing --> Synthesizing: Results Ready
13
    Synthesizing --> Citations: Add Sources
14
    Citations --> Complete: Report Ready
15
    Complete --> [*]
16

17
    state Error {
18
        [*] --> DetectFailure
19
        DetectFailure --> LogError
20
        LogError --> DetermineRecovery
21
        DetermineRecovery --> [*]
22
    }
23

24
    state Recovering {
25
        [*] --> LoadCheckpoint
26
        LoadCheckpoint --> AdaptStrategy
27
        AdaptStrategy --> ResumeExecution
28
        ResumeExecution --> [*]
29
    }

Key Production Challenges#

Error Compounding: Minor failures cascade into major behavioral changes
- Solution: Checkpoint systems and intelligent error recovery
Debugging Complexity: Non-deterministic execution makes reproduction difficult
- Solution: Comprehensive tracing and decision pattern monitoring
Deployment Coordination: Stateful agents can’t be updated mid-execution
- Solution: Rainbow deployments with gradual traffic shifting
Context Window Management: Extended conversations exceed limits
- Solution: Intelligent compression and memory mechanisms

Observability Architecture#

1
graph LR
2
    subgraph "Agent System"
3
        A[Agents]
4
        T[Tools]
5
        M[Memory]
6
    end
7

8
    subgraph "Observability Layer"
9
        TR[Tracing]
10
        ME[Metrics]
11
        LO[Logs]
12
        AL[Alerts]
13
    end
14

15
    subgraph "Analysis"
16
        DP[Decision Patterns]
17
        PE[Performance]
18
        ER[Error Analysis]
19
        CO[Cost Tracking]
20
    end
21

22
    A --> TR
23
    T --> ME
24
    M --> LO
25

26
    TR --> DP
27
    ME --> PE
28
    LO --> ER
29
    ME --> CO
30

31
    DP --> AL
32
    PE --> AL
33
    ER --> AL
34
    CO --> AL

Performance Metrics and Token Economics {#performance-metrics}#

Understanding the economics of multi-agent systems is crucial for production deployment:

Token Usage Analysis#

1
pie title "Token Usage by Interaction Type"
2
    "Traditional Chat" : 1
3
    "Single Agent" : 4
4
    "Multi-Agent System" : 15

Key findings from Anthropic’s analysis:

Token usage explains 80% of performance variance
Number of tool calls adds 10% variance
Model choice contributes 5% variance
Other factors account for 5%

Performance Scaling#

1
graph TB
2
    subgraph "Performance Factors"
3
        TU[Token Usage<br/>80% impact]
4
        TC[Tool Calls<br/>10% impact]
5
        MC[Model Choice<br/>5% impact]
6
        OF[Other Factors<br/>5% impact]
7
    end
8

9
    subgraph "Optimization Strategies"
10
        TU --> OS1[Parallel execution]
11
        TC --> OS2[Tool selection]
12
        MC --> OS3[Model routing]
13
        OF --> OS4[System tuning]
14
    end
15

16
    style TU fill:#d32f2f,color:#fff
17
    style TC fill:#f57c00,color:#fff
18
    style MC fill:#fbc02d
19
    style OF fill:#689f38,color:#fff

Cost-Benefit Analysis#

Multi-agent systems are economically viable when:

Task value exceeds 15× chat interaction cost
Heavy parallelization is possible
Information exceeds single context windows
Complex tool orchestration is required

Lessons Learned and Best Practices {#lessons-learned}#

Architecture Patterns That Work#

Separation of Concerns: Each agent should have a distinct, well-defined role
Hierarchical Coordination: Clear delegation chains prevent coordination chaos
Parallel by Default: Design for concurrent execution from the start
Graceful Degradation: Systems should handle partial failures intelligently

Tool Design Principles#

1
mindmap
2
  root((Tool Design))
3
    Clear Purpose
4
      Single responsibility
5
      Distinct from others
6
      Well-defined scope
7
    Excellent Descriptions
8
      Precise capabilities
9
      Usage examples
10
      Error conditions
11
    Error Handling
12
      Graceful failures
13
      Informative messages
14
      Recovery guidance
15
    Performance
16
      Optimized for agents
17
      Parallel-friendly
18
      Predictable latency

Prompt Engineering Best Practices#

Think Like Your Agents: Use simulations to understand prompt effects
Encode Human Heuristics: Study expert approaches and encode them
Start Broad, Then Narrow: Mirror human research patterns
Let Agents Improve Themselves: Use Claude to optimize prompts

Evaluation Strategy#

Start Immediately: Even 20 test cases provide valuable signal
Combine Methods: LLM judges + human review + automated metrics
Focus on Outcomes: Evaluate end states, not prescribed paths
Monitor Emergent Behaviors: Watch for unexpected interaction patterns

Future Directions {#future-directions}#

Asynchronous Execution#

The current synchronous execution model creates bottlenecks. Future iterations will likely implement:

1
graph TD
2
    subgraph "Current: Synchronous"
3
        CS1[Lead Agent]
4
        CS2[Wait for Subagents]
5
        CS3[Process Results]
6
        CS4[Create New Agents]
7
        CS1 --> CS2 --> CS3 --> CS4
8
    end
9

10
    subgraph "Future: Asynchronous"
11
        FA1[Lead Agent]
12
        FA2[Continuous Monitoring]
13
        FA3[Dynamic Agent Creation]
14
        FA4[Real-time Steering]
15
        FA1 --> FA2
16
        FA2 --> FA3
17
        FA3 --> FA4
18
        FA4 --> FA2
19
    end
20

21
    style CS2 fill:#ffcdd2
22
    style FA2 fill:#c8e6c9
23
    style FA3 fill:#c8e6c9
24
    style FA4 fill:#c8e6c9

Enhanced Coordination Mechanisms#

Future systems may implement:

Inter-agent communication: Direct subagent coordination
Dynamic task redistribution: Load balancing across agents
Adaptive resource allocation: Scaling based on task complexity
Learned coordination patterns: ML-optimized delegation strategies

Domain Specialization#

While research tasks benefit enormously from multi-agent systems, other domains present opportunities:

1
graph LR
2
    subgraph "High Potential Domains"
3
        R[Research<br/>✓ Implemented]
4
        CA[Code Analysis<br/>🔄 In Progress]
5
        DS[Data Science<br/>📊 Planned]
6
        CR[Creative Work<br/>🎨 Experimental]
7
    end
8

9
    subgraph "Key Requirements"
10
        P[Parallelizable tasks]
11
        L[Large information space]
12
        C[Complex tool usage]
13
        V[High value tasks]
14
    end
15

16
    R --> P
17
    R --> L
18
    R --> C
19
    R --> V
20

21
    style R fill:#4caf50,color:#fff
22
    style CA fill:#2196f3,color:#fff
23
    style DS fill:#ff9800,color:#fff
24
    style CR fill:#9c27b0,color:#fff

Common Use Cases in Production#

Based on Anthropic’s analysis, the top use cases for their Research feature include:

1
pie title "Research Feature Usage Distribution"
2
    "Software Development (10%)" : 10
3
    "Content Optimization (8%)" : 8
4
    "Business Strategy (8%)" : 8
5
    "Academic Research (7%)" : 7
6
    "Information Verification (5%)" : 5
7
    "Other (62%)" : 62

Users report saving days of work by:

Finding business opportunities they hadn’t considered
Navigating complex healthcare options
Resolving technical bugs through comprehensive searches
Uncovering research connections across disciplines

Implementation Considerations#

When to Use Multi-Agent Systems#

✅ Good Fit:

Open-ended research tasks
Problems requiring broad exploration
Tasks exceeding single context windows
Workflows with parallelizable subtasks
High-value outputs justifying token costs

❌ Poor Fit:

Simple, deterministic workflows
Tasks requiring shared context
Real-time coordination needs
Low-value or high-frequency operations

Getting Started#

For teams considering multi-agent architectures:

Start with a clear use case: Identify tasks with natural parallelization
Build incrementally: Begin with 2-3 agents before scaling
Invest in observability early: You’ll need it for debugging
Create small evaluation sets: 20 good test cases beat 200 mediocre ones
Expect iteration: The gap between prototype and production is wide

Conclusion {#conclusion}#

Anthropic’s multi-agent research system demonstrates that the future of AI isn’t just about more capable individual models—it’s about orchestrating multiple agents to achieve collective intelligence. The 90.2% performance improvement over single-agent systems isn’t just a benchmark; it represents a fundamental shift in how we approach complex, open-ended problems.

The journey from prototype to production revealed critical insights:

Architecture matters: The orchestrator-worker pattern provides the right balance of coordination and autonomy
Prompt engineering is system design: In multi-agent systems, prompts define collaborative behavior
Evaluation requires new approaches: Traditional methods don’t capture emergent behaviors
Production readiness is hard: Error compounding and state management require robust engineering

As we move forward, multi-agent systems will likely become the standard for tackling complex tasks that require:

Exploring vast information spaces
Coordinating diverse tools and APIs
Maintaining state across extended interactions
Scaling beyond single context windows

The lessons from Anthropic’s Research feature provide a roadmap for teams building the next generation of AI systems. While challenges remain—particularly around asynchronous execution and inter-agent coordination—the potential is clear: multi-agent systems can transform how we solve complex problems, from scientific research to business strategy to creative endeavors.

The key takeaway? When individual intelligence reaches a threshold, collective intelligence becomes the multiplier. Just as human civilization advanced through cooperation and specialization, AI systems are beginning their own journey toward collaborative problem-solving. The future isn’t just smarter agents—it’s smarter systems of agents working together.

Appendix: Additional Implementation Tips#

Long-Horizon Conversation Management#

Implement checkpoint systems for conversations spanning hundreds of turns
Use external memory stores to persist critical information
Spawn fresh subagents with clean contexts when approaching limits
Design handoff protocols for context continuity

Subagent Output Optimization#

Allow direct filesystem writes to minimize information loss
Use artifact systems for structured outputs (code, reports, visualizations)
Pass lightweight references instead of copying large outputs
Implement specialized output formats for different agent types

Memory Architecture Patterns#

1
graph TD
2
    subgraph "Memory Hierarchy"
3
        WM[Working Memory<br/>Active Context]
4
        STM[Short-term Memory<br/>Session Cache]
5
        LTM[Long-term Memory<br/>Persistent Store]
6
    end
7

8
    subgraph "Access Patterns"
9
        WM --> F[Fast Access<br/>< 1ms]
10
        STM --> M[Medium Access<br/>< 100ms]
11
        LTM --> S[Slow Access<br/>< 1s]
12
    end
13

14
    WM <--> STM
15
    STM <--> LTM
16

17
    style WM fill:#e3f2fd
18
    style STM fill:#fff3e0
19
    style LTM fill:#f3e5f5

State Management Strategies#

Focus on end-state evaluation for state-mutating agents
Break complex workflows into discrete checkpoints
Implement rollback mechanisms for critical operations
Design idempotent operations where possible

The multi-agent revolution is just beginning. As these systems mature and new patterns emerge, we’ll continue discovering better ways to orchestrate collective AI intelligence. The journey from individual to collective intelligence—whether human or artificial—remains one of the most fascinating challenges in computer science.

This analysis is based on Anthropic’s engineering blog post about their multi-agent research system, with additional architectural insights and implementation patterns drawn from production experience with multi-agent systems.