Building a Multi-Agent LLM Debate Engine with Python
Architecture decisions, 4 LLM provider integration, and lessons learned building Synapse — a 45-file, 12,000-line multi-agent debate system with a 5-phase pipeline and 306 tests.
Synapse started as an experiment: what happens if you make multiple LLMs argue with each other until they reach a reasoned consensus? After two months of evenings and weekends, it became a 12,000+ line Python system with 45 files, 4 LLM provider integrations, a 5-phase debate pipeline, and 306 tests.
This is a post about the architecture decisions, what worked, what didn’t, and why building it made me significantly better at prompting LLMs.
The Core Idea
The problem with single-LLM responses is confirmation bias baked into the model. Ask GPT-4 a complex question and it gives you a confident, coherent answer that may completely miss an important counter-argument or alternative framing.
The debate engine addresses this by spinning up multiple LLM agents with different system prompts and structured reasoning roles, running them through an adversarial pipeline, and synthesising the output into a final response that has actually survived scrutiny.
Architecture Decisions
Why FastAPI + Async?
The debate pipeline is inherently concurrent. During Phase 2 (initial stances), all agents can run simultaneously. During Phase 3 (cross-critique), critiques can be generated in parallel before any rebuttal happens.
Using FastAPI with asyncio.gather() reduced the total pipeline latency by ~60% compared to a naive sequential implementation.
# Phase 2: All agents generate initial stances simultaneously
async def run_initial_stances(query: str, agents: list[Agent]) -> list[Stance]:
tasks = [agent.generate_stance(query) for agent in agents]
stances = await asyncio.gather(*tasks)
return list(stances)
Provider Abstraction Layer
Supporting four LLM providers (OpenAI, Anthropic, Google Gemini, Groq) without the rest of the code caring about the differences required a clean abstraction:
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class LLMResponse:
content: str
model: str
tokens_used: int
latency_ms: float
class BaseLLMProvider(ABC):
@abstractmethod
async def complete(
self,
messages: list[dict],
temperature: float = 0.7,
max_tokens: int = 2000,
) -> LLMResponse:
...
class AnthropicProvider(BaseLLMProvider):
def __init__(self, api_key: str, model: str = "claude-3-5-sonnet-20241022"):
self.client = anthropic.AsyncAnthropic(api_key=api_key)
self.model = model
async def complete(self, messages, temperature=0.7, max_tokens=2000) -> LLMResponse:
start = time.time()
# Convert from OpenAI message format to Anthropic format
system, human_messages = self._convert_messages(messages)
response = await self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
temperature=temperature,
system=system,
messages=human_messages,
)
return LLMResponse(
content=response.content[0].text,
model=self.model,
tokens_used=response.usage.input_tokens + response.usage.output_tokens,
latency_ms=(time.time() - start) * 1000,
)
The pattern is simple: every provider implements complete() and returns a LLMResponse. The pipeline never touches provider-specific code.
Agent Personas and Role Prompts
The quality of debate output is almost entirely determined by the system prompts. I spent more time on persona design than on the Python code.
Each agent gets:
- A role (Advocate, Skeptic, Devil’s Advocate, Synthesizer)
- A domain bias (varies per use case)
- A debate style directive (Socratic, empirical, consequentialist)
SKEPTIC_SYSTEM_PROMPT = """You are a rigorous skeptic and critical thinker. Your role in this debate is to:
1. Challenge assumptions in the initial framing
2. Identify gaps in evidence or logical leaps
3. Propose alternative explanations
4. Flag risks and unintended consequences
You must cite specific weaknesses in other agents' arguments, not vague disagreement.
Format: Lead each critique with [ASSUMPTION] or [EVIDENCE] or [LOGIC] tags."""
The tag-based formatting was a key insight — structured tags made parsing and displaying critiques much more reliable than asking the LLM to “clearly label” things.
The Vedic Astrology Vertical
One of the more unexpected pivots: a friend asked if the debate engine could work for Jyotish (Vedic astrology) chart interpretation, where different schools of thought genuinely disagree on interpretation methods.
This turned out to be a rich test case. The debates between a Parashari school agent and a Jaimini school agent produced remarkably nuanced output that couldn’t come from either school alone. It validated the core thesis: adversarial multi-agent debate produces better coverage of a problem space than any single expert.
What the 306 Tests Cover
Testing LLM applications is legitimately hard because the outputs are non-deterministic. The test suite handles this with three strategies:
1. Deterministic unit tests for all non-LLM code: parsing, state management, API routing, provider configuration.
2. Contract tests for LLM outputs: assert structural properties (correct JSON schema, required fields present, length within range) rather than exact content.
def test_stance_response_schema(mock_llm_response):
stance = parse_stance_response(mock_llm_response)
assert isinstance(stance.position, str)
assert len(stance.position) > 50 # non-trivial content
assert stance.confidence in range(1, 11) # 1-10 scale
assert len(stance.supporting_points) >= 2
assert len(stance.key_uncertainties) >= 1
3. Behavioral tests using a small, cheap model (GPT-3.5-turbo or Gemini Flash) as an evaluator:
async def test_debate_advances_through_phases(debate_engine):
result = await debate_engine.run(
query="Should all Kubernetes clusters use GitOps?",
agents=["advocate", "skeptic"],
max_rounds=2,
)
# Use LLM to evaluate structural properties
evaluation = await evaluator.assess(
result=result,
criteria=[
"Does the synthesis reference arguments from both agents?",
"Does the skeptic's critique address specific claims, not generalities?",
]
)
assert evaluation.criteria_met_count >= 1
Lessons Learned
Prompt iteration is the highest-leverage activity. The code architecture matters, but 80% of the quality improvements came from refining system prompts. Treat prompts as first-class code — version-control them, test them, review changes to them carefully.
Temperature management matters. Advocate agents work best at temperature 0.8 (creative, assertive). Skeptic agents work best at 0.4 (precise, conservative). Synthesizer agents at 0.3 (focused, coherent). These weren’t obvious upfront — I found them through systematic testing.
Async is not optional at scale. A synchronous debate with 3 agents over 5 phases at ~3s per LLM call would take ~45 seconds. The async implementation runs it in ~12 seconds by parallelising everything that can be parallelised.
Token costs add up fast. A full 3-agent, 5-phase debate uses approximately 15,000–25,000 tokens depending on complexity. At scale, caching Phase 1 decompositions for similar queries reduces cost significantly.
The synthesizer is the hardest role. Getting an LLM to genuinely reconcile conflicting views rather than just summarise them requires careful prompting. My current approach: give the synthesizer explicit instructions to identify the point of genuine disagreement before attempting synthesis, and require it to acknowledge where consensus was not reached.
What’s Next for Synapse
- Streaming UI — currently outputs everything at the end; want real-time debate display
- Debate history — store debates, allow branching from a specific phase to explore different paths
- Evaluation scoring — quantitative quality score for debate output, not just subjective assessment
- More verticals — the abstraction is clean enough that adding new domain-specific debate configurations is straightforward
The code is in my GitHub POC. If you are building anything with multi-agent LLMs, I hope the architecture patterns here are useful.
Divyansh Srivastav
DevOps Architect · Kubernetes Platform Engineering