All posts

Our Multi-Agent Orchestrator vs. Twelve LLM-Agent Research Papers

We built a multi-agent orchestrator that autonomously plans, implements, and reviews changes to a 100,000-line Rust game engine. It coordinates seven specialist AI agents through structured debate, executes work in isolated git worktrees, and maintains a hybrid-search knowledge base of over 160 context documents. The system has been running in production for months, shipping features and fixes without human intervention between the topic description and the merged commit.

Over the past few weeks we ingested twelve LLM-agent research papers — surveys of multi-agent architectures, tool-use paradigms, evaluation frameworks, and communication topologies. We mapped our production system against the patterns, failure modes, and open problems those papers describe. This post is the result: a concrete comparison between what the literature recommends and what we actually run.

The central finding is that domain constraints do much of the work that the literature's dynamic methods are designed to do. Our codebase enforces P2P determinism rules — no HashMap in simulation code, no f32::sqrt() (must use rapier3d::na::ComplexField equivalents), seeded RNG only, BTreeMap iteration order. It enforces codegen atomicity — partial edits to generated files leave the build broken, so every codegen change requires a paired config-plus-trigger todo. It enforces sim-path purity — no system time, no platform-dependent float operations, no unseeded randomness. These constraints are encoded statically in project rules and enforced by permission hooks, not discovered at runtime by a planner. The gaps that remain — structured debate, cross-task memory, agent-level evaluation metrics — are precisely the ones where domain constraints cannot substitute for missing capability.

The System in Four Paragraphs

The chairman planner coordinates specialist agents through structured rounds of debate. Seven roles — GameDesigner, P2pExpert, EngineExpert, UiExpert, SimulationExpert, DevToolsExpert, ContentWriter — are mapped to domain clusters in the codebase knowledge base. The chairman queries the context engine to determine which clusters a topic touches, selects the relevant roles, and injects per-role context sections into each specialist's prompt. Specialists never query the knowledge base directly. They return structured JSON — an AgentOutput containing role, sections, constraints, and concerns — and the chairman detects cross-role conflicts via a dedicated synthesis call. Maximum three rounds; most topics converge in one or two.

AgentOutput — specialist response schema
{
  "role": "SimulationExpert",
  "sections": [{ "title": "...", "content": "..." }],
  "constraints": ["no f32::sqrt in sim-path code", "..."],
  "concerns": ["codegen pair missing for new config field"]
}

The star-graph executor translates converged plans into code. A planner agent decomposes specs into todos, up to three todo agents work concurrently with work-stealing, and a synthesizer evaluates results with a verdict of continue, replan, done, or failed. Each agent gets fresh context — the persisted root (project rules plus spec) combined with a transient delta from the current run. This fresh-context-per-agent design mitigates the 60-plus-message attention degradation that the fundamentals survey (fundamentals-autonomous-llm-agents-2025) identifies as a primary failure mode in long-running agent sessions. An anti-spiral monitor watches for pathological loops: warn at 8 consecutive grep or bash calls without an edit, kill at 12. Same-reason consecutive replans trigger immediate halt. Codegen changes require mandatory two-todo pairs — one for the config file, one for the build trigger — because partial codegen modifications leave generated Rust disagreeing with its source config.

Anti-spiral monitor thresholds
SPIRAL_WARN  = 8   // consecutive grep/bash without edit → warning
SPIRAL_KILL  = 12  // consecutive grep/bash without edit → terminate agent
Same-reason consecutive replan → immediate halt

The orchestrator watch daemon runs the outer loop: git fetch, merge, planning, fixes, roadmap execution, push, sleep thirty seconds. Each agent runs in an isolated git worktree with permissions governed by .claude/settings.agents.json — a hardcoded manifest checked on every tool invocation via PreToolUse hooks. Context documents are the inter-agent communication protocol; agents never communicate directly. The queue is deliberately sequential because code and context doc conflicts make parallelism unsafe at the orchestration layer.

The context engine provides the shared knowledge substrate. Built on LanceDB with BM25 full-text search, AllMiniLmL6V2 vector embeddings, and Reciprocal Rank Fusion reranking, it indexes over 160 structured JSON context documents. An MCP server exposes browse, fetch, and rebuild as native tool calls. Each task produces up to three context documents: a spec-context prefetched by the chairman, a feature-context written during execution, and a review-context capturing the reviewer's findings. Knowledge persists across tasks — research micro-agents that fill gaps during planning write permanent documents that benefit every future run.

The thesis restated: static domain constraints — P2P determinism rules, codegen atomicity, sim-path purity — already close gaps that dynamic planning methods in the literature are designed to address. The interesting comparison is not whether we match every recommended pattern, but which patterns become unnecessary when your domain encodes its own constraints, and which remain necessary regardless.

Where We Land on the Map

Mapping the system against the literature reveals six strong alignment points — places where we independently converged on patterns the papers describe, sometimes with different mechanisms serving the same function.

  • Chairman as SOP pattern (metagpt-chatdev-agentverse-2023) — MetaGPT's contribution is role-specialized agents communicating through structured artifacts, preventing the cascading-hallucination problem where one agent's error propagates unchecked. Our chairman implements this via seven domain experts producing AgentOutput JSON, with a shared message pool realized as context documents. The structured artifact format — sections, constraints, concerns — forces agents to separate observations from recommendations, which is the same decomposition MetaGPT achieves through its standardized operating procedures.
  • Information gatekeeper as LMPR split (review-llm-agent-paradigms-2024) — The LMPR framework decomposes agent architectures into LLM, Model, Planner, and Retriever roles. Our chairman maps to Planner-LMPR, the synthesizer to Evaluator, and specialists to pure Actors with no tool autonomy. The gatekeeper pattern — specialists see only injected context, never querying the knowledge base themselves — addresses the failure mode the paradigms review identifies when retriever access is uncontrolled: agents spend tokens exploring irrelevant context and produce diluted analysis.
  • Executor loop as ReAct plus Reflexion (react-reasoning-acting-2023) — Todo agents follow the Thought-Action-Observation loop that ReAct formalizes. The synthesizer's verdict mechanism — evaluating execution results and deciding whether to continue, replan, or halt — is Reflexion-style self-evaluation with bounded retries. The connection to generative-agents-2023 is indirect: where Generative Agents use memory streams for behavioral continuity, we use fresh context per agent precisely to avoid the accumulation of stale reasoning that long-lived agents suffer from.
  • Anti-spiral monitor as answer to loop pathology (fundamentals-autonomous-llm-agents-2025) — The fundamentals survey documents repetitive-action loops as a primary failure mode in autonomous agents, citing OSWorld experiments where agents cycle through the same actions without progress. Our anti-spiral monitor directly addresses this: SPIRAL_WARN at 8 consecutive read-only tool calls without an edit, SPIRAL_KILL at 12, plus immediate halt on same-reason consecutive replans. This is not a general heuristic — it encodes the specific observation that an agent grepping repeatedly without editing has lost its thread.
  • Knowledge-gap micro-agents as Voyager skill library (survey-llm-agent-methodology-2025) — Voyager builds a growing skill library; ExpeL accumulates experiential insights. Our knowledge-gap detection spawns research micro-agents that write permanent context documents — the equivalent of Voyager's skill store, but for codebase understanding rather than executable skills. The key parallel from the methodology survey is persistence: these documents survive topic abandonment, so every planning session enriches the knowledge base for all future sessions.
  • Star-plus-layered communication topology (survey-llm-multi-agents-2024) — The multi-agent survey catalogs communication topologies: flat, star, layered, hierarchical. Our system is a hybrid — star-graph executor (synthesizer as hub coordinating todo agents) nested inside a layered chairman-to-executor pipeline. The survey identifies this combination as the scalable choice, and our experience confirms it: the star layer handles concurrent execution, while the layered pipeline handles sequential planning-to-execution handoff. HuggingGPT (hugginggpt-mrkl-chameleon-2023) uses a similar LLM-as-orchestrator pipeline, though with dynamic model selection rather than fixed specialist roles.

Gaps the Papers Flag — That Our Constraints Already Mitigate

Three gaps from the literature map to capabilities we lack in the abstract but address through domain-specific mechanisms. In each case, static constraints do work that the recommended dynamic method would otherwise need to perform at runtime.

  • No Dynamic-Model LMPR (review-llm-agent-paradigms-2024) — We have the Policy and Evaluator roles from LMPR but no world-model for look-ahead planning. For simulation-path changes, determinism rules already constrain valid modifications so tightly that lookahead would mostly rediscover what we encode statically. The f32::sqrt ban is illustrative: an agent cannot use standard library float operations in sim-path code — it must use rapier3d::na::ComplexField equivalents. A dynamic world-model would need to discover this constraint; our project rules state it explicitly, and permission hooks enforce it. Domain constraints mitigate this gap.
  • Static profiling instead of dynamic (autogen-camel-2023, generative-agents-2023) — The methodology survey (survey-llm-agent-methodology-2025) recommends LLM-generated agent profiles that adapt to task requirements. Our seven roles are hardcoded — each maps to real architectural boundaries at the crate level and to determinism domains in the simulation. Dynamic profiles add value when cluster coverage shifts unpredictably. Ours are structurally stable because the codebase architecture is stable. If we added a new crate or domain, we would add a role — but that is a quarterly event, not a per-task adaptation. Domain constraints mitigate this gap.
  • Specialist tool access restricted (toolformer-2023) — Toolformer demonstrates that LLMs can learn to use tools selectively. Our specialists have no direct tool access — the chairman injects all context, and specialists reason over pre-digested information only. This is a deliberate legibility-over-capability tradeoff: when a specialist produces a bad recommendation, we can trace exactly which context it saw. The information-gatekeeper pattern trades theoretical capability gain for predictability and debuggability. Domain constraints mitigate this gap — though it remains a genuine capability ceiling if tasks require specialists to explore beyond their injected context.

Gaps That Survive Our Constraints

Three gaps from the literature identify genuine missing capabilities that domain constraints do not address. These are the highest-leverage areas for improvement.

  • Debate paradigm underused (survey-rise-potential-llm-agents-2023) — A Conflict schema exists — role_a, role_b, description — but conflicts are resolved unilaterally by the synthesis agent, not through adversarial debate between the conflicting specialists. For cross-system changes where simulation constraints conflict with P2P requirements or UI behavior, structured debate between SimulationExpert and P2pExpert could catch integration issues that unilateral synthesis misses. The rise-and-potential survey identifies debate as a distinct cooperation paradigm; we have the data structure but not the mechanism. Domain constraints do not mitigate this gap.
  • Cross-task memory shallow (survey-llm-agent-methodology-2025) — The three-document structure — spec-context, feature-context, review-context — provides strong task-scoped memory, but there is no cross-task episodic store equivalent to Reflexion's trial memory or MemGPT's virtual memory hierarchy. Simulation bugs are pattern-recurrent: the same class of determinism violation appears across different features. A run-level lessons store accumulating replan reasons would let future runs front-load constraint checks instead of rediscovering the same failure patterns. Domain constraints do not mitigate this gap.
  • Evaluation blind spot (survey-evaluation-llm-agents-2025) — We measure task completion but not cost-efficiency, safety, or robustness. The evaluation survey describes HAL-style holistic scoring across multiple dimensions. Anti-spiral kills are a robustness proxy, but they are not logged as structured metrics. TAG_DESYNC_HASH — our production correctness oracle that catches determinism violations every 60 simulation frames — is precisely the kind of domain-specific evaluation metric the survey says most agent systems lack. We have it for the simulation, just not for the agents themselves. Domain constraints do not mitigate this gap.

Why Domain Constraints Change the Calculus

The twelve papers frame agent systems generically — they must work across arbitrary domains, arbitrary codebases, arbitrary team structures. Our system is not generic. It operates on a codebase with strong structural priors: P2P determinism rules that ban HashMap and platform-dependent float operations in simulation code, codegen atomicity that requires paired config-and-trigger modifications, sim-path purity that forbids system time and unseeded randomness, and a permission boundary via .claude/settings.agents.json that is checked on every tool invocation.

Those constraints already perform work that the literature's dynamic methods — Dynamic-Model LMPR, planner lookahead, dynamic profiling — would otherwise need to discover at runtime. The collect-then-act simulation pipeline illustrates this directly: System 1 computes shared state (PlayerInfo per player, InfluenceEdge data per edge), System 2 consumers read exclusively from precomputed values. The chairman's information-gatekeeper pattern mirrors this architecture — a single authority computes shared context, downstream specialists operate on pre-digested views. The structural parallel is not coincidental; both solve the same problem of preventing downstream consumers from reasoning over inconsistent or stale data.

The anti-spiral monitor is not just preventing wasted tokens. It is preventing partial-mutation errors that deterministic systems cannot tolerate. An agent that edits a TOML config but fails to trigger codegen leaves generated Rust disagreeing with its config source — a state that compiles but produces wrong behavior, exactly the kind of silent failure that the fundamentals survey (fundamentals-autonomous-llm-agents-2025) identifies as the hardest to detect. The two-todo codegen pair requirement exists because we learned this the hard way.

For production systems with strong structural priors, the highest-leverage investments are: observability — so you can prove your constraints are doing their job; conflict-handling fidelity — where your priors cannot help; and long-term memory — which transcends any single task's structure. Dynamic planning machinery is lower priority precisely because you already know the shape of the problem.

Structured adversarial debate between specialists and a metrics pipeline measuring cost-efficiency, robustness, and safety are planned but not yet shipped — the natural next steps that this analysis identified. The constraints handle what they can. The work that remains is building the capabilities they cannot substitute for.

Papers Referenced in the Analysis