All posts

The Chairman Pattern: Multi-Agent Planning with Controlled Information Flow

Most multi-agent LLM systems follow a simple pattern: fan out prompts to several agents, collect their responses, merge the results. It works, but it breaks down when agents have overlapping domains and contradictory recommendations. Who wins? How do you converge?

We built a different pattern for our game engine's autonomous development pipeline. We call it the chairman — a coordinating agent that treats specialist LLM agents like a board of domain experts debating a proposal. The chairman controls what each agent sees, detects when they disagree, and drives structured rounds of conflict resolution until the board converges on a plan.

This post walks through the concrete Rust implementation: the data structures, the round protocol, the information gating, and the tricks we learned making LLM agents produce parseable output reliably.

The Pipeline

The chairman is one stage in a larger autonomous orchestrator that watches a git branch, picks up work, and executes it:

Orchestrator polling loop
git fetch → merge → planning (chairman) → fixes → roadmap → git push → sleep 30s

The orchestrator runs a polling loop. Planning gets priority 1 — its output (specs and fix documents) feeds directly into the fix and roadmap executor stages in the same cycle. The chairman is the planning brain.

Input: A Topic File

Everything starts with a human writing a natural language feature description to docs/planning/{topic}/topic.md. No structured format, no templates — just describe what you want built:

"Add a Fischer increment to the chess clock system. After each realtime phase ends, add N seconds to the player's clock based on the match rules preset."

The chairman takes it from there.

Step 1: Domain Detection via Context Engine

The first thing the chairman does is figure out which parts of the codebase this topic touches. It queries a local context engine — a LanceDB instance with BM25 full-text search, AllMiniLmL6V2 vector embeddings (384d), and RRF reranking over 165 structured context documents.

planning.rs — domain detection via browse
let browse_results = browse_context_server(&topic_text);
let relevant_clusters: Vec<String> = browse_results
    .unwrap_or_default()
    .iter()
    .map(|r| r.cluster.clone())
    .collect::<BTreeSet<_>>()
    .into_iter()
    .collect();

The context engine returns results tagged with domain clusters like "Match Balance & Timing", "UI Animation & Transitions", "P2P Multiplayer". The chairman collects the unique clusters from browse results, then maps them to specialist roles.

Step 2: Role Selection

Seven specialist roles are defined as a Rust enum, each mapped to specific domain clusters:

planning.rs — specialist roles
enum AgentRole {
    GameDesigner,       // Match Simulation Core, Match Balance & Timing, ...
    P2pExpert,          // P2P Multiplayer, Engine Services
    EngineExpert,       // Engine Core & Lifecycle, Shader & Render Pipeline, ...
    UiExpert,           // UI Layout & Menus, UI Animation & Transitions, ...
    SimulationExpert,   // Match Simulation Core, Formations & Pitch & Zones, ...
    DevToolsExpert,     // Dev & Build Tools, Data Analytics
    ContentWriter,      // (no clusters — general purpose)
}

select_roles_for_clusters() iterates all roles and checks if any of their clusters appear in the browse results. If a topic only touches UI and match timing, you get UiExpert and GameDesigner — you don't waste tokens on P2pExpert debating determinism for a purely visual feature. BTreeSet ensures deterministic ordering across runs.

Each role has a hardcoded system prompt that frames its priorities. The P2P expert is "paranoid about HashMap iteration order, floating point divergence, unseeded RNG." The game designer pushes back on "technical solutions that sacrifice game feel." These prompts create genuine tension — agents argue from different value systems, which is the whole point.

Step 3: Knowledge Gap Detection

Before the debate begins, the chairman checks whether the context engine actually has knowledge about every relevant cluster. detect_knowledge_gaps() finds cluster queries that returned zero browse results:

planning.rs — gap detection
fn detect_knowledge_gaps(
    queries: &[String],
    browse_results: &[(String, Vec<BrowseResult>)]
) -> Vec<String> {
    queries.iter()
        .filter(|query| !browse_results.iter()
            .any(|(q, results)| q == *query && !results.is_empty()))
        .cloned()
        .collect()
}

For each gap, the chairman spawns a research micro-agent — a separate Claude instance with a 5-minute timeout that greps the codebase, reads source files, and writes a structured context document to docs/context/. After completion, the chairman rebuilds the LanceDB index and reloads the context server so the new knowledge is immediately available.

The key insight: these context enrichments survive even if the topic is abandoned. The research agent is filling gaps in codebase knowledge, not producing topic-specific analysis. Every future topic benefits from the knowledge captured here.

Step 4: The Information Gatekeeper

This is the architectural choice that makes the chairman pattern different from simple fan-out.

Agents never query the context engine directly. The chairman is the sole interface to codebase knowledge. It fetches context per role via fetch_context_for_role(), which queries the context server for each of the role's clusters, then injects the results into the agent's prompt under a Domain Context section:

planning.rs — prompt construction
fn build_specialist_prompt(
    role, topic, context_sections,
    round, previous_outputs, conflicts
) -> String {
    // Role identity and system prompt
    // Topic text
    // Domain context (injected by chairman)
    // Previous round outputs (if round > 1)
    // Conflicts to resolve (if any)
    // Instructions for this round
}

Why gate information? Three reasons:

Focus. Each agent sees only its domain's context. The UI expert doesn't wade through physics internals. This keeps agent reasoning sharp and on-topic.

Control. The chairman decides what constitutes relevant context. Agents can't go spelunking through unrelated docs and get distracted or confused.

Reproducibility. Since the chairman controls all inputs, the same topic with the same context state produces deterministic role selection and prompt construction. The LLM responses vary, but the framing doesn't.

Step 5: The Round Protocol

The structured debate runs for a maximum of 3 rounds:

Round 1: All selected agents run in parallel. Each gets the topic text, their domain context, and instructions to produce a structured analysis. Agents are spawned as separate processes via std::thread::spawn + agent::spawn_claude_raw(), each with a 10-minute timeout. Their stdout is captured and parsed as JSON.

After Round 1: The chairman detects conflicts by spawning another LLM call that analyzes all agent outputs and identifies semantic contradictions — cases where one agent's constraint or recommendation conflicts with another's. This replaced an earlier keyword-overlap heuristic with proper LLM synthesis.

Rounds 2-3: Only agents involved in conflicts run again. These agents receive the full context of all previous outputs plus the specific conflicts they need to resolve. Non-participating agents' outputs are preserved via merge_round_outputs().

Early exit: If no conflicts are detected after any round, the chairman breaks out of the loop immediately. Most topics converge in 1-2 rounds.

planning.rs — round loop
for round in 1..=status.max_rounds {
    let agents_to_run = if round == 1 {
        status.selected_roles.clone()
    } else if conflicts.is_empty() {
        break;  // No conflicts — converged
    } else {
        agents_involved_in_conflicts(&conflicts)
            .into_iter().collect()
    };

    // Spawn agents in parallel, collect outputs
    // Merge into all_outputs
    // detect_conflicts() via LLM synthesis
    // Rebuild context index between rounds
}

Between rounds, the chairman rebuilds the context index and reloads the context server so subsequent rounds can see updated context if research agents wrote new docs.

The Agent Output Contract

Each agent returns a structured AgentOutput:

planning.rs — agent output struct
struct AgentOutput {
    role: String,
    sections: Vec<Section>,     // { title, content }
    constraints: Vec<String>,   // technical constraints
    concerns: Vec<String>,      // risks or issues foreseen
}

Conflicts are detected as:

planning.rs — conflict struct
struct Conflict {
    role_a: String,
    role_b: String,
    description: String,
}

Parsing LLM Output Reliably

Getting LLMs to return valid JSON is a known pain point. parse_agent_output_from_text() uses 4 fallback strategies:

1. Direct parse — serde_json::from_str on the trimmed text. 2. Markdown JSON fences — extract content between ```json and ```. 3. Plain code fences — extract content between ``` and ```. 4. First-brace-to-last-brace — find the first { and last } in the text, try parsing that substring.

Strategy 4 is the catch-all — when agents prefix their JSON with commentary or append explanations, this strips the surrounding text. In practice, strategy 1 or 2 handles ~90% of outputs, but strategy 4 has saved enough runs to justify its existence.

Step 6: Synthesis

Once converged, the chairman writes all agent outputs to synthesis.json and spawns a synthesis agent with skill access. This agent invokes the /new-spec and /new-fix skills to produce structured output documents:

spec.md files in docs/roadmap/ — roadmap specs that feed into the TDD executor. fix.json files in docs/fixes/ — bug fix documents that feed into the fix executor.

The synthesis agent gets a 15-minute timeout — it's doing real work, not just reformatting. It reads the converged domain analyses and translates cross-cutting concerns into actionable implementation plans.

The full state machine: Pending → Running → Converged → OutputCreated → Done (or Failed with retry logic — up to 2 retries per phase).

What We Learned

Information gating matters more than prompt engineering. The biggest improvement in output quality came not from refining system prompts, but from controlling what context each agent sees. A focused agent with limited, relevant context consistently outperforms an agent drowning in the full codebase.

Conflict detection needs LLM reasoning. Our first version used keyword overlap heuristics to detect conflicts. It missed semantic contradictions (where agents disagree using entirely different vocabulary) and flagged false positives (where agents independently mentioned the same system without contradicting each other). Using a dedicated LLM call for conflict detection was the right trade — it costs one extra API call per round but dramatically reduces wasted rounds.

Persistent knowledge enrichment compounds. The research micro-agents were initially a hack to handle topics that touched undocumented systems. But because they write to the shared context engine, every run makes the next run smarter. After a few months, the knowledge gaps that trigger research agents have dropped significantly — the system is self-improving.

Parse defensively. LLMs will surprise you with output formatting. The 4-strategy JSON parser cost us 20 lines of code and eliminated an entire class of pipeline failures. If you're building agent pipelines, write your output parser before your prompt.

The Pattern, Abstracted

If you're building a multi-agent system that needs convergence (not just aggregation), the chairman pattern is:

1. Detect domains — figure out which specialists are relevant. 2. Gate information — each specialist sees only their domain context. 3. Structured rounds — parallel execution, conflict detection, targeted re-runs. 4. Early termination — stop as soon as agents agree. 5. Persistent enrichment — fill knowledge gaps in a way that outlives the current task.

The key constraint: the chairman is the only agent with access to the full picture. Specialists never see each other's raw context, only their outputs. This prevents the common failure mode where agents try to reason about domains outside their expertise and produce confidently wrong analysis.

The full implementation is ~1550 lines of Rust in planning.rs, with 36 tests. No frameworks, no agent libraries — just threads, HTTP calls to a local context server, and subprocess spawning for Claude instances. Sometimes the simplest orchestration is the most robust.