All posts

The Day the Agents Shipped a Feature

The previous posts in this series described how we built a codebase knowledge engine — hybrid search combining keyword matching with semantic similarity, a persistent server that keeps the ML model hot, a finalize phase that captures what agents learn, and an MCP adapter that turns search into a native tool call. Those posts covered the parts. This one is about the parts working together.

A human wrote a single document describing a feature: split the game's monolithic training data store into multiple named variants and add tiered CPU difficulty based on player rating. The document was forty lines. What happened next was observed, not directed.

Act I — The Brief

The chairman — the orchestration agent that coordinates planning — read the topic and queried the context engine. The results came back spanning six domain clusters: match simulation, P2P multiplayer, engine core, UI, dev tooling, and content. The feature touched wire formats in the shared crate, loaders in the game binary, recorder wiring across four game modes, and version-coupling rules that gate matchmaking compatibility.

Based on which clusters the search results touched, the chairman selected its panel: seven specialists, one per domain. The game designer. The P2P expert. The engine expert. The UI expert. The simulation expert. The dev-tools expert. The content writer. Seven is the maximum possible panel — every available role. The topic was cross-cutting enough to need all of them.

The chairman also ran twenty-three knowledge gap queries against the context engine, checking whether the existing knowledge base covered every system the feature would touch. All twenty-three came back filled. No research agents were needed. The prior five posts' worth of accumulated context — captured from previous tasks by previous agents — meant the panel could start debating immediately, with no fact-finding detours.

The chairman's panel selection is automatic. It is driven by which domain clusters the context engine's search results touch, not by a human deciding who should be in the room. The knowledge base determines who is relevant.

Act II — The Debate

Three rounds. The chairman ran all seven experts in parallel in the first round, asking each to analyze the topic through their domain lens and surface concerns, constraints, and design positions.

Round two dropped the simulation expert — their first-round concerns about deterministic float safety were already resolved and needed no further debate. Six experts continued.

Round three dropped the dev-tools expert, whose scope was fully settled, and re-engaged the simulation expert to weigh in on a conflict that had evolved since their departure. Six experts again, but not the same six.

The asymmetry is the interesting part. The chairman does not run every expert for every round. It tracks which domains still have open conflicts and only re-engages experts whose input is needed. Rounds are not a formality — they are targeted.

Five conflicts surfaced across the three rounds.

The first was about what happens when a CPU difficulty file fails to load. The simulation expert and dev-tools expert wanted the CPU to stand still — a loud, unmistakable failure. The game designer led a five-expert majority in the opposite direction: fall back to the player's own training data with a warning notification. A CPU that stands still, the game designer argued, is indistinguishable from a crash to the player. A degraded-but-playable experience is always better than a technically correct but unplayable one.

The second conflict was about whether to tell the player that training data had been split. The engine expert and simulation expert wanted a one-time tooltip explaining the change. The game designer and P2P expert pushed back: the player never knew the data was shared in the first place. A tooltip explaining the split creates the confusion it claims to prevent — it answers a question nobody was asking. The game designer eventually conceded to a single tooltip on first entry into one specific mode where behavior would visibly change, but held the line on silence everywhere else.

The third — and the one that drew the sharpest exchange — was about difficulty thresholds. The simulation expert proposed one set of rating boundaries. The game designer proposed another, offset by a hundred points at the bottom end. The difference sounds small. It is not. The simulation expert's boundaries meant a player would need to lose fifteen to twenty consecutive matches before reaching the easiest difficulty tier. The game designer's boundaries cut that to seven or eight losses.

The game designer's argument carried the room unanimously: the hundred-point offset is the difference between a safety net that works and one that is too far below the tightrope.

The remaining two conflicts — how to store recorder references internally, and whether to show the CPU difficulty tier after a match — resolved quickly. The panel chose two named fields over an indexed array because the key space is exactly two and will remain two. The post-match display was deferred to a follow-up task as purely additive UI work.

The conflicts are the point. A single agent implementing this feature would have made five silent design decisions. The panel made five explicit ones, with arguments, counterarguments, and recorded rationale. When a future agent queries the context engine for why the difficulty thresholds are set where they are, it will find the game designer's reasoning — not just the numbers.

Act III — The Build

The chairman synthesized three rounds of debate into a thirty-six-line specification and placed it in the roadmap. The executor — a separate agent with its own worktree — picked it up.

The executor's log is a hundred and fifty-eight lines. It decomposed the spec into work items, dispatching three parallel sub-agents to handle independent slices of the implementation. One worked on the shared crate data types. One worked on the game-side loaders and mode wiring. One worked on tests.

The output: a nine-variant enum for training set identity, a pure function mapping player rating to CPU difficulty tier, a bundled asset loader with format validation, split recorders for the two writable data stores, a file rename migration for existing player data, and thirteen new tests covering tier boundaries, loader round-trips, and error paths. All five hundred and fifteen existing tests continued to pass.

Then the reviewer ran.

The reviewer is a separate agent that reads the spec, reads the diff, and traces user flows through the changed code. On its first pass, it found three issues. The onboarding and tutorial modes were not loading their dedicated data sets — the executor had wired the new loader for matches and training but missed two mode-entry paths. The fallback notification was using a raw string instead of the text storage system. And the CPU difficulty data was not being released from memory when the player exited a match.

Three real issues, not style nits. The executor fixed all three. The reviewer ran again, confirmed the fixes, and approved.

No retries. No human intervention between the topic document and the approved implementation. The planning phase ran to convergence in three rounds. The executor implemented in one pass. The reviewer caught real issues on the first iteration and confirmed fixes on the second. The total pipeline — plan, build, review — ran to completion autonomously.

What This Means

This is not a story about a feature. The feature — splitting training data and adding difficulty tiers — is ordinary game development work. Any competent developer could implement it in a day.

The story is about what happened around the feature. A knowledge engine that told the chairman which specialists to summon. A planning phase where seven domain experts debated trade-offs that a single implementer would have decided silently. A specification that carried the rationale for every design choice, not just the requirements. An executor that parallelized independent work streams. A reviewer that caught integration gaps the executor missed. And a finalize phase — not shown here, but running after every task — that captured the new knowledge back into the engine for the next feature.

Each of those pieces was described in a previous post. Hybrid search. Persistent inference. Reasoning capture. Native tooling. This post is the proof that they compose.

  • For engineering leaders: — The interesting metric is not speed. It is the ratio of silent decisions to explicit ones. A single developer making five design choices leaves no trace of alternatives considered. A panel of specialists debating those same choices leaves a searchable record that compounds across every future task touching the same systems.
  • For technical decision-makers: — The reviewer catching three real integration issues — not linting violations, not style preferences, but actual missing code paths — is the safety rail that makes autonomous execution viable. The pipeline is not unattended because it is careless. It is unattended because it checks its own work.
  • For anyone evaluating AI-assisted development: — The context engine is the quiet multiplier. Every prior post in this series built one piece of it. This post showed all six pieces — search, inference, reasoning capture, native tooling, multi-agent planning, and autonomous execution — operating on a single feature from start to finish. The knowledge base that made this possible was itself built by previous agents doing previous work. The system feeds itself.

This is the sixth and final post in the context engine series. The first five described how to build a knowledge engine for AI agents. This one described what it looks like when the engine runs.