Why We Built S.A.M.I. Around Multi-Agent Consensus
Single-model coding tools make a silent assumption: one AI is enough. We disagree. This post explains the engineering reasoning that led us to build a consensus-first orchestration system and what we learned along the way.
The SAMI Team
Engineering
Every AI coding assistant available today — Copilot, Cursor, Codeium, and the rest — routes your prompt to a single model and returns the first answer it produces. This is fast, cheap, and works well for the majority of autocomplete tasks. But it has a structural flaw that becomes visible as soon as you ask something non-trivial.
Single-model tools inherit all the blind spots of their underlying model. Different frontier models miss different edge cases — one may struggle with concurrent code, another over-sanitises outputs in security-sensitive contexts, a third has its own systematic tendencies. No model is wrong in the same way, and that difference matters.
The Case for Disagreement
In software engineering, code review exists precisely because the author of a function is the worst person to find its bugs. They already know what they intended — their brain autocorrects mistakes before they notice them. A second pair of eyes, especially one that does not share the author's mental model, catches what the author missed.
Multi-agent consensus applies this principle to AI: we run the same task through multiple models and force them to critique and verify each other's output before we synthesise a final answer. The value is not blind agreement, but structured disagreement under a shared checkpoint protocol.
This is not an original idea in academia. Ensemble methods in machine learning have existed for decades, and multi-agent debate has been studied as a technique to improve reasoning (see Li et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate", 2023). What we built is a practical implementation of that idea on top of real production LLM APIs.
How the Orchestrator Works
The S.A.M.I. orchestrator runs server-side, parallelising work across all council members. When a task arrives it goes through seven stages:
- 01 Connector. An interactive clarification loop — if the task is ambiguous, the council asks up to five targeted questions before proceeding. All context is locked before later stages begin.
- 02 Strategist. All council members pressure-test the plan and contracts and vote on weak spots; the coordinating member synthesises the path forward.
- 03 Architect. System design and interface contracts are fixed. All members vote; the coordinating member records the binding architecture.
- 04 Maverick. Only one council member writes the binding code. Every other member critiques direction, sketches alternatives in prose, and files change requests — they do not write code at this stage.
- 05 Optimizer. All council members challenge the draft, propose variants, and measure tradeoffs. The coordinating member chooses the revision path.
- 06 Guardian. A full red-team and security audit. All models check the output against sources, repo context, and prior stage artifacts.
- 07 Debugger. Final validation before the result is delivered. Only a blocking vote can stop this stage.
What We Got Wrong First
Our initial design ran all agents sequentially. Task one goes to Model A, the result goes to Model B for review, then to Model C for security, and so on. This was conceptually clean but painfully slow — a moderately complex feature could take four to five minutes before the user saw any output. Users noticed and complained.
We moved to full parallelism in the next iteration and immediately hit a different problem: models producing structurally incompatible outputs. If two models solve the same problem with different class hierarchies, merging their outputs is not trivial. We introduced a schema contract at the decomposition stage — each subtask now includes a structured output specification that all agents must conform to. Divergences from the schema are flagged before the review stage, not after.
We also discovered that some models perform badly when told explicitly that other models will review their work — they become overly verbose and add defensive disclaimers rather than writing tighter code. We removed any mention of the multi-agent context from individual agent system prompts. Each agent believes it is the only one working on the task. The review happens after the fact, invisibly.
When Consensus Is Not the Right Tool
We are not dogmatic about consensus. For fast, high-volume tasks — tab completion, variable renaming, docstring generation — consensus adds latency and cost without meaningful quality gain. The orchestrator applies consensus only when the task meets certain criteria: the output will be committed to a repository, the output involves security boundaries, or the user explicitly requests high-confidence mode.
For tasks below that threshold, S.A.M.I. routes directly to the single most capable model for that task type. The routing decision is made by the orchestrator and is not user-visible — you get the right level of scrutiny without having to think about it.
Where We Are Going
The current orchestrator is a server-side, highly concurrent runtime. It works well and is the production path. A lower-level systems rewrite for the performance-critical hot path is a future workstream — the current runtime is not going anywhere soon.
If you want to try the consensus system, download the S.A.M.I. desktop app and create an account. The orchestrator is the engine behind every agent run on paid plans.