Quorum Mode: What Happens When 3 AI Models Review Each Other's Code
Director @ Microsoft
What if three AI models reviewed each other's plans before any code was written?
That's the idea behind Quorum Mode in Plan Forge. Instead of dispatching a coding task to a single model, you dispatch it to three in parallel. Each model produces a detailed implementation plan — without executing code. A reviewer agent then synthesizes the best elements from all three into a unified execution plan. The final builder uses that consensus plan instead of the raw instructions.
Think of it as a design review by three senior engineers before any code is written.
But does it actually produce better code? We ran an A/B test to find out.
The Test Setup
We built the same feature twice using identical hardened plan files:
- Feature: Invoice Engine — rate tiers, volume discounts, tax calculation, banker's rounding
- Plan: Same scope contract, same slices, same validation gates
- Run A (Control): Standard single-model execution (Claude Sonnet)
- Run B (Quorum): Three models in parallel (Claude Opus, GPT-5.3-Codex, Claude Sonnet) → reviewer synthesis → Claude Sonnet builds from consensus
Both runs used identical guardrails, the same tech stack preset, and the same build/test gates. The only variable was whether the builder worked from raw slice instructions (Control) or a synthesized consensus plan (Quorum).
The Results
Both runs passed all gates. Every slice built, every test passed, and the independent reviewer signed off on both. But the quality of the code was measurably different.
| Metric | Control (Single) | Quorum (3-Model) |
|---|---|---|
| Tests written | 15 | 18 (+20%) |
| Helper extraction | Inline code | DRY helpers (IsWeekend(), CalculateVolumeDiscount()) |
| Test dates | Hardcoded literals | Relative offsets (don't break when dates pass) |
| Edge cases | Standard coverage | Voided invoice regeneration, invoice number sequencing |
| .NET patterns | Generic ValidationException | ArgumentException.ThrowIfNullOrWhiteSpace (modern) |
| Total cost | $0.62 | $0.84 (+35%) |
| Total time | 12 min | 32 min (2.7x) |
What the Numbers Mean
20% More Tests
The quorum run produced 18 tests vs 15. The extra tests weren't padding — they covered edge cases that the single model missed entirely: voided invoice regeneration, invoice number sequencing under concurrent access, and boundary conditions in volume discount tiers. These are exactly the tests that would have caught production bugs.
DRY Helpers vs. Inline Code
The single-model run inlined business logic — volume discount calculation appeared in three places with slight variations. The quorum run extracted IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as reusable helpers. Same behavior, but the quorum code is easier to maintain and debug.
This is the synthesis effect — when one model suggests inline code and another suggests extraction, the reviewer picks the cleaner approach.
Robust Test Dates
A subtle but telling difference. The single-model run used hardcoded date literals like new DateTime(2026, 3, 15) in tests. Those tests will break when the dates pass, because the business logic validates that invoice dates aren't in the future. The quorum run used relative offsets: DateTime.Now.AddDays(-7). Small thing, but it's the difference between tests that stay green forever and tests that start failing on April 16th.
Modern .NET Patterns
The control run used throw new ValidationException("...") — functional but generic. The quorum run used ArgumentException.ThrowIfNullOrWhiteSpace() — a .NET 7+ API that's more precise, generates better error messages, and is the current recommended pattern. Again, the synthesis effect: one model knew about the modern API, the reviewer picked it.
Cost vs. Quality
The quorum run cost $0.22 more ($0.84 vs $0.62) — about 35% extra in percentage terms, but still under a dollar total. For a feature that will be maintained for years, that's negligible. The time increase was more significant: 32 minutes vs 12 minutes. The extra 20 minutes is the parallel dry-run analysis (3 models thinking) plus the reviewer synthesis step. The actual build time was comparable.
For $0.22 more, you get 20% more tests, cleaner architecture, and modern patterns. That's the cheapest code review you'll ever buy.
When to Use It
Quorum Mode isn't for every slice. Running it on a simple CRUD endpoint that creates a database record is overkill. Running it on your auth flow, billing logic, or database migration is worth every token.
That's why we built --quorum=auto. It scores each slice's complexity (1-10) using 7 weighted signals:
- File scope count
- Cross-module dependencies
- Security keywords
- Database/migration keywords
- Gate count
- Task count
- Historical failure rate
Only slices scoring at or above the threshold (default: 6) get the 3-model treatment. Everything else runs normally. You get quality where it matters without burning tokens on trivial work.
How It Works Under the Hood
The architecture is straightforward:
- Dispatch: The orchestrator sends the slice instructions to 3 models in parallel via their respective APIs
- Dry-run analysis: Each model produces a detailed implementation plan — file-by-file, with specific patterns and edge cases — but does not write code
- Synthesis: A reviewer agent receives all 3 analyses and produces a unified plan, picking the best approach per file/component
- Build: The builder model executes from the consensus plan instead of the raw slice instructions
If one model is unavailable, the reviewer works with 2 analyses. If fewer than 2 respond, it falls back to normal single-model execution. Your pipeline never blocks on model availability.
The Takeaway
The quality difference between single-model and quorum isn't in correctness — both passed all gates. It's in craftsmanship. Code that's easier to maintain, debug, and extend. Tests that cover edge cases you wouldn't have thought to test. Patterns that match current best practices instead of what the training data averaged out to.
For critical business logic, the 3% cost overhead pays for itself the first time a production bug doesn't happen.
Quorum Mode is available in Plan Forge via pforge run-plan <plan> --quorum or --quorum=auto for selective application. Use --estimate --quorum to preview the cost before committing.