A/B Test April 7, 2026 · 9 min read

Quorum Mode: What Happens When 3 AI Models Review Each Other's Code

Director @ Microsoft

Robed reviewer examining three merging code scrolls from blue, green, and amber anvils in a dark forge workshop

What if three AI models reviewed each other's plans before any code was written?

That's the idea behind Quorum Mode in Plan Forge. Instead of dispatching a coding task to a single model, you dispatch it to three in parallel. Each model produces a detailed implementation plan — without executing code. A reviewer agent then synthesizes the best elements from all three into a unified execution plan. The final builder uses that consensus plan instead of the raw instructions.

Think of it as a design review by three senior engineers before any code is written.

But does it actually produce better code? We ran an A/B test to find out.

The Test Setup

We built the same feature twice using identical hardened plan files:

Feature: Invoice Engine — rate tiers, volume discounts, tax calculation, banker's rounding
Plan: Same scope contract, same slices, same validation gates
Run A (Control): Standard single-model execution (Claude Sonnet)
Run B (Quorum): Three models in parallel (Claude Opus, GPT-5.3-Codex, Claude Sonnet) → reviewer synthesis → Claude Sonnet builds from consensus

Both runs used identical guardrails, the same tech stack preset, and the same build/test gates. The only variable was whether the builder worked from raw slice instructions (Control) or a synthesized consensus plan (Quorum).

The Results

Both runs passed all gates. Every slice built, every test passed, and the independent reviewer signed off on both. But the quality of the code was measurably different.

Metric	Control (Single)	Quorum (3-Model)
Tests written	15	18 (+20%)
Helper extraction	Inline code	DRY helpers (IsWeekend(), CalculateVolumeDiscount())
Test dates	Hardcoded literals	Relative offsets (don't break when dates pass)
Edge cases	Standard coverage	Voided invoice regeneration, invoice number sequencing
.NET patterns	Generic ValidationException	ArgumentException.ThrowIfNullOrWhiteSpace (modern)
Total cost	$0.62	$0.84 (+35%)
Total time	12 min	32 min (2.7x)

What the Numbers Mean

20% More Tests

The quorum run produced 18 tests vs 15. The extra tests weren't padding — they covered edge cases that the single model missed entirely: voided invoice regeneration, invoice number sequencing under concurrent access, and boundary conditions in volume discount tiers. These are exactly the tests that would have caught production bugs.

DRY Helpers vs. Inline Code

The single-model run inlined business logic — volume discount calculation appeared in three places with slight variations. The quorum run extracted IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as reusable helpers. Same behavior, but the quorum code is easier to maintain and debug.

This is the synthesis effect — when one model suggests inline code and another suggests extraction, the reviewer picks the cleaner approach.

Robust Test Dates

A subtle but telling difference. The single-model run used hardcoded date literals like new DateTime(2026, 3, 15) in tests. Those tests will break when the dates pass, because the business logic validates that invoice dates aren't in the future. The quorum run used relative offsets: DateTime.Now.AddDays(-7). Small thing, but it's the difference between tests that stay green forever and tests that start failing on April 16th.

Modern .NET Patterns

The control run used throw new ValidationException("...") — functional but generic. The quorum run used ArgumentException.ThrowIfNullOrWhiteSpace() — a .NET 7+ API that's more precise, generates better error messages, and is the current recommended pattern. Again, the synthesis effect: one model knew about the modern API, the reviewer picked it.

Cost vs. Quality

The quorum run cost $0.22 more ($0.84 vs $0.62) — about 35% extra in percentage terms, but still under a dollar total. For a feature that will be maintained for years, that's negligible. The time increase was more significant: 32 minutes vs 12 minutes. The extra 20 minutes is the parallel dry-run analysis (3 models thinking) plus the reviewer synthesis step. The actual build time was comparable.

For $0.22 more, you get 20% more tests, cleaner architecture, and modern patterns. That's the cheapest code review you'll ever buy.

When to Use It

Quorum Mode isn't for every slice. Running it on a simple CRUD endpoint that creates a database record is overkill. Running it on your auth flow, billing logic, or database migration is worth every token.

That's why we built --quorum=auto. It scores each slice's complexity (1-10) using 7 weighted signals:

File scope count
Cross-module dependencies
Security keywords
Database/migration keywords
Gate count
Task count
Historical failure rate

Only slices scoring at or above the threshold (default: 6) get the 3-model treatment. Everything else runs normally. You get quality where it matters without burning tokens on trivial work.

How It Works Under the Hood

// Quorum workflow per slice

executeSlice(slice)

│

├─ scoreComplexity() → 1-10 score (7 weighted signals)

│

├─ score < threshold → normal execution

└─ score ≥ threshold → quorumDispatch()

│

├─ Claude Opus 4.6 → dry-run plan ─┐

├─ GPT-5.3-Codex → dry-run plan ─┼─ Promise.all()

└─ Claude Sonnet → dry-run plan ─┘

│

quorumReview() ← synthesize best approach per file

│

spawnWorker(enhancedPrompt) ← execute with consensus

│

gate ✓

The architecture is straightforward:

Dispatch: The orchestrator sends the slice instructions to 3 models in parallel via their respective APIs
Dry-run analysis: Each model produces a detailed implementation plan — file-by-file, with specific patterns and edge cases — but does not write code
Synthesis: A reviewer agent receives all 3 analyses and produces a unified plan, picking the best approach per file/component
Build: The builder model executes from the consensus plan instead of the raw slice instructions

If one model is unavailable, the reviewer works with 2 analyses. If fewer than 2 respond, it falls back to normal single-model execution. Your pipeline never blocks on model availability.

The Takeaway

The quality difference between single-model and quorum isn't in correctness — both passed all gates. It's in craftsmanship. Code that's easier to maintain, debug, and extend. Tests that cover edge cases you wouldn't have thought to test. Patterns that match current best practices instead of what the training data averaged out to.

For critical business logic, the 3% cost overhead pays for itself the first time a production bug doesn't happen.

Quorum Mode is available in Plan Forge via pforge run-plan <plan> --quorum or --quorum=auto for selective application. Use --estimate --quorum to preview the cost before committing.

← Previous: Guardrails Lessons Next: Unified System →