A wide overhead three-quarter shot of the Plan Forge shop floor with three smiths working simultaneously at three different stations (one at the crucible, one at the anvil, one at the watchtower), each surrounded by a soft glowing amber rune work-aura, the sun's arc visible through high arched windows showing the passage of a full day

Appendix R

A Day in the Forge — Three Vignettes from Real Pipelines

Three short case studies from production runs, each absorbed from a contemporary blog post and condensed to the parts that survive when the version numbers change. The vignettes are arranged from the largest reframe (Vignette 1, the loop that never ends) to the most quantitative receipt (Vignette 2, the 99-vs-44 A/B test) to the most operational pattern (Vignette 3, the three-model quorum run).

Audience: Readers who want concrete worked examples before committing to the chapters. Especially useful for stakeholders evaluating Plan Forge for adoption.

How to use: Read in order, or skip to the vignette closest to your situation. Each one ends with a "Where to read more" pointer into the canonical chapter that owns the topic and a citation to the original blog post for the first-person account.

The three vignettes at a glance

Vignette	What it shows	Source post
1. The Loop That Never Ends	The full closed-loop audit of a real production Next.js site: a Node discovery crawler emitting structured JSON, a three-lane triage filter, the Crucible eating the bug lane, Tempering re-auditing with the same harness that discovered, and the bug registry auto-smelting regressions back into the next pass, running unattended.	blog post
2. The .NET A/B Test — 99 vs 44	The same .NET 10 WebAPI built twice from an identical skeleton on the same machine, same afternoon, same Claude Opus 4.6 model. One run with Plan Forge guardrails, one with pure vibe coding. 99 vs 44 on structural quality (4.6× more tests, 6 vs 0 interfaces, 9 vs 0 DTOs), in less wall-clock time.	blog post
3. Quorum Mode in Practice	The same C# invoicing slice executed twice from one hardened plan: once with the default single-model worker, once with a three-model quorum. Both passed every gate and the independent reviewer. The quorum run cost $0.22 more, produced +20% tests, extracted DRY helpers the single run inlined, used relative test dates that survive the calendar, and emitted modern .NET 7+ exception patterns.	blog post

All three vignettes preserve the pseudonyms used in the original blog posts. "TheProject" in Vignette 1 is a real production Next.js site the maintainer operates; the owner did not clear the real name for publication. Every metric is from the actual run.

Vignette 1 — The Loop That Never Ends

Source: "The Loop That Never Ends" · Subject: TheProject (production Next.js site) · What it demonstrates: the closed-loop architecture from Discovery to Tempering, running without a human in the loop after the first pass.

The setup

TheProject is a production Next.js site, marketing pages, a product catalog, a handful of interactive demos. Like most sites that grow organically, it had accumulated the usual rot: placeholder copy that never got replaced, stale /docs routes, console errors nobody noticed, href="#" waiting to be wired up. The maintainer had two options. Sit down with a checklist and grind through it; or wire the rot into Plan Forge's closed loop and let the loop close on itself.

The loop, drawn honestly

Plan Forge's seven-step pipeline reads as a straight line in the diagrams, but the production shape is circular. Four passes, with back-edges that matter as much as the forward ones:

Pass 1: Discovery harness. A reusable Node crawler (~200 lines, one file) walks the site and emits structured JSON: route, placeholders found, broken hrefs, console errors, redirect chains. Boring JSON. But it is structured boring JSON, the only kind the Crucible can turn into smelts.
Pass 2: the Crucible eats the JSON. A 30-line wrapper reads .forge/audits/dev-<ts>.json, groups findings by route and severity, and for each group calls forge_crucible_submit with a title, the raw evidence, and a priority derived from the severity bucket. The Crucible runs its usual interview, the hardener emits a Phase-NN plan with a Scope Contract.
Pass 3: execute and Temper. Nothing exotic; the same pforge run-plan the project uses on itself. The interesting part is what happens after the last slice commits: Tempering re-runs the discovery harness against the newly-deployed preview URL. If the same JSON query that found the problem now returns empty, Tempering reports green. If not, the failures get written to the bug registry.
Pass 4: the loop closes. The bug registry auto-smelts. No human triage. The same pipeline that wrote the fix now catches the fix's own regressions. After Pass 1, the loop runs without the maintainer.

The key insight: the back-edges are the point. Discovery finds problems and funnels them into the Crucible. Tempering catches regressions and writes them to the bug registry, which auto-smelts them back into Discovery's next pass. The pipeline does not hit a "done" state, it hits a quiet state. The next deploy starts the loop again.

The mistake that almost sank the loop

The first version of the Crucible wrapper routed every finding through the Crucible. Console errors, 404s, auth redirects, placeholder regex hits, all of it became a proposed smelt for the Crucible to interview. The interview queue grew to 60+ items and half were noise the Crucible had no business thinking about.

The fix was a three-lane triage before the Crucible ever saw a finding:

Lane	What goes here	What happens
Bug lane	Findings with evidence and scope: broken links, console errors, missing assets.	Skip the Crucible entirely. These are not ideas, they are bugs. Route to the bug registry; let auto-smelt fix them in a single pass.
Crucible lane	Scope-ambiguous feature work the audit revealed: empty CTAs, "Coming soon" sections, half-built flows.	Submit to the Crucible. The Crucible interviews for scope, the hardener emits the plan, the Forge executes.
Noise lane	Auth-redirect 307s, 404s on test-data routes, false-positive regex hits.	Filter at the harness. Never reach the Crucible. Tune signal-to-noise at the source, a discovery harness that cries wolf on auth redirects teaches the Crucible to ignore it.

The bug lane runs first, fix the known defects, watch Tempering validate them, prove the mechanics end-to-end, then the feature lane opens. If Round 1's bug lane fails, auto-smelt re-ingests and retries without the human. The loop eats its own mistakes before it ever touches the feature backlog. That ordering is what makes the feature lane safe to run unattended.

The outcome

Over two weeks, with no manual TODO list and no human in the loop after the initial wrapper, the system found 23 placeholders the maintainer did not know existed, 7 broken links from a migration the previous month, and a console error in the checkout flow that had been silently firing for weeks. The loop is still finding things, slower now, but steady.

What makes the loop work

Four conditions, in order of how long they took to learn:

Structured evidence, not prose. The Crucible cannot smelt "the pricing page looks weird." It can smelt {"route": "/pricing", "placeholders": ["Coming soon", "TODO: price tiers"], "broken_hrefs": ["#"]}. The discovery harness exists to turn the first into the second.
Triage before the Crucible, not after. Three lanes (bug / Crucible / noise) at the wrapper, not inside the Crucible interview. This is the insight that took longest to learn.
Tempering must re-audit with the same tool that discovered. If discovery uses regex and Tempering uses eyeballs, the loop leaks. If both use the same harness, a fix is only done when the same JSON query that found it now returns empty.
Auto-smelt is opt-in but default-on. Turn it off per-project and the loop degrades into a pipeline, and pipelines end. The whole point is that this one does not.

Where to read more → Chapter 2 — How It Works for the seven-step pipeline in detail; Chapter 5 — Crucible (Idea Smelting) for the interview model; Chapter 22 — How the Shop Remembers for the bug registry and auto-smelt machinery.

Vignette 2 — The .NET A/B Test (99 vs 44)

Source: "The A/B Test: 99 vs 44 — Same App, Same Model, Same Time" · Subject: a .NET 10 WebAPI built twice · What it demonstrates: the structural-quality gap between Plan Forge and vibe coding when every other variable is held constant.

The setup

Both projects started from an identical .NET 10 WebAPI skeleton, the same git commit, the same empty solution. The requirements were identical: Clients CRUD → Projects CRUD → Invoice Engine with rate tiers, volume discounts, tax calculation, and banker's rounding. Both runs used Claude Opus 4.6. Same machine, same afternoon. The only variable was whether the AI had guardrails.

Run A: Plan Forge v2.22.1 installed, guardrails, Temper Guards, instruction files, the full pipeline.
Run B: a blank project and a prompt. Pure vibe coding.

The numbers

Head-to-head bar chart comparing Plan Forge against vibe coding across six structural quality metrics from the April 2026 .NET A/B test. Plan Forge: 60 tests, 6 interfaces, 9 DTOs, 4 typed exceptions, 79 CancellationToken references, 99 quality score. Vibe coding: 13 tests, 0 interfaces, 0 DTOs, 0 typed exceptions, 0 CancellationToken references, 44 quality score. — Figure R-1. The structural-quality gap visualised: same model, same time, different software shape.

Metric	Plan Forge (A)	Vibe coding (B)	Delta
Duration	~7 min	~8 min	guardrails did not add overhead
Tests	60	13	4.6× more
Interfaces	6	0	vibe = 0
DTOs	9	0	vibe = 0
Typed exceptions	4	0	vibe = 0
Error middleware	ProblemDetails (RFC 7807)	none	vibe had no error contract
Banker's rounding	5 usages	0	requirement silently dropped by vibe
CancellationToken	79 refs	0	vibe = 0
.gitignore	present	missing	vibe committed `bin/` and `obj/`
Quality score (/100)	99	44	2.25× higher

What mattered — the software shape, not the line count

The Plan Forge run produced more code, and it produced the right code:

3-layer architecture (Controller → Service → Repository), not a flat 2-layer structure where controllers called EF Core directly.
Interfaces for every service and repository. Dependency injection works, mocking works, testing works.
DTOs at the API boundary. Mass-assignment protection, clean contracts, no entity leakage.
Four typed exceptions (NotFoundException, DuplicateException, ValidationException, BusinessRuleException) mapped via ProblemDetails (RFC 7807) to proper HTTP status codes.
Banker's rounding (MidpointRounding.ToEven) on every financial calculation, the requirement that was explicitly stated but silently dropped by the vibe run.
CancellationToken on every async method, 79 references, enabling graceful shutdown and request cancellation.

The vibe-coded version works. You can start it, call the endpoints, and get responses. It also has structural problems that block production deployment: 12 build errors on first attempt (the model removed the EF Core decimal precision configuration to make the build pass, silently violating the banker's rounding requirement), no interfaces (controllers cannot be unit-tested), entities exposed directly as API responses (change a column, break the API contract), and 111 build-output files committed to the initial git commit because no .gitignore was generated.

The surprise — time was the same

The conventional wisdom is that structure slows you down. More rules, more process, more overhead. Skip the architecture, skip the tests, ship faster. The numbers tell a different story: Plan Forge produced 4.6× more tests and a 2.25× higher quality score in less wall-clock time (7 vs 8 minutes). The guardrails did not add overhead. They prevented the rework loop. The vibe run spent its extra minute fighting the EF Core build errors and applying a fix that sacrificed a requirement.

Guardrails do not slow you down. Rework slows you down. Guardrails prevent rework.

Where to read more → Chapter 1 — What Is Plan Forge? for the canonical evidence table; Chapter 4 — Writing Plans That Work for the guardrails that produced the structural quality; the original A/B-test blog post for the full per-metric narrative.

Vignette 3 — Quorum Mode in Practice

Source: "Quorum Mode: What Happens When 3 AI Models Review Each Other's Code" · Subject: the same C# invoicing slice, executed twice · What it demonstrates: the synthesis effect, when three models propose, the reviewer picks the cleanest approach, and quality compounds for cents on the dollar.

The setup

One feature, two executions, identical hardened plan:

Feature: Invoice Engine, rate tiers, volume discounts, tax calculation, banker's rounding.
Plan: same Scope Contract, same slices, same validation gates, same tech-stack preset.
Run A (Control): standard single-model execution on Claude Sonnet.
Run B (Quorum): three models in parallel (Claude Opus, GPT-5.3-Codex, Claude Sonnet) → reviewer synthesis → Claude Sonnet builds from the consensus plan.

Both runs passed every gate. Every slice built, every test passed, and the independent reviewer signed off on both. The interesting part is how they passed.

The numbers

Metric	Single (control)	Quorum (3-model)
Tests written	15	18 (+20%)
Helper extraction	Inline, repeated 3×	Reusable helpers, single source
Test dates	Hardcoded literals	Relative offsets
.NET exception pattern	Generic `ValidationException`	`ArgumentException.ThrowIfNullOrWhiteSpace` (.NET 7+)
Edge cases covered	Standard happy path	Voided-invoice regeneration, sequence races
Total cost	$0.62	$0.84 (+$0.22)
Total time	12 min	32 min (2.7×)

The four named patterns

The single-model and the quorum runs are not different code volumes, they are different code shapes. Four named patterns drive the gap:

DRY helper extraction. The single-model run inlined volume-discount math in three call sites with slight variations. The quorum run extracted IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as private static helpers because the synthesizer saw multiple proposals and picked the one that did not repeat itself.
Robust test dates. Single-model tests pinned dates to literal calendar days (new DateTime(2026, 3, 15)). Those tests fail when the dates pass and the business logic correctly refuses future invoices. Quorum tests used relative offsets (DateTime.Now.AddDays(-7)) that stay green forever.
Modern .NET patterns. Control run: throw new ValidationException("..."), functional but generic. Quorum run: ArgumentException.ThrowIfNullOrWhiteSpace(), the .NET 7+ recommended API. One model knew about it, the reviewer picked it.
Edge-case coverage. The extra three tests in the quorum run were not padding, they covered voided invoice regeneration, invoice-number sequencing under concurrent access, and boundary conditions in volume-discount tiers. The exact tests that would have caught production bugs.

The economics

The quorum run cost $0.22 more than the control run ($0.84 vs $0.62), about 35% in percentage terms, but still under a dollar total. For a feature that will be maintained for years, the differential is rounding error. The time delta was more significant: 32 minutes vs 12 minutes. The extra twenty minutes is the parallel dry-run analysis (three models thinking) plus the reviewer synthesis step. The actual build time was comparable.

For $0.22 more, you get 20% more tests, cleaner architecture, and modern patterns. That is the cheapest code review you will ever buy.

When to use it

Quorum mode is not for every slice. Running it on a simple CRUD endpoint that creates a database record is overkill. Running it on an auth flow, billing logic, or a database migration is worth every token. The default --quorum=auto threshold scores each slice's complexity (1–10) using seven weighted signals, file scope count, cross-module dependencies, security keywords, database/migration keywords, gate count, task count, historical failure rate, and only slices at or above the threshold (default 6) get the three-model treatment.

Where to read more → Advanced Execution — Quorum Quality Examples for the canonical side-by-side; the same chapter's Cost Optimization section for the auto-threshold details; the original quorum-mode blog post for the under-the-hood dispatch diagram.

What the three vignettes share

Read together, the three vignettes describe the same shape from three angles. Vignette 1 (the loop) is about making the pipeline survive its own output, Tempering re-auditing with the same tool that discovered, the bug registry auto-smelting regressions, the loop running unattended. Vignette 2 (99 vs 44) is about making the software survive its own future, interfaces and DTOs and typed exceptions and cancellation, the structural quality that separates a prototype from production code. Vignette 3 (quorum) is about making the next slice survive the gap between what one model knows and what another does, the synthesis effect, paid for in cents, banked in code that does not need a second rewrite.

Three vignettes, three different surface areas, one underlying claim: a harness that survives its own output is the difference between a demo and a shop. The chapters this appendix cross-links explain the mechanisms; the blog posts behind the vignettes preserve the first-person account; the receipts above are the part that survives when the version numbers change.