Origin Story April 11, 2026 · 12 min read

From Impossible to 7 Minutes: A Year of Building AI Coding Guardrails

Director @ Microsoft

A lone blacksmith lighting a forge for the first time, dawn light streaming through the window, blueprint scroll on the workbench, raw iron ingots stacked nearby

A year ago, getting enterprise-grade code from an AI agent was nearly impossible. Not because the models weren't capable — they were. GPT-4 could write a service layer. Claude could architect a repository pattern. But the gap between what they could produce and what they actually produced in practice was enormous.

Today, Plan Forge produces a 99/100 quality application in 7 minutes. Same model. Same machine. No manual intervention.

This is the story of that journey — from a frustrated developer staring at yet another AI-generated codebase with zero interfaces, to a system that reliably produces enterprise-quality software faster than most developers can scaffold a project manually.

The Beginning (April 2025)

In the spring of 2025, AI coding assistants were everywhere. Copilot had gone mainstream. Claude and GPT-4 were writing entire features from prompts. The hype was real — and so was the disillusionment.

The models could generate code fast. Impressively fast. But the code they generated was consistently wrong in ways that mattered. Not syntactically wrong — it compiled. Not functionally wrong — the endpoints returned data. Wrong in the ways that separate a prototype from production: no separation of concerns, no interfaces, no DTOs, no error handling beyond catch (Exception), no tests beyond the happy path, and absolutely no consideration for financial precision or cancellation tokens.

The "demo magic" was intoxicating. You could show a stakeholder an AI building an entire CRUD application in five minutes and the room would gasp. But try to deploy that application — try to add a second feature, try to write a test, try to onboard a new developer — and the magic evaporated. What you had wasn't software. It was a demo with a database.

This was the 80/20 wall before anyone had named it. AI could get you 80% of the way to a working application in 20% of the time. But the remaining 20% — the architecture, the error handling, the tests, the security, the operational concerns — took the other 80% of the effort to bolt on after the fact. And by then, the AI-generated foundation was fighting you every step of the way.

The First Guardrail File

The first version of what would become Plan Forge was a single file: copilot-instructions.md. It was 2,000 lines of rules, conventions, patterns, and warnings crammed into one monolithic document. It was terrible. But it was also the first time I saw an AI agent consistently produce an interface before a concrete class.

I've written about the lessons learned from guardrails in detail, but the short version is: that first file proved the hypothesis. If you gave the model context about what good looks like, it would produce good code. Not because the model got smarter. Because it got direction.

The problem was that a 2,000-line file is a terrible way to deliver context. The models would cherry-pick sections, skip entire blocks, or treat rules from line 1,800 as optional suggestions. Worse, the file tried to cover everything — from git conventions to database patterns to API design — which meant every token of context was diluted by irrelevant instructions for the current task.

From Rules to a System (Summer 2025)

The evolution happened in stages, driven by a simple observation: agents don't read long files — they cherry-pick.

That insight was the architectural unlock. Instead of one giant file, what if we had focused instruction files that loaded contextually? Edit a controller? Load the API patterns file. Touch the database layer? Load the database conventions. Writing tests? Load the testing strategy.

By mid-summer, the single file had fractured into 18 specialized instruction files, each with frontmatter metadata that controlled when it loaded:

architecture-principles.instructions.md — the universal baseline (loads on every edit)
api-patterns.instructions.md — REST conventions, pagination, error responses
database.instructions.md — ORM patterns, migrations, connection management
testing.instructions.md — TDD workflow, test isolation, coverage expectations
security.instructions.md — input validation, secret management, CORS
errorhandling.instructions.md — exception hierarchy, ProblemDetails (RFC 7807)

Each file was 80–200 lines. Focused. Actionable. The model wasn't drowning in 2,000 lines of everything — it was getting exactly the 150 lines it needed for the task at hand.

But instruction files alone weren't enough. The model still made shortcuts when no one was watching. It would generate great service code with interfaces and DTOs, then turn around and skip tests because nothing forced it to write them. Manual checking was unsustainable — you'd spend as long reviewing the AI's output as you would writing the code yourself.

That's when validation gates entered the picture. Automated checks at every stage boundary: Did the model create interfaces? Are there DTOs at the API boundary? Do the tests exist and pass? Is CancellationToken propagated through async chains? These gates turned guidelines into contracts.

The Pipeline Takes Shape (Fall 2025)

By fall, the pieces had coalesced into a pipeline. Not a CI/CD pipeline — a cognitive pipeline. A sequence of steps that transformed a feature description into production code, with validation at every boundary.

The steps emerged organically:

Step 0: Specify — Define what and why before planning anything
Step 1: Preflight — Verify prerequisites, check existing patterns
Step 2: Harden — Transform the plan into an execution contract with scope boundaries
Step 3: Execute — Build slice by slice, validating at every gate
Step 4: Sweep — Eliminate TODOs, stubs, and placeholder code
Step 5: Review — Independent audit for drift and quality

But the most important insight wasn't the steps themselves — it was the isolation principle: the builder should never review its own work.

This led to the 4-session architecture that Plan Forge uses today:

Session 1 — Specify & Plan

Define the feature. Harden the plan into an execution contract.

Session 2 — Execute

Build slice by slice. Validate at gates. No self-review.

Session 3 — Review

Fresh context. Independent audit. Drift detection.

Session 4 — Ship

Commit, changelog, deploy. Clean handoff.

Four sessions, not one. Each with fresh context. The builder never reviews its own work. The reviewer has no memory of the shortcuts that were considered and rejected. This separation of concerns — applied not to code layers but to cognitive tasks — was the breakthrough that made consistent quality possible.

Going Multi-Model (Winter 2025–2026)

By winter, single-model execution was proving its limits. Claude was exceptional at architecture and nuance. GPT excelled at breadth and speed. Grok brought a different analytical lens entirely. Each model had blind spots — and those blind spots were consistent.

The question became: what if we stopped relying on one model's judgment and started treating AI code analysis the way we treat human code review — as a consensus process?

That's how quorum mode was born. Three models analyzing the same code slice independently, then synthesizing their findings into a unified report. The results were immediate: quorum analysis produces 20% more test recommendations than any single model alone. Each model catches issues the others miss. The overlap validates confidence; the differences reveal blind spots.

The Forge Gets an Anvil (Early 2026)

Through all of 2025, Plan Forge was essentially a collection of files. Powerful files — instruction files, prompt templates, plan documents — but files nonetheless. You installed them into a project, and they shaped how AI agents worked within that project. The system lived in the project's .github/ directory.

In early 2026, Plan Forge stopped being "files you install" and became "a system that runs."

The MCP server gave Plan Forge a programmatic API. Seventeen tools exposed as native operations — forge_run_plan, forge_analyze, forge_diagnose, forge_cost_report — accessible from any AI agent that speaks the Model Context Protocol. The CLI (pforge) gave humans direct access to the same operations.

The dashboard brought visibility. Live progress tracking during plan execution, cost aggregation across runs, session replay for auditing what the AI did and why. The autonomous orchestrator — a DAG-based execution engine with CLI worker spawning — made it possible to execute an entire hardened plan without human intervention, stopping only when a validation gate failed.

The Forge Timeline

v1.0 Summer 2025

Template + instruction files. The guardrail foundation — 18 specialized files, prompt templates, and the 4-session pipeline.

v2.0 January 2026

Autonomous orchestrator. DAG-based execution, CLI worker spawning, validation gates, token tracking. Plans run themselves.

v2.5 February 2026

Quorum mode. Multi-model consensus analysis — 3 models, independent findings, synthesized report. 20% more test recommendations.

v2.10 March 2026

OpenClaw bridge. Notifications across platforms — Telegram, Slack, Discord — so the forge can reach you when a gate fails.

v2.14 March 2026

Copilot platform integration. Native VS Code experience — skills, agents, lifecycle hooks, instruction auto-loading.

v2.18 April 2026

Temper Guards. Learned patterns from agent-skills analysis — the specific shortcuts AI agents take that produce compiling but architecturally broken code.

v2.22 April 2026

Power/speed presets, 3-provider quorum, cost tracking, image generation. The forge is fully lit.

Seven Agents, Nine Presets, One System

One of the hardest problems in AI-assisted development isn't the AI — it's the tooling fragmentation. A team of five might use VS Code with Copilot, Claude Code in the terminal, Cursor as a standalone editor, and Codex for CI automation. Each tool has its own configuration format. Each requires separate setup. Without a shared baseline, the same project gets different guardrails depending on which tool opens it.

The multi-agent adapter system solved this. One setup command generates native configuration for every supported AI tool — Copilot, Claude Code, Cursor, Codex, Gemini CLI, Windsurf, and Cline. Same guardrails, same instruction files, same quality baseline. Different formats, same rules.

Nine language presets — TypeScript, Python, .NET, Go, Rust, Java, PHP, Swift, and Azure IaC — provide stack-specific conventions layered on top of the universal architecture principles. A Python project gets pytest patterns and type-hint requirements. A .NET project gets xUnit conventions and CancellationToken rules. A Go project gets table-driven tests and error wrapping patterns. The principles are universal; the implementation is native to each stack.

The Unified Vision

Plan Forge was never meant to stand alone. It solves the blueprinting problem — how to reliably produce quality code from AI agents. But quality code is only one piece of a larger system.

The unified vision is three interlocking systems:

Plan Forge (blueprint) — guardrails, instruction files, hardened plans, autonomous execution
OpenBrain (memory) — persistent, searchable memory that bridges sessions and enables agents to learn across conversations
OpenClaw (reach) — notifications and integrations that let the system reach out when it needs human attention

Blueprint + Memory + Reach. The forge shapes the metal, the brain remembers the metallurgy, the claw delivers the finished piece.

Today: 99 vs 44 in 7 Minutes

And so we arrive at the present. Full circle.

A year ago, getting enterprise-grade code from an AI agent required massive manual intervention. You'd generate, review, rewrite, regenerate, review again — an exhausting loop of steering the model back toward quality patterns it should have followed from the start.

Today, Plan Forge v2.22 produces a 99/100 quality application in 7 minutes. The same model, without guardrails, produces 44/100 in 8 minutes. Same model. Same requirements. Same machine. Same afternoon.

Without Guardrails

13 tests · 0 interfaces · 0 DTOs

With Plan Forge

60 tests · 6 interfaces · 9 DTOs

The Plan Forge run produced 60 tests, 6 interfaces, 9 DTOs, ProblemDetails error handling (RFC 7807), banker's rounding on every financial calculation, CancellationToken on every async method, and a proper .gitignore. The vibe-coded version had 12 build errors on first attempt, no interfaces, entities exposed as API responses, generic exception catching, and committed bin/ folders to git.

The difference isn't the model. It never was. The difference is context. Direction. Guardrails that encode what production software actually looks like — and validation gates that enforce it.

What I'd Tell My Past Self

The biggest lesson from this year is deceptively simple: the quality of AI-generated code is not a function of model capability — it's a function of the context you provide.

Every model improvement in the past year has made Plan Forge more effective, not less. Better models don't eliminate the need for guardrails — they extract more value from them. A smarter model with the right context about architecture, testing, and error handling produces exponentially better output than a smarter model with no context at all.

The guardrails aren't training wheels. They're blueprints. And the forge is just getting started.

The forge is lit. The metal is hot. Build something that lasts.

Epilogue: Where the Journey Went Next (May 2026)

A month after this post went up, Plan Forge stopped being just a build pipeline and became a shop. Not because anyone asked — because the same problems that drove the original guardrails kept appearing on either side of the build. How do you know what to build? The Smelt station was born from that. How do you know it's still right after you ship? LiveGuard came from that. How does the next run get smarter? The Learn station made that explicit.

The pipeline that this post is about is still here — it's now the Forge station of a four-station shop (Smelt · Forge · Guard · Learn). Highlights since April:

Smelt — The Crucible interview now intakes features with a critical-fields gate. Specs that used to drift past the planner refuse to leave the smelter incomplete.
Forge — Quorum Mode (3 models reviewing in parallel) escalates automatically on high-complexity slices. Host-aware routing keeps the right model on the right host without silently double-billing.
Guard — LiveGuard shipped: drift scoring against the original spec, secret scans, dependency watch, regression guards, incident capture, and remote alerts to Slack / Teams / PagerDuty / OpenClaw. The spec stops mattering at merge no longer; it becomes the baseline LiveGuard measures the running system against.
Learn — Health DNA aggregates findings across runs and self-tunes escalation, cost, and quorum thresholds. The next plan starts with what the last plan learned.
Audit Loop — A closed-loop self-tempering drain (scan → triage → fix until convergence) became a first-class subsystem in v2.80. The loop literally never ends now — that earned its own post.
Forge-Master — A read-only intent classifier with an embedding cache and quorum advisory mode. The shop now has a front door you can talk to.

The numbers grew too — 105 MCP tools, 4 reviewer skills, 9 stack presets, 7 host adapters, ~21 reviewer agents. None of it was on the roadmap a year ago. Each station emerged because a real problem demanded it, and the next one always rhymed with the same lesson this whole post is about: the model isn't the bottleneck — context is. The shop is just more places to put context.

If this epilogue made you want to see the loop close, The Loop That Never Ends → picks up exactly where this story leaves off.

← Previous: The A/B Test All posts →