Lessons April 7, 2026 · 10 min read

I Built Guardrails for AI Coding Agents — Here's What I Learned

Director @ Microsoft

Blacksmiths forging iron guardrail plates around a glowing AI brain sphere, each plate etched with security, testing, and architecture symbols

Plan Forge started as a personal frustration. I kept watching AI agents produce code that looked great in a demo and fell apart in production. Not because the models were bad — they're genuinely impressive — but because they had no structural constraints. No architectural standards. No scope discipline. No memory of what they decided yesterday.

So I built guardrails. Auto-loading instruction files that activate based on context. 14 specialized reviewer agents. 9 tech stack presets. A 7-step pipeline that converts rough ideas into hardened execution contracts. (Now part of a four-station Forge Shop — Smelt, Forge, Guard, Learn — but the lessons below predate that and still apply.)

Here are the seven lessons that cost the most to learn.

Lesson 1: Agents Don't Drift Maliciously — They Drift Because Nobody Said Not To

The number one misconception is that AI agents "go rogue." They don't. They do exactly what you let them. If you say "build me a login page," you'll get a login page — plus a password reset flow, an admin panel, a user profile system, and refactored database migrations. The agent isn't being creative. It's being thorough with zero scope constraints.

The fix was embarrassingly simple: define what shouldn't be built. Every hardened plan in Plan Forge has a Forbidden Actions section. "Do NOT add features not in this spec. Do NOT refactor existing code outside the touched files. Do NOT change the database schema beyond what's specified." Explicit prohibitions cut scope drift by an order of magnitude.

The most powerful guardrail isn't "do this." It's "don't do that."

Lesson 2: Auto-Loading Beats Manual Attachment Every Time

Early versions required developers to manually attach instruction files to each chat session. Nobody did it. The guardrails existed but were never used — like safety goggles hanging on the wall.

The breakthrough was applyTo frontmatter. Each instruction file declares which file patterns it cares about:

security.instructions.md applies to **/auth/**,**/middleware/**
database.instructions.md applies to **/repositories/**,**/migrations/**
testing.instructions.md applies to **/*.test.*,**/*.spec.*

When you edit an auth file, security guardrails load automatically. No developer action required. Adoption went from ~20% ("whoever remembered") to 100% ("it just works").

Lesson 3: The Builder Must Never Review Its Own Work

This was the hardest lesson to internalize. In a single long chat session, the agent that wrote the code will always believe its code is correct. It has sunk-cost bias baked into its context window. It literally cannot see its own blind spots because those blind spots are in the same token sequence that produced the code.

Plan Forge mandates session isolation between execution and review. The builder works in Session 2. The reviewer works in Session 3 — fresh context, same guardrails, independent judgment. It catches bugs, drift, and architectural violations at a rate that shocked me.

Think of it this way: would you let a developer merge their own PR without review? Then why would you let an AI agent do it?

Lesson 4: Slice Boundaries Are the Only Real Validation Points

Testing "at the end" doesn't work. By the time you run tests after building 15 files, the failures cascade so badly that the agent spends more time debugging than it spent building. I've watched agents burn through entire context windows chasing regressions that compound across files.

The answer was slices. Every feature gets decomposed into 3-7 execution slices, each with its own build and test gate. The agent cannot proceed to slice N+1 until slice N passes. This means:

Failures are caught when they're small (1-3 files, not 15)
The agent fixes the problem with full context of what it just wrote
Green-to-green progression means you always have a safe rollback point

Slice boundaries are non-negotiable. They're the reason Plan Forge features actually ship complete instead of "mostly done."

Lesson 5: One Guardrail File Per Concern — Not One Giant File

The first version of Plan Forge had a single copilot-instructions.md that was 2,000 lines long. It covered security, testing, architecture, database patterns, error handling, deployment — everything. And it was terrible.

Agents process long instruction files worse than short ones. Key rules get buried. Contradictions creep in. The agent cherry-picks what's convenient and ignores the rest.

Splitting into 18 focused files — each under 150 lines, each with a single concern — was transformational. The security file only talks about security. The testing file only talks about testing. Each one loads only when relevant. Compliance went up. Token usage went down. Everybody won.

Update (v2.18): We took this further with Temper Guards and Warning Signs in every instruction file. Temper Guards document the specific shortcuts agents take that still produce compiling code but erode quality — like “this is just a DTO, no logic to test” or “N+1 won’t matter at our scale.” Warning Signs list observable anti-patterns reviewers can grep for. Each file now teaches agents not just what to do, but why not to skip it.

Lesson 6: Tech Stack Presets Eliminate 80% of Customization

Every stack has different conventions. .NET uses PascalCase and xUnit. Python uses snake_case and pytest. TypeScript uses camelCase and vitest. If your guardrails say "use PascalCase" to a Python developer, they'll immediately distrust the entire system.

Presets solved this. Run setup.ps1 -Preset python and you get Python-specific instruction files, agents, prompts, and skills — all pre-written with Python best practices. The developer never has to customize anything. It just works for their stack.

Nine presets ship today: .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC. Plus a custom preset for anything else. Multi-preset support means a full-stack project can combine -Preset typescript,azure-iac and get the right guardrails for each layer.

Lesson 7: Enterprise Quality Is a Default, Not an Upgrade

The biggest mistake in AI-assisted development tooling is treating quality as optional. "Add tests later." "We'll refactor." "Security can wait." No. Every feature should ship with tests, proper error handling, input validation, and architectural compliance from the first commit.

Plan Forge makes this structural. The hardened plan includes test expectations per slice. The architecture guardrails load on every file change. The security guardrails load on every auth file. The testing guardrails load on every test file. There's no "opt in to quality" — quality is the default. You'd have to actively work around it to ship bad code.

And with Exit Proof (v2.19), every skill now ends with a verifiable checklist — not “it seems right” but “paste the test output, show the migration file, prove coverage didn’t drop.” Evidence over assumption.

The best developer tools don't make quality easier. They make it unavoidable.

What's Next

These seven lessons shaped Plan Forge into what it is today — but the landscape keeps evolving. Quorum Mode (multi-model consensus) is already showing measurably better results than single-model execution. Auto-escalation routes failing slices to stronger models. The MCP server exposes 18 forge tools as native operations. And community extensions are adding domain-specific guardrails for industries from healthcare to finance.

The core insight hasn't changed: AI agents are as good as the constraints you give them. Give them none, and they'll improvise brilliantly — right up until they destroy everything they built. Give them structure, and they'll build things you're proud to ship.

Plan Forge is free and open source. MIT licensed. Try it on your next feature.

← Previous: Spec Kit + Plan Forge Next: Quorum Mode →