A single craftsman seated at a wooden writing desk tucked into a quiet corner of the Plan Forge shop, a leather-bound notebook open in front of him under an oil lamp, a partially finished sword resting on the desk to his right, stone walls behind him lined with hung tools and pinned rune charts

About the Author

Scott Nichols

Director, Strategic Account Technology Strategist (Virtual CTO), Microsoft

Software & Digital Platforms · Boise, ID

GitHub LinkedIn

Brand new here? Start with What Is Plan Forge for the 60-second overview, then come back.

About the Microsoft title. I mention it because being close to the source of these models (Copilot, Azure OpenAI, MCP) is part of why Plan Forge looks the way it does. I see how the primitives are built and where they break, and that shapes what gets built on top of them. Plan Forge itself is a personal project, not a Microsoft product. The code, the opinions, and any breakage are mine.

Why I Built This

Plan Forge came from frustration, my own.

I've spent my career as a software architect. First building enterprise systems, then helping teams at Microsoft build them on Azure. I know what good architecture looks like: clean layers, clear boundaries, every component with a purpose. Lasagna, not spaghetti.

When AI coding agents arrived, I was excited. Here was a tool that could generate code faster than any team I'd ever managed. But the excitement wore off fast. The agents were brilliant at greenfield work, scaffolding, boilerplate, CRUD endpoints, but they had no concept of architectural discipline. They made decisions I didn't ask for, expanded scope without warning, and produced code that compiled but couldn't be maintained.

Sound familiar? If you've hit the 80/20 wall (the point where AI-built code stops scaling), you know exactly what I mean.

I realized the problem wasn't the models. The models were capable. The problem was that nobody was giving them structure. No scope contracts. No validation gates. No separation between building and reviewing. We'd spent decades learning that human dev teams need guardrails, code reviews, and architectural governance, then handed AI agents a blank prompt and said "build me an app."

So I started writing those guardrails. First as instruction files I pasted into Copilot chats. Then as structured prompts. Then as a pipeline with sessions and validation gates. Then as a full framework with agents, skills, lifecycle hooks, an orchestrator, a dashboard, and cost tracking.

I built Plan Forge because I needed it. The same impulse that made me establish coding standards for human teams drove me to establish them for AI teams. The tools are different, but the principles are the same: clear scope, layered architecture, validation at boundaries, independent review, and no spaghetti code, ever.

If I'm being honest, Plan Forge also exists because of Spec Kit. That was the project that taught me the fix wasn't a better model. It was structure. Define what you want. Plan before you build. Stop letting the agent improvise. Plan Forge took that idea and pushed on it: scope contracts, auto-loading guardrails, isolated review sessions, multi-model consensus, then a whole runtime watch layer for what happens after the build leaves the shop. The two tools are still better together. See the Spec Kit interop chapter for the details.

Why a Forge?

People ask why the forge metaphor. Why not "Plan AI" or "Spec Pipeline" or some other clean-tech name? Because I'm a huge advocate of software as craft, and the trades got there first.

I'm a metal and wood worker in my own home workshop. There's something about working with your hands, hammering, planing, joining, fitting, that teaches discipline you cannot shortcut. The apprentice learns under a journeyman, the journeyman learns under a master, and at every step the work has to pass inspection before it leaves the shop. Skip a step and the piece fails. Lie about a measurement and someone gets hurt downstream. That's a culture of care that modern software development has mostly forgotten, and AI agents (left alone with a blank prompt) forget it even faster than humans do.

The forge is the oldest expression of that culture. Fire, hammer, anvil, water, repeat. Every great piece of metalwork came out of a disciplined process with explicit stages, and the smith had a name for every step. I wanted Plan Forge to feel the same way. Every concept in the framework has a real-world craft analog. A plan is a work order. A reviewer agent is a quality inspector. The Crucible is where raw material gets melted down and reshaped. LiveGuard is the warranty card after the piece leaves the shop. The four-station shop layout is literally the floor plan of a working forge, divided by function.

Naming things this way isn't decoration, it's a forcing function. If I can't find a real-world equivalent for an abstraction, that's usually a sign the abstraction isn't doing real work. The metaphor catches things that look like software but aren't actually building anything. Software is a young craft, but it doesn't have to be a careless one. Plan Forge is my attempt to bring the old traditions (apprentice to journeyman to master, hammer and anvil, fit and finish) into the place where AI agents and humans build things together.

Moments From the Forge

People sometimes ask which specific failure made me start writing guardrails. Honestly, it wasn't one. It was the same failure on loop. A few of them stuck hard enough to change the design.

The 2,000-line file

My first attempt at fixing any of this wasn't a framework. It was a single copilot-instructions.md that ballooned to about 2,000 lines: security, testing, architecture, deployment, everything I could think of crammed into one document. It was terrible. The agent cherry-picked, ignored half of it, and treated rules buried after line 1,500 as optional suggestions.

But it was also the first time I watched an AI consistently produce an interface before a concrete class. For one beautiful moment, somebody had told it what good looked like. The model didn't get smarter. It got direction. That was the hypothesis I've been refining ever since, and the reason today's instruction files are 80 to 200 lines each, auto-loaded by file pattern, one concern per file.

The demo with a database

Through 2025 I kept watching the same pattern in client demos. An agent would build a CRUD app from a single prompt. Five minutes in, the room would gasp. Endpoints, a UI, real data flowing. Five days later, that same app couldn't survive a second feature being added. No interfaces. No DTOs. Errors swallowed by catch (Exception). Tests that only covered the happy path. No cancellation tokens. No consideration for financial precision in code that was literally adding up money.

What we kept calling "software" was a demo with a database glued underneath. That's the 80/20 wall before anyone had named it: AI gets you to 80% in 20% of the time, and then the remaining 20% (the architecture, the tests, the error handling, the security) takes the other 80% of the effort to bolt on, while the AI-generated foundation fights you every step of the way.

The login page that grew an admin panel

I asked an agent for a login page once. I got a login page, a password reset flow, a user profile screen, a half-built admin panel, and a database migration that touched four tables I never mentioned. The agent wasn't being creative. It was being thorough with zero scope constraints.

That's the day "Forbidden Actions" became a required section in every Plan Forge plan. Explicit prohibitions like "do NOT add features outside this spec, do NOT refactor untouched files, do NOT change the schema beyond what's specified" cut scope drift by an order of magnitude. The most powerful guardrail isn't "do this." It's "don't do that."

The reviewer with no memory

The hardest lesson to internalize was that an agent in a long session will always believe its own code is correct. It has sunk-cost bias baked into the context window. It literally cannot see its own blind spots because those blind spots are sitting in the same token sequence that produced the code.

The first time I ran the same review prompt in a fresh session (same guardrails, no memory of the shortcuts the builder had considered and rejected), it caught fifteen issues the original session swore weren't there. Session isolation between builder and reviewer stopped being a nice idea and became a non-negotiable. It's why Plan Forge runs four sessions instead of one, why Session 3 is always a fresh reviewer, and why I trust that reviewer's output more than I trust my own first read.

The customer-reported broken link

Late in 2025 a customer reported a broken link on a page I'd shipped weeks earlier with an AI agent. I opened the file. Sure enough, a <a href="#"> placeholder the agent had left as scaffolding and nobody had grepped for. Then I went looking for siblings. There were twenty-three of them across the site. Plus a "Coming soon" on the pricing page. Plus a TODO in the FAQ. The build had been "green" the entire time because nothing in our pipeline was actually looking for those things in the deployed artifact.

That's where LiveGuard came from. The forge can't just stop caring at git push. Drift scoring, secret scanning, dep watch, regression guards: the build leaves the shop, but the watching shouldn't. The on-call runbooks grew out of the same incident.

Background

My work at Microsoft focuses on Azure enterprise architecture, helping organizations design cloud systems that scale, stay secure, and remain maintainable over years. Before that, I built distributed systems, designed multi-tenant SaaS platforms, and ran engineering teams where architecture governance was a daily concern.

That background shapes Plan Forge in specific ways. Three threads from my day-job work show up directly in the framework:

Stamps pattern architecture, my open-source StampsPattern project brings Azure enterprise cell isolation to infrastructure-as-code. The same boundary-thinking shows up in every Plan Forge scope contract and in the enterprise reference architecture.
Multi-tenant isolation, years of building SaaS platforms taught me that "works for one tenant" is not the same as "works for all tenants." Plan Forge's multi-tenancy reviewer agent comes from real production incidents.
Guardrails as culture, every great team I've worked on had non-negotiable standards. The bad ones always thought they were "agile" enough to skip them. The guardrails aren't about distrust, they're about consistency. The instruction files exist so the same rules apply whether your team member is a junior dev, a senior engineer, or an AI model.

Where I Am With It Today

I use Plan Forge daily. It builds itself (the version of the manual you're reading was generated by the version of the pipeline before it), it builds my homelab tooling, and it's the way I onboard every new client project. When I find a rough edge, I file the bug into my own queue and the next phase fixes it. That feedback loop, me eating my own dog food, is the only reason the framework has survived past v1.

A Passion Project, Built in the Open

One thing I want to be straight about: Plan Forge is a passion project. It's something I build nights and weekends because the problem genuinely bugs me, not because anyone is paying for a feature roadmap, not because there's a release-quality QA team behind it, and not because every corner is polished. It isn't perfect. It probably never will be. There will be rough edges in the CLI, the dashboard will surprise you sometimes, and the docs will lag behind the code more often than I'd like.

What it does have is a tight feedback loop with the people actually using it. Every meaningful improvement in the last year came out of someone trying it on a real project, hitting a wall, and telling me what broke. That covers auto-loading instruction files, the Forbidden Actions section, quorum mode, LiveGuard, the four-station shop layout, and the Crucible interview. Plan Forge grows by your input. That's not a marketing line; it's literally how the roadmap gets built. See the project history and lessons learned chapters for the receipts.

So if you try it and it stumbles, please tell me. File an issue. Open a PR. Comment on a blog post. Build an extension for a niche your team cares about. The best version of Plan Forge is the one shaped by the people who actually have to ship software with it. If you're stuck before you even get there, the troubleshooting chapter and failure-mode catalog usually have a head start on the answer.

If you're looking for something specific to dig into, the highest-impact contributions right now are: language presets beyond .NET, Node, and Python (Rust, Go, and Java are mapped but under-tested), notification extensions for Slack / Teams / PagerDuty / email, new entries for the failure-mode catalog, and reviewer agents for domains I don't work in daily (ML pipelines, mobile, embedded). If you ship something in one of those areas, I will absolutely talk about it.

GitHub: srnichols/plan-forge. Issues, discussions, PRs all welcome.
Blog: planforge.software/blog. Long-form posts on what's working, what isn't, and why.
Contributing: View contributing guide on GitHub
Extensions: Build and publish your own domain guardrails. See Chapter 12.
Customizing the framework: Chapter on customization walks through tweaking presets, skills, and hooks.
License: MIT. Fork it, embed it, rewrite half of it. Whatever helps your team ship better software.

Plan Forge builds Plan Forge. Every feature in this framework was developed using the same pipeline it ships to users: 55+ phases, 7,500+ self-tests, v1.0 through v3.11 with zero manual rollbacks. If the pipeline can build itself without drift, it can build your project too. And when it can't, that's where you come in.