Chapter 1 · Act I, Smelt

What Is Plan Forge?

The AI-Native SDLC Forge Shop. One workshop, four stations, every phase of the lifecycle.

Plan Forge is the orchestration harness that sits on top of GitHub Copilot (and other AI coding tools). It does not replace your model or your IDE, it adds the SDLC layer GitHub deliberately leaves to the ecosystem: planning, validation gates, memory, cost control, and reviewer separation.

It is also licensed MIT because your SDLC is yours, and your institutional memory lives in OpenBrain, a user-owned service, because your accumulated decisions should not be trapped inside any one AI vendor.

The GitHub stack with Plan Forge layered on top. Below: GitHub (the substrate), repositories, Actions, Copilot model, Issues, PRs. Above: Plan Forge (the harness), Smelt (intake), Forge (execute), Guard (post-deploy defense), Learn (memory). The harness sits on the substrate; it does not replace it. — Figure 1-1. Plan Forge is a harness, not a model. It sits *on top of* the GitHub stack, repositories, Actions, the Copilot model, Issues, and PRs, and adds the SDLC layer GitHub deliberately leaves to the ecosystem. See Appendix H · GitHub stack alignment and Appendix I · Plan Forge on the GitHub stack for the surface-by-surface map.

The One-Line Answer

Plan Forge is a complete AI-native SDLC workshop. Instead of giving your AI agent a single code-generation step, it gives the agent a whole shop, four specialized stations (Smelt, Forge, Guard, Learn) connected by gates, telemetry, and persistent memory.

"A blacksmith without a shop is just a hammer in a hand."

Plan Forge in 5 terms, you'll see these everywhere; bookmark them now.

Plan, a Markdown file in docs/plans/ describing one feature: what to build, what files it can touch, what tests must pass.
Slice, one numbered step inside a plan. Plans are broken into 3–7 slices so the AI works in checkpointed chunks instead of one giant edit.
Scope contract, the section of the plan that lists exactly which files are in-scope vs out-of-scope vs forbidden. The orchestrator enforces it: edits outside scope are blocked.
Validation gate, a concrete shell command (e.g., dotnet test) that must pass before the next slice runs. Gates are how Plan Forge knows the AI didn't break anything.
Hardened plan, a plan that has gone through Step 2 of the pipeline (Plan Hardener), which adds the scope contract, validation gates, forbidden actions, and rollback steps. Plans the AI can execute autonomously must be hardened.

All five terms have full entries in the Glossary.

The Four Stations

Every station handles one phase of the software lifecycle. Every station is AI-run and product-owner-supervised — you own spec, direction, and final acceptance; the shop owns build, review, supervision, defense, and learning. See The Virtual Engineering Team below for the role-by-role map.

Station	Phase	What runs here	What comes out
🪨 Smelt	Intake → scope contract	Specifier agent, hardening runbook, `/specify`, `/harden-plan`, Project Principles	A Scope Contract the Forge can execute without follow-up questions
🔨 Forge	Scope contract → shipped code	`pforge run-plan`, slice gates, quorum mode, auto-escalation, cost ledger	Green tests, green CI, green cost ledger, or an honest stop with a fix proposal
🛡️ Guard	Post-deploy defense (LiveGuard)	Secret scan, env drift, regression guard, incident triage, fix proposals	Pre-deploy block on severity ≥ high, post-slice drift advisory, triaged incidents
🧠 Learn	Memory & retrospectives	OpenBrain, bug registry, testbed findings, Health DNA, Forge Intelligence	Tomorrow's plan is colder, faster, and less wrong

🔗 Want the deep-dive tour? Each station has its own full walkthrough on the Shop Tour page. This chapter gives you the overview; Chapter 2 covers how the stations connect.

The Virtual Engineering Team

Plan Forge isn't "AI plus a code-completion plugin." It's a full enterprise engineering shop where every traditional role is filled by a specialized agent or guardrail, governed by 40 years of software engineering practice encoded into 17+ auto-loading instruction files and 20 specialized reviewers.

Traditional engineering role	Plan Forge equivalent
Product Owner (spec, direction, acceptance)	You — non-negotiable, non-replaceable
IC engineers (architecture, security, performance, DB, deploy, API, accessibility, multi-tenancy, CI/CD, observability, dependency, compliance)	20 specialized agents + 17 auto-loading guardrail files
Tech lead / staff engineer	Quorum mode (multi-model consensus) + auto-escalation on slice failure
Engineering manager	Forge-Master Observer + Auditor — agents supervising agents, not metaphorically, literally
QA team	Tempering harness + testbed + regression guard + slice gates
SRE / on-call	LiveGuard — secret scan, drift report, dep watch, env diff, incident capture, runbooks
Continuous improvement / retros	Audit loop + bug registry auto-smelt + Reflexion lessons + auto-skill promotion (loops that run unattended for weeks and learn from every pass)
Architecture review board	Independent Session-3 review gate (fresh AI session, full guardrail load)
Institutional knowledge / wiki	OpenBrain L3 memory with Hallmark provenance
Release manager	Shipper agent + release-checklist + version.instructions.md

Your three jobs as the human

Make a few technology choices once. Pick a preset (dotnet, typescript, python, etc.), declare Project Principles, set forbidden patterns. One-time, then locked.
Own the spec and the direction. Product Owner duties. Crucible interviews you; you decide what to build and why. The shop won't deviate from the Scope Contract you sign off on.
Run manual acceptance testing. Agents don't read minds. They implement what the spec says, not what you intended. A green pipeline (all tests pass, no drift, no secrets, no regressions, Forge-Master independently audited) proves the code is correct. Only you can decide whether it's what you actually wanted. That's Product Owner UAT, same as it has been for 40 years.

Three closed loops, not one. The Forge builds, Forge-Master supervises, and the Learn station feeds every finding back into the next plan. The audit loop ran for two weeks unattended on a real production Next.js site and surfaced 30+ defects the maintainer didn't know existed (see Day in the Forge). For a skeptical CTO who has been burned by AI demos, this is the part that distinguishes Plan Forge from "an AI that writes code": autonomous execution, autonomous supervision, and autonomous learning — three loops, not one.

The Problem This Solves

AI coding agents are powerful but directionless.

They generate code fast. But fast isn't the same as good. Without a full shop around them, without scope contracts, slice gates, post-deploy guards, and institutional memory, AI-generated code tends to be untestable, insecure, architecturally inconsistent, and impossible to maintain at scale. That's fine for prototypes; it's not fine for production systems.

The 80/20 Wall — The Problem Plan Forge Solves

You've probably lived this pattern:

You fire up an AI agent, Copilot, Cursor, Claude, whatever, and describe the app you want. The first 80% is magic. Files appear, components wire up, the database schema materializes. You're shipping faster than you ever thought possible.

Then complexity creeps in. Auth flows interact with database queries. Middleware chains get long. The agent still works, but you notice it's making assumptions without asking, it picked a caching strategy you wouldn't have chosen, refactored code from three sessions ago that was working fine.

Then the wall. Every change breaks something else. Fix the auth bug, break the dashboard. Fix the dashboard, break the API response format. The agent is confidently producing code that compiles but doesn't work. You're debugging AI-generated code you don't fully understand, in an architecture you didn't fully choose.

The pattern everyone hits: prompt → hope → fix → re-prompt → hope harder.

The four-phase trajectory

Plotted as completion vs. confidence, the failure mode is consistent across teams and tools:

0 → 50% (greenfield rush), Empty repo, clear scope, every prompt produces working code. Confidence is high; the codebase has no constraints to violate yet.
50 → 80% (complexity creeps), The agent starts making undiscussed architectural decisions. Caching strategy, error-handling pattern, schema shape, all chosen mid-stream. Most still works, but the codebase now has invisible commitments the agent doesn't track.
80% → the wall (every change breaks something), Each fix introduces a regression somewhere else. The agent's previous decisions become constraints on its current decisions, but it doesn't remember them. You spend more time debugging than building.
100% (maybe just start over), The codebase is structurally tangled in ways the agent can't unwind. Many teams quietly restart from scratch, cheaper than fixing the architectural debt.

The mechanism: architectural memory loss. AI agents forget why code was written a certain way, so they "improve" it, and break every caller. Without persistent decision memory and forced session boundaries, every long session becomes a fresh inventor that doesn't know what the previous inventor committed to. Why session isolation works →

This isn't a model problem. It's a shop problem. One hammer is not a workshop. When agents work from loose intent rather than a hardened Scope Contract, and when nothing watches what ships or remembers what failed, they do fine on greenfield builds but start thrashing once the codebase gets complex enough that every change has downstream consequences.

The fix is the full shop: Smelt before the agent writes a line of code, Forge the scope so it can't drift, Guard what ships, and Learn with a memory that carries decisions forward.

Vibe coding gets you a prototype. Plan Forge gets you a product.

Longer narrative version with the failure stories: The 80/20 Wall: Why AI Agents Break What They Build.

💡 Cost Model
The core pipeline (prompts, instructions, agents) is free, it works with your existing Copilot subscription. Automated execution (pforge run-plan) and quorum mode use your IDE's AI model, consuming premium requests. Direct API providers (xAI Grok, OpenAI) require API keys and are billed per-token. The Dashboard's Cost tab tracks every dollar.

What Happens Without the Shop

Vibe Coding (no shop)

✗Prompt → hope → fix → re-prompt
✗Agent picks architecture mid-stream
✗Every session starts from zero
✗Agent reviews its own work
✗"It compiles" = "it's done"
✗Secrets + CVEs ship to prod unnoticed

The Forge Shop (Plan Forge)

✓Smelt: Scope contract locked before coding
✓Forge: Slice gates, build + test at every boundary
✓Forge: Fresh session audits independently
✓Guard: Secrets + drift + regressions caught pre-deploy
✓Learn: Memory carries decisions across sessions
✓Learn: Bug registry + testbed + health DNA feed the next plan

Without the shop, AI coding agents:

If you've managed human dev teams, you know guardrails aren't about distrust, they're about consistency. The same principle applies when your team members are AI models.

Silently expand scope, "I'll also add..." (you didn't ask for that)
Make undiscussed decisions, picks a database pattern without telling you
Skip validation, ships code that doesn't build or pass tests
Lose context, forgets requirements halfway through long sessions
Never self-audit, the executor grades its own exam
Have no post-deploy defense, secrets, drift, and regressions land in prod unseen

These problems get worse the less technical your team is, you may not even notice the drift until it's too late.

Without the shop	With Plan Forge
Agent writes code that passes once, breaks in production	Code follows your architecture from the first line (Smelt)
30–50% of AI-generated code needs rework after review	Independent review catches drift before merge (Forge)
Agent re-discovers solved problems every session	Persistent memory loads prior decisions in seconds (Learn)
Secrets and CVEs slip into deploys	LiveGuard blocks pre-deploy on severity ≥ high (Guard)
Context window wasted on exploration and backtracking	Hardened plan tells the agent exactly what to build
"It works on my machine" shipped to staging	Validation gates pass at every slice boundary

What Plan Forge Does

Plan Forge is an AI-native SDLC workshop, four stations connected by gates, telemetry, and memory, that converts your rough ideas into shipped, defended, remembered software. It installs guardrail files, MCP tools, reviewer agents, and a live dashboard into your project so every AI edit happens inside the shop, not next to it.

The Blacksmith Analogy, Extended

A blacksmith doesn't hand raw iron to a customer. They heat it, hammer it, temper it, and, in a real shop, the master smith watches it ship, remembers which blades broke, and sharpens the process for next time.

Plan Forge does the same for your development plans:

Shop Stage	Station	What Happens
🔥 Heat, raw ore	Smelt	You describe what you want; the Specifier agent extracts a Scope Contract
🔨 Hammer, shape it	Forge	Plan broken into slices with validation gates; AI builds slice-by-slice
💧 Quench, check the edge	Forge	Fresh-session review audits for drift, completeness, quality
🛡️ Guard, patrol the floor	Guard	LiveGuard scans secrets, drift, regressions, CVEs pre- and post-deploy
🧠 Remember, sharpen the process	Learn	Every incident, fix, and review feeds OpenBrain memory + bug registry + Health DNA

Who This Is For

Solo Developers

You're using Copilot or Claude to build features, but you've noticed the AI drifts when sessions get long. You spend time re-explaining your patterns. Plan Forge gives you a repeatable pipeline that remembers your standards, validates at every step, and catches the mistakes you'd normally catch in code review, except there's no reviewer. You are the team.

Development Teams

Your team uses AI tools but everyone gets different quality results. Junior devs get code that works but violates your architecture. Senior devs spend review cycles catching AI-generated antipatterns. Plan Forge makes the architecture the default, instruction files load automatically, validation gates enforce build+test, and the reviewer-gate agent catches drift before anyone opens a PR.

Enterprise & Regulated Environments

You need audit trails, consistent architecture, and code that meets compliance standards. Plan Forge gives you phase-level tracking (DEPLOYMENT-ROADMAP.md), per-slice cost accounting, OTLP telemetry, and 19 independent reviewer agents, including compliance, security, and multi-tenancy auditors that run automatically. Every execution has a trace.

Plan Forge Is / Plan Forge Is Not

Positioning matters more than features when an entire category is in motion. The shortest answer is paired: what Plan Forge claims to be, and the closest things it deliberately is not.

Plan Forge is	Plan Forge is not
The orchestration harness that sits on top of GitHub Copilot (and other AI coding tools).	An AI model. Plan Forge works with whatever AI you already use, Copilot, Claude, Cursor, Codex, Gemini, Windsurf, or any tool that accepts text prompts.
The SDLC layer GitHub deliberately leaves to the ecosystem: planning, validation gates, memory, cost control, and reviewer separation.	A code generator. Plan Forge doesn't write your code, it tells the AI how to write it, then verifies the result.
Opinionated about software shape (interfaces, DTOs, typed exceptions, tests), see the 99-vs-44 evidence below.	Opinionated about your stack. Nine presets cover .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC. Each installs stack-appropriate guardrails.
MIT-licensed because your SDLC is yours.	A managed cloud service or a process you rent. Plan Forge runs entirely inside your existing IDE, CLI, and repo.
Tied to your repo's source of truth via GitHub Issues, PRs, and Actions, Plan Forge writes to the artifacts you already audit.	A CI/CD system. It doesn't deploy your app. It validates that what's built matches what was planned. Your CI pipeline is a separate concern.
Designed so your institutional memory lives in OpenBrain, a user-owned service, because your accumulated decisions should not be trapped inside any one AI vendor.	A project manager. It doesn't assign tasks to humans or track sprints. It structures work for AI agents, slices, gates, scope contracts.

Evidence — A/B Test Results

The shop story is testable. The April 2026 .NET A/B test built the same WebAPI twice from an identical .NET 10 skeleton (same git commit baseline) using the same model (Claude Opus 4.6) on the same machine. One run used Plan Forge guardrails; the other used pure vibe coding. Comparable wall-clock time, 7 minutes for Plan Forge, 8 minutes for vibe coding (the extra minute went to fighting build errors).

Head-to-head bar chart comparing Plan Forge against vibe coding across six structural quality metrics from the April 2026 .NET A/B test. Plan Forge: 60 tests, 6 interfaces, 9 DTOs, 4 typed exceptions, 79 CancellationToken references, 99 quality score. Vibe coding: 13 tests, 0 interfaces, 0 DTOs, 0 typed exceptions, 0 CancellationToken references, 44 quality score. Vibe coding scored zero on four of six structural metrics. — Figure 1-2. A/B test results, Plan Forge vs vibe coding across six structural quality metrics (.NET, April 2026).

Metric	Vibe coding	Plan Forge	Delta
Tests	13	60	4.6× more
Interfaces	0	6	vibe = 0
DTOs	0	9	vibe = 0
Typed exceptions	0	4	vibe = 0
CancellationToken references	0	79	vibe = 0
Quality score (/100)	44	99	2.25× higher
Build time	8 min	7 min	guardrails didn't add overhead

What this measures. The differences are structural quality, the presence or absence of interfaces, DTOs, typed exceptions, and cancellation support. These are not subjective stylistic choices; they are the patterns that make production code maintainable, testable, and safe to extend. Vibe coding scoring zero on four of six metrics is not noise, it is a different software shape.

The vibe run spent its extra minute fighting build errors caused by an EF Core InMemory misconfiguration that the model had to diagnose, backtrack, and fix at the cost of sacrificing a requirement (banker's rounding). That rework cycle is invisible in a demo; at scale it is the dominant cost.

Full A/B test write-up with code samples, methodology, and links to both repositories: The A/B Test: 99 vs 44 — Same App, Same Model, Same Time.

How to Read This Manual

This manual follows the four stations of the shop:

Act I: Smelt (Chapters 1–5): What Plan Forge is, how the shop works, installation, writing plans, and the Crucible for community-contributed ideas. Start here if you're new.
Act II: Forge (Chapters 6–15): Hands-on building, your first plan, the dashboard, CLI, customization, instructions & agents, MCP tools, extensions, multi-agent, advanced execution, troubleshooting.
Act III: Guard (Chapters 16–20): LiveGuard mental model, all 14 tools, and the dashboard, plus the Watcher (read-only tail of another project's run) and the Remote Bridge (phone-friendly approvals via Telegram, Slack, Discord, OpenClaw).
Act IV: Learn (Chapters 21–24): The Bug Registry, the Testbed, Health DNA, and three-tier memory architecture, how the shop remembers.

Already installed and want to build something? Skip to Chapter 6: Your First Plan.

📄 Full reference: README on GitHub