A dimly lit bronze workshop diagnostic bench at night with an open ledger, an oil-lit brass lantern overhead, and a magnifying loupe held in a mechanical iron arm focusing a beam of amber light onto a glowing ERR rune, floating diagnostic glyphs (wrench, checkmark, gear, question-mark) orbit the workshop
Chapter 15

Troubleshooting

"Something's wrong." Find the answer fast.

Every tool breaks eventually. The question is whether you have a diagnostic path or just a prayer. Start with pforge smith, it catches 80% of issues in 5 seconds.

Key terms: Glossary defines every Plan Forge term. If you see "scope contract," "validation gate," "slice," or "applyTo" and aren't sure what they mean, check there first.
Trying to do something, not fix something? This chapter answers "why is X broken?" If the question is "how do I X?", for example "how do I lower the cost of a run" or "how do I add a custom skill", jump to Appendix S — How Do I…? Task Index. It maps verbs to chapters.

Diagnostic Tools

Troubleshooting decision tree: start with pforge smith, branch to execution, guardrails, dashboard, or setup issues
Figure 15-1. Troubleshooting decision tree
ToolWhat It ChecksWhen to Use
pforge smithEnvironment, VS Code config, setup health, versionFirst thing when anything seems off
pforge checkSetup file existence and validityAfter setup or update
forge_diagnose({ file }) (MCP tool)Multi-model bug investigation on a specific fileWhen a slice fails and you can't see why, invoke from Copilot Chat

What a healthy pforge smith looks like

If you've never run it, here's the shape of the output to compare against. Anything red or marked FAIL is a real problem; WARN usually means an optional extension or integration isn't installed.

Terminal, expected output
$ pforge smith

Plan Forge v3.12.0, forge diagnostic

Environment
  OS                Windows 10.0.22631  OK
  Shell             PowerShell 7.4.1    OK
  Node              v20.11.0            OK  (≥ 20 required)
  Git               2.42.0              OK  (≥ 2.30 required)

Forge layout
  .github/prompts            22 files   OK
  .github/instructions       22 files   OK
  .github/agents             14 files   OK
  .github/hooks               7 files   OK
  .github/skills             12 files   OK
  docs/plans                  5 files   OK
  .forge/config.json         present    OK

MCP server
  pforge-mcp/server.mjs      present    OK
  Port 3100                  free       OK
  Port 3101 (WS hub)         free       OK

Agent adapters
  copilot   .vscode/mcp.json  OK
  claude    .mcp.json         not installed   WARN (run setup with --agent claude)
  cursor    .cursor/mcp.json  not installed   WARN
  codex     .codex/mcp.json   not installed   WARN

Result: 15 OK, 3 WARN, 0 FAIL ,  forge is healthy
Read it from the bottom. The Result: line is the headline. If FAIL = 0 you're fine to keep working. WARNs are reminders, not blockers.

Agent Isn't Following Guardrails

SymptomCauseFix
AI ignores coding standardsInstruction files not loadingCheck applyTo pattern matches the file you're editing. Run pforge smith to verify file counts.
Wrong instructions loadingapplyTo glob too broadNarrow the pattern, use **/auth/** instead of **
Guardrails load but AI ignores themContext budget exceededReduce copilot-instructions.md to <80 lines. Remove applyTo: '**' from non-essential files.
Project Principles not enforcedPROJECT-PRINCIPLES.md missingRun the project-principles prompt. The instruction file activates only when this file exists.

Plan Execution Fails

SymptomCauseFix
Gate fails with build errorsCode doesn't compileFix the build error, then pforge run-plan --resume-from N
Gate fails, tests regressNew code broke existing testsFix the regression. Check if scope contract is too broad.
Slice times outContext window exhausted or model overloadedSplit the slice into smaller chunks. Try a different --model.
Model returns errorAPI key invalid or rate limitedCheck XAI_API_KEY / OPENAI_API_KEY env vars. Wait for rate limit reset.
Scope violation detectedAI touched forbidden filesThe PreToolUse hook should catch this. If not, tighten the Scope Contract.
Escalation exhaustedAll models in chain failedReview the slice, it may be too complex. Break into sub-slices or simplify gates.

Dashboard Won't Load

SymptomCauseFix
Connection refused on :3100Server not runningnode pforge-mcp/server.mjs
Port already in useAnother process on 3100node pforge-mcp/server.mjs --port 4100 or kill the conflicting process
Blank page loadsMissing node_modulescd pforge-mcp && npm install
WebSocket disconnectsFirewall or proxy blocking :3101Allow port 3101, or set WS_PORT env var
No data in Runs/Cost tabsNo execution history yetRun a plan first: pforge run-plan

Setup Failed

SymptomCauseFix
"Preset not found"Typo in preset nameValid presets: dotnet, typescript, python, java, go, swift, rust, php, azure-iac
Permission deniedRead-only directory or no git accessCheck file permissions. Run from a writable directory.
Existing files conflictPrevious setup existsUse -Force flag to overwrite, or pforge update for selective updates
Wrong files installedIncorrect preset for your stackRe-run: .\setup.ps1 -Preset <correct-preset> -Force

Costs Are Too High

StrategySavingsHow
Use cheaper execution model50–70%Set modelRouting.execute to a smaller model
Reserve expensive model for review30–50%modelRouting.review: "claude-opus-4.6"
Raise quorum threshold20–40%--quorum-threshold 8 (fewer slices trigger consensus, see scoring rubric)
Reduce context per slice10–20%Use targeted Context: lists (see Chapter 4)
Preview before runningN/Apforge run-plan --estimate or forge_estimate_quorum (compares all four modes)

Grok Image Generation Crashes Session

xAI Grok Aurora returns JPEG bytes regardless of requested format. If raw bytes with wrong MIME type enter the conversation history, the session becomes unrecoverable.

Current mitigations: The MCP tool returns text-only responses (file path + metadata, never raw base64). The generateImage() function detects actual format via magic bytes and converts using sharp. Sessions should be safe, but if you encounter the MIME mismatch error, start a fresh session.

Safe workflow: Use .jpg extensions (matches Grok's native output), generate art in dedicated sessions, or use the REST API: POST /api/image/generate.

Common Error Messages

Looking for the contract, not the fix? Every exit code, MCP error code, and REST status Plan Forge emits is documented in Appendix X — Errors & Exit Codes. This table maps symptom → fix; the appendix maps code → meaning.
ErrorCauseFix
No .forge.json foundNot in a Plan Forge projectRun pforge init or setup.ps1
templateVersion mismatchFramework files outdatedpforge update
No API key configuredMissing env var for image/analysisSet XAI_API_KEY or OPENAI_API_KEY
Plan parsing failedMalformed plan fileCheck for missing ## Execution Slices section or broken markdown
Gate command failed (exit 1)Build or test failureFix the code, then --resume-from N
DRIFT DETECTEDForbidden file modifiedRevert the forbidden change, re-run the slice
CRITICAL_FIELDS_MISSING v2.82.1Crucible finalize blocked, missing build-command, test-command, scope, gates, forbidden-actions, or rollbackCall forge_crucible_preview for criticalGaps[], then continue the interview
PLAN_ALREADY_EXISTS v2.82.1Crucible finalize refuses to overwrite hand-authored docs/plans/Phase-NN.mdRead both files (existing plan + .crucible-draft.md), then re-finalize with overwrite: true if you really mean it
ASK_QUESTION_MISMATCH v2.82.1Client passed a stale questionId to forge_crucible_askRe-fetch state via forge_crucible_preview, retry with the current question id
QUORUM_ALL_FAILED v2.78All quorum models timed out (60s each) or erroredCheck API keys / network; retry. Consider --quorum=speed if flagship models are unavailable. Multi-agent quorum reference.
NO_REASONING_MODELForge-Master has no model configured and no API key foundgh auth login for zero-key path, or set ANTHROPIC_API_KEY / OPENAI_API_KEY / XAI_API_KEY, or set forgeMaster.reasoningModel
Subprocess STATUS_CONTROL_C_EXIT (0xC000013A) v2.81Worker process was killed by signal mid-sliceSlice is now correctly marked failed (not silently passed). Check statusReason, then --resume-from N
slice-orphan-warning event v2.82.1Failed slice's worker deliverables were staged but not committedSee .forge/runs/<runId>/orphans-slice-<N>.json for copy-paste recovery commands

Crucible Finalize Fails v2.82.1+

The Crucible critical-fields gate refuses to draft TBD-laden plans. If finalize keeps returning CRITICAL_FIELDS_MISSING, the recovery path is:

  1. forge_crucible_preview { id }, returns criticalGaps: [{ field, reason, hint }, …]
  2. For each gap, the next call to forge_crucible_ask queues a question that targets that field
  3. Build/test command questions auto-fill suggestions via inferRepoCommands, usually you just confirm
  4. Once all gaps resolved, finalize succeeds

If the gate is blocking on something you genuinely don't need (rare, the gate exists for good reason), the escape hatch is --manual-import on a hand-authored plan. See Chapter 5 — Enforcement Gate.

Forge-Master Misroutes Intent

Forge-Master classifies prompts into operational, troubleshoot, build, advisory, or offtopic. Misroutes happen most often when:

  • Stage 1 keyword scorer didn't match, check the via field in the response. If "keyword", try a more keyword-rich phrasing ("status of …", "why did … fail", "should we …")
  • Embedding cache is cold, new project, no prior classifications. Hit rate climbs after 10–20 turns. Check GET /api/forge-master/cache-stats
  • Router model is too small, default grok-3-mini is fine for most prompts but quirky vocabulary may need grok-4 or gpt-4o-mini. Override via forgeMaster.routerModel in .forge.json
  • Quorum advisory not firing on "auto", requires lane=advisory + autoEscalated=true + fromTier=high + confidence≥medium. Use "always" to remove gating during testing

See Forge-Master chapter — Troubleshooting for the full list.

Host-Aware Routing Confusion v2.82+

Host-aware routing detects which IDE / CLI host you're running Plan Forge from (VS Code + Copilot, Claude Code, Cursor, Windsurf, Zed, bare terminal) so you don't silently double-pay against your non-Copilot subscription when calling gpt-* models. If you're seeing surprising routing behavior:

SymptomWhat's happeningOverride
"My gpt-* calls cost more on Claude Code than VS Code"Default auto mode prefers direct OpenAI API on non-Copilot hosts (honors your subscription)Set routing.hostPreference: "gh-copilot" in .forge.json to force Copilot subscription billing
"Quorum dropped gpt-* from the run"You're on a non-Copilot host AND OPENAI_API_KEY is unset AND routing.hostPreference is "drop"Set the API key, or change preference to "auto" / "gh-copilot"
"Quorum pre-run summary table shows different billing per model"Working as intended, the new table shows host + per-model billing surface so you can see spend distribution before dispatchNone, this is a feature, not a bug

Errors & Exit Codes

If a script needs to react to a Plan Forge failure programmatically, branch on the exit code (CLI / orchestrator) or the named error code (MCP tools / REST). These are stable across releases, new failure modes get new codes rather than reusing existing ones.

LayerReturnsBranch on
pforge CLIPOSIX exit code0 success · 1 generic failure · 2 environment refusal (not in git repo, update-check failed, audit had no scanners)
pforge run-planExit code + statusReason in JSON0=completed / completed-with-warnings · 1=failed / aborted. statusReason narrows it: gate-failed, drift-detected, quorum-all-failed, etc.
MCP tools (forge_*){ ok, code, error } envelopeok: false with a named code, e.g. NO_API_KEY, CRITICAL_FIELDS_MISSING, QUORUM_ALL_FAILED, PLAN_NOT_FOUND
REST (POST /api/…)HTTP status + JSON body400 bad body · 404 missing · 409 state conflict (ERR_UPDATE_DURING_RUN) · 429 rate limited (use retryAfterMs) · 500 internal
OS subprocess (worker, gate)Native exit code, surfaced via statusReason0xC000013A Windows Ctrl+C · 130/137/143 POSIX signals. Mapped to worker-signaled.
Full contract: every exit code, every named error code, every error event, plus copy-paste Bash and PowerShell CI recipes, see Appendix X — Errors & Exit Codes.
Subsystem catalog: Appendix Z — Failure-Mode Catalog complements this chapter. Where troubleshooting is symptom-driven (you see a red output and look up what it means), Appendix Z is subsystem-organised — browse by gate, quorum, watcher, OpenBrain, snapshot, model-pool, or hub to see every known failure mode with its symptom, cause, and fix triple.

Getting Help

📄 Full reference: FAQ, Multi-Agent Setup — GitHub Copilot