Failure-Mode Catalog
Common Plan Forge failure modes organized by layer. For each: symptom, diagnosis path, recovery action, and prevention. This appendix is the operator's companion to Appendix X — Errors & Exit Codes: Appendix X lists what the system says; Appendix Z lists what to do.
forge_diagnose tool and the /health-check skill cover most cases automatically, this catalog is for when you need to understand why the automation suggests what it does.
Index
Worker failures
FM1 — Token limit hit
Symptom: worker response truncated mid-sentence or mid-tool-call; error like max_tokens reached or HTTP 200 with finish_reason: length.
Diagnosis: check forge_watch_live for the slice's input + output token counts; compare to the model's context window. Most often the prompt grew beyond budget after a few file reads.
Recovery: split the slice. The scope was too broad. Re-run with a tighter file list. If splitting isn't practical, switch the slice's model to one with a larger context (Opus 1M, GPT-5.5).
Prevention: target 1–4 files per slice; use scope contracts; let auto quorum route bigger slices to larger-context models.
FM2 — Model timeout
Symptom: orchestrator waits past the configured provider.timeoutMs and aborts. Status reason: worker-signaled or provider-timeout.
Diagnosis: provider status page; forge_watch_live shows the last successful token timestamp. If the model was streaming and then stopped, the network broke. If it never streamed, the provider is overloaded.
Recovery: pforge run-plan --resume-from <slice>. The retry will use the same prompt; provider issues are usually transient. If repeated, switch provider via --model.
Prevention: keep the provider list in .forge.json#modelRouting.fallback populated so auto mode can fail over without manual intervention.
FM3 — Malformed tool call
Symptom: model returns a tool-call block with invalid JSON, wrong argument types, or a tool name that doesn't exist. Orchestrator surfaces tool-call-invalid.
Diagnosis: inspect .forge/runs/<runId>/trajectory.jsonl for the raw tool-call frame.
Recovery: the orchestrator retries with the parse error fed back to the model. If 3 retries fail, the slice errors. Manual fix: tighten the tool's inputSchema in the MCP server so the model gets a clearer contract on the next attempt.
Prevention: follow the forge_search ACI gold standard for new tools, bounded payloads, sparse fields, explicit schemas, friendly empty-state messages.
FM4 — Edit blocked by scope / forbidden actions
Symptom: PreToolUse hook fires; worker's edit is rejected with scope-violation or forbidden-action. Slice fails or worker pivots to a different file.
Diagnosis: read the hook's output line, it names the file and the rule. Compare against the plan's Scope Contract and Forbidden Actions sections.
Recovery: two paths. (a) If the worker was wrong (genuine scope creep), let the block stand, the system is working as designed. (b) If the plan was too narrow (the legitimate fix requires touching a file the scope doesn't allow), edit the plan to widen scope, file a plan-defect meta-bug, then resume.
Prevention: write Scope Contracts that match the slice's true file set. Underscoped plans are the #1 source of FM4. See the AI Plan Hardening Runbook for scope-sizing guidance.
FM5 — Worker loop detected
Symptom: the worker calls the same tool with the same arguments N times in a row, or alternates between two tool calls indefinitely. Orchestrator emits loop-detected and aborts the slice.
Diagnosis: trajectory.jsonl shows the repeating pattern. Common cause: the model is reading a file, "concluding," then reading it again because no progress was made.
Recovery: abort with forge_abort if not already aborted. Split the slice or give the worker a clearer next-step instruction in the plan. If the loop is between two specific tools, check whether one of them has an ambiguous empty-state message (see Appendix X — MCP tool errors).
Prevention: ACI hygiene, tools must return friendly messages on empty results, not bare { hits: [] }.
Gate failures
FM6 — Gate test failure (legitimate)
Symptom: gate command exits non-zero; test runner reports failed assertions.
Diagnosis: read the gate output. The orchestrator's retry loop will feed the failure back to the worker and let it try again (up to execution.maxRetries).
Recovery: let the retry happen. If it still fails after retries, the slice's gate is the truth, the implementation is wrong. Triage: is the test correct? Is the implementation incomplete? Is the test too strict?
Prevention: tight, fast gates that fail with clear error messages. Loose gates pass bad work; cryptic gates leave the worker spinning.
FM7 — Gate timeout
Symptom: gate runs past the configured timeout (default 120s); orchestrator kills it. Status reason: gate-timeout.
Diagnosis: was the test suite legitimately too big, or did a test hang? Try running the gate command manually; observe time-to-completion.
Recovery: if legitimate, raise the timeout for that slice in the plan's per-slice gateTimeoutMs. If a hang, fix the test (often a missing mock for an async call or an unbounded retry loop).
Prevention: gates should run in <30s ideally, <60s comfortably. Slice-level gates that need to run a 5-minute suite are usually a smell, consider running the small slice gate plus a separate periodic sweep.
FM8 — Non-portable gate command
Symptom: gate passes on the plan author's machine but fails on another platform (typically Windows). Common: bash pipe-to-brace-group like grep -c | { read n; [ "$n" -ge 1 ]; } where the inner variable is invisible through the cmd→bash shim.
Diagnosis: gate output shows the failure on the second machine; manual run of the gate command reproduces it.
Recovery: rewrite the gate to use simple, portable shell. Prefer grep -q PATTERN file and test -f path over complex pipe-fests. Avoid pipe-to-brace-group; use intermediate files if you need to capture counts.
Prevention: see AI Plan Hardening Runbook — portable gate commands.
FM9 — Documentation / index validator drift
Symptom: gate validator (e.g. node docs/manual/maintain.mjs) reports drift: orphan files, missing index entries, broken cross-refs.
Diagnosis: the validator output lists every drift item. Typical: a new file was created but not registered in the index SEARCH_SECTIONS array.
Recovery: run the validator twice. The first pass detects drift and auto-regenerates derived files (book-index, list-of-figures, glossary). The second pass confirms convergence. If the second pass still shows drift, fix manually (usually a missing manual.js registration).
Prevention: P12 (Documentation Phase) pattern in Appendix Y mandates the twice-validate gate.
Orchestrator failures
FM10 — Worker spawn failure
Symptom: orchestrator can't launch the worker subprocess; exits with worker-spawn-failed. On Windows: ENOENT from spawn.
Diagnosis: usually a missing CLI on PATH (e.g. claude, cursor-agent, codex). Run pforge smith, it lists which agent CLIs are present.
Recovery: install or reinstall the worker CLI; verify with where claude (Windows) / which claude (POSIX). On Windows, restart the IDE after PATH changes, child-process PATH is inherited at spawn time.
Prevention: pforge smith in your project's preflight; /health-check skill on session start.
FM11 — Git stash conflict on rollback
Symptom: failed slice rolled back; git stash pop reports merge conflicts because foreign files were modified during the run.
Diagnosis: git status shows conflict markers in files the slice was not supposed to touch.
Recovery: resolve conflicts manually, then drop the stash with git stash drop. The v3.3.4 / v3.3.5 fixes addressed the most common shapes of this (snapshot-apply-then-drop ordering); if you hit it on a current Plan Forge version, file an orchestrator-defect meta-bug.
Prevention: don't make manual edits while a plan is running. The orchestrator's snapshot model assumes the working tree is stable during execution.
FM12 — Snapshot apply failure
Symptom: orchestrator can't apply the pre-slice snapshot to roll back a failed slice. Status reason: snapshot-apply-failed.
Diagnosis: .forge/runs/<runId>/snapshots/ contains the snapshot artifacts; inspect git output for the actual failure (usually a file-permission issue or a concurrent index lock).
Recovery: manually restore from the snapshot or from the prior git commit. git reflog shows the orchestrator's commits; git reset --hard <sha> to the pre-slice state if necessary.
Prevention: ensure no other git operations are running against the repo during plan execution; close other IDE windows that might be touching the index.
FM13 — Plan parse error
Symptom: pforge run-plan exits with code 2 (EX_USAGE) and a plan-parse error. Common: duplicate slice headers, missing required sections, malformed bash gate fences.
Diagnosis: error message names the line. pforge check <plan> validates standalone.
Recovery: fix the markdown. Common issues: two slices with the same heading text; gate code-fence not closed; ### Slice N heading without a following body.
Prevention: run pforge check before pforge run-plan; the Crucible's plan-hardening pass (Session 1) catches most parse errors before they reach execution.
Provider failures
FM14 — Rate limit (HTTP 429)
Symptom: provider returns 429; orchestrator surfaces provider-rate-limit.
Diagnosis: check provider's rate-limit headers (x-ratelimit-remaining-requests, x-ratelimit-reset-*). Are you over your tier's per-minute or per-day cap?
Recovery: the orchestrator backs off and retries automatically (configurable in .forge.json#execution.backoff). Manual: switch to a different provider via --model until the window resets, or upgrade your provider tier.
Prevention: spread load across providers via modelRouting.fallback; reserve power quorum for slices that actually need it (each panelist counts against the rate limit).
FM15 — Provider 5xx / outage
Symptom: 500/502/503 from provider; sustained failures over multiple retries.
Diagnosis: check the provider's status page. If a single provider is degraded, fail over.
Recovery: pforge run-plan --resume-from <slice> --model <different-provider>. Multi-provider routing in auto mode handles this automatically when configured.
Prevention: maintain keys for at least two providers (Anthropic + OpenAI is the common pairing). The marginal cost of having a fallback key configured is zero until you need it.
FM16 — Auth expired
Symptom: provider returns 401/403; or gh auth login token expired (relevant for Copilot routing).
Diagnosis: pforge smith reports auth status per provider. For GitHub Copilot: gh auth status.
Recovery: rotate the API key (env var or .forge/secrets.json); for OAuth: gh auth login again. Resume the plan.
Prevention: rotate keys before they expire; for OAuth, the LiveGuard preDeploy hook can be extended to call gh auth status as part of its checks.
Memory failures
FM17 — L2 jsonl corruption
Symptom: forge_memory_report errors with JSON parse exception; memory search returns empty.
Diagnosis: open .forge/memory/L2.jsonl; look for a truncated last line (write interrupted by crash).
Recovery: remove the corrupt line. Re-run forge_memory_report to verify. The file is append-only jsonl, recovery is just trim-the-last-line.
Prevention: don't kill the orchestrator mid-write. The flush-on-write design minimizes the window, but it's not zero.
FM18 — L3 endpoint unreachable
Symptom: memory_recall calls timing out; OpenBrain (or your configured L3) not responding.
Diagnosis: curl the configured memory.l3Endpoint; check network and auth token.
Recovery: L3 is opt-in and the orchestrator falls back to L2-only when L3 is down. No slice should fail because L3 is unreachable. If a slice does, the worker is over-relying on L3 hints, tighten the plan instruction set to make L3 advisory rather than required.
Prevention: treat L3 as a hint surface, not a contract. The plan should be runnable with L3 off.
Hook failures
FM19 — Hook blocks a legitimate edit (false positive)
Symptom: PreToolUse blocks an edit that the plan's scope actually allows; or LiveGuard preDeploy flags a "secret" that's a placeholder constant.
Diagnosis: hook output names the rule. Inspect the rule's pattern; compare against the actual content.
Recovery: tighten the pattern (forge_secret_scan ignores patterns are configurable). For scope hooks, widen the Scope Contract in the plan.
Prevention: tune secret-scan ignore patterns when you add codebase-specific constants that match common secret shapes (e.g. fixture IDs that look like API keys).
FM20 — Hook script error
Symptom: a hook script exits non-zero with an actual scripting error (not a policy denial).
Diagnosis: hook output includes the script's stderr. Most common: pwsh-vs-bash mismatch on the wrong platform.
Recovery: fix the script; run it manually to verify. Hook scripts live in .github/hooks/<Event>.md with code fences for each platform.
Prevention: keep both bash and pwsh blocks for every hook; /health-check exercises hooks during smoke testing.
Quorum failures
FM21 — Panel disagrees below threshold
Symptom: quorum panel returns; no answer reaches the configured threshold. Slice fails with quorum-no-consensus.
Diagnosis: forge_quorum_analyze on the run id shows each panelist's answer; look for fundamental disagreement (different APIs proposed, different architectural choices) vs near-misses on wording.
Recovery: split the slice into a P14 (Spike) plus a build slice. The disagreement signal is the panel telling you the question is ambiguous, resolve the ambiguity at the plan level, not by re-running the same quorum.
Prevention: clearer slice prompts; tighter Scope Contracts. Quorum disagreement is usually a plan-quality signal.
FM22 — Panelist timeout (panel partial)
Symptom: one or more panelists fail to respond before the per-panelist timeout. Quorum either proceeds with fewer voices (if remaining count ≥ threshold) or fails.
Diagnosis: trajectory.jsonl shows which panelist timed out and at what stage.
Recovery: if quorum failed due to insufficient responders, retry with --quorum=auto (smaller panel, less rate-limit risk) or after the timed-out provider recovers.
Prevention: configure .forge.json#quorum.panelistTimeoutMs to a value your slowest provider tolerates; for cost-sensitive workflows, prefer auto over power, fewer panelists = fewer timeout opportunities.
System failures
FM23 — Port already in use
Symptom: hub or MCP server can't bind to 3100/3101/3102; exits with EADDRINUSE.
Diagnosis: a previous Plan Forge process didn't shut down cleanly, or another tool grabbed the port. On Windows: netstat -ano | findstr :3100; on POSIX: lsof -i :3100.
Recovery: kill the stale process by PID. pforge smith detects orphan processes and offers to clean them up.
Prevention: shut down cleanly (Ctrl+C, not kill -9). The orchestrator releases its ports on SIGTERM but not on SIGKILL.
FM24 — Disk full
Symptom: writes to .forge/runs/<runId>/trajectory.jsonl or .forge/cost-history.json fail; orchestrator errors with ENOSPC.
Diagnosis: df -h . (POSIX) / Get-PSDrive (Windows). Trajectory files can grow large for long runs.
Recovery: clear old runs, .forge/runs/ can be aggressively pruned; only keep recent traces. Cost history is small (JSONL one row per LLM call).
Prevention: configure .forge.json#execution.trajectoryRetentionDays (default 30) to a value your disk tolerates.
FM25 — File locked (Windows)
Symptom: write fails with EBUSY or EPERM; common when an editor, antivirus, or sync client (OneDrive / Dropbox) is holding the file.
Diagnosis: Get-Process | Where { $_.Modules.FileName -contains $path } in pwsh; or use Process Explorer's "Find Handle" feature.
Recovery: close the editor / sync client; the orchestrator's retry loop usually picks up the file on the next attempt. For persistent locks, exclude .forge/ from sync-client scope and antivirus realtime scanning.
Prevention: put working repos outside synced folders when possible; add .forge/ to OneDrive / Dropbox exclusion lists.
General recovery techniques
When in doubt, the following are safe in any failure mode:
pforge smith, environment diagnostic; reports installed CLIs, configured providers, port status, disk space./health-checkskill,forge_smith→forge_validate→forge_sweepin sequence.forge_diagnose, per-run diagnosis with structured remediation suggestions.pforge run-plan --resume-from <slice>, resumes a failed run at a specific slice, preserving prior committed slices.git reflog+git reset --hard, ultimate rollback to any prior orchestrator commit.forge_meta_bug_file, if you worked around a Plan Forge defect, file it so the fix lands upstream. See self-repair reporting.
See also
- Appendix X — Errors & Exit Codes, the symbolic names this catalog references (
scope-violation,gate-timeout, etc.). - Appendix Y — Plan Pattern Library, the patterns whose anti-patterns this catalog cross-references.
- Chapter 15 — Troubleshooting, the narrative-style intro to the failure surface.
- Chapter 30 — Incident response, the security incident path that overlays this catalog.
- AI Plan Hardening Runbook, how to write plans that produce fewer of these failures in the first place.
- Self-Repair Reporting, the meta-bug flow for Plan-Forge-itself defects.