A tall glass-fronted specimen cabinet inside the Plan Forge shop displaying rows of glass jars and shallow trays containing examples of broken or failed work (a cracked iron blade, a warped plate, a melted ingot, a snapped hammer head, a corroded gear), each labeled with a small parchment tag and a glowing amber failure-mode rune

Appendix Z

Failure-Mode Catalog

Common Plan Forge failure modes organized by layer. For each: symptom, diagnosis path, recovery action, and prevention. This appendix is the operator's companion to Appendix X — Errors & Exit Codes: Appendix X lists what the system says; Appendix Z lists what to do.

How to use this appendix. Read the index, find the failure mode that matches the symptom you're seeing, jump to its section, follow the diagnosis path, apply the recovery. The forge_diagnose tool and the /health-check skill cover most cases automatically, this catalog is for when you need to understand why the automation suggests what it does.

Troubleshooting decision tree showing the diagnostic starting point and branching paths by subsystem — Figure Z-1. Start here — the decision tree routes symptoms to the subsystem sections below

Index

Layer	Failure modes
Worker	FM1 token limit · FM2 model timeout · FM3 malformed tool call · FM4 scope blocked · FM5 loop detected
Gate	FM6 test failure · FM7 gate timeout · FM8 non-portable gate · FM9 validator drift
Orchestrator	FM10 worker spawn failure · FM11 stash conflict · FM12 snapshot apply failure · FM13 plan parse error
Provider	FM14 rate limit · FM15 provider 5xx / outage · FM16 auth expired
Memory	FM17 L2 jsonl corruption · FM18 L3 endpoint unreachable
Hook	FM19 hook false positive · FM20 hook script error
Quorum	FM21 panel disagree below threshold · FM22 panelist timeout
System	FM23 port in use · FM24 disk full · FM25 file locked (Windows)

Worker failures

FM1 — Token limit hit

Symptom: worker response truncated mid-sentence or mid-tool-call; error like max_tokens reached or HTTP 200 with finish_reason: length.

Diagnosis: check forge_watch_live for the slice's input + output token counts; compare to the model's context window. Most often the prompt grew beyond budget after a few file reads.

Recovery: split the slice. The scope was too broad. Re-run with a tighter file list. If splitting isn't practical, switch the slice's model to one with a larger context (Opus 1M, GPT-5.5).

Prevention: target 1–4 files per slice; use scope contracts; let auto quorum route bigger slices to larger-context models.

FM2 — Model timeout

Symptom: orchestrator waits past the configured provider.timeoutMs and aborts. Status reason: worker-signaled or provider-timeout.

Diagnosis: provider status page; forge_watch_live shows the last successful token timestamp. If the model was streaming and then stopped, the network broke. If it never streamed, the provider is overloaded.

Recovery: pforge run-plan --resume-from <slice>. The retry will use the same prompt; provider issues are usually transient. If repeated, switch provider via --model.

Prevention: keep the provider list in .forge.json#modelRouting.fallback populated so auto mode can fail over without manual intervention.

FM3 — Malformed tool call

Symptom: model returns a tool-call block with invalid JSON, wrong argument types, or a tool name that doesn't exist. Orchestrator surfaces tool-call-invalid.

Diagnosis: inspect .forge/runs/<runId>/trajectory.jsonl for the raw tool-call frame.

Recovery: the orchestrator retries with the parse error fed back to the model. If 3 retries fail, the slice errors. Manual fix: tighten the tool's inputSchema in the MCP server so the model gets a clearer contract on the next attempt.

Prevention: follow the forge_search ACI gold standard for new tools, bounded payloads, sparse fields, explicit schemas, friendly empty-state messages.

FM4 — Edit blocked by scope / forbidden actions

Symptom: PreToolUse hook fires; worker's edit is rejected with scope-violation or forbidden-action. Slice fails or worker pivots to a different file.

Diagnosis: read the hook's output line, it names the file and the rule. Compare against the plan's Scope Contract and Forbidden Actions sections.

Recovery: two paths. (a) If the worker was wrong (genuine scope creep), let the block stand, the system is working as designed. (b) If the plan was too narrow (the legitimate fix requires touching a file the scope doesn't allow), edit the plan to widen scope, file a plan-defect meta-bug, then resume.

Prevention: write Scope Contracts that match the slice's true file set. Underscoped plans are the #1 source of FM4. See the AI Plan Hardening Runbook for scope-sizing guidance.

FM5 — Worker loop detected

Symptom: the worker calls the same tool with the same arguments N times in a row, or alternates between two tool calls indefinitely. Orchestrator emits loop-detected and aborts the slice.

Diagnosis: trajectory.jsonl shows the repeating pattern. Common cause: the model is reading a file, "concluding," then reading it again because no progress was made.

Recovery: abort with forge_abort if not already aborted. Split the slice or give the worker a clearer next-step instruction in the plan. If the loop is between two specific tools, check whether one of them has an ambiguous empty-state message (see Appendix X — MCP tool errors).

Prevention: ACI hygiene, tools must return friendly messages on empty results, not bare { hits: [] }.

Gate failures

FM6 — Gate test failure (legitimate)

Symptom: gate command exits non-zero; test runner reports failed assertions.

Diagnosis: read the gate output. The orchestrator's retry loop will feed the failure back to the worker and let it try again (up to execution.maxRetries).

Recovery: let the retry happen. If it still fails after retries, the slice's gate is the truth, the implementation is wrong. Triage: is the test correct? Is the implementation incomplete? Is the test too strict?

Prevention: tight, fast gates that fail with clear error messages. Loose gates pass bad work; cryptic gates leave the worker spinning.

FM7 — Gate timeout

Symptom: gate runs past the configured timeout (default 120s); orchestrator kills it. Status reason: gate-timeout.

Diagnosis: was the test suite legitimately too big, or did a test hang? Try running the gate command manually; observe time-to-completion.

Recovery: if legitimate, raise the timeout for that slice in the plan's per-slice gateTimeoutMs. If a hang, fix the test (often a missing mock for an async call or an unbounded retry loop).

Prevention: gates should run in <30s ideally, <60s comfortably. Slice-level gates that need to run a 5-minute suite are usually a smell, consider running the small slice gate plus a separate periodic sweep.

FM8 — Non-portable gate command

Symptom: gate passes on the plan author's machine but fails on another platform (typically Windows). Common: bash pipe-to-brace-group like grep -c | { read n; [ "$n" -ge 1 ]; } where the inner variable is invisible through the cmd→bash shim.

Diagnosis: gate output shows the failure on the second machine; manual run of the gate command reproduces it.

Recovery: rewrite the gate to use simple, portable shell. Prefer grep -q PATTERN file and test -f path over complex pipe-fests. Avoid pipe-to-brace-group; use intermediate files if you need to capture counts.

Prevention: see AI Plan Hardening Runbook — portable gate commands.

FM9 — Documentation / index validator drift

Symptom: gate validator (e.g. node docs/manual/maintain.mjs) reports drift: orphan files, missing index entries, broken cross-refs.

Diagnosis: the validator output lists every drift item. Typical: a new file was created but not registered in the index SEARCH_SECTIONS array.

Recovery: run the validator twice. The first pass detects drift and auto-regenerates derived files (book-index, list-of-figures, glossary). The second pass confirms convergence. If the second pass still shows drift, fix manually (usually a missing manual.js registration).

Prevention: P12 (Documentation Phase) pattern in Appendix Y mandates the twice-validate gate.

Orchestrator failures

FM10 — Worker spawn failure

Symptom: orchestrator can't launch the worker subprocess; exits with worker-spawn-failed. On Windows: ENOENT from spawn.

Diagnosis: usually a missing CLI on PATH (e.g. claude, cursor-agent, codex). Run pforge smith, it lists which agent CLIs are present.

Recovery: install or reinstall the worker CLI; verify with where claude (Windows) / which claude (POSIX). On Windows, restart the IDE after PATH changes, child-process PATH is inherited at spawn time.

Prevention: pforge smith in your project's preflight; /health-check skill on session start.

FM11 — Git stash conflict on rollback

Symptom: failed slice rolled back; git stash pop reports merge conflicts because foreign files were modified during the run.

Diagnosis: git status shows conflict markers in files the slice was not supposed to touch.

Recovery: resolve conflicts manually, then drop the stash with git stash drop. The v3.3.4 / v3.3.5 fixes addressed the most common shapes of this (snapshot-apply-then-drop ordering); if you hit it on a current Plan Forge version, file an orchestrator-defect meta-bug.

Prevention: don't make manual edits while a plan is running. The orchestrator's snapshot model assumes the working tree is stable during execution.

FM12 — Snapshot apply failure

Symptom: orchestrator can't apply the pre-slice snapshot to roll back a failed slice. Status reason: snapshot-apply-failed.

Diagnosis: .forge/runs/<runId>/snapshots/ contains the snapshot artifacts; inspect git output for the actual failure (usually a file-permission issue or a concurrent index lock).

Recovery: manually restore from the snapshot or from the prior git commit. git reflog shows the orchestrator's commits; git reset --hard <sha> to the pre-slice state if necessary.

Prevention: ensure no other git operations are running against the repo during plan execution; close other IDE windows that might be touching the index.

FM13 — Plan parse error

Symptom: pforge run-plan exits with code 2 (EX_USAGE) and a plan-parse error. Common: duplicate slice headers, missing required sections, malformed bash gate fences.

Diagnosis: error message names the line. pforge check <plan> validates standalone.

Recovery: fix the markdown. Common issues: two slices with the same heading text; gate code-fence not closed; ### Slice N heading without a following body.

Prevention: run pforge check before pforge run-plan; the Crucible's plan-hardening pass (Session 1) catches most parse errors before they reach execution.

Provider failures

FM14 — Rate limit (HTTP 429)

Symptom: provider returns 429; orchestrator surfaces provider-rate-limit.

Diagnosis: check provider's rate-limit headers (x-ratelimit-remaining-requests, x-ratelimit-reset-*). Are you over your tier's per-minute or per-day cap?

Recovery: the orchestrator backs off and retries automatically (configurable in .forge.json#execution.backoff). Manual: switch to a different provider via --model until the window resets, or upgrade your provider tier.

Prevention: spread load across providers via modelRouting.fallback; reserve power quorum for slices that actually need it (each panelist counts against the rate limit).

FM15 — Provider 5xx / outage

Symptom: 500/502/503 from provider; sustained failures over multiple retries.

Diagnosis: check the provider's status page. If a single provider is degraded, fail over.

Recovery: pforge run-plan --resume-from <slice> --model <different-provider>. Multi-provider routing in auto mode handles this automatically when configured.

Prevention: maintain keys for at least two providers (Anthropic + OpenAI is the common pairing). The marginal cost of having a fallback key configured is zero until you need it.

FM16 — Auth expired

Symptom: provider returns 401/403; or gh auth login token expired (relevant for Copilot routing).

Diagnosis: pforge smith reports auth status per provider. For GitHub Copilot: gh auth status.

Recovery: rotate the API key (env var or .forge/secrets.json); for OAuth: gh auth login again. Resume the plan.

Prevention: rotate keys before they expire; for OAuth, the LiveGuard preDeploy hook can be extended to call gh auth status as part of its checks.

Memory failures

FM17 — L2 jsonl corruption

Symptom: forge_memory_report errors with JSON parse exception; memory search returns empty.

Diagnosis: open .forge/memory/L2.jsonl; look for a truncated last line (write interrupted by crash).

Recovery: remove the corrupt line. Re-run forge_memory_report to verify. The file is append-only jsonl, recovery is just trim-the-last-line.

Prevention: don't kill the orchestrator mid-write. The flush-on-write design minimizes the window, but it's not zero.

FM18 — L3 endpoint unreachable

Symptom: memory_recall calls timing out; OpenBrain (or your configured L3) not responding.

Diagnosis: curl the configured memory.l3Endpoint; check network and auth token.

Recovery: L3 is opt-in and the orchestrator falls back to L2-only when L3 is down. No slice should fail because L3 is unreachable. If a slice does, the worker is over-relying on L3 hints, tighten the plan instruction set to make L3 advisory rather than required.

Prevention: treat L3 as a hint surface, not a contract. The plan should be runnable with L3 off.

Hook failures

FM19 — Hook blocks a legitimate edit (false positive)

Symptom: PreToolUse blocks an edit that the plan's scope actually allows; or LiveGuard preDeploy flags a "secret" that's a placeholder constant.

Diagnosis: hook output names the rule. Inspect the rule's pattern; compare against the actual content.

Recovery: tighten the pattern (forge_secret_scan ignores patterns are configurable). For scope hooks, widen the Scope Contract in the plan.

Prevention: tune secret-scan ignore patterns when you add codebase-specific constants that match common secret shapes (e.g. fixture IDs that look like API keys).

FM20 — Hook script error

Symptom: a hook script exits non-zero with an actual scripting error (not a policy denial).

Diagnosis: hook output includes the script's stderr. Most common: pwsh-vs-bash mismatch on the wrong platform.

Recovery: fix the script; run it manually to verify. Hook scripts live in .github/hooks/<Event>.md with code fences for each platform.

Prevention: keep both bash and pwsh blocks for every hook; /health-check exercises hooks during smoke testing.

Quorum failures

FM21 — Panel disagrees below threshold

Symptom: quorum panel returns; no answer reaches the configured threshold. Slice fails with quorum-no-consensus.

Diagnosis: forge_quorum_analyze on the run id shows each panelist's answer; look for fundamental disagreement (different APIs proposed, different architectural choices) vs near-misses on wording.

Recovery: split the slice into a P14 (Spike) plus a build slice. The disagreement signal is the panel telling you the question is ambiguous, resolve the ambiguity at the plan level, not by re-running the same quorum.

Prevention: clearer slice prompts; tighter Scope Contracts. Quorum disagreement is usually a plan-quality signal.

FM22 — Panelist timeout (panel partial)

Symptom: one or more panelists fail to respond before the per-panelist timeout. Quorum either proceeds with fewer voices (if remaining count ≥ threshold) or fails.

Diagnosis: trajectory.jsonl shows which panelist timed out and at what stage.

Recovery: if quorum failed due to insufficient responders, retry with --quorum=auto (smaller panel, less rate-limit risk) or after the timed-out provider recovers.

Prevention: configure .forge.json#quorum.panelistTimeoutMs to a value your slowest provider tolerates; for cost-sensitive workflows, prefer auto over power, fewer panelists = fewer timeout opportunities.

System failures

FM23 — Port already in use

Symptom: hub or MCP server can't bind to 3100/3101/3102; exits with EADDRINUSE.

Diagnosis: a previous Plan Forge process didn't shut down cleanly, or another tool grabbed the port. On Windows: netstat -ano | findstr :3100; on POSIX: lsof -i :3100.

Recovery: kill the stale process by PID. pforge smith detects orphan processes and offers to clean them up.

Prevention: shut down cleanly (Ctrl+C, not kill -9). The orchestrator releases its ports on SIGTERM but not on SIGKILL.

FM24 — Disk full

Symptom: writes to .forge/runs/<runId>/trajectory.jsonl or .forge/cost-history.json fail; orchestrator errors with ENOSPC.

Diagnosis: df -h . (POSIX) / Get-PSDrive (Windows). Trajectory files can grow large for long runs.

Recovery: clear old runs, .forge/runs/ can be aggressively pruned; only keep recent traces. Cost history is small (JSONL one row per LLM call).

Prevention: configure .forge.json#execution.trajectoryRetentionDays (default 30) to a value your disk tolerates.

FM25 — File locked (Windows)

Symptom: write fails with EBUSY or EPERM; common when an editor, antivirus, or sync client (OneDrive / Dropbox) is holding the file.

Diagnosis: Get-Process | Where { $_.Modules.FileName -contains $path } in pwsh; or use Process Explorer's "Find Handle" feature.

Recovery: close the editor / sync client; the orchestrator's retry loop usually picks up the file on the next attempt. For persistent locks, exclude .forge/ from sync-client scope and antivirus realtime scanning.

Prevention: put working repos outside synced folders when possible; add .forge/ to OneDrive / Dropbox exclusion lists.

General recovery techniques

When in doubt, the following are safe in any failure mode:

pforge smith, environment diagnostic; reports installed CLIs, configured providers, port status, disk space.
/health-check skill, forge_smith → forge_validate → forge_sweep in sequence.
forge_diagnose, per-run diagnosis with structured remediation suggestions.
pforge run-plan --resume-from <slice>, resumes a failed run at a specific slice, preserving prior committed slices.
git reflog + git reset --hard, ultimate rollback to any prior orchestrator commit.
forge_meta_bug_file, if you worked around a Plan Forge defect, file it so the fix lands upstream. See self-repair reporting.

Failure-Mode Catalog

Index

Worker failures

FM1 — Token limit hit

FM2 — Model timeout

FM3 — Malformed tool call

FM4 — Edit blocked by scope / forbidden actions

FM5 — Worker loop detected

Gate failures

FM6 — Gate test failure (legitimate)

FM7 — Gate timeout

FM8 — Non-portable gate command

FM9 — Documentation / index validator drift

Orchestrator failures

FM10 — Worker spawn failure

FM11 — Git stash conflict on rollback

FM12 — Snapshot apply failure

FM13 — Plan parse error

Provider failures

FM14 — Rate limit (HTTP 429)

FM15 — Provider 5xx / outage

FM16 — Auth expired

Memory failures

FM17 — L2 jsonl corruption

FM18 — L3 endpoint unreachable

Hook failures

FM19 — Hook blocks a legitimate edit (false positive)

FM20 — Hook script error

Quorum failures

FM21 — Panel disagrees below threshold

FM22 — Panelist timeout (panel partial)

System failures

FM23 — Port already in use

FM24 — Disk full

FM25 — File locked (Windows)

General recovery techniques

See also