A long stone wall inside the Plan Forge shop engraved with a glowing amber timeline running left to right at chest height, marked at intervals by small bronze relief icons representing project milestones (first crucible lit, first anvil, first guard tower, first library shelf), iron torches in wall sconces, a single smith with a lantern walking the wall reading the engravings

Reference

Project History

Eleven inflection points from v1.0 (Summer 2025) to v3.6 (May 2026). Each one solved a specific problem the previous version exposed.

Plan Forge evolution timeline showing eight headline version milestones: v1.0 Summer 2025 (18 instruction files plus 4-session pipeline), v2.0 January 2026 (autonomous orchestrator plus 17 MCP tools), v2.5 February 2026 (Quorum Mode 3-model consensus), v2.10 March 2026 (OpenClaw bridge for cross-platform notifications), v2.14 March 2026 (Copilot platform integration), v2.18 April 2026 (Temper Guards plus Warning Signs plus Context Fuel), v2.83 May 2026 (host-aware routing plus quorum estimator plus complexity rubric), and v3.6 May 2026 highlighted in amber as current (OpenBrain promoted to L3 memory layer). The May 2026 v3.x sprint also shipped v2.95 Lattice code-graph indexing, v3.0 Copilot integration trilogy, and v3.2 through v3.4 Team Mode, covered in the prose sections below. — Plan Forge evolution timeline showing eight headline version milestones

When to read this chapter: understanding why a feature exists, evaluating whether a design constraint is foundational or contingent, or onboarding to the architecture's history.

Reference adaptation of the marketing essay From Impossible to 7 Minutes — A Year of Building AI Coding Guardrails (April 2026), extended through v3.6 from the CHANGELOG.

v1.0 — Foundation (Summer 2025)

What shipped: 18 specialized instruction files, prompt templates, and the 4-session pipeline (Specify → Plan → Execute → Review). Plan Forge at this point was "files you install", a guardrail collection that lived in the project's .github/ directory.

Inflection point: The breakthrough was not the file count. It was discovering that session isolation works, the builder cannot review its own work, but a separate session with fresh context catches blind spots reliably. This insight made consistent quality possible and became the foundation everything else built on. See How It Works → Why Session Isolation Works and Lessons Learned → Lesson 3.

What it solved: Single-session AI work had a quality ceiling, agents would believe their own bad code was correct because the bad code lived in the same context window as the proposed fix.

v2.0 — Autonomous Orchestrator (January 2026)

What shipped: DAG-based execution engine with CLI worker spawning, 17 MCP tools (forge_run_plan, forge_analyze, forge_diagnose, forge_cost_report, etc.), the pforge CLI, and the dashboard with live progress / cost aggregation / session replay.

Inflection point: Plan Forge stopped being "files you install" and became "a system that runs." The MCP server gave it a programmatic API; the dashboard gave it visibility; the orchestrator made full plan execution possible without human intervention between slices.

What it solved: Hardened plans existed in v1.0 but a human had to drive each slice. Long features required hours of supervised execution. The orchestrator removed the supervision tax for everything except gate failures.

v2.5 — Quorum Mode (February 2026)

What shipped: Multi-model consensus analysis. Three models analyze the same slice independently; a reviewer model synthesizes their findings into a unified report. See Advanced Execution → Quorum Mode for current mechanics.

Inflection point: Single-model execution was hitting its limits. Claude excelled at architecture; GPT at breadth; Grok brought a different analytical lens. Each model had blind spots, and those blind spots were consistent. Treating AI code analysis as a consensus process, the way human code review works, produced 20% more test recommendations than any single model alone (per quorum A/B test).

What it solved: Quality plateau on complex slices. One model's blind spot is another model's strength.

v2.10 — OpenClaw Bridge (March 2026)

What shipped: Cross-platform notification fan-out, Telegram, Slack, Discord, Microsoft Teams, PagerDuty, OpenClaw, with inline approval / reject flows for events that need a human. See Remote Bridge.

Inflection point: Plan Forge runs inside the IDE, but some decisions are not IDE-shaped. A reviewer flags drift at 2 AM. A quorum tie needs a human tiebreaker. An incident fires after the laptop closes. The bridge made the forge able to reach you instead of waiting for you to come back.

What it solved: The "I missed the notification" failure mode that blocked autonomous execution overnight or away from the desk.

v2.14 — Copilot Platform Integration (March 2026)

What shipped: Native VS Code experience, skills, agents, Plan Forge lifecycle hooks (PreDeploy, PreCommit, PreAgentHandoff, PostSlice, configured via .github/hooks/plan-forge.json), and instruction auto-loading via applyTo frontmatter. (These are not Claude Code's SessionStart / PreToolUse / PostToolUse / Stop hooks, the trigger semantics differ; see Installation for the mapping.) See Multi-Agent → Copilot.

Inflection point: Auto-loading turned guardrail adoption from optional ("whoever remembered") to default ("it just works"). The applyTo pattern moved compliance from roughly 20% to 100%. See Lessons Learned → Lesson 2.

What it solved: Manual instruction-file attachment was a dead pattern. Lifecycle hooks gave Plan Forge the ability to enforce rules at file-edit time rather than relying on the agent to remember to load them.

v2.18 — Temper Guards, Warning Signs, Context Fuel (April 2026)

What shipped: Each instruction file gained two new sections, Temper Guards documenting the specific shortcuts agents take that produce compiling but architecturally broken code, and Warning Signs listing observable anti-patterns reviewers can grep for. Context Fuel instruction file taught agents to manage their own context budgets.

Inflection point: Agent-skills analysis revealed a class of failure that previous guardrails missed, the model would write code that compiled, passed tests, and looked plausible while violating an architectural principle nobody had thought to forbid explicitly. Temper Guards captured these as named anti-patterns; Warning Signs gave reviewers a way to detect them.

What it solved: The "looks correct, is structurally wrong" failure mode. Compiling code is not architecturally compliant code.

v2.83 — Host-Aware Routing, Quorum Estimator, Complexity Rubric (May 2026)

What shipped: Host-aware model routing (subscription-vs-API billing surface awareness), forge_estimate_quorum tool for tool-backed cost projection across all four quorum modes, and the documented complexity scoring rubric (scoreSliceComplexity()) with seven weighted signals. See Host-Aware Routing and Estimating Quorum Cost.

Inflection point: Quorum cost was previously hand-computed by agents, and observed to overshoot reality by an order of magnitude. The estimator tool replaced chat math with measured projection. Host-aware routing fixed the silent-double-pay failure mode where gpt-* models on Claude Code or Cursor would bill the user's pay-per-token API instead of their existing subscription.

What it solved: Cost surprise. Both the quorum overhead surprise (estimator) and the host billing surprise (routing).

v2.95 — Lattice / Code-Graph Indexing (May 2026)

What shipped: Phase Lattice introduced tree-sitter-based code chunking, code-graph indexing, and the forge_lattice_* tool family (index, query, callers, blast, stat). Anvil caching for cost-effective re-indexing. Hallmark provenance tracking on every chunk. v3.5.1 added camelCase-aware relevance ranking via scoreChunk() / tokenizeForSearch().

Inflection point: Plan Forge could now reason about the user's actual codebase architecture, not just plans and instructions. Searching getUserById returns the function, its callers, and its blast radius across the repository. This made auto-generated plans architecture-aware: a slice that touches a hub function gets flagged as high-blast-radius before execution.

What it solved: Plans that looked safe in isolation but rippled unexpectedly. Pre-Lattice, the agent had to grep its way to architectural awareness slice by slice.

v3.0 — Copilot Integration Trilogy (May 2026)

What shipped: Three sync surfaces, completed in three consecutive releases. pforge sync-spaces (v2.98) generates Copilot Spaces from forge plans and principles. forge_sync_memories (v2.99) writes .github/copilot-memory-hints.md from cross-tool memory. forge_sync_instructions (v3.0) generates .github/copilot-instructions.md from project profile, project principles, extra instruction files, and .forge.json config.

Inflection point: Copilot became a first-class citizen of the Plan Forge ecosystem, not just one of several agent surfaces. Every Copilot conversation now opens with project-specific guidance auto-loaded by the platform, no manual setup, no forgotten attachments. This collapsed the onboarding gap for the largest installed base of any AI coding agent.

What it solved: Copilot users were getting generic guidance because copilot-instructions.md was hand-written or absent. The sync trilogy made the file always up to date and always reflective of the actual project's profile, principles, and configuration.

v3.2–3.4 — Team Mode (May 2026)

What shipped: Three releases focused on multi-developer awareness. v3.2 added .forge/team-activity.jsonl (shared run log), the forge_team_activity MCP tool, and pforge team activity. v3.3 added pforge github review delegate, when a slice produces a PR, an issue assigned to @copilot is filed with a structured review checklist, and the Copilot Coding Agent posts findings back on the PR. v3.4 added the Team tab in the dashboard with per-operator cards, success rates, costs, and a conflict-risk banner.

Inflection point: Plan Forge stopped being a solo tool. Teams running parallel plan executions against the same repository could now see who was working on what, get reviewer attention from the Copilot Coding Agent without a human handoff, and detect coordination risk before two developers stepped on each other's slices.

What it solved: The "two of us hit the same file" failure mode. And the "I shipped a PR but nobody reviewed it" failure mode.

v3.6 — OpenBrain Promotion / L3 Memory Made Loud (May 2026, current)

What shipped: OpenBrain, the optional cross-session semantic memory backend, was reframed from a row-5 "optional extension" to L3 memory layer with a clear on-ramp at every install touchpoint. pforge smith now always reports L3 status. setup.ps1 / setup.sh prompt for OpenBrain install at the end of the flow (auto-suppressed in CI). New pforge brain {status, hint, test, replay} subcommands. README gains a numbered Step 3 "Enable Persistent Memory" with four deploy options. The if (openBrainConfigured) gating did not change, Plan Forge still works perfectly without it. See Memory Architecture on GitHub.

Inflection point: OpenBrain hooks were already wired into 28 MCP tools, 4 search-before-acting prompts, Reflexion lessons, Auto-skills, and cross-project Federation, but every one was gated and silently no-op'd otherwise. Users who didn't know to install OpenBrain were getting Plan Forge's L1 (Hub events) plus L2 (.forge/*.jsonl durable files) memory but no persistent semantic memory across sessions. The inner loop that makes the agent improve over time was effectively dark. v3.6 made the L3 layer discoverable without changing any soft-fail behavior.

What it solved: The "Plan Forge isn't getting smarter over time" failure mode. Without L3, Reflexion lessons, Auto-skills, and postmortem learnings had nowhere durable to live across sessions.