Forge laboratory with brass test fixtures and glass vials each holding a glowing micro-blueprint, scenario replay against a dedicated fixture repo

Act IV, Learn · Chapter 24

The Testbed

A separate repo. A library of scenarios. End-to-end proof that the shop still works.

New here? Read this first. Unit tests check one function. Integration tests check one service. Neither tells you whether Plan Forge itself still works end-to-end on a real codebase, the way you'd actually use it. The Testbed solves that. It's a separate sandbox repo (a real .NET app called TimeTracker) that Plan Forge uses as a punching bag: replay a known scenario, see if the full pipeline produces a clean shippable outcome, record what broke.

Why a separate repo? So Plan Forge can break things, commit, revert, and try again, without ever touching your real project.
Why a library of scenarios? Each scenario is a JSON file describing a known regression (e.g. “agent dropped a test file last release— catch it”). Run them all and you know the forge still holds.
Who needs this? You don't, day-to-day. The Testbed is mainly for Plan Forge maintainers and platform teams who want regression coverage of the tool itself. Skip ahead unless that's you.

Tool: forge_testbed_run. Scenarios: docs/plans/testbed-scenarios/*.json. Findings: docs/plans/testbed-findings/*.json. Requires testbed.path in .forge.json.

Why a Separate Testbed?

Unit tests cover one module; integration tests cover one service. Neither tells you whether the full Plan-Forge pipeline still produces a clean, shippable outcome on a real repo under a real scenario. The Testbed does, it's a second, dedicated repository that Plan Forge treats as a read-write fixture, replays a scenario against, and records the defect log.

Learn-by-Doing: The Reference Testbed

The canonical reference testbed lives at srnichols/plan-forge-testbed. It's a real .NET 10 application, TimeTracker, a billable-hours tracker with Clients, Projects, Time Entries, Billing, Invoices, and Dashboard surfaces, used as the worked example throughout this manual.

If you're learning Plan-Forge by doing, work through it in this order:

Backend slices (docs/plans/Phase-1-CLIENTS-CRUD-PLAN.md), see how pforge run-plan drives a four-slice CRUD feature with [P] parallelism, [depends:], [scope:], and validation gates.
UI slices (docs/plans/Phase-2-WEB-UI-PLAN.md), Plan-Forge builds a Blazor Server + Microsoft Fluent UI front-end against the existing REST API. The plan demonstrates that pforge produces enterprise-grade UI: layered (page → service interface → repository, never DbContext in components), accessible (WCAG 2.1 AA), and tested (bUnit). This is the proof artifact for "pforge does not vibe-code."
Operational scenarios (docs/plans/testbed-scenarios/*.json), the synthetic regressions in the section below, replayed end-to-end via forge_testbed_run.

The .NET preset ships three artifacts that make Step 2 work on any consuming project, they're not testbed-specific:

Artifact	Path	Purpose
Instruction file	`.github/instructions/blazor-fluent-ui.instructions.md`	Auto-loads on `*.razor` edits. Forbids `DbContext` in components, mandates code-behind split, lifecycle discipline, accessibility checklist.
Reviewer agent	`.github/agents/blazor-reviewer.agent.md`	Read-only audit of UI changes for layer violations, lifecycle bugs, and Fluent UI misuse.
Skill	`.github/skills/ui-scaffold/SKILL.md`	`/ui-scaffold <Entity> --crud` generates the page + DTO + service interface + bUnit test in one shot, enforcing the layering rules.

Why a UI demo? Backend slices are easy to make look impressive, they're terse, type-safe, and gates are straightforward. UI is where vibe-coding usually wins on speed and loses on quality. The Phase-2 UI plan exists to demonstrate that Plan-Forge produces UI you'd actually deploy: separation of concerns intact, no DbContext in .razor, every page accessible, every component tested.

Scenario Fixtures

Scenarios are JSON files under docs/plans/testbed-scenarios/. Each one describes:

Initial state, branch, commit, known-good baseline.
Instructions, the prompt / plan the agent will execute.
Expected artifacts, which files must change, which must not.
Gates, build, test, lint, drift thresholds.

A scenario is idempotent: the Testbed resets the fixture repo to the pinned commit before every run.

Anatomy of a Run

forge_testbed_run:

Acquires .forge/testbed.lock (one scenario at a time per testbed).
Verifies the testbed is clean (ERR_TESTBED_DIRTY if not).
Replays the scenario end-to-end in the testbed directory.
Captures artifacts, run metrics, and any defects.
Writes a finding JSON under docs/plans/testbed-findings/ and emits testbed-scenario-completed.
Releases the lock.

Common Errors

Code	Meaning	Recovery
`ERR_TESTBED_NOT_FOUND`	`testbed.path` missing or invalid	Set it in `.forge.json`
`ERR_TESTBED_DIRTY`	Uncommitted changes in the testbed	Commit or stash inside the testbed repo
`ERR_TESTBED_LOCKED`	Another scenario is running	Wait, or remove a stale `.forge/testbed.lock`

Feedback Into the Loop

Findings with defects feed two consumers:

Bug Registry, scanner-eligible defects auto-register via forge_bug_register.
Health DNA, run metrics (duration, gate failures, drift score) feed the daily Health DNA fingerprint.

Testbed ≠ CI. Your CI system runs against pull requests and masters the green/red light for merge. The Testbed runs against Plan Forge itself, under a library of synthetic scenarios, to ensure the pipeline still produces shippable code across upgrades.