A glowing brass balance scale on the workbench at the Plan Forge shop, one pan stacked with gold coins and a softly glowing amber ingot, the other pan balanced with a small parchment receipt and a tiny finished iron piece, a wooden abacus and an open leather ledger book to the side

Act II, Forge · Chapter 31

Cost & Economics

How Plan Forge prices LLM calls, where token costs come from, the three sources of truth, per-quorum-mode economics, cost-effective workflow patterns, and the anti-lock-in commitments that keep your provider bill yours, never marked up, never proxied, never withheld.

Never hand-compute quorum costs in chat. Hand-computed quorum estimates have been observed to overshoot reality by an order of magnitude. Always call forge_estimate_quorum for projections and forge_cost_report for actuals. If you're a UI building a quorum picker, populate it from forge_estimate_quorum, do not invent dollar amounts.

Orientation

Plan Forge has no Plan Forge bill. It has your provider bill, plus the orchestrator's bookkeeping to tell you what fraction of that bill belongs to which slice, which plan, and which model. Three things follow from that:

Bring-your-own-key. Plan Forge calls Anthropic, OpenAI, Google, and xAI directly using your keys. There is no relay, no proxy, no aggregator endpoint. (Azure OpenAI is also supported, same direct call, your AOAI deployment.)
No markup. The orchestrator records what the provider charged you. Plan Forge adds zero margin. The numbers in forge_cost_report are the numbers on your provider invoice.
Per-slice attribution. Every LLM call is tagged with the run id, slice number, role (worker / reviewer / quorum responder / forge-master / etc.), and provider. You can answer "which slice cost the most" from the audit trail, not from guessing.

Three sources of truth

Cost numbers in Plan Forge come from exactly three places. Knowing which is which prevents the common confusion between "what a slice will cost" and "what it did cost."

Source	Answers	How to read it
`MODEL_PRICING` table pforge-mcp/cost-service.mjs	"What does a given model charge per million input / output / cache tokens?"	Static table, updated when providers publish new prices. Each entry cites its `_source` URL with date. Cache, flex, priority, and AOAI deployment multipliers are encoded alongside the base rates.
`forge_estimate_quorum` · `forge_estimate_slice`	"What will this plan / slice cost before I run it?"	Token-aware projections. Walks each slice, projects worker tokens by file size + scope, projects quorum panel by mode. Returns four-mode breakdown (auto / power / speed / disabled) for plans.
`forge_cost_report`	"What did Plan Forge actually charge to my providers?"	Aggregates `.forge/cost-history.json`, one record per LLM call with run id, slice, role, model, tokens, ticks (xAI exact-cost), and dollar amount. Roll up by day / month / model / role.

Why three? The pricing table is the contract (rates per token), the estimate tools are the forecast (rates × projected tokens), and the cost report is the actual (rates × observed tokens with cache hits, retries, and provider rounding folded in). Estimates and actuals will differ, the cost report is always authoritative.

Cost drivers

The variables that move your bill, ranked roughly by how much leverage they have.

Driver	Range	How to manage it
Model tier	~50× spread between flagship and nano (`claude-opus-4.7` $5/$25 vs `gpt-5-nano` $0.05/$0.40 per 1M tokens)	Use cheaper models for code-search / classification / routing. Reserve flagships for hard reasoning slices. The `auto` quorum mode does this automatically.
Token volume per slice	1K (small CRUD) to 200K (large refactor with broad context)	Tighten scope contracts. A slice that touches 4 files costs ~10× less than one that touches 40, even with the same logic. Split fat slices.
Quorum panel size	1 model (disabled) to 5+ models (power mode)	Use `auto` by default; opt into `power` only for high-stakes or low-confidence decisions. See per-quorum-mode economics.
Cache reuse	1.0× (no cache) down to 0.10× (Anthropic / OpenAI cache read)	Plan Forge prompts the same system blocks across slices in a run, which providers cache. No action needed, just don't restart the run between slices unnecessarily.
Reasoning tokens (o-series, GPT-5 reasoning)	Often 5–20× visible output	Reasoning tokens are billed at the output rate and already counted in `output_tokens`, don't double-count when estimating. Use reasoning models only when the slice needs them.
Retries & escalation	1× (clean pass) to 3–5× (full escalation chain)	Tighten validation gates so first-pass success rate climbs. The Inner Loop's reviewer calibration is designed for this, see Chapter 14 deep dive — The Inner Loop.
AOAI deployment type	1.0× (global / provisioned) to 1.1× (data-zone / regional)	Use global Azure OpenAI deployments unless data-residency requires otherwise. The 10% uplift is encoded in `aoai_deployment_type_multiplier`.
Priority / flex tier (GPT-5.x)	0.5× (flex) to 2.0× input / 1.5× output (priority)	Flex is fine for batch / offline runs; priority is rarely worth it for plan execution. Default tier is standard.

Estimate vs actuals

Before running a plan, get a projection. After running, audit the actual.

Before the run: `forge_estimate_quorum`

// MCP
forge_estimate_quorum({ plan: "docs/plans/Phase-NN.md" })

// REST
POST /api/tool/forge_estimate_quorum
{ "plan": "docs/plans/Phase-NN.md" }

// CLI
pforge run-plan --estimate docs/plans/Phase-NN.md

Returns a per-slice token projection plus four-mode totals:

{
  "plan": "Phase-NN",
  "slices": [
    { "n": 1, "name": "Add user_profiles table", "projectedTokens": 8400, "modelTier": "mid", … },
    …
  ],
  "modes": {
    "auto":     { "totalUsd": 0.42, "breakdown": { /* per-model */ } },
    "power":    { "totalUsd": 1.85, "breakdown": { /* per-model */ } },
    "speed":    { "totalUsd": 0.18, "breakdown": { /* per-model */ } },
    "disabled": { "totalUsd": 0.09, "breakdown": { /* per-model */ } }
  }
}

The picker UI in the dashboard uses exactly this payload. If you're building your own UI, populate it the same way, the four-mode table is the single source of truth for "what does this cost?"

After the run: `forge_cost_report`

// MCP
forge_cost_report({ runId: "run-2026-05-18-091234" })   // one run
forge_cost_report({ scope: "month" })                    // current month
forge_cost_report({ scope: "month", groupBy: "model" })  // monthly by model

// REST
GET /api/cost/report?runId=…
GET /api/cost/report?scope=month&groupBy=model

Returns the actual provider-billable amounts pulled from .forge/cost-history.json. Group by model, role, day, or slice. For runs that included xAI calls, the dollar amounts use the provider's exact-cost ticks (1 tick = $1×10^-10) rather than multiplier math, what you see is what xAI billed.

Per-quorum-mode economics

Quorum mode is the biggest single cost lever after model tier. Plan Forge ships four modes:

Mode	Panel	Threshold	Cost shape	When
`auto` (default)	Dynamic: 2–3 models picked by intent class	Majority of responders	~3× single-model	Most plans. Cost-effective and adequate for most decisions.
`power`	4–5 flagship models (Opus, GPT-5.5, Gemini Pro, Grok 4.x)	5	~8–12× single-model	Architectural decisions, plan hardening (Session 1), high-stakes refactors.
`speed`	4–7 fast / cheap models (mini / nano tier)	7	~1.5–3× single-model	High-volume CI runs, batch classifications, when latency > depth.
`disabled` (`--no-quorum`)	1 model	n/a	1× (baseline)	Solo dev, trivial slices, dev-loop iteration.

Picker UIs MUST be tool-backed. If you're showing a quorum mode chooser with dollar amounts, those numbers come from forge_estimate_quorum, never from chat math. The ratios above are approximate and shift with model availability; the tool always returns current numbers.

Cost-effective workflows

Patterns that have been observed to reduce spend without hurting outcomes.

Right-size slices

A slice that costs $0.50 to succeed is dramatically cheaper than one that costs $3 to fail and $2 to retry. The smaller the slice, the higher the first-pass success rate, the lower the total cost. The Crucible's plan-hardening pass (Session 1) is designed to split slices that are too fat, trust it. Target: 1–4 files per slice, 1 conceptual change per slice.

Let auto-quorum route models

The auto mode classifies each slice into "search-like" / "transform-like" / "reason-like" and routes to a model tier accordingly. Hardcoding a flagship via --model often costs 10× more for no measurable quality gain on routine slices.

Tighten validation gates

Loose gates pass bad work; bad work triggers retries; retries cost money. Strict, fast-to-execute gates (the Inner Loop, reviewer calibration target ~90–95% precision) catch failures on the first attempt and avoid the retry tax. The Inner Loop's forge_validate and forge_sweep are designed for exactly this trade.

Don't fight the cache

Provider caches give 10× savings on cached input. Plan Forge structures prompts so the system block, scope contract, and slice instructions are stable across slices in a run, providers cache the prefix automatically. Restarting the orchestrator between slices throws this away. Run plans end-to-end when you can.

Quorum only when it matters

power mode at the wrong moment is the most common over-spend. Reserve it for: plan hardening (Session 1), architectural decisions, slices flagged with high blast radius. Routine execution, even of moderately complex slices, works fine on auto or disabled.

Anti-lock-in posture

Plan Forge's economic story is your bill stays yours. Concretely:

Commitment	What it means
BYOK across providers	Anthropic, OpenAI, Google, xAI, Azure OpenAI, same code path, your keys. Switch providers by changing env vars; no migration tool needed.
No proxy layer	The orchestrator calls the provider's public API directly. There is no Plan Forge endpoint in the data path. Outage isolation: Plan Forge can't take you down, only your provider can.
No usage telemetry	Plan Forge does not phone home with your token counts. The cost history lives in `.forge/cost-history.json` on your machine and stays there unless you explicitly export it.
Symmetric provider treatment	Adding a new provider takes ~30 lines in `pforge-mcp/cost-service.mjs` + a route adapter. No provider is privileged; the pricing table is open-source.
Open-source pricing table	`MODEL_PRICING` is in the repo with `_source` URLs. If you don't believe a rate, click the source. If a rate is wrong, file a PR.
Easy export	`forge_cost_report` exports JSON or CSV. Your run history is portable to any BI tool. No data lock-in.
Skill / plan files are portable	SKILL.md and plan markdown are vendor-neutral text. Moving to a different agent runtime (Claude Code, Cursor, raw API scripts) preserves your investment.

Forecasting at scale

For teams or CI use, forge_cost_report aggregates roll up cleanly. Group by the dimension you want to forecast against and feed the result into your spreadsheet, BI tool, or dashboard of choice:

# Monthly spend by model
GET /api/cost/report?scope=month&groupBy=model

# Per-run breakdown (granular: every LLM call)
GET /api/cost/report?runId=run-2026-05-18-091234

# Last-30-day rollup by role (worker vs reviewer vs quorum vs forge-master)
GET /api/cost/report?scope=month&groupBy=role

Records come straight out of .forge/cost-history.json, one row per LLM call, with run id, slice, role, model, token counts, and dollar amount (or xAI ticks). The file is plain JSONL; you can pipe it through jq, import to DuckDB, or load to a spreadsheet without going through the API. Plan Forge does not enforce budgets, send alerts, or phone provider invoices, the data is yours; the policy is yours.

Worked example: a real slice

From the recent v3.6.2 manual-completion phase, slice B5 ship REST API reference appendix:

Item	Value
Mode	`auto` quorum, 3 models on hardening, 1 model on execution
Files touched	10 (1 new, 9 modified)
Worker input tokens	~42,000 (system + scope + 9 referenced files at ~3K each)
Worker output tokens	~6,400 (mostly the new `rest-api-reference.html`)
Cache hit on system block	Yes (Anthropic, 0.10× on ~3,200 cached tokens)
Validation passes	2 (one failed on broken cross-refs, ~5K extra worker tokens to fix)
Total provider spend	~$0.78
Equivalent `power` mode estimate	~$6.20 (8× multiplier)
Equivalent `disabled` estimate	~$0.26 (single model, but expected reduction in reviewer-catch rate raised retry risk; auto was the right pick)

The lesson: auto mode with right-sized slices and tight gates kept a 600-line appendix delivery under a dollar. The estimator predicted $0.71; actual was $0.78, a 10% miss attributable to the second validation pass, which the estimator does not yet model.

Cost & Economics

Orientation

Three sources of truth

Cost drivers

Estimate vs actuals

Before the run: forge_estimate_quorum

After the run: forge_cost_report

Per-quorum-mode economics

Cost-effective workflows

Right-size slices

Let auto-quorum route models

Tighten validation gates

Don't fight the cache

Quorum only when it matters

Anti-lock-in posture

Forecasting at scale

Worked example: a real slice

See also

Before the run: `forge_estimate_quorum`

After the run: `forge_cost_report`