Cost & Economics
How Plan Forge prices LLM calls, where token costs come from, the three sources of truth, per-quorum-mode economics, cost-effective workflow patterns, and the anti-lock-in commitments that keep your provider bill yours, never marked up, never proxied, never withheld.
forge_estimate_quorum for projections and forge_cost_report for actuals. If you're a UI building a quorum picker, populate it from forge_estimate_quorum, do not invent dollar amounts.
Orientation
Plan Forge has no Plan Forge bill. It has your provider bill, plus the orchestrator's bookkeeping to tell you what fraction of that bill belongs to which slice, which plan, and which model. Three things follow from that:
- Bring-your-own-key. Plan Forge calls Anthropic, OpenAI, Google, and xAI directly using your keys. There is no relay, no proxy, no aggregator endpoint. (Azure OpenAI is also supported, same direct call, your AOAI deployment.)
- No markup. The orchestrator records what the provider charged you. Plan Forge adds zero margin. The numbers in
forge_cost_reportare the numbers on your provider invoice. - Per-slice attribution. Every LLM call is tagged with the run id, slice number, role (worker / reviewer / quorum responder / forge-master / etc.), and provider. You can answer "which slice cost the most" from the audit trail, not from guessing.
Three sources of truth
Cost numbers in Plan Forge come from exactly three places. Knowing which is which prevents the common confusion between "what a slice will cost" and "what it did cost."
| Source | Answers | How to read it |
|---|---|---|
MODEL_PRICING tablepforge-mcp/cost-service.mjs | "What does a given model charge per million input / output / cache tokens?" | Static table, updated when providers publish new prices. Each entry cites its _source URL with date. Cache, flex, priority, and AOAI deployment multipliers are encoded alongside the base rates. |
forge_estimate_quorum · forge_estimate_slice | "What will this plan / slice cost before I run it?" | Token-aware projections. Walks each slice, projects worker tokens by file size + scope, projects quorum panel by mode. Returns four-mode breakdown (auto / power / speed / disabled) for plans. |
forge_cost_report | "What did Plan Forge actually charge to my providers?" | Aggregates .forge/cost-history.json, one record per LLM call with run id, slice, role, model, tokens, ticks (xAI exact-cost), and dollar amount. Roll up by day / month / model / role. |
Cost drivers
The variables that move your bill, ranked roughly by how much leverage they have.
| Driver | Range | How to manage it |
|---|---|---|
| Model tier | ~50× spread between flagship and nano (claude-opus-4.7 $5/$25 vs gpt-5-nano $0.05/$0.40 per 1M tokens) | Use cheaper models for code-search / classification / routing. Reserve flagships for hard reasoning slices. The auto quorum mode does this automatically. |
| Token volume per slice | 1K (small CRUD) to 200K (large refactor with broad context) | Tighten scope contracts. A slice that touches 4 files costs ~10× less than one that touches 40, even with the same logic. Split fat slices. |
| Quorum panel size | 1 model (disabled) to 5+ models (power mode) | Use auto by default; opt into power only for high-stakes or low-confidence decisions. See per-quorum-mode economics. |
| Cache reuse | 1.0× (no cache) down to 0.10× (Anthropic / OpenAI cache read) | Plan Forge prompts the same system blocks across slices in a run, which providers cache. No action needed, just don't restart the run between slices unnecessarily. |
| Reasoning tokens (o-series, GPT-5 reasoning) | Often 5–20× visible output | Reasoning tokens are billed at the output rate and already counted in output_tokens, don't double-count when estimating. Use reasoning models only when the slice needs them. |
| Retries & escalation | 1× (clean pass) to 3–5× (full escalation chain) | Tighten validation gates so first-pass success rate climbs. The Inner Loop's reviewer calibration is designed for this, see Chapter 14 deep dive — The Inner Loop. |
| AOAI deployment type | 1.0× (global / provisioned) to 1.1× (data-zone / regional) | Use global Azure OpenAI deployments unless data-residency requires otherwise. The 10% uplift is encoded in aoai_deployment_type_multiplier. |
| Priority / flex tier (GPT-5.x) | 0.5× (flex) to 2.0× input / 1.5× output (priority) | Flex is fine for batch / offline runs; priority is rarely worth it for plan execution. Default tier is standard. |
Estimate vs actuals
Before running a plan, get a projection. After running, audit the actual.
Before the run: forge_estimate_quorum
// MCP
forge_estimate_quorum({ plan: "docs/plans/Phase-NN.md" })
// REST
POST /api/tool/forge_estimate_quorum
{ "plan": "docs/plans/Phase-NN.md" }
// CLI
pforge run-plan --estimate docs/plans/Phase-NN.md
Returns a per-slice token projection plus four-mode totals:
{
"plan": "Phase-NN",
"slices": [
{ "n": 1, "name": "Add user_profiles table", "projectedTokens": 8400, "modelTier": "mid", … },
…
],
"modes": {
"auto": { "totalUsd": 0.42, "breakdown": { /* per-model */ } },
"power": { "totalUsd": 1.85, "breakdown": { /* per-model */ } },
"speed": { "totalUsd": 0.18, "breakdown": { /* per-model */ } },
"disabled": { "totalUsd": 0.09, "breakdown": { /* per-model */ } }
}
}
The picker UI in the dashboard uses exactly this payload. If you're building your own UI, populate it the same way, the four-mode table is the single source of truth for "what does this cost?"
After the run: forge_cost_report
// MCP
forge_cost_report({ runId: "run-2026-05-18-091234" }) // one run
forge_cost_report({ scope: "month" }) // current month
forge_cost_report({ scope: "month", groupBy: "model" }) // monthly by model
// REST
GET /api/cost/report?runId=…
GET /api/cost/report?scope=month&groupBy=model
Returns the actual provider-billable amounts pulled from .forge/cost-history.json. Group by model, role, day, or slice. For runs that included xAI calls, the dollar amounts use the provider's exact-cost ticks (1 tick = $1×10-10) rather than multiplier math, what you see is what xAI billed.
Per-quorum-mode economics
Quorum mode is the biggest single cost lever after model tier. Plan Forge ships four modes:
| Mode | Panel | Threshold | Cost shape | When |
|---|---|---|---|---|
auto (default) | Dynamic: 2–3 models picked by intent class | Majority of responders | ~3× single-model | Most plans. Cost-effective and adequate for most decisions. |
power | 4–5 flagship models (Opus, GPT-5.5, Gemini Pro, Grok 4.x) | 5 | ~8–12× single-model | Architectural decisions, plan hardening (Session 1), high-stakes refactors. |
speed | 4–7 fast / cheap models (mini / nano tier) | 7 | ~1.5–3× single-model | High-volume CI runs, batch classifications, when latency > depth. |
disabled (--no-quorum) | 1 model | n/a | 1× (baseline) | Solo dev, trivial slices, dev-loop iteration. |
forge_estimate_quorum, never from chat math. The ratios above are approximate and shift with model availability; the tool always returns current numbers.
Cost-effective workflows
Patterns that have been observed to reduce spend without hurting outcomes.
Right-size slices
A slice that costs $0.50 to succeed is dramatically cheaper than one that costs $3 to fail and $2 to retry. The smaller the slice, the higher the first-pass success rate, the lower the total cost. The Crucible's plan-hardening pass (Session 1) is designed to split slices that are too fat, trust it. Target: 1–4 files per slice, 1 conceptual change per slice.
Let auto-quorum route models
The auto mode classifies each slice into "search-like" / "transform-like" / "reason-like" and routes to a model tier accordingly. Hardcoding a flagship via --model often costs 10× more for no measurable quality gain on routine slices.
Tighten validation gates
Loose gates pass bad work; bad work triggers retries; retries cost money. Strict, fast-to-execute gates (the Inner Loop, reviewer calibration target ~90–95% precision) catch failures on the first attempt and avoid the retry tax. The Inner Loop's forge_validate and forge_sweep are designed for exactly this trade.
Don't fight the cache
Provider caches give 10× savings on cached input. Plan Forge structures prompts so the system block, scope contract, and slice instructions are stable across slices in a run, providers cache the prefix automatically. Restarting the orchestrator between slices throws this away. Run plans end-to-end when you can.
Quorum only when it matters
power mode at the wrong moment is the most common over-spend. Reserve it for: plan hardening (Session 1), architectural decisions, slices flagged with high blast radius. Routine execution, even of moderately complex slices, works fine on auto or disabled.
Anti-lock-in posture
Plan Forge's economic story is your bill stays yours. Concretely:
| Commitment | What it means |
|---|---|
| BYOK across providers | Anthropic, OpenAI, Google, xAI, Azure OpenAI, same code path, your keys. Switch providers by changing env vars; no migration tool needed. |
| No proxy layer | The orchestrator calls the provider's public API directly. There is no Plan Forge endpoint in the data path. Outage isolation: Plan Forge can't take you down, only your provider can. |
| No usage telemetry | Plan Forge does not phone home with your token counts. The cost history lives in .forge/cost-history.json on your machine and stays there unless you explicitly export it. |
| Symmetric provider treatment | Adding a new provider takes ~30 lines in pforge-mcp/cost-service.mjs + a route adapter. No provider is privileged; the pricing table is open-source. |
| Open-source pricing table | MODEL_PRICING is in the repo with _source URLs. If you don't believe a rate, click the source. If a rate is wrong, file a PR. |
| Easy export | forge_cost_report exports JSON or CSV. Your run history is portable to any BI tool. No data lock-in. |
| Skill / plan files are portable | SKILL.md and plan markdown are vendor-neutral text. Moving to a different agent runtime (Claude Code, Cursor, raw API scripts) preserves your investment. |
Forecasting at scale
For teams or CI use, forge_cost_report aggregates roll up cleanly. Group by the dimension you want to forecast against and feed the result into your spreadsheet, BI tool, or dashboard of choice:
# Monthly spend by model
GET /api/cost/report?scope=month&groupBy=model
# Per-run breakdown (granular: every LLM call)
GET /api/cost/report?runId=run-2026-05-18-091234
# Last-30-day rollup by role (worker vs reviewer vs quorum vs forge-master)
GET /api/cost/report?scope=month&groupBy=role
Records come straight out of .forge/cost-history.json, one row per LLM call, with run id, slice, role, model, token counts, and dollar amount (or xAI ticks). The file is plain JSONL; you can pipe it through jq, import to DuckDB, or load to a spreadsheet without going through the API. Plan Forge does not enforce budgets, send alerts, or phone provider invoices, the data is yours; the policy is yours.
Worked example: a real slice
From the recent v3.6.2 manual-completion phase, slice B5 ship REST API reference appendix:
| Item | Value |
|---|---|
| Mode | auto quorum, 3 models on hardening, 1 model on execution |
| Files touched | 10 (1 new, 9 modified) |
| Worker input tokens | ~42,000 (system + scope + 9 referenced files at ~3K each) |
| Worker output tokens | ~6,400 (mostly the new rest-api-reference.html) |
| Cache hit on system block | Yes (Anthropic, 0.10× on ~3,200 cached tokens) |
| Validation passes | 2 (one failed on broken cross-refs, ~5K extra worker tokens to fix) |
| Total provider spend | ~$0.78 |
Equivalent power mode estimate | ~$6.20 (8× multiplier) |
Equivalent disabled estimate | ~$0.26 (single model, but expected reduction in reviewer-catch rate raised retry risk; auto was the right pick) |
The lesson: auto mode with right-sized slices and tight gates kept a 600-line appendix delivery under a dollar. The estimator predicted $0.71; actual was $0.78, a 10% miss attributable to the second validation pass, which the estimator does not yet model.
See also
- Appendix W — REST API: Cost, full endpoint surface for estimates and reports.
- Chapter 14 — Advanced Execution: Quorum & Multi-Agent, the runtime side of the mode picker.
- Inner Loop deep dive, reviewer calibration and gate tightening (the retry-tax mitigation).
- Forge-Master chapter, the read-only reasoning entrypoint that orchestrates cost-aware tool chains for open-ended queries.
- Appendix T —
.forge.jsonmodelRouting, default model selection that drives per-slice cost. - Appendix T —
.forge.jsonquorum, threshold and mode that govern panel size. - Appendix U — Provider API Keys, the env vars per provider.
- Appendix V — Event Catalog, the WebSocket event stream that carries cost-recorded notifications during runs.
- Appendix N — Data flow, what leaves the box and to which provider (the privacy companion to cost).