Frequently asked

Questions about cost per accepted change

Q: What counts as a change?

The default unit is a merged pull request to the production branch, size-normalized so one accepted change unit represents a comparable amount of substantive work. A merge of 1–500 lines counts as 1 unit; a larger merge of N lines counts as ⌈N / 500⌉ units. Lines changed is additions plus deletions, excluding vendored code, generated code, lockfiles, and bulk imports.

Q: What does "stayed there" mean?

A change has stayed in production if it survives its stayed-there window post-merge without being reverted or repaired. Reverts, rollbacks, defect-fixing rewrites, harm-neutralizing feature flags, and hotfixes invalidate acceptance. Building further on top of the change — extending it, refactoring, optimizing, adding tests or docs — does not.

Q: How long is the stayed-there window?

The recommended default is 30 days post-merge. Fast-moving teams may use 14 days; regulated industries and long-tail-failure systems may use 60–90 days. The window is evaluated per-change from each change’s own merge date, and must be applied consistently across all reporting periods.

Q: How is "acceptance" different from "correctness"?

Acceptance is observable; correctness is aspirational. CPAC measures whether a change survived contact with production for the window — a real, auditable signal. The defenses against counting incorrect-but-surviving changes are the stayed-there window (lengthen it to catch silent escapes) and the rework cost line in the numerator.

Q: Why not measure tokens or cost-per-token instead of dollars?

The numerator is denominated in dollars, not tokens, deliberately. Tokens are not comparable between models or vendors, shift within a vendor’s own model family (a tokenizer change can inflate token counts 12–34% at unchanged price), vary wildly for identical tasks, and price input and output differently. Dollars are the unit of business reality and absorb all of those upstream changes without breaking the metric.

Q: Why track each cost component separately?

The headline cost per accepted change is what you report; it is not what you investigate with. Track each of the five numerator components — model, infrastructure, engineering time, review, rework — as its own time series with absolute value, percent share, and delta vs prior window. A moving headline tells you something changed; only the components tell you what.

Q: Does CPAC work for non-code work?

Yes. The same shape applies wherever AI produces artifacts that must be reviewed before use — drafts, designs, recommendations, plans. The components rename naturally ("accepted recommendations," "accepted designs") and "stayed there" becomes "still in use after N days." This site focuses on code because that is where the literature and measurement infrastructure are best developed.

Q: How does CPAC relate to DORA metrics?

DORA metrics describe how a delivery system behaves — how often it ships, how fast, with what failure rate. CPAC describes what that behavior costs. They are complementary: change failure rate counts incidents, not the dollar cost of the rework those incidents demand. Two teams with an identical 8% change failure rate can have a roughly 2× difference in rework cost; CPAC sees it because rework is a numerator line.

Q: How does CPAC relate to FinOps?

CPAC is FinOps cost-to-serve moved one layer upstream. FinOps measures the unit cost of running software; CPAC measures the unit cost of producing trusted software. Organizations with mature FinOps practices already have most of the inputs — CPAC just rearranges them around the accepted-change denominator.

What counts as a change?

The default unit is a merged pull request to the production branch, normalized by size so that one accepted change unit represents a comparable amount of substantive accepted work. (Teams on non-PR workflows substitute the squash-merge commit or the merge commit that lands a logical change set; the unit must be applied consistently.)

The size-normalization rule:

A merged PR or commit of 1–500 lines changed counts as 1 unit.
A larger merge of N lines changed counts as ⌈ N / 500 ⌉ units.

Examples:

Lines changed	Accepted change units
120	1
500	1
800	2
1,800	4
5,000	10

"Lines changed" is the sum of additions and deletions, excluding vendored code, generated code, lockfiles, and bulk imports. Document any further exclusions when you publish or submit a number.

Why the 500-line threshold:

It tracks human verification capacity. The reviewer-comprehension cliff at ~200–400 lines is documented in the canonical code-review literature — Cisco's review of ~2,500 code reviews (Cohen, Best Kept Secrets of Peer Code Review, SmartBear, 2006) and subsequent SmartBear / Cisco analyses both find defect-detection efficiency dropping sharply past 200 LOC and effectively collapsing past 400. SmallTalk-style cultures push the threshold higher; mainstream engineering practice has converged on 200–400. 500 lines sits just past that cliff as a clean buffer that maps to a single "substantial change" most reviewers can hold in their head before splitting.
It catches the mega-merge. Without normalization, a team shipping one 5,000-line merge as a single "accepted change" gets the same denominator credit as a team shipping one 100-line bug fix — even though the former is an order of magnitude more accepted work.
It blocks the over-split gaming. Without a minimum unit size, teams could inflate the denominator by chopping trivial work into ever-smaller PRs. Below 500 lines, splitting buys no extra credit.

Teams may substitute another unit — accepted PR, deployed increment, shipped feature flag — as long as the size-normalization principle is preserved and applied consistently across the measurement window and across the baseline.

The unit is less important than the consistency. Whatever you pick, document it and stick with it. Switching units mid-window inflates or deflates the result.

How do I count lines changed from git?

Git's built-in tools handle this directly. Nothing third-party has displaced them for this specific job.

The aggregate, between two refs

git diff --shortstat A..B
# 47 files changed, 1832 insertions(+), 614 deletions(-)

git diff --numstat A..B
# <additions> <deletions> <path>     (machine-readable, one row per file)

Per-commit, across a date window

The $1 != "-" filter skips binary files (which git log --numstat outputs as - -). This is what you want for a LOC-based CPAC denominator — binary churn isn't reviewable in lines.

git log --since=2026-04-01 --until=2026-04-30 \
  --pretty=tformat: --numstat \
  | awk '$1 != "-" {add+=$1; del+=$2} END {print add+del}'

Per-PR via the GitHub CLI (recommended for normalization)

The PR is the right unit for the 500-LOC normalization rule. The GitHub CLI returns one number per PR, which is exactly what the calculator library's normalizeChanges() helper expects:

gh pr list -R owner/repo --state merged \
  --search 'merged:2026-04-01..2026-04-30' \
  --json number,additions,deletions --limit 1000

Two operational notes for the recipes below: pass -R owner/repo when running outside the repo's working tree, and remember that --limit 1000 is the maximum gh pr list accepts. For windows with more than 1,000 merged PRs, use gh api graphql --paginate with a PR-search query, or use the REST search API with explicit cursor pagination.

One-liner that produces the normalized accepted-change unit count for the window. Each PR contributes max(1, ceil(LOC/500)) units — so a binary-only PR (additions + deletions = 0 from gh's perspective, since it counts line deltas only) still counts as 1 unit, matching the library's normalizeChanges(). The (add // 0) at the end guards against an empty list:

gh pr list -R owner/repo --state merged --search 'merged:2026-04-01..2026-04-30' \
  --json additions,deletions --limit 1000 \
  | jq '[.[] | (.additions + .deletions) | (. / 500) | ceil | if . < 1 then 1 else . end] | (add // 0)'

Excluding vendored, generated, lockfiles

Required for any defensible CPAC submission. Two approaches:

# Pathspec exclusions on the diff itself.
# Use `:(exclude,glob)` for path-spanning patterns — without the `glob`
# magic, `**` is not recognized as a cross-directory wildcard.
git diff --numstat A..B -- \
  ':(exclude)vendor' \
  ':(exclude,glob)**/*.lock' \
  ':(exclude)dist' \
  ':(exclude,glob)**/*.generated.*'

# Or mark in .gitattributes — GitHub's API honors these for PR stats
vendor/**         linguist-vendored
dist/**           linguist-generated
**/*.generated.*  linguist-generated

Other tools worth knowing about

cloc, scc, and tokei are excellent for total LOC snapshots by language but are not range-aware in the way diff calculations need. cloc does have a --diff mode for comparing two refs that excludes comments and whitespace, which can be useful if you want a stricter denominator than raw added+deleted. git-quick-stats and gitfame answer related but different questions (per-author dashboards, authorship attribution) and are not the right tools for computing CPAC inputs.

For repositories not hosted on GitHub, use git log --numstat with a per-commit aggregator and pipe the per-PR (or per-merge) line counts through the normalizeChanges() helper.

What does "stayed there" mean?

A change has stayed in production if it survives its stayed-there window post-merge without being reverted or fixed. The principle: only follow-ups that effectively retract the original change count against it. Building further on top of the change is the signal that it succeeded, not failed.

These invalidate acceptance (subsequent change retracts or repairs the original):

Reverted via a follow-up commit, git revert, or operational rollback.
Substantively rewritten to address a defect, bug, or correctness issue the original introduced.
Disabled by feature flag to neutralize harm.
Hotfix patching the original to make it work as it should have on merge.

These do NOT invalidate acceptance (subsequent change builds on or extends the original):

Incremental improvement — extending the functionality, adding new features on top.
Refactoring for readability, structure, or maintainability (without changing observed behavior).
Performance optimization (unless fixing a regression the original introduced).
Test additions, documentation updates, comment edits.
Cosmetic follow-ups: formatting, renames, dependency bumps.
Iteration that takes the change further along its intended direction.

The judgment call: if a reasonable reviewer would describe the follow-up as "fixing what the original got wrong," it invalidates. If they would describe it as "building on what the original enabled," it does not. When the line is fuzzy, document the call in your methodology notes and apply it consistently across reporting windows.

How long is the stayed-there window?

The recommended default is 30 days post-merge. A change counts as accepted if, 30 days after it landed in the production branch, it is still in production in substantively its original form.

Acceptable alternatives:

Window	Best for	Tradeoff
14 days	Fast-moving teams, daily-or-better deploy cadence, low cost of incident response.	Catches acute reverts. Misses some second-sprint regressions.
30 days (recommended)	Most engineering organizations. Aligns with DORA change-failure-rate convention and monthly reporting cycles.	Catches the bulk of post-deploy issues. Adds ~1 month reporting lag.
60–90 days	Regulated industries, payment systems, infrastructure with long-tail failure modes.	High-confidence acceptance. Significant reporting lag; not suitable for monthly cadence.

The window is per-change: each change is evaluated from its own merge date, not from the reporting window's boundaries. This keeps the test fair — without it, a change merged at the start of the reporting window faces a harder survival test than one merged at the end.

Reporting lag is the cost of this discipline. With a 30-day survival window and a monthly cadence, the current month's cost per accepted change cannot be fully computed until 30 days after the month closes — the most recent merges still need to season. This is the same discipline as cohort retention measurement; it is a feature, not a bug. Teams that want closer-to-real-time signal use 14 days; teams that need higher-confidence acceptance use 60–90.

Whatever window you pick, apply it consistently across all reporting periods. Switching mid-stream invalidates the trend.

How is "acceptance" different from "correctness"?

Acceptance is observable. Correctness is aspirational. CPAC measures whether a change survived contact with production for the duration of the window — that is a real, auditable signal. A change can be incorrect in some absolute sense and still "stay there"; CPAC will count it.

The defense against this is the "stayed there" window and the rework cost line. Lengthening the window catches more silent escapes. Counting fix-up cost in the numerator catches the rest. CPAC is a defensible proxy for correctness, not a claim of perfect correctness.

What measurement window should I use?

Choose a window long enough that most rework would have surfaced. Two weeks is a minimum; four weeks is the practical default; one quarter is the most defensible.

Report CPAC for the same window length consistently. Quarter-over-quarter comparisons are sound; quarter-versus-sprint comparisons are not.

How do I track model cost per change?

The numerator's model-cost component is straightforward at the aggregate level — your LLM provider's billing dashboard gives you a monthly total. The harder part is attributing that spend to specific accepted changes, which is what lets you compare AI-assisted and non-AI work fairly.

One useful tool here is git-ai — a Git extension that automatically links every AI-written line to the agent, model, and transcripts that generated it. Combined with provider pricing, that lets you derive model spend per merged change rather than spreading total spend evenly across the team's output.

For teams without per-change attribution, a reasonable starting approach is to allocate the period's total model spend proportionally to AI-touched changes (as identified by your tooling or a survey of authors). Document the allocation method when you publish a number.

See the instrumentation guide for specific LLM proxy options (LiteLLM, Helicone, OpenLLMetry, Portkey), per-team API key attribution, and per-commit attribution via git-ai.

Why not measure tokens or cost-per-token instead of dollars?

The numerator's model-cost component is in dollars, not tokens. This is deliberate: token-based metrics are unreliable as a comparator across any timeframe longer than a single API call.

Tokens are not comparable between models or vendors

Each model family uses its own tokenization scheme — BPE variants, SentencePiece, vendor-specific encoders. The same English sentence requires dramatically different token counts on Claude vs GPT vs Gemini vs open-weight models. "Tokens per change" is meaningful within one model on one day; it is nearly meaningless across any other comparison.

Tokenization changes within a vendor's own model family

A model bump can shift token usage substantially even when the model is doing the same job. Claude Opus 4.7's new tokenizer is the worked example: Anthropic disclosed a 1.0–1.35× inflation range vs. 4.6 depending on content type, and independent measurements by OpenRouter and Simon Willison (April 2026) found ~12–27% more tokens on typical workloads and 32–34% on large (10K+ token) prompts — at unchanged per-token pricing. A team tracking "tokens per change" through that upgrade would have seen a phantom regression that had nothing to do with their delivery system. Vendors do not coordinate tokenization changes with measurement frameworks.

Identical tasks vary wildly in token count

The same logical change can consume very different token counts depending on:

Current context length — a 50,000-token codebase context vs a 5,000-token one for the same edit.
System prompts, which the vendor or your tooling may change without notice.
Attached MCP servers, tools, or function definitions that pad every request.
Whether prompt caching hit or missed (cache hits are roughly 10× cheaper for the same input).
Whether the model used reasoning / thinking tokens (often billed, often invisible to the user).
Multi-turn agent loops vs single-shot completions — an agent can make 50 model calls for what one completion does once.

Input and output tokens are priced very differently

Output tokens are typically 3–5× more expensive than input tokens. A "token count" that does not split the two is not even an internally consistent cost signal. Two operations with identical total token counts can cost wildly different amounts depending on the input/output split.

Dollars are the unit of business reality

Your CFO does not negotiate with a token budget; they negotiate with a dollar one. By denominating in dollars, cost per accepted change tracks what you actually spend — and absorbs all the upstream changes (model version bumps, pricing updates, cache hit rates, multi-step strategies) without breaking the metric.

The token economy is an implementation detail vendors change unilaterally. Cost per accepted change measures the layer above it: the dollar cost of producing software your team actually kept.

Why track each cost component separately?

The headline cost per accepted change is what you report to the CFO. It is not what you investigate with. You also need to track each of the five numerator components — model, infrastructure, engineering time, review, rework — as its own time series, with absolute values, percent share of total, and Δ vs prior window.

The reason is diagnostic: a moving headline tells you something changed; it tells you nothing about what.

Worked example: the Claude 4.6 → 4.7 bump

Claude Opus 4.7 launched at the same dollar price per token as 4.6, but with a new tokenizer that consumes more tokens for the same content. Anthropic disclosed a 1.0–1.35× inflation range; independent measurements (OpenRouter, Simon Willison, April 2026) found 12–27% higher token usage on typical workloads, and 32–34% on large (10K+ token) prompts. For teams that adopted 4.7, the model-cost component in dollars rose by that range without any change in their own delivery system.

A team watching only the headline cost per accepted change would have seen one of two equally bad outcomes:

The headline climbs ~20%. The team faces an unanswerable question — did we get less productive, did the model get more expensive, or did something else degrade? Wasted investigation, missed root cause.
The headline stays flat or drops. 4.7's per-task capability gains were sometimes enough to offset the extra tokens — fewer iterations per accepted change, so net spend stayed flat. The team congratulates itself on AI productivity, never knowing that vendor cost rose ~20% and only luck masked it. A signal procurement and leadership needed to hear, missed.

With component-level tracking, the same scenario reads instantly: model cost up ~20%, all other components flat. The conclusion is immediate — this is an upstream vendor change, not a delivery-system change. Talk to procurement; if you are getting offsetting capability gains, document them; if not, evaluate alternatives.

What each component catches that the headline doesn't

Model cost moving — vendor pricing change, tokenization shift, model bump, cache hit/miss change, agent strategy change.
Infrastructure moving — observability investment, runner consolidation, CI/CD change.
Engineering time moving — hiring, attrition, role-mix shift, capacity change.
Review cost moving — review-process intervention, PR-sizing change, reviewer experience shift.
Rework moving — quality regression, test-coverage change, gating-policy change, AI scope expanding into riskier code.

A team watching only the headline can see something moved; it cannot tell which lever to pull. A team watching the components can.

What to record each window

The XLSX tracker template does this automatically. If you build your own dashboard, the minimum row per window is:

Headline cost per accepted change
Each of the five cost components, in absolute dollars
Each component as a percent of total cost
Δ vs prior window for the headline and each component
Accepted change unit count

The percent shares are particularly load-bearing: a team where rework has crept from 8% to 18% of total has a very different problem than one where it crept from 8% to 9%, even if the headline moved by the same amount in both cases.

How do other AI metrics (tokens/sec, cost-per-request, latency) relate to cost per accepted change?

Most low-level AI metrics teams already track are diagnostic signals that feed into one of the five CPAC components. They are useful — just not as the bottom-line cost metric. The mapping:

Low-level metric	What it measures	CPAC component it surfaces in	When to watch it
Tokens per second (throughput)	Inference speed	Engineering time — slow inference is developers waiting	Developer-flow complaints; agent loops feel sluggish
Time to first token (TTFT)	Initial response latency	Engineering time; can drag per-suggestion acceptance rate	After model bumps, region changes, or provider switches
Cost per request	Per-API-call spend	Model cost	Capacity planning, vendor comparison (with caveat below)
Tokens per request	Input + output token usage	Model cost	After a model version bump, prompt change, or tool addition
Cache hit rate	Share of requests served from prompt cache	Model cost (cache hits are roughly 10× cheaper)	After tooling changes that affect prompt structure
Request volume	Total API calls per window	Model cost; infrastructure cost	Capacity planning; anomaly detection
Error / retry rate	Failed or retried requests	Model cost (failed requests still bill); engineering time (debug + retry)	After model bumps, agent-strategy changes, or new tools
Tool-call / agent-loop depth	Model calls per logical task	Model cost; sometimes engineering time	After agent architecture changes; when costs jump but task count doesn't
Per-suggestion acceptance rate	% of AI suggestions accepted by developers	Leading indicator for accepted change units — not a substitute	Always — pair with CPAC for trend explanation

The pattern: low-level metrics explain why a CPAC component moved. They are not substitutes for the metric itself, because no single low-level metric captures the fully-loaded production cost of trusted software.

Three operating modes

Day-to-day: monitor the low-level metrics that affect your dominant cost components. If model cost is 70% of the numerator, watch cache hit rate and tokens-per-request closely. If engineering time dominates, watch PR cycle time and TTFT.

Investigation: when cost per accepted change moves, decompose into components, then drill into the low-level metrics that feed the moving component. Headline → component → diagnostic.

Vendor comparison: cost-per-request and tokens-per-request are the right metrics for vendor benchmarking — but only alongside a measurement of how many requests or tokens each vendor uses to produce an accepted change. The tempting comparison ("Vendor A is $0.10/request, Vendor B is $0.15/request") is the one to resist; the meaningful comparison is "Vendor A produces an accepted change for $X, Vendor B for $Y." A 50%-cheaper-per-request vendor that takes 3× the iterations is more expensive in CPAC terms.

The architecture: cost per accepted change at the top for executive conversations and trend tracking; cost components in the middle for engineering reviews; low-level metrics at the bottom for incident response, optimization, and capacity planning.

How do I value engineering and review time?

Use a fully-loaded hourly rate — salary, benefits, tooling allocation, overhead. Most organizations already have this number for capacity planning. If you do not, $150/hour is a reasonable starting estimate for senior engineers in high-cost markets; adjust to your context.

The absolute value matters less than the relative trend. If your assumptions are consistent across windows, CPAC will move in the right direction when your delivery system improves.

How do I split costs across teams or features?

Allocate by the most defensible signal available — time tracked, story points completed, branch ownership. Perfect attribution is not required; consistent attribution is.

Does CPAC work for non-code work?

The same shape applies wherever AI produces artifacts that must be reviewed before use — drafts, designs, recommendations, plans. The components rename naturally ("accepted recommendations," "accepted designs"). The "stayed there" clause becomes "still in use after N days." The metric generalizes; this site focuses on code because that is where the literature and the measurement infrastructure are best developed.

How does CPAC relate to DORA metrics?

DORA metrics describe how a delivery system behaves: how often it ships, how quickly, with what failure rate. CPAC describes what that behavior costs. The two are complementary; CPAC is the dollar layer DORA was never designed to be.

The cleanest illustration is the change-failure-rate gap. Change failure rate counts incidents — not the cost of the rework those incidents demand. Two teams can post an identical 8% change failure rate: one absorbs the rework with senior engineers at $200/hour over a 40-hour week, the other with junior engineers at $100/hour over four 10-hour weeks. Same DORA number, ~2× difference in rework cost, completely different return on the AI investment that produced the failures. CPAC sees the difference because rework is a numerator line; DORA sees only the incident count. This is exactly the AI-augmented failure mode — review and rework cost climbs while DORA's headline holds.

Run DORA for delivery-system behavior. Run CPAC for delivery-system economics. The two together — DORA as leading indicator, CPAC as cost outcome — is the right reporting posture for a finance-aware engineering organization.

How does CPAC relate to FinOps?

CPAC is FinOps cost-to-serve, moved one layer upstream. FinOps measures the unit cost of running software. CPAC measures the unit cost of producing trusted software. Organizations with mature FinOps practices already have most of the inputs; CPAC just rearranges them around the accepted-change denominator.

Can I use cost per accepted change alone?

No, for two reasons.

First, the headline is a summary — track the components. Each of the five cost components (model, infrastructure, engineering, review, rework) needs its own time series so you can diagnose what's actually moving when the headline shifts. See the entry on why track each cost component separately above.

Second, pair the headline with at least one leading indicator. Change failure rate, lead time to accepted change, machine catch rate, or DevEx pulse. The headline tells you the system's cost; the components tell you the source; the leading indicators tell you why.

Where does the metric come from?

Cost per accepted change was defined in The Delivery Gap (Brenn Hill, 2026) as the cost vertex of the Verification Triangle. See how to cite.

How do I propose a change to the definition?

Open an issue or a pull request at github.com/brennhill/cost-per-accepted-change. The goal is for the definition to remain stable; refinements, worked examples, and translations are welcome.