AI FinOps

Frequently asked

Questions about cost per accepted change

What counts as a change?

The default unit is a merged pull request to the production branch, normalized by size so that one accepted change unit represents a comparable amount of substantive accepted work. (Teams on non-PR workflows substitute the squash-merge commit or the merge commit that lands a logical change set; the unit must be applied consistently.)

The size-normalization rule:

Examples:

Lines changedAccepted change units
1201
5001
8002
1,8004
5,00010

"Lines changed" is the sum of additions and deletions, excluding vendored code, generated code, lockfiles, and bulk imports. Document any further exclusions when you publish or submit a number.

Why the 500-line threshold:

Teams may substitute another unit — accepted PR, deployed increment, shipped feature flag — as long as the size-normalization principle is preserved and applied consistently across the measurement window and across the baseline.

The unit is less important than the consistency. Whatever you pick, document it and stick with it. Switching units mid-window inflates or deflates the result.

How do I count lines changed from git?

Git's built-in tools handle this directly. Nothing third-party has displaced them for this specific job.

The aggregate, between two refs

git diff --shortstat A..B
# 47 files changed, 1832 insertions(+), 614 deletions(-)

git diff --numstat A..B
# <additions> <deletions> <path>     (machine-readable, one row per file)

Per-commit, across a date window

The $1 != "-" filter skips binary files (which git log --numstat outputs as - -). This is what you want for a LOC-based CPAC denominator — binary churn isn't reviewable in lines.

git log --since=2026-04-01 --until=2026-04-30 \
  --pretty=tformat: --numstat \
  | awk '$1 != "-" {add+=$1; del+=$2} END {print add+del}'

Per-PR via the GitHub CLI (recommended for normalization)

The PR is the right unit for the 500-LOC normalization rule. The GitHub CLI returns one number per PR, which is exactly what the calculator library's normalizeChanges() helper expects:

gh pr list -R owner/repo --state merged \
  --search 'merged:2026-04-01..2026-04-30' \
  --json number,additions,deletions --limit 1000

Two operational notes for the recipes below: pass -R owner/repo when running outside the repo's working tree, and remember that --limit 1000 is the maximum gh pr list accepts. For windows with more than 1,000 merged PRs, use gh api graphql --paginate with a PR-search query, or use the REST search API with explicit cursor pagination.

One-liner that produces the normalized accepted-change unit count for the window. Each PR contributes max(1, ceil(LOC/500)) units — so a binary-only PR (additions + deletions = 0 from gh's perspective, since it counts line deltas only) still counts as 1 unit, matching the library's normalizeChanges(). The (add // 0) at the end guards against an empty list:

gh pr list -R owner/repo --state merged --search 'merged:2026-04-01..2026-04-30' \
  --json additions,deletions --limit 1000 \
  | jq '[.[] | (.additions + .deletions) | (. / 500) | ceil | if . < 1 then 1 else . end] | (add // 0)'

Excluding vendored, generated, lockfiles

Required for any defensible CPAC submission. Two approaches:

# Pathspec exclusions on the diff itself.
# Use `:(exclude,glob)` for path-spanning patterns — without the `glob`
# magic, `**` is not recognized as a cross-directory wildcard.
git diff --numstat A..B -- \
  ':(exclude)vendor' \
  ':(exclude,glob)**/*.lock' \
  ':(exclude)dist' \
  ':(exclude,glob)**/*.generated.*'

# Or mark in .gitattributes — GitHub's API honors these for PR stats
vendor/**         linguist-vendored
dist/**           linguist-generated
**/*.generated.*  linguist-generated

Other tools worth knowing about

cloc, scc, and tokei are excellent for total LOC snapshots by language but are not range-aware in the way diff calculations need. cloc does have a --diff mode for comparing two refs that excludes comments and whitespace, which can be useful if you want a stricter denominator than raw added+deleted. git-quick-stats and gitfame answer related but different questions (per-author dashboards, authorship attribution) and are not the right tools for computing CPAC inputs.

For repositories not hosted on GitHub, use git log --numstat with a per-commit aggregator and pipe the per-PR (or per-merge) line counts through the normalizeChanges() helper.

What does "stayed there" mean?

A change has stayed in production if it survives its stayed-there window post-merge without being reverted or fixed. The principle: only follow-ups that effectively retract the original change count against it. Building further on top of the change is the signal that it succeeded, not failed.

These invalidate acceptance (subsequent change retracts or repairs the original):

These do NOT invalidate acceptance (subsequent change builds on or extends the original):

The judgment call: if a reasonable reviewer would describe the follow-up as "fixing what the original got wrong," it invalidates. If they would describe it as "building on what the original enabled," it does not. When the line is fuzzy, document the call in your methodology notes and apply it consistently across reporting windows.

How long is the stayed-there window?

The recommended default is 30 days post-merge. A change counts as accepted if, 30 days after it landed in the production branch, it is still in production in substantively its original form.

Acceptable alternatives:

WindowBest forTradeoff
14 days Fast-moving teams, daily-or-better deploy cadence, low cost of incident response. Catches acute reverts. Misses some second-sprint regressions.
30 days (recommended) Most engineering organizations. Aligns with DORA change-failure-rate convention and monthly reporting cycles. Catches the bulk of post-deploy issues. Adds ~1 month reporting lag.
60–90 days Regulated industries, payment systems, infrastructure with long-tail failure modes. High-confidence acceptance. Significant reporting lag; not suitable for monthly cadence.

The window is per-change: each change is evaluated from its own merge date, not from the reporting window's boundaries. This keeps the test fair — without it, a change merged at the start of the reporting window faces a harder survival test than one merged at the end.

Reporting lag is the cost of this discipline. With a 30-day survival window and a monthly cadence, the current month's cost per accepted change cannot be fully computed until 30 days after the month closes — the most recent merges still need to season. This is the same discipline as cohort retention measurement; it is a feature, not a bug. Teams that want closer-to-real-time signal use 14 days; teams that need higher-confidence acceptance use 60–90.

Whatever window you pick, apply it consistently across all reporting periods. Switching mid-stream invalidates the trend.

How is "acceptance" different from "correctness"?

Acceptance is observable. Correctness is aspirational. CPAC measures whether a change survived contact with production for the duration of the window — that is a real, auditable signal. A change can be incorrect in some absolute sense and still "stay there"; CPAC will count it.

The defense against this is the "stayed there" window and the rework cost line. Lengthening the window catches more silent escapes. Counting fix-up cost in the numerator catches the rest. CPAC is a defensible proxy for correctness, not a claim of perfect correctness.

What measurement window should I use?

Choose a window long enough that most rework would have surfaced. Two weeks is a minimum; four weeks is the practical default; one quarter is the most defensible.

Report CPAC for the same window length consistently. Quarter-over-quarter comparisons are sound; quarter-versus-sprint comparisons are not.

How do I track model cost per change?

The numerator's model-cost component is straightforward at the aggregate level — your LLM provider's billing dashboard gives you a monthly total. The harder part is attributing that spend to specific accepted changes, which is what lets you compare AI-assisted and non-AI work fairly.

One useful tool here is git-ai — a Git extension that automatically links every AI-written line to the agent, model, and transcripts that generated it. Combined with provider pricing, that lets you derive model spend per merged change rather than spreading total spend evenly across the team's output.

For teams without per-change attribution, a reasonable starting approach is to allocate the period's total model spend proportionally to AI-touched changes (as identified by your tooling or a survey of authors). Document the allocation method when you publish a number.

See the instrumentation guide for specific LLM proxy options (LiteLLM, Helicone, OpenLLMetry, Portkey), per-team API key attribution, and per-commit attribution via git-ai.

Why not measure tokens or cost-per-token instead of dollars?

The numerator's model-cost component is in dollars, not tokens. This is deliberate: token-based metrics are unreliable as a comparator across any timeframe longer than a single API call.

Tokens are not comparable between models or vendors

Each model family uses its own tokenization scheme — BPE variants, SentencePiece, vendor-specific encoders. The same English sentence requires dramatically different token counts on Claude vs GPT vs Gemini vs open-weight models. "Tokens per change" is meaningful within one model on one day; it is nearly meaningless across any other comparison.

Tokenization changes within a vendor's own model family

A model bump can shift token usage substantially even when the model is doing the same job. Claude Opus 4.7's new tokenizer is the worked example: Anthropic disclosed a 1.0–1.35× inflation range vs. 4.6 depending on content type, and independent measurements by OpenRouter and Simon Willison (April 2026) found ~12–27% more tokens on typical workloads and 32–34% on large (10K+ token) prompts — at unchanged per-token pricing. A team tracking "tokens per change" through that upgrade would have seen a phantom regression that had nothing to do with their delivery system. Vendors do not coordinate tokenization changes with measurement frameworks.

Identical tasks vary wildly in token count

The same logical change can consume very different token counts depending on:

Input and output tokens are priced very differently

Output tokens are typically 3–5× more expensive than input tokens. A "token count" that does not split the two is not even an internally consistent cost signal. Two operations with identical total token counts can cost wildly different amounts depending on the input/output split.

Dollars are the unit of business reality

Your CFO does not negotiate with a token budget; they negotiate with a dollar one. By denominating in dollars, cost per accepted change tracks what you actually spend — and absorbs all the upstream changes (model version bumps, pricing updates, cache hit rates, multi-step strategies) without breaking the metric.

The token economy is an implementation detail vendors change unilaterally. Cost per accepted change measures the layer above it: the dollar cost of producing software your team actually kept.

Why track each cost component separately?

The headline cost per accepted change is what you report to the CFO. It is not what you investigate with. You also need to track each of the five numerator components — model, infrastructure, engineering time, review, rework — as its own time series, with absolute values, percent share of total, and Δ vs prior window.

The reason is diagnostic: a moving headline tells you something changed; it tells you nothing about what.

Worked example: the Claude 4.6 → 4.7 bump

Claude Opus 4.7 launched at the same dollar price per token as 4.6, but with a new tokenizer that consumes more tokens for the same content. Anthropic disclosed a 1.0–1.35× inflation range; independent measurements (OpenRouter, Simon Willison, April 2026) found 12–27% higher token usage on typical workloads, and 32–34% on large (10K+ token) prompts. For teams that adopted 4.7, the model-cost component in dollars rose by that range without any change in their own delivery system.

A team watching only the headline cost per accepted change would have seen one of two equally bad outcomes:

With component-level tracking, the same scenario reads instantly: model cost up ~20%, all other components flat. The conclusion is immediate — this is an upstream vendor change, not a delivery-system change. Talk to procurement; if you are getting offsetting capability gains, document them; if not, evaluate alternatives.

What each component catches that the headline doesn't

A team watching only the headline can see something moved; it cannot tell which lever to pull. A team watching the components can.

What to record each window

The XLSX tracker template does this automatically. If you build your own dashboard, the minimum row per window is:

The percent shares are particularly load-bearing: a team where rework has crept from 8% to 18% of total has a very different problem than one where it crept from 8% to 9%, even if the headline moved by the same amount in both cases.

How do other AI metrics (tokens/sec, cost-per-request, latency) relate to cost per accepted change?

Most low-level AI metrics teams already track are diagnostic signals that feed into one of the five CPAC components. They are useful — just not as the bottom-line cost metric. The mapping:

Low-level metricWhat it measuresCPAC component it surfaces inWhen to watch it
Tokens per second (throughput) Inference speed Engineering time — slow inference is developers waiting Developer-flow complaints; agent loops feel sluggish
Time to first token (TTFT) Initial response latency Engineering time; can drag per-suggestion acceptance rate After model bumps, region changes, or provider switches
Cost per request Per-API-call spend Model cost Capacity planning, vendor comparison (with caveat below)
Tokens per request Input + output token usage Model cost After a model version bump, prompt change, or tool addition
Cache hit rate Share of requests served from prompt cache Model cost (cache hits are roughly 10× cheaper) After tooling changes that affect prompt structure
Request volume Total API calls per window Model cost; infrastructure cost Capacity planning; anomaly detection
Error / retry rate Failed or retried requests Model cost (failed requests still bill); engineering time (debug + retry) After model bumps, agent-strategy changes, or new tools
Tool-call / agent-loop depth Model calls per logical task Model cost; sometimes engineering time After agent architecture changes; when costs jump but task count doesn't
Per-suggestion acceptance rate % of AI suggestions accepted by developers Leading indicator for accepted change units — not a substitute Always — pair with CPAC for trend explanation

The pattern: low-level metrics explain why a CPAC component moved. They are not substitutes for the metric itself, because no single low-level metric captures the fully-loaded production cost of trusted software.

Three operating modes

Day-to-day: monitor the low-level metrics that affect your dominant cost components. If model cost is 70% of the numerator, watch cache hit rate and tokens-per-request closely. If engineering time dominates, watch PR cycle time and TTFT.

Investigation: when cost per accepted change moves, decompose into components, then drill into the low-level metrics that feed the moving component. Headline → component → diagnostic.

Vendor comparison: cost-per-request and tokens-per-request are the right metrics for vendor benchmarking — but only alongside a measurement of how many requests or tokens each vendor uses to produce an accepted change. The tempting comparison ("Vendor A is $0.10/request, Vendor B is $0.15/request") is the one to resist; the meaningful comparison is "Vendor A produces an accepted change for $X, Vendor B for $Y." A 50%-cheaper-per-request vendor that takes 3× the iterations is more expensive in CPAC terms.

The architecture: cost per accepted change at the top for executive conversations and trend tracking; cost components in the middle for engineering reviews; low-level metrics at the bottom for incident response, optimization, and capacity planning.

How do I value engineering and review time?

Use a fully-loaded hourly rate — salary, benefits, tooling allocation, overhead. Most organizations already have this number for capacity planning. If you do not, $150/hour is a reasonable starting estimate for senior engineers in high-cost markets; adjust to your context.

The absolute value matters less than the relative trend. If your assumptions are consistent across windows, CPAC will move in the right direction when your delivery system improves.

How do I split costs across teams or features?

Allocate by the most defensible signal available — time tracked, story points completed, branch ownership. Perfect attribution is not required; consistent attribution is.

Does CPAC work for non-code work?

The same shape applies wherever AI produces artifacts that must be reviewed before use — drafts, designs, recommendations, plans. The components rename naturally ("accepted recommendations," "accepted designs"). The "stayed there" clause becomes "still in use after N days." The metric generalizes; this site focuses on code because that is where the literature and the measurement infrastructure are best developed.

How does CPAC relate to DORA metrics?

DORA metrics describe how a delivery system behaves: how often it ships, how quickly, with what failure rate. CPAC describes what that behavior costs. The two are complementary; CPAC is the dollar layer DORA was never designed to be.

The cleanest illustration is the change-failure-rate gap. Change failure rate counts incidents — not the cost of the rework those incidents demand. Two teams can post an identical 8% change failure rate: one absorbs the rework with senior engineers at $200/hour over a 40-hour week, the other with junior engineers at $100/hour over four 10-hour weeks. Same DORA number, ~2× difference in rework cost, completely different return on the AI investment that produced the failures. CPAC sees the difference because rework is a numerator line; DORA sees only the incident count. This is exactly the AI-augmented failure mode — review and rework cost climbs while DORA's headline holds.

Run DORA for delivery-system behavior. Run CPAC for delivery-system economics. The two together — DORA as leading indicator, CPAC as cost outcome — is the right reporting posture for a finance-aware engineering organization.

How does CPAC relate to FinOps?

CPAC is FinOps cost-to-serve, moved one layer upstream. FinOps measures the unit cost of running software. CPAC measures the unit cost of producing trusted software. Organizations with mature FinOps practices already have most of the inputs; CPAC just rearranges them around the accepted-change denominator.

Can I use cost per accepted change alone?

No, for two reasons.

First, the headline is a summary — track the components. Each of the five cost components (model, infrastructure, engineering, review, rework) needs its own time series so you can diagnose what's actually moving when the headline shifts. See the entry on why track each cost component separately above.

Second, pair the headline with at least one leading indicator. Change failure rate, lead time to accepted change, machine catch rate, or DevEx pulse. The headline tells you the system's cost; the components tell you the source; the leading indicators tell you why.

Where does the metric come from?

Cost per accepted change was defined in The Delivery Gap (Brenn Hill, 2026) as the cost vertex of the Verification Triangle. See how to cite.

How do I propose a change to the definition?

Open an issue or a pull request at github.com/brennhill/cost-per-accepted-change. The goal is for the definition to remain stable; refinements, worked examples, and translations are welcome.