A survey

How teams measure AI cost

Every team trying to measure AI in their delivery pipeline ends up choosing among a handful of approaches. Most are partial. This page walks the landscape, what each option captures, what each misses, and where cost per accepted change fits.

The eight approaches in current use

1. Token / API cost

The most direct cost signal: how much you spend on the LLM provider per period. Promoted by the FinOps Foundation as the entry-level AI cost metric, and surfaced by every major model vendor.

What it catches: raw model spend; runaway agents; per-feature unit economics.
What it misses: the labor cost of producing and reviewing the work the model generates; the cost of rework when it ships defects. Token cost is a real input to delivery cost, but never the whole picture.

2. Volume metrics: lines of code, PRs merged, commits

The default in many engineering analytics products. "AI code share" — the percentage of merged code attributed to AI suggestions — is a 2024–2026 variant.

What it catches: activity. Easy to compute, easy to chart.
What it misses: most of what you ultimately care about. PR count rises with AI even when delivered value doesn't; "AI code share" can climb while quality and stability slip. These are activity metrics — genuinely useful as signals, just easy to mistake for outcomes. They're also the ones that show up most in vendor marketing, which is understandable: they're the easiest to measure and the most cheering to report.

3. Acceptance rate

The fraction of AI suggestions that a developer accepts. Tracked by Copilot, Cursor, and most other coding assistants. Often reported alongside "characters inserted from AI."

What it catches: short-term developer agreement with the model. A useful product-engineering signal for vendors.
What it misses: whether the accepted suggestion survived review, whether it shipped, whether it caused a defect three days later. Acceptance is a leading indicator; it is not an outcome.

4. DORA Four Keys

Deployment frequency, lead time for changes, change failure rate, and time to restore. Defined by DORA; popularized by Accelerate (Forsgren, Humble, Kim, 2018) and the annual State of DevOps reports.

What it catches: how a delivery system behaves. Industry-standard, well-defended, mature instrumentation — the framework engineering leaders already use to talk about delivery health.
What it misses: what that behavior costs. Specifically, change failure rate counts incidents — not the dollar cost of the rework those incidents demand. A team where senior engineers absorb a 40-hour rework week posts the same change-failure-rate as a team where junior engineers absorb four 10-hour rework weeks; only CPAC sees the cost difference. A team can post excellent DORA numbers and a terrible cost picture if velocity is bought with disproportionate review and rework — exactly the AI-augmented failure mode. DORA and CPAC are complementary: DORA describes the delivery system; CPAC is the dollar layer DORA was never designed to be.

5. SPACE framework

Satisfaction, Performance, Activity, Communication, Efficiency. A multi-dimensional developer-productivity framework from Microsoft Research and GitHub (Forsgren, Storey, Maddila, Zimmermann, Houck, Butler, 2021).

What it catches: a balanced view of productivity that resists single-metric gaming — the framework everyone reaches for when they want to defeat the "lines of code" trap.
What it misses: a unit-cost number a CFO can act on. SPACE's five dimensions are diagnostic by design; none of them collapse to a single dollar value, and SPACE was never built to. CPAC is the dollar bottom line that SPACE's dimensions help explain when it moves.

6. DevEx scores

Developer Experience surveys — pulse scores on flow, feedback loops, and cognitive load. Promoted by DX, Faros, and others; aligned with the SPACE tradition.

What it catches: friction and toil. Strong leading indicator of attrition and burnout. Useful for diagnosing where AI tooling is helping or hurting team experience.
What it misses: dollars. DevEx data tells you where to invest; it does not tell you what your AI program costs per unit of delivered value.

7. Self-reported productivity surveys

"How much faster are you with AI?" — surveyed by BCG, McKinsey, GitHub Octoverse, Stack Overflow's annual developer survey, and others. Headline numbers in this category are commonly in the 20–55% range.

What it catches: sentiment, perception, what teams will say about AI.
What it misses: reality. In a randomized controlled trial of 16 experienced open-source developers across 246 real tasks, METR (July 2025) found that allowing AI tools increased completion time by 19% — a 20% slowdown — while the same developers self-reported that AI had sped them up by 20%. The perception-reality gap was wide enough to invert the conclusion. Self-report is not a measurement; it is a hypothesis. (METR has since noted that experienced developers are increasingly unwilling to work without AI, which biases any new measurement of the gap.)

8. FinOps cost-to-serve

The unit cost of running software — cost per request, per active user, per transaction. The mature, board-defensible discipline of cloud cost management. See the FinOps Foundation.

What it catches: operational unit economics, well-instrumented and well-understood.
What it misses: the cost of producing the software. FinOps measures the right side of the deployment boundary; cost per accepted change measures the left.

The framework it sits inside: the Verification Triangle

Cost per accepted change is the cost vertex of a three-vertex framework defined in The Delivery Gap. The other two vertices describe what is being delivered and how well it is checked:

Intent clarity — is the team converging on the right thing, fast enough? Measured by first-pass acceptance, post-merge rework, and increment frequency. Not by scoring spec documents — specs are the receipt of iteration, not the recipe.
Verification quality — are defects caught early, and by machines? The discipline of eval-driven development (see Eval-Driven Development). Operationalized as six tiers of quality gates (four machine, one human, one peer), built cumulatively. Measured by machine catch rate and change failure rate.
Cost — what does each trusted change actually cost to deliver? This is the vertex cost per accepted change quantifies, paired with lead time to accepted change.

The three vertices are mutually informing: weak intent makes the gates verify the wrong thing; weak gates hide cost in rework; unmeasured cost means you cannot see whether the other two are improving. Cost per accepted change is necessary but not sufficient — it is the bottom-line dollar measurement that the other two vertices supply context for.

Where cost per accepted change fits

Cost per accepted change does not replace any of the metrics above. It sits one layer above them:

Token cost is a numerator input.
Volume metrics are the incomplete shadow CPAC is designed to round out.
Acceptance rate is a leading indicator that should correlate with CPAC over time.
DORA tells you whether your delivery system is fast and stable; CPAC tells you what that costs.
SPACE and DevEx are diagnostic; CPAC is the bottom line.
Productivity surveys are aspirational; CPAC is observable.
FinOps cost-to-serve is its downstream sibling. CPAC is the same discipline, applied one layer upstream.

The right operating posture is to report cost per accepted change as the headline number and pair it with two or three of these leading indicators for diagnosis. Without CPAC, the indicators do not roll up to anything a CFO can act on. Without the indicators, CPAC moves without explaining why.

Defending the choices behind the formula

Why "accepted" rather than "merged"

A merged pull request is not a unit of value if it gets reverted next week. "Merged" describes the moment the diff landed; "accepted" requires that the diff stayed in production through the measurement window. The denominator should reflect the work that kept, not the work that was attempted.

Why "and stayed there"

Without this clause, cost per accepted change could be gamed by shipping recklessly and counting every merge. With it, the metric self-corrects: silent escapes that get quietly fixed are excluded from the denominator and their fix cost is added to the numerator. This is the rework defense, built into the metric instead of bolted on.

The "stayed there" duration defaults to 30 days post-merge, applied per-change rather than per-reporting-window. This is the same discipline as cohort-based retention measurement: don't evaluate a cohort that hasn't had time to mature. Teams may use 14 days (faster cadence, less lag) or 60–90 days (high-confidence, regulated environments). See the FAQ for the full reasoning.

Why all five cost components, not just model cost

Model cost is a small fraction of total delivery cost in most organizations. Engineering and review time dominate the numerator. A "cost per change" metric that only counts model spend would understate the true cost by an order of magnitude — and would let teams optimize for the wrong thing.

Why "per change" and not "per developer-hour"

Developer-hours are an input. Accepted changes are an output. Cost-per-output is the right shape for a metric that asks "is this investment paying off?" Cost-per-input answers a different, more inward-facing question and is more easily gamed by changing how time is counted.

Why a 500-line normalization on "change"

Without size-normalization, the denominator could be gamed in either direction. A team shipping one massive 5,000-line merge as a single "accepted change" would be credited the same denominator as a team shipping a 100-line bug fix — even though the former is an order of magnitude more accepted work. Conversely, a team could inflate its denominator by splitting trivial changes into ever-smaller PRs.

The 500-line threshold cuts both knots and is grounded in what humans can actually verify: reviewer comprehension, defect-detection, and the willingness to leave substantive comments all drop sharply past about 400 lines in a single PR. 500 lines is a clean round number that sits just past that cliff, providing a sane buffer while remaining recognizably "one substantial change" in most engineering cultures.

A PR of 1–500 lines counts as 1 unit; a larger PR of N lines counts as ⌈ N / 500 ⌉ units. The rule applies uniformly to AI-assisted and non-AI work so comparisons remain meaningful. See the FAQ for examples.

Why not a productivity index that combines several signals

Composite indices look defensible and rarely are: every weighting choice is contested, every component drifts, and the index is hard to explain. CPAC is one number with a known formula. It is easy to recompute, easy to compare, and easy to challenge — three properties an index does not have.

Common critiques, addressed

"Acceptance is not correctness."

True. CPAC's defense is the "stayed there" window: lengthening the window catches more silent escapes; counting fix-up cost in the numerator catches the rest. CPAC is a defensible proxy for correctness, not a claim of perfect correctness. See the FAQ for window selection guidance.

"This punishes ambitious teams."

It does the opposite. A team that ships ambitious work that survives in production gets a lower CPAC than a team that ships timid work that gets reverted. The metric rewards kept ambition.

"What about value, not just cost?"

Value is captured upstream of CPAC, by what teams choose to work on, and downstream of CPAC, by FinOps cost-to-serve and revenue analytics. CPAC is the production-cost layer in between. It does not claim to be the only number on the dashboard.

"This is just X renamed."

It is not. None of the eight approaches above carries all five cost components in the numerator, the "stayed there" clause in the denominator, and the unit-output shape that maps directly to FinOps. If it were already named, the cost-management conversation around AI would look very different than it does in 2026.

Disagree with anything on this page? Open an issue at github.com/brennhill/cost-per-accepted-change/issues. Refinements, alternative framings, and additional approaches to survey are all welcome.