Field note · Microsoft · GitHub Copilot
The Copilot productivity paradox, and the number that resolves it
GitHub measured a 55% speed-up. Independent researchers measured 41% more bugs, rising code churn, and experienced developers running 19% slower while feeling 20% faster. Here's the reassuring part: all of these results are real. They disagree because each one honestly measured a different thing — and the thing most of us actually want to know sits just out of frame in every one of them: what it costs to produce software the team gets to keep. That's a hard number to see, and nobody has been hiding it. This is a look at how one metric brings it into view.
The velocity story we all heard
The case for AI-assisted coding was made early, and it was made well, in the language of speed. In a controlled experiment GitHub ran in 2022, developers asked to implement an HTTP server in JavaScript finished 55.8% faster with Copilot than without — a solid result, with a 95% confidence interval running from 21% to 89%, and it understandably shaped how the whole industry came to talk about these tools.[1] It is a genuine finding. For a self-contained, greenfield task, the tool really is dramatically faster, and that's worth being excited about.
From there the story compounded, as good news does. GitHub reported code-completion acceptance rates around 30%, and later that Copilot authored roughly 46% of the code in files where it was enabled.[2] Microsoft's leadership described Copilot as a larger business than all of GitHub had been at acquisition. And the conviction was capitalized: Microsoft told investors it expected to spend on the order of $80 billion on AI-enabled data centers in fiscal 2025 alone.[3] These are big, encouraging numbers, and they were honestly come by.
It's worth gently noticing what they have in common — task completion time, acceptance rate, share of code authored, adoption counts. Every one measures activity: how fast code is produced, and how much of it the model wrote. What none of them can see, by design, is whether the code that got produced was kept. That's not a flaw in the metrics; it's just outside what they were built to watch.
What independent measurement found
As the tools moved from benchmark tasks into real, messy codebases, a second body of evidence built up — and it pulled in a different direction. None of this contradicts the excitement above; it just adds the part of the picture that's harder to photograph.
| Study | What it measured | Finding |
|---|---|---|
| METR, 2025[4] | Experienced open-source developers, 246 real tasks on large (~1M-line) repositories, randomized | 19% slower with AI — while the same developers believed they were ~20% faster |
| Uplevel, 2024[5] | ~800 developers, cycle time and PR throughput, before/after Copilot | No significant throughput gain; 41% more bugs in pull requests |
| GitClear, 2024–25[6] | 200M+ changed lines, 2020–2024, code-evolution patterns | Short-window churn up (5.5%→7.9%); copy-pasted code up (8.3%→12.3%); refactored "moved" code down (24.1%→9.5%) |
| DORA, 2024[7] | Industry survey, AI adoption vs delivery performance | +25% AI adoption associated with −7.2% delivery stability and −1.5% throughput |
These studies are not all the same weight, and it's only fair to say so. METR is a small sample (16 developers) but a rigorous randomized design on real work. Uplevel and GitClear are large but observational — they show association, not proof of cause. DORA is a correlational survey. None of them overturns the original GitHub experiment; the HTTP-server task really was finished faster. What they gently question is the leap from "faster on a fresh task" to "more value delivered across a whole engineering organization" — a leap that's easy and natural to make, and that the data asks us to make more carefully.
The pattern: velocity was measured, kept value was not
Read the two columns together and the paradox dissolves. The first body of evidence measured how much code got written, and how fast. The second measured what happened to it next: more of it had to be revised within two weeks, more of it was duplicated rather than refactored, more bugs rode in on it, and delivery got less stable. Faster generation and lower retention are not contradictory. They are the two halves of a single, unmeasured trade.
The DORA report named the mechanism kindly and precisely: AI makes it easier to write more code, so batch sizes grow — and large batches are simply harder to deliver reliably. You can produce more, faster, and keep a little less of it, and a dashboard built on production volume will show you the first half warmly while having no way to see the second. Neither half is a lie. They're just two different questions.
This is the delivery gap in miniature. Cost per accepted change is the metric defined to close it: the fully-loaded cost of producing software that reached production and stayed there, divided by the number of changes that did. It is designed to move — visibly, early — in exactly the conditions these studies describe, so a team can catch the trade while it's still small and easy to steer.
What cost per accepted change adds to the picture
Here's the encouraging part: lay each independent finding next to the metric and the contradiction dissolves into something you can actually work with. Each one becomes a specific line item rather than a mystery:
- The +41% bugs (Uplevel) land in two places. Bug-fix work that retracts a recent change pulls that change out of the denominator — it did not stay. And the cost of the fix lands in the numerator as rework. The same event hits CPAC twice; on a PR-count dashboard it is invisible.
- The rising churn (GitClear) is the denominator leaking. Code revised within two weeks is, by the metric's default 30-day survival window, a candidate for not being an accepted change. A team whose churn climbed from 5.5% to 7.9% has a denominator quietly shrinking relative to its raw merge count.
- The slowdown on real repos (METR) shows up as engineering and review cost. If senior developers spend more wall-clock time steering and verifying AI output on a mature codebase, that time is numerator cost — whether or not the velocity dashboard moved.
- The stability drop (DORA) is the accepted-change rate falling. Less stable delivery means a larger share of merges get reverted or hotfixed inside the survival window. That is the denominator, by definition.
And because CPAC is tracked component by component, the $80B-scale bet on model spend would have been legible too: model cost is one of five numerator lines. A team could watch model cost rise while rework and review costs rose alongside it — and see immediately that cheaper, faster generation was being paid for downstream, rather than assume the speed-up was free.
A worked reading (illustrative)
To make the mechanism concrete, here is a hypothetical team — not Microsoft, not any real team, just numbers chosen to show how the lens behaves. A 10-engineer group adopts an AI assistant. Their velocity dashboard lights up: pull requests rise 30%, cycle time ticks down, "AI authored 45% of our code." On those metrics it reads as a clear win — and, honestly, on those metrics it is one. The team has every reason to feel good.
Now read the same quarter through cost per accepted change, applying a 30-day survival window and 500-line size normalization to the denominator, and counting all five cost lines in the numerator:
| Before AI | After AI | |
|---|---|---|
| Merged PRs (velocity dashboard) | 100 | 130 |
| Reverted / hotfixed within 30 days | 8 | 26 |
| Accepted change units (after survival + normalization) | 100 | 108 |
| Engineering + steering cost | $150,000 | $150,000 |
| Review cost | $30,000 | $44,000 |
| Rework cost | $14,000 | $34,000 |
| Model + infrastructure cost | $2,000 | $16,000 |
| Total cost | $196,000 | $244,000 |
| Cost per accepted change | $1,960 | $2,259 |
Pull requests rose 30%; accepted change units rose 8%; and the cost of producing each kept change went up about 15%. Nothing here says "don't use AI" — the tool may well be worth it, and this team is one good quarter of tuning away from making it clearly worth it. What the reading does is hand them something to act on: the real movement is in the rework, review, and model lines, so the next step is encouraging and concrete — tighten PR size, strengthen the gates, and re-measure. Without this view it would have been easy to scale the spend on the strength of a 30% gain that hadn't quite been banked yet — not because anyone was careless, but because the dashboards in front of them genuinely couldn't show the other side of the ledger.
The takeaway
The Copilot record isn't a cautionary tale, and it certainly isn't a story about a bad tool — by the evidence it's a fast, genuinely useful one. It's a story about something every one of us does: we measure the half of the transaction that's easy to see — generation speed, code-authored share — and the other half — retention, rework, stability — stays quieter and harder to instrument, even though it's every bit as real. That's not a failing of any team. It's just how visibility works when the tools are new and moving fast.
So the gentle suggestion is to add one question to the mix. Alongside "how much faster are we?" and "what percent of our code is AI-written?", ask: what does a change we actually keep cost us now, versus a year ago — and which of the five cost components moved? It's answerable, it's finance-legible, and it's the question the usual dashboards simply weren't built to ask. None of us has this fully figured out yet; the point of the number is to make the figuring-out a little easier, and a lot more honest.
Run your own numbers → Quick start for leaders → More field notes →
Sources
- GitHub, "Research: quantifying GitHub Copilot’s impact on developer productivity and happiness" (2022), and Kalliamvakou et al., "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot," arXiv:2302.06590 — the controlled experiment in which Copilot users completed an HTTP-server task 55.8% faster. https://arxiv.org/abs/2302.06590
- GitHub, "Quantifying GitHub Copilot’s impact" and subsequent company statements reporting an average code-completion acceptance rate near 30% and, later, that Copilot authors roughly 46% of code in files where it is enabled. These are vendor-reported figures, not independent studies. https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
- CNBC, "Microsoft expects to spend $80 billion on AI-enabled data centers in fiscal 2025" (Jan 3, 2025). https://www.cnbc.com/2025/01/03/microsoft-expects-to-spend-80-billion-on-ai-data-centers-in-fy-2025.html
- METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," arXiv:2507.09089 — a randomized trial in which experienced developers were 19% slower with AI tools while believing they were ~20% faster. https://arxiv.org/abs/2507.09089
- Uplevel, "AI for Developer Productivity" (Sept 2024) — a study of ~800 developers finding no significant change in cycle time or PR throughput and a 41% increase in bugs introduced in pull requests. https://uplevelteam.com/blog/ai-for-developer-productivity
- GitClear, "AI Copilot Code Quality" (2024 and 2025 editions) — analysis of 200M+ changed lines finding rising short-window code churn, a surge in duplicated/copy-pasted code, and a steep decline in refactored ("moved") lines. https://www.gitclear.com/ai_assistant_code_quality_2025_research
- Google / DORA, "2024 Accelerate State of DevOps Report" — estimating that a 25% increase in AI adoption was associated with a 7.2% decrease in delivery stability and a 1.5% decrease in throughput. https://dora.dev/research/2024/dora-report/
This is a field note — a friendly, illustrative reading of the public record, not a commissioned case study and not a scorecard for anyone. We're all still learning how to get the most out of these tools, and the teams who shipped them moved fast and shared a lot in the open, which is exactly why there's enough public data to write this at all. Corrections and better public data are genuinely welcome via GitHub. The metric itself is defined on the home page; see how to cite.