For the engineer doing the setup

Instrumentation guide

How to actually wire your stack to produce cost per accepted change every window without rebuilding observability from scratch. Organized by numerator component, with a minimum-viable pipeline at the end.

1. Model cost

The single most leveraged instrumentation choice. Without per-team or per-commit attribution, you can only allocate aggregate model spend by head-count or guess.

Minimum viable

One API key per team. Most providers (Anthropic, OpenAI, Google) let you create multiple keys per workspace; bills are itemized per key. This single change gets you per-team model cost with zero proxy infrastructure.
Pull billing monthly from the provider's billing dashboard or API. Anthropic exposes usage via the Console API; OpenAI via the Usage API; Google via Cloud Billing.
Allocate to teams by key. Done.

Production

Run an LLM proxy in front of every provider call. The proxy adds: per-request logging, custom tags, retries, budgets, and cache analytics — the data you need to debug a moving model-cost component.

LiteLLM — most popular open-source option. Routes to 100+ providers, supports per-key budgets, ships with a usage dashboard, runs as a Python/Docker service.
Helicone — managed proxy + observability. Generous free tier, clean UI for cost attribution and tagging. Lowest-effort option.
OpenLLMetry — OpenTelemetry instrumentation. Lower-level; pipes to your existing observability stack (Datadog, Honeycomb, Grafana) without a new dashboard.
Portkey — managed gateway with prompt management, caching, and usage tracking.

Whichever you pick, ensure each request carries tags for: team, project or repo, actor (user or agent), and purpose (e.g., code-gen, review, chat). Without tags, the proxy gives you nicer aggregate data but the same attribution problem.

Per-commit attribution

If you want model cost attributed to specific accepted changes, you need to know which commits an LLM touched. git-ai is the maturing tool here — it links AI-written lines to the agent, model, and transcripts that generated them via a git extension. Combined with provider pricing, this lets you compute the model cost per accepted change unit rather than spread total spend evenly.

2. Infrastructure cost

The compute, storage, observability, CI/CD, and agent-runner overhead attributable to producing changes.

Minimum viable

Take your monthly cloud bill, identify the line items that exist to support the dev/CI/agent loop (CI minutes, agent runner VMs, code-search infrastructure, observability for AI-touched services).
Allocate to teams by head-count or by team's share of CI minutes. Document the allocation method.

Production

Cloud cost tagging. AWS Cost Allocation Tags, GCP Billing Labels, Azure Cost Management tags. Tag every resource with team, project, purpose. Without tags, attribution is guesswork.
FinOps Foundation tooling. CloudHealth, Vantage, Apptio Cloudability — built for exactly this allocation problem. Most have free tiers for small orgs.
CI-specific: GitHub Actions usage reports (per-repo minute usage), CircleCI insights, Buildkite analytics.

3. Engineering time

The time team members spend specifying, prompting, integrating, and steering AI work, converted to currency at a loaded hourly rate.

Minimum viable

Use a blended fully-loaded rate × planned capacity. Most organizations already have these numbers for capacity planning:

Loaded hourly rate from finance (salary + benefits + tooling + overhead, divided by working hours)
Planned capacity hours per window (e.g., 10 engineers × 4 weeks × 32 hours/week = 1,280 hours)
Multiply

This overestimates active delivery time slightly, which is the right direction — it absorbs meetings, interruptions, and the real cost of context-switching.

Production

Time tracking: Toggl, Harvest, Jira Tempo. High accuracy, high friction; most teams reject this. Use only if your culture already supports it.
Calendar-based estimation: Export team calendars, subtract meetings and PTO from capacity. Better signal than blended capacity; less invasive than time tracking.
Issue tracker hours: If your team estimates issues in hours or story points and tracks completion, you can derive a delivery-hours total per window.

4. Review cost

Time spent reviewing and gating AI-generated work, converted to currency.

Minimum viable

Sample a representative week. Ask reviewers to track time spent on PR reviews for one week. Multiply by 4 (or however many weeks in your window). Adjust for known seasonality. Use the team's blended hourly rate.

Production

GitHub Review timing: the GitHub Events API and GraphQL API expose PR review timestamps. With a small script, you can compute "time elapsed between PR opened and first review" and "time elapsed during active review." It's not perfect — reviewers don't spend the whole elapsed time actively reviewing — but it's a defensible proxy.
CodeScene, Swarmia, LinearB: engineering analytics platforms that compute review cycle time as a built-in metric. Useful if you already use one for DORA tracking.
Conventional commit type: if your type: on PRs includes a reviewer-load category, you can weight differently — security or migration PRs cost more reviewer time than a small fix.

5. Rework cost

The trickiest component to instrument, and the one most teams ignore — which is exactly why catching it matters.

Mining reverts from git

Three signals to capture, ordered from most reliable to most subjective:

(a) Explicit git revert commits. Built-in syntax; commit messages are prefixed Revert "...".

# All revert commits in a window
git log --grep='^Revert "' \
  --since=2026-04-01 --until=2026-04-30 \
  --pretty=format:'%h %s'

# Or via gh — match GitHub's auto-generated revert PR title pattern.
# (Body-search for "reverts #" is unreliable: GitHub's search tokenizer
# strips the # and over-matches the word "reverts".)
gh pr list -R owner/repo --state merged \
  --search 'merged:2026-04-01..2026-04-30 in:title "Revert"' \
  --json number,title,body,additions,deletions

(b) PRs labeled as fixes or hotfixes. Requires team discipline on labels, but very cheap once in place:

gh pr list --state merged \
  --search 'merged:2026-04-01..2026-04-30 label:fix,hotfix,bug' \
  --json number,title,additions,deletions

(c) Conventional Commits. If your team uses Conventional Commits, you get free typing of every commit (fix:, revert:, feat:). Parse the commit messages directly — without --merges, since merge commits typically have non-conventional messages like Merge pull request #123 ...:

# Works for any workflow. For squash-merge workflows, you can add `--merges`
# since the squash commit carries the conventional-format message.
git log --since=2026-04-01 --until=2026-04-30 \
  --pretty=format:'%s' | grep -E '^(fix|revert)(\(.*\))?: '

For each identified revert / fix, capture the hours spent. The cheapest approach is a manual hour estimate per ticket reviewed in the window's quarterly review. The most rigorous is to attach a time-spent field to each fix ticket and roll up automatically.

Mining fix tickets from your issue tracker

If your team uses Jira, Linear, or GitHub Issues, fix tickets are usually well-typed:

Jira: filter by issuetype = Bug and resolved >= 2026-04-01 AND resolved <= 2026-04-30. Pull time-spent via the Tempo or Jira Time Tracking API.
Linear: filter by label IN ('bug', 'fix') and completedAt in the window. Linear's GraphQL API returns time estimates if your team enters them.
GitHub Issues: filter by label:bug closed in window via gh issue list. Estimates are typically not native; require a custom field.

Tying fixes back to the original change

For rigorous attribution, link each fix back to the PR or commit that introduced the defect:

PR template field: require "Fixes:" or "Reverts:" reference in PR body.
Branch naming: fix/CPAC-123-eng-time-rounding where CPAC-123 is the ticket referencing the original change.
Conventional commits: fix(commit-sha-prefix): ....

Tied fixes let you compute the more rigorous "stayed there" check: each merged PR is examined N days later; if a tied fix was merged within the window, the original is excluded from the denominator and the fix's cost lands in the numerator.

The minimum-viable approach

If you have nothing today, start by pulling the revert commits and the bug-labeled PRs in the window, eyeballing each, and assigning a rough hour estimate per fix. A team of 10–30 engineers typically has 5–25 such items in a four-week window; an hour of triage produces a defensible rework-cost number.

6. The denominator — accepted change units

The other half of the metric. Pull merged PRs, apply the 500-LOC normalization, filter by the survival window.

The recipe

# Step 1: list merged PRs in the window
gh pr list --state merged \
  --search 'merged:2026-04-01..2026-04-30' \
  --json number,additions,deletions,mergedAt,title --limit 1000 \
  > merges.json

# Step 2: for each PR, check it hasn't been reverted or fix-followed
#   within the 30-day survival window. Filter merges.json to surviving PRs.

# Step 3: normalize via the 500-LOC rule. Floor of 1 unit per surviving PR
# so binary-only PRs (additions+deletions=0 from gh's perspective) still
# count, matching the library's normalizeChanges() helper.
jq '[.[] | (.additions + .deletions) | (. / 500) | ceil | if . < 1 then 1 else . end] | (add // 0)' merges.json

The full recipe lives in the FAQ; the calculator library exports normalizeChanges() for the LOC step.

Putting it together: the monthly pipeline

A minimum-viable monthly script. Cron it for the first of the month, fetching the prior month's data. Bash arithmetic ($((...))) is integer-only, and billing APIs return decimals — so the final composition is done in jq, which handles floats and divide-by-zero cleanly.

#!/usr/bin/env bash
set -euo pipefail

# Run on the 1st of every month, computing the prior month's cost per accepted change.
# Requires: gh (authenticated), jq, and access to your provider billing
# + cloud billing. Adjust dates, repo, team key, and hourly rate.

WINDOW_START="2026-04-01"
WINDOW_END="2026-04-30"
OWNER="my-org"
REPO="my-repo"
HOURLY_RATE=150

# ------- REPLACE BEFORE USE -------
# The two URLs below are placeholders. Fill them in with your provider's
# real billing endpoints, or the script will report $0 for model + infra
# cost (and you'll know to come back here). See documentation:
#   Anthropic Admin API:  https://docs.anthropic.com/en/api/admin-api/usage-cost
#   OpenAI Usage API:     https://platform.openai.com/docs/api-reference/usage
#   AWS Cost Explorer:    aws ce get-cost-and-usage --time-period ...
# ----------------------------------

# 1. Model cost — pull from your provider's billing API. The `|| echo 0`
#    fallback lets the script complete even if the endpoint is unreachable;
#    you'll see $0 for model cost and know what to fix.
MODEL_COST=$(curl -fsS "https://REPLACE_ME/billing?start=$WINDOW_START&end=$WINDOW_END" 2>/dev/null \
  | jq '.cost_usd // 0' 2>/dev/null || echo 0)

# 2. Infra cost — pull from cloud billing. Same fallback pattern.
INFRA_COST=$(aws ce get-cost-and-usage \
  --time-period "Start=$WINDOW_START,End=$WINDOW_END" \
  --granularity MONTHLY --metrics AmortizedCost 2>/dev/null \
  | jq '(.ResultsByTime[0].Total.AmortizedCost.Amount | tonumber) // 0' 2>/dev/null \
  || echo 0)

# 3. Engineering time — blended rate × planned capacity (integer math is fine).
ENG_HOURS=1280   # 10 engineers × 4 weeks × 32h
ENG_COST=$((ENG_HOURS * HOURLY_RATE))

# 4. Review cost — sampled week × 4.
REVIEW_HOURS=40
REVIEW_COST=$((REVIEW_HOURS * HOURLY_RATE))

# 5. Rework cost — count fix/hotfix/revert PRs in window, ~2h per fix average.
REWORK_PR_COUNT=$(gh pr list -R "$OWNER/$REPO" \
  --search "merged:$WINDOW_START..$WINDOW_END label:fix,hotfix,revert" \
  --json number --limit 1000 | jq 'length')
REWORK_COST=$((REWORK_PR_COUNT * 2 * HOURLY_RATE))

# 6. Accepted change units — gh + jq + ceil. `add // 0` guards empty windows.
#    Note: --limit 1000 is gh's max for this flag. The truncation check
#    below warns if you hit the cap; for larger windows use
#    `gh api graphql --paginate` with a PR-search query.
PRS_JSON=$(gh pr list -R "$OWNER/$REPO" --state merged \
  --search "merged:$WINDOW_START..$WINDOW_END" \
  --json additions,deletions --limit 1000)
if [[ "$(echo "$PRS_JSON" | jq 'length')" -eq 1000 ]]; then
  echo "WARN: gh pr list hit the --limit 1000 cap; UNITS may be truncated." >&2
fi
UNITS=$(echo "$PRS_JSON" \
  | jq '[.[] | (.additions + .deletions) | (. / 500) | ceil | if . < 1 then 1 else . end] | (add // 0)')

# 7. Compose and report. jq does the float arithmetic and the zero-units guard.
jq -nr \
  --argjson model "$MODEL_COST" \
  --argjson infra "$INFRA_COST" \
  --argjson eng "$ENG_COST" \
  --argjson review "$REVIEW_COST" \
  --argjson rework "$REWORK_COST" \
  --argjson units "$UNITS" \
  --arg start "$WINDOW_START" \
  --arg end "$WINDOW_END" '
  ($model + $infra + $eng + $review + $rework) as $total |
  (if $units == 0 then "N/A (no accepted changes in window)"
   else "$\(($total * 100 / $units | round) / 100)"
   end) as $cpac |
  "Window: \($start) to \($end)
Model:    $\($model)
Infra:    $\($infra)
Eng:      $\($eng)
Review:   $\($review)
Rework:   $\($rework)
Total:    $\($total)
Units:    \($units)
Cost per accepted change: \($cpac)"
'

Append the output as a row in the tracker spreadsheet and you have a defensible monthly time series. Refine each step over time as the metric proves its value.

Honest caveats

Survival-window logic adds reporting lag. With a 30-day window, you cannot fully compute April until at least May 31, because the last April merges need to season. Plan the pipeline accordingly.
Rework attribution is the weakest link. No automation will catch silent reworks done by branching off the original and quietly fixing without a referenced ticket. Pair instrumentation with a quarterly retrospective that asks the team "what didn't we capture?"
Hourly rates are political. Get finance to bless the rate on day one. Disputes about the rate kill more measurement programs than any technical limitation.
Cache-hit rates can swing model cost dramatically. If you're trending model cost over time, also track cache hit rate as a leading indicator — a quiet cache config change can move the cost component without any change in delivery system behavior.

For the broader operational guidance (who runs the measurement, how often, what to report), see the quick-start playbook. For where to push back on common critiques of the metric, see the measurement comparison page.

Found a better tool or a sharper script for any of these components? Open an issue at the repo. The most useful updates to this page come from teams sharing what they built.