Framework · For running agents

Cost per accepted outcome

By Brenn Hill·June 2026·~9 min read

Cost per accepted change measures the cost of producing trusted software. As teams move from AI that writes code to agents that do work in production, the same question reappears one layer over: not "what did we build and keep," but "what did the agent achieve, and did it stick?" Cost per accepted outcome is that metric — the runtime sibling, built on exactly the same bones.

The one-line version Cost per accepted change prices a change that reached production and stayed. Cost per accepted outcome prices an outcome an agent produced — the accepted result of one or more actions — that was accepted and stayed accepted. Same discipline — denominate by kept value, count the fully-loaded cost, report in dollars — moved from build-time to run-time. (The outcome is the value unit you count; the action is the loggable primitive beneath it.)

Credit The term cost per accepted outcome — CAPO — was coined by Nikhil Mungel in "FinOps for agents: loop limits, tool-call caps and the new unit economics of agentic SaaS" (InfoWorld, March 2026), where he defines it as "the fully loaded cost to deliver one accepted outcome for a specific workflow," with acceptance set by a concrete quality gate (automated validation, a user "Apply" click, or a signal like "case not reopened in 7 days"). This page adopts CAPO and situates it in the AI FinOps family as the runtime sibling of cost per accepted change — adding the seven-line cost decomposition (its failure-impact line is close kin to Mungel's "Failure Cost Share"), the per-outcome survival window, and the LoopRails oversight bridge.

Why a new denominator

The instinctive way to watch an agent is by its bill: cost per token, cost per request, cost per agent-run. Those are the agent-world equivalent of "lines of code" — activity metrics that rise whether or not the work was any good. An agent that runs ten thousand times but whose work gets overridden, rolled back, or quietly re-done has produced ten thousand runs and far fewer kept outcomes. A per-token dashboard will call that cheap. It is not.

Cost per accepted outcome borrows the move that makes cost per accepted change honest: put kept value in the denominator. Count the outcomes that were accepted and stayed accepted, and let the cost of the ones that didn't fall into the numerator as remediation. The result is a single, finance-legible number that tracks the economics of trusted autonomous work, not the volume of it. (And "outcome," not "action," is deliberate: an outcome is a value-bearing result, so the metric can't be gamed by counting tool calls.)

The formula, expanded

The numerator is the fully-loaded cost of running the agent over a window — and, as with cost per accepted change, the lines that matter most are the ones a token bill never shows:

Component	What it captures
Inference cost	LLM tokens — input, output, cache, reasoning — including retries and multi-step loops.
Tool & API cost	External calls the agent makes: search, code execution, RAG / vector, paid third-party APIs.
Infrastructure cost	Orchestration runtime, sandboxes, memory / vector stores, observability, queues.
Oversight cost	The human-in-the-loop labor — approvals, reviews, the show-and-prove load. As autonomy scales, this becomes the dominant hidden line, exactly as review cost did for AI-assisted coding.
Remediation cost	The internal labor to clean up outcomes that did not stay: rollbacks, human redo, incident response.
Failed-run cost	Runs that produced nothing usable but still billed tokens and compute.
Failure impact	The downstream financial consequence of outcomes that failed — escalation to costlier channels, lost or delayed revenue, refunds and credits, SLA penalties, churn, compliance exposure. Distinct from remediation: that's what you pay to fix it; this is the value destroyed.

The denominator is the count of accepted outcome units: consequential agent outcomes that were accepted and stayed accepted during the window, complexity-normalized so a one-shot classification and a fifty-step autonomous workflow aren't counted as equals. (A natural normalizer is the risk grade of the actions that produced the outcome — more on that below.)

"Stayed accepted" — the rework defense, for agents

The clause that does the work, just like the "stayed there" clause in cost per accepted change. An outcome counts in the denominator only if, within a survival window, it was not:

reverted or rolled back — the outcome was undone;
overridden or corrected by a human reviewer;
re-run to get a result that finally stuck;
re-opened by the end user (the support "re-contact" signal); or
the cause of an incident, complaint, or compliance issue needing remediation in the window.

Outcomes the agent produced this window

Accepted & stayed

Did not stay

↓ counts in the denominator — accepted outcome units

↓ its cleanup and its downstream consequences hit the numerator (remediation + failure impact)

That double-hit is the point, and it's inherited straight from cost per accepted change: a shortfall shows up twice — once by shrinking the denominator, once by growing the numerator — so a metric built on it can't be fooled by an agent that ships fast and reliably wrong.

Two nuances worth baking in Approval is not an override. In a human-in-the-loop system, a reviewer approving an outcome is the design working, not a failure — the test is whether it stayed without later correction. And a correct escalation is a win. An agent that recognizes it can't safely handle something and hands off to a human produced a good outcome; only failure handoffs — it tried, got it wrong, a human cleaned up — count against the denominator.

A worked example

A fleet of support-and-ops agents over a one-week window. Loaded human-review time runs through the oversight line.

Direct operating cost	$19,000
Inference cost	$3,000
Tool & API cost	$1,200
Infrastructure cost	$800
Oversight cost (human-in-the-loop)	$9,000
Remediation cost (overridden / rolled-back outcomes)	$4,000
Failed-run cost	$1,000
Failure impact (escalations, refunds, lost sales)	$12,000
Total fully-loaded cost	$31,000
Outcomes attempted	12,500
Accepted outcome units (accepted & stayed)	5,000
Cost per accepted outcome	$6.20

The same window tells three very different stories. A per-run dashboard divides the $31,000 by all 12,500 attempts and reports a cheerful $2.48 a run. Count only what you directly pay and divide by the 5,000 outcomes that stuck, and you get $3.80. But the honest number includes the consequences of the outcomes that failed — the escalations, refunds, and lost sales — which lifts it to $6.20, where the single largest line is neither inference nor oversight but failure impact, at 39% of the bill. Nothing here says "don't run the agent." It says the real cost lives in the human load, the cleanup, and above all the downstream consequences — so that's where the next improvement is. Raise the acceptance rate and all three move in your favor at once.

The line most agent dashboards omit Failure impact is the hardest component to quantify and the easiest to leave at zero — which is exactly why omitting it is dangerous. A single wrong autonomous action can trigger a refund, an SLA penalty, or a lost deal worth orders of magnitude more than the tokens that produced it. Estimate it as failure rate × average consequence rather than pretend it's nothing; a rough number beats a missing one. This line is also the dollar form of LoopRails' consequence-severity axis — high-consequence actions are the ones to prevent, not merely review — so cost per accepted outcome is what makes that severity legible to finance.

A FinOps operating model

What makes this a FinOps practice and not just a metric is the loop around it. It maps cleanly onto the FinOps Foundation's three phases:

Phase 1

Inform

Tag every action with agent, task-type, risk grade, tenant/team, and outcome (accepted / overridden / escalated / reverted). Attribute spend to accepted outcomes, not the total token bill. This is the visibility layer — showback by team, agent, and task.

Phase 2

Optimize

Drive cost per accepted outcome down: cache, right-size the model per risk grade, cap loop depth — and, the biggest lever, raise the acceptance rate. Fewer reverts and failure-escalations beats cheaper tokens, because it lifts the denominator while cutting remediation and failure impact at the same time.

Phase 3

Operate

Govern continuously: budgets and anomaly alerts on cost per accepted outcome per agent and team, and — the real prize — tie any expansion of agent autonomy to the trend, not to adoption. A flat per-seat token cap is a blunt v0 of this; the trend is the steering wheel.

Pair it with leading indicators

As with cost per accepted change, the headline is a summary — don't use it alone. Report it alongside two or three diagnostics that explain why it moved:

Acceptance rate — share of outcomes accepted and kept. The primary quality signal.
Override / reversal rate and failure-escalation rate — where the denominator is leaking.
Autonomy ratio — share of actions executed without a human gate.
Loop depth and tokens per accepted outcome — where inference cost is going.
Machine catch rate — share of bad actions caught by automated gates before they shipped, rather than downstream.

The bridge: oversight you can measure

Cost per accepted outcome only works if you can observe which outcomes stayed — and that observability is precisely what a good oversight framework gives you. LoopRails is the natural companion: its RAIL properties make actions Reversible, Authorized, Interruptible, and Logged, and that "Logged" is the audit trail you need to tell an accepted outcome from a reverted one. Its risk grades (G0–G3) are a ready-made complexity normalizer for the denominator — weight accepted outcomes by the grade of the actions behind them instead of inventing a new scale.

The division of labor is clean: LoopRails decides which actions need a human and proves the oversight catches mistakes; cost per accepted outcome prices the results that got kept. Oversight and economics, measuring the same thing — trusted autonomous work — from two sides.

Two more companions each map onto a line in the numerator. BRACE supplies the security controls that stop a hijacked or misaligned agent from generating catastrophic failure impact in the first place. And eval-driven development — the verification-quality discipline at Eval-Driven Development — is how you raise the acceptance rate that lifts the whole denominator. Security, oversight, quality, cost: four lenses on the same goal, and cost per accepted outcome is the one that puts the others in dollars.

What it is, and is not

It is an aggregate, time-series, fleet-or-team-level steering metric — the dollar bottom line of running agents.
It is not cost-per-token or cost-per-run (those are inputs and diagnostics), a per-agent leaderboard, or an autonomy KPI to chase.
It is not a replacement for safety gates. It pairs with oversight; it never relaxes it.

Compute it

The calculator has a For agents tab — enter the seven cost lines and your accepted-outcome count and get the number, with a shareable link. The same logic ships in the reference library:

import { costPerAcceptedOutcome } from 'cost-per-accepted-change';

const capo = costPerAcceptedOutcome({
  inferenceCost: 3000,
  toolCost: 1200,
  infraCost: 800,
  oversightCost: 9000,
  remediationCost: 4000,
  failedRunCost: 1000,
  failureImpactCost: 12000,   // escalations, refunds, lost sales
  acceptedOutcomes: 5000,
});

console.log(capo.value);     // 6.2
console.log(capo.breakdown); // share-of-total per component

Open the agent calculator → Cost per accepted change → Field notes →

Cost per accepted outcome (CAPO) was introduced by Nikhil Mungel (InfoWorld, 2026); this page situates it alongside cost per accepted change, the build-side metric from The Delivery Gap (Brenn Hill, 2026). Both are free to use and adapt; refinements and worked examples are welcome via GitHub. See how to cite.