The Agent That Was Perfect in Testing

Deploying an AI agent is the easy part.

A few months ago we deployed one that had passed every test we ran. The logic was clean, the output was consistent, the edge cases were covered. We shipped it into production and moved on to the next build.

Three weeks later it was switched off.

Not because it crashed. Because nobody owned it.

The engineer who built it had moved to the next project. The manager who approved it had assumed someone else was watching the output. The team using it had noticed something felt slightly off but had no obvious person to raise it with. By the time we caught the problem, the agent had been producing subtly wrong output for two weeks — not broken enough to trigger an alert, but wrong enough to matter.

We switched it off and fixed the immediate bug within a week. That part was easy. The harder question — who actually owns this thing once it's back in production — we hadn't answered before we shipped it the first time, and fixing the code didn't answer it either.

That's the lesson I keep coming back to. Not the debugging. The ownership question we still hadn't solved.

Why agents fail differently than software

Software fails loudly. An error throws an exception. A service returns a 500. The monitoring catches it, the alert fires, someone gets paged.

Agents fail quietly. They don't crash — they drift. The quality of output degrades gradually as the world changes and the agent doesn't. The context window fills with stale assumptions. The examples that trained the prompt no longer reflect how customers actually phrase things. A dependency changes upstream and the agent adjusts its behavior in ways that are hard to detect without reading the output carefully.

A dashboard doesn't catch drift. A named person who reads the output does.

This is why monitoring isn't enough. Monitoring is designed to detect sudden failures in deterministic systems. Agents are probabilistic systems that degrade gradually. The failure mode requires a different response — not better instrumentation, but human accountability.

The data confirms what I lived through: 88% of enterprise AI agent pilots never reach production, and of those that do, 22% deliver negative ROI at twelve months. Forrester's 2026 research identifies named ownership before build-start as the strongest single structural predictor of production success. Not the model. Not the architecture. Who's accountable.

What every deployed agent needs

Here's what I've landed on since — not a solved playbook, a working conviction. Before any agent leaves testing, three things need to be true:

An owner who isn't the builder. The person who built the agent is the wrong person to own it in production. They're too close to the implementation, too likely to assume the output is correct, and too likely to have moved on to the next build. The owner should be someone who uses the output regularly — whose work depends on it — and who will notice when it starts producing results that don't feel right. Accountability follows dependence.

Explicit kill conditions. Not "switch it off if it feels wrong." Specific criteria: if error rate exceeds a threshold, if the owner flags more than two output quality issues in a week, if the agent hasn't been reviewed in thirty days. Kill conditions written before deployment are the only ones that hold. Written after a problem emerges, they're rationalizations.

A fixed review cadence. A specific date — weekly or monthly depending on stakes — when someone asks three questions: Is this agent still doing what we deployed it to do? Have the assumptions it runs on changed? Is the output quality what we'd accept if we were seeing it for the first time today? The review doesn't need to be long. It needs to happen.

The gap most teams miss

Most companies have thought carefully about how to build agents. Very few have thought carefully about how to keep them running.

The build gets the attention because it's visible — there's a demo, a launch, a milestone. The maintenance gets ignored because it's invisible. The agent runs in the background, producing output, and nobody's watching closely enough to notice the gradual degradation.

The Agent Orchestrator role exists partly to close this gap: someone whose job is not to build agents but to own the ones already deployed — to maintain the review cadence, enforce the kill conditions, and surface the drift before it becomes a problem.

Without that ownership, you end up where we did: switching off a working agent because nobody was watching.

We haven't fully closed that gap yet. I know what needs to be true before the next agent ships — I don't yet have proof that we've made it stick.

Before your next agent goes to production: write down the name of the owner. If you can't, it's not ready to ship.

Worth Reading

88% of AI Agent Pilots Never Reach Production — The data behind the pattern I described above. The stat that stands out: 56% of enterprises now have a named "AI agent owner" or "agentic ops" lead, up from 11% two years ago. The ones that don't are most of the 88%.

The Playbook for Building an AI-Native Company — YC's framework for what AI-native actually means operationally. Short, dense, worth reading twice. The distinction they draw between AI-enabled and AI-native maps almost exactly to what I see in PE portfolios.

The AI-Native Services Playbook — Emergence Capital's take on how AI-native services companies are being built and valued differently. A useful commercial lens if you're thinking about how AI changes your product's exit multiple, not just your internal ops.