Ten Outages in Twelve Days. The Reliability Axis Your Agent Stack Isn't Measuring.
Between June 5 and June 16, Claude experienced ten significant service disruptions — a mean time between failures of roughly one day. Every enterprise team running agents on top of Anthropic's API learned something the benchmark reports don't cover: task performance and infrastructure reliability are different axes, and the agent evaluation industry has built around only one of them.
Between June 5 and June 16, Claude experienced ten significant service disruptions. Opus 4.8 and Haiku 4.5 hit walls of 500 and 529 errors. Fix attempts were deployed. The failures kept coming.
The mean time between significant outages: roughly one day.
That figure would breach standard enterprise SLAs before the first disruption was acknowledged. And yet thousands of organizations had already built production workflows on top of the API — because no standard agent evaluation methodology would have flagged this risk before deployment.
That's the story. Not the outages themselves. The fact that no one's evaluation framework was designed to catch what caused them.
What Broke, and Why It Mattered
Claude Code now accounts for roughly 4% of all public GitHub commits — and those are just the visible ones. Inside enterprise environments, the footprint is larger: CI/CD pipelines executing multi-step tasks autonomously, customer support triaging bots handling first contact, data pipelines using LLM semantic analysis to route and classify at scale.
When the API went down, all of it stopped. Development velocity dropped as automated pair-programmers disappeared. Support queues backed up as triage bots fell silent. Data pipelines froze mid-workflow, some with no clean rollback state.
Anthropic attributed the strain to unprecedented user growth — partly enterprise adoption of Claude Code, partly a surge following Anthropic's public dispute with the Pentagon — and noted that expanded compute capacity through Amazon and Google partnerships is not yet available. That's a credible explanation. It's also beside the point for the organizations that discovered, mid-workflow, that a model they'd embedded as infrastructure had a one-day failure interval.
The Two-Axis Problem
Enterprise agent evaluation has a benchmark problem. Not the kind DeepSWE exposed — contaminated training data inflating scores — but a structural one. The entire evaluation ecosystem, from leaderboards to vendor-published performance claims, is built around a single axis: task quality. Can the agent answer correctly? Can it complete the workflow? Does it reason well under edge cases?
This axis matters. It's necessary. It is not sufficient.
Infrastructure reliability is a different axis entirely, and it doesn't appear in agent benchmarks because benchmarks are designed to measure what an agent does when it's running. They're not designed to measure whether it's running when you need it to be.
The enterprise cloud industry spent a decade learning to reason across both dimensions simultaneously. You don't evaluate a cloud database provider purely on query performance. You evaluate query performance and SLA guarantees and failover behavior and the provider's history of incident response. A database that's 20% faster but down twice as often is a worse choice for production, not a better one.
The enterprise AI industry hasn't made this shift. Agent selection is still overwhelmingly driven by benchmark performance, capability demonstrations, and vendor-reported metrics like containment rates. Infrastructure reliability — uptime history, latency percentiles at production scale, degraded-mode behavior when a model endpoint is unhealthy — isn't a first-class criterion in most enterprise agent evaluation processes.
The twelve days between June 5 and June 16 are what happens when you skip that axis.
Single-Vendor Dependency Isn't a Bug. It's an Architecture Choice Nobody Made Consciously.
Thoughtworks framed the outages as a reckoning with AI's increasing status as infrastructure. That framing is right, and it points directly at the underlying error.
When enterprises first adopted cloud services, the single-cloud mistake was understandable. Multi-cloud architecture is expensive, complex, and requires deliberate engineering investment. Most early adopters defaulted to a single provider and learned the hard way.
The expert consensus now is blunt: in 2026, hardcoding a specific provider's API endpoint is a single point of failure that's a very real threat to business continuity. The advice is correct. The problem is that most organizations didn't arrive at single-vendor AI dependency through a deliberate architecture decision — they arrived there because agent evaluation frameworks gave them no signal that reliability needed to be a criterion.
You can't optimize for something you're not measuring.
The 2012 AWS us-east-1 outages — which brought down Netflix, Reddit, and Pinterest — drove the multi-region architecture conversation precisely because those failures were visible and consequential. Ten Anthropic outages in twelve days are doing the same thing for AI infrastructure. The question is whether the industry extracts the right lesson.
The wrong lesson: use a different vendor. Provider monoculture is the failure, not provider choice.
The right lesson: infrastructure reliability needs to be a verified, tracked criterion in agent selection — with the same rigor applied to task performance. That means evaluating failover behavior, setting SLA expectations before production, and understanding what your agent stack does when the primary model endpoint is unhealthy, not discovering it under load.
What Infrastructure-Grade Agent Evaluation Actually Requires
The current state of agent evaluation handles task performance reasonably well, if imperfectly. The benchmarks exist. The leaderboards exist. The independent evaluation infrastructure is emerging.
For reliability, nothing equivalent exists at the level enterprises need.
Infrastructure-grade agent evaluation requires measuring things that current benchmarks don't touch:
Uptime and SLA history for the underlying model infrastructure — tracked independently, not reported by the vendor.
Degraded-mode behavior — what does the agent do when the primary endpoint is slow or returning errors? Does it fail loudly, fail silently, or fall back gracefully?
Latency at production percentiles — p95 and p99 latency under realistic load, not average response time under favorable conditions.
Failover path performance — if you add a secondary provider, does the fallback agent maintain acceptable task quality on the same workload? Failover to a model with worse performance characteristics is an architecture decision with consequences.
None of this replaces task performance evaluation. Both axes need to be measured, tracked, and included in agent selection decisions before deployment.
Anthropic will stabilize its infrastructure — they have the resources and the incentive, and the new compute capacity will come online. That part will resolve. What's less certain is whether the enterprise teams affected by twelve days of failures will update their agent evaluation frameworks to include reliability as a tracked dimension going forward, or treat this as a one-off incident and return to benchmark-only selection.
One-off incidents, in complex systems, are how you describe a systematic gap before the next iteration of the same failure.
The benchmarks are measuring one axis. The gap is the other one.