March 20, 202610 min read3 views

How Anthropic Uses Claude to Fix Claude: AI-Powered SRE

claude-aianthropicsite-reliability-engineeringai-operationssreinfrastructure

Introduction

What happens when one of the world's most advanced AI systems goes down — and the tool used to fix it is the AI itself? That is the reality Anthropic is building toward with its AI Reliability Engineering (AIRE) team, a group dedicated to keeping Claude online, responsive, and performant at a scale that now serves over a million new signups every single day.

At QCon London in March 2026, a member of Anthropic's reliability team revealed that the company has been actively using Claude — its own large language model — as part of its incident response workflow. The results are promising, but the challenges are unlike anything traditional site reliability engineering has ever faced. In this article, we will explore how Anthropic is rethinking infrastructure reliability by turning Claude into both the product and the operator, what works, what does not, and what it means for anyone running AI systems at scale.

Why AI Systems Need a New Kind of SRE

Traditional site reliability engineering was designed for deterministic software. A web server either returns a 200 response or it does not. A database query either completes within the expected latency window or it times out. The failure modes are well-understood, the monitoring is straightforward, and the runbooks are predictable.

Large language models break all of those assumptions. Claude does not fail in binary ways. It can return a response that is technically successful — a 200 status code, valid JSON, properly formatted text — but the content might be degraded, repetitive, off-topic, or subtly wrong. Traditional monitoring would see a healthy system. Users would see something broken.

This is the fundamental challenge that Anthropic's AIRE team is tackling. Reliability for AI systems is not just about uptime. It is about output quality, consistency, latency distributions across different prompt types, and the complex interplay between model behavior, infrastructure load, and user experience. A spike in GPU utilization might not cause errors in the traditional sense, but it could silently degrade the quality of Claude's responses for thousands of users simultaneously.

The team, led by Todd Underwood — formerly of Google where he spent nearly fifteen years and co-authored the O'Reilly book on reliable machine learning — understands this distinction deeply. Underwood built Google's Machine Learning SRE organization from the ground up, and he brought that expertise to Anthropic specifically because AI reliability requires a fundamentally different approach.

How Claude Is Used in Its Own Incident Response

The most fascinating aspect of Anthropic's approach is the recursive nature of the solution: they are using Claude to help diagnose and respond to issues affecting Claude. Alex Palcuie, an SRE on the AIRE team who previously worked on Google Cloud Platform reliability, has been integrating LLMs into actual incident response workflows since January 2026.

The concept is straightforward in theory but complex in execution. When an incident is detected — whether through automated monitoring, user reports, or anomalous metric patterns — Claude is brought into the response loop. It can analyze logs at a speed no human can match, correlate signals across dozens of monitoring dashboards, summarize the current state of an incident for responders joining late, and suggest potential root causes based on patterns it recognizes from previous incidents.

But the QCon talk revealed an important limitation that anyone considering this approach should understand: Claude excels at finding issues but still makes a poor substitute for a full site reliability engineer. The primary reason is that Claude consistently mistakes correlation with causation during incident analysis. When presented with a timeline of events leading up to a failure, Claude will identify patterns and correlations with impressive speed. However, it tends to latch onto temporal correlations — things that happened around the same time — and present them as causal chains without the deep systems intuition that experienced SREs develop over years of hands-on work.

For example, if a deployment happened thirty minutes before an outage and a configuration change happened at the same time, Claude might confidently attribute the outage to the deployment when the actual root cause was an unrelated capacity issue in a downstream service. Human SREs learn to be skeptical of obvious correlations. Claude, at least in its current form, has not fully developed that skepticism.

The Scale Problem That Made This Necessary

To understand why Anthropic is investing so heavily in AI-powered reliability, you need to understand the scale problem they are facing. In late February 2026, a viral campaign encouraging users to switch from ChatGPT to Claude drove a sixty percent increase in free users and a doubling of paid subscribers in a matter of weeks. By early March, Anthropic reported that over one million new users were signing up for Claude every single day.

This explosive growth led directly to one of the most significant outages in Claude's history on March 3, 2026. Users worldwide reported that Claude was completely down, and Anthropic's status page confirmed elevated errors across nearly every user-facing platform — Claude.ai, Cowork, the API Platform, and even the Claude Code CLI. The incident affected every surface through which users interact with Claude.

Traditional scaling playbooks struggle with AI workloads because the resource requirements are fundamentally different. Scaling a web application means adding more servers and load balancers. Scaling an LLM means allocating more GPU clusters, managing model sharding across nodes, handling the complex memory requirements of long context windows — especially now that Claude supports up to one million tokens — and doing all of this while maintaining consistent response quality.

The March 3 outage was a wake-up call, but it also validated the investment in AIRE. The team's approach to using Claude in its own incident response pipeline helped them identify and resolve the cascading failures faster than traditional methods alone would have allowed. The incident retrospective, analyzed by Rootly and others in the industry, highlighted a new category of challenges that SRE teams will increasingly face as AI systems become more sophisticated and failure becomes harder to detect and define.

What Makes AI Reliability Different From Traditional SRE

Several factors make AI reliability engineering a distinct discipline, and understanding them is crucial for anyone operating or building with Claude or similar systems.

First, there is the problem of quality degradation without explicit failure. A traditional service either works or it does not. An AI service can work in a technical sense while delivering poor results. Monitoring for this requires evaluating output quality in real time, which itself requires AI — creating a recursive monitoring challenge.

Second, AI systems have non-deterministic behavior by design. The same prompt can produce different outputs on different requests. This makes it extremely difficult to establish baselines, detect regressions, and distinguish between normal variation and actual degradation. The AIRE team has had to develop entirely new metrics and monitoring approaches that account for expected variability while still catching meaningful quality drops.

Third, the failure domains are entangled in ways that traditional systems are not. A single GPU node running slowly might not affect overall throughput but could cause specific types of long-context requests to time out. A minor model configuration change might improve average quality while degrading performance on edge cases that a subset of power users depends on. These subtle, partial failures are much harder to detect, diagnose, and resolve than the clean failure modes of traditional infrastructure.

Fourth, the resource economics are different. GPU compute is expensive and relatively inelastic. You cannot spin up new GPU clusters in seconds the way you can launch new virtual machines. Capacity planning for AI workloads requires much longer lead times and more sophisticated demand forecasting, especially when user growth can spike dramatically and unpredictably.

Lessons for Teams Running AI Workloads

Anthropic's experience offers several practical lessons for any organization running AI systems in production, whether they are using Claude through the API or operating their own models.

The first lesson is to invest in quality monitoring, not just availability monitoring. Traditional uptime metrics will miss the most impactful AI failures. Build evaluation pipelines that continuously test your AI system's output quality against known benchmarks. If you are using Claude through the API, this means running periodic test prompts that cover your critical use cases and comparing the outputs against expected quality baselines.

The second lesson is to treat your incident response process as something that AI can augment but not replace. Anthropic's own experience shows that Claude is valuable for log analysis, signal correlation, and incident summarization, but human judgment remains essential for determining root cause and making remediation decisions. The best approach is a hybrid one where AI handles the data-intensive grunt work and humans make the critical calls.

The third lesson is to plan for non-linear scaling challenges. AI workloads do not scale linearly with user growth. Doubling your users might triple your compute requirements depending on how usage patterns shift. Long context requests, which are now possible with Claude's one-million-token window, consume dramatically more resources than short interactions. Your capacity planning needs to account for not just user count but usage pattern distribution.

The fourth lesson is to build robust fallback mechanisms. When your AI system is degraded, what happens to your users? Anthropic has implemented multiple levels of graceful degradation across Claude's infrastructure, from routing requests to less loaded clusters to temporarily reducing maximum context lengths during peak load. Think about what graceful degradation looks like for your specific use case.

The Future of AI-Powered Operations

Anthropic's work on using Claude for its own operations points toward a future where AI systems are not just the products being monitored but active participants in their own reliability. This is a profound shift that will accelerate as models become more capable.

The near-term trajectory is clear. AI will become standard tooling in incident response across the industry, not just at AI companies. The ability to analyze vast amounts of log data, correlate signals across complex distributed systems, and generate human-readable summaries of system state is valuable for any large-scale operation. The key is understanding the limitations — particularly around causal reasoning — and designing workflows that leverage AI's strengths while compensating for its weaknesses.

Longer term, we may see AI systems that can not only diagnose issues but implement fixes autonomously, with human oversight for critical decisions. Anthropic is clearly moving in this direction, and their willingness to be transparent about both the successes and the limitations of their approach is valuable for the entire industry.

What This Means for Claude Users

For the millions of people who use Claude daily, these behind-the-scenes investments in reliability engineering translate directly into a better experience. The March 3 outage was disruptive, but the speed of recovery and the subsequent stability improvements demonstrate that Anthropic is treating reliability as a first-class engineering priority.

The doubled usage limits during off-peak hours through March 27 are another sign that Anthropic is actively managing capacity while scaling to meet demand. Understanding these dynamics helps you get the most out of Claude — if you are a heavy user, scheduling intensive workloads during off-peak hours can give you significantly more throughput.

Conclusion

Anthropic's AI Reliability Engineering team represents a new frontier in operations engineering. The challenges they face — quality-aware monitoring, non-deterministic system behavior, entangled failure domains, and recursive self-monitoring — are challenges that every organization running AI at scale will eventually encounter.

The key takeaway is that AI reliability requires a fundamentally different mindset from traditional SRE. Uptime is necessary but not sufficient. Quality, consistency, and graceful degradation matter just as much. And the tools we use to maintain reliability — including AI itself — need to be deployed with a clear understanding of their strengths and limitations.

For Claude power users who want to stay on top of these developments and monitor how reliability impacts their own usage patterns, tools like SuperClaude can help track real-time model availability and usage across sessions.