Back to Blog
March 23, 202610 min read2 views

Claude AI Computer Use: How Anthropic Reached Near-Human Performance

claude-aianthropiccomputer-useai-agentsvercepttutorial

Introduction

For years, the promise of AI agents that can actually operate your computer felt like science fiction. The idea that an AI could navigate a spreadsheet, fill out a web form across multiple browser tabs, or interact with live software the way a human does seemed distant. That distance just collapsed.

Anthropic's Claude has made a dramatic leap in computer use capabilities. On OSWorld, the most widely referenced benchmark for evaluating how well AI models can operate real software, Claude Sonnet went from under 15% success rate when computer use first launched in late 2024 to a remarkable 72.5% today. That puts Claude Sonnet 4.6 within striking distance of human-level performance on complex desktop tasks.

This article breaks down what happened, why the Vercept acquisition was the catalyst, how computer use actually works under the hood, and what this means for anyone building with or relying on Claude in their daily workflow.

The Vercept Acquisition: Why It Matters

On February 25, 2026, Anthropic announced the acquisition of Vercept, a Seattle-based AI startup that had been quietly building some of the most advanced vision-based computer perception and automation technology in the industry. The nine-person team, led by co-founders Kiana Ehsani, Luca Weihs, and Ross Girshick, had spent years tackling a deceptively hard problem: teaching AI systems to see and interact with software interfaces the same way humans do.

This was not a talent acquisition dressed up as a product play. Vercept was built around a clear thesis that making AI genuinely useful for completing complex, multi-step tasks requires solving hard perception and interaction problems. Their research focused on how AI can interpret dynamic visual environments, understand UI elements in context, and execute sequences of actions that require real-time adaptation when things on screen change unexpectedly.

The Vercept team wound down their external product in the weeks following the acquisition and integrated directly into Anthropic's computer use research group. The results showed up almost immediately in Claude's benchmark performance, but the real impact goes much deeper than any single number.

Understanding OSWorld and What 72.5% Actually Means

OSWorld is the standard evaluation benchmark that the AI research community uses to measure how well models can perform real computer tasks. Unlike text-based benchmarks where a model generates answers in isolation, OSWorld requires the AI to actually operate within a live software environment. Tasks include things like navigating complex spreadsheets with multiple sheets and formulas, completing multi-step web forms that span several browser tabs, managing files and folders in a desktop operating system, interacting with applications that have menus, dropdowns, and modal dialogs, and handling edge cases where the expected interface changes or errors appear.

When Anthropic first released computer use capabilities with Claude in late 2024, the success rate on OSWorld was below 15%. That was understandable for a first release. The model could handle simple, predictable interactions but would fail when confronted with unexpected popups, non-standard UI layouts, or tasks that required remembering context across multiple steps.

Jumping to 72.5% is not just a quantitative improvement. It represents a qualitative shift. At this level, Claude can reliably handle the kind of messy, real-world computer tasks that previously required a human. The remaining gap to 100% largely consists of extremely unusual edge cases, highly specialized software with non-standard interfaces, and tasks that require domain expertise beyond general computer literacy.

To put it in perspective, most evaluations of human performance on OSWorld tasks land somewhere in the 85-90% range, accounting for the fact that even humans make mistakes, get confused by unfamiliar software, and occasionally misclick. Claude is now closer to that human baseline than it is to where it started.

How Claude Computer Use Actually Works

Claude's computer use capability works through a perception-action loop that runs in real time. The model receives a screenshot of the current state of the screen and analyzes it to identify UI elements such as buttons, text fields, menus, links, and other interactive components. Based on the task it has been asked to perform, Claude reasons about which action to take next, then executes that action, which could be clicking at specific coordinates, typing text, scrolling, pressing keyboard shortcuts, or dragging and dropping elements.

After each action, Claude receives a new screenshot reflecting the updated state of the screen. It then re-evaluates the situation and decides on the next step. This loop continues until the task is complete or the model determines it cannot proceed.

What makes this challenging is not any single step. Recognizing a button or typing into a text field is relatively straightforward. The difficulty lies in the orchestration: maintaining context about what has already been done, understanding how the current screen state relates to the overall goal, recovering gracefully when something unexpected happens (like a popup blocking the expected element), and knowing when a task is genuinely complete versus when there are remaining steps.

The Vercept acquisition directly improved Claude's performance on the perception side of this loop. Their research into vision-based understanding of dynamic interfaces gave Claude a much more robust ability to interpret what is on screen, even when the layout is unfamiliar or when elements are partially occluded, animated, or rendered in non-standard ways.

What Changed Since the Early Days

The progression from sub-15% to 72.5% did not happen overnight. It reflects a series of compounding improvements that Anthropic has made to Claude's computer use stack over the past 18 months.

First, there was a significant upgrade in visual understanding. Early versions of computer use struggled with dense UIs that had many small elements close together. Claude would sometimes click on the wrong button or misidentify which text field was currently active. The model now handles complex, information-rich screens with much higher accuracy.

Second, multi-step planning improved dramatically. The early system would often lose track of its place in a longer task sequence. If a task required ten steps and something changed at step four, the model might restart from scratch or get stuck. Current versions maintain a much more robust internal representation of task progress and can adapt their plan on the fly without losing context about earlier steps.

Third, error recovery became far more sophisticated. Real software is full of unexpected states: error dialogs, confirmation prompts, loading spinners, session timeouts, and elements that move or resize depending on window dimensions. Claude now handles these interruptions much more gracefully, treating them as temporary obstacles to work around rather than reasons to fail the entire task.

Finally, the speed of the perception-action loop has improved. Faster screenshot processing and more efficient reasoning about next actions means that Claude can complete tasks in a timeframe that feels more natural, rather than the slow, halting progression that characterized early demonstrations.

Practical Implications for Developers and Power Users

For developers building AI-powered automation, Claude's improved computer use opens up a range of possibilities that were previously unreliable. The most immediate application is automated testing and QA, where Claude can navigate through an application's UI the same way a user would, identifying bugs, broken layouts, or unexpected behaviors without requiring handwritten test scripts for every possible path.

Data entry and migration tasks that involve legacy software without APIs are another strong use case. Many organizations still rely on software that was built decades ago and can only be operated through a graphical interface. Claude can now interact with these systems reliably enough to handle repetitive tasks that previously required human operators.

For individual power users, the practical benefit is the ability to delegate multi-step computer tasks to Claude with much higher confidence that the task will be completed correctly. Setting up a development environment, configuring software settings across multiple application windows, or performing a sequence of operations that spans several tools are all now within Claude's reliable operating range.

The integration with Claude Cowork, which Anthropic launched as a research preview in January 2026, makes this particularly accessible. Cowork brings Claude's computer use capabilities to knowledge workers who are not necessarily developers, letting them automate repetitive desktop tasks through natural language instructions rather than writing code or configuring automation tools.

Where Computer Use Still Falls Short

Despite the impressive progress, there are clear limitations that anyone relying on Claude's computer use should understand. Tasks that require fine motor precision, like pixel-perfect design work or precise cursor positioning in drawing applications, remain difficult. The coordinate-based interaction model means that very small click targets or drag operations that require sub-pixel accuracy can still fail.

Highly specialized professional software with unusual UI paradigms, such as advanced video editing tools, CAD applications, or certain scientific instruments, can still trip up the model. Claude's visual understanding is strongest with standard web and desktop UI patterns. When software diverges significantly from those patterns, accuracy drops.

Tasks that require waiting for external processes to complete, such as a file download, a compilation step, or a server response, can also be tricky. Claude needs to correctly identify when a process is still running versus when it has completed or failed, and the visual cues for this are not always obvious.

Security-sensitive operations also remain appropriately restricted. Claude will not enter passwords, fill in credit card numbers, or perform actions that could compromise user security, even if it is technically capable of doing so. These are deliberate safety boundaries rather than capability limitations.

The Competitive Landscape

Claude's 72.5% on OSWorld puts it well ahead of other models in the computer use space. OpenAI's Operator and Google's Mariner have both made progress, but neither has published results in the same range. The gap is significant enough that Claude is currently the most capable option for anyone looking to build or use AI-powered computer automation.

This matters not just for benchmarks but for real-world reliability. The difference between 50% and 72.5% success rate is the difference between a tool that fails half the time and one that works reliably enough to trust with real tasks. As any developer knows, the last 20% of reliability improvement is often worth more than the first 80% because it is the difference between a demo and a product.

Anthropic's strategy of acquiring specialized teams like Vercept rather than trying to solve every problem in-house seems to be paying off. The combination of Claude's strong language understanding with Vercept's deep expertise in visual perception and interaction has produced results faster than either capability alone would have achieved.

What Comes Next

The trajectory from 15% to 72.5% in roughly 18 months suggests that human-level computer use performance is not far off. Anthropic has been clear that computer use is a strategic priority, and the Vercept acquisition signals continued investment in this direction.

The next frontiers are likely to include better handling of long-running tasks that span minutes or hours rather than seconds, more robust interaction with specialized professional software, improved coordination between computer use and other Claude capabilities like web search and code execution, and the ability to learn and adapt to a specific user's software environment and preferences over time.

For the broader AI industry, Claude's computer use capabilities point toward a future where the distinction between "using AI" and "using a computer" starts to blur. Instead of switching between a chat interface and your actual work applications, AI becomes a layer that operates within your existing software environment, handling tasks on your behalf while you focus on the decisions that matter.

Conclusion

Claude's leap in computer use performance is one of the most practically significant advances in AI capabilities this year. The jump from under 15% to 72.5% on OSWorld, fueled by the Vercept acquisition and sustained engineering investment, transforms computer use from an interesting demo into a reliable tool. For developers, power users, and anyone who spends their day navigating software, this is worth paying attention to.

The gap between where Claude is now and human-level performance is closing fast. Whether you are building automation workflows, looking to delegate repetitive tasks, or simply curious about where AI agents are heading, Claude's computer use capabilities represent the current state of the art.

If you are a heavy Claude user tracking your consumption across models and features, tools like SuperClaude can help you monitor your usage limits and stay on top of your AI workflow in real time.