The Inconvenient Number

Start with the number that nobody in this industry wants to talk about.

Between February and June 2025, a non-profit research institute called METR ran a randomized controlled trial on 16 experienced open-source developers. Half used AI coding tools. Half didn’t. The developers with AI were 19% slower.

Not 19% faster. Slower.

The kicker: those same developers believed they were 20% faster. The gap between what they felt and what actually happened was nearly 40 percentage points.

This single finding upends the entire premise on which a $36 billion industry has been built. Cursor, the AI-native code editor built by four MIT graduates, hit $1 billion in annualized revenue in under 24 months and commands a $29.3 billion valuation. GitHub Copilot, Microsoft’s AI coding assistant, has crossed 20 million cumulative users and sits inside 90% of Fortune 100 companies. Between them, they have convinced the software industry that AI makes programmers dramatically more productive.

And yet the most rigorous study conducted to date says the opposite.

That contradiction is the real story here. Not which tool has better autocomplete. Not whose agent mode runs more tasks in parallel. The question that matters is deeper: are these tools actually changing how well software gets built, or are they selling developers a feeling of productivity that doesn’t survive measurement?

To answer that, you have to understand the products, the companies behind them, the research that both supports and undermines them, and the developers who use them every day. You also have to reckon with the possibility that both sides of this debate are simultaneously right.

Four Kids Against Microsoft

The founding story of Cursor has been retold enough times that it risks becoming myth. But the specific details matter, because they explain why the product exists in its particular form.

In 2022, Michael Truell, Sualeh Asif, Arvid Lunnemark, and Aman Sanger were finishing their computer science and mathematics degrees at MIT, where they had conducted research together at CSAIL. Truell had won first place at the NYC Science and Engineering Fair in high school and interned at Google and Two Sigma. Asif, who grew up in Karachi and competed for Pakistan at the International Mathematical Olympiad, brought deep mathematical and machine learning expertise. Lunnemark, a Swede who had won gold at the IMO and competed at the International Olympiad in Informatics, served as the group’s systems architect. Sanger, who had done NLP research and interned at Bridgewater and Google, took the COO role but remained deeply technical.

They had their pick of jobs. Google, Meta, top-tier quant funds. They turned them all down to build a code editor.

The specific bet was not “AI will be big.” Everybody knew that. The bet was architectural: that the right way to add AI to software development was not to bolt it onto an existing editor as a plugin, but to rebuild the editor itself with AI as the central organizing principle. Microsoft had taken the plugin approach with Copilot, shipping it as an extension for VS Code. The Anysphere team believed that approach would always be limited, because an extension can only do what the host editor allows it to do.

So they forked VS Code. They took Microsoft’s own source code, kept the familiar interface, and rebuilt the internals around a different assumption: that the AI isn’t an add-on. It’s a co-author.

The decision looked insane at the time. VS Code was free and had 70%+ market share. Copilot had a two-year head start. Microsoft had, for practical purposes, unlimited money.

What the Anysphere team understood, and what took Microsoft longer to grasp, was that developer tools have a peculiar economic property. Developers will pay for marginal improvements in a way that most software users won’t. A senior engineer in San Francisco earning $350,000 a year generates roughly $170 of economic value per hour. A tool that saves ten minutes a day pays for itself many times over at $20 a month. The question was never whether developers would pay. It was whether the AI-native approach would deliver enough marginal improvement to justify switching editors.

Revenue doubling every two months was the answer.

The Plugin vs. the Machine

To understand why Cursor grew so fast, you have to understand what it does differently from Copilot at a mechanical level.

Open Copilot in VS Code, and you get an AI that watches what you type and suggests the next line. It’s good at this. The completions are fast, usually relevant, and often save keystrokes. Copilot now contributes 46% of all code written by its active users, up from 27% at launch in 2022. A joint study by GitHub and Accenture found developers completed tasks 55% faster with it.

But open a TypeScript project with 200 files and try to rename a core data type. In Copilot, you’re largely on your own. The AI can help with individual files, but coordinating changes across an entire project, understanding that renaming UserProfile to AccountProfile also means updating thirty import statements, twelve test files, four API routes, and a database migration, requires a kind of project-level awareness that a plugin architecture struggles to provide.

Cursor was built for exactly this. Its tab completion scans the entire project, not just the current file. When it suggests a symbol from another module, it auto-imports it. Its Composer feature, launched with Cursor 2.0 in October 2025, lets you describe a refactoring task in plain English and generates a coordinated change plan across every affected file. Developers report that rename-and-refactor operations that took hours of careful manual work now take single-digit minutes.

The difference is not that Cursor’s AI is smarter. Both tools can use the same underlying models. The difference is that Cursor’s architecture gives the AI access to more context. It can see the whole codebase, not just the file you have open.

This matters less for small scripts and throwaway projects. It matters enormously for the kind of codebases that professional developers actually work in: tens of thousands of files, complex dependency graphs, years of accumulated architectural decisions that a newcomer (human or AI) needs context to navigate.

Microsoft recognized the gap. Throughout 2025, Copilot’s agent mode gained the ability to analyze codebases, propose multi-file changes, run tests, and auto-correct failures in a loop. The December 2025 updates added custom agents, parallel execution, and a cloud-based coding agent that files pull requests autonomously. Copilot’s agent mode is now available across VS Code, JetBrains, Eclipse, and Xcode.

But architectural decisions made early tend to echo. The extension model means Copilot has to negotiate with the host editor for every capability. Cursor, controlling the entire stack, moves faster. It ships sandboxed terminals where agents run commands safely in isolation. It runs up to eight agents in parallel from a single prompt, each in its own copy of the codebase. BugBot, its debugging agent, plugs into GitHub to review pull requests and flag issues before humans even look at the code.

The trajectory is similar; the execution speed is not.

A Day in Two Editors

Abstract comparisons obscure what these tools actually feel like to use. Consider a typical morning for a backend engineer working on a Node.js microservices architecture.

In Copilot, you open VS Code, navigate to a file, and start typing. The AI watches and suggests. You write the first few characters of a function, and Copilot offers to complete it. Tab to accept. You write the next function, and it suggests something plausible but wrong; you delete and write it yourself. You need to add error handling to an API endpoint; Copilot suggests a try-catch block with a generic error message. Useful, but you’ll need to customize it. The rhythm is: type, accept, type, reject, type, accept. The AI is a fast, moderately helpful autocomplete that occasionally surprises you with exactly the right suggestion.

In Cursor, you open the same project and work differently. Instead of writing the error handling yourself, you select the API endpoint, open Composer, and type: “Add retry logic with exponential backoff, circuit breaker pattern, and structured error logging to this endpoint. Update the corresponding test file.” Cursor generates a plan that touches the route handler, the error utility module, and the test suite. You review the diff across all three files, accept or modify, and move on. The rhythm is: describe intent, review plan, approve. The AI is a junior colleague who needs supervision but can execute multi-step plans.

The Copilot workflow is lower-risk and lower-effort per interaction, but requires more interactions. The Cursor workflow is higher-risk and higher-effort per interaction, but handles larger units of work. Which is “better” depends on whether your bottleneck is typing speed or thinking speed.

For a junior developer still learning the codebase, Copilot’s line-by-line model is safer. Each suggestion is small enough to evaluate without deep context. For a senior developer who knows exactly what they want but finds the mechanical act of implementing it tedious, Cursor’s plan-and-execute model removes the tedium while preserving the decision-making.

This seniority split is one of the least discussed dynamics in the market. The METR study tested experienced developers on their own projects and found them slower with AI, though it’s worth noting that 56% of participants had never used Cursor before, introducing a tool-learning overhead. The vendor studies that show productivity gains often use developers working on unfamiliar tasks in unfamiliar codebases, where AI context helps bridge a knowledge gap. The uncomfortable implication: AI coding tools may help you most when you know least, and help you least when you know most.

The Productivity Mirage

Return to the METR study, because it exposes something that the market data alone cannot.

The study was specific: 16 experienced developers working on their own open-source projects, codebases they knew intimately, tackling 246 real issues across repositories averaging over a million lines of code. The sample was small, and the confidence interval wide (the true slowdown could range from 2% to 40%). But the methodology was rigorous: METR manually reviewed 143 hours of screen recordings. The AI didn’t help these developers go faster. It introduced overhead. Time writing prompts. Time reviewing generated code for correctness. Time managing the cognitive switch between directing an AI and thinking about the actual problem. Time fixing AI-generated code that was almost right but not quite.

The “almost right” part is critical. An AI that generates wrong code is easy to reject. An AI that generates code that looks plausible but contains a subtle bug, a missing null check, a race condition, an off-by-one error in a boundary case, demands more careful review than writing the code from scratch would have required.

And yet the developers felt faster. They reported 20% productivity gains, nearly the exact inverse of the measured 19% loss.

The psychology isn’t hard to understand. AI tools reduce the most tedious parts of coding: writing boilerplate, remembering function signatures, generating test scaffolding. These are the parts of programming that feel like drudgery. Eliminating drudgery feels like progress, even when the total task completion time increases because you’ve replaced ten minutes of boring-but-familiar work with fifteen minutes of prompt-craft and output-review.

This perception gap explains a great deal about the market. GitHub can truthfully claim that Copilot contributes 46% of code for active users. That number is real. What it doesn’t tell you is whether the code that Copilot writes would have been written faster by the developer alone. The 46% measures output attribution, not time savings. A developer who spends five minutes prompting and reviewing an AI-generated function that would have taken three minutes to write by hand has increased AI code contribution while decreasing personal productivity.

The vendor-sponsored studies that show 55% speed improvements use different methodology. They typically measure time to complete isolated, well-defined tasks in controlled environments, not the messy reality of working in a large, familiar codebase with complex requirements. The METR study’s design, using developers’ own projects with real-world constraints, is harder to replicate but more ecologically valid.

None of this means AI coding tools are useless. The Stack Overflow 2025 Developer Survey found that 84% of developers use or plan to use them, and 51% use them daily. Frequent users report higher satisfaction than occasional users, suggesting a learning curve: developers who invest time in learning effective prompting patterns, in understanding where AI excels and where it fails, extract more value.

But the blanket claim that “AI makes developers X% faster” deserves far more scrutiny than it receives. The answer depends on the developer, the task, the codebase, and how you measure speed. For boilerplate generation, code completion, and simple bug fixes, the gains are real. For complex architecture decisions, subtle debugging, and deep refactoring in unfamiliar code, the evidence is at best mixed.

The $36 billion AI coding tools market is built, in part, on a feeling. That feeling is genuine, and feelings drive purchase decisions. But the gap between perceived and actual productivity is a crack in the foundation that the industry has not yet addressed.

Two Prices, One Question

Cursor charges $20 a month. Copilot charges $10. At the business tier, it’s $40 versus $19.

The fact that Cursor reached $1 billion in ARR while charging double is the single most revealing data point in this entire market. Enterprise revenue grew 100x in 2025 alone.

The economics work like this. A 100-person engineering team at a startup pays $48,000 a year for Cursor or $22,800 for Copilot. The difference is $25,200, which is roughly the cost of one week of one senior engineer’s fully-loaded compensation. If Cursor saves one engineer one hour per week that Copilot doesn’t, the premium pays for itself several times over.

For enterprises, the calculation inverts. A 5,000-person engineering organization pays $2.4 million a year for Cursor’s business tier versus $1.14 million for Copilot. The $1.26 million difference is real money, and the marginal productivity gain from better multi-file context has to justify it against the procurement friction, security review overhead, and tooling management complexity of standardizing on a less established vendor.

This is why the market segments the way it does. Startups and small teams choose Cursor. Enterprises choose Copilot. The few enterprises that choose Cursor tend to be engineering-led organizations where developer experience drives procurement, rather than IT-led organizations where compliance and existing vendor relationships drive it.

The pricing gap also explains why Copilot has 42% market share and Cursor has 18%, while Cursor’s revenue per user is substantially higher. Copilot dominates the volume market. Cursor dominates the value market. Neither position is inherently more sustainable, but they create very different business dynamics.

The Terminal Insurgency

While the editor wars play out, a third approach has emerged that may render the entire debate obsolete.

Anthropic launched Claude Code as a research preview in February 2025. It’s not an editor. It’s not a plugin. It’s a command-line tool that lives in your terminal, reads your codebase, writes files, runs commands, and executes multi-step workflows through conversation. It went broadly available in May 2025 alongside Claude 4. By November, six months after the public launch, it hit $1 billion in annualized run rate, faster than ChatGPT’s early trajectory.

Claude Code’s philosophical premise is different from both Cursor and Copilot. Copilot says: “Your editor is your workspace; let me help you inside it.” Cursor says: “Let me rebuild your workspace around AI.” Claude Code says: “Your workspace is wherever you work, and I’ll meet you there.”

The terminal-first approach turns out to have properties that neither the plugin model nor the integrated-editor model can easily replicate. Claude Code runs on remote servers where no GUI editor exists. It works in CI/CD pipelines. It operates over SSH. It pairs with any editor, because it doesn’t care which editor you use. Its configuration lives in markdown files (CLAUDE.md, AGENTS.md) inside the repository, meaning project-specific conventions and instructions travel with the code rather than with the developer’s editor settings.

The model underneath, currently Claude’s most capable reasoning model, can maintain focus on complex tasks for extended periods of continuous operation. This isn’t a completion engine that predicts the next line. It’s a reasoning engine that can hold an entire project’s architecture in context while executing a multi-step plan.

GitHub noticed. In February 2026, GitHub added Claude Code to its new Agent HQ multi-agent platform. The implication is stark: even Microsoft’s own platform now acknowledges that the editor-centric model isn’t the only viable architecture for AI-assisted development.

For developers who have used all three tools, the experience is different in kind, not just degree. Copilot feels like a fast autocomplete. Cursor feels like a smart collaborator inside your editor. Claude Code feels like handing a task to a colleague and getting it back done. Each model has its failure modes, but they fail differently, and the terminal-based approach’s failures tend to be more visible and therefore more correctable.

When the Tools Break

No serious analysis of these products can ignore their failure modes, because failure modes reveal architectural assumptions.

Copilot’s characteristic failure is the confident wrong answer. It generates code that compiles, passes linting, looks correct at a glance, and contains a subtle logical error. This happens because Copilot optimizes for plausibility, for code that looks like code that a human would write. The problem is that plausible-looking code and correct code aren’t the same thing, especially in edge cases involving concurrency, security, or complex business logic.

Cursor’s characteristic failure is the over-ambitious refactoring. Given a prompt to “refactor this module,” it sometimes generates sweeping changes that touch files the developer didn’t intend to modify. The project-wide context that is Cursor’s greatest strength becomes a liability when the AI interprets a narrow request broadly. Experienced Cursor users learn to constrain their prompts, but the learning curve is steeper than Copilot’s “accept or reject this line” interaction model.

Claude Code’s characteristic failure is scope creep in multi-step tasks. Given a complex instruction, it can pursue a chain of reasoning that diverges from the developer’s intent, making changes that are internally consistent but don’t match what was actually needed. The terminal-based model also means there’s no inline diff preview; you see the result after the AI has already modified files, which demands more trust and more careful review.

These failure modes map directly to the METR finding. Each tool introduces a different kind of overhead: Copilot demands line-by-line review of plausible-but-potentially-wrong code. Cursor demands prompt engineering skill to avoid over-broad changes. Claude Code demands clear problem decomposition and post-hoc review. All three add cognitive work that pure manual coding doesn’t require.

The question isn’t whether AI coding tools have failure modes. Everything does. The question is whether the productivity gains on successful operations outweigh the costs on failed ones. For experienced developers who have learned their tool’s patterns, the answer tilts positive. For newcomers who haven’t developed the instinct for when to trust and when to doubt, the tools can be actively misleading, producing code that works in demos and breaks in production.

The Data You’re Giving Away

There’s a dimension of this market that almost no comparison article mentions, because it’s uncomfortable for every vendor involved.

Every keystroke you make in Copilot flows through Microsoft’s servers. Every prompt you write in Cursor goes to Anysphere. Every terminal session with Claude Code passes through Anthropic’s API. The telemetry from millions of developers writing code constitutes one of the most valuable training datasets in existence: a real-time record of how humans solve programming problems, what patterns they use, what mistakes they make, and how they respond to AI suggestions.

The vendors are aware of this value. GitHub states that Copilot Business and Enterprise tiers don’t use customer code for model training. Cursor’s privacy policy makes similar commitments for its business tier. But free and individual tiers carry fewer guarantees, and the definition of “training” versus “product improvement” versus “model evaluation” has enough ambiguity to make privacy-conscious organizations uneasy.

The deeper issue is strategic. The company that accumulates the best data on how developers write code can build the best models for helping developers write code. This creates a flywheel: better tools attract more users, more users generate more data, more data produces better models, better models improve the tools. Microsoft, with 20 million Copilot users, has a data advantage that is difficult for any competitor to overcome through engineering talent alone.

Cursor’s counter-strategy is to be model-agnostic. It supports GPT-4, Claude, and its own in-house model. This means Cursor can improve as any underlying model improves, without being locked into a single provider’s trajectory. But it also means Cursor’s competitive moat is thinner than Copilot’s: if the tool layer gets commoditized, Cursor has no proprietary model to fall back on.

Claude Code’s position is different again. Anthropic both builds the tool and the model. The feedback loop between Claude Code usage patterns and Claude model improvements is closed within a single company. This vertical integration is similar to Apple’s hardware-software approach, and it may prove to be the most durable competitive position in the long run.

For developers, the practical question is not whether your code is being used, but whether you’re comfortable with the trade-off. You get a productivity tool. The vendor gets a window into your development process. The exchange has been normalized so quickly that most developers don’t think about it. Whether they should is an open question.

The Windsurf Niche and the Long Tail

Windsurf, formerly Codeium, occupies a specific niche worth understanding. It’s a VS Code fork, like Cursor, but optimized for a different use case: extremely large codebases. Its remote indexing approach claims to handle repositories with over one million lines of code, a capability that matters enormously for enterprise teams working in monorepos.

Windsurf’s Cascade agent provides deep, cross-file context awareness. Its pricing sits between Copilot and Cursor: $15 per month for Pro, $30 for teams, $60 for enterprise. The credit-based model means heavy users may pay substantially more.

Beyond the top four, the field includes Replit, which has evolved into an AI-native development platform (private equity firm Hg Capital reported up to 6x productivity gains among power users at its portfolio companies). Devin, built by Cognition, takes the most aggressive approach: an autonomous AI software engineer that Nubank used to refactor millions of lines of ETL code, claiming 12x efficiency gains. Amazon Q Developer and Google’s Gemini Code Assist leverage their respective cloud ecosystems.

The proliferation of tools is itself a signal. When a market fragments this rapidly, it means no single product has found the winning architecture. The competition is still over paradigms, not features.

The Developer Identity Crisis

The 2025 Stack Overflow Developer Survey captures something that market share numbers miss. Developer trust in AI is declining even as adoption rises.

In 2023 and 2024, more than 70% of developers expressed positive sentiment toward AI tools. By 2025, that number dropped to 60%. Only 33% trust AI-generated code for accuracy. 46% actively distrust it. 87% worry about accuracy. 81% worry about security and privacy.

These numbers describe a population engaged in something they don’t fully trust. 84% use the tools or plan to, while a third say they don’t believe the output. This is not the profile of a satisfied customer base. It’s the profile of a workforce that feels it has no choice.

The pressure is structural. When the YC Winter 2025 batch reports that 25% of its startups have codebases that are 95% AI-generated, and an estimated 41% of code committed globally in 2025 was initially generated or suggested by AI (according to a GitClear analysis of 211 million changed lines), a developer who refuses AI tools is competing against developers who use them. Even if the METR study is correct that AI makes experienced developers slower on their own projects, the developer who uses AI can produce more code in an interview exercise, prototype faster in a hackathon, and appear more productive in a sprint velocity metric. Perception, not measurement, drives hiring and promotion.

Andrej Karpathy named this dynamic in February 2025 when he coined “vibe coding”: programming by feeling, accepting AI suggestions without reading the diffs, copying error messages back to the AI until it works. By early 2026, Karpathy himself had moved past the concept, proposing “agentic engineering” as the next phase. But vibe coding persists as the default mode for a growing share of developers, particularly those early in their careers who have never known programming without AI.

The long-term consequences of a generation of developers who can’t read their own code are not yet visible. They will be.

The Sustainability Question

There’s a number that doesn’t appear in any of the marketing materials. Running large language models is expensive. Every completion, every chat response, every agent task consumes GPU compute. At scale, the cost per user per month for AI inference is substantial, and for heavy users running multi-agent workflows, the cost can exceed the subscription price.

Copilot’s $10 per month individual tier was widely reported to be unprofitable when it launched. Microsoft could absorb the loss because Copilot drives GitHub adoption, Azure compute revenue, and enterprise lock-in. The product doesn’t need to make money on its own; it needs to make the Microsoft ecosystem stickier.

Cursor doesn’t have that luxury. At $20 per month, with no adjacent revenue streams, the subscription needs to cover inference costs plus engineering, infrastructure, and growth. Cursor’s rapid move toward usage-based pricing for premium features (the Pro+ tier at $60 per month for background agents, and BugBot as a $40 per month add-on) suggests the flat subscription model may not scale. As agents become more capable and consume more compute, the gap between what users pay and what the compute costs widens.

Claude Code straddles both models. Individual developers can access it through Anthropic’s Pro ($20 per month) or Max subscriptions with a pooled usage allowance. Power users and teams pay per API token consumed, where average daily costs run around $6 per developer but can spike far higher for heavy agentic workflows. The hybrid approach is more economically transparent than a flat rate, but it also means costs are unpredictable, and intensive users may face bills that exceed their Cursor or Copilot subscriptions by an order of magnitude.

The pricing structures of 2026 are transitional. Someone will have to pay for the compute. Either the tools get more expensive, or the models get more efficient, or the vendors find adjacent revenue streams to subsidize the core product. The current pricing is a market-capture strategy, not a sustainable business model, and developers choosing tools based on today’s prices may find the economics shift underneath them.

The Enterprise Stalemate

Inside large organizations, the Cursor-Copilot decision has surprisingly little to do with the products themselves.

Copilot wins by default. It’s already in the procurement system. It integrates with GitHub, Azure DevOps, Microsoft 365. It has the security certifications. The compliance team has already reviewed it. Adding Copilot licenses to an existing Microsoft enterprise agreement takes a purchase order, not a vendor review.

Cursor wins by insurgency. A team lead hears about it at a conference, tries it, gets hooked, and fights through procurement to get licenses for the team. The 100x growth in Cursor’s enterprise revenue during 2025 happened despite the procurement friction, not because of it. Every one of those enterprise deployments involved someone putting their credibility on the line against the path of least resistance.

Claude Code wins by stealth. Anthropic reached 300,000 business customers by August 2025. According to Menlo Ventures’ survey of enterprise technical decision-makers, Anthropic’s share of enterprise LLM spend rose from 12% in 2023 to 40% by late 2025. Many of these deployments started with individual developers installing a command-line tool and expensing the API costs. By the time procurement noticed, Claude Code was embedded in team workflows.

The dual-tool pattern is emerging as the pragmatic resolution. The enterprise standardizes on Copilot for everyone. Specific teams get Cursor or Claude Code licenses where they can demonstrate measurable productivity gains. It’s messy, it creates management overhead, and it’s what actually happens.

Research from Faros AI, based on telemetry from over 10,000 developers across 1,255 teams, adds a sobering footnote. AI coding tools increase individual developer output, but the gains don’t translate into company-level productivity improvements. Faros found that AI adoption correlated with a 9% increase in bugs per developer and a 154% increase in average pull request size. The bottleneck moves downstream: code review, QA, deployment pipelines, and organizational decision-making absorb the additional code without processing it faster. A developer who writes twice as fast but waits two days for code review has doubled the length of the review queue, not the speed of the organization.

What Comes Next

The convergence is unmistakable. Every tool is moving toward agents. Cursor runs eight in parallel. Copilot’s agent mode auto-corrects in loops. Claude Code reasons through multi-step plans. The completion engine that Copilot launched with in 2021 is becoming a legacy feature.

But convergence on the what doesn’t mean convergence on the how. The plugin model, the integrated-editor model, and the terminal-first model represent genuinely different bets on how developers will want to interact with AI. It’s possible that the IDE itself, the paradigm both Cursor and Copilot operate within, gets disrupted by approaches that don’t assume an editor is the center of the workflow. Replit’s browser-based approach, Devin’s autonomous-engineer approach, and Claude Code’s meet-you-anywhere approach all point in this direction.

The model layer keeps improving underneath all of them, which means the feature gap between tools narrows on every cycle. What differentiated one product six months ago becomes table stakes today. This puts relentless pressure on tool makers to compete on experience, ecosystem, and workflow rather than raw capability.

For developers making choices today, three things matter more than which tool scores best in a feature comparison:

First, how much time you’re willing to invest in learning. Every tool has a learning curve. The METR study’s subjects were AI-experienced developers, and they were still slower. The developers who extract real value from these tools are the ones who have internalized when to trust, when to question, and when to write the code themselves.

Second, what kind of work dominates your day. If you spend most of your time in a large, well-understood codebase doing incremental work, Copilot’s inline completions may be all you need. If you do frequent multi-file refactoring or greenfield development, Cursor’s project-wide context justifies the premium. If you work across diverse environments, including terminals, CI/CD, and remote servers, Claude Code goes where the others can’t.

Third, whether you can distinguish between feeling productive and being productive. The 40-percentage-point gap between perceived and actual productivity in the METR study is a warning. AI coding tools are, in part, a palliative. They make programming feel less tedious. That’s worth something. But mistaking reduced tedium for increased output leads to bad decisions about tooling, bad decisions about team sizing, and bad decisions about what a developer can accomplish in a sprint.

Start with the inconvenient number, and end with it.

The METR study found that experienced developers were 19% slower with AI and believed they were 20% faster. That 40-point perception gap hasn’t gone away. It has been papered over by adoption metrics, revenue growth, and the structural pressure of an industry that has decided AI coding tools are mandatory before proving they work.

The four MIT students who forked VS Code in 2022 were right about the transformation. AI is changing software development profoundly enough to justify rebuilding the tools from scratch. But being right about the transformation doesn’t mean the current tools deliver what the market claims they deliver. It means the tools are early, the research is unsettled, and the companies are selling the future while the present is still being debugged.

The $36 billion in combined valuations is a bet on that future. It’s a bet worth making. But it’s a bet, not a proof. And the developers writing code with these tools every day, the ones who feel faster even when they aren’t, deserve to know the difference.


Published on February 8, 2026.

About the Author

Gene Dai is the co-founder of OpenJobs AI, a next-generation recruitment technology platform. He writes about the intersection of artificial intelligence, developer tools, and the future of work.