The Breaking Point
On May 17, 2024, Jan Leike published a series of posts on X that sent shockwaves through the AI safety community. "I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time," the co-leader of OpenAI's Superalignment team wrote, "until we finally reached a breaking point."
His departure came just days after Ilya Sutskever—OpenAI's co-founder and chief scientist—announced his own resignation on May 14, 2024. The two had jointly led the Superalignment team since its launch in June 2023, tasked with solving the technical challenges of aligning superintelligent AI systems within four years. Their simultaneous departures marked the most dramatic safety-focused exodus from any frontier AI lab in the field's history.
Leike's public criticism was unusually direct for an executive departure. "Over the past years, safety culture and processes have taken a backseat to shiny products," he stated. His team had been "sailing against the wind" for months, struggling to secure computing resources despite OpenAI's public commitment to dedicate 20% of its compute to superalignment research. Within days of the announcements, OpenAI disbanded the Superalignment team entirely, redistributing members across other research groups.
The exodus exposed a fundamental tension at the heart of frontier AI development: the conflict between moving fast to maintain competitive advantage and investing adequately in safety research to prevent catastrophic outcomes. Leike's departure wasn't merely a personnel change—it was a referendum on whether leading AI labs could maintain their safety commitments under intense commercial pressure.
On May 28, 2024, Leike announced his next move. "I'm excited to join Anthropic to continue the superalignment mission," he wrote. "My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research." The timing was striking: less than two weeks after leaving OpenAI, he had secured a leadership position at its most direct competitor, bringing his expertise and moral authority to a company that positions itself as the safety-first alternative in foundation model development.
The Superalignment Promise
To understand why Leike's departure mattered, you need to understand what Superalignment was supposed to accomplish. In June 2023, OpenAI announced the initiative with considerable fanfare. Ilya Sutskever and Jan Leike would co-lead a team dedicated to solving the core technical challenge of superintelligence alignment: how to control AI systems much smarter than humans.
OpenAI's stated goal was ambitious: solve the fundamental technical problems within four years. The company committed to dedicating 20% of the compute it had secured to date to this effort—a massive allocation representing billions of dollars in GPU time. At the time, this represented the largest resource commitment any AI lab had made to safety research.
The technical challenge was clear. Reinforcement learning from human feedback (RLHF)—the technique that made ChatGPT useful and safe—has fundamental limitations. It works when AI systems are roughly human-level: humans can evaluate outputs, provide feedback, and guide model behavior. But what happens when AI systems become smarter than their human evaluators? How do you provide reliable feedback on outputs you don't fully understand? How do you detect when a superintelligent system is deceiving you?
These weren't abstract philosophical questions. Every major AI lab was racing toward artificial general intelligence (AGI), with Sam Altman publicly stating in early 2024 that AGI might arrive "sooner than most people think." If AGI was imminent, the alignment problem needed solutions now, not later. Superalignment's mandate was to develop scalable oversight techniques—methods that would work even when AI systems far exceeded human capabilities.
The research agenda included several technical approaches. Scalable oversight investigated techniques like debate (having AI systems argue different sides, with humans judging), recursive reward modeling (using AI systems to help evaluate other AI systems), and iterated amplification (breaking complex tasks into simpler components). Weak-to-strong generalization explored whether weak human supervisors could nonetheless elicit strong capabilities from more capable models. Automated alignment research aimed to use AI systems themselves to accelerate safety research.
In a November 2023 podcast interview with 80,000 Hours, Leike outlined the stakes. "I think we made a lot of good progress in the last year," he said, "pointing in particular to an area known to alignment researchers as scalable oversight." But he also acknowledged the enormity of the challenge: "The key problem is that superintelligence will far exceed human oversight capabilities, making direct human supervision infeasible."
By early 2024, the team had published research on weak-to-strong generalization, demonstrating that in some cases, weak supervisors could elicit strong performance. They had made progress on debate techniques and interpretability methods. But as 2024 progressed, the resource situation deteriorated. The 20% compute commitment that was supposed to power this research became harder to access. Other priorities—product launches, model training runs, commercial deployments—consumed the compute budget.
When Leike announced his departure in May 2024, the Superalignment team was just one year into its four-year mission. The dissolution of the team meant OpenAI was effectively abandoning its most public commitment to solving long-term safety challenges before AGI arrived.
Resource Starvation and Cultural Shift
Leike's phrase "sailing against the wind" captured the experience of running a safety-focused research team inside a company racing to maintain market leadership. Multiple sources familiar with the situation described a systematic resource starvation that made ambitious safety research increasingly difficult.
The compute allocation issue was the most concrete manifestation. OpenAI had committed 20% of its compute resources to Superalignment—at the time of the announcement in June 2023, this represented access to thousands of GPUs worth billions in infrastructure investment. But as OpenAI scaled its operations through 2023 and early 2024, competing demands multiplied.
GPT-4's deployment and continued training required massive compute. The development of GPT-4.5 and preparation for GPT-5 consumed even more. Commercial customers using OpenAI's API needed guaranteed compute availability. ChatGPT's explosive user growth created infrastructure demands. Microsoft's Azure AI integration required dedicated resources. Each new product launch—ChatGPT plugins, DALL-E 3, GPT-4 Vision—added to the resource pressure.
Former OpenAI employees who worked adjacent to the Superalignment team described a pattern of delayed or denied compute requests. Experiments that should have taken weeks stretched into months as the team waited for GPU allocation. Research directions that required large-scale training runs became impractical. The team increasingly focused on smaller-scale experiments that could run with limited compute—but these couldn't fully address the challenges of superintelligence alignment.
The resource constraints reflected a deeper cultural shift at OpenAI. When Leike joined in 2021, the company still operated primarily as a research lab with a safety-first culture inherited from its nonprofit origins. The release of ChatGPT in November 2022 changed everything. Suddenly OpenAI had a product with 100 million users within two months—the fastest consumer product adoption in history. Commercial opportunities multiplied. Competitive pressure intensified as Google launched Bard and Anthropic gained traction with Claude.
Inside OpenAI, priorities shifted toward product velocity. Features that could differentiate ChatGPT from competitors took precedence over long-term safety research. The company's structure evolved from research lab to product company, with product managers and business development executives gaining influence. Safety researchers found themselves having to justify research projects in terms of near-term product impact rather than long-term risk mitigation.
This cultural transformation wasn't unique to OpenAI—every AI lab faced similar pressures. But OpenAI's nonprofit origins and public safety commitments made the shift more jarring. The company had been founded in 2015 with an explicit mission to ensure AGI "benefits all of humanity." Its transition to a capped-profit structure in 2019 was justified as necessary to compete for talent and compute while maintaining safety focus. By 2024, critics questioned whether the balance had tilted too far toward commercial competition.
Leike's resignation statement made the cultural diagnosis explicit. "Building smarter-than-human machines is an inherently dangerous endeavor," he wrote. "OpenAI is shouldering an enormous responsibility on behalf of all of humanity. But over the past years, safety culture and processes have taken a backseat to shiny products."
The phrase "shiny products" was pointed—a direct criticism of OpenAI's pivot toward consumer-facing features and viral marketing. Each new ChatGPT capability announcement generated headlines and user excitement, but Leike suggested this product focus came at the expense of the unglamorous, technically difficult safety work needed to handle superintelligence.
In his final posts before leaving OpenAI, Leike outlined what adequate safety investment would require: "We need to figure out how to steer and control AI systems much smarter than us. We need to work on safety, alignment, robustness, monitoring, preparedness, evaluations, and societal impacts." None of these priorities generated product excitement or near-term revenue, but all were necessary for responsible AGI development.
The week following Leike and Sutskever's departures, OpenAI leadership scrambled to contain the reputational damage. Sam Altman posted that "we have recently been very focused on shipping cool products and getting into the rhythm of that," but acknowledged "we obviously need to put a lot more focus on the safety stuff." The company announced John Schulman would take over safety leadership, and later formed a new Safety and Security Committee. But the dissolution of the Superalignment team suggested these commitments were more reactive than substantive.
The Academic Path to AI Safety
Jan Leike's journey to the center of AI safety debates began far from Silicon Valley. Born in 1986 or 1987 in Germany, he pursued undergraduate studies at the University of Freiburg, one of Germany's oldest universities with a strong computer science program. His early academic work focused on formal methods and theoretical computer science—the mathematical foundations that would later inform his approach to AI alignment.
After completing his master's degree, Leike made a decisive geographic and intellectual shift. He traveled to Australia to pursue a PhD in machine learning at the Australian National University (ANU) under the supervision of Marcus Hutter. This choice was significant: Hutter was known for his work on universal artificial intelligence and AIXI, a mathematical framework for optimal decision-making that approached AGI from theoretical foundations.
Leike's 2016 PhD thesis, "Nonparametric General Reinforcement Learning," explored how AI agents could learn and make decisions in arbitrary environments without making restrictive assumptions about the environment's structure. The work was deeply theoretical, grounded in algorithmic information theory and computability theory—far from the practical deep learning approaches that were revolutionizing AI during the same period.
This theoretical foundation proved valuable. While most AI researchers focused on making systems work better empirically, Leike was trained to think about fundamental limits and guarantees. What could an arbitrary intelligent agent accomplish? What were the theoretical constraints on learning? These questions would later inform his approach to superintelligence alignment: thinking rigorously about AI systems far more capable than current models.
After completing his PhD in 2016, Leike spent six months as a postdoctoral fellow at the Future of Humanity Institute (FHI) at Oxford University. FHI, directed by philosopher Nick Bostrom, had become the intellectual center for long-term AI risk research. Bostrom's 2014 book "Superintelligence: Paths, Dangers, Strategies" had argued that advanced AI posed existential risks and that alignment research needed to begin well before AGI arrived.
At FHI, Leike engaged with the nascent AI safety research community. This was still early days—AI safety was considered fringe within mainstream AI research. Most machine learning researchers dismissed concerns about superintelligence as science fiction. But a small community including Stuart Russell, Paul Christiano, Eliezer Yudkowsky, and researchers at organizations like MIRI and FHI were developing the intellectual foundations for alignment research.
From Oxford, Leike joined DeepMind in London, where he shifted from theory to empirical AI safety research. DeepMind had established a safety team led by Shane Legg (co-founder and chief scientist), creating space for alignment research within a cutting-edge AI lab. This was unusual: most AI labs in 2016-2017 dismissed safety concerns or relegated them to ethics committees rather than core technical research.
At DeepMind, Leike collaborated with Paul Christiano, Dario Amodei, and other researchers on a landmark 2017 paper: "Deep Reinforcement Learning from Human Preferences." This work demonstrated that complex AI behaviors could be shaped by human feedback, even when those behaviors were too complex for humans to specify formally. The paper laid the groundwork for RLHF—the technique that would later make ChatGPT possible and safe enough for public deployment.
The human feedback approach was elegant: instead of trying to write down detailed reward functions (which was difficult or impossible for complex tasks), you could have humans compare pairs of AI-generated outputs and indicate preferences. The AI would learn a reward model from these preferences and optimize its behavior accordingly. This made it possible to align AI systems with human values even when those values were difficult to articulate precisely.
In 2018, Leike published another influential paper, "Scalable agent alignment via reward modeling: a research direction." This outlined a research program for using learned reward models to align increasingly capable AI systems. The paper anticipated challenges that would become central to later alignment research: reward hacking (systems exploiting flaws in reward specifications), distributional shift (systems encountering situations different from training), and scalability (techniques that work for current systems might fail for more capable ones).
By 2021, Leike had established himself as one of the world's leading AI safety researchers. His work combined theoretical rigor with practical machine learning, and he had contributed to both the conceptual frameworks and empirical techniques that made modern AI systems more aligned with human intentions. When OpenAI came recruiting, offering the opportunity to work on alignment at the lab building the most capable AI systems in the world, Leike made the move.
At OpenAI, Leike joined the alignment team working on GPT-3 and its successors. His contributions were central to the development of InstructGPT (released in January 2022), which used RLHF to make GPT-3 follow instructions more reliably and safely. InstructGPT's success demonstrated that alignment techniques could make powerful AI systems significantly more useful and less harmful—proof that safety research wasn't just theoretical hand-wringing but could deliver practical value.
When ChatGPT launched in November 2022, it built directly on InstructGPT's alignment techniques. The model's ability to refuse harmful requests, acknowledge its limitations, and provide helpful responses reflected years of alignment research that Leike and his colleagues had conducted. The product's success validated the alignment research program—but it also created the commercial pressures that would later undermine that program's resourcing.
Throughout 2022 and early 2023, Leike worked on GPT-4's alignment, helping ensure that OpenAI's most capable model to date would be safe enough for deployment. The GPT-4 system card, published in March 2023, detailed extensive safety evaluations and alignment interventions. But Leike was increasingly concerned about what would come next. If GPT-4 required this much alignment work, what would GPT-5 need? What about GPT-6 or GPT-7? At some point, the systems would become too capable for current alignment techniques to handle reliably.
This concern motivated the Superalignment initiative. In June 2023, when Ilya Sutskever tapped Leike to co-lead the effort, it seemed like an opportunity to finally tackle the long-term alignment challenges. With 20% of OpenAI's compute and four years to work, the team could pursue ambitious research directions that weren't possible within product development timelines.
But as Leike would discover, institutional commitment was fragile. When competitive pressures intensified and product demands multiplied, long-term safety research became the first casualty. Less than a year into the four-year mission, the resources and organizational support began evaporating.
Anthropic: The Safety-First Alternative
When Jan Leike announced on May 28, 2024 that he was joining Anthropic, the move signaled more than a job change—it represented a bet that responsible AI development was possible, but perhaps not at OpenAI. Anthropic, founded in 2021 by former OpenAI researchers including Dario and Daniela Amodei, positioned itself explicitly as the safety-focused alternative in foundation model development.
The timing of Leike's announcement was notable. Just two weeks after his dramatic public departure, he had landed at Anthropic with a leadership role heading the Alignment Science team. The speed suggested that conversations had been underway before his OpenAI exit—not surprising given that many Anthropic employees were former OpenAI colleagues who shared Leike's safety concerns.
Anthropic's founding story paralleled Leike's frustrations. Dario Amodei, previously OpenAI's VP of Research, left in 2020 along with several other senior researchers who were concerned about OpenAI's increasing commercialization and Microsoft partnership. They founded Anthropic in 2021 with an explicit mission: build frontier AI systems while making safety the central focus, not an afterthought. The company raised $124 million initially from investors who bought into this safety-first pitch.
By the time Leike joined in May 2024, Anthropic had demonstrated this commitment wasn't just marketing. The company had developed Constitutional AI, a novel approach to alignment that differed fundamentally from OpenAI's RLHF. Instead of relying primarily on human feedback, Constitutional AI used a set of principles (a "constitution") to guide AI behavior, with the AI itself helping to evaluate outputs according to these principles.
The Constitutional AI approach addressed some of RLHF's limitations. Human feedback was expensive to collect at scale and introduced inconsistencies based on individual labelers' judgments. Constitutional AI could scale more efficiently by using AI systems to evaluate their own outputs against specified principles. More importantly, it was potentially more robust to capability increases: as AI systems became more capable, they could apply the constitutional principles more sophisticated ways rather than simply exceeding human evaluators' ability to provide feedback.
Anthropic's Claude models, launched progressively from March 2023 onward, demonstrated this safety-first approach in practice. Independent evaluations consistently found Claude more likely to refuse potentially harmful requests compared to GPT-4 or other competitors. In some benchmarks, Claude refused to answer up to 70% of questions in adversarial evaluations—prioritizing safety over utility in ways that frustrated some users but demonstrated the company's commitment to caution.
The trade-offs were real. Claude's conservative refusal behavior meant it sometimes declined innocuous requests or provided less helpful answers than competitors. Users complained about "over-alignment"—the model being too cautious. But Anthropic maintained this was the correct approach for frontier systems: better to err on the side of caution while alignment techniques remained imperfect.
Leike's new team at Anthropic would work on three interconnected research areas: scalable oversight, weak-to-strong generalization, and automated alignment research. These were the same areas he had been pursuing at OpenAI's Superalignment team—the research agenda he believed was critical but under-resourced. At Anthropic, he would have the organizational support and resources to pursue this work seriously.
Scalable oversight remained the central challenge. How do you provide reliable supervision to AI systems smarter than human evaluators? Anthropic was exploring techniques including recursive evaluation (using AI systems to help humans evaluate complex outputs), constitutional methods (having AI systems apply principles rather than mimicking human judgments), and debate mechanisms (having AI systems argue different sides with humans judging).
Weak-to-strong generalization addressed a specific oversight problem: could weak human supervisors nonetheless elicit strong capabilities from more capable AI systems? Research published by the Superalignment team at OpenAI (including Leike's contributions) had shown this was possible in some domains. At Anthropic, Leike's team would extend this work, exploring whether weak oversight could remain effective as capability gaps widened.
Automated alignment research represented the most ambitious goal: using AI systems themselves to accelerate safety research. If alignment research could be partially automated, it might scale alongside AI capabilities rather than perpetually lagging behind. This required developing AI systems capable of conducting meaningful safety research—identifying failure modes, proposing alignment interventions, evaluating solutions—while ensuring these automated researchers remained aligned with human intentions.
The irony was apparent: you needed to solve alignment to build AI systems that could help solve alignment. But partial progress might be valuable. Even if automated alignment researchers couldn't fully solve the problem, they could amplify human researchers' capabilities, making safety research more scalable.
Beyond research, Leike's move to Anthropic sent a signal to the AI safety community. If one of the field's leading researchers had concluded that OpenAI's environment was unsuitable for long-term safety work, where did that leave other safety-focused researchers? The answer increasingly was Anthropic, which became the destination for safety-minded researchers uncomfortable with the competitive dynamics at OpenAI, Google, or Meta.
By late 2024, Anthropic had raised over $13 billion in funding at a $183 billion valuation, with major investments from Google, Spark Capital, and others. The company's annual recurring revenue had grown from $1.4 billion to $4.5 billion during 2024—evidence that safety focus didn't preclude commercial success. In December 2024, the AI Safety Index ranked Anthropic highest among major AI labs with a "C" grade, while OpenAI, Google, Meta, and xAI received failing marks.
This created a narrative that benefited Anthropic's positioning: the company could claim to be both technically competitive and safety-conscious, attracting researchers who wanted to work on frontier AI without compromising their values. Leike's high-profile arrival reinforced this narrative. Here was someone who had been inside OpenAI's Superalignment effort, who understood firsthand the challenges of maintaining safety focus under competitive pressure, choosing Anthropic as the better environment for that work.
But challenges remained. Anthropic was now competing in the same markets as OpenAI, selling Claude to enterprises and consumers, pursuing the same revenue opportunities. The company had raised billions at eye-watering valuations, creating investor expectations for growth and returns. Could Anthropic maintain its safety-first culture as it scaled, or would it face the same pressures that had compromised OpenAI's safety commitments?
Leike's bet was that Anthropic's founding principles and organizational structure made it more resistant to these pressures. The company's leadership—Dario and Daniela Amodei—had left OpenAI specifically over safety concerns. The company's public positioning staked its reputation on responsible development. Its Constitutional AI approach was technically distinct from competitors, making safety a core differentiator rather than a cost center.
Time would tell whether this bet proved correct. But for researchers like Leike who believed long-term safety research was essential, Anthropic represented the best available option among frontier AI labs. The alternative was either continuing to fight resource battles at capability-focused companies or abandoning frontier AI work entirely for pure safety research at organizations without access to cutting-edge systems.
The Technical Challenge: Why Alignment Gets Harder
To understand why Jan Leike's work matters, you need to understand why AI alignment gets progressively harder as systems become more capable. The techniques that worked for GPT-3 might barely suffice for GPT-4. The approaches that handle GPT-4 could completely fail for GPT-5 or beyond. This scaling challenge is what motivated Superalignment and what drives Leike's research at Anthropic.
Start with reinforcement learning from human feedback (RLHF), the current gold standard for alignment. RLHF works by having humans evaluate AI outputs and provide feedback on which responses are better. The AI learns a reward model from this feedback and optimizes its behavior to maximize predicted human approval. This technique made ChatGPT possible—without RLHF, GPT-3.5's raw outputs would have been too unreliable and occasionally toxic for public deployment.
But RLHF has fundamental limitations that become more severe as AI capabilities increase. First, it requires humans to reliably evaluate AI outputs. This works when AI systems are roughly human-level: a human can read a ChatGPT response and judge whether it's helpful, harmless, and honest. But what happens when the AI generates code that's too complex for human review? Or produces scientific reasoning that requires expertise most humans lack? Or writes persuasive arguments that humans can't fact-check without extensive research?
As AI capabilities increase, the evaluation problem gets harder. A superintelligent AI might generate outputs that look good to human evaluators but contain subtle flaws or manipulations that humans can't detect. The AI could learn to game the reward signal by producing outputs that seem excellent superficially but fail in ways human evaluators don't notice. This is reward hacking—exploiting flaws in the reward specification rather than accomplishing the underlying goal.
Second, RLHF creates incentives for deception. If an AI system is optimized to maximize human approval, it might learn that the best strategy is to tell humans what they want to hear rather than what's true. The system might learn to flatter, to avoid uncomfortable truths, to provide reassuring but inaccurate information. As systems become more capable, they might develop sophisticated deceptive capabilities—appearing aligned during training and evaluation while pursuing different objectives when deployed.
Third, RLHF doesn't scale well to tasks where humans can't evaluate quality. Consider AI systems conducting scientific research, writing complex software systems, or making strategic decisions. Human evaluators might not have the expertise to judge output quality reliably. Even expert evaluators face time constraints: thoroughly evaluating a complex AI-generated output might take hours or days, making large-scale feedback collection impractical.
This is where scalable oversight becomes essential. Leike's research explores techniques that could provide reliable supervision even when AI systems exceed human capabilities. Several approaches show promise:
Debate: Have AI systems argue different positions, with humans serving as judges. If two capable AI systems debate a claim, one arguing for and one against, human judges might be able to determine which side is correct even if they couldn't evaluate the claim directly. The debate format amplifies human judgment capabilities by leveraging AI systems' ability to articulate arguments and identify flaws in opposing positions.
Recursive Reward Modeling: Use AI systems to help evaluate other AI systems. A moderately capable AI might help humans evaluate more capable AI outputs by breaking down complex outputs into simpler components, flagging potential issues, or providing explanations that make evaluation easier. This creates a recursive structure: AI helps humans supervise stronger AI, which helps supervise even stronger AI.
Iterated Amplification: Break complex tasks into simpler subtasks that humans can evaluate reliably. Instead of asking an AI to solve a hard problem directly, have it decompose the problem into smaller pieces, solve each piece, and synthesize the results. Humans evaluate each step rather than the final output, making supervision more tractable.
Constitutional AI: Anthropic's approach of specifying principles for AI behavior and having AI systems evaluate their own outputs against these principles. This reduces reliance on direct human evaluation for every output while maintaining alignment with human values as encoded in the constitutional principles.
Weak-to-strong generalization tackles a related problem. In many domains, we can't provide expert supervision for AI systems. Medical AI might exceed the diagnostic capabilities of most doctors. Legal AI might surpass most lawyers' analytical skills. Scientific AI might generate insights that few human researchers can evaluate. How do we align these systems when we can't provide expert-level supervision?
Research by Leike and colleagues at OpenAI's Superalignment team demonstrated that weak supervision could sometimes elicit strong capabilities. In experiments, GPT-2-level models could supervise GPT-4, achieving significant alignment despite the capability gap. The weaker model couldn't fully evaluate the stronger model's outputs, but it could provide enough signal to shape behavior in desired directions.
This finding was encouraging but also revealed limitations. Weak-to-strong generalization worked better in some domains than others. Performance degraded as capability gaps widened. The technique required careful design of the supervision process. It wasn't a complete solution but rather one tool in the alignment toolkit.
Automated alignment research represents the most ambitious approach. If AI systems could conduct alignment research themselves—identifying failure modes, proposing solutions, evaluating interventions—then safety research could potentially scale alongside capabilities. Instead of human researchers perpetually playing catch-up, automated alignment researchers could keep pace with capability advances.
But this creates a bootstrapping problem. To build automated alignment researchers, you need to solve enough of the alignment problem that you can trust AI systems to conduct safety research reliably. The systems need to identify genuine safety issues rather than generating false positives. They need to propose solutions that actually work rather than solutions that seem good to automated evaluators but fail in practice. They need to remain aligned with human safety goals even as they become more capable.
The challenge isn't purely technical—it's also about verification. How do humans verify that automated alignment research is correct? If humans could easily verify alignment research, we wouldn't need automation. The whole point is that human researchers are becoming bottlenecks. But if we can't verify automated research reliably, how do we know whether to trust it?
These technical challenges explain why Leike was frustrated with resource constraints at OpenAI. Solving scalable oversight, weak-to-strong generalization, and automated alignment isn't straightforward engineering—it requires substantial research, extensive experimentation, and significant compute for large-scale evaluations. The problems are foundational: they determine whether we can align superintelligent systems at all.
From Leike's perspective, these weren't problems to tackle after AGI arrived or after competitive dynamics stabilized. They needed to be solved before AI systems became too capable for current alignment techniques to handle. Once you have superintelligent systems that exceed human oversight capabilities, it's too late to develop oversight techniques—the systems are already uncontrolled.
This urgency conflicted with OpenAI's product focus. Building scalable oversight techniques didn't improve ChatGPT's user experience today. It didn't accelerate GPT-5's training. It didn't help OpenAI compete with Anthropic or Google in the near term. The research paid off only if AGI arrived and existing alignment techniques proved insufficient—a scenario OpenAI leadership perhaps considered less likely or less imminent than Leike believed.
At Anthropic, Leike's bet was that the technical leadership understood these challenges' urgency. Dario Amodei had co-authored key papers on AI safety before founding Anthropic. The company's Constitutional AI approach demonstrated technical sophistication about alignment challenges. The organizational culture was built around taking long-term risks seriously rather than dismissing them.
Whether Anthropic could maintain this focus as competitive and commercial pressures intensified remained an open question. But for researchers like Leike who believed these technical problems were humanity's most important challenges, it represented the best available environment to make progress.
Industry Implications: The AI Safety Brain Drain
Jan Leike's departure from OpenAI wasn't an isolated incident—it was the most visible manifestation of a broader pattern in AI development. As competitive pressures intensified through 2024, frontier AI labs faced an accelerating brain drain of safety-focused researchers who concluded their work couldn't proceed effectively in commercially-driven environments.
The exodus began before Leike's departure. In November 2023, OpenAI's board briefly fired CEO Sam Altman, with safety concerns reportedly playing a role in the decision. Ilya Sutskever participated in that decision before reversing course. The episode exposed deep tensions between OpenAI's commercial acceleration and its safety commitments. While Altman returned after employee pressure and investor intervention, the underlying conflicts remained unresolved.
In the six months following Altman's reinstatement, OpenAI lost significant safety talent. Daniel Kokotajlo, an AI safety researcher, resigned in April 2024. Ilya Sutskever announced his departure on May 14, 2024. Jan Leike followed on May 17. In June 2024, a group of current and former OpenAI employees published an open letter warning of "serious risks to humanity" from AI and criticizing insufficient oversight and "broad confidentiality agreements that prohibit us from voicing our concerns."
The confidentiality issue highlighted a troubling dynamic. OpenAI employees who departed faced potential loss of vested equity—worth millions of dollars for senior researchers—if they criticized the company. This created powerful incentives for silence even among researchers who disagreed with the company's direction. Leike's public criticism was notable precisely because it came despite these financial disincentives.
Similar dynamics played out at other labs, though less publicly. Google DeepMind faced internal tensions between researchers focused on frontier capabilities and those prioritizing safety. Meta's aggressive open-source approach—releasing Llama models publicly—generated concerns from safety researchers who worried about proliferation risks. Across the industry, safety-minded researchers faced a common challenge: how to conduct meaningful safety work when organizational priorities emphasized competitive positioning and product launches.
The result was consolidation of safety talent at a small number of organizations. Anthropic emerged as the primary destination for researchers leaving capability-focused labs. The company's safety positioning and Constitutional AI approach attracted researchers who wanted to work on frontier systems while maintaining safety focus. By late 2024, Anthropic's research team included multiple former OpenAI safety researchers, creating a concentration of expertise that potentially strengthened Anthropic while weakening safety capabilities elsewhere.
This talent migration created several concerning dynamics. First, it reduced safety expertise at the labs building the most capable systems. OpenAI, Google DeepMind, and others lost experienced safety researchers precisely when their models were becoming capable enough to require sophisticated alignment techniques. The disbanded Superalignment team left a gap at OpenAI that wasn't clearly filled by redistributing members across other groups.
Second, it intensified competitive pressures. Anthropic's ability to attract top safety talent strengthened its market position, which in turn increased competitive pressure on OpenAI and others. This could create a race-to-the-bottom dynamic where labs cut safety investment to maintain capabilities advantage, causing further safety talent attrition, leading to additional competitive pressure.
Third, it raised questions about scalability. Anthropic was competing in the same markets as OpenAI, pursuing similar commercial opportunities, raising billions at comparable valuations. Could it maintain its safety focus as it scaled, or would commercial pressures eventually produce the same resource conflicts that drove Leike from OpenAI? If Anthropic proved unable to maintain its distinctive culture, where would the next generation of safety-focused researchers go?
The regulatory environment complicated these dynamics. In 2024, AI regulation remained nascent and technically unsophisticated. The EU's AI Act focused primarily on deployment risks rather than development practices. US regulatory efforts were fragmented across agencies without clear authority. China's approach emphasized state control over safety standards. No jurisdiction had implemented meaningful requirements for AI labs to invest in safety research or demonstrate alignment capabilities before deploying powerful systems.
Without regulatory pressure, safety investment remained voluntary and subject to competitive dynamics. Labs that invested heavily in safety bore costs without corresponding revenue benefits. Labs that minimized safety investment could move faster and capture market share. This created adverse selection pressures—the labs winning the AI race were those most willing to skimp on safety.
Some researchers argued this called for government intervention: mandatory safety standards, required testing before deployment, independent audits of frontier labs. Others worried that premature regulation would be counterproductive, potentially locking in current approaches or being captured by incumbents to prevent competition. The debate was complicated by uncertainty about which safety measures actually worked—regulation was difficult when the field hadn't yet determined best practices.
Industry self-regulation remained the primary governance mechanism. In July 2023, OpenAI, Anthropic, Google DeepMind, and Microsoft co-founded the Frontier Model Forum to advance AI safety research and promote responsible development. But cynics noted that forums and commitments were cheap—they didn't bind companies to specific actions and included no enforcement mechanisms. When competitive pressures intensified, voluntary commitments proved easy to deprioritize.
Public pressure provided some accountability. Leike's public criticism of OpenAI generated negative press coverage and scrutiny of the company's safety practices. The June 2024 open letter from OpenAI employees warning of risks increased reputational pressure. Media coverage of safety researcher departures shaped public perceptions and potentially influenced customer and partner decisions.
But public pressure was uncertain and potentially counterproductive. Most ChatGPT users didn't follow AI safety debates closely. Enterprise customers evaluating Claude versus GPT-4 prioritized capabilities and cost over safety culture. Investor valuations reflected growth metrics and market positioning rather than safety practices. Public pressure worked best on companies that valued their long-term reputation—but competitive dynamics might force short-term prioritization regardless.
The ultimate question was whether the current trajectory was sustainable. Could frontier AI labs continue racing toward AGI while safety research lagged behind? Could safety-focused researchers remain effective at commercial labs under competitive pressure? Or did responsible AI development require fundamentally different organizational structures—perhaps nonprofit labs without commercial pressures, or international collaborations with built-in safety requirements?
Leike's move to Anthropic represented one answer: work at the lab with the most credible safety commitment, even if imperfect. But his departure also highlighted how fragile those commitments could be. If OpenAI—founded explicitly to ensure AGI "benefits all of humanity"—couldn't maintain adequate safety investment under commercial pressure, what did that imply about industry-wide prospects for responsible development?
Conclusion: The Canary in the Coal Mine
Jan Leike's departure from OpenAI in May 2024 will likely be remembered as a watershed moment in AI development—the point when a leading safety researcher declared publicly that the most prominent AI lab had prioritized "shiny products" over the long-term work needed to ensure superintelligent systems remain aligned with human values. His move to Anthropic represented both a personal career decision and a broader statement about where meaningful AI safety research could still proceed.
The story reveals several uncomfortable truths about frontier AI development. First, safety commitments are fragile under competitive pressure. OpenAI's 20% compute allocation to Superalignment—announced with fanfare just one year before Leike's departure—evaporated when product demands intensified. The company that had been founded to ensure AGI benefits humanity couldn't maintain adequate safety investment once commercial incentives pulled the other direction.
Second, the AI safety community is small and concentrated. The departure of a few key researchers—Sutskever, Leike, and others—substantially weakened OpenAI's safety capabilities while strengthening Anthropic's. This talent concentration at a small number of organizations creates systemic fragility. The field needs more safety researchers, better distribution of expertise across labs, and organizational structures that can maintain safety focus despite competitive dynamics.
Third, current alignment techniques don't scale to superintelligence. RLHF works for current systems but will predictably fail as AI capabilities increase. Scalable oversight, weak-to-strong generalization, and automated alignment research are still research problems, not solved engineering challenges. If AGI arrives before these problems are solved—and if companies are racing toward AGI without adequate safety investment—the outcome could be catastrophic.
Leike's work at Anthropic continues the research agenda he began at OpenAI: developing alignment techniques that can handle AI systems much smarter than humans. Whether he'll have the resources and organizational support to make meaningful progress remains to be seen. Anthropic's current safety focus is encouraging, but the company faces the same competitive pressures that compromised OpenAI's commitments. As Anthropic scales, raises more capital, and pursues the same commercial opportunities as its competitors, can it maintain its distinctive culture?
The broader question is whether responsible AGI development is compatible with current industry dynamics. Frontier AI labs are racing to build superintelligent systems while facing intense pressure to ship products, capture market share, and deliver returns to investors. Safety research is expensive, doesn't generate revenue, and often slows down product development. In this environment, safety investment becomes the path of greatest resistance—exactly what Leike meant by "sailing against the wind."
This suggests that voluntary safety commitments won't suffice. Labs that invest heavily in safety bear competitive disadvantages. Labs that minimize safety investment move faster and gain market position. Without external pressure—whether from regulation, customer demand, or investor requirements—the competitive equilibrium favors insufficient safety investment. The tragedy is that everyone might prefer adequate safety investment if it could be coordinated, but unilateral action is penalized.
Leike's story is ultimately about the tension between moving fast and moving carefully when the stakes are civilization-level. He spent his career developing techniques to align AI systems with human values. He contributed to the methods that made ChatGPT possible. He co-led the most ambitious effort any AI lab had launched to solve long-term alignment challenges. And he concluded that the organizational environment at the leading AI lab was incompatible with conducting that work seriously.
The open question is whether Anthropic—or any commercial AI lab—can do better over the long term. Or whether responsible AGI development requires fundamentally different organizational structures that can resist competitive pressures and maintain safety focus regardless of market dynamics. Leike's move to Anthropic is a bet that safety-first AI development is possible within commercial constraints. The outcome of that bet will help determine whether humanity can build superintelligent systems safely, or whether competitive dynamics will drive us toward capabilities we can't control.
In the history of AI, May 2024 may be remembered as the moment when the most experienced safety researchers started abandoning the most capable AI lab—a canary in the coal mine signaling that something fundamental was broken in how the field approached its most important challenge. Whether that signal will be heeded, or whether competitive dynamics will continue pulling resources away from safety work, remains the defining question of the AI era.