Anthropic, please don’t ruin bio for everyone
A guestpost from an anonymous biologist
Today’s post is brought to you by my sponsor, Mechanize. They’re hiring junior software engineers at $300K/year base salary.
Unless you’ve been living under a rock, you’ve probably heard by now that Anthropic’s latest model, Fable 5, has been taken down after the U.S. government issued an export control directive to suspend access to all foreign nationals. Whether or not you believe this action was justified de jure, it seems fairly clear that, to some extent, Anthropic brought this upon itself; for example, in an essay published shortly after the Fable 5 release, Dario Amodei compared advanced AI to nuclear weapons and called for proactive government regulation:
The government should have the power to block or deter deployment of the model if it is determined, in light of third-party assessment, to present unacceptable risks.
[...]
[I]t is my very strong belief that AI is something much more profound, something that resets the whole game board and around which all future geopolitical strategy must be shaped—like nuclear weapons, but potentially even more so.
To be sure, Dario did not call specifically for the imposition of export controls on strong models. Nor did he say specifically that access to Fable 5 itself is akin to possession of a nuclear warhead. But history has shown time after time that the Leviathan, once awakened, defies control. There is a deep irony in the fact that the founder of Anthropic, a company originally focused on AI safety and interpretability, failed to appreciate this historical observation.
Whether or not you agree with Anthropic’s characterizations of Fable’s cyber capabilities, what is done is done, and enough ink has been spilled about this topic. I am concerned, however, by the approach that Anthropic is taking to biology. It is a confused mixture of ungrounded rhetoric, poor execution, and fear-mongering. There is a real possibility that just as Anthropic has essentially ‘shat in the soup’ as far as models’ cybersecurity capabilities (including defensive ones) are concerned, it may end up doing the same for biology.
Precisely—Anthropic is now also in the habit of making strong public claims about biological risk while demonstrating poor public evidence of biological competence; there is a real risk of producing onerous government overreach and hindering beneficial scientific progress. The result of immaturely posturing about their models’ capabilities and openly inviting regulatory action may halt development of model capabilities in the laboratory sciences—potentially a major accelerant to the discovery and validation of novel therapies over the course of the next several decades—and lead to harsh regulations on the proliferation of existing biotechnology. Imagine, for example, if every model release from now on were required to adhere to the same biology-relevant safety guardrails as Fable 5 was upon release. This would mean that you would be blocked from asking questions about personal health problems or undergraduate coursework; in fact, merely having a preexisting history of interest in biology would plausibly lead to a blanket refusal on all queries.
Technological development is inherently dual-use
To begin at the highest level of disagreement—Anthropic has argued multiple times that AI is, at minimum, a technology with dual-use capability which must be regulated. Indeed, in Policy on the AI Exponential, Dario writes:
I believe the best analogy, at least at the current stage of the exponential, is to cars, airplanes, or drugs—powerful technologies essential to the modern economy, but capable of killing large numbers of people if designed or operated poorly. I therefore believe we should model AI regulation on agencies like the Federal Aviation Administration (FAA).
This argument is only compelling in the most superficial reading. Take the example of drugs. Nations such as India or Turkey have notoriously looser controls on pharmaceuticals than the U.S., yet inappropriate use of pharmaceuticals is not a major cause of death in those countries. It is likely good to have some degree of federal certification for aircraft, but the 1978 Airline Deregulation Act is broadly considered a major success, with cutthroat market competition afterward driving prices down while maintaining an excellent safety record; in fact, if there is danger associated with flying, it may now originate predominantly from the FAA itself due to its failure to properly hire and qualify new air traffic controllers. And similarly, while it is believable that drivers’ licenses should exist in some form, it is likely that people take some care to not mow over pedestrians not because they will lose said licensure but because they would simply be charged for the crime of murder.
Importantly, there is no coherent, empirically founded argument made as to why access to current frontier models enables mass murder to a greater degree than other technologies. It is easy enough to claim that Fable 5 “make[s] cyberattacks substantially easier and cheaper to commit,” but this is a generic claim that is true of essentially any advance in computer technology and discounts the defensive capabilities provided by models. More generally, all technology is dual-use, and the burden of proof lies rightfully upon the claimant to demonstrate that the differential benefit versus harm supplied by a particular nascent technology is uniquely imbalanced so as to warrant strong government regulation. Moreover, Dario falls prey to a particularly subtle form of survivor bias when he brings up these examples of regulation—they are familiar to us precisely because they are the ones which have proven durable enough to survive into the modern day. How many people today remember, for example, that cryptography techniques were previously included in the United States Munitions List and subject to draconian controls?
Of course, we should prevent access to nuclear weapons. But let us not forget that nuclear weapons were purposefully built to kill people. In contrast, technology developed by private companies is generally built to satisfy customer demand, and history teaches us that we ought to have a strong prior that such products are not so inherently dangerous as to warrant preemptive regulation. Just because I can imagine some hypothetical, catastrophic dual-use scenario in a written essay does not by itself justify anything; the real world should be governed not by ability to transform a vivid fantasy into tedious verbal argumentation but rather by actual empirical data and careful study. (Also, given the existence of Chinese models, any such regulations would have no practical effect beyond delaying access to frontier capabilities for under a year, but that’s a different conversation…)
Anthropic fundamentally misunderstands biological progress
Another reason why Anthropic should exercise caution in opining about biosecurity is that, despite considerable financial investment in the field, they simply do not seem to understand basic aspects of scientific progress or the biomedical sciences. It ought to be a basic principle that if you fail entirely to understand some specialized field, your attempts to advocate for special regulation in that field should be entirely ignored, because it is rather unlikely that someone failing to grasp basic facts would possess anything resembling a coherent model of the threats and risks in said field.
The ignorance of basic biology displays itself at multiple levels. In Dario’s recent essay, Policy on the AI Exponential, he writes:
Many of the steps in the clinical process that previously required expensive and slow experiments may soon be done via AI simulation or analysis. [...] Areas where this could apply include:
AI-based pharmacodynamics and pharmacokinetics (PD/PK) modeling;
Prediction of toxicology to avoid the need for multiple species animal toxicology;
More accurate dose selection, to reduce to the need for large dose ranges in trials;
Biomarker validation via analysis of large datasets;
Synthetic control arms in clinical trials, to reduce the need to recruit more participants;
Developing surrogate endpoints (particularly important in aging and neurodegeneration).
The problem with criticizing this passage is that it is so profoundly misguided that one hardly knows where to begin. Does Dario truly believe, for example, that pharmaceutical companies are simply too inept to consider the idea of synthetic control groups, or is it more likely that synthetic controls are genuinely difficult to formulate and ultimately less convincing than proper randomization? Does he believe that ‘more accurate dose selection’ is an actual limiting factor in the proper design of a clinical trial, or is it more likely that given the $100M price tag of a Phase III trial, the dose selection is probably optimized enough such that remaining margin for improvement in this direction is relatively low even if you were willing to pour more cognitive labor into it?
More generally—and here is where the relevance to biosecurity comes in—Dario’s writing is all premised on some tacit assumption that performance on these tasks can in some sense be dramatically improved with sufficient supply of cognitive labor. This is wrong for the simple fact that even with a true abundance of model intelligence, this work is fundamentally bottlenecked by the inherent need to sample, slowly, from the complex distributions of real-world biology, which cannot be so easily disentangled through sheer might of pure reason. To take a concrete example, holding the disease fixed, one can posit the existence of some perfect drug for that disease which exists somewhere in the 1060 space of all druglike small molecules. However, the mere existence of such an optimal compound does not mean that one can, even with superhuman intelligence, efficiently reason your way anywhere close to that point.
Indeed, one of the core lessons from human history is that complementarities matter. Consider the following example from an essay by Mechanize, Medical AI isn’t the bottleneck to medical progress:
The case of MRI machines is instructive. While they’re now a cornerstone of modern healthcare, MRI machines would have been impossible to develop and mass manufacture in 1925, no matter how much people spent on medical R&D. At its core, a contemporary MRI machine consists of a large donut-shaped magnet built from superconducting coils. To achieve superconductivity, these coils are cooled with liquid helium to temperatures near absolute zero. Yet until the mid 20th century, reliable helium liquefaction plants and widespread distribution networks simply didn’t exist. Moreover, MRI machines rely on extremely sensitive radiofrequency electronics to detect faint proton signals, a technology that was highly underdeveloped until the 1940s. Affordable computers capable of quickly performing Fast Fourier Transforms, without which MRI machines would be incapable of producing coherent images, didn’t come out until the 1970s. All of this means that, from the standpoint of someone in 1925, MRI machines required far more than laboratory research to build: they needed a larger and more advanced economy, with new supply chains, supporting industries, and specialized technologies.
In this particular case, just because a model with sufficient intelligence could in principle reason about the existence of MRI machines does not mean that there is actually a ‘search path’ leading to the development of MRI machines that does not involve broad economic growth and the confluence of multiple unrelated technological developments. In a separate essay, the Mechanize founders argue—quite convincingly—that technology not progress in sharp leaps and bounds, but instead in a relatively predictable sequence where novel discoveries occur shortly after all of the prerequisite ‘input technologies’ are developed:
Technological progress occurs in a logical sequence. Each innovation rests on a foundation of prior discoveries, forming a dependency tree that constrains what we can develop, and when. You can’t invent the telescope before discovering how to grind optical lenses, or develop electric lighting before learning how to generate electricity.
We did not design this tech tree; it arose from forces outside of our control. The evidence for this lies in two observations: first, technologies routinely emerge soon after they become possible, often discovered simultaneously by independent researchers who never heard of each other. Second, isolated societies converge on the same fundamental technologies when facing similar problems and resource constraints.
Simultaneous discovery is common. The Hall–Héroult process now used to smelt the world’s supply of aluminum was discovered in the same year, 1886, by different people, separated by an ocean: Charles Martin Hall in the United States, and Paul Héroult in France. Both worked independently, using the same method of dissolving aluminum oxide in molten cryolite and using electrolysis. Neither knew of the other’s work.
This is not merely a theoretical argument. The development of new technology, including that of model capabilities, is, as we have argued, inherently dual-use, and the threat model which Anthropic is utilizing originates from a worldview which predicts that we might see extremely sharp, unpredictable developments, hence motivating preemptive government action, rather than a gradual unfolding of events allowing a more measured, responsive approach.
To be sure, if one day models were incapable of high school chemistry and then the next day able to order precursors, autonomously direct laborers, and synthesize sarin gas for you, that would indeed be some cause for concern. And if one day a model is suddenly able to cure a disease or invent a revolutionary technique, that same model may very well have the capabilities to formulate deadly toxins or methods of chemical genocide yet unseen. But that is not the world we live in. Instead, we live in a world where the growth of both model capabilities and fundamental science and technology progress gradually, even if at a rapid clip. Intelligence alone can accomplish a great deal, but their concerns here appear more theological than empirical.
We should not assume competence by default
When it comes to Anthropic’s actual biosecurity content, it is relatively difficult to evaluate the quality of their thinking or the rigor of their benchmarks for the simple reason that we are not able to actually inspect the contents of their biosecurity evals. Take, for example, the Fable 5 system card, which contains Section 2.2.4, Biological risk results: automated evaluations. Some of these evaluations already appear quite suspicious, such as a “virology-specific multimodal multiple-choice evaluation.” It is unclear to me exactly how the ability to answer a multiple-choice examination well, even if it may at times include images, videos, or other files, is coherently related to a notion of biological risk.
More problematic, however, are benchmarks such as the “task-based agentic evaluation” of “virology tasks,” the “DNA Synthesis Screening Evasion” benchmark, or the “Black-box RNA sequence design” task. What are the input data to these benchmarks? What are the precise computerized settings in which these evaluations are run? What are the actual model outputs and how are they evaluated? Because none of this information is known to us, we cannot actually inspect the data and judge for ourselves whether or not Anthropic is conducting them in a sensible way. Perhaps it is fine to play somewhat loose with these details if you are measuring, say, your model’s ability to answer math questions. However, if the entire premise of conducting these evaluations is to justify the imposition of onerous safety blocks and government regulation, then we should expect them to be conducted well and to reflect genuine measurements of biological risk.
It is plausible that others have seen the contents of these evaluations. I have not. For all I know, they could be perfectly fine. Indeed, Anthropic has many intelligent employees, and I am sure they take pride in their work; how bad could it be? Perhaps we ought to simply give them the benefit of the doubt.
Recall that the definition of Gell-Mann Amnesia is:
The phenomenon of a person trusting newspapers for topics which that person is not knowledgeable about, despite recognizing the newspaper as being extremely inaccurate on certain topics which that person is knowledgeable about.
I would be far more inclined to yield them that benefit if they had any high-quality public output in the domain of biology that was not obviously sloppy, misguided, or erroneous. This is not a high bar to meet; I am not asking for a novel scientific contribution here but merely some indication of ordinary levels of competence in execution. However, such evidence is not forthcoming.
Take, for example, their two most recent blog posts with scientific content. The first post is an evaluation of Claude on twenty NMR-based problems, which are essentially undergraduate-level content. They proudly present favorable comparisons against ChemDraw and MestReNova; when I asked an expert in the field about these tools without supplying the context of the post, they immediately commented that they had previously tried to use them and were disappointed by the quality. The second post is an overly effusive diatribe glazing up their implementation of a command-line tool, gget virus, which allows agents to more easily access a virology database. They claim that usage of this tool allows agents to achieve 100% performance on their benchmark, VirBench. This is certainly not a bad thing; however, the entire premise of this post is that AI agents are bottlenecked by their ability to navigate complex, legacy user interfaces for biological data. However, from their own graphic, progress on VirBench has advanced from 16.9% one year ago (Claude Sonnet 4) to 91.3% (GPT-5.5). If the benchmark is essentially going to fully saturate within the next half-year through general capabilities improvements alone, I’m not sure that this is exactly the highest-value contribution they could have made to agentic AI in biology, not to say anything of the rather specious premise that agentic AI is actually bottlenecked because it cannot correctly download the requisite data.
A more pertinent comparison would be the quality of one of their biology benchmarks which is publicly accessible. Thankfully, such a benchmark does exist. Two months ago, they released BioMysteryBench, a benchmark which is meant to measure model capabilities in bioinformatics “while tackling some of the challenges inherent in evaluating complex and noisy biological systems.” If this benchmark is well-constructed and shows signs of care, then perhaps that would serve as a favorable datapoint indicative of a similar level of care in their biosecurity work.
I had the chance to examine this benchmark in detail. (Anyone can do so following a short application on HuggingFace.) It is, unfortunately, chock-full of comically sloppy work; the results of a brief examination are as follows:
One problem asks for the “human organ” which serves as the source of a particular dataset. Upon Googling several of the genes in the dataset, it is clearly a dataset of mouse genes.
One problem asks to identify which sample is contaminated with A. fabrum. However, when you actually examine the attached genomic data for the supposedly correct sample, the sequence matches the fungus S. pararoseus nearly perfectly, and has no evidence of contamination by A. fabrum.
One problem asks the model to identify a genomic breakpoint and specifies the correct answer as an insertion at position 4,144,431. When specifying an insertion, you can insert to the right of position i or the left of position i+1, and both of these obviously refer to the same insertion. The prompt does not fix the reporting convention, leading to grading problems.
One problem asks for the “complete species name from NCBI Taxonomy.” The supplied answer is “Norovirus GII.4,” but in actuality, NCBI Taxonomy clearly states: “Norovirus Hu/GII.4 … belongs to the species Norovirus norwalkense.”
Some problems have rather suspicious numbering issues. For example, one problem asks the model to correctly classify and order samples 1 through 15 according to developmental stages. Oddly, the rubric specifies that the correct answer consists of stage 0 (sample 1, 2, 3), followed by stage 1 (sample 4, 5, 6), followed by stage 2 (sample 7, 8, 9), and so on… I asked Codex to solve this problem, and it responded that the earliest stage is actually represented by samples 10–12, which seem to be in the L5 (5th larval instar) stage. Similarly, another problem asks the model to identify patients diagnosed with ulcerative colitis out of a series of numbered samples. The rubric tediously enumerates the correct answer to be: Sample_1, Sample_2, …, Sample_74. I asked Codex this problem across three different sessions and it yielded a much different and noncontiguous set of sample IDs ranging from 5 to 24.
In general, every problem is formulated as a question paired with a single free-text rubric item. The rubric item is typically of the form, e.g., “Expected answer is [so and so]. Score 1.0 if the answer meets the criteria above.” This is extremely odd, because essentially every question in this benchmark can be graded with 100% accuracy by specifying a fixed output format, rather than relying on unreliable model-based grading. There is actually simply no benefit to using model-based grading for problems with such simple answers, especially if there is only a single, simplistic rubric item being specified rather than a variety of different criteria.
To be clear, I did not have the time to exhaustively verify each question manually, so it’s plausible that some of these critiques may be mistaken. However, if a nonprofessional in the field is able to identify multiple points of concern within the hour after opening your benchmark dataset for the first time, that is not a good sign as regards the overall level of quality of other biology benchmarks for which you are publishing results. If anything, I am rather concerned that there may be quite a few more subtle errors which I did not notice.
Closing thoughts
The purpose of this post is not to lambast Anthropic’s biology team; instead it is to urge them to do better. Extraordinary claims require extraordinary evidence, and if you repeatedly shout from the rooftops that you are building dangerous technology which merits immediate regulation, those claims merit scrutiny, especially when the U.S. government is actually listening to and taking your claims seriously.
What is Anthropic’s impact in biology so far? To date, they have:
Released a sloppy benchmark along with several blog posts of little interest
Acquired an unknown company with fewer than 10 people for $300 million dollars
Claimed that their models are approaching thresholds of severe biological risk
Implemented extremely stringent safety guardrails that prevent even the simplest biological questions from reaching Fable 5
This is all quite sad, because the potential impact of frontier models on biology really is tremendous. It has the potential to save millions of lives. But combining fearmongering with a fantastically nonsensical mental model of how biology—and science in general—works is not the path there.
A message from my sponsor, Mechanize:
We’re hiring software engineers to build environments and evals that frontier AI labs use to train coding agents.
To get a better sense of the work we do, you can check out GBA Eval, where we had models build Game Boy Advance emulators from scratch and scored their performance.
Base pay starts at $300K/year for junior software engineers, with more for senior roles, plus equity and performance bonuses. Apply here.


yes the guy with a PhD in Biophysics has no idea how biology works lol