10 Comments
User's avatar
Paull Young's avatar

This is a superb read. I appreciate the ‘scientist’ vs ‘simulator’ breakdown.

One Q: where do you feel we stand when it comes to open, accessible data underpinning AI approaches? In the piece I note you say both “The accessibility of data in other domains is a largely solved problem.” (Outside of biology), but later you say “The big labs are focused on intelligence—reasoning, long context, tool use. Domain-specific simulation and data collection are massive undertakings that lie outside their core competencies and business models.”

Melissa's avatar

thanks for your thoughtful comment!

to clarify, in that first reference, i was referring solely to accessibility of data in *science* domains, excluding biology (like weather modeling, materials sciences). i'd say the field of ai as a whole is still significantly data bottlenecked, across domains both within language (industry-specific knowledge, high quality chain-of-thought) and outside of language (world models, video models)

the big labs have historically passed off large-scale data collection to other players, like scale and mercor. it's unclear who the equivalent players will be for scientific data, and in biology in particular, where the existing data is fragmented across institutions, generated with inconsistent protocols, often proprietary, etc.

in general, as ai has progressed, we've seen a progression from using public data to relying more on private data collection and curation efforts--first llms were trained on commoncrawl, wikipedia, etc, and alphafold on pdb (public protein crystallography datatset). nowadays, all the big labs have massive efforts to collect proprietary data, isomorphic generates their data through private partnerships with pharma. in fact, private data collection efforts can often become a company's moat (ie noetik). i'm more optimistic that the scientific community will promote open datasets more than their corporate counterparts, but i think the incentive structures favoring private data are quite strong

Shashwat Goel's avatar

Training AI to help with science when simulators aren't available was exactly the motivation of https://arxiv.org/abs/2512.23707 :)

Gary Welz's avatar

Fantastic. I’ve just written about AI-Powered Knowledge Engines that facilitate scientists using a range of AI tools in research. https://zenodo.org/records/18463304

Well stated!

Jacky Li's avatar

Given the ungodly amount of capex on AI right now, domain-specific models/toolings definitely deserve more attention. I guess one explanation could be that the labs believe building new domain specific AI will be trivial once we have LLMs that rival the best researchers and terawatts of compute.

Melissa's avatar

very sympathetic to this argument and tried to address it somewhat in the article:

> GPT-7 may very well have the cognitive capabilities to design digital twins that simulate human biology, but it will have been enabled by many other players already advancing the algorithms behind effective simulators and building automated data infrastructure. In the same way that ML for world models, voice, and image generation were pushed forward by ElevenLabs, Midjourney, and WorldLabs, among others, we should expect ML for science to be pushed forward by a plurality of efforts.

"trivial" here still requires training data that takes years to generate, physical infrastructure that doesn't exist yet, etc (not necessarily just reasoning and terawatts of compute). domain-specific AI certainly won't be hard forever, but it has long lead times that don't compress with better reasoning, so treating it as a downstream consequence of agi adds years to the timeline.

Jasper's avatar

Thanks for sharing! How do you feel about Anthropic's recent collaborations with many academic institutes, and OpenAI's collaborations with various biotechs? While you mentioned big labsaremore focused on intelligence itself, these recent trends seem to indicate their interests in the messy sciences as well?

Melissa's avatar

as far as i can tell, the efforts of the big labs have largely been oriented around incorporating *language models* into scientific workflows

it's incredibly important work! but a central piece of the argument here is that ai for science also requires training domain-specific foundation models, which the big labs aren't uniquely positioned for. the best protein folding models, weather prediction models, material simulation models, etc all exist outside of openai and anthropic (though, as aforementioned, many of them emerge from domain-targeted teams within deepmind). even if we consider the landscape outside of science, the races to train the best voice models, world models, etc are still ongoing (think elevenlabs, worldlabs).

that said, if the big labs become inspired to train protein folding models or dna language models or climate models, that's a win for the field

Timothy Kostolansky's avatar

> Through scale alone, the models have grown from naive stochastic parrots into entities we credit with agency and emotional depth.

this is not true! there was much more than mere scaling

Melissa's avatar

very true! "scale alone" is an oversimplification. data curation, post-training algorithmic improvements (ie rlhf), inference-time compute, tooling / scaffolding were all critical as well. was mostly trying to gesture at the broader arc of how quickly our perception of these systems shifted