Machine Learning: The Native Language of Biology

55 us-merul 25 6/5/2025, 10:51:52 PM decodingbiology.substack.com ↗

Comments (25)

dmacfour · 1d ago

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."

-Leo Breiman, like 24 years ago

Machine learning isn't the native language of biology, the author just realized that there's more than one approach to modeling. I'm a statistician working in an ML role and most of the issues I run into (from a modeling perspective) are the reverse of what this article describes - people trying to use ML for the precise things inferential statistics and mechanistic models are designed for. Not that the distinction is that clear to begin with.

JHonaker · 5h ago

Agreed wholeheartedly. I have argued with the VP of our department about this paper quite a few times.

I feel like Breiman sets up a strawman that I've never encountered when I work with my colleagues that are trained in the statistics community. That doesn't mean it didn't exist 25 years ago when he wrote it. I concede that we are sometimes willing to make simplifying assumptions in order to state something particular, but it's almost like we've been culturally conditioned to steep everything we say with every caveat possible.

Whereas I am constantly having to point out the poor feedback we've had about some of the XGBoost models despite the fact that they're clearly the most "predictive" when evaluated naively.

Fomite · 18h ago

This is largely my feeling as well.

Perenti · 22h ago

In the third paragraph the authors state:

"For example, the Lotka-Volterra model accurately captures predator-prey dynamics using systems of differential equations."

This is incorrect. The validation of the L-V predator/prey model was considered to be the population dynamics of the Snow Shoe Hare and Canada Lynx as seen in Hudson Bay Company records. The data actually models the fashion cycles in Europe, showing prices and demand from Europe drove the efforts of the Company and the trappers. This is in the standard texts from at least the mid 90s AFAIK.

bglazer · 1d ago

The problem with this machine-learned “predictive biology” framework is that it doesn’t have any prescription for what to do when your predictions fail. Just collect more data! What kind of data? As the author notes, the configuration space of biology is effectively infinite so it matters a great deal what you measure and how you measure it. If you don’t think about this (or your model can’t help you think about it) you’re unlikely to observe the conditions where your predictions are incorrect. That’s why other modeling approaches care about tedious things like physics and causality. They let you constrain the model to conditions you’ve observed and hypothesize what missing, unobserved factors might be influencing your system.

It’s also a bit arrogant in presuming that no other approaches to modeling cells cared about “prediction”. Of course, systems and mathematical biologists care about making accurate predictions, they just also care about other things like understanding molecular interactions *because that lets you make better predictions*

Not to be cynical but this seems like an attempt to export benchmark culture from ML into bio. I think that blindly maximizing test set accuracy is likely to lead down a lot dead end paths. I say this as someone actively doing ML for bio research.

j7ake · 22h ago

Also predictions in biology take months or years to validate, so they lack the fast feedback loop of the vision and NLP world where the feedback is almost instant.

Combine this with the fact that In vivo data in biology is extremely limited, and we see copying the NLP and vision playbook into biology is challenging

Fomite · 18h ago

This. Many of the predictions we're talking about are potentially years in the making, involve expensive data collection to validate, suffer from a lot of stochastic noise, etc.

j7ake · 9h ago

Honestly even if a prediction comes an experiment, and they know exactly how the experiment was done, it takes month to years to follow up and verify.

Generative AI is basically going to flood the field with more predictions, but with little explanation of how, and doing nothing to alleviate the downstream verification process.

Fomite · 4h ago

And when it's off in its prediction, without an explanation of how, you have no chance to revise your prediction, it's just all the way back to square one.

LeonardoTolstoy · 18h ago

This person seems to work in a field (exercise / athletics) with an abundance of data, low stakes outcomes, reasonably well established biomarkers, etc. in other words, a field perfectly suited for a top down outcome driven analysis.

IMO the post is merely stating: "man, everyone should be doing this!" Without realizing that (1) everyone is doing this, and (2) it doesn't seem like it because many (most?) fields in biology don't work in the top down approach being suggested. Determining mechanism and function is vital in biology because in a lot of cases there just isn't the data to perform a fuzzy outcome driven analysis.

piombisallow · 1d ago

That's a lot of words, including a sentence that in which the author almost compares himself with Galileo. The proof is in the pudding no? What did you predict with it?

barbarr · 23h ago

The author claims that "machine learning methods better describe many biological systems than traditional mathematical formulations", but I see very little concrete evidence in the article to support it.

seydor · 21h ago

Biological systems can be described via diff equations, e.g. neural cells can be analyzed with hodgkin-huxley type models and this can lead to bottom-up theories of biological neural networks. ML is used to approximate other more complex processes but that doesn't mean that it s impossible

suddenlybananas · 19h ago

Science isn't about making predictions primarily, it's about explanations.

HappMacDonald · 18h ago

Explanations in turn are tools whose only purpose is to make predictions.

dtj1123 · 16h ago

This is an inaccurate statement. Geocentrism makes identical predictions to heliocentrism, but clearly the two models offer differing explanations of the dynamics of the solar system.

From an engineering perspective, yes, predictions are all that you care about. From a scientific perspective, the end goal is the simplest and most general set of explanations possible.

suddenlybananas · 15h ago

In fact, geocentric models made better predictions than early heliocentric ones because epicycles allowed a better fit to the data.

jltsiren · 18h ago

Explanations are also useful, because people often find them interesting.

Some things are valuable, because they keep us alive and healthy in the short term. Some things are valuable, because we find them interesting, enjoyable, or something like that. And some things are indirectly valuable, because they enable other things that are more directly valuable.

randcraw · 18h ago

IMHO, this article makes grand claims but doesn't substantiate them.

In what way is ML-based biology any different from the myriad statistics-based mechanistic models that systems or computational biology has employed for 50 years to model biological mechanisms and processes? Does the author claim that theory-less parameterless ML models like those in deep NNs are superior because theory-based explicitly parameterized models are doomed to fail? If so, then some specific examples / illustrations would go a long way toward making your case.

mfld · 17h ago

I generally enjoyed the article. Maybe it's because the classical functional categorization/cataloging approaches in molecular biology are rarely sufficient to explain experimental data unless you are an expert and know all the exceptions and special cases. So the Predictive Biology approach seems a promising path, particularly since a lot of data for ML training is available.

That said, the formulation "machine learning is the native language of biology" seems odd.

bigyabai · 1d ago

Look, we're all going to sit around cringing until someone says it; machine learning is explicitly the natural language of computers. In nature, neurons are not arranging themselves into neat unsigned 8-bit integers to quantize themselves for recollection. They're also networked by synapses and reactive biology, not feedforward algorithms scanning static, hereditary weights.

This whole thing feels like the author is familiar with one set of abstractions but not the other. It's very reminiscent of the (intensely fallible) Chomsky logic that leads to insane extrapolations about what biology is or isn't. Machine learning is a model, and all models are wrong.

suddenlybananas · 19h ago

What do you mean by Chomsky logic?

meepmorp · 15h ago

Nah, they mean UG and his theorizing about the in-born language facilitates of the human brain.

suddenlybananas · 15h ago

But there's nothing intrinsically fallacious about positing UG, nor crazy extrapolations.

meepmorp · 12h ago

I agree with you, I'm just pointing out what (imo) OP was referring to.

Ask HN: Any good tools for viewing congressional bills?

Ask HN: Is synthetic data generation practical outside academia?

I Built an AI Agent with Gmail Access and Discovered a Security Hole

Ask HN: What would you work on if you couldn't fail?

Ask HN: Startup getting spammed with PayPal disputes, what should we do?

Ask HN: Anyone else feeling increasingly alienated from the industry?

Ask HN: Has anybody built search on top of Anna's Archive?

Tiptap open-sources 10 formerly Pro extensions under MIT license

Ask HN: Who is hiring? (June 2025)

Ask HN: What are some good resources for coding best practices?

Ask HN: Should I build a directory product?

Ask HN: How do I learn robotics in 2025?

Ask HN: How do I learn practical electronic repair?

Ask HN: Anyone making a living from a paid API?

Ask HN: Options for One-Handed Typing

Ask HN: Who wants to be hired? (June 2025)

Ask HN: What do you put in claude.md and what you leave out?

Ask HN: What are your fav/goto decision making hacks/heuristics?

Ask HN: Running AI agents in isolated environments

Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

Ask HN: Walking while working and having meetings

Ask HN: Where do you go for cutting-edge dev news and info?

Ask HN: Who's Using the Origin Private File System?

Ask HN: What is the best LLM for consumer grade hardware?

Ask HN: How are parents who program teaching their kids today?

O(1) memory, no-preprocessing reachability algorithm for 2D grids

Ask HN: What tools are you using for AI evals? Everything feels half-baked

Reaching my first 100 users without money or audience (at 10K users now)

Ask HN: Dealing with Vibe Coding Depression?

How do you store and maintain your CV/resume over time?

Ask HN: List of skills to survive the AI tsunami

Ask HN: What's with the repeated job posts on "Who's hiring"?

Ask HN: Resources for building AI agents for software development?

Ask HN: Best way to get laid off

Ask HN: Anyone using project management tools for personal projects?

Ask HN: Unexplainable Copilot Premium Requests

Machine Learning: The Native Language of Biology

Comments (25)