Ask HN: Is synthetic data generation practical outside academia?
3 points by cpard 2h ago 2 comments
Ask HN: Has anybody built search on top of Anna's Archive?
283 points by neonate 3d ago 146 comments
Machine Learning: The Native Language of Biology
55 us-merul 25 6/5/2025, 10:51:52 PM decodingbiology.substack.com ↗
-Leo Breiman, like 24 years ago
Machine learning isn't the native language of biology, the author just realized that there's more than one approach to modeling. I'm a statistician working in an ML role and most of the issues I run into (from a modeling perspective) are the reverse of what this article describes - people trying to use ML for the precise things inferential statistics and mechanistic models are designed for. Not that the distinction is that clear to begin with.
I feel like Breiman sets up a strawman that I've never encountered when I work with my colleagues that are trained in the statistics community. That doesn't mean it didn't exist 25 years ago when he wrote it. I concede that we are sometimes willing to make simplifying assumptions in order to state something particular, but it's almost like we've been culturally conditioned to steep everything we say with every caveat possible.
Whereas I am constantly having to point out the poor feedback we've had about some of the XGBoost models despite the fact that they're clearly the most "predictive" when evaluated naively.
"For example, the Lotka-Volterra model accurately captures predator-prey dynamics using systems of differential equations."
This is incorrect. The validation of the L-V predator/prey model was considered to be the population dynamics of the Snow Shoe Hare and Canada Lynx as seen in Hudson Bay Company records. The data actually models the fashion cycles in Europe, showing prices and demand from Europe drove the efforts of the Company and the trappers. This is in the standard texts from at least the mid 90s AFAIK.
It’s also a bit arrogant in presuming that no other approaches to modeling cells cared about “prediction”. Of course, systems and mathematical biologists care about making accurate predictions, they just also care about other things like understanding molecular interactions *because that lets you make better predictions*
Not to be cynical but this seems like an attempt to export benchmark culture from ML into bio. I think that blindly maximizing test set accuracy is likely to lead down a lot dead end paths. I say this as someone actively doing ML for bio research.
Combine this with the fact that In vivo data in biology is extremely limited, and we see copying the NLP and vision playbook into biology is challenging
Generative AI is basically going to flood the field with more predictions, but with little explanation of how, and doing nothing to alleviate the downstream verification process.
IMO the post is merely stating: "man, everyone should be doing this!" Without realizing that (1) everyone is doing this, and (2) it doesn't seem like it because many (most?) fields in biology don't work in the top down approach being suggested. Determining mechanism and function is vital in biology because in a lot of cases there just isn't the data to perform a fuzzy outcome driven analysis.
From an engineering perspective, yes, predictions are all that you care about. From a scientific perspective, the end goal is the simplest and most general set of explanations possible.
Some things are valuable, because they keep us alive and healthy in the short term. Some things are valuable, because we find them interesting, enjoyable, or something like that. And some things are indirectly valuable, because they enable other things that are more directly valuable.
In what way is ML-based biology any different from the myriad statistics-based mechanistic models that systems or computational biology has employed for 50 years to model biological mechanisms and processes? Does the author claim that theory-less parameterless ML models like those in deep NNs are superior because theory-based explicitly parameterized models are doomed to fail? If so, then some specific examples / illustrations would go a long way toward making your case.
That said, the formulation "machine learning is the native language of biology" seems odd.
This whole thing feels like the author is familiar with one set of abstractions but not the other. It's very reminiscent of the (intensely fallible) Chomsky logic that leads to insane extrapolations about what biology is or isn't. Machine learning is a model, and all models are wrong.