There are places I've found the topological perspective useful, but after a decade of grappling with trying to understand what goes on inside neural networks, I just haven't gotten that much traction out of it.
I've had a lot more success with:
* The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks.
* The idea of circuits - networks of such connected concepts.
Related to ways of understanding neural networks, I've seen these views expressed a lot, which to me seem like misconceptions:
- LLMs are basically just slightly better `n-gram` models
- The idea of "just" predicting the next token, as if next-token-prediction implies a model must be dumb
(I wonder if this [1] popular answer to Karpathy's RNN [2] post is partly to blame for people equating language neural nets with n-gram models. The stochastic parrot paper [3] also somewhat equates LLMs and n-gram models, e.g. "although she primarily had n-gram models in mind, the conclusions remain apt and relevant". I guess there was a time where they were more equivalent, before the nets got really really good)
hey chris, I found your posts quite inspiring back then, with very poetic ideas. cool to see you follow up here!
esafak · 3h ago
If it was topology we wouldn't bother to warp the manifold so we can do similarity search. No, it's geometry, with a metric. Just as in real life, we want to be able to compare things.
Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training? I imagine it violently changing at first before stabilizing, followed by geometric refinement. Here are some relevant papers:
> Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training?
If you've ever played with GANs or VAEs, you can actually answer this question! And the answer is more or less 'yes'. You can look at GANs at various checkpoints during training and see how different points in the high dimensional space move around (using tools like UMAP / TSNE).
> I imagine it violently changing at first before stabilizing, followed by geometric refinement
Also correct, though the violent changing at the beginning is also influenced the learning rate and the choice of optimizer.
esafak · 1h ago
And crucially, the initialization algorithm.
profchemai · 2h ago
Agree, if anything it's Applied Linear Algebra...but that sounds less exotic.
lostmsu · 1h ago
Well, we know it is non-linear. More like differential equations.
srean · 2h ago
The title, as it stands, is trite and wrong. More about that a little later. The article on the other hand is a pleasant read.
Topology is whatever little structure that remains in geometry after you throwaway distances, angles, orientations and all sorts of non tearing stretchings. It's that bare minimum that still remains valid after such violent deformations.
While notion of topology is definitely useful in machine learning, -- scale, distance, angles etc., all usually provide lots of essential information about the data.
If you want to distinguish between a tabby cat and a tiger it would be an act of stupidity to ignore scale.
Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations. That happens, but to claim deep learning is applied topology is absurd, almost stupid.
theahura · 1h ago
> Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations
But...you can't. The input data lives on a manifold that you cannot 'trust'. It doesn't mean anything apriori that an image of a coca-cola can and an image of a stopsign live close to each other in pixel space. The neural network applies all of those violent transformations you are talking about
srean · 57m ago
> But...you can't.
Only in a desperate sales pitch or a desparate research grant. There are of course some situations were certain measurements are untrustworthy, but to claim that to be the general case is very snake oily.
When certain measurements become untrustworthy, that it does so only because of some smooth transformation, is not very frequent (this is what purely topological methods will deal with). Random noise will also do that for you.
Not disputing the fact that sometimes metrics cannot be trusted entirely, but to go to a topological approach seems extreme. One should use as much of the information bearing measurements as possible.
BTW I am completely on-board with the idea that data often looks as if it has been sampled from an unknown, potentially smooth, possibly non-Euclidean manifold and then corrupted by noise.
In such cases recovering that manifold from noisy data is a very worthy cause.
In fact that is what most of your blogpost is about. But that's differential geometry and manifolds, they have structure far richer than a topology. For example they may have tangent planes, a Reimann metric or a symplectic form etc. A topological method would throw all of that away and focus on topology.
kentuckyrobby · 40m ago
I don't think that was their point, I think their point was that neural networks 'create' their optimization space by using lengths, distances, and angles. You can't reframe it from a topological standpoint, otherwise optimization spaces of some similar neural networks on similar problems would topologically comparable, which is not true.
throwawaymaths · 21m ago
once you get into the nitty gritty, a lot of things that wouldn't matter if it were pure topology, do, like number of layers all the way to quantization/fp resolution
soulofmischief · 2h ago
Thanks for sharing. I also tend to view learning in terms of manifolds. It's a powerful representation.
> I'm personally pretty convinced that, in a high enough dimensional space, this is indistinguishable from reasoning
I actually have journaled extensively about this and even written some on Hacker News about it with respect to what I've been calling probabilistic reasoning manifolds:
> This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
> But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.
> Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains
Are you talking about reasoning in general, reasoning qua that mental process which operates on (representations of) propositions?
In which case, I cannot understand " true reasoning is expressed in terms of probabilities, not axioms "
One of the features of reasoning is that it does not operate in this way. It's highly implausible animals would have been endowed with no ability to operate non-probabilistically on propositions represented by them, since this is essential for correct reasoning -- and a relatively trivial capability to provide.
Eg., "if the spider is in boxA, then it is not everywhere else" and so on
soulofmischief · 1h ago
Propositions are just predictions, they all come with some level of uncertainty even if we ignore that uncertainty for practical purposes.
Any validation of a theory is inherently statistical, as you must sample your environment with some level of precision across spacetime, and that level of precision correlates to the known accuracy of hypotheses. In other words, we can create axiomatic systems of logic, but ultimately any attempt to compare them to reality involves empirical sampling.
Unlike classical physics, our current understanding of quantum physics essentially allows for anything to be "possible" at large enough spacetime scales, even if it is never actually "probable". For example, quantum tunneling, where a quantum system might suddenly overcome an energy barrier despite lacking the required energy.
Every day when I walk outside my door and step onto the ground, I am operating on a belief that gravity will work the same way every time, that I won't suddenly pass through the Earth's crust or float into the sky. We often take such things for granted, as axiomatic, but ultimately all of our reasoning is based on statistical correlations. There is the ever-minute possibility that gravity suddenly stops working as expected.
> if the spider is in boxA, then it is not everywhere else
We can't even physically prove that. There's always some level of uncertainty which introduces probability into your reasoning. It's just convenient for us to say, "it's exceedingly unlikely in the entire age of the universe that a macroscopic spider will tunnel from Box A to Box B", and apply non-probabilistic heuristics.
It doesn't remove the probability, we just don't bother to consider it when making decisions because the energy required for accounting for such improbabilities outweighs the energy saved by not accounting for them.
As mentioned in my comment, there's also the possibility that universal axioms may be recoverable as fixed points in a reasoning manifold, or in some other transformation. If you view these probabilities as attractors on some surface, fixed points may represent "axioms" that are true or false under any contextual transformation.
jvanderbot · 2h ago
I suspect, as a layperson who watches people make decisions all the time, that somewhere in our mind is a "certainty checker".
We don't do logic itself, we just create logic from certainty as part of verbal reasoning. It's our messy internal inference of likelihoods that causes us to pause and think, or dash forward with confidence, and convincing others is the only place we need things like "theorems".
This is the only way I can square things like intuition, writing to formalize thoughts, verbal argument, etc, with the fact that people are just so mushy all the time.
ComplexSystems · 2h ago
I really liked this article, though I don't know why the author is calling the idea of finding a separating surface between two classes of points "topology." For instance, they write
"If you are trying to learn a translation task — say, English to Spanish, or Images to Text — your model will learn a topology where bread is close to pan, or where that picture of a cat is close to the word cat."
This is everything that topology is not about: a notion of points being "close" or "far." If we have some topological space in which two points are "close," we can stretch the space so as to get the same topological space, but with the two points now "far". That's the whole point of the joke that the coffee cup and the donut are the same thing.
Instead, the entire thing seems to be a real-world application of something like algebraic geometry. We want to look for something like an algebraic variety the points are near. It's all about geometry and all about metrics between points. That's what it seems like to me, anyway.
srean · 2h ago
> This is everything that topology is not about
100 percent true.
I can only hope that in an article that is about two things, i) topology and ii) deep learning, the evident confusions are contained within one of them -- topology, only.
theahura · 1h ago
fair, I was using 'topology' more colloquially in that sentence. Should have said 'surface'.
umutisik · 3h ago
Data doesn't actually live on a manifold. It's an approximation used for thinking about data. Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way. Deep learning is not applied anything, it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).
sota_pop · 2h ago
I disagree with this wholeheartedly. Sure, there is lots of trial and error, but it’s more an amalgamation of theory from many areas of mathematics including but not limited to: topology, geometry, game theory, calculus, and statistics. The very foundations (i.e. back-propagation) is just the chain rule applied to the weights. The difference is that deep learning has become such an accessible (sic profitable) field that many practitioners have the luxury of learning the subject without having to learn the origins of the formalisms. Ultimately allowing them to utilize or “reinvent” theories and techniques often without knowing they have been around in other fields for much longer.
saberience · 1h ago
None of the major aspects of deep learning came from manifolds though.
It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.
But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.
qbit42 · 1h ago
Loss landscapes can be viewed as manifolds. Adagrad/ADAM adjust SGD to better fit the local geometry and are widely used in practice.
kwertzzz · 2h ago
Can you give an example where theories and techniques from other fields are reinvented? I would be genuinely interested for concrete examples. Such "reinventions" happen quite often in science, so to some degree this would be expected.
srean · 2h ago
Bethe ansatz is one. It took a toure de force by Yedidia to recognize that loopy belief propagation is computing the stationary point of Bethe's approximation to Free Energy.
Many statistical thermodynamics ideas were reinvented in ML.
Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.
One can keep going.
ogogmad · 2h ago
The connections of deep learning to stat-mech and thermodynamics are really cool.
It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.
srean · 1h ago
In Boltzmann's formulation of stat-mech it comes from the assumption that when a system is in "equilibrium", then all the micro-states that are consistent with the macro-state are equally occupied. That's the basis of the theory. A prime mover is thermal agitation.
It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.
mitthrowaway2 · 51m ago
The Bayesian reframing of that would be that when all you have measured is the macrostate, and you have no further information by which to assign a higher probability to any compatible microstate than any other, you follow the principle of indifference and assign a uniform distribution.
srean · 12m ago
Yes indeed, thanks for pointing this out. There are strong relationships between max-ent and Bayesian formulations.
For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.
whatever1 · 1h ago
I mean the entire domain of systems control is being reinvented by deep RL. System identification, stability, robustness etc
nickpsecurity · 2h ago
One might add 8-16-bit training and quantization. Also, computing semi-unreliable values with error correction. Such tricks have been used in embedded, software development on MCU's for some time.
behnamoh · 2h ago
> a few intuitions coming from theory (that was not topology).
I think these 'intuitions' are an after-the-fact thing, meaning AFTER deep learning comes up with a method, researchers in other fields of science notice the similarities between the deep learning approach and their (possibly decades old) methods. Here's an example where the author discovers that GPT is really the same computational problems he has solved in physics before:
I beg to differ. It's complete hyperbole to suggest that the article said "it's the same problem as something in physics", given this statement:
It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, *unlike other aspects of AI and ML* [emphasis mine]. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.
Note: Matrix-matrix multiplication is basic mathematics, and not remotely interesting as physics. It's not physically interesting.
bee_rider · 1h ago
Agreed.
Although, to try to see it from the author’s perspective, it is pulling tools out of the same (extremely well developed and studied in it’s own right) toolbox as computational physics does. It is a little funny although not too surprising that a computational physics guy would look at some linear algebra code and immediately see the similarity.
Edit: actually, thinking a little more, it is basically absurd to believe that somebody has had a career in computational physics without knowing they are relying heavily on the HPC/scientific computing/numerical linear algebra toolbox. So, I think they are just using that to help with the narrative for the blog post.
constantcrying · 50m ago
You are exactly right, after deep learning researchers had invented Adam for SGD, numerical analysts finally discovered Gradient descent. And after the first neural net was discovered, finally the matrix was invented in the novel field of linear algebra.
theahura · 2h ago
I say this as someone who has been in deep learning for over a decade now: this is pretty wrong, both on the merits (data obviously lives on a manifold) and on its applications to deep learning (cf chris olah's blog as an example from 2014, which is linked in my post -- https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/). Embedding spaces are called 'spaces' for a reason. GANs, VAEs, contrastive losses -- all of these are about constructing vector manifolds that you can 'walk' to produce different kinds of data.
almostgotcaught · 1h ago
You're citing a guy that never went to college (has no math or physics degree), has never published a paper, etc. I guess that actually tracks pretty well with how strong the whole "it's deep theory" claim is.
Chris Olah? One of the founders of Anthropic and the head of their interpretability team?
esafak · 2h ago
It does if you relax your definition to accommodate approximation error, cf. e.g., Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (https://aclanthology.org/2021.acl-long.568.pdf)
niemandhier · 2h ago
It’s alchemy.
Deep learning in its current form relates to a hypothetical underlying theory as alchemy does to chemistry.
In a few hundred years the Inuktitut speaking high schoolers of the civilisation that comes after us will learn that this strange word “deep learning” is a left over from the lingua franca of yore.
adamnemecek · 2h ago
Not really, most of the current approaches are some approximations of the partition function.
fmap · 58m ago
The reason deep learning is alchemy is that none of these deep theories have predictive ability.
Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.
In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.
theahura · 52m ago
It's true that there is no directly predictive model of deep learning, and it's also true that there is some trial and error, but it is wrong to say that therefore there is no operating theory at all. I recommend reading Ilyas 30 papers (here's my review of that set:
https://open.substack.com/pub/theahura/p/ilyas-30-papers-to-...) to see how shared intuitions and common threads are clearly developed over the last decade+
Koshkin · 2h ago
> Data doesn't actually live on a manifold.
Often, they do (and then they are called "sheaves").
wenc · 1h ago
Many types of data don’t. Disconnected spaces like integer spaces don’t sit on a manifold (they are lattices). Spiky noisy fragmented data don’t sit on a (smooth) manifold.
In fact not all ML models treat data as manifolds. Nearest neighbors, decision trees don’t require the manifold assumption and actually work better without it.
qbit42 · 1h ago
Any reasonable statistical explanation of deep learning requires there to be some sort of low dimensional latent structure in the data. Otherwise, we would not have enough training data to learn good models, given how high the ambient dimensions are for most problems.
wenc · 29m ago
Deep learning specifically yes. It needs a manifold assumption. But not data in general which was what I was responding to.
theahura · 50m ago
It turns out a lot of disconnected spaces can be approximated by smooth ones that have really sharp boundaries, which more or less seems to be how neural networks will approximate something like discrete tokens
wenc · 24m ago
Can be approximated yes. Approximated well? No, but you can get away with it sometimes with saturation functions like softmax.
Some data is truly integral and if you try to approximate with tanh, sigmoid or any number of switching functions you lose fidelity.
baxtr · 1h ago
Just a side comment to your observation: the principle is called reductionism and has been tried on many fields.
Physics is just applied mathematics
Chemistry is just applied physics
Biology is just applied chemistry
It doesn’t work very well.
yubblegum · 1h ago
> Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way.
Of course. Now, to actually deeply understand what is happening with these constructs, we will use topology. Topoligical insights will without doubt then inform the next generations of this technology.
solomatov · 1h ago
May I ask you to give examples of insights from topology which improved existing models, or at least improved our understanding of them? arxiv papers are preferred.
Regic · 1h ago
I feel like the fact that ML has no good explanation why it works this well gives a lot of people room to invent their head-canon, usually from their field of expertise. I've seen this from exceptionally intelligent individuals too. If you only have a hammer...
nomel · 17m ago
I think it would be more unusual, and concerning, if an intelligent individual didn't attempt to apply their expertise for a head-canon of something unknown.
Coming up with an idea for how something works, by applying your expertise, is the fundamental foundation of intelligence, learning, and was behind every single advancement of human understanding.
People thinking is always a good thing. Thinking about the unknown is better. Thinking with others is best, and sharing those thoughts isn't somehow bad, even if they're not complete.
thuuuomas · 2h ago
I cannot understand this prideful resentment of theory common among self-described practitioners.
Even if existing theory is inadequate, would an operating theory not be beneficial?
Or is the mystique combined with guess&check drudgery job security?
canjobear · 2h ago
If there were theory that led to directly useful results (like, telling you the right hyperparameters to use for your data in a simple way, or giving you a new kind of regularization that you can drop in to dramatically improve learning) then deep learning practitioners would love it. As it currently stands, such theories don't really exist.
theahura · 1h ago
This is way too rigorous. You can absolutely have theories that lead to useful results even if they aren't as predictive as you describe. Theory of evolution for an obvious counterpoint.
fiddlerwoaroof · 1h ago
Useful theories only come to exist because someone started by saying they must exist and then spent years or lifetimes discovering them.
lumost · 1h ago
There are many reasons to believe a theory may not be forthcoming, or that if it is available may not be useful.
For instance, we do not have consensus on what a theory should accomplish - should it provide convergence bounds/capability bounds? Should it predict optimal parameter counts/shapes? Should it allow more efficient calculation of optimal weights? Does it need to do these tasks in linear time?
Even materials science in metals is still cycling through theoretical models after thousands of years of making steel and other alloys.
jebarker · 2h ago
There are strong incentives to leave theory as technical debt and keep charging forward. I don't think it's resentment of theory, everyone would love a theory if one were available but very few are willing to forgoe the near term rewards to pursue theory. Also it's really hard.
danielmarkbruce · 1h ago
Who is proud? What you are seeing in some cases is eye rolling. And it's fair eye rolling.
There is an enormous amount of theory used in the various parts of building models, there just isn't an overarching theory at the very most convenient level of abstraction.
It almost has to be this way. If there was some neat theory, people would use it and build even more complex things on top of it in an experimental way and then so on.
hiddencost · 2h ago
Maybe a little less with the ad hominems? The OP is providing an accurate description of an extremely immature field.
cnity · 2h ago
Many mathematicians are (rightly, IMO) allergic to assertions that certain branches are not useful (explicit in OP) and especially so if they are dismissive of attempts to understand complicated real world phenomema (implicit in OP, if you ask me).
csimon80 · 2h ago
"All models are wrong, but some are useful" -George Box
woopwoop · 1h ago
I don't agree with your first sentence, but I agree with the rest of this post.
motoboi · 2h ago
Your comment sits in the nice gradient between not seeing at all the obvious relationships between deep learning and topology and thinking that deep learning is applied topology.
That helps to visualize how the activation functions, bias and weights (linear transformations) serve to stretch the high dimensional space so that data go into extremes and become easy to put in a high dimension, low dimensional object (the manifold) where is trivial to classify or separate.
Gaining an intuition about this process will make some deep learning practices so much easy to understand.
constantcrying · 53m ago
>it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).
Neural Networks consist almost exclusively of two parts, numerical linear algebra and numerical optimization.
Even if you reject the abstract topological description. Numerical linear algebra and optimization couldn't be any more directly applicable.
nis0s · 38m ago
> One way to think about neural networks, especially really large neural networks, is that they are topology generators. That is, they will take in a set of data and figure out a topology where the data has certain properties. Those properties are in turn defined by the loss function.
Latent spaces may or may not have useful topology, so this idea is inherently wrong, and builds the wrong type of intuition. Different neural nets will result in different feature space understanding of the same data, so I think it's incorrect to believe you're determining intrinsic geometric properties from a given neural net. I don't think people should throw around words carelessly because all that does is increase misunderstanding of concepts.
In general, manifolds can help discern useful characteristics about the feature space, and may have useful topological structures, but trying to impose an idea of "topology" on this is a stretch. Moreover, the kind of basics examples used in this blog post don't help prove the author's point. Maybe I am misunderstanding this author's description of what they mean, but this idea of manifold learning is nothing new.
adsharma · 53m ago
Isn't Deep Learning more like Graph Theory? I shared yesterday that Google published a paper called CRISP (https://arxiv.org/pdf/2505.11471) that carefully avoids any reference to the word "Graph".
So then the question becomes what's the difference between Graph Theory and Applied Topology? Graphs operate on discrete structures and topology is about a continuous space. Otherwise they're very closely related.
But the higher order bit is that AI/ML and Deep Learning in particular could do a better job of learning from and acknowledging prior art from related fields. Reusing older terminology instead of inventing new.
Graviscalar · 1h ago
I was one of the people that was super excited after reading the Chris Olah blogpost from 2014, and over the past decade I've seen the insight go exactly nowhere. It's neat but it hasn't driven any interesting results, though Ayasdi did some interesting stuff with TDA and Gunnar Carlson has been playing around with neural nets recently.
theahura · 45m ago
I think it's incorrect that the insight has gone nowhere. See, for eg, contrastive loss / clip, or vqgan image generation. Arguably also diffusion models.
More generally, in my experience as an AI researcher, understandings of the geometry of data leads directly to changes in model architecture. Though people disparage that as "trial and error" it is far more directed than people on the outside give credit for.
Graviscalar · 36m ago
The geometric intuition is solid, but actually applying topology has been less fruitful in spite of a lot of people trying their best, as Chris Olah himself has said elsewhere in this thread.
rubitxxx8 · 1h ago
What would you have expected to happen?
Advances and insights sometimes lie dormant for decades or more before someone else picks them up and does something new.
Graviscalar · 54m ago
I would expect model/algorithm improvements from using topological concepts to analyze the manifolds in question or concrete results in model interpretability. Gunnar has studied some toy examples, but they were barely a step up from the ones Olah constructed for the sake of explanation and they haven't borne any further fruit.
You can say any advance or insight is just lying dormant, it doesn't mean anything unless you can specifically articulate why it still has potential. I haven't made any claims on the future of the intersection of deep learning and topology, I was pointing out that it's been anything but dormant given the interest in it but it hasn't lead anywhere.
profchemai · 2h ago
Once I read "This has been enough to get us to AGI.", credibility took a nose dive.
In general it's a nice idea, but the blogpost is very fluffy, especially once it connects it to reasoning, there is serious technical work in this area (i.g. https://arxiv.org/abs/1402.1869) that has expanded this idea and made it more concrete.
vayllon · 1h ago
Another type of topology you’ll encounter in deep neural networks (DNNs) is network topology. This refers to the structure of the network — how the nodes are connected and how data flows between them. We already have several well-known examples, such as auto-encoders, convolutional neural networks (CNNs), and generative adversarial networks (GANs), all of which are bio-inspired.
However, we still have much to learn about the topology of the brain and its functional connectivity. In the coming years, we are likely to discover new architectures — both internal within individual layers/nodes and in the ways specialized networks connect and interact with each other.
Additionally, the brain doesn’t rely on a single network, but rather on several ones — often referred to as the "Big 7" — that operate in parallel and are deeply interconnected. Some of these include the Default Mode Network (DMN), the Central Executive Network (CEN) or the Limbic Network, among others. In fact, a single neuron can be part of multiple networks, each serving different functions.
We have not yet been able to fully replicate this complexity in artificial systems, and there is still much to be learned and inspired by from this "network topologies".
So, "Topology is all you need" :-)
_alternator_ · 1h ago
The question is not so much whether this is true—we can certainly represent any data as points on a manifold. Rather, it’s the extent to which this point of view is useful. In my experience, it’s not the most powerful perspective.
In short, direct manifold learning is not really tractable as an algorithmic approach. The most powerful set of tools and theoretical basis for AI has sprung from statistical optimization theory (SGD, information-theoretical loss minimization, etc.). The fact that data is on a manifold is a tautological footnote to this approach.
terabytest · 2h ago
I'm confused by the author's diagram claiming that AGI/ASI are points on the same manifold as next token prediction, chat models, and CoT models. While the latter three are provably part of the same manifold, what justifies placing AGI/ASI there too?
What if the models capable of CoT aren't and will never be, regardless of topological manipulation, capable of processes that could be considered AGI? For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
As a layman, this matches my intuition that LLMs are not at all in the same family of systems as the ones capable of generating intelligence or consciousness.
theahura · 1h ago
Possible. AGI/ASI are poorly defined. I tend to think we're already at AGI, obviously many disagree.
> For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
I've done a fair bit of connectomics research and I think that this framing elides the ways in which neural networks and biological networks are actually quite similar. For example, in mice olfactory systems there is something akin to a 'feature vector' that appears based on which neurons light up. Specific sets of neurons lighting up means 'chocolate' or 'lemon' or whatever. More generally, it seems like neuronal representations are somewhat similar to embedding representations, and you could imagine constructing an embedding space based on what neurons light up where. Everything on top of the embeddings is 'just' processing.
fusionadvocate · 15m ago
I believe we already have the technology required for AGI. It perhaps is analogous to a lunar manned station or a 2 mile tall skyscrapper. We have the technology required to build it, but we don't for various reasons.
deepburner · 35m ago
This whole article is just a nothingburger. Saying something is applied topology is only one step more advanced than saying something is maths - duh. These mathematical abstractions are incredibly general and and you can pretty much draw up anything in terms of anything, the challenging part is being able to turn around and use the model/abstraction to say things about the thing you're abstracting. I don't think scholars have been very successful in that regard, less so this article.
Yeah deep learning is applied topology, it's also applied geometry, and probably applied algebra and I wouldn't be surprised if it was also applied number theory.
crgi · 1h ago
Interesting read. Reminded me of the Trinity 3D manifold visualization tool which (among other things) let's you explore the hyperspace of neural networks: https://github.com/trinity-xai/Trinity
Ok, how do transformers fit into this understanding of deep learning?
theahura · 1h ago
Transformers learn embedding representations of tokens, which are easily mapped into a space. Similar tokens are mapped to similar places on the space. The fully connected layer at the end of each transformer block defines a transformation of a set of points in a space to another point in that space, not unlike the example of adding colors together to get a new color
Transformers don't feel differentiable (because of the attention mechanism), but they actually are (as being back-propagation based forces it to be).
The attention mechanism is not a stretching of the manifold, but is trained to be able to measure distances in the manifold surface, which is stretched and deformed (or transformed?) in the feed-forward layers.
sota_pop · 2h ago
I’ve always enjoyed this framing of the subject, the idea of mapping anything as hyperplanes existing in a solution space is one of the ideas that really blew my hair back during my academic studies. I would nitpick at your “dots in a circle example - with the stoner reference joke” I could be mistaken, but common practice isn’t to “move to a higher dimension”, but use a kernel (i.e. parameterize the points into the polar |r,theta> basis). All things considered, nice article.
theahura · 1h ago
I'm pulling directly from Chris Olah's blog post with that example. But I will say that in practice, its always surprising how increasing the dimensionality of a neural network magically solves all sorts of problems. You could use a kernel if you don't have more computation available, but given more computation adding a dimension is strictly more flexible (and is capable of separating a much wider range of datasets)
parpfish · 2h ago
This is also how I've often thought about deep learning -- focusing on the geometry of the data at each layer rather than the weights and biases is far more revealing.
I've always been hopeful that some algebraic topology master would dig into this question and it'd provide some better design principles for neural nets. which activation functions? how much to fan in/out? how many layers?
maxiepoo · 2h ago
Isn't it more differential geometry?
kookamamie · 43m ago
> This has been enough to get us to AGI.1
Hard disagree.
mirekrusin · 2h ago
Just because manifold looks a bit like burrito if you squint doesn't mean it is a burrito.
ComplexSystems · 2h ago
What if you don't have to squint very much?
fedeb95 · 2h ago
Interesting read. This seems hard to prove:
"Everything lives on a manifold"
khoangothe · 36m ago
Cool post! Thanks
mbowcut2 · 1h ago
To a topologist, everything is topology.
kookamamie · 40m ago
To a man with only a hammer, everything looks like a nail.
I tried really hard to use topology as a way to understand neural networks, for example in these follow ups:
- https://colah.github.io/posts/2014-10-Visualizing-MNIST/
- https://colah.github.io/posts/2015-01-Visualizing-Representa...
There are places I've found the topological perspective useful, but after a decade of grappling with trying to understand what goes on inside neural networks, I just haven't gotten that much traction out of it.
I've had a lot more success with:
* The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks.
* The idea of circuits - networks of such connected concepts.
Some selected related writing:
- https://distill.pub/2020/circuits/zoom-in/
- https://transformer-circuits.pub/2022/mech-interp-essay/inde...
- https://transformer-circuits.pub/2025/attribution-graphs/bio...
- LLMs are basically just slightly better `n-gram` models
- The idea of "just" predicting the next token, as if next-token-prediction implies a model must be dumb
(I wonder if this [1] popular answer to Karpathy's RNN [2] post is partly to blame for people equating language neural nets with n-gram models. The stochastic parrot paper [3] also somewhat equates LLMs and n-gram models, e.g. "although she primarily had n-gram models in mind, the conclusions remain apt and relevant". I guess there was a time where they were more equivalent, before the nets got really really good)
[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139
[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/
[3] https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training? I imagine it violently changing at first before stabilizing, followed by geometric refinement. Here are some relevant papers:
* Topology and geometry of data manifold in deep learning (https://arxiv.org/abs/2204.08624)
* Topology of Deep Neural Networks (https://jmlr.org/papers/v21/20-345.html)
* Persistent Topological Features in Large Language Models (https://arxiv.org/abs/2410.11042)
* Deep learning as Ricci flow (https://www.nature.com/articles/s41598-024-74045-9)
If you've ever played with GANs or VAEs, you can actually answer this question! And the answer is more or less 'yes'. You can look at GANs at various checkpoints during training and see how different points in the high dimensional space move around (using tools like UMAP / TSNE).
> I imagine it violently changing at first before stabilizing, followed by geometric refinement
Also correct, though the violent changing at the beginning is also influenced the learning rate and the choice of optimizer.
Topology is whatever little structure that remains in geometry after you throwaway distances, angles, orientations and all sorts of non tearing stretchings. It's that bare minimum that still remains valid after such violent deformations.
While notion of topology is definitely useful in machine learning, -- scale, distance, angles etc., all usually provide lots of essential information about the data.
If you want to distinguish between a tabby cat and a tiger it would be an act of stupidity to ignore scale.
Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations. That happens, but to claim deep learning is applied topology is absurd, almost stupid.
But...you can't. The input data lives on a manifold that you cannot 'trust'. It doesn't mean anything apriori that an image of a coca-cola can and an image of a stopsign live close to each other in pixel space. The neural network applies all of those violent transformations you are talking about
Only in a desperate sales pitch or a desparate research grant. There are of course some situations were certain measurements are untrustworthy, but to claim that to be the general case is very snake oily.
When certain measurements become untrustworthy, that it does so only because of some smooth transformation, is not very frequent (this is what purely topological methods will deal with). Random noise will also do that for you.
Not disputing the fact that sometimes metrics cannot be trusted entirely, but to go to a topological approach seems extreme. One should use as much of the information bearing measurements as possible.
BTW I am completely on-board with the idea that data often looks as if it has been sampled from an unknown, potentially smooth, possibly non-Euclidean manifold and then corrupted by noise. In such cases recovering that manifold from noisy data is a very worthy cause.
In fact that is what most of your blogpost is about. But that's differential geometry and manifolds, they have structure far richer than a topology. For example they may have tangent planes, a Reimann metric or a symplectic form etc. A topological method would throw all of that away and focus on topology.
> I'm personally pretty convinced that, in a high enough dimensional space, this is indistinguishable from reasoning
I actually have journaled extensively about this and even written some on Hacker News about it with respect to what I've been calling probabilistic reasoning manifolds:
> This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
> But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.
> Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains
Full comment: https://news.ycombinator.com/item?id=42871894
In which case, I cannot understand " true reasoning is expressed in terms of probabilities, not axioms "
One of the features of reasoning is that it does not operate in this way. It's highly implausible animals would have been endowed with no ability to operate non-probabilistically on propositions represented by them, since this is essential for correct reasoning -- and a relatively trivial capability to provide.
Eg., "if the spider is in boxA, then it is not everywhere else" and so on
Any validation of a theory is inherently statistical, as you must sample your environment with some level of precision across spacetime, and that level of precision correlates to the known accuracy of hypotheses. In other words, we can create axiomatic systems of logic, but ultimately any attempt to compare them to reality involves empirical sampling.
Unlike classical physics, our current understanding of quantum physics essentially allows for anything to be "possible" at large enough spacetime scales, even if it is never actually "probable". For example, quantum tunneling, where a quantum system might suddenly overcome an energy barrier despite lacking the required energy.
Every day when I walk outside my door and step onto the ground, I am operating on a belief that gravity will work the same way every time, that I won't suddenly pass through the Earth's crust or float into the sky. We often take such things for granted, as axiomatic, but ultimately all of our reasoning is based on statistical correlations. There is the ever-minute possibility that gravity suddenly stops working as expected.
> if the spider is in boxA, then it is not everywhere else
We can't even physically prove that. There's always some level of uncertainty which introduces probability into your reasoning. It's just convenient for us to say, "it's exceedingly unlikely in the entire age of the universe that a macroscopic spider will tunnel from Box A to Box B", and apply non-probabilistic heuristics.
It doesn't remove the probability, we just don't bother to consider it when making decisions because the energy required for accounting for such improbabilities outweighs the energy saved by not accounting for them.
As mentioned in my comment, there's also the possibility that universal axioms may be recoverable as fixed points in a reasoning manifold, or in some other transformation. If you view these probabilities as attractors on some surface, fixed points may represent "axioms" that are true or false under any contextual transformation.
We don't do logic itself, we just create logic from certainty as part of verbal reasoning. It's our messy internal inference of likelihoods that causes us to pause and think, or dash forward with confidence, and convincing others is the only place we need things like "theorems".
This is the only way I can square things like intuition, writing to formalize thoughts, verbal argument, etc, with the fact that people are just so mushy all the time.
"If you are trying to learn a translation task — say, English to Spanish, or Images to Text — your model will learn a topology where bread is close to pan, or where that picture of a cat is close to the word cat."
This is everything that topology is not about: a notion of points being "close" or "far." If we have some topological space in which two points are "close," we can stretch the space so as to get the same topological space, but with the two points now "far". That's the whole point of the joke that the coffee cup and the donut are the same thing.
Instead, the entire thing seems to be a real-world application of something like algebraic geometry. We want to look for something like an algebraic variety the points are near. It's all about geometry and all about metrics between points. That's what it seems like to me, anyway.
100 percent true.
I can only hope that in an article that is about two things, i) topology and ii) deep learning, the evident confusions are contained within one of them -- topology, only.
It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.
But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.
Many statistical thermodynamics ideas were reinvented in ML.
Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.
One can keep going.
It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.
It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.
For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.
I think these 'intuitions' are an after-the-fact thing, meaning AFTER deep learning comes up with a method, researchers in other fields of science notice the similarities between the deep learning approach and their (possibly decades old) methods. Here's an example where the author discovers that GPT is really the same computational problems he has solved in physics before:
https://ondrejcertik.com/blog/2023/03/fastgpt-faster-than-py...
Although, to try to see it from the author’s perspective, it is pulling tools out of the same (extremely well developed and studied in it’s own right) toolbox as computational physics does. It is a little funny although not too surprising that a computational physics guy would look at some linear algebra code and immediately see the similarity.
Edit: actually, thinking a little more, it is basically absurd to believe that somebody has had a career in computational physics without knowing they are relying heavily on the HPC/scientific computing/numerical linear algebra toolbox. So, I think they are just using that to help with the narrative for the blog post.
No comments yet
Deep learning in its current form relates to a hypothetical underlying theory as alchemy does to chemistry.
In a few hundred years the Inuktitut speaking high schoolers of the civilisation that comes after us will learn that this strange word “deep learning” is a left over from the lingua franca of yore.
Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.
In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.
Often, they do (and then they are called "sheaves").
In fact not all ML models treat data as manifolds. Nearest neighbors, decision trees don’t require the manifold assumption and actually work better without it.
Some data is truly integral and if you try to approximate with tanh, sigmoid or any number of switching functions you lose fidelity.
Physics is just applied mathematics
Chemistry is just applied physics
Biology is just applied chemistry
It doesn’t work very well.
Of course. Now, to actually deeply understand what is happening with these constructs, we will use topology. Topoligical insights will without doubt then inform the next generations of this technology.
Coming up with an idea for how something works, by applying your expertise, is the fundamental foundation of intelligence, learning, and was behind every single advancement of human understanding.
People thinking is always a good thing. Thinking about the unknown is better. Thinking with others is best, and sharing those thoughts isn't somehow bad, even if they're not complete.
Even if existing theory is inadequate, would an operating theory not be beneficial?
Or is the mystique combined with guess&check drudgery job security?
For instance, we do not have consensus on what a theory should accomplish - should it provide convergence bounds/capability bounds? Should it predict optimal parameter counts/shapes? Should it allow more efficient calculation of optimal weights? Does it need to do these tasks in linear time?
Even materials science in metals is still cycling through theoretical models after thousands of years of making steel and other alloys.
There is an enormous amount of theory used in the various parts of building models, there just isn't an overarching theory at the very most convenient level of abstraction.
It almost has to be this way. If there was some neat theory, people would use it and build even more complex things on top of it in an experimental way and then so on.
See? Everything lives in the manifold.
Now for a great visualization about the Manifold Hypothesis I cannot recommend more this video: https://www.youtube.com/watch?v=pdNYw6qwuNc
That helps to visualize how the activation functions, bias and weights (linear transformations) serve to stretch the high dimensional space so that data go into extremes and become easy to put in a high dimension, low dimensional object (the manifold) where is trivial to classify or separate.
Gaining an intuition about this process will make some deep learning practices so much easy to understand.
Neural Networks consist almost exclusively of two parts, numerical linear algebra and numerical optimization.
Even if you reject the abstract topological description. Numerical linear algebra and optimization couldn't be any more directly applicable.
Latent spaces may or may not have useful topology, so this idea is inherently wrong, and builds the wrong type of intuition. Different neural nets will result in different feature space understanding of the same data, so I think it's incorrect to believe you're determining intrinsic geometric properties from a given neural net. I don't think people should throw around words carelessly because all that does is increase misunderstanding of concepts.
In general, manifolds can help discern useful characteristics about the feature space, and may have useful topological structures, but trying to impose an idea of "topology" on this is a stretch. Moreover, the kind of basics examples used in this blog post don't help prove the author's point. Maybe I am misunderstanding this author's description of what they mean, but this idea of manifold learning is nothing new.
So then the question becomes what's the difference between Graph Theory and Applied Topology? Graphs operate on discrete structures and topology is about a continuous space. Otherwise they're very closely related.
But the higher order bit is that AI/ML and Deep Learning in particular could do a better job of learning from and acknowledging prior art from related fields. Reusing older terminology instead of inventing new.
More generally, in my experience as an AI researcher, understandings of the geometry of data leads directly to changes in model architecture. Though people disparage that as "trial and error" it is far more directed than people on the outside give credit for.
Advances and insights sometimes lie dormant for decades or more before someone else picks them up and does something new.
You can say any advance or insight is just lying dormant, it doesn't mean anything unless you can specifically articulate why it still has potential. I haven't made any claims on the future of the intersection of deep learning and topology, I was pointing out that it's been anything but dormant given the interest in it but it hasn't lead anywhere.
In general it's a nice idea, but the blogpost is very fluffy, especially once it connects it to reasoning, there is serious technical work in this area (i.g. https://arxiv.org/abs/1402.1869) that has expanded this idea and made it more concrete.
However, we still have much to learn about the topology of the brain and its functional connectivity. In the coming years, we are likely to discover new architectures — both internal within individual layers/nodes and in the ways specialized networks connect and interact with each other.
Additionally, the brain doesn’t rely on a single network, but rather on several ones — often referred to as the "Big 7" — that operate in parallel and are deeply interconnected. Some of these include the Default Mode Network (DMN), the Central Executive Network (CEN) or the Limbic Network, among others. In fact, a single neuron can be part of multiple networks, each serving different functions.
We have not yet been able to fully replicate this complexity in artificial systems, and there is still much to be learned and inspired by from this "network topologies".
So, "Topology is all you need" :-)
In short, direct manifold learning is not really tractable as an algorithmic approach. The most powerful set of tools and theoretical basis for AI has sprung from statistical optimization theory (SGD, information-theoretical loss minimization, etc.). The fact that data is on a manifold is a tautological footnote to this approach.
What if the models capable of CoT aren't and will never be, regardless of topological manipulation, capable of processes that could be considered AGI? For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
As a layman, this matches my intuition that LLMs are not at all in the same family of systems as the ones capable of generating intelligence or consciousness.
> For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.
I've done a fair bit of connectomics research and I think that this framing elides the ways in which neural networks and biological networks are actually quite similar. For example, in mice olfactory systems there is something akin to a 'feature vector' that appears based on which neurons light up. Specific sets of neurons lighting up means 'chocolate' or 'lemon' or whatever. More generally, it seems like neuronal representations are somewhat similar to embedding representations, and you could imagine constructing an embedding space based on what neurons light up where. Everything on top of the embeddings is 'just' processing.
Yeah deep learning is applied topology, it's also applied geometry, and probably applied algebra and I wouldn't be surprised if it was also applied number theory.
The attention mechanism is not a stretching of the manifold, but is trained to be able to measure distances in the manifold surface, which is stretched and deformed (or transformed?) in the feed-forward layers.
I've always been hopeful that some algebraic topology master would dig into this question and it'd provide some better design principles for neural nets. which activation functions? how much to fan in/out? how many layers?
Hard disagree.
"Everything lives on a manifold"