Edit: I wrote my comment a bit too early before finishing the whole article. I'll leave my comment below, but it's actually not very closely related to the topic at hand or the author's paper.
I agree with the gist of the article (which IMO is basically that universal computation is universal regardless of how you perform it), but there are two big issues that prevent this observation from helping us in a practical sense:
1. Not all models are equally efficient. We already have many methods to perform universal search (e.g., Levin's, Hutter's, and Schmidhuber's versions), but they are painfully slow despite being optimal in a narrow sense that doesn't extrapolate well to real world performance.
2. Solomonoff induction is only optimal for infinite data (i.e., it can be used to create a predictor that asymptotically dominates any other algorithmic predictor). As far as I can tell, the problem remains totally unsolved for finite data, due to the additive constant that results from the question: which universal model of computation should be applied to finite data? You can easily construct a Turing machine that is universal and perfectly reproduces the training data, yet nevertheless dramatically fails to generalize. No one has made a strong case for any specific natural prior over universal Turing machines (and if you try to define some measure to quantify the "size" of a Turing machine you realize this method starts to fail once the number of transition tables becomes large enough to start exhibiting redundancy).
im3w1l · 31m ago
Regarding your second point I think there are two cases here that should be kept separate. The first is that you are teleported into a parallel dimension where literally everything works differently from here. In that case I do agree that there are several reasonable choices of models of computation. You simply have to pick one and hope it wasn't too bad.
But the second case is that you encounter some phenomenon here in our ordinary world. And in that case I think you can do way better by reasoning about the phenomenon and trying to guess at plausible mechanics based on your preexisting knowledge of how the world works. In particular, I think guessing that "there is some short natural language description of how the phenomenon works, based on a language grounded in the corpus of human writing" is a very reasonable prior.
coffeecoders · 40m ago
I think “we might decode whale speech or ancient languages” is a huge stretch. Context is the most important part of what makes language useful.
There is billions of human-written texts, grounded in shared experience that makes our AI good at language. We don't have that for a whale.
somethingsome · 51m ago
Mmmh I'm deeply skeptical of some parts.
> One explanation for why this game works is that there is only one way in which things are related
There is not, this is a completely non transitive relationship.
On another point, suppose you keep the same vocabulary, but permute the signification of the words, the neural network will still learn relationships, completely different ones, but it's representation may converge toward a better compression for that set of words, but I'm dubious that this new compression scheme will ressemble the previous one (?)
I would say that given an optimal encoding of the relationships, we can achieve an extreme compression, but not all encodings lead to the same compression at the end.
If I add 'bla' between every words in a text, that is easy to compress, but now, if I add an increasing sequence of words between each words, the meaning is still there, but the compression will not be the same, as the network will try to generate the words in-between.
(thinking out loud)
streptomycin · 30m ago
Is it closer to Mussolini or bread? Mussolini.
Is it closer to Mussolini or David Beckham? Uhh, I guess Mussolini. (Ok, they’re definitely thinking of a person.)
That reasoning doesn't follow. Many things besides people would have the same answers, for instance any animal that seems more like Mussolini than Beckham.
pjio · 18m ago
I believe the joke is about David Beckham not really being (perceived as) human, even when compared to personified evil
jxmorris12 · 21m ago
Whoops. I hope you can overlook this minor logical error.
Fomite · 14m ago
Oswald Mosley
TheSaifurRahman · 1h ago
This only works when different sources share similar feature distributions and semantic relationships.
The M or B game breaks down when you play with someone who knows obscure people you've never heard of. Either you can't recognize their references, or your sense of "semantic distance" differs from theirs.
The solution is to match knowledge levels: experts play with experts, generalists with generalists.
The same applies to decoding ancient texts, if ancient civilizations focused on completely different concepts than we do today, our modern semantic models won't help us understand their writing.
npinsker · 1h ago
I've played this game with friends occasionally and -- when it's a person -- don't think I've ever completed a game.
dr_dshiv · 52m ago
What about the platonic bits? Any other articles that give more details there?
TheSaifurRahman · 1h ago
Has there been research on using this to make models smaller? If models converge on similar representations, we should be able to build more efficient architectures around those core features.
yorwba · 1h ago
It's more likely that such an architecture would be bigger rather than smaller. https://arxiv.org/abs/2412.20292 demonstrated that score-matching diffusion models approximate a process that combines patches from different training images. To build a model that makes use of this fact, all you need to do is look up the right patch in the training data. Of course a model the size of its training data would typically be rather unwieldy to use. If you want something smaller, we're back to approximations created by training the old-fashioned way.
giancarlostoro · 1h ago
I've been thinking about this a lot. I want to know what's the smallest a model needs to be, before letting it browse search engines, or files you host locally is actually an avenue an LLM can go through to give you more informed answers. Is it 2GB? 8GB? Would love to know.
empath75 · 1h ago
This is kind of fascinating because I just tried to play mussolini or bread with chatgpt and it is absolutely _awful_ at it, even with reasoning models.
It just assumes that your answers are going to be reasonably bread-like or reasonably mussolini-like, and doesn't think laterally at all.
It just kept asking me about varieties of baked goods.
edit: It did much better after I added some extra explanation -- that it could be anything that it may be very unlike either choice, and not to try and narrow down too quickly
fsmv · 1h ago
I think an LLM is a bit too high level for this game or maybe it just would need a lengthy prompt to explain the game.
If you used word2vec directly it's the exact right thing to play this game with. Those embeddings exist in an LLM but it's trained to respond like text found online not play this game.
tyronehed · 2h ago
Especially if they are all me-too copies of a Transformer.
When we arrive at AGI, you can be certain it will not contain a Transformer.
jxmorris12 · 1h ago
I don't think architecture matters. It seems to be more a function of the data somehow.
> I don't think architecture matters. It seems to be more a function of the data somehow.
of course it matters
if I supply the ants in my garden with instructions on how to build tanks and stealth bombers they're still not going to be able to conquer my front room
I agree with the gist of the article (which IMO is basically that universal computation is universal regardless of how you perform it), but there are two big issues that prevent this observation from helping us in a practical sense:
1. Not all models are equally efficient. We already have many methods to perform universal search (e.g., Levin's, Hutter's, and Schmidhuber's versions), but they are painfully slow despite being optimal in a narrow sense that doesn't extrapolate well to real world performance.
2. Solomonoff induction is only optimal for infinite data (i.e., it can be used to create a predictor that asymptotically dominates any other algorithmic predictor). As far as I can tell, the problem remains totally unsolved for finite data, due to the additive constant that results from the question: which universal model of computation should be applied to finite data? You can easily construct a Turing machine that is universal and perfectly reproduces the training data, yet nevertheless dramatically fails to generalize. No one has made a strong case for any specific natural prior over universal Turing machines (and if you try to define some measure to quantify the "size" of a Turing machine you realize this method starts to fail once the number of transition tables becomes large enough to start exhibiting redundancy).
But the second case is that you encounter some phenomenon here in our ordinary world. And in that case I think you can do way better by reasoning about the phenomenon and trying to guess at plausible mechanics based on your preexisting knowledge of how the world works. In particular, I think guessing that "there is some short natural language description of how the phenomenon works, based on a language grounded in the corpus of human writing" is a very reasonable prior.
There is billions of human-written texts, grounded in shared experience that makes our AI good at language. We don't have that for a whale.
> One explanation for why this game works is that there is only one way in which things are related
There is not, this is a completely non transitive relationship.
On another point, suppose you keep the same vocabulary, but permute the signification of the words, the neural network will still learn relationships, completely different ones, but it's representation may converge toward a better compression for that set of words, but I'm dubious that this new compression scheme will ressemble the previous one (?)
I would say that given an optimal encoding of the relationships, we can achieve an extreme compression, but not all encodings lead to the same compression at the end.
If I add 'bla' between every words in a text, that is easy to compress, but now, if I add an increasing sequence of words between each words, the meaning is still there, but the compression will not be the same, as the network will try to generate the words in-between.
(thinking out loud)
Is it closer to Mussolini or David Beckham? Uhh, I guess Mussolini. (Ok, they’re definitely thinking of a person.)
That reasoning doesn't follow. Many things besides people would have the same answers, for instance any animal that seems more like Mussolini than Beckham.
The M or B game breaks down when you play with someone who knows obscure people you've never heard of. Either you can't recognize their references, or your sense of "semantic distance" differs from theirs. The solution is to match knowledge levels: experts play with experts, generalists with generalists.
The same applies to decoding ancient texts, if ancient civilizations focused on completely different concepts than we do today, our modern semantic models won't help us understand their writing.
It just assumes that your answers are going to be reasonably bread-like or reasonably mussolini-like, and doesn't think laterally at all.
It just kept asking me about varieties of baked goods.
edit: It did much better after I added some extra explanation -- that it could be anything that it may be very unlike either choice, and not to try and narrow down too quickly
If you used word2vec directly it's the exact right thing to play this game with. Those embeddings exist in an LLM but it's trained to respond like text found online not play this game.
When we arrive at AGI, you can be certain it will not contain a Transformer.
I once saw a LessWrong post claiming that the Platonic Representation Hypothesis doesn't hold when you only embed random noise, as opposed to natural images: http://lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-p...
of course it matters
if I supply the ants in my garden with instructions on how to build tanks and stealth bombers they're still not going to be able to conquer my front room