I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
Davidzheng · 1h ago
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
modeless · 1h ago
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with text describing such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
M4v3R · 59m ago
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
sixo · 23m ago
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
harshitaneja · 2m ago
I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.
plantain · 22m ago
If you've ever used Claude Code + Plan mode - you know that exactly this is true.
modeless · 55m ago
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
jokoon · 14m ago
Those are bold claims
pilooch · 2h ago
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.