Show HN: Private real-time dictation app for Mac (github.com)
9 points by aviaryan 9h ago 2 comments
Show HN: QuizKnit, an open source quiz creator (quizknit.com)
3 points by jibolash 6h ago 0 comments
Show HN: My Cross-Platform MySQL Parser (abbychau.github.io)
4 points by abbychau 11h ago 0 comments
There are no new ideas in AI only new datasets
274 bilsbie 144 6/30/2025, 2:43:46 PM blog.jxmo.io ↗
It’s apparently much easier to scare the masses with visions of ASI, than to build a general intelligence that can pick up a new 2D video game faster than a human being.
A serious attempt at video/vision would involve some probabilistic latent space that can be noised in ways that make sense for games in general. I think veo3 proves that ai can generalize 2d and even 3d games, generating a video under prompt constraints is basically playing a game. I think you could prompt veo3 to play any game for a few seconds and it will generally make sense even though it is not fine tuned.
[0] https://deepmind.google/discover/blog/genie-2-a-large-scale-...
Besides static puzzles (like a maze or jigsaw) I don't believe this analogy holds? A model working with prompt constraints that aren't evolving or being added over the course of "navigating" the generation of the model's output means it needs to process 0 new information that it didn't come up with itself — playing a game is different from other generation because it's primarily about reacting to input you didn't know the precise timing/spatial details of, but can learn that they come within a known set of higher order rules. Obviously the more finite/deterministic/predictably probabilistic the video game's solution space, the more it can be inferred from the initial state, aka reduce to the same type of problem as generating a video from a prompt), which is why models are still able to play video games. But as GP pointed out, transfer function negative in such cases — the overarching rules are not predictable enough across disparate genres.
> I think you could prompt veo3 to play any game for a few seconds
I'm curious what your threshold for what constitutes "play any game" is in this claim? If I wrote a script that maps button combinations to average pixel color of a portion of the screen buffer, by what metric(s) would veo3 be "playing" the game more or better than that script "for a few seconds"?
edit: removing knee-jerk reaction language
I am just saying we have proof that it can understand complex worlds and sets of rules, and then abide by them. It doesn't know how to use a controller and it doesn't know how to explore the game physics on its own, but those steps are much easier to implement based on how coding agents are able to iterate and explore solutions.
It doesn't. And you said it yourself:
> generating a video under prompt constraints is basically playing a game.
No. It's neither generating a game (that people can play) nor is it playing a game (it's generating a video).
Since it's not a model of the world in any sense of the word, there are issues with even the most basic object permanenece. E.g. here's veo3 generating a GTA-style video. Oh look, the car spins 360 and ends up on a completely different street than the one it was driving down previously: https://www.youtube.com/watch?v=ja2PVllZcsI
John Carmack founded Keen technology in 2022 and has been working seriously on AI since 2019. From his experience in the video game industry, he knows a thing or two about linear algebra and GPUs, that is the underlying maths and the underlying hardware.
So, for all intent and purposes, he is an "AI guy" now.
He has built an AI system that fails to do X.
That does not mean there isn't an AI system that can do X. Especially considering that a lot is happening in AI, as you say.
Anyway, Carmack knows a lot about optimizing computations on modern hardware. In practice, that happens to be also necessary for AI. However, it is not __sufficient__ for AI.
Not sure why justanotherjoe is a credible resource on who is and isn’t expert in some new dialectic and euphemism for machine state management. You’re that nobody to me :shrug:
Yann LeCun is an AI guy and has simplified it as “not much more than physical statistics.”
WWhole lot of AI is decades old info theory books applied to modern computer.
Either a mem value is or isn’t what’s expected. Either an entire matrix of values is or isn’t what’s expected. Store the results of some such rules. There’s your model.
The words are made up and arbitrary because human existence is arbitrary. You’re being sold on a bridge to nowhere.
One phenomena that bared this to me, in a substantive way, was noticing an increasing # of reverent comments re: Geohot in odd places here, that are just as quickly replied to by people with a sense of how he works, as opposed to the keywords he associates himself with. But that only happens here AFAIK.
Yapping, or, inducing people to yap about me, unfortunately, is much more salient to my expected mindshare than the work I do.
It's getting claustrophobic intellectually, as a result.
Example from the last week is the phrase "context engineering" - Shopify CEO says he likes it better than prompt engineering, Karpathy QTs to affirm, SimonW writes it up as fait accompli. Now I have to rework my site to not use "prompt engineering" and have a Take™ on "context engineering". Because of a couple tweets + a blog reverberating over 2-3 days.
Nothing against Carmack, or anyone else named, at all. i.e. in the context engineering case, they're just sharing their thoughts in realtime. (i.e. I don't wanna get rolled up into a downvote brigade because it seems like I'm affirming the loose assertion Carmack is "not an AI guy", or, that it seems I'm criticizing anyone's conduct at all)
EDIT: The context engineering example was not in reference to another post at the time of writing, now one is the top of front page.
The difference here is that your example shows a trivial statement and a change period of 3 days, whereas what Carmack is doing is taking years.
It sounds like the "best" AI without constraint would just be something like a replay of a record speedrun rather than a smaller set of heuristics of getting through a game, though the latter is clearly much more important with unseen content.
[1] https://instadeep.com/2021/10/a-simple-introduction-to-meta-...
https://arxiv.org/pdf/1804.03720
And like I get it, it’s fun to complain about the obnoxious and irrational AGI people. But the discussion about how people are using these things in their everyday lives is way more interesting.
I'm wondering whether one has tested with the same model but on two situations:
1) Bring it to superhuman level in game A and then present game B, which is similar to A, to it.
2) Present B to it without presenting A.
If 1) is not significantly better than 2) then maybe it is not carrying much "knowledge", or maybe we simply did not program it correctly.
Given the long list of dead philosophers of mind, if you have a trivial proof, would you mind providing a link?
If we were taking a walk and you asked me for an explanation for a mathematical concept I have not actually studied, I am fully capable of hazarding a casual guess based on the other topics I have studied within seconds. This is the default approach of an LLM, except with much greater breadth and recall of studied topics than I, as a human, have.
This would be very different than if we sat down at a library and I applied the various concepts and theorems I already knew to make inferences, built upon them, and then derived an understanding based on reasoning of the steps I took (often after backtracking from several reasoning dead ends) before providing the explanation.
If you ask an LLM to explain their reasoning, it's unclear whether it just guessed the explanation and reasoning too, or if that was actually the set of steps it took to get to the first answer they gave you. This is why LLMs are able to correct themselves after claiming strawberry has 2 rs, but when providing (guessing again) their explanations they make more "relevant" guesses.
A simple nonsense programming task would suffice. For example "write a Python function to erase every character from a string unless either of its adjacent characters are also adjacent to it in the alphabet. The string only contains lowercase a-z"
That task isn't anywhere in its training set so they can't memorise the answer. But I bet ChatGPT and Claude can still do it.
Honestly this is sooooo obvious to anyone that has used these tools, it's really insane that people are still parroting (heh) the "it just memorises" line.
They generate statistically plausible answers (to simplify the answer) based on the training set and weights they have.
Of course, this because I have spent a lot of time TRAINING to play chess and basically none training to play go.
I am good on guitar because I started training young but can't play the flute or piano to save my life.
Most complicated skills have basically no transfer or carry over other than knowing how to train on a new skill.
I guess its a totaly different level of control: instead of immediately choosing a certain button to press, you need to set longer term goals. "press whatever sequence over this time i need to do to end up closer to this result"
There is some kind of nested multidimensional thing to train on here instead of immediate limited choices
We train the models on what are basically shadows, and they learn how to pattern match the shadows.
But the shadows are only depictions of the real world, and the LLMs never learn about that.
A lot of intelligence is just pattern matching and being quick about it.
Current AI only does one of those (pattern matching, not evolution), and the prospects of simulating evolution is kind of bleak, given I don’t think we can simulate a full living cell yet from scratch? Building a world model requires life (or something that has undergone a similar evolutionary survivorship path), not something that mimics life.
AI has beat the best human players in Chess, Go, Mahjong, Texas hold'em, Dota, Starcraft, etc. It would be really, really surprising that some Atari game is the holy grail of human performance that AI cannot beat.
Less quality of life focused, I don’t believe that the models he uses for this research are capable of more. Is it really that revealing?
The original paper "Playing Atari with Deep Reinforcement Learning" (2013) from Deepmind describes how agents can play Atari games, but these agents would have to be specifically trained on every individual game using millions of frames. To accomplish this, simulators were run in parallel, and much faster than in real-time.
Also, additional trickery was added to extract a reward signal from the games, and there is some minor cheating on supplying inputs.
What Carmack (and others before him) is interested in, is trying to learn in a real-life setting, similar to how humans learn.
But as impressive as this is, it’s easy to lose sight of the bigger picture: we’ve only scratched the surface of what artificial intelligence could be — because we’ve only scaled two modalities: text and images.
That’s like saying we’ve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.
Human intelligence is multimodal. We make sense of the world through:
Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space — how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).
None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.
Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.
The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
I respectfully disagree. Touch gives pretty cool skills, but language, video and audio are all that are needed for all online interactions. We use touch for typing and pointing, but that is only because we don't have a more efficient and effective interface.
Now I'm not saying that all other senses are uninteresting. Integrating touch, extensive proprioception, and olfaction is going to unlock a lot of 'real world' behavior, but your comment was specifically about intelligence.
Compare humans to apes and other animals and the thing that sets us apart is definitely not in the 'remaining' senses, but firmly in the realm of audio, video and language.
I probably made a mistake when i asserted that -- should have thought it over. Vision is evolutionarily older and more “primitive”, while language is uniquely human [or maybe, more broadly, primate, cetacean, cephalopod, avian...] symbolic, and abstract — arguably a different order of cognition altogether. But i maintain that each and every sense is important as far as human cognition -- and its replication -- is concerned.
If all humans lacked vision, the human race would definitely not do just fine.
Human neural networks are dynamic, they change and rearrange, grow and sever. An LLM is fixed and relies on context, if you give it the right answer it won't "learn" that is the correct answer unless it is fed back into the system and trained over months. What if it's only the right answer for a limited period of time?
To build an intelligent machine, it must be able train itself in real time and remember.
Based on the architectures we have they may also be the ending. There’s been a lot of news in the past couple years about LLMs but has there been any breakthroughs making headlines anywhere else in AI?
Yeah, lots of stuff tied to robotics, for instance; this overlaps with vision, but the advances go beyond vision.
Audio has seen quite a bit. And I imagine there is stuff happening in niche areas that just aren't as publicly interesting as language, vision/imagery, audio, and robotics.
Like Dr. Who said: DALEKs aren't brains in a machine, they are the machine!
Same is true for humans. We really are the whole body, we're not just driving it around.
The brain could. Of course it could. It's just a signals processing machine.
But would it be missing anything we consider core to the way humans think? Would it struggle with parts of cognition?
For example: experiments were done with cats growing up in environments with vertical lines only. They were then put in a normal room and had a hard time understanding flat surfaces.
https://computervisionblog.wordpress.com/2013/06/01/cats-and...
“We’ve barely scratched the surface with Rust, so far we’re only focused on code and haven’t even explored building mansions or ending world hunger”
And at the same time I have noticed that people don’t understand the difference between an S-curve and an exponential function. They can look almost identical at certain intervals.
I kind of wonder if libraries like pytorch have hurt experimental development. So many basic concepts no one thinks about anymore because they just use the out of the box solutions. And maybe those solutions are great and those parts are "solved", but I am not sure. How many models are using someone else's tokenizer, or someone else's strapped on vision model just to check a box in the model card?
When the foundation layer at a given moment doesn't yield an ROI on intellectual exploration - say because you can overcompensate with VC funded raw compute and make more progess elsewhere -, few(er) will go there.
But inevitably, as other domains reach diminishing returns, bright minds will take a look around where significant gains for their effort can be found.
And so will the next generation of PyTorch or foundational technologies evolve.
What about simulation: models can make 3D objects so why not give them a physics simulator? We have amazing high fidelity (and low cost!) game engines that would be a great building block.
What about rumination: behind every Cursor rule for example, is a whole story of why a user added it. Why not take the rule, ask a reasoning model to hypothesize about why that rule was created, and add that rumination (along with the rule) to the training data. Providing opportunities to reflect on the choices made by their users might deepen any insights, squeezing more juice out of the data.
We let models write code and run it. Which gives them a high chance of getting arithmetic right.
Solving the “crossing the river” problem by letting the model create and run a simulation would give a pretty high chance of getting it right.
Each Cursor rule is a byproduct of tons of work and probably contains lots that can be unpacked. Any research on that?
Innovation is in the cracks: recognition of holes, intersections, tangents, etc. on old ideas. It has bent said that innovation is done on the shoulders of giants.
So AI can be an express elevator up to an army of giant's shoulders? It all depends on how you use the tools.
As with most things, the truth lies somewhere in the middle. LLMs can be helpful as a way of accelerating certain kinds and certain aspects of research but not others.
I wonder if we can mine patent databases for old ideas that never worked out in the past, but now are more useful. Perhaps due to modern machining or newer materials or just new applications of the idea.
It reminds me of an AI talk a few decades ago, about how the cycle goes: more data -> more layers -> repeat...
Anyways, I'm not sure how your comment relates to these two avenues of improvement.
The insight into the structure of the benzene ring famously came in a dream, hadn't been seen before, but was imagined as a snake bitings its own tail.
--- start quote ---
The empirical formula for benzene had been long known, but its highly unsaturated structure was a challenge to determine. Archibald Scott Couper in 1858 and Joseph Loschmidt in 1861 suggested possible structures that contained multiple double bonds or multiple rings, but the study of aromatic compounds was in its earliest years, and too little evidence was then available to help chemists decide on any particular structure.
More evidence was available by 1865, especially regarding the relationships of aromatic isomers.
[ Kekule claimed to have had the dream in 1865 ]
--- end quote ---
The dream claim came from Kekule himself 25 years after his proposal that he had to modify 10 years after he proposed it.
Can you imagine if we applied the same gatekeeping logic to science?
Imagine you weren't allowed to use someone else's scientific work or any derivative of it.
We would make no progress.
The only legitimate defense I have ever seen here revolves around IP and copyright infringement, which I couldn't care less about.
Slight difference to those methods, wouldn't you agree?
It can probably remember more facts about a topic than a PhD in that topic, but the PhD will be better at thinking about that topic.
Why should the model need to memorize facts we already have written down somewhere?
"Thinking" is too broad a term to apply usefully but I would say its pretty clear we are not close to AGI.
So can a notebook.
As a simple analogy, read out the following sentence multiple times, stressing a different word each time.
"I never said she stole my money"
Note how the meaning changes and is often unique?
That is a lens I to the frame problem and it's inverse, the specification problem.
The above problem quickly becomes tower-complete, and recent studies suggest that RL is reinforcing or increasing the weight of existing patterns.
As the open domain frame problem and similar challenges are equivalent to HALT, finding new ways to extract useful information will be important for generalization IMHO.
Synthetic data is useful, but not a complete solution, especially for tower problems.
and as far as synthetic vs real data, there's a lot of gaps in LLM knowledge; and vision models suffer from "limited tags", which used to have workarounds with textual embeddings and the like, but those went by the wayside as LoRA, controlnet, etc. appeared.
There's people who are fairly well known that LLMs have no idea about. There's things in books i own that the AI confidently tells me are either wrong or don't exist.
That one page about compressing 1 gig wikipedia as small as possible implicitly and explicitly states that AI is "basically compression" - and if the data isn't there, it's not in the compressed set (weights) either.
And i'll reply to another comment here, about "24/7 rolling/ for looped" AI - i thought of doing this when i first found out about LLMs, but context windows are the enemy, here. I have a couple of ideas about how to have a continuous AI, but i don't have the capital to test it out.
The original idea of connectionism is that neural networks can represent any function, which is the fundamental mathematical fact. So we should be optimistic, neural nets will be able to do anything. Which neural nets? So far people have stumbled on a few productive architectures, but it appears to be more alchemy than science. There is no reason why we should think there won't be both new ideas and new data. Biology did it, humans will do it too.
> we’re engaged in a decentralized globalized exercise of Science, where findings are shared openly
Maybe the findings are shared, if they make the Company look good. But the methods are not anymore
The ability to collect gene expression data at a tissue specific level has only been invented and automated in the last 4-5 years (see 10X Genomics Xenium, MERFISH). We've only recently figured out how to collect this data at the scale of millions of cells. A breakthrough on this front may be the next big area of advancement.
- Moore's law petering out, steering hardware advancements towards parallelism
- Fast-enough internet creating shift to processing and storage in large server farms, enabling both high-cost training and remote storage of large models
- Social media + search both enlisting consumers as data producers, and necessitating the creation of armies of Mturkers for content moderation + evaluation, later becoming available for tagging and rlhf
- A long-term shift to a text-oriented society, beginning with print capitalism and continuing through the rise of "knowledge work" through to the migration of daily tasks (work, bill paying, shopping) online, that allows a program that only produces text to appear capable of doing many of the things a person does
We may have previously had the technical ideas in the 1990s but we certainly didn't have the ripened infrastructure to put them into practice. If we had the dataset to create an LLM in the 90s, it still would have been astronomically cost-prohibitive to train, both in CPU and human labor, and it wouldn't have as much of an effect on society because you wouldn't be able to hook it up to commerce or day-to-day activities (far fewer texts, emails, ecommerce).
Because new methods unlock access to new datasets.
Edit: Oh I see this was a rhetorical question answered in the next paragraph. D'oh
"There weren't really any advancements from around 2018. The majority of the 'advancements' were in the amount of parameters, training data, and its applications. What was the GPT-3 to ChatGPT transition? It involved fine-tuning, using specifically crafted training data. What changed from GPT-3 to GPT-4? It was the increase in the number of parameters, improved training data, and the addition of another modality. From GPT-4 to GPT-40? There was more optimization and the introduction of a new modality. The only thing left that could further improve models is to add one more modality, which could be video or other sensory inputs, along with some optimization and more parameters. We are approaching diminishing returns." [1]
10 months ago around o1 release:
"It's because there is nothing novel here from an architectural point of view. Again, the secret sauce is only in the training data. O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238
Soon you will see similar models from competitors." [2]
Winter is coming.
1. https://news.ycombinator.com/item?id=40624112
2. https://news.ycombinator.com/item?id=41526039
The reason we don't do it isn't because it's hard, it's because it yields worse results for increased cost.
> i used chatgpt for the first time today and have some lite rage if you wanna hear it. tldr it wasnt correct. i thought of one simple task that it should be good at and it couldnt do that.
> (The kangxi radicals are neatly in order in unicode so you can just ++ thru em. The cjks are not. I couldnt see any clear mapping so i asked gpt to do it. Big mess i had to untangle manually anyway it woulda been faster to look them up by hand (theres 214))
> The big kicker was like, it gave me 213. And i was like, "why is one missing?" Then i put it back in and said count how many numbers are here and it said 214, and there just werent. Like come on you SHOULD be able to count.
If you can make the language models actually interface with what we've been able to do with computers for decades, i imagine many paths open up.
There’s an infinite repertoire of such tasks that combine AI capabilities with traditional computer algorithms, and I don’t think we have a generic way of having AI autonomously outsource whatever parts require precision in a reliable way.
This happens to be the basis of every aspect of our biology.
The iPhone is a perfect example. There were smartphones with cameras and web browsers before. But when the iPhone launched, it added a capacitive touch screen that was so responsive there was no need for a keyboard. The importance of that one technical innovation can't be overstated.
Then the "new new thing" is followed by a period of years where the innovation is refined, distributed, applied to different contexts, and incrementally improved.
The iPhone launched in 2007 is not really that much different than the one you have in your pocket today. The last 20 years has been about improvements. The web browser before that is also pretty much the same as the one you use today.
We've seen the same pattern happen with LLMs. The author of the article points out that many of AI's breakthroughs have been around since the 1990s. Sure! And the Internet was created in the 1970s and mobile phones were invented in the 1980s. That doesn't mean the web and smartphones weren't monumental technological events. And it doesn't mean LLMs and AI innovation is somehow not proceeding apace.
It's just how this stuff works.
Each crawl on the internet is actually a discrete chunk of a more abstractly defined, constant influx of information streams. Let's call them rivers (it's a big stream).
These rivers can dry up, present seasonal shifts, be poisoned, be barraged.
It will never "get there" and gather enough data to "be done".
--
Regarding "new ideas in AI", I think there could be. But this whole thing is not about AI anymore.