Ask HN: Why hasn't x86 caught up with Apple M series?
441 points by stephenheron 4d ago 616 comments
Ask HN: Best codebases to study to learn software design?
107 points by pixelworm 6d ago 92 comments
From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms
131 mgninad 35 8/30/2025, 5:45:24 AM vinithavn.medium.com ↗
After the success of transfer learning for computer vision in the mid-2010s, it was obvious that NLP needed its own transfer learning approach and AlexNet moment.
Lots of research focus around that time was on recurrent models—because that was the conventional wisdom about how you model sequences. Markov chains had led to vanilla RNNs, LSTMs, GRU, etc., which all seemed tantalizingly promising. (MAMBA fans take note.) Attention mechanisms were even used in recurrent models…but so was everything else.
Then came transformers—mixing all the then-best-practice bits with the heretical idea of just not giving a shit about O(n^2) complexity. The vanilla transformer used an encoder-decoder structure like the best translation models had been doing; it used a stack of identical blocks to nudge the output along through the pipe like ResNet; it was pretrained on a multi-task objective using a large document corpus. But then it jettisoned all the other complexity and just let it all rest on the attention mechanism to capture long range dependencies.
It was immediately thrilling, but it was also completely impractical. (I think the largest model had a 500ish token context limit and bigger than hobbyist GPUs.) So it mostly sat on a shelf while people used other “good enough” models for a few years until the hardware got better and a couple folks proved that it could actually work to run these things at massive scale.
And now here we are.
I think they knew what they were saying at the time, but I don’t think they knew that it would remain true for years.
I feel like there is a step missing here...
People were using RNN encoders/decoders for machine translation - the encoder was used to make a representation (fixed-size vector) of the source language sentence, the decoder generated the target language sentence from the source representation.
The issue that people were bumping into is that the fixed-sized vector bottlenecked the encoder/decoder architecture. Representing a variable-length source sentence as a fixed-size vector leads to a loss of information that increases with the source sentence length.
People started adding attention to the decoder as a way to work around this issue. Each decoder step could attend to every token (well, RNN hidden representation) of the source sentence. So, this led to the RNN + attention architecture.
The title 'Attention is all you need' comes from the realization that in this architecture the RNN is not needed, neither for the encoder and decoder. It's a message to the field who was using RNNs + attention (to avoid the bottleneck). Of course, the rest was born from that, encoder-only transformer models like BERT and decoder-only models like current LLMs.
I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.
Being able to provide an immediate replacement is not a requirement to point out limitations in current technology.
If any midwit can say "X is deeply flawed" but no one can put together an Y that would beat X, then clearly, pointing out the flaws was never the bottleneck at all.
It's not a linear process so I'm not sure the "bottleneck" analogy holds here.
We're not limited to only talking about "the bottleneck". I think the argument is more that we're very close to optimal results for the current approach/architecture, so getting superior outcomes from AI will actually require meaningfully different approaches.
Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s
1. Non lossy
2. Readily interpretable
https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...
Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.
If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.
This is like saying we're going to weave carpets from already woven carpets. Attention wasn't the core or key to the problem, it was finding out a way to disentangle the embedded intents and discover what's really happening in the arbitrary.
The results might work as a magic trick for a bit, but they unravel eventually. The whole shebang is a magic act posing as inference, learning, intelligence, reasoning. None of these things are really happening because they don't do to the source or are about the actual behavior, it's a short-cut hack.
https://pubmed.ncbi.nlm.nih.gov/31489566/
I suspect closed models aren't doing anything too radically different from what's presented here.