"Why Are the Critical Value and Emergent Behavior of Large Language Models (LLMs) Fake?"
Hope this helps anyone else who thinks this is about the R Programming Language or the correlation coefficient symbolized by r
thomastjeffery · 9d ago
They should have dropped " the" instead.
knorker · 9d ago
Yet another reason why Title Case Is Bad For Everything. I must have read that title 4 times before clicking to see anyone else understood it.
kylebenzle · 9d ago
I agree! But only took reading a second time to realize they obviously were using R -> are to save space, not 4!
esafak · 9d ago
dang: Please restore the word "Are" in the title and replace "and" with &, or use the initialism LLMs if you ran out of space. R makes it sound like a critical exponent, in the context of emergence and phase transitions. https://en.wikipedia.org/wiki/Critical_exponent
The preprint is from 2022 and the article from 2024. There may well be more recent research that builds on this paper.
croemer · 9d ago
TFA thinks that logarithmic x-axes are misleading. That's just not true here. I looked at the preprint and it's the right way to display data over such a large range.
The whole point is that below a threshold, you can increase training by a factor of 10 or 100 and there's barely any change. But at the threshold, the same relative increase suddenly produces big improvements.
yorwba · 9d ago
> you can increase training by a factor of 10 or 100 and there's barely any change.
How do you know accuracy didn't increase by a factor of 100 from 0.001% to 0.1%? Another factor of 100 for both and you're at 10% with "only" 10⁴ as many training FLOPS!
If you want to use a nonlinear transformation to show very small and very large inputs in the same graph, surely a similar effort should be made to make changes in accuracy near the limits of 0% and 100% more visible.
croemer · 9d ago
You are right that using double log axes could be better, but the author doesn't ask for y-axis to be log-scaled but for x-axis to be linear.
JSR_FDED · 12d ago
I’ve always struggled with logarithmic charts. This is a perfect example of when not to use them.
croemer · 9d ago
Why is it a perfect example? Logarithmic axes are very useful and there's no way a lot with linear axis would have made sense here.
Logarithmic axes are the norm for showing scaling, power law effects etc. Once you get used to them there's no way back.
jbentley1 · 9d ago
It's fairly easy to observe that LLMs do things that were out of reach for software only a short period ago, and that their abilities are increasing fast. Yet some people try to make trivial arguments like this and try to say AI is somehow 'fake'.
Hope this helps anyone else who thinks this is about the R Programming Language or the correlation coefficient symbolized by r
The whole point is that below a threshold, you can increase training by a factor of 10 or 100 and there's barely any change. But at the threshold, the same relative increase suddenly produces big improvements.
How do you know accuracy didn't increase by a factor of 100 from 0.001% to 0.1%? Another factor of 100 for both and you're at 10% with "only" 10⁴ as many training FLOPS!
If you want to use a nonlinear transformation to show very small and very large inputs in the same graph, surely a similar effort should be made to make changes in accuracy near the limits of 0% and 100% more visible.
Logarithmic axes are the norm for showing scaling, power law effects etc. Once you get used to them there's no way back.