"You don't think about memories when you design hardware"
I found that comment interesting because memory is mostly what I think about when I design hardware. Memory access is by far the slowest thing, so managing memory bandwidth is absolutely critical, and it uses a significant portion of the die area.
Also, a significant portion of the microarchitecture of a modern CPU is all about managing memory accesses.
If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access, then it would likely tie in performance, assuming the CPU team did a good job measuring memory. Still, it wouldn't need all the extra stuff that did the measurement. So I see that as a win.
The main problem is, I have absolutely no idea how to do that, and unfortunately, I haven't met anyone else who knows either. Hence, tons of bookkeeping logic in CPUs persists.
gchadwick · 15m ago
This certainly rings true with my own experiences (worked on both GPUs and CPUs and now doing AI inference at UK startup Fractile).
> If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access,
I think this is fundamentally impossible because of dynamic behaviours. It's very tempting to assume that if you can be clever enough you can work this all out ahead of time, encode what you need in some static information and the whole computation just runs like clockwork on the hardware, no or little area spent on scheduling, stalling, buffering etc. Though I think history has shown over and over this just doesn't work (for more general computation at least, more feasible in restricted domains). There's always lots of fiddly details in real workloads that surprise you and if you've got an inflexible system you've got no give and you end up 'stopping the world' (or some significant part of it) to deal with it killing your performance.
Notably running transformer models feels like one of those restrictive domains you could do this well in but dig in and there's plenty enough dynamic behaviour in there that you can't escape the problems they cause.
Legend2440 · 19m ago
Keep in mind this article is almost 20 years old.
Memory access used to matter less because CPUs were a lot slower relative to memory. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.
foota · 25m ago
This is almost the opposite of what you're referring to I think, but interestingly Itanium had instruction level parallelism hard-coded by the compiler as opposed to determined by the processor at runtime.
eigenform · 48m ago
Also weird because pipelining is very literally "insert memories to cut some bigger process into parts that can occur in parallel with one another"
uticus · 25m ago
"memories" could also refer to registers, or even a stack-based approach
jonathaneunice · 1h ago
"High-level CPUs" are a tarpit. Beautiful idea in theory, but history shows they're a Bad Idea.
Xerox, LMI, Symbolics, Intel iAPX 432, UCSD p-System, Jazelle and PicoJava—just dozens of fancy designs for Lisp, Smalltalk, Ada, Pascal, Java—yet none of them ever lasted more than a product iteration or three. They were routinely outstripped by MOT 68K, x86, and RISC CPUs. Whether your metric of choice is shipment volumes, price/performance, or even raw performance, they have routinely underwhelmed. A trail of tears for 40+ years.
theredleft · 1m ago
Throwing in "a trail of tears" at the end is pretty ridiculous. It's just hardware.
switchbak · 8m ago
I'm just a software guy, so my thoughts are probably painfully naive here.
My understanding is that we're mostly talking about economics - eg: that there's no way a Java/Lisp CPU could ever compete with a general purpose CPU. That's what I thought was the reason for the Symbolics CPU decline vs general purpose chips.
It seems like some hardware co-processing go a long way for some purposes though? GC in particular seems like it would be amenable to some dedicated hardware.
These days we're seeing dedicated silicon for "AI" chips in newer processors. That's not a high level CPU as described in the article, but it does seem like we're moving away from purely general purpose CPUs into a more heterogeneous world of devoting more silicon for other important purposes.
irq-1 · 1h ago
That's true, but isn't it an issue of price and volume? Specialized network cards are OK (in the market.)
jonathaneunice · 57m ago
Probably! Everything runs on economics. Network and graphics accelerators did just fine—but then they were (or became over time) high-volume opportunities. Volume drives investment; investment drives progress.
do_not_redeem · 55m ago
Those specialized network cards have ASICs that parse network packets and don't need additional memory to do their job. You can easily build a fixed-size hardware register to hold an IPv4 packet header, for example.
But a "high level CPU" is all about operating on high-level objects in memory, so your question becomes, how do you store a linked list in memory more efficiently than a RISC CPU? How do you pointer chase to the next node faster than a RISC CPU? Nobody has figured it out yet, and I agree with the article, I don't see how it's possible. CPUs are already general-purpose and very efficient.
delta_p_delta_x · 1h ago
As James Mickens says in The Night Watch[1],
“Why would someone write code in a grotesque language that exposes raw memory addresses? Why not use a modern language with garbage collection and functional programming and free massages after lunch?” Here’s the answer: Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis.
But I feel there's a middle ground between LISP and raw assembly and mangling pointers. For instance, modern C++ and Rust (and all the C++ 'successors', like Carbon, Circle, Safe C++, etc) have a very novel and clear set of ownership, view, and span semantics that make more declarative, functional, side-effect-free programming easy to write and read, while still maintaining high-performance—most notably without unnecessarily throwing copies around.
A few days ago a friend was asking me for a HPC problem: he'd written a bog-standard nested for loop with indices, was observing poor performance, and asked me if using C++23 std::ranges::views would help. It did, but I also fundamentally reworked his data flow. I saw something like a 5x performance improvement, even though I was writing at a higher level of abstraction than he was.
It's my strong opinion that Von Neumann's architecture is great for general purpose problem solving. However for computing LLMs and similar fully known execution plans, it would be far better to decompose them into a fully parallel and pipelined graph to be executed on a reconfigurable computing mesh.
My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.
My current estimate is that it could save 95% of the energy for a given computation.
Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.
elchananHaas · 1h ago
For sure. The issue is that many AI workloads require terabytes per second of memory bandwidth and are on the cutting edge of memory technologies. As long as you can get away with little memory usage you can have massive savings, see Bitcoin ASICs.
The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.
librasteve · 1h ago
occam (MIMD) is an improvement over CUDA (SIMD)
with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):
- 8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
- 40000 CPUs in single reticle (80 TIPS)
- 4.5M CPUs per 300mm wafer (10 PIPS)
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect
heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory
xphos · 44m ago
Nit pick here but ...
I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.
I enjoyed your other point though about the comparsions of machines though
Legend2440 · 22m ago
>in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.
In practice, this is a lie - real computers do not offer O(1) random access. Physics constrains you to O(sqrt N).
The classic example is the Lisp Machine. Hypothetically a purpose-built computer would have an advantage running Lisp, but Common Lisp was carefully designed so that it could attain high performance on an ordinary "32-bit" architecture like the 68k or x86 as well as SPARC, ARM and other RISC architectures. Java turned out the same way.
It's hard to beat a conventional CPU tuned up with superscalar, pipelines, caches, etc. -- particularly when these machines sell in such numbers that the designers can afford heroic engineering efforts that you can't.
gizmo686 · 2h ago
It's also hard to beat conventional CPUs when transitor density doubles every two years. By the time your small team has crafted their purpose built CPU, the big players will have released a general purpose CPU on the next generation of manufacturing abilities.
I expect that once our manufacturing abilities flatline, we will start seeing more interest in specialized CPUs again, as it will be possible to spend a decade designing your Lisp machine
PaulHoule · 1h ago
I never thought I could make a custom CPU until I came across
But when I saw that I thought, yeah, I could implement something like that on an FPGA. It's not so much a language-specific CPU as an application-specific CPU. If you were building something that might be FPGA + CPU or FPGA with a soft core it might be your soft core, particularly if you had the right kind of tooling. (Wouldn't it be great to have a superoptimizing 'compiler' that can codesign the CPU together with the program?)
It has its disadvantages, particularly the whole thing will lock up if a fetch is late. For workloads where memory access is predictable I'd imagine you could have a custom memory controller but my picture of how that works is fuzzy. For unpredictable memory access though you can't beat the mainstream CPU -- me and my biz dev guy had a lot of talks with a silicon designer who had some patents for a new kind of 'vector' processor who schooled us on how many ASIC and FPGA ideas that sound great on paper can't really fly because of the memory wall.
yjftsjthsd-h · 2h ago
There's also a matter of scale. By the time you've made your first million custom CPUs, the next vendor over has made a billion generic CPUs, sold them, and then turned the money back into R&D to make even better ones.
wbl · 2h ago
John McCartey had amazing prescience given that Lisp was created in 1956 and the RISC revolution wasn't until at least 1974.
There's certainly some dynamic language support in CPUs: the indirect branch predictors and target predictors wouldn't be as large if there wasn't so much JavaScript and implementations that make those circuits work well.
PaulHoule · 1h ago
Half of the Dragon Book is about parsing and the other half is about reconciling the lambda calculus (any programming language with recursive functions) with the Von Newman/Turing approach to computation. Lisp manages to skip the first.
uticus · 27m ago
> My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?
suprised no mention of Itanium in the article
gwbas1c · 1h ago
I get that today's CPUs are orders of magnitude faster than the CPUs of the past. I also think it's important to admit that today's computers aren't orders of magnitude faster than yesterday's computers.
There's many reasons for that, but mostly it's because it's not worth it to optimize most use cases. It's clear that, for the author's use case (image recognition,) it's "worth it" to optimize to the level they discuss.
Otherwise, we shouldn't assume that doubling a CPU's speed will magically double an application's speed. The author alludes to bottlenecks inside the computer that appear "CPU bound" to many programmers. These bottlenecks are still there, even when tomorrow's CPU is "twice as fast" as today's CPUs.
nromiun · 2h ago
But what is the fundamental difference between C compiling to assembly and Luajit doing the same as JIT? Both are very fast. Both are high level compared to assembly. Except one gets the low level language tag and the other one does not.
I don't buy the argument that you absolutely need a low level language for performance.
IainIreland · 2h ago
This isn't about languages; it's about hardware. Should hardware be "higher-level" to support higher level languages? The author says no (and I am inclined to agree with him).
librasteve · 1h ago
this
pinewurst · 2h ago
(2008)
moomin · 1h ago
I mean, I don’t have an answer to this, but I’ll bet Simon Peyton-Jones has some ideas…
I found that comment interesting because memory is mostly what I think about when I design hardware. Memory access is by far the slowest thing, so managing memory bandwidth is absolutely critical, and it uses a significant portion of the die area.
Also, a significant portion of the microarchitecture of a modern CPU is all about managing memory accesses.
If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access, then it would likely tie in performance, assuming the CPU team did a good job measuring memory. Still, it wouldn't need all the extra stuff that did the measurement. So I see that as a win.
The main problem is, I have absolutely no idea how to do that, and unfortunately, I haven't met anyone else who knows either. Hence, tons of bookkeeping logic in CPUs persists.
> If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access,
I think this is fundamentally impossible because of dynamic behaviours. It's very tempting to assume that if you can be clever enough you can work this all out ahead of time, encode what you need in some static information and the whole computation just runs like clockwork on the hardware, no or little area spent on scheduling, stalling, buffering etc. Though I think history has shown over and over this just doesn't work (for more general computation at least, more feasible in restricted domains). There's always lots of fiddly details in real workloads that surprise you and if you've got an inflexible system you've got no give and you end up 'stopping the world' (or some significant part of it) to deal with it killing your performance.
Notably running transformer models feels like one of those restrictive domains you could do this well in but dig in and there's plenty enough dynamic behaviour in there that you can't escape the problems they cause.
Memory access used to matter less because CPUs were a lot slower relative to memory. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.
Xerox, LMI, Symbolics, Intel iAPX 432, UCSD p-System, Jazelle and PicoJava—just dozens of fancy designs for Lisp, Smalltalk, Ada, Pascal, Java—yet none of them ever lasted more than a product iteration or three. They were routinely outstripped by MOT 68K, x86, and RISC CPUs. Whether your metric of choice is shipment volumes, price/performance, or even raw performance, they have routinely underwhelmed. A trail of tears for 40+ years.
My understanding is that we're mostly talking about economics - eg: that there's no way a Java/Lisp CPU could ever compete with a general purpose CPU. That's what I thought was the reason for the Symbolics CPU decline vs general purpose chips.
It seems like some hardware co-processing go a long way for some purposes though? GC in particular seems like it would be amenable to some dedicated hardware.
These days we're seeing dedicated silicon for "AI" chips in newer processors. That's not a high level CPU as described in the article, but it does seem like we're moving away from purely general purpose CPUs into a more heterogeneous world of devoting more silicon for other important purposes.
But a "high level CPU" is all about operating on high-level objects in memory, so your question becomes, how do you store a linked list in memory more efficiently than a RISC CPU? How do you pointer chase to the next node faster than a RISC CPU? Nobody has figured it out yet, and I agree with the article, I don't see how it's possible. CPUs are already general-purpose and very efficient.
A few days ago a friend was asking me for a HPC problem: he'd written a bog-standard nested for loop with indices, was observing poor performance, and asked me if using C++23 std::ranges::views would help. It did, but I also fundamentally reworked his data flow. I saw something like a 5x performance improvement, even though I was writing at a higher level of abstraction than he was.
[1]: https://www.usenix.org/system/files/1311_05-08_mickens.pdf
My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.
My current estimate is that it could save 95% of the energy for a given computation.
Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.
The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.
with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnectheat would be the biggest issue ... but >>50% of each CPU is low power (local) memory
I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.
I enjoyed your other point though about the comparsions of machines though
In practice, this is a lie - real computers do not offer O(1) random access. Physics constrains you to O(sqrt N).
https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=13741269 - Feb 2017 (127 comments)
The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=10358153 - Oct 2015 (58 comments)
The "high-level CPU" challenge - https://news.ycombinator.com/item?id=107221 - Jan 2008 (4 comments)
It's hard to beat a conventional CPU tuned up with superscalar, pipelines, caches, etc. -- particularly when these machines sell in such numbers that the designers can afford heroic engineering efforts that you can't.
I expect that once our manufacturing abilities flatline, we will start seeing more interest in specialized CPUs again, as it will be possible to spend a decade designing your Lisp machine
https://en.wikipedia.org/wiki/Transport_triggered_architectu...
But when I saw that I thought, yeah, I could implement something like that on an FPGA. It's not so much a language-specific CPU as an application-specific CPU. If you were building something that might be FPGA + CPU or FPGA with a soft core it might be your soft core, particularly if you had the right kind of tooling. (Wouldn't it be great to have a superoptimizing 'compiler' that can codesign the CPU together with the program?)
It has its disadvantages, particularly the whole thing will lock up if a fetch is late. For workloads where memory access is predictable I'd imagine you could have a custom memory controller but my picture of how that works is fuzzy. For unpredictable memory access though you can't beat the mainstream CPU -- me and my biz dev guy had a lot of talks with a silicon designer who had some patents for a new kind of 'vector' processor who schooled us on how many ASIC and FPGA ideas that sound great on paper can't really fly because of the memory wall.
There's certainly some dynamic language support in CPUs: the indirect branch predictors and target predictors wouldn't be as large if there wasn't so much JavaScript and implementations that make those circuits work well.
suprised no mention of Itanium in the article
There's many reasons for that, but mostly it's because it's not worth it to optimize most use cases. It's clear that, for the author's use case (image recognition,) it's "worth it" to optimize to the level they discuss.
Otherwise, we shouldn't assume that doubling a CPU's speed will magically double an application's speed. The author alludes to bottlenecks inside the computer that appear "CPU bound" to many programmers. These bottlenecks are still there, even when tomorrow's CPU is "twice as fast" as today's CPUs.
I don't buy the argument that you absolutely need a low level language for performance.