The "high-level CPU" challenge (2008)

72 signa11 46 8/12/2025, 5:07:43 PM yosefk.com ↗

Comments (46)

throwaway31131 · 18h ago

"You don't think about memories when you design hardware"

I found that comment interesting because memory is mostly what I think about when I design hardware. Memory access is by far the slowest thing, so managing memory bandwidth is absolutely critical, and it uses a significant portion of the die area.

Also, a significant portion of the microarchitecture of a modern CPU is all about managing memory accesses.

If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access, then it would likely tie in performance, assuming the CPU team did a good job measuring memory. Still, it wouldn't need all the extra stuff that did the measurement. So I see that as a win.

The main problem is, I have absolutely no idea how to do that, and unfortunately, I haven't met anyone else who knows either. Hence, tons of bookkeeping logic in CPUs persists.

Legend2440 · 17h ago

Keep in mind this article is almost 20 years old.

Memory access used to matter less because CPUs were a lot slower relative to memory. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.

PeterWhittaker · 15h ago

Wait, what?

Oh, yeah. Jeebus.

2008 was SEVENTEEN years ago.

I turn 60 soon. Anything 2xxx just seems like almost yesterday, 199x isn't that far away in mind, and even 198x is pretty close.

I love math (B.Sc. Physics, so more math than anything else) but sometimes math hurts.

Damn you, beloved math.

gchadwick · 17h ago

This certainly rings true with my own experiences (worked on both GPUs and CPUs and now doing AI inference at UK startup Fractile).

> If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access,

I think this is fundamentally impossible because of dynamic behaviours. It's very tempting to assume that if you can be clever enough you can work this all out ahead of time, encode what you need in some static information and the whole computation just runs like clockwork on the hardware, no or little area spent on scheduling, stalling, buffering etc. Though I think history has shown over and over this just doesn't work (for more general computation at least, more feasible in restricted domains). There's always lots of fiddly details in real workloads that surprise you and if you've got an inflexible system you've got no give and you end up 'stopping the world' (or some significant part of it) to deal with it killing your performance.

Notably running transformer models feels like one of those restrictive domains you could do this well in but dig in and there's plenty enough dynamic behaviour in there that you can't escape the problems they cause.

eigenform · 17h ago

Also weird because pipelining is very literally "insert memories to cut some bigger process into parts that can occur in parallel with one another"

foota · 17h ago

This is almost the opposite of what you're referring to I think, but interestingly Itanium had instruction level parallelism hard-coded by the compiler as opposed to determined by the processor at runtime.

uticus · 17h ago

"memories" could also refer to registers, or even a stack-based approach

blakepelton · 16h ago

Some folks use the term "state element" to describe the set of things which hold state between clock cycles. Memories (e.g., SRAM) and registers are both examples of state elements.

I took the "You don't think about memories when you design hardware" paragraph from Yosef to mean that SRAM is so highly tuned that in most cases your best bet is to just use it "off the shelf" rather than invent some novel type of state element. And if you use SRAM, then you are stuck sucking through a small straw (e.g., an SRAM can hold 1K words, but you can only read 1 word/cycle).

jonathaneunice · 18h ago

"High-level CPUs" are a tarpit. Beautiful idea in theory, but history shows they're a Bad Idea.

Xerox, LMI, Symbolics, Intel iAPX 432, UCSD p-System, Jazelle and PicoJava—just dozens of fancy designs for Lisp, Smalltalk, Ada, Pascal, Java—yet none of them ever lasted more than a product iteration or three. They were routinely outstripped by MOT 68K, x86, and RISC CPUs. Whether your metric of choice is shipment volumes, price/performance, or even raw performance, they have routinely underwhelmed. A trail of tears for 40+ years.

switchbak · 17h ago

I'm just a software guy, so my thoughts are probably painfully naive here.

My understanding is that we're mostly talking about economics - eg: that there's no way a Java/Lisp CPU could ever compete with a general purpose CPU. That's what I thought was the reason for the Symbolics CPU decline vs general purpose chips.

It seems like some hardware co-processing go a long way for some purposes though? GC in particular seems like it would be amenable to some dedicated hardware.

These days we're seeing dedicated silicon for "AI" chips in newer processors. That's not a high level CPU as described in the article, but it does seem like we're moving away from purely general purpose CPUs into a more heterogeneous world of devoting more silicon for other important purposes.

bigfishrunning · 16h ago

Those "AI" chips generally just optimize around a vector-multiply, allowing more parallelism with less precision. This is almost a lower level processor then a general purpose CPU, and doesn't really map to the "string compare instruction" type of functionality that Yossi Kreinin describes in the essay.

wmf · 16h ago

If you want to build something that is used you can't ignore economics. But in this case I think high-level CPUs are worse even if you could put economics aside.

GC in particular seems like it would be amenable to some dedicated hardware.

Azul went through that incredible journey.

irq-1 · 18h ago

That's true, but isn't it an issue of price and volume? Specialized network cards are OK (in the market.)

jonathaneunice · 17h ago

Probably! Everything runs on economics. Network and graphics accelerators did just fine—but then they were (or became over time) high-volume opportunities. Volume drives investment; investment drives progress.

do_not_redeem · 17h ago

Those specialized network cards have ASICs that parse network packets and don't need additional memory to do their job. You can easily build a fixed-size hardware register to hold an IPv4 packet header, for example.

But a "high level CPU" is all about operating on high-level objects in memory, so your question becomes, how do you store a linked list in memory more efficiently than a RISC CPU? How do you pointer chase to the next node faster than a RISC CPU? Nobody has figured it out yet, and I agree with the article, I don't see how it's possible. CPUs are already general-purpose and very efficient.

theredleft · 16h ago

Throwing in "a trail of tears" at the end is pretty ridiculous. It's just hardware.

Dylan16807 · 16h ago

Did you consider maybe they weren't making a reference.

delta_p_delta_x · 18h ago

As James Mickens says in The Night Watch[1],

  “Why would someone write code in a grotesque language that exposes raw memory addresses? Why not use a modern language with garbage collection and functional programming and free massages after lunch?” Here’s the answer: Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis.

But I feel there's a middle ground between LISP and raw assembly and mangling pointers. For instance, modern C++ and Rust (and all the C++ 'successors', like Carbon, Circle, Safe C++, etc) have a very novel and clear set of ownership, view, and span semantics that make more declarative, functional, side-effect-free programming easy to write and read, while still maintaining high-performance—most notably without unnecessarily throwing copies around.

A few days ago a friend was asking me for a HPC problem: he'd written a bog-standard nested for loop with indices, was observing poor performance, and asked me if using C++23 std::ranges::views would help. It did, but I also fundamentally reworked his data flow. I saw something like a 5x performance improvement, even though I was writing at a higher level of abstraction than he was.

[1]: https://www.usenix.org/system/files/1311_05-08_mickens.pdf

mikewarot · 18h ago

It's my strong opinion that Von Neumann's architecture is great for general purpose problem solving. However for computing LLMs and similar fully known execution plans, it would be far better to decompose them into a fully parallel and pipelined graph to be executed on a reconfigurable computing mesh.

My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.

My current estimate is that it could save 95% of the energy for a given computation.

Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.

elchananHaas · 18h ago

For sure. The issue is that many AI workloads require terabytes per second of memory bandwidth and are on the cutting edge of memory technologies. As long as you can get away with little memory usage you can have massive savings, see Bitcoin ASICs.

The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.

mikewarot · 16h ago

The reason they need terabytes per second is they're constantly loading weights and data into and then back out of multiply accumulator chips.

If you program hardware to just do the multiply/accumulate with the weights built in, you can then reduce the bandwidth required to just putting data into a "layer" and getting the results out, a much, MUCH lower amount of data in most cases.

thrtythreeforty · 13h ago

Doesn't this presume that you can fit the model's weights into the SRAMs of a single chip, or of multiple chips that you plan to connect together? A big reason HBM is a thing is because it's much, much denser than SRAM.

Dylan16807 · 16h ago

That sounds like an FPGA, but aren't those notoriously slow for most uses? You can make a kickass signal pipeline or maybe an emulator but most code gets shunted to an attached arm core or even a soft-core implemented on top that wastes an order of magnitude of performance.

And no architecture is clock limited by driving signals across the die. Why do people keep assuming this? CPU designers are very smart and they break slow things into multiple steps.

bigfishrunning · 16h ago

> That sounds like an FPGA, but aren't those notoriously slow for most uses?

If you're trying to run software on them yes. If you use them for their stated purpose (as an array of gates), then they can be orders of magnitude faster then the equivalent computer program. It's all about using the right tool for the right job.

librasteve · 18h ago

occam (MIMD) is an improvement over CUDA (SIMD)

with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):

  -  8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
  - 40000 CPUs in single reticle (80 TIPS)
  - 4.5M CPUs per 300mm wafer (10 PIPS)

the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect

heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory

xphos · 17h ago

Nit pick here but ...

I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.

I enjoyed your other point though about the comparsions of machines though

sifar · 7h ago

Really curious about why you think programming a vector machine is so painful ? In terms of what ? And what do you exactly mean by a "true Vector Machine" ?

My experience with RVV (i am aware of vector architecture history, just using it as an example)so far indicates that while it is not the greatest thing, it is not that bad either. You play with what you have!!

Yes, compared to regular SIMD, it is a step up in complexity but nothing a competent SIMD programmer cannot reorient to. Designing a performant hardware cpu is another matter though - lot of (micro)architectural choices and tradeoffs that can impact performance significantly.

PaulHoule · 19h ago

The classic example is the Lisp Machine. Hypothetically a purpose-built computer would have an advantage running Lisp, but Common Lisp was carefully designed so that it could attain high performance on an ordinary "32-bit" architecture like the 68k or x86 as well as SPARC, ARM and other RISC architectures. Java turned out the same way.

It's hard to beat a conventional CPU tuned up with superscalar, pipelines, caches, etc. -- particularly when these machines sell in such numbers that the designers can afford heroic engineering efforts that you can't.

gizmo686 · 19h ago

It's also hard to beat conventional CPUs when transitor density doubles every two years. By the time your small team has crafted their purpose built CPU, the big players will have released a general purpose CPU on the next generation of manufacturing abilities.

I expect that once our manufacturing abilities flatline, we will start seeing more interest in specialized CPUs again, as it will be possible to spend a decade designing your Lisp machine

PaulHoule · 18h ago

I never thought I could make a custom CPU until I came across

https://en.wikipedia.org/wiki/Transport_triggered_architectu...

But when I saw that I thought, yeah, I could implement something like that on an FPGA. It's not so much a language-specific CPU as an application-specific CPU. If you were building something that might be FPGA + CPU or FPGA with a soft core it might be your soft core, particularly if you had the right kind of tooling. (Wouldn't it be great to have a superoptimizing 'compiler' that can codesign the CPU together with the program?)

It has its disadvantages, particularly the whole thing will lock up if a fetch is late. For workloads where memory access is predictable I'd imagine you could have a custom memory controller but my picture of how that works is fuzzy. For unpredictable memory access though you can't beat the mainstream CPU -- me and my biz dev guy had a lot of talks with a silicon designer who had some patents for a new kind of 'vector' processor who schooled us on how many ASIC and FPGA ideas that sound great on paper can't really fly because of the memory wall.

yjftsjthsd-h · 18h ago

There's also a matter of scale. By the time you've made your first million custom CPUs, the next vendor over has made a billion generic CPUs, sold them, and then turned the money back into R&D to make even better ones.

wbl · 19h ago

John McCartey had amazing prescience given that Lisp was created in 1956 and the RISC revolution wasn't until at least 1974.

There's certainly some dynamic language support in CPUs: the indirect branch predictors and target predictors wouldn't be as large if there wasn't so much JavaScript and implementations that make those circuits work well.

PaulHoule · 18h ago

Half of the Dragon Book is about parsing and the other half is about reconciling the lambda calculus (any programming language with recursive functions) with the Von Newman/Turing approach to computation. Lisp manages to skip the first.

Legend2440 · 17h ago

>in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.

In practice, this is a lie - real computers do not offer O(1) random access. Physics constrains you to O(sqrt N).

https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

dang · 17h ago

Related. Others?

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=13741269 - Feb 2017 (127 comments)

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=10358153 - Oct 2015 (58 comments)

The "high-level CPU" challenge - https://news.ycombinator.com/item?id=107221 - Jan 2008 (4 comments)

xphos · 16h ago

I think LISP is an idea but what about other languages like BQN. Horrible to type but its can represent some really high level algorithms and has a totally different idea on what instructions can be. The idea of having inner and outer products formalized is cool. There are other like it since its more array programming oriented but most seem to write an efficient C impl of a stdlib but an instruction set for them would probably work since its not really about memory allocations but operations on memories.

It's not suited at least it my naive understanding to general purpose computing but it carves its own space.

https://mlochbaum.github.io/BQN/index.html

gwbas1c · 18h ago

I get that today's CPUs are orders of magnitude faster than the CPUs of the past. I also think it's important to admit that today's computers aren't orders of magnitude faster than yesterday's computers.

There's many reasons for that, but mostly it's because it's not worth it to optimize most use cases. It's clear that, for the author's use case (image recognition,) it's "worth it" to optimize to the level they discuss.

Otherwise, we shouldn't assume that doubling a CPU's speed will magically double an application's speed. The author alludes to bottlenecks inside the computer that appear "CPU bound" to many programmers. These bottlenecks are still there, even when tomorrow's CPU is "twice as fast" as today's CPUs.

pyrolistical · 16h ago

Clearly memory latency is the bottleneck. I see 2 paths we can take.

1. Memory orientated computers

2. Less memory abstractions

We already did the first, they are called GPUs.

What I imagine for the second is full control over the memory hierarchy. L1, L2, L3, etc are fully controlled by the compiler. The compiler (which might be just in time) can choose the latency vs throughput. It knows exactly how all the caches are connected to all numa nodes and their latencies.

The compiler can choose to skip a cache level for latency critical operations. The compiler knows exactly the block size and cache capacity. This way it can shuffle data around with pipelining to hide latency, instead of accidentally working right now.

nromiun · 19h ago

But what is the fundamental difference between C compiling to assembly and Luajit doing the same as JIT? Both are very fast. Both are high level compared to assembly. Except one gets the low level language tag and the other one does not.

I don't buy the argument that you absolutely need a low level language for performance.

IainIreland · 18h ago

This isn't about languages; it's about hardware. Should hardware be "higher-level" to support higher level languages? The author says no (and I am inclined to agree with him).

librasteve · 18h ago

this

pinewurst · 19h ago

(2008)

uticus · 17h ago

> My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?

suprised no mention of Itanium in the article

nurettin · 16h ago

Doesn't RISC and SIMD address different dimensions of this problem?

moomin · 18h ago

I mean, I don’t have an answer to this, but I’ll bet Simon Peyton-Jones has some ideas…

wmf · 16h ago

He doesn't. (At least not good ones.) That's one point of the article: high-level CPUs are something that sounds like a good idea for someone else to work on but that's an illusion.