The "high-level CPU" challenge (2008)

58 signa11 32 8/12/2025, 5:07:43 PM yosefk.com ↗

Comments (32)

throwaway31131 · 1h ago

"You don't think about memories when you design hardware"

I found that comment interesting because memory is mostly what I think about when I design hardware. Memory access is by far the slowest thing, so managing memory bandwidth is absolutely critical, and it uses a significant portion of the die area.

Also, a significant portion of the microarchitecture of a modern CPU is all about managing memory accesses.

If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access, then it would likely tie in performance, assuming the CPU team did a good job measuring memory. Still, it wouldn't need all the extra stuff that did the measurement. So I see that as a win.

The main problem is, I have absolutely no idea how to do that, and unfortunately, I haven't met anyone else who knows either. Hence, tons of bookkeeping logic in CPUs persists.

gchadwick · 15m ago

This certainly rings true with my own experiences (worked on both GPUs and CPUs and now doing AI inference at UK startup Fractile).

> If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access,

I think this is fundamentally impossible because of dynamic behaviours. It's very tempting to assume that if you can be clever enough you can work this all out ahead of time, encode what you need in some static information and the whole computation just runs like clockwork on the hardware, no or little area spent on scheduling, stalling, buffering etc. Though I think history has shown over and over this just doesn't work (for more general computation at least, more feasible in restricted domains). There's always lots of fiddly details in real workloads that surprise you and if you've got an inflexible system you've got no give and you end up 'stopping the world' (or some significant part of it) to deal with it killing your performance.

Notably running transformer models feels like one of those restrictive domains you could do this well in but dig in and there's plenty enough dynamic behaviour in there that you can't escape the problems they cause.

Legend2440 · 19m ago

Keep in mind this article is almost 20 years old.

Memory access used to matter less because CPUs were a lot slower relative to memory. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.

foota · 25m ago

This is almost the opposite of what you're referring to I think, but interestingly Itanium had instruction level parallelism hard-coded by the compiler as opposed to determined by the processor at runtime.

eigenform · 48m ago

Also weird because pipelining is very literally "insert memories to cut some bigger process into parts that can occur in parallel with one another"

uticus · 25m ago

"memories" could also refer to registers, or even a stack-based approach

jonathaneunice · 1h ago

"High-level CPUs" are a tarpit. Beautiful idea in theory, but history shows they're a Bad Idea.

Xerox, LMI, Symbolics, Intel iAPX 432, UCSD p-System, Jazelle and PicoJava—just dozens of fancy designs for Lisp, Smalltalk, Ada, Pascal, Java—yet none of them ever lasted more than a product iteration or three. They were routinely outstripped by MOT 68K, x86, and RISC CPUs. Whether your metric of choice is shipment volumes, price/performance, or even raw performance, they have routinely underwhelmed. A trail of tears for 40+ years.

theredleft · 1m ago

Throwing in "a trail of tears" at the end is pretty ridiculous. It's just hardware.

switchbak · 8m ago

I'm just a software guy, so my thoughts are probably painfully naive here.

My understanding is that we're mostly talking about economics - eg: that there's no way a Java/Lisp CPU could ever compete with a general purpose CPU. That's what I thought was the reason for the Symbolics CPU decline vs general purpose chips.

It seems like some hardware co-processing go a long way for some purposes though? GC in particular seems like it would be amenable to some dedicated hardware.

These days we're seeing dedicated silicon for "AI" chips in newer processors. That's not a high level CPU as described in the article, but it does seem like we're moving away from purely general purpose CPUs into a more heterogeneous world of devoting more silicon for other important purposes.

irq-1 · 1h ago

That's true, but isn't it an issue of price and volume? Specialized network cards are OK (in the market.)

jonathaneunice · 57m ago

Probably! Everything runs on economics. Network and graphics accelerators did just fine—but then they were (or became over time) high-volume opportunities. Volume drives investment; investment drives progress.

do_not_redeem · 55m ago

Those specialized network cards have ASICs that parse network packets and don't need additional memory to do their job. You can easily build a fixed-size hardware register to hold an IPv4 packet header, for example.

But a "high level CPU" is all about operating on high-level objects in memory, so your question becomes, how do you store a linked list in memory more efficiently than a RISC CPU? How do you pointer chase to the next node faster than a RISC CPU? Nobody has figured it out yet, and I agree with the article, I don't see how it's possible. CPUs are already general-purpose and very efficient.

delta_p_delta_x · 1h ago

As James Mickens says in The Night Watch[1],

  “Why would someone write code in a grotesque language that exposes raw memory addresses? Why not use a modern language with garbage collection and functional programming and free massages after lunch?” Here’s the answer: Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis.

But I feel there's a middle ground between LISP and raw assembly and mangling pointers. For instance, modern C++ and Rust (and all the C++ 'successors', like Carbon, Circle, Safe C++, etc) have a very novel and clear set of ownership, view, and span semantics that make more declarative, functional, side-effect-free programming easy to write and read, while still maintaining high-performance—most notably without unnecessarily throwing copies around.

A few days ago a friend was asking me for a HPC problem: he'd written a bog-standard nested for loop with indices, was observing poor performance, and asked me if using C++23 std::ranges::views would help. It did, but I also fundamentally reworked his data flow. I saw something like a 5x performance improvement, even though I was writing at a higher level of abstraction than he was.

[1]: https://www.usenix.org/system/files/1311_05-08_mickens.pdf

mikewarot · 1h ago

It's my strong opinion that Von Neumann's architecture is great for general purpose problem solving. However for computing LLMs and similar fully known execution plans, it would be far better to decompose them into a fully parallel and pipelined graph to be executed on a reconfigurable computing mesh.

My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.

My current estimate is that it could save 95% of the energy for a given computation.

Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.

elchananHaas · 1h ago

For sure. The issue is that many AI workloads require terabytes per second of memory bandwidth and are on the cutting edge of memory technologies. As long as you can get away with little memory usage you can have massive savings, see Bitcoin ASICs.

The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.

librasteve · 1h ago

occam (MIMD) is an improvement over CUDA (SIMD)

with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):

  -  8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
  - 40000 CPUs in single reticle (80 TIPS)
  - 4.5M CPUs per 300mm wafer (10 PIPS)

the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect

heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory

xphos · 44m ago

Nit pick here but ...

I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.

I enjoyed your other point though about the comparsions of machines though

Legend2440 · 22m ago

>in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.

In practice, this is a lie - real computers do not offer O(1) random access. Physics constrains you to O(sqrt N).

https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

dang · 16m ago

Related. Others?

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=13741269 - Feb 2017 (127 comments)

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=10358153 - Oct 2015 (58 comments)

The "high-level CPU" challenge - https://news.ycombinator.com/item?id=107221 - Jan 2008 (4 comments)

PaulHoule · 2h ago

The classic example is the Lisp Machine. Hypothetically a purpose-built computer would have an advantage running Lisp, but Common Lisp was carefully designed so that it could attain high performance on an ordinary "32-bit" architecture like the 68k or x86 as well as SPARC, ARM and other RISC architectures. Java turned out the same way.

It's hard to beat a conventional CPU tuned up with superscalar, pipelines, caches, etc. -- particularly when these machines sell in such numbers that the designers can afford heroic engineering efforts that you can't.

gizmo686 · 2h ago

It's also hard to beat conventional CPUs when transitor density doubles every two years. By the time your small team has crafted their purpose built CPU, the big players will have released a general purpose CPU on the next generation of manufacturing abilities.

I expect that once our manufacturing abilities flatline, we will start seeing more interest in specialized CPUs again, as it will be possible to spend a decade designing your Lisp machine

PaulHoule · 1h ago

I never thought I could make a custom CPU until I came across

https://en.wikipedia.org/wiki/Transport_triggered_architectu...

But when I saw that I thought, yeah, I could implement something like that on an FPGA. It's not so much a language-specific CPU as an application-specific CPU. If you were building something that might be FPGA + CPU or FPGA with a soft core it might be your soft core, particularly if you had the right kind of tooling. (Wouldn't it be great to have a superoptimizing 'compiler' that can codesign the CPU together with the program?)

It has its disadvantages, particularly the whole thing will lock up if a fetch is late. For workloads where memory access is predictable I'd imagine you could have a custom memory controller but my picture of how that works is fuzzy. For unpredictable memory access though you can't beat the mainstream CPU -- me and my biz dev guy had a lot of talks with a silicon designer who had some patents for a new kind of 'vector' processor who schooled us on how many ASIC and FPGA ideas that sound great on paper can't really fly because of the memory wall.

yjftsjthsd-h · 2h ago

There's also a matter of scale. By the time you've made your first million custom CPUs, the next vendor over has made a billion generic CPUs, sold them, and then turned the money back into R&D to make even better ones.

wbl · 2h ago

John McCartey had amazing prescience given that Lisp was created in 1956 and the RISC revolution wasn't until at least 1974.

There's certainly some dynamic language support in CPUs: the indirect branch predictors and target predictors wouldn't be as large if there wasn't so much JavaScript and implementations that make those circuits work well.

PaulHoule · 1h ago

Half of the Dragon Book is about parsing and the other half is about reconciling the lambda calculus (any programming language with recursive functions) with the Von Newman/Turing approach to computation. Lisp manages to skip the first.

uticus · 27m ago

> My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?

suprised no mention of Itanium in the article

gwbas1c · 1h ago

I get that today's CPUs are orders of magnitude faster than the CPUs of the past. I also think it's important to admit that today's computers aren't orders of magnitude faster than yesterday's computers.

There's many reasons for that, but mostly it's because it's not worth it to optimize most use cases. It's clear that, for the author's use case (image recognition,) it's "worth it" to optimize to the level they discuss.

Otherwise, we shouldn't assume that doubling a CPU's speed will magically double an application's speed. The author alludes to bottlenecks inside the computer that appear "CPU bound" to many programmers. These bottlenecks are still there, even when tomorrow's CPU is "twice as fast" as today's CPUs.

nromiun · 2h ago

But what is the fundamental difference between C compiling to assembly and Luajit doing the same as JIT? Both are very fast. Both are high level compared to assembly. Except one gets the low level language tag and the other one does not.

I don't buy the argument that you absolutely need a low level language for performance.

IainIreland · 2h ago

This isn't about languages; it's about hardware. Should hardware be "higher-level" to support higher level languages? The author says no (and I am inclined to agree with him).

librasteve · 1h ago

this

pinewurst · 2h ago

(2008)

moomin · 1h ago

I mean, I don’t have an answer to this, but I’ll bet Simon Peyton-Jones has some ideas…

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Ashet Home Computer (ashet.computer)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Show HN: Omnara – Run Claude Code from anywhere (github.com)

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M² (sdo.group)

The Equality Delete Problem in Apache Iceberg (blog.dataengineerthings.org)

Training language models to be warm and empathetic makes them less reliable (arxiv.org)

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

Why are there so many rationalist cults? (asteriskmag.com)

Debian GNU/Hurd 2025 released (lists.gnu.org)

Exile Economics: If Globalisation Fails (lrb.co.uk)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Blender is Native on Windows 11 on Arm (thurrott.com)

StarDict sends X11 clipboard to remote servers (lwn.net)

RISC-V single-board computer for less than 40 euros (heise.de)

Modos Paper Monitor – Open-hardware e-paper monitor and dev kit (crowdsupply.com)

How to become your own ISP (WHY2025) [video] (media.ccc.de)

Australian court finds Apple, Google guilty of being anticompetitive (ghacks.net)

Nexus: An Open-Source AI Router for Governance, Control and Observability (nexusrouter.com)

The "high-level CPU" challenge (2008) (yosefk.com)

Writing is power transfer technology (danco.substack.com)

Evaluating LLMs Playing Text Adventures (entropicthoughts.com)

Enlisting in the Fight Against Link Rot (jszym.com)

The Ancient Art and Intimate Craft of Artificial Eyes (thereader.mitpress.mit.edu)

Claude vs. Gemini: Testing on 1M Tokens of Context (every.to)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

Let's get real about the one-person billion dollar company (marcrand.com)

A Spellchecker Used to Be a Major Feat of Software Engineering (prog21.dadgum.com)

GitHub is (again) having issues (githubstatus.com)

The Article in the Most Languages (en.wikipedia.org)

Qodo CLI agent scores 71.2% on SWE-bench Verified (qodo.ai)

Starbucks in Korea asks customers to stop bringing in printers/desktop computers (fortune.com)

Perplexity low-balls Google with $34.5B offer for Chrome (arstechnica.com)

Artificial biosensor can better measure the body's main stress hormone (medicalxpress.com)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Bell Laboratories Acquired by Berkshire Hathaway (pctonline.com)

The ex-CIA agents deciding Facebook's content policy (2022) (mronline.org)

How an Ultra-Rare Disease Accelerates Aging (newyorker.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

All known 49-year-old Apple-1 computers (apple1registry.com)

Show HN: I built an offline, open‑source desktop Pixel Art Editor in Python (github.com)

Is the A.I. Boom Turning Into an A.I. Bubble? (newyorker.com)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] (arxiv.org)

Stanford CS336 Language Modeling from Scratch (youtube.com)

That viral video of a 'deactivated' Tesla Cybertruck is a fake (theverge.com)

Undefined Behavior in C and C++ (2024) (russellw.github.io)

What does it mean to be thirsty? (quantamagazine.org)

New 3D Laser Scanner Developed for Harvesting Robots (uni-wuerzburg.de)

Weathering Software Winter (2022) (100r.co)

FreeBSD Scheduling on Hybrid CPUs (wiki.freebsd.org)

The "high-level CPU" challenge (2008)

Comments (32)