“This telegram must be closely paraphrased before being communicated to anyone” (history.stackexchange.com)

This is an entirely uncontroversial take among experts in the space. x86 is an old CISC-y hot mess. RISC-V is a new-school hyper-academic hot mess. Recent ARM is actually pretty good. And none of it matters, because the uncore and the fabrication details (in particular, whether things have been tuned to run full speed demon or full power sipper) completely dominate the ISA.

In the past x86 didn't dominate in low power because Intel had the resources to care but never did, and AMD never had the resources to try. Other companies stepped in to full that niche, and had to use other ISAs. (If they could have used x86 legally, they might well have done so. Oops?) That may well be changing. Or perhaps AMD will let x86 fade away.

torginus · 1d ago

I remember reading this Jim Keller interview:

https://web.archive.org/web/20210622080634/https://www.anand...

Basically the gist of it is that the difference between ARM/x86 mostly boils down to instruction decode, and:

- Most instructions end up being simple load/store/conditional branch etc. on both architectures, where there's literally no difference in encoding efficiency

- Variable length instruction has pretty much been figured out on x86 that it's no longer a bottleneck

Also my personal addendum is that today's Intel efficiency cores are have more transistors and better perf than the big Intel cores of a decade ago

jorvi · 20h ago

Nice followup to your link: https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter.

Personally I do not entirely buy it. Intel and AMD have had plenty of years to catch up to Apple's M-architecture and they still aren't able to touch it in efficiency. The PC Snapdragon chips AFAIK also offer better performance-per-watt than AMD or Intel, with laptops offering them often having 10-30% longer battery life at similar performance.

The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt, offering 320W of RTX 3090 performance in a 110W envelope: https://images.macrumors.com/t/xuN87vnxzdp_FJWcAwqFhl4IOXs=/...

ben-schaaf · 17h ago

> Personally I do not entirely buy it. Intel and AMD have had plenty of years to catch up to Apple's M-architecture and they still aren't able to touch it in efficiency. The PC Snapdragon chips AFAIK also offer better performance-per-watt than AMD or Intel, with laptops offering them often having 10-30% longer battery life at similar performance.

Do not conflate battery life with core efficiency. If you want to measure how efficient a CPU core is you do so under full load. The latest AMD under full load uses the same power as M1 and is faster, thus it has better performance per watt. Snapdragon Elite eats 50W under load, significantly worse than AMD. Yet both M1 and Snapdragon beat AMD on battery life tests, because battery life is mainly measured using activities where the CPU is idle the vast majority of the time. And of course the ISA is entirely irrelevant when the CPU isn't being used to begin with.

> The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt, offering 320W of RTX 3090 performance in a 110W envelope

That chart is Apple propaganda. In Geekbench 5 the RTX 3090 is 2.5x faster, in blender 3.1 it is 5x faster. See https://9to5mac.com/2022/03/31/m1-ultra-gpu-comparison-with-... and https://techjourneyman.com/blog/m1-ultra-vs-nvidia-rtx-3090/

devnullbrain · 11h ago

I want to measure the device how I use it. Race-to-sleep and power states are integral to CPU design.

ben-schaaf · 11h ago

Yes they are, but only one of those is at all affected by the choice of ISA. If modern AMD chips are better at race-to-sleep than an Apple M1 and still get worse battery life then the problem is clearly not x86-64.

simonh · 11h ago

Right, so as I understand it people see that x86-64 designs score poorly on a set of benchmarks and infer that it is because they are x86-64.

In fact it’s because that manufacturer has made architectural choices that are not inherent to the x86-64 ISA.

And that’s just hardware. MacOS gets roughly 30% better battery life on M series hardware than Asahi Linux. I’m not blaming the Asahi team, they do amazing work, they don’t even work on many of the Linux features relevant to power management, and Apple has had years of head start on preparing for and optimising for the M architecture. It’s just that software matters, a lot.

So if I’m reading this right, ISA can make a difference, but it’s incremental compared to the many architectural decisions and trade offs that go into a particular design.

bee_rider · 5h ago

> So if I’m reading this right, ISA can make a difference, but it’s incremental compared to the many architectural decisions and trade offs that go into a particular design.

This is true, but only in the sense that is very rarely correct to say “Factor Y can’t possibly make a difference.”

brookst · 11h ago

Does anyone care about blaming / lauding an ISA without any connection to the actual devices that people use?

Performance and battery life are lived experiences. There’s probably some theoretical hyper optimization where 6502 ISA is just as good as ARM, but does it matter?

jijijijij · 9h ago

In this thread, it does. You are moving the goalpost by making this about "actual devices", when the topic is ISA efficiency.

fwipsy · 9h ago

> If you want to measure how efficient a CPU core is you do so under full load.

Higher wattage gives diminishing returns. Chips will run higher wattage under full load just to eke out a marginal improvement in performance. Therefore efficiency improves if the manufacturer chooses to limit the chip rather than pushing it harder.

Test efficiency using whatever task the chip will be used for. For most ultralight laptops, that will be web browsing etc. so the m1 MacBook/snapdragon results are valid for typical users. Maybe your workload hammers the CPU but that doesn't make it the one true benchmark.

ben-schaaf · 8h ago

No, and that's exactly the point I'm making. If you try to measure ISA efficiency using a workload where the CPU is idle the vast majority of the time, then your power usage will be dominated by things unrelated to the ISA.

To further hammer the point home, let me reduce do a reductio ad absurdum: The chip is still "in use" when its asleep. Sleeping your laptop is a typical usecase. Therefore how much power is used while sleeping is a good measure of ISA efficiency. This is of course absurd because the CPU cores are entirely turned off when sleeping, they could draw 1kW with potato performance and nothing in this measurement would change.

petrichorko · 13h ago

To me it's not as simple as comparing the efficiency under full load. I imagine the efficiency on x86 as some kind of log curve, which translates to higher power consumption even on lighter loads. Apple's ARM implementation tends to eat a lot less power on tasks that happen most of the time, hence greatly improving the battery life.

I've tried a Ryzen 7 that had a similar efficiency to an M1 according to some tests, and that thing ran hot like crazy. Its just marketing bullshit to me now..

sgc · 12h ago

The OS matters, and I would guess you were using two different OSes? I have no doubt macOS running on an m1 is more optimized than whatever you were using on the ryzen.

I recently had to remove Windows completely from a few years old laptop with an 12th gen cpu and a Intel Iris / GeForce RTX 3060 Mobile combo because it was running very hot (90c+) and the fans were constantly running. Running Linux, I have no issues. I just double checked since I had not for several months, and temperature is 40c lower on my lap than it was propped up on a book for maximum airflow. Full disclaimer, I would have done this anyways, but the process was sped up because my wife was extremely annoyed with the noise my new-to-me computer was making, and it was cooking the components.

I have learned to start with the OS when things are tangibly off, and only eventually come back to point the finger at my hardware.

ashirviskas · 11h ago

OS does matter, with Linux my M1 macbook gets kinda hot and it cannot do more than 1.5-2h of google meetings with cameras on. IIRC google meetings on macos were at least a bit more efficient.

Though it has definitely been getting better in the last 1.5 years using Asahi Linux and in some areas it is a better experience than most laptops running Linux (sound, cameras, etc.). The developers even wrote a full fledged physical speaker simulator just so it could be efficiently driven over its "naive" limits.

sgc · 8h ago

That is more a "who is controlling access to hardware drivers matters" problem. I wouldn't be surprised if macOS was still a tiny bit more efficient with a level playing field, but we will never know.

petrichorko · 12h ago

Linux can definitely help with this, I had the same experience with it on Z1 (SteamOS). But even running Windows 11 in a VM on M1 does not make the machine run hot

ben-schaaf · 7h ago

If we're comparing entire systems as products you're absolutely right. That's not what this discussion is about. We're trying to compare the efficiency of the ISA. What do you think would happen if AMD replaced the x86-64 decoders with ARM64 ones, and changed nothing about how the CPU idles, how high it clocks or how fast it boosts?

My guess is ARM64 is a few percent more efficient, something AMD has claimed in the past. They're now saying it would be identical, which is probably not far from the truth.

The simple fact of the matter is that the ISA is only a small part of how long your battery lasts. If you're gaming or rendering or compiling it's going to matter a lot, and Apple battery life is pretty comparable in these scenarios. If you're reading, writing, browsing or watching then your cores are going to be mostly idle, so the only thing the ISA influences won't even have a current running through it.

boxed · 15h ago

He said "per watt", that's still true. You just talked about max throughput, which no one is discussing.

dwattttt · 13h ago

> The latest AMD under full load uses the same power as M1 and is faster, thus it has better performance per watt.

He also said per watt. An AMD CPU running at full power and then stopping will use less battery than an M1 with the same task; that's comparing power efficiency.

boxed · 11h ago

https://techjourneyman.com/blog/m1-ultra-vs-nvidia-rtx-3090/

Look at their updated graph which has less BS. It's never close in perf/watt.

The BS part about apples graph was that they cut the graph short for the nvidia card (and bending the graph a bit at the end). The full graph still shows apple being way better per watt.

ohdeargodno · 14h ago

It's not, because Apple purposefully lied on their marketing material. Letting a 3090 go on full blast brings it pretty much in line in perf/watt. Your 3090 will not massively thermal throttle after 30 minutes either, but the M1 Ultra will.

So, yes, if you want to look good on pointless benchmarks, a M1 ultra ran for 1 minute is more efficient than a downclocked 3090.

brookst · 10h ago

How do you think Nvidia got Samsung’s 8nm process to be just as power efficient as TSMC’s 5nm node?

No comments yet

boxed · 11h ago

https://techjourneyman.com/blog/m1-ultra-vs-nvidia-rtx-3090/

Look at that updated graph which has less BS. It's never close in perf/watt.

The BS part about apples graph was that they cut the graph short for the nvidia card (and bending the graph a bit at the end). The full graph still shows apple being way better per watt.

No comments yet

exmadscientist · 18h ago

Yes, Intel/AMD cannot match Apple in efficiency.

But Apple cannot beat Intel/AMD in single-thread performance. (Apple marketing works very hard to convince people otherwise, but don't fall for it.) Apple gets very, very close, but they just don't get there. (As well, you might say they get close enough for practical matters; that might be true, but it's not the question here.)

That gap, however small it might be for the end user, is absolutely massive on the chip design level. x86 chips are tuned from the doping profiles of the silicon all the way through to their heatsinks to be single-thread fast. That last 1%? 2%? 5%? of performance is expensive, and is far far far past the point of diminishing returns in turns of efficiency cost paid. That last 20% of performance burns 80% of the power. Apple has chosen not to do things this way.

So x86 chips are not particularly well tuned to be efficient. They never have been; it's, on some level, a cultural problem. Could they be? Of course! But then the customers who want what x86 is right now would be sad. There are a lot of customers who like the current models, from hyperscalers to gamers. But they're increasingly bad fits for modern "personal computing", a use case which Apple owns. So why not have two models? When I said "doping profiles of the silicon" above, that wasn't hyperbole, that's literally true. It is a big deal to maintain a max-performance design and a max-efficiency design. They might have the same RTL but everything else will be different. Intel at their peak could have done it (but was too hubristic to try); no one else manufacturing x86 has had the resources. (You'll note that all non-Apple ARM vendor chips are pure efficiency designs, and don't even get close to Apple or Intel/AMD. This is not an accident. They don't have the resources to really optimize for either one of these goals. It is hard to do.)

Thus, the current situation: Apple has a max-efficiency design that's excellent for personal computing. Intel/AMD have aging max-performance designs that do beat Apple at absolute peak... which looks less and less like the right choice with every passing month. Will they continue on that path? Who knows! But many of their customers have historically liked this choice. And everyone else... isn't great at either.

mojuba · 15h ago

> Apple has a max-efficiency design that's excellent for personal computing. Intel/AMD have aging max-performance designs that do beat Apple at absolute peak...

Can you explain then, how come switching from Intel MBP to Apple Silicon MBP feels like literally everything is 3x faster, the laptop barely heats up at peak load, and you never hear the fans? Going back to my Intel MBP is like going back to stone age computing.

In other words if Intel is so good, why is it... so bad? I genuinely don't understand. Keep in mind though, I'm not comparing an Intel gaming computer to a laptop, let's compare oranges to oranges.

fxtentacle · 14h ago

If you take a peak-performance-optimized design (the Intel CPU) and throttle it down to low power levels, it will be slower than a design optimized for low power (the Apple CPU).

"let's compare oranges to oranges"

That's impossible because Apple has bought up most of TSMC's 3nm production capacity. You could try to approximate by comparing Apple M4 Max against NVIDIA B300 but that'll be a very one-sided win for NVIDIA.

wtallis · 8h ago

> That's impossible because Apple has bought up most of TSMC's 3nm production capacity. You could try to approximate by comparing Apple M4 Max against NVIDIA B300 but that'll be a very one-sided win for NVIDIA.

Have you not heard that Intel's Lunar Lake is made on the same TSMC 3nm process as Apple's M3? It's not at all "impossible" to make a fair and relevant comparison here.

VHRanger · 9h ago

> Can you explain then, how come switching from Intel MBP to Apple Silicon MBP feels like literally everything is 3x faster, the laptop barely heats up at peak load, and you never hear the fans? Going back to my Intel MBP is like going back to stone age computing.

My understanding of it is that Apple Silicon's very very long instruction pipeline plays well with how the software stack in MacOS is written and compiled first and foremost.

Similarly that the same applications take less RAM in MacOS than even in Linux often even because at the OS level stuff like garbage collection are better integrated.

bee_rider · 5h ago

Is the Intel MacBook very old?

Is it possible that your workloads are bound by something other than single-threaded compute performance? Memory? Drive speed?

Is it possible that Apple did a better job tuning their OS for their hardware, than for Intel’s?

bpavuk · 14h ago

it all comes down to thermal budget of something as thin as MBP.

aurareturn · 11h ago

  But Apple cannot beat Intel/AMD in single-thread performance.

AMD, Intel, Qualcomm have all reference Geekbench ST numbers. In Geekbench, Apple is significantly ahead of AMD and Intel in ST performance. So no need Apple marketing to convince us. The industry has benchmarks to do so.

whizzter · 14h ago

It's the gaming/HPC focus, sure you can achieve some stunning benchmark numbers with nice vectorized straightforward code.

In the real world we have our computers running JIT'ed JS, Java or similar code taking up our cpu time, tons of small branches (mostly taken the same way and easily remembered by the branch predictor) and scattering reads/writes all over memory.

Transistors not spent on larger branch prediction caches or L1 caches are badly spent, doesn't matter if the CPU can issue a few less instructions per clock to ace an benchmark if it's waiting for branch mispredictions or cache misses most of the time.

There's no coincidence that the Apple teams iirc are partly the same people that built Pentium-M (that begun the Core era by delivering very good perf on mobile chips when P4 was supposed to be the flagship).

steveBK123 · 11h ago

> looks less and less like the right choice with every passing month

It does seem like for at least the last 3-5 years it's been pretty clear that Intel x86 was optimizing for the wrong target / a shrinking market.

HPC increasingly doesn't care about single core/thread performance and is increasingly GPU centric.

Anything that cares about efficiency/heat (basically all consumer now - mobile, tablet, laptop, even small desktop) has gone ARM/RISC.

Datacenter market is increasingly run by hyperscalers doing their own chip designs or using AMD for cost reasons.

bee_rider · 4h ago

It seems impossible that CPUs could ever catch up to GPUs, for the things that GPUs are really good at.

I dunno. I sort of like all the vector extensions we’ve gotten on the CPU side as they chase that dream. But I do wonder if Intel would have been better off just monomaniacally focusing on single-threaded performance, with the expectation that their chips should double down on their strength, rather than trying to attack where Nvidia is strong.

menaerus · 12h ago

> But Apple cannot beat Intel/AMD in single-thread performance

It's literally one of the main Apple M chips advantage over Intel/AMD. At the time when M chip came out, it was the only chip that managed to consume ~100GB/s of MBW with just a single thread.

https://web.archive.org/web/20240902200818/https://www.anand...

> From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself.

privatelypublic · 17h ago

Let's not forget- there shouldn't be anything preventing you from setting PL1 and PL2 power levels in linux or windows, AMD or Intel. Sometimes you can even set them in the Bios.

Letting you limit just how much of that extra 20% power hogging perf you want.

torginus · 16h ago

There are just so many confounding factors that it's almost entirely impossible to pin down what's going on.

- M-series chips have closely integrated RAM right next to the CPU, while AMD makes do with standard DDR5 far away from the CPU, which leads to a huge latency increase

- I wouldn't be surprised if Apple CPUs (which have a mobile legacy) are much more efficient/faster at 'bursty' workloads - waking up, doing some work and going back to sleep

- M series chips are often designed for a lower clock frequency, and power consumption increases quadratically (due to capactive charge/dischargelosses on FETs) Here's a diagram that shows this on a GPU:

https://imgur.com/xcVJl1h

So while it's entirely possible that AArch64 is more efficient (the decode HW is simpler most likely, and encoding efficiency seems identical):

https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-0...?

It's hard to tell how much that contributes to the end result.

magicalhippo · 12h ago

Zen 5 also seems to have a bit of a underperforming memory subsystem, from what I can gather.

Hardware Unboxed just did an interesting video[1] comparing gaming performance of 7600X Zen 4 and 9700X Zen 5 processors, and also the 9800X3D for reference. In some games the 9700X Zen 5 had a decent lead over the Zen 4, but in others it had exactly the same performance. But the 9800X3D would then have a massive lead over the 9700X.

For example, in Horizon Zero Dawn benchmark, the 7600X had 182 FPS while the 9700X had 185 FPS, yet the 9800X3D had a massive 266 FPS.

[1]: https://www.youtube.com/watch?v=emB-eyFwbJg

VHRanger · 9h ago

I mean, huge software with a ton of quirks like a AAA video game are arguably not a good benchmark to understand hardware.

They're still good benchmarks IMO because they represent a "real workload" but to understand why the 9800X3D performs this much better you'd want some metrics on CPU cache misses in the processors tested.

It's often similar to hyperthreading -- on very efficient sofware you actually want to turn SMT off sometimes because it causes too many cache evictions as two threads fight for the same L2 cache space which is efficiently utilized.

So software having a huge speedup from a X3D model with a ton of cache might indicate the sofware has a bad data layout and needs the huge cache because it keeps doing RAM round trips. You'd presumably also see large speedups in this case from faster RAM on the same processor.

magicalhippo · 6h ago

> but to understand why the 9800X3D performs this much better you'd want some metrics on CPU cache misses in the processors tested.

But as far as I can tell the 9600X and the 9800X3D are the same except for the 3D cache and a higher TDP. However they have similar peak extended power (~140W) and I don't see how the different TDP numbers explain the differences between 9600X and 7600X where the is sometimes ahead and other times identical, while the 9800X3D beats both massively regardless.

What other factors could it be besides fewer L3 cache misses that lead to 40+% better performance of the 9800X3D?

> You'd presumably also see large speedups in this case from faster RAM on the same processor.

That was precisely my point. The Zen 5 seems to have a relatively slow memory path. If the M-series has a much better memory path, then the Zen 5 is at a serious disadvantage for memory-bound workloads. Consider local CPU-run LLMs as a prime example. The M-s crushes AMD there.

I found the gaming benchmark interesting because it represented workloads that had workloads that just straddled the cache sizes, and thus showed how good the Zen 5 could be had it had a much better memory subsystem.

I'm happy to be corrected though.

formerly_proven · 16h ago

> M-series chips have closely integrated RAM right next to the CPU, while AMD makes do with standard DDR5 far away from the CPU, which leads to a huge latency increase

2/3rds the speed of light must be very slow over there

torginus · 16h ago

I mean at 2GHz, and 2/3c, the signal travels about 10cm in 1 clock cycle. So it's not negligible, but I suspect it has much more to do with signal integrity and the transmission line characteristics of the data bus.

I think since on mobile CPUs, the RAM sits right on top of the SoC, very likely the CPUs are designed with a low RAM latency in mind.

christkv · 15h ago

I think the m chips have much wider databus so bandwith is much higher as well as lower latency?

VHRanger · 9h ago

huh, it seems like the M4 pro can hit >400GB/s of RAM bandwidth whereas even a 9950x hits only 100GB/s.

I'm curious how that is; in practice it "feels" like my 9950x is much more efficient at "move tons of RAM" tasks like a duckDB workload above a M4.

But then again a 9950x has other advantages going on like AVX512 I guess?

hnuser123456 · 8h ago

Yes, the M-series chips effectively use several "channels" of RAM (depending on the tier/size of chip) while most desktop parts, including the 9950x, are dual-channel. You get 51.2 GB/s of bandwidth per channel of DDR5-6400.

You can get 8-RAM-channel motherboards and CPUs and have 400 GB/s of DDR5 too, but you pay a price for the modularity and capacity over it all being integrated and soldered. DIMMs will also have worse latency than soldered chips and have a max clock speed penalty due to signal degradation at the copper contacts. A Threadripper Pro 9955WX is $1649, a WRX90 motherboard is around $1200, and 8x16GB sticks of DDR5 RDIMMS is around $1200, $2300 for 8x32GB, $3700 for 8x64GB sticks, $6000 for 8x96GB.

robotnikman · 4h ago

> Intel and AMD have had plenty of years to catch up to Apple's M-architecture and they still aren't able to touch it in efficiency

A big reason for this, at least for AMD, is because Apple buys all of TSMC's latest and greatest nodes for massive sums of money, so there is simply none left for others like AMD who are stuck a generation behind. And Intel is continually stuck trying to catch up. I would not say its due to x86 itself.

ChoGGi · 10h ago

> The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt, offering 320W of RTX 3090 performance in a 110W envelope: https://images.macrumors.com/t/xuN87vnxzdp_FJWcAwqFhl4IOXs=/...

I see it's measuring full system wattage with a 12900k which tended to use quite a bit of juice compared to AMD offerings.

https://gamersnexus.net/u/styles/large_responsive_no_waterma...

diddid · 19h ago

I mean the M1 is nice but pretending that it can do in 110w what the 3090 does with 320w is Apple marketing nonsense. Like if your use case is playing games like cp2077, the 3090 will do 100fps in ultra ray tracing and an M4 Max will only do 30fps. Not to mention it’s trivial to undervolt nvidia cards and get 100% performance at 80% power. So 1/3 the power for 1/3 the performance? How is that smoking anything?

whatevaa · 14h ago

Apple fans drinking apple juice, nothing new with fans, sadly.

pjmlp · 12h ago

Indeed, like talking as if Apple mattered at all in server space, or digital workstations (studio is not a replacement for people willing to buy Mac Pros, which still keep being built with Intel Xeons).

whatagreatboy · 14h ago

Even Jim Keller says that instruction decode is the difference, and that saves a lot of battery for ARM even if it doesn't change the core efficiency at full lot.

michaelmrose · 17h ago

RTX3090 is a desktop part optimized for maximum performance with a high-end desktop power supply. It isn't meaningful to compare its performance per watt with a laptop part.

Saying it offers a certain wattage worth of the desktop part means even less because it measures essentially nothing.

You would probably want to compare it to a mobile 3050 or 4050 although this still risks being a description of the different nodes more so than the actual parts.

KingOfCoders · 17h ago

It's no comparison at all, the person who bought a 3090 in the 30xx days wanted max gaming performance, someone with an Apple laptop wants longer battery usage.

It's like comparing an F150 with an Ferrari, a decision that no buyer needs to make.

thechao · 11h ago

> It's like comparing an F150 with a Ferrari, a decision that no buyer needs to make.

... maybe a Prius? Bruh.

aredox · 15h ago

>Intel and AMD have had plenty of years to catch up to Apple's M-architecture and they still aren't able to touch it in efficiency.

Why would they spend billions to "catch up" to an ecological niche that is already occupied, when the best they could do - if the argument here is right that x86 and ARM are equivalent - is getting the same result?

They would only invest this much money and time if they had some expectation of being better, but "sharing the first place" is not good enough.

samus · 14h ago

The problem is that they are slowly losing the mobile markets, while their usual markets are not growing as they used to. AMD is less vulnerable to the issues that arise from that because they are fabless, and they could pivot entirely to GPU or non-x86 markets if they really wanted to. But Intel has fabs (very expensive in terms of R&D and capex) dedicated to products for desktop and server markets that must continue to generate revenue.

9rx · 10h ago

The question remains: Why would they spend billions only to "catch up"? That means, even after the investment is made and they have a product that is just as good, there is still no compelling technical reason for the customer to buy Intel/AMD over the other alternatives that are just as good, so their only possible avenue to attract customers is to drive the price into the ground which is a losing proposition.

42lux · 10h ago

The markets didn't really budge. Apple only grew 1% in the traditional PC market (desktop/notebook) over the last 5 years and that's despite a wave of new products. The snapdragons are below 1%...

KingOfCoders · 17h ago

Why would they? They are dominated by gaming benchmarks in a way Apple isn't. For decades it was not efficiency but raw performance, 50% more power usage for 10% more performance was ok.

"The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt"

Gamers are not interested in performance-per-watt but fps-per-$.

If some behavior looks strange to you, most probably you don't understand the underlying drivers.

goalieca · 12h ago

> Gamers are not interested in performance-per-watt but fps-per-$.

I game a decent amount on handheld mode for the switch. Like tens of millions of others.

pjmlp · 11h ago

While others run PlayStation and XBox.

The demographics aren't the same, nor the games.

KingOfCoders · 10h ago

I game a decent amount on table top games. Like tens of millions of others.

codedokode · 1d ago

x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy.

Also, more complicated decoding and extra caches means longer pipeline, which means more price to pay when a branch is mispredicted (binary search is a festival of branch misprediction for example, and I got 3x acceleration of linear search on small arrays when I switched to the branchless algorithm).

Also I am not a CPU designer, but branch prediction with wide decoder also must be a pain - imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache.

As for encoding efficiency, I played with little algorithms (like binary search or slab allocator) on godbolt, and RISC-V with compressed instruction generates similar amount of code as x86 - in rare cases, even slightly smaller. So x86 has a complex decoding that doesn't give any noticeable advantages.

x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder.

wallopinski · 22h ago

I was an instruction fetch unit (IFU) architect on P6 from 1992-1995. And yes, it was a pain, and we had close to 100x the test vectors of all the other units, going back to the mid 1980's. Once we started going bonkers with the prefixes, we just left the pre-Pentium decoder alone and added new functional blocks to handle those. And it wasn't just branch prediction that sucked, like you called out! Filling the instruction cache was a nightmare, keeping track of head and tail markers, coalescing, rebuilding, ... lots of parallel decoding to deal with cache and branch-prediction improvements to meet timing as the P6 core evolved was the typical solution. We were the only block (well, minus IO) that had to deal with legacy compatibility. Fortunately I moved on after the launch of Pentium II and thankfully did not have to deal with Pentium4/Northwood.

nerpderp82 · 20h ago

https://en.wikipedia.org/wiki/P6_(microarchitecture)

The P6 is arguably the most important x86 microarch ever, it put Intel on top over the RISC workstations.

What was your favorite subsystem in the P6 arch?

Was it designed in Verilog? What languages and tools were used to design P6 and the PPro?

wallopinski · 18h ago

Well duh, the IFU. :) No, I was fond of the FPU because the math was just so bonkers. The way division was performed with complete disregard to the rules taught to gradeschoolers always fascinated me. Bob Colwell told us that P6 was the last architecture one person could understand completely.

Tooling & Languages: IHDL, a templating layer on top of HDL that had a preprocessor for intel-specific macros. DART test template generator for validation coverage vectors. The entire system was stitched together with PERL, TCL, and shellscripts, and it all ran on three OSes: AIX, HPUX and SunOS. (I had a B&W sparcstation and was jealous of the 8514/a 1024x768 monitors on AIX.) We didn't go full Linux until Itanic and by then we were using remote computing via Exceed and gave up our workstations for generic PCs. When I left in the mid 2000's, not much had changed in the glue/automation languages, except a little less Tcl. I'm blanking on the specific formal verification tool, I think it was something by Cadence. Synthesis and timing was ... design compiler and primetime? Man. Cobwebs. When I left we were 100% Cadence and Synopsys and Verilog (minus a few custom analog tools based on SPICE for creating our SSAs). That migration happened during Bonnell, but gahd it was painful. Especially migrating all the Pentium/486/386/286/8088 test vectors.

I have no idea what it is like ~20 years later (gasp), but I bet the test vectors live on, like Henrietta Lacks' cells. I'd be interested to hear from any Intelfolk reading this?

jcranmer · 1d ago

> x86 decoding must be a pain

So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules.

As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111

I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost).

That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens.

[1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes.

monocasa · 20h ago

> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle

That's been my understanding as well. X86 style length decoding is about one pipeline stage if done dynamically.

The simpler riscv length decoding ends up being about a half pipeline stage on the wider decoders.

Dylan16807 · 23h ago

> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle

That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit"

mohinder · 22h ago

The chopping up can happen the next cycle, in parallel across all the instructions in the cache line(s) that were fetched, and it can be pipelined so there's no loss in throughput. Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.

codedokode · 16h ago

As for program size, I played with small algorithms (like binary search) on godbolt, and my x86 programs had similar size to RISC-V with compressed instructions. I rarely saw 1-byte instructions, there almost always was at least one prefix.

> e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8

With compressed instructions the theoretical maximum is 16.

> so they have to deal with the problem too.

Luckily you can determine the length from first bits of an instruction, and you can have either 2 bytes left from previous line, or 0.

Dylan16807 · 22h ago

> The chopping up can happen the next cycle

It still causes issues.

> Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8).

RISC-V has better code density. The handful of one byte instructions don't make up for other longer instructions.

> And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.

Now? Have to deal with the problem too?

It feels like you didn't read my previous post. I was explaining how it's much much simpler to decode length. And the variable length has been there since the original version.

eigenform · 19h ago

Don't know this for certain, but I always assumed that x86 implementations get away with this by predecoding cachelines. If you're going to do prefetching in parallel and decoupled from everything else, might as well move part of the work there too? (obviously not without cost - plus, you can identify branches early!)

matja · 14h ago

Missing a 0 at the end

monocasa · 20h ago

> x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy.

To be fair, a lot of modern ARM cores also have uop caches. There's a lot to decide even without the variable length component, to the point that keeping a cache of uops and temporarily turning pieces of the IFU off can be a win.

camel-cdr · 18h ago

> trace caches

They don't anymore they have uop caches, but trace caches are great and apple uses them [1].

They allow you to collapse taken branches into a single fetch.

Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse.

> extra caches

This is one thing I don't understand, why not replace the L1I with the uop-cache entirely?

I quite like what Ventana does with the Veyron V2/V3. [3,4] They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths.

[1] https://www.realworldtech.com/forum/?threadid=223220

[2] https://lists.riscv.org/g/tech-profiles/attachment/353/0/RIS... (page 10)

[3] https://www.ventanamicro.com/technology/risc-v-cpu-ip/

[4] https://youtu.be/EWgOVIvsZt8

adgjlsfhk1 · 3h ago

you need both. Branches don't tell you "jump to this micro-op", they're "jump to this address" so you need the address numbering of a normal L1i.

jabl · 12h ago

> x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder.

Fortunately flags (or even individual flag bits) can be renamed just like other registers, removing that bottleneck. And some architectures that use flag registers, like aarch64, have additional arithmetic instructions which don't update the flag register.

Using flag registers brings benefits as well. E.g. conditional jump distances can be much larger (e.g. 1 MB in aarch64 vs. 4K in RISC-V).

adgjlsfhk1 · 3h ago

How big a difference is that? 1MB is still too small to jump to an arbitrary function, and 4K is big enough to almost always jump within a function.

jabl · 16h ago

> I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy.

The P4 microarch had trace caches, but I believe that approach has since been avoided. What practically all contemporary x86 processors do have, though is u-op caches, which contain decoded micro-ops. Note this is not the same as a trace cache.

For that matter, many ARM cores also have u-op caches, so it's not something that is uniquely useful only on x86. The Apple M* cores AFAIU do not have u-op caches, FWIW.

phire · 1d ago

Intel’s E cores decode x86 without a trace cache (μop cache), and are very efficient. The latest (Skymont) can decode 9 x86 instructions per cycle, more than the P core (which can only decode 8)

AMD isn’t saying that decoding x86 is easy. They are just saying that decoding x86 doesn’t have a notable power impact.

varispeed · 1d ago

Does that really say anything about efficiency? Why can't they decode 100 instructions per cycle?

ajross · 1d ago

> Why can't they decode 100 instructions per cycle?

Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that).

But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms.

And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that.

[1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway.

GeekyBear · 22h ago

Wasn't the point of SMT that a single instruction decoder had difficulty keeping the core's existing execution units busy?

fulafel · 18h ago

No, it's about the same bottleneck that also explains the tapering off of single core performance. We can't extract more parallelism from the single flow-of-control of programs, because operations (and esp control flow transfers) are dependent on results of previous operations.

SMT is about addressing the underutilization of execution resources where your 6-wide superscalar processor gets 2.0 ILP.

See eg https://my.eng.utah.edu/~cs6810/pres/6810-09.pdf

BobbyTables2 · 21h ago

I vaguely thought it was to provide another source of potentially “ready” instructions when the main thread was blocked on I/O to main memory (such as when register renaming can’t proceed because of dependencies).

But I could be way off…

ajross · 20h ago

No, it's about instruction latency. Some instructions (cache misses that need to hit DRAM) will stall the pipeline and prevent execution of following instructions that depend on the result. So the idea is to keep two streams going at all times so that the other side can continue to fill the units. SMT can be (and was, on some Atom variants) a win even with an in-order architecture with only one pipeline.

imtringued · 15h ago

That's a gross misrepresentation of what SMT is to the point where nothing you said is correct.

First of all. In SMT there is only one instruction decoder. SMT merely adds a second set of registers, which is why it is considered a "free lunch". The cost is small in comparison to the theoretical benefit (up to 2x performance).

Secondly. The effectiveness of SMT is workload dependent, which is a property of the software and not the hardware.

If you have a properly optimized workload that makes use of the execution units, e.g. a video game or simulation, the benefit is not that big or even negative, because you are already keeping the execution units busy and two threads end up sharing limited resources. Meanwhile if you have a web server written in python, then SMT is basically doubling your performance.

So, it is in fact the opposite. For SMT to be effective, the instruction decoder has to be faster than your execution units, because there are a lot of instructions that don't even touch them.

eigenform · 19h ago

I think part of the argument is that doing a micro-op cache is not exactly cutting down on your power/area budget.

(But then again, do the AMD e-cores have uop caches?)

eigenform · 19h ago

> [...] imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache.

Yeah, you [ideally] want to predict the existence of taken branches or jumps in a cache line! Otherwise you have cycles where you're inserting bubbles into the pipeline (if you aren't correctly predicting that the next-fetched line is just the next sequential one ..)

ahartmetz · 1d ago

Variable length decoding is more or less figured out, but it takes more design effort, transistors and energy. They cost, but not a lot, relatively, in a current state of the art super wide out-of-order CPU.

wallopinski · 22h ago

"Transistors are free."

That was pretty much the uArch/design mantra at intel.

nerpderp82 · 20h ago

Isn't that still true for high perf chips? We don't have ways to use all those transistors so we make larger and larger caches.

exmadscientist · 18h ago

Max-performance chips even introduce dead dummy transistors ("dark silicon") to provide a bit of heat sinking capability. Having transistors that are sometimes-but-rarely useful is no problem whatsoever for modern processes.

yvdriess · 15h ago

AFAIK the dark silicon term is specifically those transistors not always powered on. Doping the Si substrate to turn it into transistors is not going to change the heat profile, so I don't think dummy transistors are added on purpose for heat management. Happy to be proven wrong though.

drob518 · 19h ago

It has turned out to be a pretty good rule of thumb over the decades.

rasz · 1d ago

Not a lot is not how I would describe it. Take a 64bit piece of fetched data. On ARM64 you will just push that into two decoder blocks and be done with it. On x86 you got what, 1 to 15 bytes range per instruction? I dont even want to think about possible permutations, its in the 10 ^ some two digit number order.

mohinder · 22h ago

You don't need all the permutations. If there are 32 bytes in a cache line then each instruction can only start at one of 32 possible positions. Then if you want to decode N instructions per cycle you need N 32-to-1 muxes. You can reduce the number of inputs to the later muxes since instructions can't be zero size.

monocasa · 20h ago

It was even simpler until very recently where the decode stage would only look at a max 16 byte floating window.

saagarjha · 22h ago

Yes, but you're not describing it from the right position. Is instruction decode hard? Yes, if you think about it in isolation (also, fwiw, it's not a permutation problem as you suggest). But the core has a bunch of other stuff it needs to do that is far harder. Even your lowliest Pentium from 2000 can do instruction decode.

ahartmetz · 1d ago

It's a lot for a decoder, but not for a whole core. Citation needed, but I remember that the decoder is about 10% of a Ryzen core's power budget, and of course that is with a few techniques better than complete brute force.

topspin · 17h ago

I've listened Keller's views on CPU design and the biggest takeaway I found is that performance is overwhelmingly dominated by predictors. Good predictors mitigate memory latency and keep pipelines full. Bad predictors stall everything while cores spin on useless cache lines. The rest, including ISA minutiae, rank well below predictors on the list of things that matter.

At one time, ISA had a significant impact on predictors: variable length instructions complicated predictor design. The consensus is that this is no longer the case: decoders have grown to overcome this and now the difference is negligible.

devnullbrain · 11h ago

> fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.

The notebooks of TFA aren't really big computers.

adgjlsfhk1 · 3h ago

they are. By "small" here, we're referring to the Core size, not the machine size. i.e. an in order, dual issue cpu is small, but a M1 chip is massive.

IshKebab · 16h ago

Yeah I'm not sure I buy it either.

It doesn't matter if most instructions have simple encodings. You still need to design your front end to handle the crazy encodings.

I doubt it makes a big difference, so until recently he would have been correct - why change your ISA when you can just wait a couple of months to get the same performance improvement. But Moore's law is dead now so small performance differences matter way more now.

imtringued · 15h ago

The core argument in RISC vs CISC has never been that you can't add RISC style instructions to a CISC. If anything, the opposite is true, because CISC architectures just keep adding more and more instructions.

The argument has been that even if you have a CISC ISA that also happens to have a subset of instructions following the RISC philosophy, that the bloat and legacy instructions will hold CISC back. In other words, the weakness of CISC is that you can add, but never remove.

Jim Keller disagrees with this assessment and it is blatantly obvious.

You build a decoder that predicts that the instructions are going to have simple encodings and if they don't, then you have a slow fallback.

Now you might say that this makes the code slow under the assumption that you make heavy use of the complex instructions, but compilers have a strong incentive to only emit fast instructions.

If you can just add RISC style instructions to CISC ISAs, the entire argument collapses into irrelevance.

IshKebab · 12h ago

It's not just the complex encodings though, there's also the variable instruction length, and the instruction semantics that mean you need microcode.

Obviously they've done an amazing job of working around it, but that adds a ton of complexity. At the very least it's going to mean you spend engineering resources on something that ARM & RISC-V don't even have to worry about.

This seems a little like a Java programmer saying "garbage collection is solved". Like, yeah you've made an amazingly complicated concurrent compacting garbage collector that is really fast and certainly fast enough almost all of the time. But it's still not as fast as not doing garbage collection. If you didn't have the "we really want people to use x86 because my job depends on it" factor then why would you use CISC?

fanf2 · 1d ago

Apple’s ARM cores have wider decode than x86

M1 - 8 wide

M4 - 10 wide

Zen 4 - 4 wide

Zen 5 - 8 wide

ryuuchin · 21h ago

Is Zen 5 more like a 4x2 than a true 8 since it has dual decode clusters and one thread on a core can't use more than one?

https://chipsandcheese.com/i/149874010/frontend

adgjlsfhk1 · 1d ago

pure decoder width isn't enough to tell you everything. X86 has some commonly used ridiculously compact instructions (e.g. lea) that would turn into 2-3 instructions on most other architectures.

ack_complete · 8h ago

Yes, but so does ARM. ld1 {v0.16b,v1.16b,v2.16b,v3.16b},x0,#64 loads 4 x 128-bit vector registers and post-increments a pointer register.

monocasa · 20h ago

Additionally, stuff llike rmw instructions are really like at least three, maybe four or five risc instructions.

ajross · 1d ago

The whole ModRM addressing encoding (to which LEA is basically a front end) is actually really compact, and compilers have gotten frightently good at exploiting it. Just look at the disassembly for some non-trivial code sometime and see what it's doing.

kimixa · 1d ago

Also the op cache - if it hits that the decoder is completely skipped.

wmf · 1d ago

Skymont - 9 wide

mort96 · 1d ago

Wow, I had no idea we were up to 8 wide decoders in amd64 CPUs.

AnotherGoodName · 1d ago

For variable vs fixed width i have heard that fixed width is part of apple silicons performance. There’s literally gains to be had here for sure imho.

astrange · 1d ago

It's easier but it's not that important. It's more important for security - you can reinterpret variable length instructions by jumping inside them.

mort96 · 1d ago

This matches my understanding as well, as someone who has a great deal of interest in the field but never worked in it professionally. CPUs all have a microarchitecture that doesn't look like the ISA at all, and they have an instruction decoder that translates ISA one or more ISA instructions into zero or more microarchitectural instructions. There are some advantages to having a more regular ISA, such as the ability to more easily decode multiple instructions in parallel if they're all the same size or having to spend fewer transistors on the instruction decoder, but for the big superscalar chips we all have in our desktops and laptops and phones, the drawbacks are tiny.

I imagine that the difference is much greater for the tiny in-order CPUs we find in MCUs though, just because an amd64 decoder would be a comparatively much larger fraction of the transistor budget

themafia · 1d ago

Then there's mainframes. Where you want code compiled in 1960 to run unmodified today. There was quite of original advantage as well as IBM was able to implement the same ISA with three different types and costs of computers.

astrange · 1d ago

uOps are kind of oversold in the CPU design mythos. They are not that different from the original ISA, and some x86 instructions (like lea) are both complex and natural fits for hardware so don't get microcoded.

newpavlov · 1d ago

>RISC-V is a new-school hyper-academic hot mess.

Yeah... Previously I was a big fan of RISC-V, but after I had to dig slightly deeper into it as a software developer my enthusiasm for it has cooled down significantly.

It's still great that we got a mainstream open ISA, but now I view it as a Linux of the hardware world, i.e. a great achievement, with a big number of questionable choices baked in, which unfortunately stifles other open alternatives by the virtue of being "good enough".

chithanh · 17h ago

> which unfortunately stifles other open alternatives by the virtue of being "good enough".

In China at least, the hardware companies are smart enough to not put all eggs in the RISC-V basket, and are pursuing other open/non-encumbered ISAs.

They have LoongArch which is a post-MIPS architecture with elements from RISC-V. Also they have ShenWei/SunWay which is post-Alpha.

And they of course have Phytium (ARM), HeXin (OpenPower), and Zhaoxin (x86).

codedokode · 1d ago

What choices? The main thing that comes to mind is lack of exceptions on integer overflow but you are unlikely meaning this.

newpavlov · 1d ago

- Handling of misaligned loads/stores: RISC-V got itself into a weird middle ground, ops on misaligned pointers may work fine, may work "extremely slow", or cause fatal exceptions (yes, I know about Zicclsm, it's extremely new and only helps with the latter, also see https://github.com/llvm/llvm-project/issues/110454). Other platforms either guarantee "reasonable" performance for such operations, or forbid misaligned access with "aligned" loads/stores and provide separate misaligned instructions. Arguably, RISC-V should've done the latter (with misaligned instructions defined in a separate higher-end extension), since passing unaligned pointer into an aligned instruction signals correctness problems in software.

- The hardcoded page size. 4 KiB is a good default for RV32, but arguably a huge missed opportunity for RV64.

- The weird restriction in the forward progress guarantees for LR/SC sequences, which forces compilers to compile `compare_exchange` and `compare_exchange_weak` in the absolutely same way. See this issue for more information: https://github.com/riscv/riscv-isa-manual/issues/2047

- The `seed` CSR: it does not provide a good quality entropy (i.e. after you accumulated 256 bits of output, it may contain only 128 bits of randomness). You have to use a CSPRNG on top of it for any sensitive applications. Doing so may be inefficient and will bloat binary size (remember, the relaxed requirement was introduced for "low-powered" devices). Also, software developers may make mistake in this area (not everyone is a security expert). Similar alternatives like RDRAND (x86) and RNDR (ARM) guarantee proper randomness and we can use their output directly for cryptographic keys with very small code footprint.

- Extensions do not form hierarchies: it looks like the AVX-512 situation once again, but worse. Profiles help, but it's not a hierarchy, but a "packet". Also, there are annoyances like Zbkb not being a proper subset of Zbb.

- Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode. This makes detection quite annoying for "universal" libraries which intend to support various OSes and embedded targets. The CPUID instruction (x86) is ideal in this regard. I totally disagree with the virtualization argument against it, nothing prevents VM from intercepting the read, no one excepts huge performance from such reads.

And this list is compiled after a pretty surface-level dive into the RISC-V spec. I heard about other issues (e.g. being unable to port tricky SIMD code to the V extension or underspecification around memory coherence important for writing drivers), but I can not confidently talk about those, so it's not part of my list.

P.S.: I would be interested to hear about other people gripes with RISC-V.

ack_complete · 21h ago

> Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode.

Not a RISC-V programmer, but this drives me crazy on ARM. Dozens of optional features, but the FEAT_ bits are all readable only from EL1, and it's unspecified what API the OS exposes to query it and which feature bits are exposed. I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software.

cesarb · 20h ago

> but the FEAT_ bits are all readable only from EL1, [...] I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software.

I like the way the Linux kernel solves this: these FEAT_ bits are also readable from EL0, since trying to read them traps to kernel mode and the read is emulated by the kernel. See https://docs.kernel.org/arch/arm64/cpu-feature-registers.htm... for details. Unfortunately, it's a Linux-only feature, and didn't exist originally (so old enough Linux kernel versions won't have the emulation).

dzaima · 13h ago

Another bad choice (perhaps more accurately called a bug, but they chose to not do anything about it): vmv1.r & co (aka whole-vector-register move instructions) depend on valid vtype being set, despite not using any part of it (outside of the extreme edge-case of an interrupt happening in the middle of it, and the hardware wanting to chop the operation in half instead of finishing it (entirely pointless for application-class CPUs where VLEN isn't massive enough for that to in any way be useful; never mind moves being O(1) with register renaming))

So to move one vector register to another, you need to have a preceding vsetvl; worse, with the standard calling convention you may get illegal vtype after a function call! Even worse, the behavior is actually left reserved for for move with illegal vtype, so hardware can (and some does) just allow it, thereby making it impossible to even test for on some hardware.

Oh, and that thing about being able to stop a vector instruction midway through? You might think that's to allow guaranteeing fast interrupts while keeping easy forwards progress; but no, vector reductions cannot be restarted.. And there's the extremely horrific vfredosum[1], which is an ordered float sum reduction, i.e. a linear chain of N float adds, i.e. a (fp add latency) * (element count in vector) -cycle op that must be started completely over again if interrupted.

[1]: https://dzaima.github.io/intrinsics-viewer/#0q1YqVbJSKsosTtY...

brandmeyer · 10h ago

Nothing major, just some oddball decisions here and there.

Fused compare-and-branch only extends to the base integer instructions. Anything else needs to generate a value that feeds into a compare-and-branch. Since all branches are compare-and-branch, they all need two register operands, which impairs their reach to a mere +/- 4 kB.

The reach for position-independent code instructions (AUIPC + any load or store) is not quite +/- 2 GB. There is a hole on either end of the reach that is a consequence of using a sign-extended 12-bit offset for loads and stores, and a sign-extended high 20-bit offset for AIUPC. ARM's adrp (address of page) + unsigned offsets is more uniform.

RV32 isn't a proper subset of RV64, which isn't a proper subset of RV128. If they were proper subsets, then RV64 programs could run unmodified on RV128 hardware. Not that its going to ever happen, but if it did, the processor would have to mode-switch, not unlike the x86-64 transition of yore.

Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. I can count on zero hands the number of times I've needed that.

The integer ISA design goes to great lengths to avoid any instructions with three source operands, in order to simplify the datapaths on tiny machines. But... the floating point extension correctly includes fused multiply-add. So big chunks of any high-end processor will need three-operand datapaths anyway.

The base ISA is entirely too basic, and a classic failure of 90% design. Just because most code doesn't need all those other instructions doesn't mean that most systems don't. RISC-V is gathering extensions like a Katamari to fill in all those holes (B, Zfa, etc).

None of those things make it bad, I just don't think its nearly as shiny as the hype. ARM64+SVE and x86-64+AVX512 are just better.

adgjlsfhk1 · 3h ago

> Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes.

IMO this is way better than the alternative in x86 and ARM. The reason no one deals with rounding modes is because changing the mode is really slow and you always need to change it back or else everything breaks. Being able to do it in the instruction allows you to do operations with non-standard modes much more simply. For example, round-to-nearest-ties-to-odd can be incredibly useful to prevent double rounding.

adgjlsfhk1 · 3h ago

> The base ISA is entirely too basic

IMO this is very wrong. The base ISA is excellent for micro-controllers and teaching, but the ~90% of real implementations can add the extra 20 extensions to make a modern, fully featured CPU.

mixmastamyk · 1d ago

Sounds like a job for RISC-6, or VI.

adgjlsfhk1 · 23h ago

> - The hardcoded page size.

I'm pretty confident that this will get removed. It's an extension that made it's way into RVA23, but once anyone has a design big enough for it to be a burden, it can be dropped.

monocasa · 20h ago

That's really hard to drop.

Fancier unix programs tend to make all kinds of assumptions about page size to do things like the double mapped ring buffer trick.

https://en.wikipedia.org/wiki/Circular_buffer#Optimization

In fact it looks like apple silicon maintains support for 4kb pages just for running Rosetta. It's one of those things like TSO that was enough of a pain to work around the assumptions that they just included hardware support for it that isn't enabled when running in regular arm software mode.

camel-cdr · 17h ago

> Handling of misaligned loads/stores

Agreed, I think the problem is that RVI doesn't want to/can't mandate implementation details.

I hope that the first few RVA23 cores will have proper misaligned load/store support and we can tell toolchains RVA23 or Zicclsm means fast misaligned load/store and future hardware that is stupid enough to not implement it, will just have to suffer.

There is some silver lining, because you can transform N misaligned loads into N+1 aligned ones + a few instructions to stich together the result. Currently this needs to be done manually, but hopefully it will be an optimization in future compiler versions: https://github.com/llvm/llvm-project/issues/150263 (Edit: oh, I should've recognised your username, xd)

> The hardcoded page size.

There is Svnapot, which is supposes to allow other page sizes, but I don't know enough about it to be sure it actually solves the problem properly.

> You have to use a CSPRNG on top of it for any sensitive applications

Shouldn't you have to do that reguardless and also mix in other kind of state on OS level?

> Extensions do not form hierarchies

The mandatory extensions in the RVA profiles are a hierarchy.

> Detection of available extensions

I think this is being worked on with unified disvover, whch should also cover other microarchitectural details.

There also is a neat toolchain solution with: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s...

> being unable to port tricky SIMD code to the V extension

Anything NEON code is trivially ported to RVV, as is AVX-512 code that doesn't use GFNI which is pretty much the only extension that doesn't have a RVV equivalent yet (neither does NEON or SVE though).

Where the complaints come from is if you want to take full advantage of the native vector length in VLA code, which can sometimes be tricky, especially in existing projects which are sometimes build arround the assumption of fixed vector lengths. But you can always fall back to using RVV as a fixed vector length ISA with a much faster way of querrying vector length then CPUID.

> P.S.: I would be interested to hear about other people gripes with RISC-V

I feel like encoding scalar fmacc with three sources and seperate destinations and rounding modes was a huge waste of encoding space, I would trade that for a vpternlog equivalent, which also is a encoding hog, any day.

The vl=0 special case was a bad idea, now you have to know/predict vl!=0 to get rid of the vector destination as a read dependency, or have some mechanism to kill an instuction if vl=0.

There should've been restricted vrgather variants earlier, but I'm now (slowly) working on proposing them and a handfull of other new vector instructions (mask add/sub, pext/pdep, bmatflip).

Overall though, I think RVV came out suprizingly good, everything works thogether very nicely.

zozbot234 · 15h ago

> I feel like encoding scalar fmacc with three sources and seperate destinations and rounding modes was a huge waste of encoding space

This might be easily solved by defining new lighter varieties of the F/D/Q extensions (under new "Z" names) that just don't include the fmacc insn blocks and reserve them for extension. (Of course, these new extensions would themselves be incompatible with the full F/D/Q extensions, effectively deprecating them for general use and relegating them to special-purpose uses where the FMACC encodings are genuinely useful.) Something to think about if the 32-bit insn encoding space becomes excessively scarce.

camdroidw · 5h ago

Reading this and child comments it would seem to me that we need a RISC Vim

whynotminot · 1d ago

An annoying thing people have done since Apple Silicon is claim that its advantages were due to Arm.

No, not really. The advantage is Apple prioritizing efficiency, something Intel never cared enough about.

choilive · 1d ago

In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.

The level of talent Apple has cannot be understated, they have some true CPU design wizards. This level of efficiency cannot be achieved without making every aspect of the CPU as fast as possible; their implementation of the ARM ISA is incredible. Lots of companies make ARM chips, but none of them are Apple level performance.

As a gross simplification, where the energy/performance tradeoff actually happens is after the design is basically baked. You crank up the voltage and clock speed to get more perf at the cost of efficiency.

toast0 · 23h ago

> In most cases, efficiency and performance are pretty synonymous for CPUs. The faster you can get work done (and turn off the silicon, which is admittedly a higher design priority for mobile CPUs) the more efficient you are.

Somewhat yes, hurry up and wait can be more efficient than running slow the whole time. But at the top end of Intel/AMD performance, you pay a lot of watts to get a little performance. Apple doesn't offer that on their processors, and when they were using Intel processors, they didn't provide thermal support to run in that mode for very long either.

The M series bakes in a lower clockspeed cap than contemperary intel/amd chips; you can't run in the clock regime where you spend a lot of watts and get a little bit more performance.

MindSpunk · 21h ago

Apple also buys out basically all of TSMC's initial capacity for their leading edge node so generally every time a new Apple Silicon thing comes out it's a process node ahead of every other chip they're comparing to.

wvenable · 1d ago

By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.

AnthonyMouse · 23h ago

> By prioritizing efficiency, Apple also prioritizes integration. The PC ecosystem prefers less integration (separate RAM, GPU, OS, etc) even at the cost of efficiency.

People always say this but "integration" has almost nothing to do with it.

How do you lower the power consumption of your wireless radio? You have a network stack that queues non-latency sensitive transmissions to minimize radio wake-ups. But that's true for radios in general, not something that requires integration with any particular wireless chip.

How do you lower the power consumption of your CPU? Remediate poorly written code that unnecessarily keeps the CPU in a high power state. Again not something that depends on a specific CPU.

How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.

What Apple does well isn't integration, it's choosing (or designing) components that are each independently power efficient, so that then the entire device is. Which you can perfectly well do in a market of fungible components simply by choosing the ones with high efficiency.

In fact, a major problem in the Android and PC laptop market is that the devices are insufficiently fungible. You find a laptop you like where all the components are efficient except that it uses an Intel processor instead of the more efficient ones from AMD, but those components are all soldered to a system board that only takes Intel processors. Another model has the AMD APU but the OEM there chose poorly for the screen.

It's a mess not because the integration is poor but because the integration exists instead of allowing you to easily swap out the part you don't like for a better one.

adgjlsfhk1 · 23h ago

> How much power is saved by soldering the memory or CPU instead of using a socket? A negligible amount if any; the socket itself has no significant power draw.

This isn't quite true. When the whole chip is idling at 1-2W, 0.1W of socket power is 10%. Some of Apple's integration almost certainly save power (e.g. putting storage controllers for the SSD on the SOC, having tightly integrated display controllers, etc).

AnthonyMouse · 22h ago

> When the whole chip is idling at 1-2W, 0.1W of socket power is 10%.

But how are you losing 10% of power to the socket at idle? Having a socket might require traces to be slightly longer but the losses to that are proportional to overall power consumption, not very large, and both CPU sockets and the new CAMM memory standard are specifically designed to avoid that anyway (primarily for latency rather than power reasons because the power difference is so trivial).

> Some of Apple's integration almost certainly save power (e.g. putting storage controllers for the SSD on the SOC, having tightly integrated display controllers, etc).

This isn't really integration and it's very nearly the opposite: The primary advantage here in terms of hardware is that the SoC is being fabbed on 3nm and then the storage controller would be too, which would be the same advantage if you would make an independent storage controller on the same process.

Which is the problem with PCs again: The SSDs are too integrated. Instead of giving the OS raw access to the flash chips, they adhere a separate controller just to do error correction and block remapping, which could better be handled by the OS on the main CPU which is fabbed on a newer process or, in larger devices with a storage array, a RAID controller that performs the task for multiple drives at once.

And which would you rather have, a dual-core ARM thing integrated with your SSD, or the same silicon going to two more E-cores on the main CPU which can do the storage work when there is any but can also run general purpose code when there isn't?

happycube · 1d ago

There's a critical instruction for Objective C handling (I forget exactly what it is) but it's faster than intel's chips even in Rosetta 2's x86 emulation.

wvenable · 1d ago

I believe it's the `lock xadd` instruction. It's faster when combined with x86 Total Store Ordering mode that the Rosetta emulation runs under.

saagarjha · 22h ago

Looking at objc_retain apparently it's a lock cmpxchg these days

Panzer04 · 1d ago

Eh, probably the biggest difference is in the OS. The amount of time Linux or Windows will spend using a processor while completely idle can be a bit offensive.

acdha · 22h ago

It’s all of the above. One thing Apple excels at is actually using their hardware and software together whereas the PC world has a long history of one of the companies like Intel, Microsoft, or the actual manufacturer trying to make things better but failing to get the others on-board. You can in 2025 find people who disable power management because they were burned (hopefully not literally) by some combination of vendors slacking on QA!

One good example of this is RAM. Apple Silicon got some huge wins from lower latency and massive bandwidth, but that came at the cost of making RAM fixed and more expensive. A lot of PC users scoffed at the default RAM sizes until they actually used one and realized it was great at ~8GB less than the equivalent PC. That’s not magic or because Apple has some super elite programmers, it’s because they all work at the same company and nobody wants to go into Tim Cook’s office and say they blew the RAM budget and the new Macs need to cost $100 more. The hardware has compression support and the OS and app teams worked together to actually use it well, whereas it’s very easy to imagine Intel adding the feature but skimping on speed / driver stability, or Microsoft trying to implement it but delaying release for a couple years, or not working with third-party developers to optimize usage, etc. – nobody acting in bad faith but just what inevitably happens when everyone has different incentives.

astrange · 5h ago

> Apple Silicon got some huge wins from lower latency and massive bandwidth, but that came at the cost of making RAM fixed and more expensive.

The memory latency actually isn't good, only bandwidth is good really. But there is a lot of cache to hide that. (The latency from fetching between CPU clusters is actually kind of bad too, so it's important not to contend on those cache lines.)

> A lot of PC users scoffed at the default RAM sizes until they actually used one and realized it was great at ~8GB less than the equivalent PC.

Less than that. Unified memory means that the SSD controller, display, etc subtract from that 8GB, whereas on a PC they have some of their own RAM on the side.

p_ing · 20h ago

Windows 10 introduced memory compression. Here's a discussion from 2015 [0]. And one on Linux by IBM from 2013 [1]. But the history goes way back [2].

I don't know why that '8GiB is great!' -- no, no it isn't. Your memory usage just spills over to swap faster. It isn't more efficient (not with those 16KiB pages).

[0] https://learn.microsoft.com/en-us/shows/Seth-Juarez/Memory-C...

[1] https://events.static.linuxfound.org/sites/events/files/slid...

[2] https://en.wikipedia.org/wiki/Virtual_memory_compression#Ori...

acdha · 10h ago

Yes, I'm aware that Windows has memory compression, so let's think about why it's less successful and Windows systems need more memory than Macs.

The Apple version has a very high-performance hardware implementation versus Microsoft's software implementation (not a slam on Microsoft, they just have to support more hardware).

The Apple designers can assume a higher performance baseline memory subsystem because, again, they're working with hardware designers at the same company who are equally committed to making the product succeed.

The core Mac frameworks are optimized to reduce VM pressure and more Mac apps use the system frameworks, which means that you're paying the overhead tax less.

Many Mac users use Safari instead of Chrome so they're saving multiple GB for an app which most people have open constantly as well as all of the apps which embed a WebKit view.

Again, this is not magic, it's aligned incentives. Microsoft doesn't control Intel, AMD, and Qualcomm's product design, they can't force Google to make Chrome better, and they can't force every PC vendor not to skimp on hardware. They can and do work with those companies but it takes longer and in some cases the cost incentives are wrong – e.g. a PC vendor knows 99% of their buyers will blame Windows if they use slower RAM to save a few bucks or get paid to preload of McAfee which keeps the memory subsystem busy constantly so they take the deal which adds to their bottom line now.

p_ing · 8h ago

Neither macOS nor Windows use a hardware-based accelerator for memory compression. It's all done in software. Linux zram uses Intel QAT but that's only available on a limited number of processors.

You seem to be under the mistaken impression that Microsoft cannot gear Windows to act differently based on the installed hardware (or processor). That's quite untrue.

acdha · 5h ago

It was software on Intel but they presumably added instructions with the intention of using them:

https://asahilinux.org/docs/hw/cpu/apple-instructions/

> You seem to be under the mistaken impression that Microsoft cannot gear Windows to act differently based on the installed hardware (or processor).

Definitely not - my point is simply that all of these things are harder and take longer if they have to support multiple implementations and get other companies to ship quality implementations.

p_ing · 3h ago

> Definitely not - my point is simply that all of these things are harder and take longer if they have to support multiple implementations and get other companies to ship quality implementations.

What's your source for it is "harder" or "takes longer"? #ifdef is a quite well known processor directive to developers and easy to use.

acdha · 1h ago

> What's your source for it is "harder" or "takes longer"?

Windows devices’ power management and battery life has been behind Apple since the previous century? If you think hardware support is a simple #ifdef, ask yourself how a compile-time flag can detect firmware updates, driver versions, or flakey hardware. It’s not that Apple’s hardware is perfect but that those are made by the same company so you don’t get Dell telling you to call Broadcom who are telling you to call Microsoft.

astrange · 5h ago

> Neither macOS nor Windows use a hardware-based accelerator for memory compression.

Not true.

achandlerwhite · 19h ago

He didn’t say 8 gigs is great but that you can get by with about 8 gigs less than equivalent on Windows.

tguvot · 17h ago

i think i had 20+ years ago servers with this https://www.eetimes.com/serverworks-rolls-out-chip-set-based...

wvenable · 18h ago

I don't buy it. Software is not magically using less RAM because it was compiled for MacOS. The actual RAM use by the OS itself is relatively small for both operating systems.

Panzer04 · 19h ago

Is this meant to be contradicting what I said?

It's all in the OS. There's absolutely no reason RAM can't be managed similarly effectively on a non-integrated product.

Android is just Linux with full attention paid to power saving measures. These OS can get very long battery life, but in my experience the typical experience is something or other keeps the processor active and halves your expected battery life.

acdha · 10h ago

My point is that it's not just the OS for Apple, because every part of the device is made by people with the same incentives. Android is slower and has worse battery efficiency than iOS not because Google are a bunch of idiots (quite the contrary) but because they have to support a wider range of hardware and work with the vendors who are going to use slower, less capable components to save $3 per device. Apple had a decade lead on basic things like storage security because when they decided to do that, the hardware team committed to putting high-quality encryption into the SoC and that meant that iOS could just assume that feature existed and was fast starting on the 3GS whereas Google had to spend years and years haranguing the actual phone manufacturers into implementing what was at the time seen as a costly optional feature.

mschuster91 · 15h ago

Apple can get away with less RAM because the flash storage is blazing fast and attached directly to the CPU, making swap much more painless than on most Windows machines that get bottom-of-the-barrel storage and controllers.

acdha · 10h ago

Yes, that's exactly what I'm talking about: Apple can do that because everyone involved works at the same place and owns the success or failure of the product. Google or Microsoft have to be a lot more conservative because they have limited ability to force the hardware vendors to do something and they'll probably get blamed more if it doesn't work well: people are primed to say “Windows <version + 1> sucks!” even if the real answer is that their PC was marginally-specced when they bought it 5 years ago.

cheema33 · 17h ago

> The advantage is Apple prioritizing efficiency, something Intel never cared enough about.

Intel cared plenty once they realized that they were completely missing out on the mobile phone business. They even made X86/Atom chips for the phone market. Asus for example had some phones (Zenfone) with Intel X86 chips in them in the mid-2010s. However, Intel Atom suffered from poor power efficiency and battery life and soon died.

steve1977 · 18h ago

Also Apple has a ton of cash that it can give to TSMC to essentially get exclusive access to the latest manufacturing process.

JumpCrisscross · 18h ago

> Apple has a ton of cash that it can give to TSMC to essentially get exclusive access to the latest manufacturing process

How have non-Apple chips on TSMC’s 5nm process compared with Apple’s M series?

Ianjit · 15h ago

This paper is great:

"Our methodical investigation demonstrates the role of ISA in modern microprocessors’ performance and energy efficiency. We find that ARM, MIPS, and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant."

https://dl.acm.org/doi/10.1145/2699682 https://abdullahyildiz.github.io/files/isa_wars.pdf

jabl · 12h ago

> Or perhaps AMD will let x86 fade away.

I agree with what you write otherwise, but not this. Why would AMD "let x86 fade away"? They are one of the two oligolistic CPU providers of the x86 ecosystem which is worth zillions. Why should they throw that away in order to become yet another provider of ARM (or RISC-V or whatnot) CPUs? I think that as long as the x86 market remains healthy, and AMD is in a position to compete in that market, they will continue doing so.

tliltocatl · 1d ago

Nitpick: uncore and the fabrication details dominate the ISA on high end/superscalar architectures (because modern superscalar basically abstract the ISA away at the frontend). On smaller (i. e. MCU) cores x86 will never stand any chance.

bpye · 1d ago

Not that it stopped Intel trying - https://en.m.wikipedia.org/wiki/Intel_Quark

somanyphotons · 1d ago

I'd love to see what would happen if AMD put out a chip with the instruction decoders swapped out for risc-v instruction decoders

AnotherGoodName · 1d ago

Fwiw the https://en.wikipedia.org/wiki/AMD_Am29000 RISC CPU and the https://en.wikipedia.org/wiki/AMD_K5 are a good example of this. As in AMD took their existing RISC CPU to make the K5 x86 CPU.

Almost the same in die shots except the K5 had more transistors for the x86 decoding. The AM29000's instruction set is actually very close to RISC-V too!

Very hard to find benchmarks comparing the two directly though.

throwawaymaths · 22h ago

TIL the k5 was RISC. thank you

TheAmazingRace · 17h ago

It was a valiant effort by AMD when they made their first attempt at an in-house x86 chip design. Prior to this, AMD effectively cloned Intel designs under license.

Sadly, the K5 was pretty underwhelming upon release, and it wasn’t until AMD acquired NexGen and integrated their special sauce into their next version core, the K6, that things started getting interesting. I was quite fond of my old AMD K6 II+ based PC back in the day. It had good value for the money.

Then you see the K7/Athlon core, complete with inspiration from the DEC Alpha thanks to none other than Jim Keller, and AMD was on a serious tear for a good while after that, beating Intel to 1 GHz.

guerrilla · 1d ago

Indeed. We don't need it, but I want it for perfectionist aesthetic completion.

epolanski · 1d ago

I have a hard time believing this fully: more custom instructions, more custom hardware, more heat.

How can you avoid it?

tapoxi · 1d ago

Since the Pentium Pro the hardware hasn't implemented the ISA, it's converted into micro ops.

epolanski · 1d ago

Come on, you know what I meant :)

If you want to support AVX e.g. you need 512bit (or 256) wide registers, you need dedicated ALUs, dedicated mask registers etc.

Ice Lake has implemented SHA-specific hardware units in 2019.

adgjlsfhk1 · 1d ago

sure, but Arm has Neon/sve which impose basically the same requirements for vector instructions, and most high performance arm implimentations have a wide suite of crypto instructions (e.g. Apple's M series chips have AES, SHA1 and Sha256 instructions)

camel-cdr · 17h ago

Neon has it worse, because it's harder to scale issue width then vector length.

Zen5 has four issue 512-bit ALUs, current Arm processors have been stuck at four issue 128-bit for years.

Issue width scales quadratically, while vector length mostly scales linearly.

Intel decided it is easier to rewrite all performance critical applications thrice then to go wider than four issue SIMD.

It will have to be seen if Arm is in a position to push software ro adopt SVE, but currently it looks very bleak, with much of the little SVE code thats out there just assuming 128-bit SVE, because thats what all of the hardware is.

ryan-ca · 5h ago

I think AVX is actually power gated when unused.

toast0 · 23h ago

ARM has instructions for SHA, AES, vectors, etc too. Pretty much have to pay the cost if you want the perf.

CyberDildonics · 23h ago

The computation has to be done somehow, I don't know that it is a given that more available instructions means more heat.

wlesieutre · 1d ago

VIA used to make low power x86 processors

yndoendo · 1d ago

Fun fact. The idea of strong national security is the reason why there are three companies with access to the x86 ISA.

DoD originally required all products to be sourced by at least three companies to prevent supply chain issues. This required Intel to allow AMD and VIA to produce products based on ISA.

For me this is good indicator if someone that talks about good national security knows what they are talking about or are just spewing bullshit and playing national security theatre.

rasz · 1d ago

Intel didnt "allow" VIA anything :). Via acquired x86 tech from IDT (WinChip Centaur garbage) in a fire sale. IDT didnt ask anyone about any licenses, neither did Cyrix, NextGen, Transmeta, Rise nor NEC.

Afaik DoD wasnt the reason behind original AMD second source license, it was IBM forcing Intel on chips that went into first PC.

cheema33 · 17h ago

This.

pavlov · 1d ago

And Transmeta…

DeepYogurt · 1d ago

Transmeta wasn't x86 internally but decoded x86 instructions. Retrobytes did a history of transmeta not too long ago and the idea was essentially to be able to be compatible with any cpu uarch. Alas by the time it shipped only x86 was relevant. https://www.youtube.com/watch?v=U2aQTJDJwd8

tyfighter · 1d ago

Actually, the reason Transmeta CPUs were so slow was that they didn't have an x86 instruction hardware decoder. Every code cache (IIRC it was only 32 MB) miss resulted in a micro-architectural trap which translated x86 instructions to the underlying uops in software.

mananaysiempre · 1d ago

> x86 didn't dominate in low power because Intel had the resources to care but never did

Remember Atom tablets (and how they sucked)?

jonbiggums22 · 10h ago

IIRC Intel hobbled early Atom with an ancient process node for the chipsets which actually made up most of the idle power usage. It was pretty clear that both Microsoft and Intel wanted this product category to go away or at least be totally relegated to bottom tier lest it cannibalize their higher margin businesses. And then of course Apple and Android came along and did just that anyway.

wmf · 1d ago

That's the point. Early Atom wasn't designed with care but the newer E-cores are quite efficient because they put more effort in.

Voultapher · 11h ago

The PMICs were also bad, plus the whole Windows software stack - to this day - is nowhere nearly as well optimized for low background and sleep power usage as MacOS and iOS are.

Findecanor · 1d ago

You mean Atom tablets running Android ?

I have a ten-year old Lenovo Yoga Tab 2 8" Windows tablet, which I still use at least once every week. It is still useful. Who can say that they are still using a ten-year old Android tablet?

masfuerte · 1d ago

I still use my 2015 Kindle Fire (which runs Android) for ebooks and light web browsing.

peterfirefly · 1d ago

My iPad Mini 4 turns 10 in a month.

vt240 · 1d ago

Yeah, I got to say in our sound company inventory I still use a dozen 6-10 year old iPads with all the mixers. They run the apps at 30fps and still hold a charge all day.

mmis1000 · 1d ago

I have tried one before. And surprisingly, It did not suck as most people claimed to be. I can even do light gaming (warframe) on it with reasonable frame rate. (It's about 2015 ~ 2020 era). So it probably depends on manufacturer (or use case though)

(Also probably due to it is a tablet, so it have a reasonable fast storage instead of hdds like notebooks in that era)

josefx · 18h ago

What year? I remember Linux gaining some traction in the low end mobile device market early on because Micrsoft just released Vista and that wouldn't even run well on most Vista Ready desktop systems.

kccqzy · 1d ago

They sucked because Intel didn't care.

YetAnotherNick · 1d ago

> how they sucked

Care to elaborate. I had the 9" mini laptop kind of device based on Atom and don't remember Atom to be the issue.

mananaysiempre · 1d ago

I had a Atom-based netbook (in the early days when they were 32-bit-only and couldn’t run up-to-date Windows). It didn’t suck, as such, but it was definitely resource-starved.

However, what I meant is Atom-based Android tablets. At about the same time as the netbook craze (late 2000s to early 2010s) there was a non-negligible number of Android tablets, and a noticeable fraction of them was not ARM- but Atom-based. (The x86 target in the Android SDK wasn’t only there to support emulators, originally.) Yet that stopped pretty quickly, and my impression is that that happened because, while Intel would certainly have liked to hitch itself to the Android train, they just couldn’t get Atoms fast enough at equivalent power levels (either at all or quickly enough). Could have been something else, e.g. perhaps they didn’t have the expertise to build SoCs with radios?

Either way, it’s not that Intel didn’t want to get into consumer mobile devices, it’s that they tried and did not succeed.

toast0 · 1d ago

Android x86 devices suffer when developers include binary libraries and don't add x86. At the time of Intel's x86 for Android push, Google didn't have good apk thinning options, so app developers had to decide if they wanted to add x86 libraries for everyone so that a handful of tablets/phones would work properly... for the most part, many developers said no; even though many/most apps are tested on the android emulator that runs on x86 and probably have binary libraries available to work in that case.

IMHO, If Intel had done another year or two of trying, it probably would have worked, but they gave up. They also canceled x86 for phone like the day before the Windows Mobile Continuum demo, which would have been a potentially much more compelling product with x86, especially if Microsoft allowed running win32 apps (which they probably wouldn't, but the potential would be interesting)

Synaesthesia · 1d ago

It got a lot better. First few generations were dog-slow, although they did work.

cptskippy · 1d ago

Atom used an in-order execution model so it's performance was always going to be lacking. Because it was in-order it had a much simpler decoder and much smaller die size, which meant you could crap the chipset and CPU on a single die.

Atom wasn't about power efficiency or performance, it was about cost optimization.

saltcured · 1d ago

I had an Atom-based Android phone (Razr-i) and it was fine.

criticalfault · 1d ago

Were they running windows or android?

jojobas · 22h ago

Eeepc was was a hit. Its successors still make excellent cheap long-life laptops, if not as performant as Apple.

pezezin · 22h ago

After playing around with some ARM hardware I have to say that I don't care whether ARM is more efficient or not as long as the boot process remains the clusterfuck that it is today.

IMHO the major win of the IBM PC platform is that it standardized the boot process from the very beginning, first with the BIOS and later with UEFI, so you can grab any random ISO for any random OS and it will work. Meanwhile in the ARM world it seems that every single CPU board requires its own drivers, device tree, and custom OS build. RISC-V seems to suffer from the same problem, and until this problem is solved, I will avoid them like toxic waste.

Teknoman117 · 22h ago

ARM systems that support UEFI are pretty fun to work with. Then there's everything else. Anytime I hear the phrase "vendor kernel" I know I'm in for an experience...

nicman23 · 16h ago

nah when i hear vendor kernel i just do not go for that experience. it is not 2006 get your shit mainline or at least packaged correctly.

mrheosuper · 15h ago

Fun Fact: Your Qualcomm based Phone may already use UEFI.

freedomben · 6h ago

I could not agree more. I wanted to love ARM, but after playing around with numerous different pieces of hardware, I won't touch it with a ten-foot pole anymore. The power savings is not worth the pain to me.

I hope like hell that RISC-V doesn't end up in the same boot-process toxic wasteland

markfeathers · 21h ago

Check out ARM SBSA / SBBR which seems aimed at solving most of these issues. https://en.wikipedia.org/wiki/Server_Base_System_Architectur... I'm hopeful RISCV comes up with something similar.

camel-cdr · 17h ago

https://github.com/riscv-non-isa/riscv-server-platform/blob/...

pezezin · 15h ago

But those initiatives are focused on servers.

What about desktops and laptops? Cellphones? SBC like the Raspberry Pi? That is my concern.

camel-cdr · 15h ago

This is also supposed to be for desktops/laptops afaik, idk about phones.

Edit, this part of the server platforms spec certainly: https://github.com/riscv-non-isa/riscv-brs/blob/main/intro.a...

> BRS-I is expected to be used by general-purpose compute devices such as servers, desktops, laptops and other devices with industry expectations on silicon vendor, OS and software ecosystem interoperability.

pezezin · 15h ago

Thank you, I had missed that part.

It would be great indeed if there is a standardized boot process that every OS can use, I think it would greatly help with RISC-V adoption.

Joel_Mckay · 20h ago

In general, most modern ARM 8/9 64bit SoC purged a lot of the vestigial problems.

Yet most pre-compiled package builds still never enable the advanced ASIC features for compatibility and safety-concerns. AMD comparing the NERF'd ARM core features is pretty sleazy PR.

Tegra could be a budget Apple M3 Pro, but those folks chose imaginary "AI" money over awesomeness. =3

nicman23 · 16h ago

nah. if they do not work on their extensions to be supported by general software, that is their problem

txrx0000 · 21h ago

It's not the ISA. Modern Macbooks are power-efficient because they have:

- RAM on package

- PMIC power delivery

- Better power management by OS

Geekerwan investigated this a while ago, see:

https://www.youtube.com/watch?v=Z0tNtMwYrGA https://www.youtube.com/watch?v=b3FTtvPcc2s https://www.youtube.com/watch?v=ymoiWv9BF7Q

Intel and AMD have implemented these improvements with Lunar Lake and Strix Halo. You can buy an x86 laptop with Macbook-like efficiency right now if you know which SoCs to pick.

edit: Correction. I looked at the die image of Strix Halo and thought it looked like it had on-package RAM. It does not. It doesn't use PMIC either. Lunar Lake is the only Apple M-series competitor on x86 at the moment.

aurareturn · 15h ago

  Intel and AMD have implemented these improvements with Lunar Lake and Strix Halo. You can buy an x86 laptop with Macbook-like efficiency right now if you know which SoCs to pick.

M4 is about 3.6x more efficient than Strix Halo when under load.[0] On a daily basis, this difference can be more because Apple Silicon has true big.Little cores that send low priority tasks to the highly efficient small cores.

For Lunar Lake, base M4 is about 35% faster, 2x more efficient, and actually has a bigger die size than M4.[1] Intel is discontinuing the Lunar Lake line because it isn't profitable for them.

I'm not sure how you can claim "Mac-like efficiency".

[0]https://imgur.com/a/yvpEpKF

[1]https://www.notebookcheck.net/Intel-Lunar-Lake-CPU-analysis-...

txrx0000 · 13h ago

Pardon my loose choice of words regarding the Mac-like efficiency. I was referring to the fact that the battery life is comparable to the M3 in day-to-day use, as demonstrated at around the 5:00 mark in the third video I linked.

In the same video, they also measure perf/watt under heavy load, and it's close to the M1, but not the latest gen M4. I think that's pretty good considering it's a first gen product.

Regarding the discontinuation, it's still on shelves right now, but I'm not sure if there will be future generations. It would be awfully silly of them to discontinue it as it's the best non-Apple laptop chip you can buy right now if you care about efficiency.

aurareturn · 12h ago

  In the same video, they also measure perf/watt under heavy load, and it's close to the M1, but not the latest gen M4. I think that's pretty good considering it's a first gen product.

Which video and timestamp? Are you aware that LNL throttles heavily when on battery life?

On battery life, M1 is a whopping 1.5x faster in single thread.[0] That makes M4 2.47x faster when compared to LNL on battery.

So no, LNL is very far behind even M1. That's why there are no fanless LNL laptops.

[0]https://b2c-contenthub.com/wp-content/uploads/2024/09/Intel-...

[0]https://www.pcworld.com/article/2463714/tested-intels-lunar-...

[0]https://browser.geekbench.com/macs/macbook-air-late-2020

txrx0000 · 11h ago

I suspect the throttling behavior has to more do with the power settings used during testing or OEM tuning on specific models.

https://www.youtube.com/watch?v=ymoiWv9BF7Q

In this video, they show the perf/watt curves at 8:30. And they show the on-battery vs on-wall performance at 18:35 across a wide variety of benchmarks, not just Geekbench. They used a Lenovo YOGA Air 15 on Window 11's "Balanced" power plan for their tests. The narrator specifically noted the Macbook-like on-battery performance.

aurareturn · 11h ago

Reviewers always use max performance setting for benchmarks and then max battery life for battery tests. That's how people get tricked. When they actually buy the laptop and use it for themselves, they complain that it's slow when on battery life or hot/loud when plugged in.

txrx0000 · 10h ago

They're not trying to trick you. In fact when they were measuring perf/watt, the Lunar Lake chip was disadvantaged against the Apple M-series because they had to run the SPEC 2017 tests on Ubuntu for the Lunar Lake chip, which has poorer tuning for it compared to Windows 11. You can see a footnote saying the compilation environment was Ubuntu 24.04 LTS on the bottom left corner of the frame when they show the perf/watt graphs.

aurareturn · 10h ago

They are trying to trick you. All reviewers are told by Intel to run benchmarks in max performance mode and battery either in balanced or max efficiency mode. These modes will throttle. So the performance you're seeing in reviews aren't achievable in battery mode unless you're ok with drastically lower battery life.

Meanwhile, PCWorld is one of the few that actually ran benchmarks while on battery life - which is what people will experience.

legacynl · 8h ago

Can somebody who knows about this stuff, please elaborate on if it's 'fair' in the first place to compare apple chips with amd/intel chips?

AMD and Intel chips run on loads of different hardware. On the other hand Apple is fully in control of what hardware (and software) their chips are used with. I don't know, but I assume there's a whole lot of tuning and optimizations that you can do when you don't have to support anything besides what you produce yourself.

Let's say it would hypothethically possible to put an M4 in a regular pc. Wouldn't it lose performance just by doing that?

aurareturn · 7h ago

  Let's say it would hypothethically possible to put an M4 in a regular pc. Wouldn't it lose performance just by doing that?

Yes. But an M4 Max running macOS running Parallels running Windows on Arm is still the fastest Windows laptop in the world: https://browser.geekbench.com/v6/cpu/compare/13494385?baseli...

legacynl · 6h ago

Yeah but an AMD/Intel CPU supports many different types of configurations. Isn't it unfair to compare a chip that only supports one configuration with one that supports many?

It feels to me like we're kind of comparing speeds between a personal automobile and a long haul truck. Yes, one is faster than the other, but that's meaningless, because both have different design considerations. A long haul truck has to be able to carry load, and that makes the design different. Of course they'll still make it as fast as possible, but it's never going to be the same as a car.

Basically what I'm saying is that because it's impossible to strip away all the performance and efficiency improvements that come from apple's total control of the software and hardware stack; is it really possible to conclude that apple sillicon itself is as impressive as they make it out to be?

aurareturn · 2h ago

Yes because it’s still the fastest SoC running Windows. Further more, consumers don’t care if AMD and Intel have to go into more configurations. They care about what they’re buying for the money.

a_wild_dandan · 6h ago

That's absolutely wild. I've been loving using the 96GB of (V)RAM in my MacBook + Apple's mlx framework to run quantized AI reasoning models like glm-4.5-air. Running models with hundreds of billions of parameters (at ~14 tok/s) on my damn laptop feels like magic.

jlei523 · 16h ago

  Intel and AMD have implemented these improvements with Lunar Lake and Strix Halo. You can buy an x86 laptop with Macbook-like efficiency right now if you know which SoCs to pick.

This just isn't true. Yes, Lunar Lake has great idle performance. But if you need to actually use the CPU, it's drastically slower than M4 while consuming more power.

Strix Halo battery life and efficiency is not even in the same ball park.

txrx0000 · 13h ago

If you look the battery life benchmarks they did at around the 5:00 mark in the third video, you can see that it achieves similar battery life compared to the an M3 Macbook in typical day-to-day use. This reflects the experience most users will have with the device.

It's true that the perf/watt is still a lot worse than the latest gen M4 under heavy load, but it's close enough to the M1 and significantly better than prior laptops chips on x86.

It is a first gen product like the M1. But it does show the ISA is not as big of a limiting factor as popularly believed.

aurareturn · 12h ago

  If you look the battery life benchmarks they did at around the 5:00 mark in the third video, you can see that it achieves similar battery life compared to the an M3 Macbook in typical day-to-day use. This reflects the experience most users will have with the device.

No. The typical day to day use of LNL is significantly slower than even M1. LNL throttles like crazy when on battery in order to achieve similar battery life.

https://b2c-contenthub.com/wp-content/uploads/2024/09/Intel-...

hakube · 12h ago

Windows laptops performance while on battery is terrible especially when you put it on power save mode. Macbooks on the other hand doesn't have that problem. It's just like you're using an iPad.

mrheosuper · 21h ago

By PMIC, did you mean VRM ?, if not, can you tell me the difference between them ?

txrx0000 · 21h ago

I'm not an expert on the topic and don't really know the difference. But in the video they say it can finetune power delivery to individual parts of the SoC and reduce idle power.

mrheosuper · 21h ago

Well, Every single CPU need some kind of voltage regulation module to work.

About "fine tune" part, this does not relate to PMIC(or VRM) at all, more like CPU design: How many power domain does this CPU have ?

audunw · 13h ago

I don’t know how it is for Apple-M, but for chips I’ve worked on, this can definitely relate to PMIC/VRMs. You can tune the voltage you feed to the various power domains based on the clock speed required for those domains at any given time. We do it with on-chip power regulators, but I suppose for Apple M it would perhaps be off-chip PMICs feeding power into the chip.

mrheosuper · 12h ago

it's the other way around. You design PMIC for a given CPU, not designing CPU for a given PMIC(but in Apple case, the engineers can work closely together to come up with something balance).

gimmeThaBeet · 8h ago

I always wondered how to gauge how effective Apple's Dialog carveout was, in terms of was it just instant PMIC department or even effective at building a foundation for them? Given their long relationship, I would think it might be pretty seamless.

I assume Apple probably do that more than I know, it is just interesting that their vertical acquisition history feels the most boring and the most interesting.

At least looking from the outside, it feels like relatively small pieces develop into pretty big differentiators, like P.A. Semi, Intrinsity and Passif paving the way to their SoCs.

KingOfCoders · 17h ago

"When RISC first came out, x86 was half microcode. So if you look at the die, half the chip is a ROM, or maybe a third or something. And the RISC guys could say that there is no ROM on a RISC chip, so we get more performance. But now the ROM is so small, you can’t find it. Actually, the adder is so small, you can hardly find it? What limits computer performance today is predictability, and the two big ones are instruction/branch predictability, and data locality."

Jim Keller

SOTGO · 1d ago

I'd be interested to hear someone with more experience talk about this or if there's more recent research, but in school I read this paper: <https://research.cs.wisc.edu/vertical/papers/2013/hpca13-isa...> that seems to agree that x86 and ARM as instruction sets do not differ greatly in power consumption. They also found that GCC picks RISC-like instructions when compiling for x86 which meant the number of micro-ops was similar between ARM and x86, and that the x86 chips were optimized well for those RISC-like instructions and so were similarly efficient to ARM chips. They have a quote that "The microarchitecture, not the ISA, is responsible for performance differences."

astrange · 4h ago

`lea` is a very common x86 instruction that isn't RISC-like. (Actually I don't think any x86 operations are RISC-like since they're variable length and overwrite their inputs.)

It's just that the most complicated of all x86 instructions are so specific that they're too irrelevant to use. Or were straight up removed in x86-64.

tester756 · 1d ago

Of course, because saying that X ISA is faster than Y ISA is like saying that Java syntax is faster than C# syntax

Everything is about the implementation: compiler, JIT, runtime/VM, stdlib, etc.

https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter

rafaelmn · 1d ago

C# syntax is faster than Java because Java has no way to define custom value types/structs (last time I checked, I know there was some experimental work on this)

tester756 · 1d ago

and yet there's more Java in HFT than C#

And don't get me wrong, I'm C# fanboi that'd never touch Java, but JVM itself is impressive as hell,

so even despite not having (yet) value types/structs, Java is still very strong due to JVM (the implementation). Valhalla should push it even further.

mrsmrtss · 11h ago

It may also be related to the fact that .NET until now does not have any GC with super low latency pauses, but there are interesting development going on recently regarding this - https://github.com/dotnet/runtime/discussions/115627.

cjbgkagh · 23h ago

HFT code is unusual in the way it is used. A lot of work goes into avoiding the Garbage Collection and other JVM overheads.

nottorp · 1d ago

Irrelevant.

There are two entities allowed to make x86_64 chips (and that only because AMD won the 64 bit ISA competition, otherwise there'd be only Intel). They get to choose.

The rest will use arm because that's all they have access to.

Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management. Maybe provide a ?linux? optimized for power consumption.

Dylan16807 · 1d ago

Unless you badly need SSE4 or AVX (and can't get around the somewhat questionable patent situation) anyone can make an x86_64 chip. And those patents are running out soon.

ac29 · 1d ago

> Oh, and x86_64 will be as power efficient as arm when one of the two entities will stop competing on having larger numbers and actually worry about power management.

Both Intel and AMD provide runtime power control so this is tunable. The last ~10% of performance requires far more than 10% of the power.

MBCook · 21h ago

I remember hearing on one of the podcasts I listened to about the difference between the Apple power cores and efficiency cores.

The power ones have more execution units I think, and are allowed to run faster.

When running the same task, the efficiency ones are just ridiculously more efficient. I wish I had some link to cite.

The extra speed the power units are allowed is enough to tip them way over the line of exponential power usage. I’ve always known each bump in megahertz comes with a big power cost but it was eye-opening.

shmerl · 1d ago

AMD do handle power consumption well, at least if you run in eco mode instead of pushing CPU to the limits. I always set eco mode on on modern Ryzens.

nottorp · 1d ago

I have a Ryzen box that I temperature limited to 65 C indeed. That was about 100 W in my office with just the graphics integrated into the Ryzen.

However, next to it there's a M2 mac mini that uses all of 37 W when I'm playing Cyberpunk 2077 so...

> Both Intel and AMD provide runtime power control so this is tunable. The last ~10% of performance requires far more than 10% of the power.

Yes but the defaults are insane.

dotnet00 · 1d ago

If you're measuring the draw at the wall, AFAIK desktop Ryzen keeps the chipset running at full power all the time and so even if the CPU is idle, it's hard to drop below, say, ~70W at the wall (including peripherals, fans, PSU efficiency etc).

Apparently desktop Intel is able to drop all the way down to under 10W on idle.

Teknoman117 · 22h ago

The last 20% of the performance takes like >75% of the power with Zen 4 systems XD.

A Ryzen 9 7945HX mini pc I have achieves like ~80% of the all-core performance at 55W of my Ryzen 9 7950X desktop, which uses 225W for the CPU (admittedly, the defaults).

I think limiting the desktop CPU to 105W only dropped the performance by 10%. I haven't done that test in awhile because I was having some stability problems I couldn't be bothered to diagnose.

bigyabai · 22h ago

Sounds like your M2 is hitting the TDP max and the Ryzen box isn't.

Keep in mind there are Nvidia-designed chips (eg. Switch 2) that use all of ten watts when playing Cyberpunk 2077. Manufactured on Samsung's 8nm node, no less. It's a bit of a pre-beaten horse, but people aren't joking when they say Apple's GPU and CPU designs leave a lot of efficiency on the table.

shmerl · 1d ago

Cyberpunk 2077 is very GPU bound, so it's not really about CPU there. I'm playing it using 7900 XTX on Linux :)

But yeah, defaults are set to look better in benchmarks and they are not worth it. Eco mode should be the default.

slimginz · 1d ago

IIRC There was a Jim Keller interview a few years ago where he said basically the same thing (I think it was from right around when he joined Tenstorrent?). The ISA itself doesn't matter, it's just instructions. The way the chip interprets those instructions is what makes the difference. ARM was designed from the beginning for low powered devices whereas x86 wasn't. If x86 is gonna compete with ARM (and RISC-V) then the chips are gonna need to also be optimized for low powered devices, but that can break decades of compatibility with older software.

sapiogram · 1d ago

It's probably from the Lex Friendman podcast he did. And to be fair, he didn't say "it doesn't matter", he said "it's not that important".

tester756 · 1d ago

https://chipsandcheese.com/p/arm-or-x86-isa-doesnt-matter

BirAdam · 11h ago

Jim Keller did say essentially that, and I think this is proven in two different facts.

First, x86 hasn't directly executed x86 instructions in a very long time.

Second, Rosetta 2.

ISA doesn't matter. Logic matters. Cache matters. Branch prediction and speculative execution matter. Buffers matter. Instruction reordering matters. Node size and packaging matter. SIMD matters for some workloads. Etc.

gigatexal · 1d ago

You’ll pry the ARM M series chips of my Mac from my cold dead hands. They’re a game changer in the space and one of the best reasons to use a Mac.

I am not a chip expert it’s just so night and day different using a Mac with an arm chip compared to an Intel one from thermals to performance and battery life and everything in between. Intel isn’t even in the same ballpark imo.

But competition is good and let’s hope they both do —- Intel and AMD because the consumer wins.

mort96 · 1d ago

I have absolutely no doubt in my mind that if Apple's CPU engineers got half a decade and a mandate from the higher ups, they could make an amazing amd64 chip too.

KingOfCoders · 17h ago

What you have to understand, these are all the same people.

cylemons · 13h ago

Dont high profile designers have strong anticompetes?

KingOfCoders · 10h ago

e.g. Jim Keller. CPU engineer

  1982 - 1998, DEC Alpha processors (loved them)
  1998 - 1999, AMD Athlon
  1999 - 2004, MIPS
  2008 - 2012, Apple A4/A5
  2012 - 2015, AMD Ryzen/Zen
  2016 - 2018, Tesla
  2018 - 2020, Intel

https://en.wikipedia.org/wiki/Jim_Keller_%28engineer%29

astrange · 3h ago

California doesn't allow noncompete clauses. It's why Silicon Valley exists in the first place.

kccqzy · 1d ago

That's not mostly because of a better ISA. If Intel and Apple had a chummier relationship you could imagine Apple licensing the Intel x86 ISA and the M series chips would be just as good but running x86. However I suspect no matter how chummy that relationship was, business is business and it is highly unlikely that Intel would give Apple such a license.

FlyingAvatar · 1d ago

It's pretty difficult to imagine.

Apple did a ton of work on the power efficiency of iOS on their own ARM chips for iPhone for a decade before introducing the M1.

Since iOS and macOS share the same code base (even when they were on different architectures) it makes much more sense to simplify to a single chip architecture that they already had major expertise with and total control over.

There would be little to no upside for cutting Intel in on it.

jopsen · 1d ago

Isn't it also easier to license ARM, because that's the whole point of the ARM Corporation.

It's not like Intel or AMD are known for letting other customize their existing chip designs.

rahkiin · 1d ago

Apple was a very early investor in ARM and is one of the few with a perpetual license of ARM tech

nly · 1d ago

And an architect license that lets them modify the ISA I believe

mandevil · 1d ago

Intel and AMD both sell quite a lot of customized chips, at least in the server space. As one example, any EC2 R7i or R7a instance you have are not running on a Sapphire Rapids or EPYC processor that you could buy, but instead one customized for AWS. I would presume that other cloud providers have similar deals worked out.

x0x0 · 1d ago

> That's not mostly because of a better ISA

Genuinely asking -- what is it due to? Because like the person you're replying to, the m* processors are simply better: desktop-class perf on battery that hangs with chips with 250 watt TDP. I have to assume that amd and intel would like similar chips, so why don't they have them if not due to the instruction set? And AMD is using TSMC, so that can't be the difference.

toast0 · 1d ago

I think the fundamental difference between an Apple CPU and an Intel/AMD CPU is Apple does not play in the megahertz war. The Apple M1 chip, launched in 2020 clocks at 3.2GHz; Intel and AMD can't sell a flagship mobile processor that clocks that low. Zen+ mobile Ryzen 7s released Jan 2019 have a boost clock of 4 GHz (ex: 3750H, 3700U); mobile Zen2 from Mar 2020 clock even higher (ex: 4900H at 4.4, 4800H at 4.2). Intel Tiger Lake was hitting 4.7 Ghz in 2020 (ex: 1165G7).

If you don't care to clock that high, you can reduce space and power requirements at all clocks; AMD does that for the Zen4c and Zen5c cores, but they don't (currently) ship an all compact core mobile processor. Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.

In addition to the basic efficiency improvements you get by having a clock limit, Apple also utilizes wider execution; they can run more things in parallel, this is enabled to some degree by the lower clock rates, but also by the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits. IIRC, Intel released some chips with on package memory, but they can't easily just drop in a couple more integer units onto an existing core.

The weaker memory model of ARM does help as well. The M series chips have a much wider out of order window, because they don't need to spend as much effort on ordering constraints (except when running in the x86 support mode); this also helps justify wider execution, because they can keep those units busy.

I think these three things are listed in order of impact, but I'm just an armchair computer architecture philosopher.

fluoridation · 1d ago

Does anyone actually care at all about frequencies? I care if my task finishes quickly. If it can finish quickly at a low frequency, fine. If the clock runs fast but the task doesn't, how is that a benefit?

My understanding is that both Intel and AMD are pushing high clocks not because it's what consumers want, but because it's the only lever they have to pull to get more gains. If this year's CPU is 2% faster than your current CPU, why would you buy it? So after they have their design they cover the rest of the target performance gain by cranking the clock, and that's how you get 200 W desktop CPUs.

>the commitment to higher memory bandwidth via on package memory; being able to count on higher bandwidth means you can expect to have more operations that are waiting on execution rather than waiting on memory, so wider execution has more benefits.

I believe you could make a PC (compatible) with unified memory and a 256-bit memory bus, but then you'd have to make the whole thing. Soldered motherboard, CPU/GPU, and RAM. I think at the time the M1 came out there weren't any companies making hardware like that. Maybe now that x86 handhelds are starting to come out, we may see laptops like that.

Yizahi · 1d ago

It's only recently when consumer software has become truly multithreaded. Historically there were major issues with that until very recently. Remember Bulldozer fiasco? They bet on the parallel execution more than Intel at the same time, e.g. same price Intel chip was 4 core, while AMD had 8 cores (consumer market). Single thread performance had been the deciding factor for decades. Even today AMDs outlier SKUs with a lot of cores and slightly lower frequencies (like 500 MHz lower or so) are not a topic of the day in any media or forum community. People talk about either top of the line SKU or something with low core count but clocking high enough to be reasonable for lighter use. Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".

fluoridation · 1d ago

Are we just going to pretend that frequency = single-thread performance? I'm fine with making that replacement mentally, I just want to confirm we're all on the same page here.

>Releasing low frequency high core count part for consumers would be greeted with questions, like "what for is this CPU?".

It's for homelab and SOHO servers. It won't get the same attention as the sexy parts... because it's not a sexy part. It's something put in a box and stuff in a corner to chug away for ten years without looking at it again.

wmf · 1d ago

low frequency high core count part for consumers

That's not really what we're talking about. Apple's cores are faster yet lower clocked. (Not just faster per clock but absolutely faster.) So some people are wondering if Intel/AMD targeting 6 GHz actually reduced performance.

gigatexal · 1d ago

But the OS has been able to take advantage of it since mountain lion with grand central dispatch. I could be wrong with the code name. This makes doing parallel things very easy.

But most every OS can.

astrange · 23h ago

Parallelism is actually very difficult and libdispatch is not at all perfect for it. Swift concurrency is a newer design and gets better performance by being /less/ parallel.

(This is mostly because resolving priority inversions turns out to be very important on a phone, and almost noone designs for this properly because it's not important on servers.)

cosmic_cheese · 1d ago

> Apple can sell a premium branded CPU where there's no option to burn a lot of power to get a little faster; but AMD and Intel just can't, people may say they want efficiency, but having higher clocks is what makes an x86 processor premium.

I think this is very context dependent. Is this a big, heavy 15”+ desktop replacement notebook where battery life was never going to be a selling point in the first place? One of those with a power brick that could be used as a dumbbell? Sure, push those clocks.

In a machine that’s more balanced or focused on portability however, high clock speeds do nothing but increase the likelihood of my laptop sounding like a jet and chewing through battery. In that situation higher clocks makes a laptop feel less premium because it’s worse at its core use case for practically no gain in exchange.

exmadscientist · 1d ago

> I have to assume that amd and intel would like similar chips

They historically haven't. They've wanted the higher single-core performance and frequency and they've pulled out all the stops to get it. Everything had been optimized for this. (Also, they underinvested in their uncores, the nastiest part of a modern processor. Part of the reason AMD is beating Intel right now despite being overall very similar is their more recent and more reliable uncore design.)

They are now realizing that this was, perhaps, a mistake.

AMD is only now in a position to afford to invest otherwise (they chose quite well among the options actually available to them, in my opinion), but Intel has no such excuse.

x0x0 · 1d ago

Not arguing, but I would think there is (and always has been) very wide demand for fastest single core perf. From all the usual suspects?

Thank you.

MBCook · 21h ago

Oh there certainly is. And there’s a reason Apple works hard for really fast single core performance. For a lot of tasks it still matters.

I suspect one of the issues is that pushing the clock is a really easy way to get an extra 2% so you can claim the crown of fastest or try to win benchmarks. It’s easy to fall into a trap of continuing to do that over and over.

But we know the long-term result. You end up blasting out a ton of heat and taking up a ton of power, even though you may only be 10% faster than a competitor who did things differently. Or worse you try to optimize for ever increasing clocks and get stuck like the Pentium 4.

As said up thread, no one really compares Apple CPU speeds with megahertz. That’s partially because Apple doesn’t talk about it or emphasize it which makes it more difficult, and partially because it’s not like you have a choice anyway.

It would never happen but it would be interesting to see how things would develop if it was possible to simply ban talking about clock speeds somehow. What would that do to the market?

exmadscientist · 18h ago

Only Intel and AMD actually attempt to deliver fastest single-thread performance. Apple has made the decision that almost-but-not-quite-the-fastest is good enough for them.

And that has made all the difference.

aurareturn · 9h ago

You’ve been saying that this whole thread but you’ve not provided any evidence.

bryanlarsen · 1d ago

What's it due to? At least this, probably more.

- more advanced silicon architecture. Apple spends billions to get access to the latest generation a couple of years before AMD.

- world class team, with ~25 years of experience building high speed low power chips. (Apple bought PA Semi to make these chips, which was originally the team that build the DEC StrongARM). And then paid & treated them properly, unlike Intel & AMD

- a die budget to spend transistors for performance: the M chips are generally quite large compared to the competition

- ARM's weak memory model also helps, but it's very minor IMO compared to the above 3.

aurareturn · 15h ago

  - a die budget to spend transistors for performance: the M chips are generally quite large compared to the competition

This is a myth. Apple chips are no bigger than the competition. For example, base M4 is smaller than Lunar Lake but is more efficient and 35% faster. M4 Pro is smaller than Strix Halo by a large margin but generally matches/exceeds the performance. Only the M4 Max is very large but it has no equivalent in the x86 world.

astrange · 23h ago

> And then paid & treated them properly, unlike Intel & AMD

Relatively properly. Nothing like the pampering software people get. I've heard Mr. Srouji is very strict about approving promotions personally etc.

(…by heard I mean I read Blind posts)

gigatexal · 1d ago

How many of those engineers remain, didn't a lot go to Nuvia that was then bought by Qualcomm?

bryanlarsen · 1d ago

Sure, but they were there long enough to train and instill culture into the others. And of course, since the acquisition in 2008 they've had access to the top new grads and experienced engineers. If you're coming out top of your class at an Ivy or similar you're going to choose Apple over Intel or AMD both because of rep and the fact that your offer salary is much better.

P.S. hearsay and speculation, not direct experience. I haven't worked at Apple and anybody who has is pretty closed lip. You have to read between the lines.

P.P.S. It's sort of a circular argument. I say Apple has the best team because they have the best chip && they have the best chip because they have the best team.

But having worked (briefly) in the field, I'm very confident that their success is much more likely due to having the best team rather than anything else.

MBCook · 21h ago

And isn’t that the reason people think some of the most recent Qualcomm chips are so much better?

x0x0 · 1d ago

interesting, ty

re: apple getting exclusive access to the best fab stuff: https://appleinsider.com/articles/23/08/07/apple-has-sweethe... . Interesting.

MBCook · 21h ago

At the same time they have a guaranteed customer who will buy the chips. How many other companies would be willing to try a process with a 30% success rate?

I think Apple helps them with money (loan?) to get some of the equipment or build the new lines. In exchange they get first shot at buying capacity.

And of course Apple is certainly paying for the privilege of the best process. At least more than other companies are willing. And they must buy a pretty tremendous volume across a couple of sizes.

It benefits both companies, otherwise they wouldn’t do it.

ThrowawayR2 · 1d ago

Intel and AMD are after the very high profit margins of the enterprise server market. They have much less motivation to focus on power efficient mobile chips which are less profitable for them.

Apple's primary product is consumer smartphones and tablets so they are solely focused on power efficient mobile chips.

bsder · 1d ago

> Genuinely asking -- what is it due to?

Mostly memory/cache subsystem.

Apple was willing to spend a lot of transistors on cache because they were optimizing the chips purely for mobile and can bury the extra cost in their expensive end products.

You will note that after the initial wins from putting stonking amounts of cache and memory bandwidth in place, Apple has not had any significant performance jump beyond the technology node improvements.

astrange · 23h ago

They aren't aiming for performance in the first place. It's a coincidence that it has good performance. They're aiming for high performance/power ratios.

MBCook · 21h ago

Wasn’t the M3 a reasonable increase and the M4 much more significant than that?

The M2 was certainly nothing amazing in jump.

x0x0 · 21h ago

I still don't understand though. Given their profit margins, the fact that they're shipping m chips in eg $1k computers means it's a $150 part.

There's tons of people that would pay $300+ for an equivalent perf + heat x86 competitor.

KingOfCoders · 17h ago

I think everything depends on circumstances.

I've used laptops for 15+ years (transitioned from a Mac Cube to a white Macbook. Macbook Pro etc.) but have migrated to a desktop some years ago (first iMac Pro, now AMD), as I work at my desk and when I'm not at my desk I'm not working.

Some years ago I got a 3900X and a 2080TI. And they still work fine, and I don't have performance problems, and although I thought of getting PCI5/NVMe with a 9950x3d/395+ (or a Threadripper), I just don't need it. I've upgraded the SSDs several times for speed and capacity (now at the PCI4/M2 limit and don't want to go into RAID), and added solar panels and a battery pack for energy usage, but I'm fine otherwise.

Indeed I want to buy a new CPU and GPU, but I don't find enough reasons (though might get a Mac Studio for local AI).

But I understand your point if you need a laptop, I just decided I no longer need one, and get more power with faster compiling for less money.

pengaru · 1d ago

Your Intel mac was stuck in the past while everyone paying attention on PCs were already enjoying TSMC 7nm silicon in the form of AMD Zen processors.

Apple Silicon macs are far less impressive if you came from an 8c/16t Ryzen 7 laptop. Especially if you consider the Apple parts are consistently enjoying the next best TSMC node vs. AMD (e.g. 5nm (M1) vs. 7nm (Zen2))

What's _really_ impressive is how badly Intel fell behind and TSMC has been absolutely killing it.

jeswin · 16h ago

> Your Intel mac was stuck in the past while everyone paying attention on PCs were already enjoying TSMC 7nm silicon in the form of AMD Zen processors.

This is basically it. Coming from dated Intel CPUs, Mac users got a shockingly good upgrade when the M-series computers were released. That amplified Apple's claims of Macs being the fastest computers, even when some key metrics (such as disk performance) were significantly behind PC parts in reality.

Yes, they're still better in performance/watt - but the node difference largely explains it like you were saying.

gigatexal · 1d ago

that ryzen laptop chip perform it'll just do it at a higher perf/watt than the apple chip will... and on a laptop that's a key metric.

tracker1 · 1d ago

And 20% or so of that difference is purely the fab node difference, not anything to do with the chip design itself. Strix Halo is a much better comparison, though Apple's M4 models do very well against it often besting it at the most expensive end.

On the flip side, if you look at servers... Compare a 128+core AMD server CPU vs a large core ARM option and AMD perf/watt is much better.

gigatexal · 1d ago

Wait are you saying the diff in perf per watt from apple arm to x86 is purely on fab leading edge ness?

Jensson · 1d ago

Basically yeah, if you compare CPU from same fab then its basically the same.

Its just Apple buys next gen fabs while AMD and intel has to be on last gen, so the M computers people compare are always one fab gen ahead. It has very little to do with CPU architecture.

They do have some cool stuff about their CPU, but the thing most laud them for has to do with fabs.

addaon · 1d ago

There's another difference -- willingness to actually pay for silicon. The M1 Max is a 432 mm^2 laptop chip built on a 5 nm process. Contrast that to AMD's "high end" Ryzen 7 8845HS at 178 mm^2 on a 4 nm process. Even the M1 Pro at 245 mm^2 is bigger than this. More area means not just more peak performance, but the ability to use wider paths at lower speeds to maintain performance at lower power. 432 mm^2 is friggin' huge for a laptop part, and it's really hard to compete with what that can do on any metric besides price.

MindSpunk · 20h ago

Comparing the M1 Max to a Ryzen 7 8845HS is not a fair comparison because the M1 chip also includes a _massive_ GPU tile, unlike the 8845HS which has a comparatively tiny iGPU because most vendors taking that part are pairing them with a separate dGPU package.

A better comparison is to take the total package area of the AI Max+ 395 that includes a 16 core CPU + a massive GPU tile and you get ~448mm^2 across all 3 chiplets.

tracker1 · 1d ago

Apple's SOC does a bit more than AMD's, such as including the ssd controller. I don't know if Apple is grafting different nodes together for chiplets, etc compared to AMD on desktop.

The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space. The CRAY-1 cpu was massive compared to both of your examples, but doesn't come close to either in terms of performance.

Also, Ryzen AI Max+ 395 is top dog on the AMD mobile CPU front and is around 308mm^2 combined.

addaon · 1d ago

> The area has nothing to do with peak performance... based on the node, it has to do with the amount of components you can cram into a given space.

Of course it does. For single-threaded performance, the knobs I can turn are clockspeed (minimal area impact for higher speed standard cells, large power impact), core width (significant area impact for decoder, execution resources, etc, smaller power impact), and cache (huge area impact, smaller power impact). So if I want higher single-threaded performance on a power budget, area helps. And of course for multi-threaded performance the knobs I have are number of cores, number of memory controllers, and last-level cache size, all of which drive area. There's a reason Moore's law was so often interpreted as talking about performance and not transistor count -- transistor count gives you performance. If you're willing to build a 432 mm^2 chip instead of a 308 mm^2 chip iso-process, you're basically gaining a half-node of performance right there.

tracker1 · 23h ago

Transistor count does not equal performance. More transistors isn't necessarily going to speed up any random single-threaded bottleneck.

Again, the CRAY-1 CPU is around 42000 mm^2, so I'm guessing you'd rather run that today, right?

gigatexal · 19h ago

True the M1 Pro and Max chips were capable of 200GB/s and 400GB/s of bandwidth between the chip and the integrated memory. No desktop chips had such at the time I think.

aurareturn · 15h ago

  Basically yeah, if you compare CPU from same fab then its basically the same.

This isn't true. If you compare N5 Apple to N5 AMD chips, Apple chips still come out far ahead in efficiency.

gigatexal · 1d ago

Man that either hella discounts all the amazing work Apple’s CPU engineers are doing or hyping up what AMD’s have done. Idk

Jensson · 1d ago

Isn't it you who is hyping up Apple here when you don't even compare the two using similar architecture? Compare a 5nm AMD laptop low power cpu to Apple M1 and the M1 no longer looks that much better at all.

gigatexal · 19h ago

Why are we talking about the M1 that came out eons (in computer time) ago? That the M1 is a benchmark is just sad when the M4 is running circles around competing x86 processors and the M5 is on the horizon which who knows what that has in store.

tracker1 · 1d ago

I wouldn't discount what Apple has done... they've created and integrated some really good niche stuff in their CPUs to do more than typical ARM designs. The graphics cores are pretty good in their own right even. Not to mention the OS/Software integration including accelerated x86 and unified memory usage in practice.

AMD has done a LOT for parallelization and their server options are impressive... I mean, you're still talking 500W+ in total load, but that's across 128+ cores. Strix Halo scaling goes down impressively to the ~10-15W range under common usage, not as low as Apple does under similar loads but impressive in its own way.

variadix · 1d ago

Instruction decode for variable length ISAs is inherently going to be more complex, and thus require more transistors = more power, than fixed length instruction decode, especially parallel decode. AFAIK modern x86 cores have to speculatively decode instructions to achieve this, compared to RISC ISAs where you know where all the instruction boundaries are and decoding N in parallel is a matter of instantiating N decoders that work in parallel. How much this determines the x86 vs ARM power gap, I don’t know, what’s much more likely is x86 designs have not been hyper optimized for power as much ARM designs have been over the last two decades. Memory order is another non-negligible factor, but again the difference is probably more attributable to the difference in goals between the two architectures for the vast majority of their lifespan, and the expertise and knowledge of the engineers working at each company.

Avi-D-coder · 1d ago

From what I have heard it's not the RISCy ISA per se, it's largely arm's weaker memory model.

I'd be happy to be corrected, but the empirical core counts seem to agree.

hydroreadsstuff · 1d ago

Indeed, the memory model has a decent impact. Unfortunately it's difficult to isolate in measurement. Only Apple has support for weak memory order and TSO in the same hardware.

MBCook · 21h ago

Oh there’s an interesting idea. Given that Linux runs on the M1 and M2 Macs, would it be possible to do some kind of benchmark there where you could turn it on and off at will for your test program?

gok · 18h ago

This has been done in fact: https://doi.org/10.1016/j.sysarc.2024.103102

hereme888 · 1d ago

I was just window-shopping laptops this morning, and realized ARM-based doesn't necessarily hold battery life advantages.

gtirloni · 1d ago

You mean Windows-based ARM laptops or Macbooks?

hereme888 · 10h ago

Windows. Never owned a Mac.

aurareturn · 15h ago

They actually do. Qualcomm matches Intel's best battery life but gives you more performance. Apple exceeds the best x86 performance and battery life by a large margin.

pacetherace · 1d ago

Plus they are not cheap either.

mrheosuper · 21h ago

I Blame that on Qualcomm. Right now they are the only vendor with ARM CPU on Windows. And using QC CPU means you must buy their whole "solution".

zuhsetaqi · 14h ago

Don't claim it just show/proof it by offering a chip for consumers that matches or better beats the metrics of Apples offerings.

perryizgr8 · 9h ago

Apple offers no chip though. It builds laptops with the best battery life and near top end performance. No other laptop manufacturer has been able to match that.

w4rh4wk5 · 1d ago

Ok. When will we get the laptop with AMD CPU that is on par with a Macbook regarding battery life?

azornathogron · 1d ago

How much of the Mac's impressive battery life is due purely to CPU efficiency, and how much is due to great vertical integration and the OS being tuned for power efficiency?

It's a genuine question; I'm sure both factors make a difference but I don't know their relative importance.

SushiHippie · 1d ago

I just searched for the asahi linux (Linux for M Series Macs) battery life, and found this blog post [0].

> During active development with virtual machines running, a few calls, and an external keyboard and mouse attached, my laptop running Asahi Linux lasts about 5 hours before the battery drops to 10%. Under the same usage, macOS lasts a little more than 6.5 hours. Asahi Linux reports my battery health at 94%.

[0] https://blog.thecurlybraces.com/2024/10/running-fedora-asahi...

ux266478 · 1d ago

The overwhelming majority is due to the power management software, yes. Other ARM laptops do not get anywhere close to the same battery life. The MNT Reform with 8x 18650s (24000mAh, 3x what you get an MBP) gets about 5h of battery life with light usage.

jogu · 21h ago

The MNT Reform does not use Li-Ion batteries, resulting in much poorer energy density. It's going to depend on the cells being used but this is what I could see from the Reform Next detail page: 8× LiFePO4 cells (16000 mAh total). Assuming 2200mah cells, I think this nets you around 56 Wh in a 4S2P configuration, Li-Ion cells would be closer to 65 Wh.

According to Apple's website it seems a 14 inch macbook pro has a 70 Wh battery.

jml7c5 · 16h ago

Other ARM laptops also have much worse SoCs, so it's not really apples-to-apples.

ux266478 · 8h ago

As someone else posted elsewhere in this thread, Asahi nets about 6.5 hours on a Macbook. Apples to apples the story is the same. It's really not that shocking. The effect of power management (or rather, when it's lacking) is very apparent when you do bare metal programming. The first thing you do when you finally get something booting is kill a battery in less than an hour with a while(true).

John23832 · 1d ago

Tight integration matters.

Look at the difference in energy usage between safari and chrome on M4s.

pablok2 · 1d ago

How much instruction perf analysis do they do to save 1% (compounded) on the most common instructions

adgjlsfhk1 · 1d ago

it's less that and more all the peripheral things. USB, wifi, bluetooth, ram, random capacitors in the VRM etc.

mmis1000 · 1d ago

I think it would only be fair to compare it when running some more resource efficient system.

Steamdeck with Windows 11 and SteamOS is a whole different experience. When running SteamOS and doing web surfing, the fan don't even really spin at all. But when running windows 11 and do the exact same thing, it just spins all the time and becomes kinda hot.

No comments yet

pulkittt · 18h ago

This interview with AMD Zen chief architect Mike Clark validated that the efficiency gains are not due to ISA https://www.computerenhance.com/p/an-interview-with-zen-chie...

WithinReason · 1d ago

Since newer CPUs have heterogeneous cores (high performance + low power), I'm wondering if it makes sense to drop legacy instructions from the low power cores, since legacy code can still be run on the other cores. Then e.g. an OS compiled the right way can take advantage of extra efficiency without the CPU losing backwards compatibility

toast0 · 1d ago

Like o11c says, that's setting everyone up for a bad time. If the heterogenous cores are similar, but don't all support all the instructions, it's too hard to use. You can build legacy instructions in a space optimized way though, but there's no reason not to do that for the high performance cores too --- if they're legacy instructions, one expects them not to run often and perf doesn't matter that much.

Intel dropped their x86-S proposal; but I guess something like that could work for low power cores. If you provide a way for a 64-bit OS to start application processors directly in 64-bit mode, you could setup low power cores so that they could only run in 64-bit mode. I'd be surprised if the juice is worth the squeeze, but it'd be reasonable --- it's pretty rare to be outside 64-bit mode, and systems that do run outside 64-bit mode probably don't need all the cores on a modern processor. If you're running in a 64-bit OS, it knows which processes are running in 32-bit mode, and could avoid scheduling them on reduced functionality cores; If you're running a 32-bit OS, somehow or another the OS needs to not use those cores... either the ACPI tables are different and they don't show up for 32-bit, init fails and the OS moves on, or the there is a firmware flag to hide them that must be set before running a 32-bit OS.

jdsully · 1d ago

I don't really understand why the OS can't just trap the invalid instruction exception and migrate it to the P-core. E.g. AVX-512 and similar. For very old and rare instructions they can emulate them. We used to do that with FPU instructions on non-FPU enabled CPUs way back in the 80s and 90s.

saagarjha · 22h ago

It's slow and annoying. What would cpuid report? If it says "yes I do AVX-512" then any old code might try to use it and get stuck on the P-cores forever even if it was only using it sparingly. If you say no then the software might never use it, so what was the benefit?

toast0 · 1d ago

It's not impossible, but it'd be a pain in the butt. If you occasionally use some avx-512 infrequently, no big deal (but also not a big deal to just not use it). But if you use it a lot, all of a sudden your core count shrinks; you might rather run on all cores with avx2. You might even prefer to run avx-512 for cores that can and avx2 for those that can't ... but you need to be able to gather information on what cores support what, and pin your threads so they don't move. If you pull in a library, who knows what they do... lots of libraries assume they can call cpuid at load time and adjust... but now you need that per-thread.

That seems like a lot of change for OS, application, etc. If you run commercial applications, maybe they don't update unless you pay them for an upgrade, and that's a pain, etc.

ryukoposting · 23h ago

We still do that with some microcontrollers! https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#index-mf...

MBCook · 21h ago

Haven’t there been bugs found in older(?) games because of this?

They would run on the power core and detect the CPU features and turn on some AVX path.

Then the OS would reschedule them onto one of the efficiency courses and they would try and run the instructions that don’t exist there and it would crash?

devnullbrain · 1d ago

Interesting but it would be pretty rough to implement. If you take a binary now and run it on a core without the correct instructions, it will SIGILL and probably crash. So you have these options:

Create a new compilation target

- You'll probably just end up running a lot of current x86 code exclusively on performance cores to a net loss. This is how RISC-V deals with optional extensions.

Emulate

- This already happens for some instructions but, like above, could quickly negate the benefits

Ask for permission

- This is what AVX code does now, the onus is on the programmer to check if the optional instructions can be used. But you can't have many dropped instructions and expect anybody to use it.

Ask for forgiveness

- Run the code anyway and catch illegal instruction exceptions/signals, then move to a performance core. This would take some deep kernel surgery for support. If this happens remotely often it will stall everything and make your system hate you.

The last one raises the question: which instructions are we considering 'legacy'? You won't get far in an x86 binary before running into an instruction operating on memory that, in a RISC ISA, would mean first a load instruction, then the operation, then a store. Surely we can't drop those.

kccqzy · 1d ago

The "ask for permission" approach doesn't work because programs don't expect the capability of a CPU to change. If a program checked a minute ago that AVX512 is available, it certainly expects AVX512 to be continually available for the lifetime of the process. That means chaos if the OS is moving processes between performance and efficiency cores.

wtallis · 1d ago

IIRC, there were several smartphone SoCs that dropped 32-bit ARM support from most but not all of their CPU cores. That was straightforward to handle because the OS knows which instruction set a binary wants to use. Doing anything more fine-grained would be a nightmare, as Intel found out with Alder Lake.

kccqzy · 1d ago

This is the flip side of Intel trying to drop AVX512 on their E cores in the 12th generation processors. It didn't work. It requires the OS to know which processes need AVX512 before they get run. And processes themselves use cpuid to determine the capability of processors and they don't expect it to change. So you basically must determine in advance which processes can be run on E cores and never migrate between cores.

kragen · 1d ago

What if the kernel handled unimplemented instruction faults by migrating the process to a core that does implement the instruction and restarting the faulting instruction?

MBCook · 21h ago

What if that core isn’t free? What if it’s not going to be free for a long time?

That could be a recipe for random long stalls for some processes.

kragen · 16h ago

I don't think avoiding such pathological cases would be that hard. See https://news.ycombinator.com/item?id=45178286

mrheosuper · 13h ago

> What if that core isn’t free

Just context switch it, like how you run 2 programs with single core cpu

kragen · 11h ago

It's correct to point out that you could end up in a situation where your "big" cores are all heavily loaded and your "small" cores with less instructions are all idle. That's unavoidable if your whole workload needs the AVX512 instructions or whatever, but it could be catastrophic if your OS just mistakenly thinks it does. But that doesn't seem unavoidable; see my comments further down the thread.

Rohansi · 1d ago

Sounds great for performance.

kragen · 1d ago

Would this be more or less costly than a page fault? It seems like it would be easy to arrange for it to happen very rarely unless none of your cores support all the instructions.

Rohansi · 20h ago

Most likely similar. What would the correct behavior be for the scheduler to avoid hitting it in the future? Flag the process as needing X instruction set extension so they only run on the high performance cores?

kragen · 16h ago

Yeah, although maybe the flag should decay after a while? You want to avoid either spending significant percentages of your time trying to run processes that make no progress because they need unavailable instructions or delaying processes significantly because they are waiting for resources they no longer need.

This sounds a little bit subtle, in the way most operating system policy problems are subtle, but far from intractable. Most of the time all your processes are making progress and either all your cores are busy or you don't have enough runnable processes to keep them busy. In the occasional case where this is not true, you can try optimistically deflagging processes that have made some progress since they were last flagged. Worst case, you context switch an idle core to a process that immediately faults. If your load average is 256 you could maybe do this 256 times in a row at most, at a cost of around a microsecond each? Maybe you have wasted a millisecond on a core that would have been idle?

And you probably want the flag lifetime to be on the order of a second normally, so you're not forced to make suboptimal scheduling decisions by outdated flags in order to avoid that microsecond of wasted context switching.

o11c · 1d ago

We've seen CPU-capability differences by accident a few times, and it's always a chaotic mess leading to SIGILL.

The kernel would need to have a scheduler that knows it can't use those cores for certain tasks. Think about how hard you would have to work to even identify such a task ...

mmis1000 · 1d ago

Current windows or linux executable format don't even list the used instruction though. And even it is listed, how about dynamic linkables? The program may decide to load library at any time it wishes, and the OS is not going to know what instruction may be used this time.

MBCook · 21h ago

You couldn’t even scan the executable if you wanted to. Because lots of code will check what the CPU is capable of doing and choose the most efficient path based on what instructions it’s allowed to use.

So until you’ve run it (halting problem) you may find instructions that you’d never even run.

Findecanor · 1d ago

I think it is not really the execution units for simple instructions that take up much chip area on application-class CPUs these days, but everything around them.

I think support in the OS/runtime environment* would be more interesting for chips where some cores have larger execution units such as those for vector and matmul units. Especially for embedded / low power systems.

Maybe x87/MMX could be dropped though.

*. BTW. If you want to find research papers on the topic, a good search term is "partial-ISA migration".

izacus · 1d ago

This was a terrible idea when we tried it on ARM and it'll remain terrible idea on AMD64 as well.

There's just too many footguns for the OS running on such a SoC to be worth it.

userbinator · 21h ago

Interesting to see this show up literally the day after I post a comment with this 11-year-old article on the same topic:

https://www.extremetech.com/extreme/188396-the-final-isa-sho...

It is a little ironic to see AMD making the backwards-compatibility argument, when they were the ones who made the very inelegant AMD64, had some known differences to Intel's CPUs in some edge-cases, and then much later, https://www.os2museum.com/wp/vme-broken-on-amd-ryzen/

There are some very-low-power x86 SoCs which are largely found in embedded systems; the most famous of these may be https://en.wikipedia.org/wiki/Vortex86

pjdesno · 22h ago

Related to this, I remember seeing a research talk that fairly convincingly demonstrated that almost all performance differences between several ARM amd x86 CPUs were explained by microarchitectural features (branch predictor type and size, etc) rather than ISA. There was one benchmark affected by a deficiency in the ARM ISA, but that’s probably fixed by now.

Symmetry · 10h ago

There are two separate issues here. Can an x86 be made nearly as efficient as an ARM chip with unbounded effort? Sure. But it's a lot easier to make a competent ARM design than a competent x86 design because there's so much more to the later and the front end has to be a lot more complicated to deal with the unsynchronized encoding.

josemanuel · 6h ago

Doesn't the Arm 'weak' memory model (versus the x86 'strong' memory model) not have a significant impact to power consumption?

bob1029 · 14h ago

Theoretically x86 could be more efficient in terms of instruction caches and memory bandwidth, but it doesn't seem like this would be very substantial in most use cases. I'm not aware of many hot path instruction streams that are so complex that caches and interconnects are getting saturated.

Moving instructions to the decoder is way more expensive than actually decoding them. The factor is insane once you get to DRAM. But, it doesn't seem all that relevant in practice.

flembat · 1d ago

That is quite a confession from AMD. It's not X86 at all, just every implementation. It is not like the ARM processors in Macs are simple any more, thats for sure.

ZuLuuuuuu · 1d ago

There are a lot of theoretical articles which claim similar things but on the other hand we have a lot of empirical evidence that ARM CPUs are significantly more power efficient.

I used laptops with both Intel and AMD CPUs, and I read/watch a lot of reviews in thin and light laptop space. Although AMD became more power efficient compared to Intel in the last few years, AMD alternative is only marginally more efficient (like 5-10%). And AMD is using TSMC fabs.

On the other hand Qualcomm's recent Snapdragon X series CPUs are significantly more efficient then both Intel and AMD in most tests while providing the same performance or sometimes even better performance.

Some people mention the efficiency gains on Intel Lunar Lake as evidence that x86 is just as efficient, but Lunar Lake was still slightly behind in battery life and performance, while using a newer TSMC process node compared to Snapdragon X series.

So, even though I see theoretical articles like this, the empirical evidence says otherwise. Qualcomm will release their second generation Snapdragon X series CPUs this month. My guess is that the performance/efficiency gap with Intel and AMD will get even bigger.

ryukoposting · 1d ago

I think both can be true.

A client CPU spends most of its life idling. Thus, the key to good battery life in client computing is, generally, idle power consumption. That means low core power draw at idle, but it also means shutting off peripherals that aren't in use, turning off clock sources for said peripherals, etc.

ARM was built for low-power embedded applications from the start, and thus low-power idle states are integrated into the architecture quite elegantly. x86, on the other hand, has the SMM, which was an afterthought.

AFAICT case for x86 ~ ARM perf equivalence is based on the argument that instruction decode, while empirically less efficient on x86, is such a small portion of a modern, high-performance pipeline that it doesn't matter. This reasoning checks out IMO. But, this effect would only be visible while the CPU is under load.

rickdeckard · 16h ago

I agree, but I also don't think that the ISA is the differentiation factor for real-life efficiency between those two architectures.

ARMs advantage is that every user-facing OS built for ARM was designed to be power-efficient, with frameworks governing applications.

x86, not so much...

viktorcode · 15h ago

That statement contradicts the significant growth of ARM in server space in recent years.

high_na_euv · 15h ago

It does not

waterTanuki · 15h ago

Then why does AWS charge less for the same compute workloads on ARM?

gt0 · 1d ago

I'm glad an authoritative source has stated this. It's been ongoing BS for years. I first got into ARM machines with the Acorn Archimedes and even back then, people were spouting some kind of intrinsic efficiency benefit to ARM that just didn't make any sense.

astrange · 23h ago

It is true for small cores, it's just not important for big cores compared to all the other things you have to solve.

MBCook · 21h ago

Makes sense. A 386 had 275,000 transistors. You can see how instruction decode stuff for a more complex x86 instruction could be a reasonable fraction of that.

ArrowLake is 17.8 billion. Even if you divide by the core count instruction decode obviously isn’t much of that.

pmarreck · 19h ago

> However, these are insignificant compared to the number of PCs running x86 today, as well as the volume of notebooks being shipped with an x86 CPU

Which is insignificant compared to the number of smartphones and pad devices running ARM quite powerfully (and without much heat) for hours and hours. Their point?

Everything I've read about x86 is that it is hideously complex, which is almost certainly impeding progress. Why would you want to prolong the life of an architecture like that?

sylware · 14h ago

RISC-V has no PI lock like ARM or x86 and x86_64.

RISC-V has to start to seriously defend itself, because it is a death sentence for ARM ISA and and could start to cast shadows on x86_64 in some areas slowly but surely. Some people will try to bring it down, hard.

If you stick to core rva22+ (core RISC-V ISA), RISC-V is good enough to replace all of them, without PI lock, and with a global standard ISA, software may have a chance to get out of the horrible mess it is currently in (a lot of critical software code path may end up assembly written... no compiler lock-in, extremely hard to do planned obsolescence, etc).

RISC-V is basically ARM ISA without PI lock.

I have been writting RISC-V assembly running on x86_64 with an interpreter for much of my software projects. It is very pleasant to code using it (basic assembler: no pseudo-instructions, I don't even use the compressed instructions).

I hope to get my hands on RISC-V performant implementations on near state-of-the-art silicon process some day (probably a mini-server, for all the self hosted stuff).

The 'silicon market' is saturated then it is amazing what the RISC-V supporters have been able to achieve. There will be mistakes (some probably big), before implementations do stabilize in the various domains (desktop/server/embedded/mobile/etc), and expect the others to press hard on them.

The next step for RISC-V would be a GPU ISA, and for RVAX, a standard hardware GPU programming interface... but it may be still too early for that since we kind of still don't know if we reach 'the sweet spot'.

thecosmicfrog · 10h ago

What is "PI lock"? A cursory web search didn't reveal much.

sylware · 3h ago

IP lock, my bad (in my native language it is the other way around).

Intellectual Property lock. ARM, x86(intel), x86_64(amd) have such extremely strong locks in countries where those are legal to be enforced.

It is said they literaly deny anybody else to use their ISAs unless there is a super strong power struggle and/or a giga-enormous amount of $$$ in the bargain.

cptskippy · 1d ago

The ISA is the contract or boundary between software and hardware. While there is a hardware cost to decode instructions, the question is how much?

As all the fanbois in the thread have have pointed out, Apple's M series is fast and efficient compared to x86 for desktop/server workloads. What no one seems to acknowledge is that Apple's A series is also fast and efficient compared to other ARM implementations in mobile workloads. Apple sees the need to maintain M and A series CPUs for different workloads, which indicates there's a benefit to both.

This tells me the ISA decode hardware isn't or isn't the only bottleneck.

MBCook · 21h ago

Whatever the hardware cost is it must pale in comparison to trying to implement the instructions set directly. Wasn’t that the huge benefit of the Pentium Pro?

The main reason to have both A and M series chips, I think, is related to which custom processors are available as well as how many of each. There’s no point in putting the full GPU from an M4 into a cell phone, it’s just gonna waste a ton of power.

bfrog · 1d ago

And yet... the world keeps proving Intel and AMD wrong on this premise with highly efficient Arm parts. While sure, there's bound to be improvements to make on x86 ultimately its a variable length opcode encoding with a complex decoder path. If nothing else, this is likely a significant issue in comparison to the nicely word aligned op code encoding arm has and surely given apples to apples core designs, the opcode decoding would be a deciding factor.

ch_123 · 1d ago

> its a variable length opcode encoding with a complex decoder path

In practice, the performance impact of variable length encoding is largely kept in check using predictors. The extra complexity in terms of transistors is comparatively small in a large, high-performance design.

We should have the ability to run any code we want on hardware we own (hugotunius.se)

Cognitive load is what matters (github.com)

NPM debug and chalk packages compromised (aikido.dev)

I ditched Docker for Podman (codesmash.dev)

30 minutes with a stranger (pudding.cool)

Next.js is infuriating (blog.meca.sh)

996 (lucumr.pocoo.org)

The MacBook has a sensor that knows the exact angle of the screen hinge (twitter.com)

Show HN: I recreated Windows XP as my portfolio (mitchivin.com)

Anthropic agrees to pay $1.5B to settle lawsuit with book authors (nytimes.com)

Signal Secure Backups (signal.org)

Using Claude Code to modernize a 25-year-old kernel driver (dmitrybrant.com)

Google can keep its Chrome browser but will be barred from exclusive contracts (cnbc.com)

Stripe Launches L1 Blockchain: Tempo (tempo.xyz)

“This telegram must be closely paraphrased before being communicated to anyone” (history.stackexchange.com)

Almost anything you give sustained attention to will begin to loop on itself (henrikkarlsson.xyz)

Chat Control Must Be Stopped (privacyguides.org)

Where's the shovelware? Why AI coding claims don't add up (mikelovesrobots.substack.com)

Mistral raises 1.7B€, partners with ASML (mistral.ai)

New Mexico is first state in US to offer universal child care (governor.state.nm.us)

Google AI Overview made up an elaborate story about me (bsky.app)

Claude Code: Now in Beta in Zed (zed.dev)

Eternal Struggle (yoavg.github.io)

I'm absolutely right (absolutelyright.lol)

LLM Visualization (bbycroft.net)

iPhone dumbphone (stopa.io)

Notes on Managing ADHD (borretti.me)

Serverless Horrors (serverlesshorrors.com)

MIT Study Finds AI Use Reprograms the Brain, Leading to Cognitive Decline (publichealthpolicyjournal.com)

The maths you need to start understanding LLMs (gilesthomas.com)

AI surveillance should be banned while there is still time (gabrielweinberg.com)

Fil's Unbelievable Garbage Collector (fil-c.org)

Wikipedia survives while the rest of the internet breaks (theverge.com)

Anthropic raises $13B Series F (anthropic.com)

No adblocker detected (maurycyz.com)

We already live in social credit, we just don't call it that (thenexus.media)

Google: 'Your $1000 phone needs our permission to install apps now' [video] (youtube.com)

14 Killed in anti-government protests in Nepal (tribuneindia.com)

Bear is now source-available (herman.bearblog.dev)

A staff engineer's journey with Claude Code (sanity.io)

Immich – High performance self-hosted photo and video management (github.com)

Purposeful animations (emilkowal.ski)

Atlassian is acquiring The Browser Company (cnbc.com)

Tesla changes meaning of 'Full Self-Driving', gives up on promise of autonomy (electrek.co)

Implementing a Foil Sticker Effect (4rknova.com)

We all dodged a bullet (xeiaso.net)

Meta suppressed research on child safety, employees say (washingtonpost.com)

Magic Lantern Is Back (magiclantern.fm)

Are we decentralized yet? (arewedecentralizedyet.online)

The staff ate it later (en.wikipedia.org)

AMD claims Arm ISA doesn't offer efficiency advantage over x86

Comments (408)