In my opinion, NOP and MOV, which are recommended in TFA for slowing down, are the worst possible choices.
The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.
For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs).
loeg · 3m ago
RDTSC(P) is pretty slow. I wonder if that would work.
weinzierl · 1h ago
For the Commodore 64 there was a product called the C64 Snail which could slow it down.
Later on the early PCs we had a Turbo Button, but since everyone had it in Turbo mode all the time it essentially was a way to slow down the machine.
EDIT: Found an image of what I remember as the "C64 Snail". It is called "BREMSE 64" (which is German for brake) in the image.
I had the impression, the turbo button was created to slow down new PCs, so they could run old software that relied heavily on CPU speed.
weinzierl · 1h ago
Yes, originally it was added to slow down faster XTs to exactly the 4.77 MHz of the original IBM XT.
With the AT it usually slowed down to some arbitrary frequency and it was more like a gimmick.
gwd · 1h ago
Kind of weird that NOP actually slows down the pipeline, as I'd think that would be the easiest thing to optimize out of the pipeline, unless instruction fetch is one of the main limiting factors. Is it architecturally defined that NOP will slow down execution?
Someone · 1h ago
I think it would be easy, but still not worth the transistors. Think of it: what programs contain lots of NOPs? Who, desiring to write a fast program, sprinkles their code with NOPs?
It’s not worth optimizing for situations that do not occur in practice.
The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.
adrian_b · 29m ago
On CPUs with variable instruction length, like the Intel/AMD CPUs, many programs have a lot of NOPs, which are inserted by the compiler for instruction alignment.
However those NOPs are seldom executed frequently, because most are outside of loop bodies. Nevertheless, there are cases when NOPs may be located inside big loops, in order to align some branch targets to cache line boundaries.
That is why many recent Intel/AMD CPUs have special hardware for accelerating NOP execution, which may eliminate the NOPs before reaching the execution units.
adrian_b · 40m ago
It depends on the CPU. On some CPUs a NOP might take the same time as an ADD and it might have the same throughput per clock cycle as ADD.
However, there are CPUs among the Intel/AMD CPUs that can execute up to a certain number of consecutive NOPs in zero time, i.e. they are removed from the instruction stream before reaching the execution units.
In general, no instruction set architecture specifies the time needed to execute an instruction. For every specific CPU model you must search its manual to find the latency and throughput for the instruction of interest, including for NOPs.
Some CPUs, like the Intel/AMD CPUs, have multiple encodings for NOP, with different lengths in order to facilitate instruction alignment. In that case the execution time may be not the same for all kinds of NOPs.
IcePic · 1h ago
I think so, as in "make sure all other stuff has run before calling the NOP finished". Otherwise, it would just skip past it and it would have no effect if placed in a loop, so it would be eating memory for no use at all.
motorest · 55m ago
> I think so, as in "make sure all other stuff has run before calling the NOP finished".
Is this related to speculative execution? The high level description sounds like NOP works as sync points.
bob1029 · 1h ago
Eating memory alone may have the desired effect. The memory bandwidth of a cpu is not infinite.
pkhuong · 46m ago
Yeah, just decode. But that's nice because the effect is independent of the backend's state.
The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.
For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs).
Later on the early PCs we had a Turbo Button, but since everyone had it in Turbo mode all the time it essentially was a way to slow down the machine.
EDIT: Found an image of what I remember as the "C64 Snail". It is called "BREMSE 64" (which is German for brake) in the image.
https://retroport.de/wp-content/uploads/2018/10/bremse64_rex...
I had the impression, the turbo button was created to slow down new PCs, so they could run old software that relied heavily on CPU speed.
With the AT it usually slowed down to some arbitrary frequency and it was more like a gimmick.
It’s not worth optimizing for situations that do not occur in practice.
The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.
However those NOPs are seldom executed frequently, because most are outside of loop bodies. Nevertheless, there are cases when NOPs may be located inside big loops, in order to align some branch targets to cache line boundaries.
That is why many recent Intel/AMD CPUs have special hardware for accelerating NOP execution, which may eliminate the NOPs before reaching the execution units.
However, there are CPUs among the Intel/AMD CPUs that can execute up to a certain number of consecutive NOPs in zero time, i.e. they are removed from the instruction stream before reaching the execution units.
In general, no instruction set architecture specifies the time needed to execute an instruction. For every specific CPU model you must search its manual to find the latency and throughput for the instruction of interest, including for NOPs.
Some CPUs, like the Intel/AMD CPUs, have multiple encodings for NOP, with different lengths in order to facilitate instruction alignment. In that case the execution time may be not the same for all kinds of NOPs.
Is this related to speculative execution? The high level description sounds like NOP works as sync points.