Worth noting that division (integer, fp, and simd) has gotten much cheaper in the last decade. Division is partially pipelined on common microarchitectures now (capable of delivering a result every 2-4 cycles) and have greatly reduced latency from ~30-80 cycles down to ~10-20 cycles.
This improvement is sufficient to tip the balance toward favoring division in some algorithms where historically programmers went out of their way to avoid it.
net01 · 6h ago
you can find a better table here with most operations and time:
I think it is a totally different type of table. Yours is real data. Theirs is more like a ballpark. Maybe there could be some use for the latter? Just to help folks reason about performance.
Although, reasoning about performance can be hard anyway.
owlbite · 2h ago
Trying to reduce high end processor performance to "operation X takes Y cycles" likely confuses the uninitiated more than it helps once you get beyond "cache miss bad".
For the uninitiated, most high-performance CPUs of recent years:
- Are massively out-of-order. It will run any operation that has all inputs satisfied in the next slot of the right type available.
- Have multiple functional units. A recent apple CPU can and will run 5+ different integer ops, 3+ load/stores and 3+ floating point ops per cycle if it can feed them all. And it may well do zero-cost register renames on the fly for "free".
- Functional units are pipelined, you can throw 1 op in the front end of the pipe each cycle, but the result sometimes isn't available for consumption until maybe 3-20 cycles later (latency depends on the type of the op and if it can bypass into the next op executed).
- They will speculate on branch results and if they get them wrong it needs to flush the pipeline and do the right thing.
- Assorted hazards may give +/- on the timing you might get in a different situation.
Liftyee · 4h ago
I agree with this. As someone who's not an expert in assembly and CPU architecture the "simplified" estimates in a condensed log-chart format was much more insightful. The exact data for specific architectures would be useful for more advanced users than me, but it doesn't offer the same quick "big picture" overview.
bee_rider · 4h ago
Did you get a chance to use it? I’ve only just come across this table now, so I haven’t bad a chance to actually try and use it for anything, so I wouldn’t be able to evaluate the usefulness.
I have a sneaking suspicion that this table is satisfying for our brains as a vaguely technical and interesting thing, but I’m not sure how useful it really is. In general the compiler will be really creative in reordering instructions, and the CPU will also be creative about which ones it runs parallel (since it is good at discovering instruction level parallelism). So, I wonder if the level of study necessary to use this information also requires the level of data that is available in the detailed table.
I have not done much caring about instructions, it seems very hard. FWIW I have had some success caring about reducing the number of trips to memory and making sure the dependencies are obvious to the computer, so I’m not totally naive… but I think that caring about instruction timing is mostly for the real hardcore optimization badasses.
torium · 2h ago
The x-axis is in CPU cycles (10^0 means 1 cycle).
If you CPU runs on 1000MHz that's 10^9 cycles per second. On that CPU the right hand side of the picture corresponds to 1ms. You can do 1 million register-register operations in 1ms, or 1 billion in 1sec.
This improvement is sufficient to tip the balance toward favoring division in some algorithms where historically programmers went out of their way to avoid it.
https://uops.info/table.html
supports most modern and old architectures
Although, reasoning about performance can be hard anyway.
For the uninitiated, most high-performance CPUs of recent years:
- Are massively out-of-order. It will run any operation that has all inputs satisfied in the next slot of the right type available.
- Have multiple functional units. A recent apple CPU can and will run 5+ different integer ops, 3+ load/stores and 3+ floating point ops per cycle if it can feed them all. And it may well do zero-cost register renames on the fly for "free".
- Functional units are pipelined, you can throw 1 op in the front end of the pipe each cycle, but the result sometimes isn't available for consumption until maybe 3-20 cycles later (latency depends on the type of the op and if it can bypass into the next op executed).
- They will speculate on branch results and if they get them wrong it needs to flush the pipeline and do the right thing.
- Assorted hazards may give +/- on the timing you might get in a different situation.
I have a sneaking suspicion that this table is satisfying for our brains as a vaguely technical and interesting thing, but I’m not sure how useful it really is. In general the compiler will be really creative in reordering instructions, and the CPU will also be creative about which ones it runs parallel (since it is good at discovering instruction level parallelism). So, I wonder if the level of study necessary to use this information also requires the level of data that is available in the detailed table.
I have not done much caring about instructions, it seems very hard. FWIW I have had some success caring about reducing the number of trips to memory and making sure the dependencies are obvious to the computer, so I’m not totally naive… but I think that caring about instruction timing is mostly for the real hardcore optimization badasses.
If you CPU runs on 1000MHz that's 10^9 cycles per second. On that CPU the right hand side of the picture corresponds to 1ms. You can do 1 million register-register operations in 1ms, or 1 billion in 1sec.
Computers are fast.