How to slow down a program and why it can be useful

102 todsacerdoti 42 8/27/2025, 11:38:35 AM stefan-marr.de ↗

Comments (42)

nvahalik · 4h ago

When doing web development I will occasionally connect my local code base to a remote SQL server via SSH.

This adds enough latency to be noticeable and I’ve found pages that were “OK” in prod that were unbearable in my local environment. Most of the time it was N+1 queries. Sometimes it was a cache that wasn’t working as intended. Sometimes it simply was a feature that “looked cool” but offered no value.

I’m not sure if there is a proxy that would do this locally but I’ve found it invaluable.

enlyth · 3h ago

I'm a big fan of Toxiproxy for these kinds of things:

https://github.com/Shopify/toxiproxy

calebh · 3h ago

I do networked game development on Windows and I've found the clumsy program to be very valuable to simulate adverse network conditions. You can set it up to simulate arbitrary network latency, packet loss and so forth.

https://jagt.github.io/clumsy/

DrillShopper · 1h ago

You can use the tc command from the netem package for those wondering how to achieve this on Linux. https://man7.org/linux/man-pages/man8/tc-netem.8.html

nvahalik · 2h ago

This looks interesting. I'll check it out, thank you!

pseudocomposer · 1h ago

I do this with a Makefile that calls “kubectl port-forward.”

nawgz · 1h ago

I’m not sure if you’re saying the latency was introduced in client <-> server hops or server <-> db hops, but chrome dev tools (I’m sure all browsers too) can simulate different network conditions with a few clicks! Useful for something similar to what you’ve said, but in the end I think you meant server <-> db latency is what you want to inject

adrian_b · 5h ago

In my opinion, NOP and MOV, which are recommended in TFA for slowing down, are the worst possible choices.

The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.

For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs), or one could execute BSWAP (available since 1989, therefore it exists in all 64-bit CPUs, including any Atom).

fanf2 · 4h ago

They care about exact slowdown factors, and they don’t like the instructions that use any of the CPU’s execution resources because the program’s real work interferes with the slowdown factor making it harder to control.

The NOPs in effect use up a small fraction of the instruction decode bandwidth, and if they insert enough NOPs they can reduce the number of real instructions that are issued per cycle and slow down the program with a fine degree of precision.

adrian_b · 3h ago

Many recent CPUs eliminate NOPs without executing them, so using NOP may work on a CPU but be ineffective on another.

A double BSWAP is equivalent with a NOP, but no existing CPU does any special optimization for it and it is very unlikely that any future CPU would do something special with it, as it is mostly obsolete (nowadays MOVBE is preferable instead of it).

NOP cannot ensure a certain slow-down factor, except on exactly the same model of CPU.

yorwba · 1h ago

> Many recent CPUs eliminate NOPs without executing them

The elimination will still take some amount of time, and the smaller this amount is, the better, because it allows dialing in the slowdown more precisely. Of course how many NOPs you need for a particular basic block needs to be measured on the CPU you're profiling on, but that's also the case if you use a more consistently slow instruction, since the other instructions won't necessarily maintain a consistent speed relative to it.

sweetjuly · 8m ago

That's not necessarily true. Many modern CPUs have uop caches, and it's reasonable to assume implementations will eliminate NOPs before placing lines in the uop cache. A hit on the uop cache loses any slowdown you hoped to achieve since now the NOPs are neither fetched nor decoded.

201984 · 1h ago

Even modern CPUs still have to read the NOPs to skip them, and that's what he was talking about. Decode bandwidth.

sidewndr46 · 2h ago

From what I understand if you really want to mess with some performance stuff just throw a random AVX or even better AVX512 workload in the critical path of all cores. Could be something really simple, like expanding out a normal cmov, jmp statement, or just regular integer addition to be needlessly AVX involved. Depending on the CPU this can apparently downclock the core, change the power delivery, and potentially result in pipeline stalls on other cores on the same die.

api · 4h ago

There's a whole niche in cryptography called verifiable delay functions (VDFs) if you want a big rabbit hole to go down.

The idea is that these are unavoidably slow and preferably non-parallelizable to compute one way, but fast or near instantaneous to verify the result. Examples include the Weslowski VDFs based on similar math to RSA, MIMC, and the use of zero knowledge proofs to provide proof of slow computations.

thfuran · 3h ago

For this purpose though, computation should be minimized to avoid spilling registers and having more complicated performance side effects than just the intended direct delay.

01HNNWZ0MV43FF · 2h ago

Those sound like the space-hard and time-hard hashes, like used for password checking

api · 1h ago

They're a relative of that, but have stronger linearity and asymmetrical verification time guarantees AFAIK.

loeg · 5h ago

RDTSC(P) is pretty slow. I wonder if that would work.

adrian_b · 5h ago

RDTSC or RDTSCP, like also CPUID, work certainly very well for slowing down a program.

However, like integer division, they may clobber registers that the program wants to use for other purposes.

For great slow-downs when the clobbered registers do not matter, I think that CPUID is the best, as it serializes the execution and it has a long execution time on all CPUs.

For small slow-downs I think that BSWAP is a good choice, as it modifies only 1 arbitrary register without affecting the flags, and it also is a less usual instruction so it is unlikely that it will ever receive special optimizations, like NOP and MOV.

However, multiple BSWAPs must be used, to occupy all available execution ports, otherwise if there is any execution port not occupied by the rest of the program the BSWAP may be executed concurrently, not requiring any extra time.

clausecker · 4h ago

Skylake has the exact same optimisations.

weinzierl · 7h ago

For the Commodore 64 there was a product called the C64 Snail which could slow it down.

Later on the early PCs we had a Turbo Button, but since everyone had it in Turbo mode all the time it essentially was a way to slow down the machine.

EDIT: Found an image of what I remember as the "C64 Snail". It is called "BREMSE 64" (which is German for brake) in the image.

https://retroport.de/wp-content/uploads/2018/10/bremse64_rex...

layer8 · 4h ago

The purpose of the Turbo button on PCs was indeed to slow it down, originally to the 4.77 MHz of the original 8088, for compatibility. For marketing reasons it was designated “turbo” instead of something indicating a slow-down. Turbo mode was always the default, since that was the standard speed of the respective CPU.

skrebbel · 3h ago

Wow thanks, this makes so much sense! I spent way too many brain cycles wondering about that button when I was a teenager.

k__ · 6h ago

Interesting.

I had the impression, the turbo button was created to slow down new PCs, so they could run old software that relied heavily on CPU speed.

weinzierl · 6h ago

Yes, originally it was added to slow down faster XTs to exactly the 4.77 MHz of the original IBM XT.

With the AT it usually slowed down to some arbitrary frequency and it was more like a gimmick.

foresto · 46m ago

After a recent upgrade, I've been pondering how to spot test optimizations on my workstation, which is now much faster than the average target system. (Keeping a slow computer around and continually moving my code to it would be effective, but inconvenient.)

My first idea is to taskset the code to a particular CPU core, and see if linux will let me put that core in a low frequency mode. Has anyone here done this on AMD hardware? If so, were the available frequencies slow enough to be useful for this purpose?

avidiax · 4h ago

> we can estimate whether an optimization is beneficial before implementing it. Coz simulates it by slowing down all other program parts. The part we think might be optimizable stays at the same speed it was before, but is now virtually sped up, which allows us to see whether it gives enough of a benefit to justify a perhaps lengthy optimization project.

Really interesting idea. I suppose the underlying assumption is that speeding up a function might reveal a new performance floor that is much higher than the speedup would suggest. Spending time to halve a function's time only to discover that overall processing time only decreased slightly because another function is now blocking would indeed be quite bad.

Not sure how this handles I/O, or if it is enough to simply delay the results of I/O functions to simulate decreased bandwidth or increased latency.

gwd · 6h ago

Kind of weird that NOP actually slows down the pipeline, as I'd think that would be the easiest thing to optimize out of the pipeline, unless instruction fetch is one of the main limiting factors. Is it architecturally defined that NOP will slow down execution?

Someone · 6h ago

I think it would be easy, but still not worth the transistors. Think of it: what programs contain lots of NOPs? Who, desiring to write a fast program, sprinkles their code with NOPs?

It’s not worth optimizing for situations that do not occur in practice.

The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.

adrian_b · 5h ago

On CPUs with variable instruction length, like the Intel/AMD CPUs, many programs have a lot of NOPs, which are inserted by the compiler for instruction alignment.

However those NOPs are seldom executed frequently, because most are outside of loop bodies. Nevertheless, there are cases when NOPs may be located inside big loops, in order to align some branch targets to cache line boundaries.

That is why many recent Intel/AMD CPUs have special hardware for accelerating NOP execution, which may eliminate the NOPs before reaching the execution units.

kevin_thibedeau · 1h ago

MIPS required NOP insertion for data hazards and the branch delay slot if a productive instruction couldn't be placed in the slot.

adrian_b · 5h ago

It depends on the CPU. On some CPUs a NOP might take the same time as an ADD and it might have the same throughput per clock cycle as ADD.

However, there are CPUs among the Intel/AMD CPUs that can execute up to a certain number of consecutive NOPs in zero time, i.e. they are removed from the instruction stream before reaching the execution units.

In general, no instruction set architecture specifies the time needed to execute an instruction. For every specific CPU model you must search its manual to find the latency and throughput for the instruction of interest, including for NOPs.

Some CPUs, like the Intel/AMD CPUs, have multiple encodings for NOP, with different lengths in order to facilitate instruction alignment. In that case the execution time may be not the same for all kinds of NOPs.

IcePic · 6h ago

I think so, as in "make sure all other stuff has run before calling the NOP finished". Otherwise, it would just skip past it and it would have no effect if placed in a loop, so it would be eating memory for no use at all.

motorest · 6h ago

> I think so, as in "make sure all other stuff has run before calling the NOP finished".

Is this related to speculative execution? The high level description sounds like NOP works as sync points.

bob1029 · 6h ago

Eating memory alone may have the desired effect. The memory bandwidth of a cpu is not infinite.

pkhuong · 6h ago

Yeah, just decode. But that's nice because the effect is independent of the backend's state.

smjburton · 2h ago

Interesting article OP. There was a similar post on Hacker News recently about using this approach on a Postgres database:

- Making Postgres slower (https://news.ycombinator.com/item?id=44704736)

Taking this approach seems to be effective at surfacing issues that otherwise wouldn't show up in regular testing. I could see this being useful if it was standardized to help identify issues before they hit production.

pchristensen · 2h ago

I fondly remember using a program called MoSlow to slow down older DOS games that were used to using every cycle they could get, and so were unplayable when I got a Pentium computer. I joked about how I wish I could find the corresponding "MoFast" program.

plandis · 5h ago

Causal profiling is something I infrequently use but I have been able to apply it outside of micro benchmarks in an orchestration system.

If you have a business process that is complex and owned by many different groups then causal profiling can be a good tool for dealing with the complexity. For large workflows in particular this can be powerful as the orchestration owner can experiment without much coordination (other than making sure the groups know/agree that some processing might be delayed).

1970-01-01 · 2h ago

Bring back the turbo button!

https://en.wikipedia.org/wiki/Turbo_button#Purpose

initramfs · 4h ago

slowing down processors also make them low power. https://github.com/hatonthecat/Solar-Kernel/blob/main/semi-a...

Ask HN: Did modern AI's coding abilities make you lose interest in programming?

Ask HN: Why hasn't x86 caught up with Apple M series?

iOS 26 Launches Sept 15 – Even GPT-5 Doesn't Know It Exists

Ask HN: How can I recover and run my old mobile game from the 2010s?

Ask HN: Does sentience put stress on the brain?

Units of Economics of LLMs. Reply to Ed Zitron's "AI Is a Money Trap"

Ask HN: What to Do with Old iPads?

Ask HN: Is there a temp phone number like temp email?

Out of curiosity: what kind of people use this "forum" (I mean Hacker News)?

Ask HN: Why are so many services rejecting Google Voice numbers for signups?

Ask HN: Why can't ChatGPT count to a million?

Ask HN: What should I use to run React Native tests on a device?

Ask HN: Any experienced devs who use AI extensively in their work?

Ask HN: Best codebases to study to learn software design?

Stop squashing your commits. You're squashing your AI too

Ask HN: How are you attributing your AI usage when developing software?

Patient Lisp Hacker Seeks Same for Long Walks Through IPL-V Code

Ask HN: Has anyone else used online communities that are archetypically "savvy"?

High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

Ask HN: What are you working on (August 2025)?

429 Too Many Requests from registry.npmjs.org

Ask HN: Are AI filters becoming stricter than society itself?

Ask HN: I just abandoned my PyCharm subscription, what should I use now?

How can a mutex in Wine be faster than a native one on Linux

Ask HN: What is wrong with modern software development

Ask HN: Recommandation for an Ergonomic Keyboard?

Ask HN: Someone has committed 20K+ LoC to a PR, exhausting my CI & AI workflows

Ask HN: How is Nvidia streaming 4k games while remote control sw is low q lag

Ask HN: How do you find early stage startups to join

Ask HN: Best Marketplaces for Used Servers?

Ask HN: Why do people hate on Sabine Hossenfelder so much?

Ask HN: What is the biggest problem LLMs solved in your life/work?

Ask HN: How can I trace what user queries make AI bots crawl my site?

ASK HN: AI in high school. Will teachers and schools have to compensate?

Problem with Payment Gateways

Ask HN: What is your source for answers?

Ask HN: Devices to allow children to listen to podcasts on my local network?

DSPy GEPA Example: Listwise Reranker

Ask HN: Does using public transportation make you more creative than driving?

Ask HN: No easy way for tvOS to display long documents (e.g., terms of service)?

Ask HN: Why aren't Android manufacturers interested in GrapheneOS?

How to slow down a program and why it can be useful

Comments (42)