Simpler Backoff

123 todsacerdoti 45 5/31/2025, 4:43:14 AM commaok.xyz ↗

Comments (45)

efitz · 1d ago
Why a lookup table? Why not just have the client remember the last backoff, and multiply it by a constant and apply a ceiling?

def getNewBackoff( oldBackoff: int): int {

newBackoff = oldBackoff * BACKOFF_CONSTANT

if (newBackoff > BACKOFF_CEILING) { newBackoff = BACKOFF_CEILING }

return newBackoff }

if you bit align the ceiling value you can replace the conditional with a bitwise and mask. If you use 2 as your exponent you can replace the multiplication with a bit shift left.

It’s also a good idea to introduce jitter, e.g. don’t set the new backoff to the old backoff * a constant, but add a random value such that the old value gets randomized around the desired new value- this helps prevent pile-ons when a bunch of clients noticed about the same time that the server became unresponsive; if they all have the same backoff algorithm then they’re all going to hit the server again at about the same time. It’s good to spread that out to try to avoid server overloading during restarts/cold starts/whatever.

The best results I ever got were when I used server side hinting to avoid pile-ons when the retries from the client were all clustered around the same time.

When my server got in an overloaded state, the auth path for clients short circuited and delivered an overloaded message with a randomized suggested retry value. The server can’t make the clients honor it (we could in my case because we wrote the clients) but if the clients do honor it you can spread out reconnects over a period instead of make them all point-in-time and dampen or prevent oscillating overloads.

akshayshah · 22h ago
Full agreement from me that jitter is essential. Here’s AWS on good retry policies: https://aws.amazon.com/builders-library/timeouts-retries-and...
mariusor · 1d ago
Similar to your suggestion, I implemented a wrapper around retries in Go before and despite liking the lookup table version quite a lot, I think it's more important to be able to parametrize the initial duration of the back-off and also the rate at which it grows, so that's what I focused on.

I'm not sure how my example fares against on a simplicity scale, but it feels a little more readable to me:

    Retry(5, BackOff(time.Second, Linear(1.4), func(_ context.Context) { // do stuff } )
Example cribbed from the documentation: https://pkg.go.dev/git.sr.ht/~mariusor/ssm#StrategyFn
0x696C6961 · 1d ago
Backoff APIs often take a retry count as the input. Your suggestion would require changing the API which isn't always practical.
gdiamos · 1d ago
explicit is better than implicit
efitz · 20h ago
A server cannot make a client behave a certain way. Explicit doesn’t matter in this case, and as many of the comments said, most of the time the best thing to do for the server is to spread retry load randomly over an interval. Explicit is bad for the server.

The only thing (IMO) explicit retries are good for is to make client state more deterministic on the client, but there are many ways of doing that other than hard coding retry intervals. For example, use asynchronous patterns for networking APIs and allow querying state of API-calls-in-progress.

unsnap_biceps · 1d ago
While I like the simplicity of the lookup table, I find myself in larger projects to have different backoffs for different upstream services. I'll have shorter ones for services that can degrade without a large impact to service results and longer ones for services that are critical for results. I would prefer to have one algorithmically generated call that is flexible vs multiple individual functions that have different lookup tables, or generating tables and passing them in each time.
kiitos · 19h ago
Backoff is a client-side behavior, so if you want to give different backoff behaviors to different classes of upstream services, you just define those backoff tables somewhere where the client can pick from them based on the target service it's trying to talk to.
harpiaharpyja · 23h ago
Generating tables is the way to go in that case. The flexibility you need, with the advantages of being data driven.
fmbb · 1d ago
You could have two lookup tables.
VWWHFSfQ · 1d ago
what if you have 100 lookup tables
mceachen · 1d ago
If you don’t have lookup tables for your lookup tables you aren’t doing it right
avinassh · 1d ago
you could use SQLite, store all of them
kstrauser · 23h ago
MongoDB is web scale.
barbazoo · 1d ago
Careful, that’s awfully close to N. /s
handsclean · 1d ago
Here’s a version that’s IMO much simpler still, about half as long as both (after fixing v2’s line count cheating), and fixes a bug: both article versions wait an extra 60 seconds between the last attempt and returning an error.

    func do(ctx context.Context) error {
        for attempt := 1; ; attempt++ {
            if request(ctx) == nil {
                return nil
            }

            if attempt == 10 {
                return fmt.Errorf("failed after %d attempts", attempt)
            }

            select {
            case <-ctx.Done():
                return ctx.Err()
            case <-time.After(time.Second *
                min(60, math.Pow(2, attempt-1)*(0.75+rand.Float64()*0.5))):
            }
        }
    }
I think the problem with the original isn’t using a formula, it’s over-parameterization. Declaring constants right next to where they’re used once isn’t giving you anything but the opportunity for naming issues and rigidity in how you combine them. If they aren’t passed in or repeated, then I wouldn’t pull them out; if they are, then I’d try to look for minimal and natural boundary points, like just min and max, or maybe an attempt → delay function.
Waterluvian · 1d ago
I sometimes agree, sometimes disagree. I don’t like magic numbers like retry(3, 1_000, 2_500)

(Though sometimes your text editor and docs are good enough to help, but I don’t like making those feel required).

I like how Python (and probably many other languages) allows for using the name of a positional arg. So you could call the exact same function but write it as retry(count=3, min_ms=1_000, max_ms=2_500) if you feel like it needs more clarity. Now you’ve got documented values that are tied to the function signature, their names have been decided for you, and they’re inline!

Mawr · 20h ago
> Declaring constants right next to where they’re used once isn’t giving you anything but the opportunity for naming issues and rigidity in how you combine them. If they aren’t passed in or repeated, then I wouldn’t pull them out;

Using magic numbers all over the place is the mark of a novice who lacks understanding that the meaning behind the numbers he knows in his mind will not magically transfer over to other people reading the code. Advising this approach is an interesting call.

Specifically here, we've got these at the beginning of the function:

    const (
        maxAttempts = 10
        baseDelay   = 1 * time.Second
        maxDelay    = 60 * time.Second
    )
If I went out of my way to look at this function, these are probably the values I'm interested in changing, and look, they're right here, at the top. Neat. It's obvious how to adjust the values to get what I want. I can stop reading the function right there.

Without these declarations, how would I know that "10" midway through the function is the number of attempts? I'd have to read and understand the entire logic first. Great. Now multiply this effort by the amount of programmers on the team and the amount of magic numbers.

blensor · 1d ago
Is this meant as a joke or a serious argument?

I personally find the original code just as if not more readable than the static lookup table and I don't need to count out the elements if I want to know how often it will retry.

But more importantly, changing the max retries is trivial in the original code and tedious in the static lookup table, especially for bigger changes.

Also, this is something you most likely want to make configurable to adapt to certain scenarios which is not possible with a static table

There are more reasons against it but those are the the main ones that jump at me right away

masklinn · 1d ago
Not only that, but now that Go has iterators if you need this in several locations the logic of the original can easily be encapsulated in an iterator with a few relevant tunable knobs, and then you just write something like:

  for range ExponentialBackoff(ctx) {
      err := request(ctx)
      if err == nil {
          return nil
      }
  }
and if one of the callsites needs to configure the backoff then you've got something like:

  for range ExponentialBackoff(ctx, MaxAttempts(20), BaseDelay(5)) {
      err := request(ctx)
      if err == nil {
          return nil
      }
  }
ctxc · 1d ago
If the author is here - please turn word wrap off for code. It's unreadable on mobile.
jitl · 1d ago
Code wraps for me in Safari on iOS 18.5
mukiblejlok · 8h ago
For Python, I would recommend the stamina package. It provides all mentioned best practices about when and how to retry, in a form of easy to use function decorator. Plus, there is a great video about the problem by the author: https://youtu.be/BxikFuvaT1Y
rixed · 1d ago
Am I the only one who wonders how such trivial posts could end up on HN first page?
Retr0id · 1d ago
Something doesn't have to be long or complex to be interesting.
qoez · 1d ago
It's so easy to make bot accounts I wouldn't be surprised if a decent amount of them are manipulated.
jpillora · 1d ago
ricardobeat · 1d ago
That’s the exact same logic, just written more verbosely and excluding the actual retry mechanism.
yawaramin · 22h ago
On a slightly different note, I am using Fibonacci backoff for a couple of things here and there and (subjectively) finding it a bit nicer than the standard 2x exponential backoff. I have it on a system that waits for a few days to raise an alert on missing data and I find the alerts on a cadence of 1, 2, 3, 5, 8... days are a bit tighter but still not overwhelming.
axeljohnsson · 1d ago
I recently listened to Marc Brooker's talk [1] on retries. My key takeaway: You don't want to put more load on your system during stress. At AWS, they do "adaptive retries", which I found interesting.

[1] https://www.youtube.com/watch?v=rvHd4Y76-fs

zoover2020 · 1d ago
I second this, retry mechanisms can cause retry storms in vast enough, distributed systems. In Amazon code bases I found the same adaptive retry strategy after a Christmas where once we played whack-a-mole for a service to get back up as its clients kept retrying.
grg0 · 23h ago
Or a function backoff(attempt) that abstracts away the details and can be re-used in places, perhaps a higher-order function that takes as input the function that does whatever.

"This code, or something like it, probably looks really familiar:"

Better leave that company, to be honest.

jelder · 1d ago
This really just an argument for using iterator generators. Does Go offer those yet?
jitl · 1d ago
Yes
ChrisMarshallNY · 1d ago
I do this, in a couple of my Watch apps.

Might be worth looking at. The current implementation is slightly awkward, but it also uses a random delay.

Watch/iPhone communication isn't very robust. It's much better, in more recent Watches.

charcircuit · 1d ago
Switching away from an implementation that everyone is familiar with is the opposite of making something more readable.
larodi · 1d ago
indeed, back in the day i'd love to write perl code such as

this() and that() or otherr() and fail();

so this all seems very reasonable, and clear, and people who read the code hated it. I've tried many times to do it in JS sadly it does not work:

a && b || do { console.log (err) }

but this all does not make it easier to read, but makes it only very likely to outsmart yourself.

some code is algorithmically more beautiful than other, more declarative perhaps. but going away from the common understanding of trivial stuff, does not always benefit the group or the project.

spauldo · 1d ago
It's down to what's idiomatic for the language. I'd argue that the Perl example is idiomatic Perl (just as the Lisp equivalent would be idiomatic Lisp), but that's certainly not idiomatic JavaScript.
anonzzzies · 1d ago
But how many people need to read/write this code? There are tons of open source libraries, gateways, proxies etc etc.
ayhanfuat · 1d ago
Aren’t both implementations fairly common? Precomputed / hardcoded set of values vs adjusting it inside the loop?
harpiaharpyja · 1d ago
So, this is just about declarative programming style?

I thought based on the title it would have been about different approaches to backoff.

hsbauauvhabzb · 1d ago
Change of the sake of change?
OutOfHere · 1d ago
Using a lookup table doesn't allow parameter values to easily be changed. Also, some of us do not get paid by the line. Fwiw, I do often note the values in a comment, and this is generally sufficient.
kassner · 1d ago
Offtopic, but huge respect for this[1]:

> Lots of people don’t want to read about AI.

> I respect that.

> But I’m currently steeped in the world of AI, for better or for worse, and I want to blog about it. So I’ve split this blog in half.

> The normal blog, which you are reading, is now AI-free.

> There’s another, at /ai that contains only AI posts. It has its own RSS feed.

Thank you. It (non-AI) definitely goes to my RSS reader.

1: https://commaok.xyz/post/blog-schism/

No comments yet

bravesoul2 · 1d ago
Another approach is to make this a cross-cutting concern using a sidecar like Envoy.