Graceful Shutdown in Go: Practical Patterns

76 mkl95 18 5/4/2025, 9:09:16 PM victoriametrics.com ↗

Comments (18)

deathanatos · 21m ago
> After updating the readiness probe to indicate the pod is no longer ready, wait a few seconds to give the system time to stop sending new requests.

> The exact wait time depends on your readiness probe configuration

A terminating pod is not ready by definition. The service will also mark the endpoint as terminating (and as not ready). This occurs on the transition into Terminating; you don't have to fail a readiness check to cause it.

(I don't know about the ordering of the SIGTERM & the various updates to the objects such as Pod.status or the endpoint slice; there might be a small window after SIGTERM where you could still get a connection, but it isn't the large "until we fail a readiness check" TFA implies.)

(And as someone who manages clusters, honestly that infintesimal window probably doesn't matter. Just stop accepting new connections, gracefully close existing ones, and terminate reasonably fast. But I feel like half of the apps I work with fall into either a bucket of "handle SIGTERM & take forever to terminate" or "fail to handle SIGTERM (and take forever to terminate)".

gchamonlive · 1h ago
This is one of the things I think Elixir is really smart in handling. I'm not very experienced in it, but it seems to me that having your processes designed around tiny VM processes that are meant to panic, quit and get respawned eliminates the need to have to intentionally create graceful shutdown routines, because this is already embedded in the application architecture.
cle · 45m ago
How does that eliminate the need for the graceful shutdown the author discusses?
evil-olive · 2h ago
another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.

it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.

utrack · 1h ago
Jfyi, I'm doing exactly this (and more) in a platform library; it covers the issues I've encountered during the last 8+ years I've been working with Go highload apps. During this time developing/improving the platform and rolling was a hobby of mine in every company :)

It (will) cover the stuff like "sync the logs"/"wait for ingresses to catch up with the liveness handler"/etc.

https://github.com/utrack/caisson-go/blob/main/caiapp/caiapp...

https://github.com/utrack/caisson-go/tree/main/closer

The docs are sparse and some things aren't covered yet; however I'm planning to do the first release once I'm back from a holiday.

In the end, this will be a meta-platform (carefully crafted building blocks), and a reference platform library, covering a typical k8s/otel/grpc+http infrastructure.

RainyDayTmrw · 32m ago
I never understood why Prometheus and related use a "pull" model for data, when most things use a "push" model.
tmpz22 · 2h ago
Is it me or are observability stacks kind of ridiculous. Logs, metrics, and traces, each with their own databases, sidecars, visualization stacks. Language-specific integration libraries written by whoever felt like it. MASSIVE cloud bills.

Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.

Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.

nkraft11 · 1h ago
I can say that going from a place that had all of that observability tooling set up to one that was at the "ssh'ing into a box and greping a log" stage, you best believe I missed company A immensely. Even knowing which box to ssh into, which log file to grep, and which magic words to search far was nigh impossible if you weren't the dev that set up the machine and wrote the bug in the first place.
MortyWaves · 1h ago
I completely agree with you but I also think, like many aspects of "tech" certain segments of it have been monopolised and turned into profit generators for certain organisations. DevOps, Agile/Scrum, Observability, Kubernetes, are all examples of this.

This dilutes the good and helpful stuff with marketing bullshit.

Grafana seemingly inventing new time series databases and engines every few months is absolutely painful to try keep up to date with in order to make informed decisions.

So much so I've started using rrdtool/smokeping again.

evil-olive · 32m ago
if you're working on a system simple enough that "SSH to the box and grep the log file" works, then by all means have at it.

but many systems are more complicated than that. the observability ecosystem exists for a reason, there is a real problem that it's solving.

for example, your app might outgrow running on a single box. now you need to SSH into N different hosts and grep the log file from all of them. or you invent your own version of log-shipping with a shell script that does SCP in a loop.

going a step further, you might put those boxes into an auto-scaling group so that they would scale up and down automatically based on demand. now you really want some form of automatic log-shipping, or every time a host in the ASG gets terminated, you're throwing away the logs of whatever traffic it served during its lifetime.

or, maybe you notice a performance regression and narrow it down to one particular API endpoint being slow. often it's helpful to be able to graph the response duration of that endpoint over time. has it been slowing down gradually, or did the response time increase suddenly? if it was a sudden increase, what else happened around the same time? maybe a code deployment, maybe a database configuration change, etc.

perhaps the service you operate isn't standalone, but instead interacts with services written by other teams at your company. when something goes wrong with the system as a whole, how do go about root-causing the problem? how do you trace the lifecycle of a request or operation through all those different systems?

when something goes wrong, you SSH to the box and look at the log file...but how do you know something went wrong to begin with? do you rely solely on user complaints hitting your support@ email? or do you have monitoring rules that will proactively notify you if a "huh, that should never happen" thing is happening?

01HNNWZ0MV43FF · 1h ago
Programs are for people. That's why we got JSON, a bunch of debuggers, Python, and so on. Programming is only like 10 percent of programming
wbl · 4h ago
If a distribute system relies on clients gracefully exiting to work the system will eventually break badly.
Rhapso · 28m ago
And i believe that so much that I don't even consider graceful shutdown in design. Components should be able to safely (and even frequently) hard-crash and so long as a critical percentage of the system is WAI then it shouldn't meaningfully impact the overall system.

The only way to make sure a system can handle components hard crashing, is if hard crashing is a normal thing that happens all the time.

All glory to the chaos monkey!

smcleod · 3h ago
Way back when, in physical land - I used STONITH for that! https://smcleod.net/2015/07/delayed-serial-stonith/
Thaxll · 57m ago
No one said that.
ikiris · 3h ago
There's a big gap between graceful shutdown to be nice to clients / workflows, and clients relying on it to work.
XorNot · 3h ago
There's valid reasons to want the typical exit not to look like a catastrophic one even if that's a recoverable situation.

That my application went down from sig int makes a big difference compared to kill.

Blue-Green migrations for example require a graceful exit behavior.

shoo · 3h ago
> Blue-Green migrations for example require a graceful exit behavior.

it may not always be necessary. e.g. if you are deploying a new version of a stateless backend service, and there is a load balancer forwarding traffic to current version and new version backends, the load balancer could be responsible for cutting over, allowing in flight requests to be processed by the current version backends while only forwarding new requests to the new backends. then the old backends could be ungracefully terminated once the LB says they are not processing any requests.