Random selection is necessary to create stable meritocratic institutions (assemblingamerica.substack.com)

Virtually any time you put a fixed sleep in your program, it's going to be wrong. It is either too long or too short, and even if it is perfect, there is no guarantee your program will actually sleep for exactly the requested amount of time. A condition variable or something similar would be the right thing to use here.

Of course, if this code path really is only taken very infrequently, you can get away with it. But assumptions are not always right, it's better to make the code robust in all cases.

anarazel · 1h ago

The usleep here isn't one that was intended to be taken frequently (and it's taken in a loop, so a too short sleep is ok). As it was introduced, it was intended to address a very short race that would have been very expensive to avoid altogether. Most of the waiting is intended to be done by waiting for a "transaction lock".

Unfortunately somebody decided that transaction locks don't need to be maintained when running as a hot standby. Which turns that code into a loop around the usleep...

(I do agree that sleep loops suck and almost always are a bad idea)

delusional · 5h ago

> there is no guarantee your program will actually sleep for exactly the requested amount of time

This is technically true, but in the same way that under a preemptive operating system there's no guarantee that you'll ever be scheduled. The sleep is a minimum time you will sleep for, beyond that you already have no guarantee about when the next instruction executes.

dwattttt · 3h ago

Spurious wakeups enters the chat.

quietbritishjim · 2h ago

Any reasonable sleep API will handle spurious wake ups under the hood and not return control.

Joker_vD · 2h ago

It will also ignore those pesky signals and restart the sleep :)

On a more serious note, any reasonable sleep API should either report all wake ups, no matter the reason, or none except for the timeout expiration itself. Any additional "reasonable" filtering is a loaded footgun because different API users have different ideas of what is "reasonable" for them.

No comments yet

OkPin · 1h ago

Fascinating root cause: a missing CHECK_FOR_INTERRUPTS() left pg_create_logical_replication_slot basically unkillable on hot standbys. Simple fix, but huge impact.

Makes me wonder how many other Postgres processes might ignore SIGTERM under edge conditions. Do folks here test signal handling during failovers or replica maintenance? Seems like something worth adding to chaos tests.

bhaak · 57m ago

> While the Postgres community has an older and sometimes daunting contribution process of submitting patches to a mailing list, [...]

The Linux kernel works the same way.

What are other big open source projects using if they are not using mailing lists?

I could imagine some using GitHub but IME it's way less efficient than mailing list. (If I were a regular contributor to a big project using GitHub, me being me, I would probably look for or write myself a GitHub to e-mail gateway).

eunos · 3h ago

With quirk like these, honestly I'm not even confident how can PostgreSQL or any software in general can be used in mission critical system.

vladms · 2h ago

Have you ever read the contraindication list of a medicine? It is the same kind of trade-off. Someone can choose to use something in a "mission critical system" (whatever that is) because the alternatives are worse, not because the chosen solution is perfect (that does not mean I would advise using PostgreSQL - just accepting that it can happen).

On the other hand "quirks like these" are the reason you should not update too often mission critical systems.

cedws · 1h ago

When you look hard enough it’s a miracle anything works at all.

contravariant · 2h ago

Might be a bit late now to start telling people that maybe using software for everything wasn't a good idea.

Joker_vD · 5h ago

The tl;dr is that Postgres, as any long-running "server" process (especially as a DBMS server!) does not run with SIG_DFL as the handler for SIGTERM; it instead sets up the signal handler that merely records the fact that the signal has happened, in hopes that whatever loops are going on will eventually pick it up. As usual, some loops don't but it's very hard to notice.

namibj · 1h ago

Feels like they should come with watchdog timers that alert when a loop fails to check in on this status often enough? Then the only thing that still needs manual checking to cover these more-exotic states is that the code path for loop watchdog-and-signal-checkin will indeed lead to early exit from the loop it's placed in (and all surrounding loops).

Sure, it's more like fuzzing then, but running in production on systems that are willing to send in big reports for logged watchdog alerts would give nearly free monitoring of very diverse and realistic workloads to hopefully cover at least relevant-in-practice codepaths with many samples before a signal ever happens to those more-rare but still relevant code paths...

bbarnett · 2h ago

Indeed. I've seen DBMSes take close to 10 minutes to gracefully exit, even when idle.

Timeout in sysvinit and service files, for a graceful exit, is typically 900 seconds.

Most modern DBMS daemons will recover if SIGKILL, most of the time, especially if you're using transactions. But startup will then be lagged, as it churns through and resolves on startup.

(unless you've modified the code to short cut things, hello booking.com "we're smarter than you" in 2015 at least, if not today)

sjsdaiuasgdia · 14m ago

Yeah, as I've said on many incident calls...we can pay for transaction recovery at shutdown or we can pay for it at startup, but it's got to happen somewhere.

The "SIGKILL or keep waiting?" decision comes down to whether you believe the DBMS is actually making progress towards shutdown. Proving that the shutdown is progressing can be difficult, depending on the situation.

LLM Inevitabilism (tomrenner.com)

A Little-Known Microsoft Program Could Expose the Defense Department to Hackers (propublica.org)

Apple's MLX adding CUDA support (github.com)

Kiro: A new agentic IDE (kiro.dev)

VHS, VCDs, and Laserdiscs in Southeast Asia (rubenerd.com)

LIGO detects most massive black hole merger to date (caltech.edu)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Cognition (Devin AI) to Acquire Windsurf (cognition.ai)

Literalism plaguing today’s movies (newyorker.com)

SQLite async connection pool for high-performance (github.com)

Random selection is necessary to create stable meritocratic institutions (assemblingamerica.substack.com)

When Sigterm Does Nothing: A Postgres Mystery (clickhouse.com)

Show HN: VS Code extension to edit the filesystem like a text buffer (github.com)

Show HN: Bedrock – An 8-bit computing system for running programs anywhere (benbridle.com)

Context Rot: How increasing input tokens impacts LLM performance (research.trychroma.com)

VMware Workstation: Bringing Virtualization to the x86 Architecture (2012) [pdf] (cse.iitb.ac.in)

Happy 20th Birthday, Django (djangoproject.com)

I Solved the Century-Old Mystery of a Miraculous Shipwreck Survivor (thewalrus.ca)

The JPEG XL Image Coding History, Features, Coding Tools, Design Rationale (arxiv.org)

C++ Coroutines Advanced: Converting std:future to asio:awaitable (ddhigh.com)

NeuralOS: An operating system powered by neural networks (neural-os.com)

Transition to using 16 KB page sizes for Android apps and games (android-developers.googleblog.com)

AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug [pdf] (lyons-den.com)

DEWLine Museum – The Distant Early Warning Radar Line (dewlinemuseum.com)

Replicube: 3D shader puzzle game, online demo (replicube.xyz)

Strategies for Fast Lexers (xnacly.me)

Win, lose, or draw: trends in English football match results (blog.engora.com)

Building Modular Rails Applications: A Deep Dive into Rails Engines (panasiti.me)

RFC: PHP license update (wiki.php.net)

Speedrun (educationprogress.org)

Cache Benchmarks (github.com)

Code highlighting extension for Cursor AI used for $500k theft (securelist.com)

Embedding user-defined indexes in Apache Parquet (datafusion.apache.org)

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020) (mymodernmet.com)

Belgian CVD is deeply broken (devae.re)

The Collapse of the FDA (nytimes.com)

A Century of Quantum Mechanics (home.cern)

Lightning Detector Circuits (techlib.com)

Cidco MailStation as a Z80 Development Platform (2019) (jcs.org)

Lossless Float Image Compression (aras-p.info)

Data brokers are selling flight information to CBP and ICE (eff.org)

Impacts of adding PV solar system to internal combustion engine vehicles (jstor.org)

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department (cnbc.com)

Show HN: Minesweeper game I built to be real-time Multiplayer (minesweeperpro.com)

Two guys hated using Comcast, so they built their own fiber ISP (arstechnica.com)

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages (htmlmaze.com)

Tandy Corporation, Part 3 (abortretry.fail)

Clang: -Wexperimental-lifetime-safety: Experimental C++ Lifetime Safety Analysis (github.com)

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree (bostonglobe.com)

Show HN: Cogency – Cognitive Architecture for AI Agents (github.com)

When Sigterm Does Nothing: A Postgres Mystery

Comments (16)