Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous I/O

491 lfittl 121 5/7/2025, 2:57:03 PM pganalyze.com ↗

Comments (121)

drdrek · 4h ago
Postgres is such a cool project, I have so much respect for its maintainers and community! For me its the second most impactful OSS project in the business tech world behind Linux itself. A real public good to be cherished and praised.
freilanzer · 2h ago
I recently started a ML project using text data and the choice was between MySQL and Postgres. Having looked at the respective features and pros and cons, the choice was immediately obvious. Also, with pgvector and https://postgresml.com available, the choice for Postgres was even easier.
cpursley · 1h ago
powerbook5300CS · 32m ago
Why not MongoDB?
the8472 · 20h ago
On linux there also is preadv2(..., RWF_NOWAIT) which can be used to do optimistic non-blocking read from the page cache. That might be useful for io_method = worker to shave off a bit of latency. Try reading on the main thread with NOWAIT and only offload to a worker thread when that fails.
anarazel · 18h ago
FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.

Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.

It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.

Edit: Fix negation

the8472 · 18h ago
Ah yeah, getting good kernel<>userspace oneshot memcpy performance for large files is surprisingly hard. mmap has setup/teardown overhead that's significant for oneshot transfers, regular read/write calls suffer from page cache/per page overhead. Hopefully all the large folio work in the kernel will help with that.
anarazel · 17h ago
From what I've seen a surprisingly large part of the overhead is due to SMAP when doing larger reads from the page cache - i.e. if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.

derefr · 11h ago
> if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?

the8472 · 2h ago
On the kernel side there's a boot parameter for all of them: mitigations=off Software that was compiled with additional fences may have to be recompiled to remove them.

https://www.kernel.org/doc/html/latest/admin-guide/kernel-pa...

amluto · 11h ago
SMAP overhead should be roughly constant, and I’d be quite surprised if it’s noticeable for large reads. Small reads are a different story.
anarazel · 11h ago
It turns out to be the other way round, curiously. The bigger the reads (i.e. how much to read in one syscall) and the bigger the target area of the reads (how long before a target memory location is reused), the bigger the overhead of SMAP gets.

If interesting I can dig up the reproducer I had at some point.

amluto · 2h ago
That is definitely interesting.
yxhuvud · 17h ago
the8472 · 16h ago
That doesn't speed up uerspace<>kernel memcopy, it just reduces cache churn. Despite its name it still goes through the page cache, it just triggers writeback and drops the pages once that's done. For example when copying to a tmpfs it makes zero difference since that lives entirely in memory.
senderista · 16h ago
So you're less dependent on the page replacement algorithm being scan-resistant, since you can use this flag for scan/loop workloads, right?
gmokki · 7h ago
I would initially add it for WAL writes and reads. There should never be another read in normal operation.
gavinray · 18h ago
Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?
anarazel · 18h ago
> Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?

Explicitly a goal.

You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.

We need at least:

- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)

- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)

EDIT: Formatting

zX41ZdbW · 18h ago
The usage in ClickHouse database: https://github.com/ClickHouse/ClickHouse/blob/d2697c0ea112fd...

Aside from a few problems in specific Linux kernel versions, it works great.

nu11ptr · 20h ago
Is this new async. I/O feature for Linux only?

I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).

https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:

https://windows-internals.com/ioring-vs-io_uring-a-compariso...

macdice · 8h ago
I am not a Windows guy but I (with help) managed to get IOCP working for this in a basic prototype. Will share publicly soon. I also sketched out an IoRing version (if you are interested in helping debug and flesh that out let me know!).

Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).

So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?

Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.

The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.

p_ing · 43m ago
Look into Registered I/O for sockets.
lfittl · 19h ago
It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.

The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).

nu11ptr · 19h ago
> "io_uring" is only available on Linux

Windows now also has IORing (see my comment above)

severino · 3h ago
Yes, but what's your point? It's not that we can take this new Postgres version and just compile it in a Windows machine supporting IORing, can we? It requires some effort and time -many times by volunteers- and given that most Postgres deployments are running under Linux, it's understandable that the developers focus on that platform first.
nu11ptr · 1h ago
Misunderstood - thought you were saying ioring was a feature specific to linux, but you meant specific to linux on Postgres.
stingraycharles · 20h ago
Sounds like this feature is based on io_uring which is a Linux feature. I would be surprised if they implemented async io on Windows before they would on Linux given the user/deployment base being very Linux-heavy.
p_ing · 19h ago
Yeah, surprise Linux had to play catch up to a Windows 1994 release! Same with the scheduler, I'd argue Windows does OOM better than Linux today...

Windows even had the concept of io_uring before, but network only with Registered I/O back in the Windows 8 (8.1?) days.

Linux still lacks the "all I/O is async" NT has.

The underlying kernel and executive of Windows aren't primitive pieces of trash. They're quite advanced, ruined by PMs and the Ads division throwing crap on top.

And yes, Windows' I/O Ring is a near 1:1 copy of the Linux implementation, but IOCP/OVERLAPPED I/O data structure preceded it since NT's first release.

This isn't a pissing match, just we all hope that kernel devs learn from each other and take the best ideas. Sometimes we, IT, don't get to choose the OS we run -- it's dictated by the apps the business requires.

greenavocado · 19h ago
How difficult would it be to completely tear out the Windows desktop experience and just use the system and display drivers without the rest? Has anybody attempted such a feat?
p_ing · 18h ago
There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).

Win32 is also responsible for core Services, which means you can't de-Windows-ify Windows and strip it down to an NT API-only. All other personalities (OS/2, POSIX, SFU) have a dependency on Win32, as well.

You're still running the WindowServer of course; it's part of the Executive.

That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.

https://www.youtube.com/watch?v=SL6t_iuitxM

smileybarry · 18h ago
> There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).

Yep, they've replaced nearly every UI with text (the login window is a TUI), though there's still some shell DLLs and the whole thing still uses a window manager. That's honestly for the best, since it allows you to migrate full installations with some UI-based apps to a Core installation with them intact.

> That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.

Windows has had a "text mode" since at least Windows XP IIRC, but it's really not that useful, if at all. Even for rescue operations you're better off with Windows PE.

Ericson2314 · 18h ago
password4321 · 15h ago
Free-as-in-beer Hyper-V Server 2019 is in extended support (only security updates) until 2029.
dboreham · 18h ago
There are several such things. The Windows installer uses one. X-Box uses another.
cyberax · 18h ago
> Same with the scheduler

Windows does OOM far better than Linux because it doesn't really overcommit RAM.

But the CPU _scheduler_ in Linux is far, far, far better than in Windows. Linux can even do hard-realtime, after all.

nu11ptr · 19h ago
> this feature is based on io_uring which is a Linux feature

And now also a Windows feature, see my comment above for info

zokier · 1h ago
IoRing != io_uring. They are similar APIs but not the same thing.
PaulHoule · 20h ago
For a long time ago there have been APIs to do asynchronous file I/O on the books for Linux but they weren't worth using because they didn't really speed anything up.
anarazel · 19h ago
They sped up things for a long time - but only when using unbuffered IO. The new thing with io_uring is that it also accelerates buffered IO. In the initial version it was all through kernel worker threads, but these days several filesystems have better paths for common cases.
immibis · 19h ago
IIRC they literally just did the sync I/O on a worker thread.
spwa4 · 19h ago
Yes, although Windows has had async I/O since Windows NT 3.1, their API is still not supported by Postgres.
anarazel · 19h ago
FWIW, there are prototype patches for an IOCP based io_method. We just couldn't get them into an acceptable state for PG 18. I barely survived getting in what we did...
spwa4 · 42m ago
I didn't mean that as a criticism on Postgres, certainly not on Postgres developers. I really look forward to evaluating this change. I'm just reminiscing a bit about admiring Windows NT ... and being incredibly disappointed by the market.
nu11ptr · 19h ago
Yes, that was IOCP, however, Windows now also has IORing (see my comment above)
mijoharas · 3h ago
Does anyone run postgres on windows?

Given that they still target it there must be a user base. Does anyone know the statistics of usage by platform? Anyone here use it?

Genuinely curious, windows backend dev is something I know very little about.

kev009 · 16h ago
A lot of work has gone into FreeBSD's aio(4) so it will be interesting to see how that works, because it doesn't have the drawbacks of Linux/glibc aio.
macdice · 9h ago
BTW I have patches for PostgreSQL AIO on FreeBSD, which I will propose for v19. It works pretty well! I was trying to keep out of Andres's way for the core architectural stuff and basic features ie didn't want to overload the pipes with confusing new topics for v18 :-)
tiffanyh · 15h ago
Would you mind expanding more on this topic.

Is FreeBSD doing anything significantly different and/or better?

kev009 · 13h ago
Sure two separate things really.

Linux aio (the POSIX one, not the broader concept that now includes io_uring) has been fundamentally bad. Part of it is some of the implementation is delegated to user space in glibc as worker threads. https://lwn.net/Articles/671649/ surveys the issues. I have not done a deep dive into this topic in a long time but as far as I know the situation never greatly improved. io_uring does not suffer from the problems, although it is a new and non-standard API with associated pros and cons.

Thomas Munro has gone into some of the benefits and also gaps of FreeBSD's aio(4) vs io_uring here https://wiki.postgresql.org/wiki/FreeBSD/AIO. Notably, because the implementation is in kernel and has received continuous improvement it is gradually removing downsides and there are several users that need it to work well. This document undersells the problems of ZFS though: the ARC is a necromanced buffer cache which Sun ironically worked very hard to remove (http://mcvoy.com/lm/papers/SunOS.vm_arch.pdf) and nobody has fixed this in 20 years. But for UFS or raw block devs or vendor file systems that doesn't matter.

FreeBSD being a complete "src" tree yields some advantages. In concrete, there are some in tree consumers like ctld (a CAM/iSCSI server) that have been used as vehicles to provide end to end implementation of things like NIC offload of complex protocols that play well with backend block devices such that you can make an offloaded data path that is pumped via asynchronous completion events on both sides (network and block). A related whitepaper https://www.chelsio.com/wp-content/uploads/resources/T6-100G... but this concept can be extended to iSCSI, NVMeOF etc. It seems NVMeOF work sponsored by Chelsio is yielding a lot of the finishing touches https://papers.freebsd.org/2023/eurobsdcon/baldwin-implement.... I believe my colleagues are also improving aio to further optimize the Netflix case of KTLS NIC with data on disk but I am not the right person to extrapolate on that.

skeptrune · 19h ago
How close is this to the way MySQL does it with InnoDB? It appears to be about the same.
seunosewa · 19h ago
Yep. It's a low hanging fruit they should've picked years ago.

They will eventually figure out using b-trees for tables too.

tehlike · 13h ago
There's a lot more PG needs to do for storage layout / access pov, and they have been working on it for a while. Orioledb has shown what might be opssible, and they have been upstreaming it.

Having the ability to do something LSM as a storage engine would be great - and potentially allow better compression than what we currently get with TOAST - which is not a lot... PG doesn't even have oob page compression...

ARandomerDude · 19h ago
Can you elaborate on this B-tree part of your comment? I know B-tree is the default index type in pg, but it sounds like there’s more to the story that I’m not familiar with.
greenavocado · 19h ago
PostgreSQL uses heap files for the primary table storage, not B-trees. In PostgreSQL table data is primarily stored in heap files (unordered collections of pages/blocks). Indexes (including primary key indexes) use B-trees (specifically B+ trees). When you query a table via an index, the B-tree index points to locations in the heap file

InnoDB uses a clustered index approach. The primary key index is a B-tree. The actual table data is stored in the leaf nodes of this B-tree. Secondary indexes point to the primary key.

One is not better than the other in general terms. InnoDB's clustered B-tree approach shines when:

You frequently access data in primary key order

Your workload has many range scans on the primary key

You need predictable performance for primary key lookups

Your data naturally has a meaningful ordering that matches your access patterns

PostgreSQL's heap approach excels when:

You frequently update non-key columns (less page splits/reorganization)

You have many secondary indexes (they're smaller without primary keys)

Your access patterns vary widely and don't follow one particular field

You need faster table scans when indexes aren't applicable

I personally find PostgreSQL's approach more flexible for complex analytical workloads with unpredictable access patterns, while InnoDB's clustered approach feels more optimized for OLTP workloads with predictable key-based access patterns. The "better" system depends entirely on your specific workload, data characteristics, and access patterns.

Sesse__ · 16h ago
Indexes that point directly to the disk column are also significantly faster to access; it is a persistent pain point for OLAP on InnoDB that all secondary indexes are indirect. You can work around it by adding additional columns to the index to make your lookups covering, but it's kludgy and imprecise and tends to bloat the index even further. (The flip side is that if you have tons of indexes, and update some unrelated column, InnoDB doesn't need to update those indexes to point to the location of the new row. But I'm generally very rarely annoyed by that in comparison.)
foobahhhhh · 1h ago
Takes me back 20 years to using SQL Server! It let you choose clustered index or not IIRC.
farazbabar · 18h ago
Don't forget high speed committed writes to append only tables (the opposite of scans), postgres approach is better here as well.
saltcured · 18h ago
It's also deeply entwined with the MVCC concurrency control and the ability to do DDL in transactions, right?
jiggawatts · 13h ago
SQL Server supports every combination of heap storage, clustered storage, MVCC, and DDL in transactions.
cryptonector · 12h ago
SQLite3 only has b-trees for files and indices.

PostgreSQL only has heaps for tables, and various other data structures for indices, with b-tree being the default for indices.

There are many cases where having b-trees for tables would make performance better.

GP is basically poking fun at PG for still having nothing but heaps for tables.

shayonj · 19h ago
Very nicely written post! I'd love to start running these in production on NVMe and hope its something major cloud providers start to offer ASAP. The performance gains are _extremely_ attractive
p_ing · 19h ago
Is io_uring still plagued by security issues enabled by it's use? Or have those largely been fixed? My understanding was many Linux admins (or even distros by default?) were disabling io_uring.
hansvm · 19h ago
p_ing · 19h ago
Thanks. It looks like it is still going through growing pains.

https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring

https://www.theregister.com/2025/04/29/linux_io_uring_securi...

But most of the 'off by default' are from ~2023 and not a current concern.

znpy · 15h ago
Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.

One should evaluate the risk according to their specific use case.

It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).

lambdaone · 1h ago
Most users don't know what io_uring is, nor would they have read any of thosee articles. They are not HN readers with low-level understanding of operating systems workingm and don't even remotely have the knowledge of how any of these technologies work, nor do they have the time to get it. With a very few exceptions, users have to rely entirely on software vendors and service providers to do this for them, and trust they know what they're doing.
wtallis · 14h ago
I think at this point, those "other technologies to enforce security" are the main area of concern for io_uring users: if those other security layers don't know about io_uring they won't apply any restrictions to it.
p_ing · 11h ago
Yes, every security mechanism should be reviewed to validate it is applicable, i.e. I had a vulnerable version of node running but I wasn't using the particular aspect of node so it was a non-issue (and amazingly, IT sec agreed!).

But in the case of io_uring, it was outright bypassing other security layers. And while we all like to think we're running trusted services/code, we have to think about supply-chain attacks that may surprise us, or zero days, etc.

rdtsc · 5h ago
Oh sure, if you run your own server or sell your cloud as a service, you can enable it. But if you want to run on someone else’s setup, you have to play by their rules.

> Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.

I think it’s more like “$faang_company already disabled it in their hosts, so I am out of luck for my containers running on their cloud”

niux · 20h ago
I recently deployed Postgres on a dedicated Hetzner EX-44 server (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for €39/month. The price-to-performance ratio is exceptional, providing enterprise-level capacity at a fraction of typical cloud costs.

For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.

My optimization approach includes:

- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)

- Real-time performance monitoring with PgHero for identifying bottlenecks

- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly

- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup

This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.

trollied · 20h ago
That’s great, but you need a solid HA/backup and recovery strategy if you even remotely care about your data.
deedubaya · 10h ago
He literally mentioned backups and not everyone needs HA?
natmaka · 7h ago
> Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables

Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?

codegeek · 19h ago
Would you be willing to open source the setup ? I would love to learn from it.
digdugdirk · 16h ago
Seconded. This corner of the programming world is a deep dark and scary place for people who haven't had solid industry experience. It'd be hugely helpful to have a barebones starting point to begin learning best practices.
dpacmittal · 12h ago
EX44 is 14 physical cores. Just wanted to point it out.
olavgg · 6h ago
And no ECC memory?
noir_lord · 1h ago
EX44’s are Intel 13500’s so no, not ECC.

I have one, I use it for running an arma3 server, it’s been bulletproof for reliability though, hetzner have come a long way since the first time I used them years and years ago where it was a mess.

I’m almost at the point I’d use them for some thig that mattered (I.e. money was involved).

Tostino · 19h ago
I would absolutely use another backup utility (additionally if you want) if I were you (barman, pgbackrest, etc).

You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...

Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.

arp242 · 19h ago
Snapshots are backups. Not sufficient backups for some cases, but perfectly fine for others.
niux · 19h ago
The reason I wrote my own tool is because I couldn't find anything for Pg17 at the time and pgbackrest seemed overkill for my needs. Also, the CLI handles backup rotations as well. Barman looks interesting though, I'll definitely have a look, thanks!
9dev · 14h ago
pgbackrest only looks scary because it’s so flexible, but the defaults work great in almost all cases. The most complex thing you’ll need to do is creating a storage bucket to write to, and configure the appropriate storage provider in pgbackrest's config file.

When it’s set up properly, it’s solid as rock. I’d really recommend you check it out again; it likely solves everything you did more elegantly, and also covers a ton of things you didn’t think of. Been there, done that :)

Tostino · 19h ago
pgbackrest was always easy to use in my experience. Not very hard to setup or configure, and low overhead. Supports spool directories for WAL shipping, compression, and block incremental backups (YAY!!!). I ran my last company on it for the last ~6 years I was there. Never any complaints, solid software (which is what you want for backups).

I have been using barman indirectly through CloundNativePG with my latest company, but don't have the operational experience to speak on it yet.

lousken · 14h ago
do you still need it? haven't managed pg for a while, but shouldn't pg17 have some solution for backups?
martinald · 19h ago
I sort of had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days. I suspect now we have PCIe 5.0 NVMes this will go up to

I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.

We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.

Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.

lfittl · 19h ago
Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.

Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).

Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...

gopalv · 19h ago
> had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.

More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).

As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.

Every foot is a nanosecond, approximately.

That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.

The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.

codegeek · 19h ago
You probably already know this but I will say it anyway. These cloud services like AWS are not succeeding in enterprise because they have outdated hardware. They succeed because in enterprise, CIOs and CTOs want something that is known, has a brand and everyone else uses it. It's like the old adage of "No one got fired for using IBM". Now it is "No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is".
the8472 · 19h ago
> No one gets fired for hosting with AWS

But consider the counterfactual: Non-realized customers because AWS certified solutions architect(tm) software couldn't deliver the price/perf they would have needed.

At $work this is a very real problem because a software system was built on api gateway, lambdas, sqs and a whole bunch of other moving pieces (serverless! scalable! easy compliance!) that combined resulted in way too much latency to meet a client's goal.

9dev · 14h ago
> No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is

Actually, AWS is so expensive, hosting everything we ran on Hetzner there would have simply depleted our funding, and the company would not exist anymore.

Hilift · 18h ago
Everything in the cloud is throttled. Network, IOPS, CPU. And probably implemented incorrectly. AWS makes billions if the customer infrastructure is great or terrible. I found that anything smaller than an AWS EC2 m5.8xlarge had noticeably bad performance on loaded servers (Windows). The list price for that would be about $13k per year, but most organizations get lower than list prices.

This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.

__s · 19h ago
Even worse on Azure where we had to ask customers to scale up vcpu to increase iops

https://azure.microsoft.com/en-us/pricing/details/managed-di...

Increasing vcpu also opened up more disk slots to try improve situation with disk striping

maherbeg · 19h ago
instance store on aws can give up to 3.3mil iops https://aws.amazon.com/blogs/aws/now-available-i3-instances-... - the main problem is just using networked storage.
the8472 · 18h ago
The NVMe on other instance types is quite throttled. E.g. on a G5.4xlarge instance EBS is limited to 593MB/s and 20000IOPS while instance-attached NVMe is limited to 512MB/s (read) at 125000IOPS, a fraction of IO what a workstation or gaming PC with similar GPU and RAM would have. And stopping the instance wipes it, which means you can't do instance warmup with those, everything must be populated at boot.
coder543 · 19h ago
That quoted IOPS number is only with an 8-disk stripe (requiring the full instance), even if you don't need 488GB of RAM or a $3600/mo instance, I believe.

The per-disk performance is still nothing to write home about, and 8 actually fast disks would blow this instance type out of the water.

perching_aix · 19h ago
Instance store is also immediately wiped when the instance is halted / restarted, which can theoretically happen at any time, for example by a mystery instance failure, or a patching tool that's helpfully restarting your boxes during offhours.
slashdev · 19h ago
My understanding is this not true, only when the instance permanently fails and is moved.
the8472 · 18h ago
slashdev · 15h ago
Yeah, so restart does not.

Which means you can count on it about as much as a server of your own, if you could not repair the server.

I know a database company that uses instance storage as the primary storage. It’s common.

perching_aix · 4h ago
Wasn't aware, interesting. I did consider it in the past as well, but the reliability aspect made me consider this as a moonshot rather than anything practical. Kind of weirdly validating to know there are (supposedly) database providers using it.

That said, a good bit of our environments are scheduled, so it still wouldn't be an option there without hacks (e.g. doing a compressed blockwise dump before shutting down and then a blockwise flash on startup).

anonymars · 19h ago
> given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

In the face of sustained writes? For how long?

merb · 19h ago
sustained reads would not even give 1 mio iops in that case. Maybe wen you only read the same file that fits into the nvme cache. Which probably never happens in a production database..
jauntywundrkind · 16h ago
I think you'd be surprised. Sustained write performance has gotten pretty good. Decent but not fancy consumer drives will often do 1GBps sustained, for bulkier writes. That's much better than we used to expect: flash has gotten much better with so many layers! This mid-range PCIe5 drive sustains a nice 1.5GBps: https://www.techpowerup.com/review/team-group-ge-pro-2-tb/6....

I don't think sustained reads are a problem? Benches like the CrystalDiskMark do a full disk random read test; they're designed to bust through cache afaik. 7.2GBp of 4k reads would translate to 1.8MIOps. Even if this is massively optimistic, you need to slash a lot of zeroes/orders of magnitude to get down to 20kIOps, which you will also pay >$100/mo for.

p_ing · 19h ago
Samsung 9910 has a 1:1 TB:GB cache size of LPDDR4X memory. I won't pretend to understand the magic NVMe drives possess, but if you got a 4TB or 8TB 9910, could you not in theory pull in all of the data you require to cache?

I would assume, and it might be a poor assumption, that NVMe controllers don't pull in files, but rather blocks, so even if you had a database that exceeded cache size, in theory if the active blocks of that database did not exceed cache size, it could be "indefinitely" cached for a read-only pattern.

wtallis · 16h ago
The DRAM on a SSD like that isn't for caching user data, it's for caching the drive's metadata about which logical blocks (as seen by the OS) correspond to which physical locations in the flash memory.
yxhuvud · 16h ago
FWIW, using the same approach as in the article, ie io_uring, is one of the few ways to actually reach anywhere close to that 1 million, so it is not as if they are competing concerns.
immibis · 17h ago
I noticed this with bandwidth. AWS price for bandwidth: $90.00/TB after 0.1TB/month. Price everywhere else (low cost VPSes): $1.50/TB after 1-5TB/month. Price some places (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB after.

You pay 60 times the price for the privilege of being on AWS.

Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.

There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.

binary132 · 19h ago
Meanwhile people are running things on raspberry pi home clusters thinking they’re winning
brulard · 19h ago
Maybe they are. With NVMe hat, you get decent IO performance
adgjlsfhk1 · 14h ago
It's still moderately bad. Raspberry pi is limited to 2 gen3 pcie lanes which is ~4-8x slower than the drive (and you will likely be further limited by cpu speed)
song · 18h ago
Are there good performance comparisons between postgres, mariadb and percona? I'm really curious at this point in which case each of those database shine.
KronisLV · 15h ago
Probably depends on the particular workload, but there are at least some attempts at benchmarking vaguely typical workloads: https://datasystemreviews.com/postgresql-vs-mariadb-performa...
pseudopersonal · 19h ago
Does anyone know when the update allowing more concurrent connections is dropping, so we can stop using pgbouncer?
hoherd · 12m ago
I recently read a great article exploring what would change if they were to switch from processes to threads for each connection. Running a connection pooler didn't seem so bad to me after reading it. https://medium.com/@tusharmalhotra_81114/why-postgresql-choo...
__s · 19h ago
That'll likely need conversion from process per connection, so not any time soon
cryptonector · 12h ago
I've been following the `AIO 2.5` thread on the PG mailing lists, and I've been salivating at the thought of this being released.
Tostino · 19h ago
Thank you for the effort that went into getting this committed. I remember seeing the first discussions about async I/O (and using io_uring) like 6 or 7 years ago. Amazing amount of work to get the design right.

Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.

dbbk · 14h ago
This looks promising! Wonder if it's coming to Neon
WhyNotHugo · 14h ago
It's pretty disappointing that simply using O_NONBLOCK doesn't work as expected on regular files. It would be such a simple and portable mechanism to do async I/O using the same interfaces that we already use for networking.
gitroom · 14h ago
insane how long it took postgres to get async i/o right - feels like all the big changes spark a million little tradeoffs, right? you think stuff like io_uring is finally gonna push postgres to catch up with the clouds