On linux there also is preadv2(..., RWF_NOWAIT) which can be used to do optimistic non-blocking read from the page cache. That might be useful for io_method = worker to shave off a bit of latency. Try reading on the main thread with NOWAIT and only offload to a worker thread when that fails.
anarazel · 13h ago
FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.
Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.
It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.
Edit: Fix negation
the8472 · 12h ago
Ah yeah, getting good kernel<>userspace oneshot memcpy performance for large files is surprisingly hard. mmap has setup/teardown overhead that's significant for oneshot transfers, regular read/write calls suffer from page cache/per page overhead.
Hopefully all the large folio work in the kernel will help with that.
anarazel · 12h ago
From what I've seen a surprisingly large part of the overhead is due to SMAP when doing larger reads from the page cache - i.e. if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.
On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.
derefr · 6h ago
> if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.
Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?
amluto · 6h ago
SMAP overhead should be roughly constant, and I’d be quite surprised if it’s noticeable for large reads. Small reads are a different story.
anarazel · 5h ago
It turns out to be the other way round, curiously. The bigger the reads (i.e. how much to read in one syscall) and the bigger the target area of the reads (how long before a target memory location is reused), the bigger the overhead of SMAP gets.
If interesting I can dig up the reproducer I had at some point.
That doesn't speed up uerspace<>kernel memcopy, it just reduces cache churn. Despite its name it still goes through the page cache, it just triggers writeback and drops the pages once that's done. For example when copying to a tmpfs it makes zero difference since that lives entirely in memory.
senderista · 10h ago
So you're less dependent on the page replacement algorithm being scan-resistant, since you can use this flag for scan/loop workloads, right?
gmokki · 1h ago
I would initially add it for WAL writes and reads. There should never be another read in normal operation.
gavinray · 12h ago
Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?
anarazel · 12h ago
> Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?
Explicitly a goal.
You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.
We need at least:
- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)
- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)
Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:
I am not a Windows guy but I (with help) managed to get IOCP working for this in a basic prototype. Will share publicly soon. I also sketched out an IoRing version (if you are interested in helping debug and flesh that out let me know!).
Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).
So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?
Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.
The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.
lfittl · 14h ago
It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.
The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).
nu11ptr · 13h ago
> "io_uring" is only available on Linux
Windows now also has IORing (see my comment above)
stingraycharles · 14h ago
Sounds like this feature is based on io_uring which is a Linux feature. I would be surprised if they implemented async io on Windows before they would on Linux given the user/deployment base being very Linux-heavy.
p_ing · 13h ago
Yeah, surprise Linux had to play catch up to a Windows 1994 release! Same with the scheduler, I'd argue Windows does OOM better than Linux today...
Windows even had the concept of io_uring before, but network only with Registered I/O back in the Windows 8 (8.1?) days.
Linux still lacks the "all I/O is async" NT has.
The underlying kernel and executive of Windows aren't primitive pieces of trash. They're quite advanced, ruined by PMs and the Ads division throwing crap on top.
And yes, Windows' I/O Ring is a near 1:1 copy of the Linux implementation, but IOCP/OVERLAPPED I/O data structure preceded it since NT's first release.
This isn't a pissing match, just we all hope that kernel devs learn from each other and take the best ideas. Sometimes we, IT, don't get to choose the OS we run -- it's dictated by the apps the business requires.
greenavocado · 13h ago
How difficult would it be to completely tear out the Windows desktop experience and just use the system and display drivers without the rest? Has anybody attempted such a feat?
p_ing · 13h ago
There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).
Win32 is also responsible for core Services, which means you can't de-Windows-ify Windows and strip it down to an NT API-only. All other personalities (OS/2, POSIX, SFU) have a dependency on Win32, as well.
You're still running the WindowServer of course; it's part of the Executive.
That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
> There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).
Yep, they've replaced nearly every UI with text (the login window is a TUI), though there's still some shell DLLs and the whole thing still uses a window manager. That's honestly for the best, since it allows you to migrate full installations with some UI-based apps to a Core installation with them intact.
> That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
Windows has had a "text mode" since at least Windows XP IIRC, but it's really not that useful, if at all. Even for rescue operations you're better off with Windows PE.
Free-as-in-beer Hyper-V Server 2019 is in extended support (only security updates) until 2029.
dboreham · 13h ago
There are several such things. The Windows installer uses one. X-Box uses another.
cyberax · 12h ago
> Same with the scheduler
Windows does OOM far better than Linux because it doesn't really overcommit RAM.
But the CPU _scheduler_ in Linux is far, far, far better than in Windows. Linux can even do hard-realtime, after all.
nu11ptr · 13h ago
> this feature is based on io_uring which is a Linux feature
And now also a Windows feature, see my comment above for info
PaulHoule · 14h ago
For a long time ago there have been APIs to do asynchronous file I/O on the books for Linux but they weren't worth using because they didn't really speed anything up.
anarazel · 13h ago
They sped up things for a long time - but only when using unbuffered IO. The new thing with io_uring is that it also accelerates buffered IO. In the initial version it was all through kernel worker threads, but these days several filesystems have better paths for common cases.
immibis · 13h ago
IIRC they literally just did the sync I/O on a worker thread.
spwa4 · 13h ago
Yes, although Windows has had async I/O since Windows NT 3.1, their API is still not supported by Postgres.
anarazel · 13h ago
FWIW, there are prototype patches for an IOCP based io_method. We just couldn't get them into an acceptable state for PG 18. I barely survived getting in what we did...
nu11ptr · 13h ago
Yes, that was IOCP, however, Windows now also has IORing (see my comment above)
kev009 · 10h ago
A lot of work has gone into FreeBSD's aio(4) so it will be interesting to see how that works, because it doesn't have the drawbacks of Linux/glibc aio.
macdice · 3h ago
BTW I have patches for PostgreSQL AIO on FreeBSD, which I will propose for v19. It works pretty well! I was trying to keep out of Andres's way for the core architectural stuff and basic features ie didn't want to overload the pipes with confusing new topics for v18 :-)
tiffanyh · 9h ago
Would you mind expanding more on this topic.
Is FreeBSD doing anything significantly different and/or better?
kev009 · 7h ago
Sure two separate things really.
Linux aio (the POSIX one, not the broader concept that now includes io_uring) has been fundamentally bad. Part of it is some of the implementation is delegated to user space in glibc as worker threads. https://lwn.net/Articles/671649/ surveys the issues. I have not done a deep dive into this topic in a long time but as far as I know the situation never greatly improved. io_uring does not suffer from the problems, although it is a new and non-standard API with associated pros and cons.
Thomas Munro has gone into some of the benefits and also gaps of FreeBSD's aio(4) vs io_uring here https://wiki.postgresql.org/wiki/FreeBSD/AIO. Notably, because the implementation is in kernel and has received continuous improvement it is gradually removing downsides and there are several users that need it to work well. This document undersells the problems of ZFS though: the ARC is a necromanced buffer cache which Sun ironically worked very hard to remove (http://mcvoy.com/lm/papers/SunOS.vm_arch.pdf) and nobody has fixed this in 20 years. But for UFS or raw block devs or vendor file systems that doesn't matter.
FreeBSD being a complete "src" tree yields some advantages. In concrete, there are some in tree consumers like ctld (a CAM/iSCSI server) that have been used as vehicles to provide end to end implementation of things like NIC offload of complex protocols that play well with backend block devices such that you can make an offloaded data path that is pumped via asynchronous completion events on both sides (network and block). A related whitepaper https://www.chelsio.com/wp-content/uploads/resources/T6-100G... but this concept can be extended to iSCSI, NVMeOF etc. It seems NVMeOF work sponsored by Chelsio is yielding a lot of the finishing touches https://papers.freebsd.org/2023/eurobsdcon/baldwin-implement.... I believe my colleagues are also improving aio to further optimize the Netflix case of KTLS NIC with data on disk but I am not the right person to extrapolate on that.
skeptrune · 14h ago
How close is this to the way MySQL does it with InnoDB? It appears to be about the same.
seunosewa · 14h ago
Yep. It's a low hanging fruit they should've picked years ago.
They will eventually figure out using b-trees for tables too.
tehlike · 8h ago
There's a lot more PG needs to do for storage layout / access pov, and they have been working on it for a while. Orioledb has shown what might be opssible, and they have been upstreaming it.
Having the ability to do something LSM as a storage engine would be great - and potentially allow better compression than what we currently get with TOAST - which is not a lot... PG doesn't even have oob page compression...
ARandomerDude · 13h ago
Can you elaborate on this B-tree part of your comment? I know B-tree is the default index type in pg, but it sounds like there’s more to the story that I’m not familiar with.
greenavocado · 13h ago
PostgreSQL uses heap files for the primary table storage, not B-trees. In PostgreSQL table data is primarily stored in heap files (unordered collections of pages/blocks). Indexes (including primary key indexes) use B-trees (specifically B+ trees). When you query a table via an index, the B-tree index points to locations in the heap file
InnoDB uses a clustered index approach. The primary key index is a B-tree. The actual table data is stored in the leaf nodes of this B-tree. Secondary indexes point to the primary key.
One is not better than the other in general terms. InnoDB's clustered B-tree approach shines when:
You frequently access data in primary key order
Your workload has many range scans on the primary key
You need predictable performance for primary key lookups
Your data naturally has a meaningful ordering that matches your access patterns
PostgreSQL's heap approach excels when:
You frequently update non-key columns (less page splits/reorganization)
You have many secondary indexes (they're smaller without primary keys)
Your access patterns vary widely and don't follow one particular field
You need faster table scans when indexes aren't applicable
I personally find PostgreSQL's approach more flexible for complex analytical workloads with unpredictable access patterns, while InnoDB's clustered approach feels more optimized for OLTP workloads with predictable key-based access patterns. The "better" system depends entirely on your specific workload, data characteristics, and access patterns.
Sesse__ · 11h ago
Indexes that point directly to the disk column are also significantly faster to access; it is a persistent pain point for OLAP on InnoDB that all secondary indexes are indirect. You can work around it by adding additional columns to the index to make your lookups covering, but it's kludgy and imprecise and tends to bloat the index even further. (The flip side is that if you have tons of indexes, and update some unrelated column, InnoDB doesn't need to update those indexes to point to the location of the new row. But I'm generally very rarely annoyed by that in comparison.)
farazbabar · 12h ago
Don't forget high speed committed writes to append only tables (the opposite of scans), postgres approach is better here as well.
saltcured · 12h ago
It's also deeply entwined with the MVCC concurrency control and the ability to do DDL in transactions, right?
jiggawatts · 8h ago
SQL Server supports every combination of heap storage, clustered storage, MVCC, and DDL in transactions.
cryptonector · 6h ago
SQLite3 only has b-trees for files and indices.
PostgreSQL only has heaps for tables, and various other data structures for indices, with b-tree being the default for indices.
There are many cases where having b-trees for tables would make performance better.
GP is basically poking fun at PG for still having nothing but heaps for tables.
p_ing · 13h ago
Is io_uring still plagued by security issues enabled by it's use? Or have those largely been fixed? My understanding was many Linux admins (or even distros by default?) were disabling io_uring.
But most of the 'off by default' are from ~2023 and not a current concern.
znpy · 9h ago
Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.
One should evaluate the risk according to their specific use case.
It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).
wtallis · 9h ago
I think at this point, those "other technologies to enforce security" are the main area of concern for io_uring users: if those other security layers don't know about io_uring they won't apply any restrictions to it.
p_ing · 5h ago
Yes, every security mechanism should be reviewed to validate it is applicable, i.e. I had a vulnerable version of node running but I wasn't using the particular aspect of node so it was a non-issue (and amazingly, IT sec agreed!).
But in the case of io_uring, it was outright bypassing other security layers. And while we all like to think we're running trusted services/code, we have to think about supply-chain attacks that may surprise us, or zero days, etc.
shayonj · 14h ago
Very nicely written post! I'd love to start running these in production on NVMe and hope its something major cloud providers start to offer ASAP. The performance gains are _extremely_ attractive
niux · 14h ago
I recently deployed Postgres on a dedicated Hetzner EX-44 server (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for €39/month. The price-to-performance ratio is exceptional, providing enterprise-level capacity at a fraction of typical cloud costs.
For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.
- Real-time performance monitoring with PgHero for identifying bottlenecks
- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly
- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup
This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.
Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?
trollied · 14h ago
That’s great, but you need a solid HA/backup and recovery strategy if you even remotely care about your data.
deedubaya · 4h ago
He literally mentioned backups and not everyone needs HA?
codegeek · 13h ago
Would you be willing to open source the setup ? I would love to learn from it.
digdugdirk · 11h ago
Seconded. This corner of the programming world is a deep dark and scary place for people who haven't had solid industry experience. It'd be hugely helpful to have a barebones starting point to begin learning best practices.
dpacmittal · 7h ago
EX44 is 14 physical cores. Just wanted to point it out.
olavgg · 16m ago
And no ECC memory?
Tostino · 14h ago
I would absolutely use another backup utility (additionally if you want) if I were you (barman, pgbackrest, etc).
You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...
Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.
arp242 · 13h ago
Snapshots are backups. Not sufficient backups for some cases, but perfectly fine for others.
niux · 14h ago
The reason I wrote my own tool is because I couldn't find anything for Pg17 at the time and pgbackrest seemed overkill for my needs. Also, the CLI handles backup rotations as well. Barman looks interesting though, I'll definitely have a look, thanks!
9dev · 9h ago
pgbackrest only looks scary because it’s so flexible, but the defaults work great in almost all cases. The most complex thing you’ll need to do is creating a storage bucket to write to, and configure the appropriate storage provider in pgbackrest's config file.
When it’s set up properly, it’s solid as rock. I’d really recommend you check it out again; it likely solves everything you did more elegantly, and also covers a ton of things you didn’t think of. Been there, done that :)
Tostino · 13h ago
pgbackrest was always easy to use in my experience. Not very hard to setup or configure, and low overhead. Supports spool directories for WAL shipping, compression, and block incremental backups (YAY!!!). I ran my last company on it for the last ~6 years I was there. Never any complaints, solid software (which is what you want for backups).
I have been using barman indirectly through CloundNativePG with my latest company, but don't have the operational experience to speak on it yet.
lousken · 8h ago
do you still need it? haven't managed pg for a while, but shouldn't pg17 have some solution for backups?
martinald · 14h ago
I sort of had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days. I suspect now we have PCIe 5.0 NVMes this will go up to
I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.
We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.
Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.
lfittl · 14h ago
Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.
Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).
Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...
gopalv · 13h ago
> had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days
The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.
More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).
As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.
Every foot is a nanosecond, approximately.
That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.
The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.
codegeek · 13h ago
You probably already know this but I will say it anyway. These cloud services like AWS are not succeeding in enterprise because they have outdated hardware. They succeed because in enterprise, CIOs and CTOs want something that is known, has a brand and everyone else uses it. It's like the old adage of "No one got fired for using IBM". Now it is "No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is".
the8472 · 13h ago
> No one gets fired for hosting with AWS
But consider the counterfactual: Non-realized customers because AWS certified solutions architect(tm) software couldn't deliver the price/perf they would have needed.
At $work this is a very real problem because a software system was built on api gateway, lambdas, sqs and a whole bunch of other moving pieces (serverless! scalable! easy compliance!) that combined resulted in way too much latency to meet a client's goal.
9dev · 9h ago
> No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is
Actually, AWS is so expensive, hosting everything we ran on Hetzner there would have simply depleted our funding, and the company would not exist anymore.
Hilift · 12h ago
Everything in the cloud is throttled. Network, IOPS, CPU. And probably implemented incorrectly. AWS makes billions if the customer infrastructure is great or terrible. I found that anything smaller than an AWS EC2 m5.8xlarge had noticeably bad performance on loaded servers (Windows). The list price for that would be about $13k per year, but most organizations get lower than list prices.
This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.
__s · 13h ago
Even worse on Azure where we had to ask customers to scale up vcpu to increase iops
The NVMe on other instance types is quite throttled. E.g. on a G5.4xlarge instance EBS is limited to 593MB/s and 20000IOPS while instance-attached NVMe is limited to 512MB/s (read) at 125000IOPS, a fraction of IO what a workstation or gaming PC with similar GPU and RAM would have.
And stopping the instance wipes it, which means you can't do instance warmup with those, everything must be populated at boot.
coder543 · 13h ago
That quoted IOPS number is only with an 8-disk stripe (requiring the full instance), even if you don't need 488GB of RAM or a $3600/mo instance, I believe.
The per-disk performance is still nothing to write home about, and 8 actually fast disks would blow this instance type out of the water.
perching_aix · 13h ago
Instance store is also immediately wiped when the instance is halted / restarted, which can theoretically happen at any time, for example by a mystery instance failure, or a patching tool that's helpfully restarting your boxes during offhours.
slashdev · 13h ago
My understanding is this not true, only when the instance permanently fails and is moved.
Which means you can count on it about as much as a server of your own, if you could not repair the server.
I know a database company that uses instance storage as the primary storage. It’s common.
anonymars · 13h ago
> given even a consumer $100-200 NVMe gives ~1million+ IOPS these days
In the face of sustained writes? For how long?
merb · 13h ago
sustained reads would not even give 1 mio iops in that case. Maybe wen you only read the same file that fits into the nvme cache. Which probably never happens in a production database..
jauntywundrkind · 11h ago
I think you'd be surprised. Sustained write performance has gotten pretty good. Decent but not fancy consumer drives will often do 1GBps sustained, for bulkier writes. That's much better than we used to expect: flash has gotten much better with so many layers! This mid-range PCIe5 drive sustains a nice 1.5GBps: https://www.techpowerup.com/review/team-group-ge-pro-2-tb/6....
I don't think sustained reads are a problem? Benches like the CrystalDiskMark do a full disk random read test; they're designed to bust through cache afaik. 7.2GBp of 4k reads would translate to 1.8MIOps. Even if this is massively optimistic, you need to slash a lot of zeroes/orders of magnitude to get down to 20kIOps, which you will also pay >$100/mo for.
p_ing · 13h ago
Samsung 9910 has a 1:1 TB:GB cache size of LPDDR4X memory. I won't pretend to understand the magic NVMe drives possess, but if you got a 4TB or 8TB 9910, could you not in theory pull in all of the data you require to cache?
I would assume, and it might be a poor assumption, that NVMe controllers don't pull in files, but rather blocks, so even if you had a database that exceeded cache size, in theory if the active blocks of that database did not exceed cache size, it could be "indefinitely" cached for a read-only pattern.
wtallis · 10h ago
The DRAM on a SSD like that isn't for caching user data, it's for caching the drive's metadata about which logical blocks (as seen by the OS) correspond to which physical locations in the flash memory.
yxhuvud · 11h ago
FWIW, using the same approach as in the article, ie io_uring, is one of the few ways to actually reach anywhere close to that 1 million, so it is not as if they are competing concerns.
immibis · 11h ago
I noticed this with bandwidth. AWS price for bandwidth: $90.00/TB after 0.1TB/month. Price everywhere else (low cost VPSes): $1.50/TB after 1-5TB/month. Price some places (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB after.
You pay 60 times the price for the privilege of being on AWS.
Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.
There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.
binary132 · 14h ago
Meanwhile people are running things on raspberry pi home clusters thinking they’re winning
brulard · 13h ago
Maybe they are. With NVMe hat, you get decent IO performance
adgjlsfhk1 · 8h ago
It's still moderately bad. Raspberry pi is limited to 2 gen3 pcie lanes which is ~4-8x slower than the drive (and you will likely be further limited by cpu speed)
song · 12h ago
Are there good performance comparisons between postgres, mariadb and percona? I'm really curious at this point in which case each of those database shine.
Does anyone know when the update allowing more concurrent connections is dropping, so we can stop using pgbouncer?
__s · 13h ago
That'll likely need conversion from process per connection, so not any time soon
cryptonector · 7h ago
I've been following the `AIO 2.5` thread on the PG mailing lists, and I've been salivating at the thought of this being released.
Tostino · 14h ago
Thank you for the effort that went into getting this committed. I remember seeing the first discussions about async I/O (and using io_uring) like 6 or 7 years ago. Amazing amount of work to get the design right.
Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.
dbbk · 8h ago
This looks promising! Wonder if it's coming to Neon
WhyNotHugo · 9h ago
It's pretty disappointing that simply using O_NONBLOCK doesn't work as expected on regular files. It would be such a simple and portable mechanism to do async I/O using the same interfaces that we already use for networking.
gitroom · 8h ago
insane how long it took postgres to get async i/o right - feels like all the big changes spark a million little tradeoffs, right? you think stuff like io_uring is finally gonna push postgres to catch up with the clouds
Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.
It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.
Edit: Fix negation
On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.
Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?
If interesting I can dig up the reproducer I had at some point.
Explicitly a goal.
You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.
We need at least:
- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)
- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)
EDIT: Formatting
Aside from a few problems in specific Linux kernel versions, it works great.
I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).
https://learn.microsoft.com/en-us/windows/win32/api/ioringap...
Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:
https://windows-internals.com/ioring-vs-io_uring-a-compariso...
Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).
So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?
Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.
The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.
The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).
Windows now also has IORing (see my comment above)
Windows even had the concept of io_uring before, but network only with Registered I/O back in the Windows 8 (8.1?) days.
Linux still lacks the "all I/O is async" NT has.
The underlying kernel and executive of Windows aren't primitive pieces of trash. They're quite advanced, ruined by PMs and the Ads division throwing crap on top.
And yes, Windows' I/O Ring is a near 1:1 copy of the Linux implementation, but IOCP/OVERLAPPED I/O data structure preceded it since NT's first release.
This isn't a pissing match, just we all hope that kernel devs learn from each other and take the best ideas. Sometimes we, IT, don't get to choose the OS we run -- it's dictated by the apps the business requires.
Win32 is also responsible for core Services, which means you can't de-Windows-ify Windows and strip it down to an NT API-only. All other personalities (OS/2, POSIX, SFU) have a dependency on Win32, as well.
You're still running the WindowServer of course; it's part of the Executive.
That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
https://www.youtube.com/watch?v=SL6t_iuitxM
Yep, they've replaced nearly every UI with text (the login window is a TUI), though there's still some shell DLLs and the whole thing still uses a window manager. That's honestly for the best, since it allows you to migrate full installations with some UI-based apps to a Core installation with them intact.
> That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.
Windows has had a "text mode" since at least Windows XP IIRC, but it's really not that useful, if at all. Even for rescue operations you're better off with Windows PE.
Windows does OOM far better than Linux because it doesn't really overcommit RAM.
But the CPU _scheduler_ in Linux is far, far, far better than in Windows. Linux can even do hard-realtime, after all.
And now also a Windows feature, see my comment above for info
Is FreeBSD doing anything significantly different and/or better?
Linux aio (the POSIX one, not the broader concept that now includes io_uring) has been fundamentally bad. Part of it is some of the implementation is delegated to user space in glibc as worker threads. https://lwn.net/Articles/671649/ surveys the issues. I have not done a deep dive into this topic in a long time but as far as I know the situation never greatly improved. io_uring does not suffer from the problems, although it is a new and non-standard API with associated pros and cons.
Thomas Munro has gone into some of the benefits and also gaps of FreeBSD's aio(4) vs io_uring here https://wiki.postgresql.org/wiki/FreeBSD/AIO. Notably, because the implementation is in kernel and has received continuous improvement it is gradually removing downsides and there are several users that need it to work well. This document undersells the problems of ZFS though: the ARC is a necromanced buffer cache which Sun ironically worked very hard to remove (http://mcvoy.com/lm/papers/SunOS.vm_arch.pdf) and nobody has fixed this in 20 years. But for UFS or raw block devs or vendor file systems that doesn't matter.
FreeBSD being a complete "src" tree yields some advantages. In concrete, there are some in tree consumers like ctld (a CAM/iSCSI server) that have been used as vehicles to provide end to end implementation of things like NIC offload of complex protocols that play well with backend block devices such that you can make an offloaded data path that is pumped via asynchronous completion events on both sides (network and block). A related whitepaper https://www.chelsio.com/wp-content/uploads/resources/T6-100G... but this concept can be extended to iSCSI, NVMeOF etc. It seems NVMeOF work sponsored by Chelsio is yielding a lot of the finishing touches https://papers.freebsd.org/2023/eurobsdcon/baldwin-implement.... I believe my colleagues are also improving aio to further optimize the Netflix case of KTLS NIC with data on disk but I am not the right person to extrapolate on that.
They will eventually figure out using b-trees for tables too.
Having the ability to do something LSM as a storage engine would be great - and potentially allow better compression than what we currently get with TOAST - which is not a lot... PG doesn't even have oob page compression...
InnoDB uses a clustered index approach. The primary key index is a B-tree. The actual table data is stored in the leaf nodes of this B-tree. Secondary indexes point to the primary key.
One is not better than the other in general terms. InnoDB's clustered B-tree approach shines when:
You frequently access data in primary key order
Your workload has many range scans on the primary key
You need predictable performance for primary key lookups
Your data naturally has a meaningful ordering that matches your access patterns
PostgreSQL's heap approach excels when:
You frequently update non-key columns (less page splits/reorganization)
You have many secondary indexes (they're smaller without primary keys)
Your access patterns vary widely and don't follow one particular field
You need faster table scans when indexes aren't applicable
I personally find PostgreSQL's approach more flexible for complex analytical workloads with unpredictable access patterns, while InnoDB's clustered approach feels more optimized for OLTP workloads with predictable key-based access patterns. The "better" system depends entirely on your specific workload, data characteristics, and access patterns.
PostgreSQL only has heaps for tables, and various other data structures for indices, with b-tree being the default for indices.
There are many cases where having b-trees for tables would make performance better.
GP is basically poking fun at PG for still having nothing but heaps for tables.
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
https://www.theregister.com/2025/04/29/linux_io_uring_securi...
But most of the 'off by default' are from ~2023 and not a current concern.
One should evaluate the risk according to their specific use case.
It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).
But in the case of io_uring, it was outright bypassing other security layers. And while we all like to think we're running trusted services/code, we have to think about supply-chain attacks that may surprise us, or zero days, etc.
For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.
My optimization approach includes:
- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)
- Real-time performance monitoring with PgHero for identifying bottlenecks
- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly
- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup
This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.
Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?
You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...
Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.
When it’s set up properly, it’s solid as rock. I’d really recommend you check it out again; it likely solves everything you did more elegantly, and also covers a ton of things you didn’t think of. Been there, done that :)
I have been using barman indirectly through CloundNativePG with my latest company, but don't have the operational experience to speak on it yet.
I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.
We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.
Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.
Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).
Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...
The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.
More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).
As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.
Every foot is a nanosecond, approximately.
That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.
The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.
But consider the counterfactual: Non-realized customers because AWS certified solutions architect(tm) software couldn't deliver the price/perf they would have needed.
At $work this is a very real problem because a software system was built on api gateway, lambdas, sqs and a whole bunch of other moving pieces (serverless! scalable! easy compliance!) that combined resulted in way too much latency to meet a client's goal.
Actually, AWS is so expensive, hosting everything we ran on Hetzner there would have simply depleted our funding, and the company would not exist anymore.
This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.
https://azure.microsoft.com/en-us/pricing/details/managed-di...
Increasing vcpu also opened up more disk slots to try improve situation with disk striping
The per-disk performance is still nothing to write home about, and 8 actually fast disks would blow this instance type out of the water.
Which means you can count on it about as much as a server of your own, if you could not repair the server.
I know a database company that uses instance storage as the primary storage. It’s common.
In the face of sustained writes? For how long?
I don't think sustained reads are a problem? Benches like the CrystalDiskMark do a full disk random read test; they're designed to bust through cache afaik. 7.2GBp of 4k reads would translate to 1.8MIOps. Even if this is massively optimistic, you need to slash a lot of zeroes/orders of magnitude to get down to 20kIOps, which you will also pay >$100/mo for.
I would assume, and it might be a poor assumption, that NVMe controllers don't pull in files, but rather blocks, so even if you had a database that exceeded cache size, in theory if the active blocks of that database did not exceed cache size, it could be "indefinitely" cached for a read-only pattern.
You pay 60 times the price for the privilege of being on AWS.
Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.
There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.
Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.