Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous I/O

572 lfittl 154 5/7/2025, 2:57:03 PM pganalyze.com ↗

Comments (154)

drdrek · 132d ago

Postgres is such a cool project, I have so much respect for its maintainers and community! For me its the second most impactful OSS project in the business tech world behind Linux itself. A real public good to be cherished and praised.

belter · 132d ago

You can get no better recommendation, that a phrase I once heard at a presentation by Richard Hipp primary author of SQLite. As he was explaining how he analyzed new features and design decisions he says some of his guidance is: "What would Postgres do?"

freilanzer · 132d ago

I recently started a ML project using text data and the choice was between MySQL and Postgres. Having looked at the respective features and pros and cons, the choice was immediately obvious. Also, with pgvector and https://postgresml.com available, the choice for Postgres was even easier.

cpursley · 132d ago

Postgres is Enough: https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...

powerbook5300CS · 132d ago

Why not MongoDB?

ricw · 132d ago

The question should be the other way around: why mongodb? It’s not ACID compliant so has major down sides…

okeuro49 · 132d ago

According to mongodb it is acid compliant.

https://www.mongodb.com/resources/products/capabilities/acid...

IgorPartola · 132d ago

I guess they didn’t need anything web scale.

belter · 132d ago

Because it's web scale...

the8472 · 133d ago

On linux there also is preadv2(..., RWF_NOWAIT) which can be used to do optimistic non-blocking read from the page cache. That might be useful for io_method = worker to shave off a bit of latency. Try reading on the main thread with NOWAIT and only offload to a worker thread when that fails.

anarazel · 133d ago

FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.

Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.

It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.

Edit: Fix negation

the8472 · 133d ago

Ah yeah, getting good kernel<>userspace oneshot memcpy performance for large files is surprisingly hard. mmap has setup/teardown overhead that's significant for oneshot transfers, regular read/write calls suffer from page cache/per page overhead. Hopefully all the large folio work in the kernel will help with that.

anarazel · 133d ago

From what I've seen a surprisingly large part of the overhead is due to SMAP when doing larger reads from the page cache - i.e. if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

On Intel it's also not hard to simply reach the per-core memory bandwidth with modern storage HW. This matters most prominently for writes by the checkpointing process, which needs to compute data checksums given the current postgres implementation (if enabled). But even for reads it can be a bottleneck, e.g. when prewarming the buffer pool after a restart.

derefr · 133d ago

> if I boot with clearcpuid=smap (not for prod use!), larger reads go significantly faster. On both Intel and AMD CPUs interestingly.

Is there a page anywhere that collects these sorts of "turn the whole hardware security layer off" switches that can be flipped to get better throughput out of modern x86 CPUs, when your system has no real attack surface to speak of (e.g. air-gapped single-tenant HPC)?

the8472 · 132d ago

On the kernel side there's a boot parameter for all of them: mitigations=off Software that was compiled with additional fences may have to be recompiled to remove them.

https://www.kernel.org/doc/html/latest/admin-guide/kernel-pa...

starspangled · 132d ago

mitigations=off disables workarounds for bugs or "mis-features" in the CPU that could be exploited to bypass OS security measures.

smap is an OS security measure, and so does not get disabled by mitigations=off. smap can be pretty draining for certain IO performance though. IMO it should be more well-known or covered by a more obvious option.

Linux kernel developers are really bad at defining and naming options like this.

amluto · 133d ago

SMAP overhead should be roughly constant, and I’d be quite surprised if it’s noticeable for large reads. Small reads are a different story.

anarazel · 133d ago

It turns out to be the other way round, curiously. The bigger the reads (i.e. how much to read in one syscall) and the bigger the target area of the reads (how long before a target memory location is reused), the bigger the overhead of SMAP gets.

If interesting I can dig up the reproducer I had at some point.

amluto · 132d ago

That is definitely interesting.

gregjm · 132d ago

TCMalloc never munmaps, instead it mmap(MAP_FIXED) within unpopulated PROT_NONE regions, and then madvise(MADV_FREE) at page granularity to reduce RSS. Perhaps a similar approach for file I/O could help to dodge the cost of munmap TLB shootdowns after a file has been read, but using MADV_DONTNEED instead of MADV_FREE. There will probably be a shootdown associated with the MADV_DONTNEED, but maybe it will be lower cost than munmap?

You might also just keep around the file mapping until memory/address space pressure requires, and at that point MAP_FIXED over it.

yxhuvud · 133d ago

Well, nowadays there is https://www.phoronix.com/news/Linux-RWF_UNCACHED-2024

the8472 · 133d ago

That doesn't speed up uerspace<>kernel memcopy, it just reduces cache churn. Despite its name it still goes through the page cache, it just triggers writeback and drops the pages once that's done. For example when copying to a tmpfs it makes zero difference since that lives entirely in memory.

senderista · 133d ago

So you're less dependent on the page replacement algorithm being scan-resistant, since you can use this flag for scan/loop workloads, right?

gmokki · 132d ago

I would initially add it for WAL writes and reads. There should never be another read in normal operation.

gavinray · 133d ago

Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?

anarazel · 133d ago

> Do you think there's a possibility of Direct IO being adopted at some point in the future now that AIO is available?

Explicitly a goal.

You can turn it on today, with a bunch of caveats (via debug_io_direct=data). If you have the right workload - e.g. read only and lots of seqscans, bitmap index scans etc you can see rather substantial perf gains. But it'll suck in any cases in 18.

We need at least:

- AIO writes in checkpointer, bgwriter and backend buffer replacement (think bulk loading data with COPY)

- readahead support in a few more places, most crucially index range scan (works out ok today if the heap is correlated with the index, sucks badly otherwise)

EDIT: Formatting

zX41ZdbW · 133d ago

The usage in ClickHouse database: https://github.com/ClickHouse/ClickHouse/blob/d2697c0ea112fd...

Aside from a few problems in specific Linux kernel versions, it works great.

nu11ptr · 133d ago

Is this new async. I/O feature for Linux only?

I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).

https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:

https://windows-internals.com/ioring-vs-io_uring-a-compariso...

macdice · 133d ago

I am not a Windows guy but I (with help) managed to get IOCP working for this in a basic prototype. Will share publicly soon. I also sketched out an IoRing version (if you are interested in helping debug and flesh that out let me know!).

Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).

So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?

Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.

The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.

p_ing · 132d ago

Look into Registered I/O for sockets.

lfittl · 133d ago

It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.

The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).

nu11ptr · 133d ago

> "io_uring" is only available on Linux

Windows now also has IORing (see my comment above)

severino · 132d ago

Yes, but what's your point? It's not that we can take this new Postgres version and just compile it in a Windows machine supporting IORing, can we? It requires some effort and time -many times by volunteers- and given that most Postgres deployments are running under Linux, it's understandable that the developers focus on that platform first.

nu11ptr · 132d ago

Misunderstood - thought you were saying ioring was a feature specific to linux, but you meant specific to linux on Postgres.

stingraycharles · 133d ago

Sounds like this feature is based on io_uring which is a Linux feature. I would be surprised if they implemented async io on Windows before they would on Linux given the user/deployment base being very Linux-heavy.

p_ing · 133d ago

Yeah, surprise Linux had to play catch up to a Windows 1994 release! Same with the scheduler, I'd argue Windows does OOM better than Linux today...

Windows even had the concept of io_uring before, but network only with Registered I/O back in the Windows 8 (8.1?) days.

Linux still lacks the "all I/O is async" NT has.

The underlying kernel and executive of Windows aren't primitive pieces of trash. They're quite advanced, ruined by PMs and the Ads division throwing crap on top.

And yes, Windows' I/O Ring is a near 1:1 copy of the Linux implementation, but IOCP/OVERLAPPED I/O data structure preceded it since NT's first release.

This isn't a pissing match, just we all hope that kernel devs learn from each other and take the best ideas. Sometimes we, IT, don't get to choose the OS we run -- it's dictated by the apps the business requires.

cyberax · 133d ago

> Same with the scheduler

Windows does OOM far better than Linux because it doesn't really overcommit RAM.

But the CPU _scheduler_ in Linux is far, far, far better than in Windows. Linux can even do hard-realtime, after all.

greenavocado · 133d ago

How difficult would it be to completely tear out the Windows desktop experience and just use the system and display drivers without the rest? Has anybody attempted such a feat?

p_ing · 133d ago

There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).

Win32 is also responsible for core Services, which means you can't de-Windows-ify Windows and strip it down to an NT API-only. All other personalities (OS/2, POSIX, SFU) have a dependency on Win32, as well.

You're still running the WindowServer of course; it's part of the Executive.

That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.

https://www.youtube.com/watch?v=SL6t_iuitxM

smileybarry · 133d ago

> There is Windows Server Core which removes everything but a CLI, but you still have the normal login experience, you still have a "desktop" (no start menu, taskbar, etc), you can still launch normal Win32 apps... for the most part (task manager, notepad, and so on).

Yep, they've replaced nearly every UI with text (the login window is a TUI), though there's still some shell DLLs and the whole thing still uses a window manager. That's honestly for the best, since it allows you to migrate full installations with some UI-based apps to a Core installation with them intact.

> That said, with a bunch of modifications, NTDEV did get Windows 11 down to it's bare minimum, and text only to boot. So I guess it's technically possible, though not useful.

Windows has had a "text mode" since at least Windows XP IIRC, but it's really not that useful, if at all. Even for rescue operations you're better off with Windows PE.

Ericson2314 · 133d ago

There is always https://en.wikipedia.org/wiki/ReactOS ! :)

password4321 · 133d ago

Free-as-in-beer Hyper-V Server 2019 is in extended support (only security updates) until 2029.

dboreham · 133d ago

There are several such things. The Windows installer uses one. X-Box uses another.

nu11ptr · 133d ago

> this feature is based on io_uring which is a Linux feature

And now also a Windows feature, see my comment above for info

zokier · 132d ago

IoRing != io_uring. They are similar APIs but not the same thing.

PaulHoule · 133d ago

For a long time ago there have been APIs to do asynchronous file I/O on the books for Linux but they weren't worth using because they didn't really speed anything up.

anarazel · 133d ago

They sped up things for a long time - but only when using unbuffered IO. The new thing with io_uring is that it also accelerates buffered IO. In the initial version it was all through kernel worker threads, but these days several filesystems have better paths for common cases.

immibis · 133d ago

IIRC they literally just did the sync I/O on a worker thread.

DmitryOlshansky · 132d ago

That’s the POSIX AIO. Linux has had AIO that was only supported for direct I/O and had certain limitations making it awkward to use.

spwa4 · 133d ago

Yes, although Windows has had async I/O since Windows NT 3.1, their API is still not supported by Postgres.

anarazel · 133d ago

FWIW, there are prototype patches for an IOCP based io_method. We just couldn't get them into an acceptable state for PG 18. I barely survived getting in what we did...

spwa4 · 132d ago

I didn't mean that as a criticism on Postgres, certainly not on Postgres developers. I really look forward to evaluating this change. I'm just reminiscing a bit about admiring Windows NT ... and being incredibly disappointed by the market.

nu11ptr · 133d ago

Yes, that was IOCP, however, Windows now also has IORing (see my comment above)

mijoharas · 132d ago

Does anyone run postgres on windows?

Given that they still target it there must be a user base. Does anyone know the statistics of usage by platform? Anyone here use it?

Genuinely curious, windows backend dev is something I know very little about.

kev009 · 133d ago

A lot of work has gone into FreeBSD's aio(4) so it will be interesting to see how that works, because it doesn't have the drawbacks of Linux/glibc aio.

macdice · 133d ago

BTW I have patches for PostgreSQL AIO on FreeBSD, which I will propose for v19. It works pretty well! I was trying to keep out of Andres's way for the core architectural stuff and basic features ie didn't want to overload the pipes with confusing new topics for v18 :-)

tiffanyh · 133d ago

Would you mind expanding more on this topic.

Is FreeBSD doing anything significantly different and/or better?

kev009 · 133d ago

Sure two separate things really.

Linux aio (the POSIX one, not the broader concept that now includes io_uring) has been fundamentally bad. Part of it is some of the implementation is delegated to user space in glibc as worker threads. https://lwn.net/Articles/671649/ surveys the issues. I have not done a deep dive into this topic in a long time but as far as I know the situation never greatly improved. io_uring does not suffer from the problems, although it is a new and non-standard API with associated pros and cons.

Thomas Munro has gone into some of the benefits and also gaps of FreeBSD's aio(4) vs io_uring here https://wiki.postgresql.org/wiki/FreeBSD/AIO. Notably, because the implementation is in kernel and has received continuous improvement it is gradually removing downsides and there are several users that need it to work well. This document undersells the problems of ZFS though: the ARC is a necromanced buffer cache which Sun ironically worked very hard to remove (http://mcvoy.com/lm/papers/SunOS.vm_arch.pdf) and nobody has fixed this in 20 years. But for UFS or raw block devs or vendor file systems that doesn't matter.

FreeBSD being a complete "src" tree yields some advantages. In concrete, there are some in tree consumers like ctld (a CAM/iSCSI server) that have been used as vehicles to provide end to end implementation of things like NIC offload of complex protocols that play well with backend block devices such that you can make an offloaded data path that is pumped via asynchronous completion events on both sides (network and block). A related whitepaper https://www.chelsio.com/wp-content/uploads/resources/T6-100G... but this concept can be extended to iSCSI, NVMeOF etc. It seems NVMeOF work sponsored by Chelsio is yielding a lot of the finishing touches https://papers.freebsd.org/2023/eurobsdcon/baldwin-implement.... I believe my colleagues are also improving aio to further optimize the Netflix case of KTLS NIC with data on disk but I am not the right person to extrapolate on that.

skeptrune · 133d ago

How close is this to the way MySQL does it with InnoDB? It appears to be about the same.

seunosewa · 133d ago

Yep. It's a low hanging fruit they should've picked years ago.

They will eventually figure out using b-trees for tables too.

tehlike · 133d ago

There's a lot more PG needs to do for storage layout / access pov, and they have been working on it for a while. Orioledb has shown what might be opssible, and they have been upstreaming it.

Having the ability to do something LSM as a storage engine would be great - and potentially allow better compression than what we currently get with TOAST - which is not a lot... PG doesn't even have oob page compression...

ARandomerDude · 133d ago

Can you elaborate on this B-tree part of your comment? I know B-tree is the default index type in pg, but it sounds like there’s more to the story that I’m not familiar with.

greenavocado · 133d ago

PostgreSQL uses heap files for the primary table storage, not B-trees. In PostgreSQL table data is primarily stored in heap files (unordered collections of pages/blocks). Indexes (including primary key indexes) use B-trees (specifically B+ trees). When you query a table via an index, the B-tree index points to locations in the heap file

InnoDB uses a clustered index approach. The primary key index is a B-tree. The actual table data is stored in the leaf nodes of this B-tree. Secondary indexes point to the primary key.

One is not better than the other in general terms. InnoDB's clustered B-tree approach shines when:

You frequently access data in primary key order

Your workload has many range scans on the primary key

You need predictable performance for primary key lookups

Your data naturally has a meaningful ordering that matches your access patterns

PostgreSQL's heap approach excels when:

You frequently update non-key columns (less page splits/reorganization)

You have many secondary indexes (they're smaller without primary keys)

Your access patterns vary widely and don't follow one particular field

You need faster table scans when indexes aren't applicable

I personally find PostgreSQL's approach more flexible for complex analytical workloads with unpredictable access patterns, while InnoDB's clustered approach feels more optimized for OLTP workloads with predictable key-based access patterns. The "better" system depends entirely on your specific workload, data characteristics, and access patterns.

Sesse__ · 133d ago

Indexes that point directly to the disk column are also significantly faster to access; it is a persistent pain point for OLAP on InnoDB that all secondary indexes are indirect. You can work around it by adding additional columns to the index to make your lookups covering, but it's kludgy and imprecise and tends to bloat the index even further. (The flip side is that if you have tons of indexes, and update some unrelated column, InnoDB doesn't need to update those indexes to point to the location of the new row. But I'm generally very rarely annoyed by that in comparison.)

sroussey · 132d ago

On commercial databases from Oracle and Microsoft, you can cluster the DB on any index. Really would love for Postresql to do the same.

cwbriscoe · 132d ago

You can but new rows will not be clustered until you re-cluster.

CLUSTER table_name USING index_name;

sroussey · 130d ago

I remember that for MySQL MyISAM (syntax was different) in the olden days. I thought Oracle let you do it though. My bad.

MySQL Innodb and other have real clustered primary key indexes with all the pluses and minuses that entails.

farazbabar · 133d ago

Don't forget high speed committed writes to append only tables (the opposite of scans), postgres approach is better here as well.

saltcured · 133d ago

It's also deeply entwined with the MVCC concurrency control and the ability to do DDL in transactions, right?

jiggawatts · 133d ago

SQL Server supports every combination of heap storage, clustered storage, MVCC, and DDL in transactions.

foobahhhhh · 132d ago

Takes me back 20 years to using SQL Server! It let you choose clustered index or not IIRC.

cryptonector · 133d ago

SQLite3 only has b-trees for files and indices.

PostgreSQL only has heaps for tables, and various other data structures for indices, with b-tree being the default for indices.

There are many cases where having b-trees for tables would make performance better.

GP is basically poking fun at PG for still having nothing but heaps for tables.

ksec · 130d ago

Despite having relatively fewer improvement on the MySQL side, especially the OSS version and not the enterprise version, there are still many areas Postgres have not exceeded MySQL, and doesn't look like will be doing so in next 3 - 4 years.

Tostino · 133d ago

Thank you for the effort that went into getting this committed. I remember seeing the first discussions about async I/O (and using io_uring) like 6 or 7 years ago. Amazing amount of work to get the design right.

Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.

shayonj · 133d ago

Very nicely written post! I'd love to start running these in production on NVMe and hope its something major cloud providers start to offer ASAP. The performance gains are _extremely_ attractive

CoolCold · 132d ago

My understanding is that local storage like NVMe is not much affected as latency is very low compared to network based block devices like ebs

niux · 133d ago

I recently deployed Postgres on a dedicated Hetzner EX-44 server (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for €39/month. The price-to-performance ratio is exceptional, providing enterprise-level capacity at a fraction of typical cloud costs.

For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.

My optimization approach includes:

- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)

- Real-time performance monitoring with PgHero for identifying bottlenecks

- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly

- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup

This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.

trollied · 133d ago

That’s great, but you need a solid HA/backup and recovery strategy if you even remotely care about your data.

deedubaya · 133d ago

He literally mentioned backups and not everyone needs HA?

Tostino · 133d ago

I would absolutely use another backup utility (additionally if you want) if I were you (barman, pgbackrest, etc).

You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...

Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.

arp242 · 133d ago

Snapshots are backups. Not sufficient backups for some cases, but perfectly fine for others.

niux · 133d ago

The reason I wrote my own tool is because I couldn't find anything for Pg17 at the time and pgbackrest seemed overkill for my needs. Also, the CLI handles backup rotations as well. Barman looks interesting though, I'll definitely have a look, thanks!

Tostino · 133d ago

pgbackrest was always easy to use in my experience. Not very hard to setup or configure, and low overhead. Supports spool directories for WAL shipping, compression, and block incremental backups (YAY!!!). I ran my last company on it for the last ~6 years I was there. Never any complaints, solid software (which is what you want for backups).

I have been using barman indirectly through CloundNativePG with my latest company, but don't have the operational experience to speak on it yet.

9dev · 133d ago

pgbackrest only looks scary because it’s so flexible, but the defaults work great in almost all cases. The most complex thing you’ll need to do is creating a storage bucket to write to, and configure the appropriate storage provider in pgbackrest's config file.

When it’s set up properly, it’s solid as rock. I’d really recommend you check it out again; it likely solves everything you did more elegantly, and also covers a ton of things you didn’t think of. Been there, done that :)

lousken · 133d ago

do you still need it? haven't managed pg for a while, but shouldn't pg17 have some solution for backups?

Tostino · 132d ago

You still need it. There are tools included with Postgres that you can cobble together for a backup solution that is lacking in features and edge case testing. But i'd much rather just use the right tool for the job.

For example, pgBackRest solves real problems with features like block level incremental backups that drastically reduce storage and transfer times for many workloads, automated backup retention policies, multiple repository support for offsite backup redundancy, encryption support, point in time recovery for granular restoration capabilities, and tooling to build standby servers very quickly and efficiently. These features handle edge cases and reduce operational overhead compared to managing scripts around pg_basebackup and WAL archiving yourself. In many environments, those features are required (e.g. encryption).

codegeek · 133d ago

Would you be willing to open source the setup ? I would love to learn from it.

digdugdirk · 133d ago

Seconded. This corner of the programming world is a deep dark and scary place for people who haven't had solid industry experience. It'd be hugely helpful to have a barebones starting point to begin learning best practices.

natmaka · 132d ago

> Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables

Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?

dpacmittal · 133d ago

EX44 is 14 physical cores. Just wanted to point it out.

olavgg · 132d ago

And no ECC memory?

noir_lord · 132d ago

EX44’s are Intel 13500’s so no, not ECC.

I have one, I use it for running an arma3 server, it’s been bulletproof for reliability though, hetzner have come a long way since the first time I used them years and years ago where it was a mess.

I’m almost at the point I’d use them for some thig that mattered (I.e. money was involved).

sroussey · 132d ago

BTW, bun now has a built-in S3 client.

martinald · 133d ago

I sort of had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days. I suspect now we have PCIe 5.0 NVMes this will go up to

I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.

We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.

Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.

gopalv · 133d ago

> had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.

More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).

As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.

Every foot is a nanosecond, approximately.

That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.

The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.

lfittl · 133d ago

Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.

Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).

Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...

codegeek · 133d ago

You probably already know this but I will say it anyway. These cloud services like AWS are not succeeding in enterprise because they have outdated hardware. They succeed because in enterprise, CIOs and CTOs want something that is known, has a brand and everyone else uses it. It's like the old adage of "No one got fired for using IBM". Now it is "No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is".

the8472 · 133d ago

> No one gets fired for hosting with AWS

But consider the counterfactual: Non-realized customers because AWS certified solutions architect(tm) software couldn't deliver the price/perf they would have needed.

At $work this is a very real problem because a software system was built on api gateway, lambdas, sqs and a whole bunch of other moving pieces (serverless! scalable! easy compliance!) that combined resulted in way too much latency to meet a client's goal.

anonymars · 131d ago

Reminds me of https://idlewords.com/talks/website_obesity.htm

> Let me give you a concrete example. I recently heard from a competitor, let’s call them ACME Bookmarking Co., who are looking to leave the bookmarking game and sell their website.

While ACME has much more traffic than I do, I learned they only have half the daily active users. This was reassuring, because the hard part of scaling a bookmarking site is dealing with people saving stuff.

We both had the same number of employees. They have an intern working on the project part time, while I dither around and travel the world giving talks. Say half a full-time employee for each of us.

We have similar revenue per active user. I gross $12,000 a month, they gross $5,000.

But where the projects differ radically is cost. ACME hosts their service on AWS, and at one point they were paying $23,000 (!!) in monthly fees. Through titanic effort, they have been able to reduce that to $9,000 a month.

I pay just over a thousand dollars a month for hosting, using my own equipment. That figure includes the amortized cost of my hardware, and sodas from the vending machine at the colo.

So while I consider bookmarking a profitable business, to them it's a $4,000/month money pit. I'm living large off the same income stream that is driving them to sell their user data to marketers and get the hell out of the game.

The point is that assumptions about complexity will anchor your expectations, and limit what you're willing to try. If you think a 'real' website has to live in the cloud and run across a dozen machines, a whole range of otherwise viable projects will seem unprofitable.

Similarly, if you think you need a many-layered CMS and extensive custom javascript for an online publishing venture, the range of things you will try becomes very constricted.

Rather than trying to make your overbuilt projects look simple, ask yourself if they can't just be simple.

9dev · 133d ago

> No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is

Actually, AWS is so expensive, hosting everything we ran on Hetzner there would have simply depleted our funding, and the company would not exist anymore.

Hilift · 133d ago

Everything in the cloud is throttled. Network, IOPS, CPU. And probably implemented incorrectly. AWS makes billions if the customer infrastructure is great or terrible. I found that anything smaller than an AWS EC2 m5.8xlarge had noticeably bad performance on loaded servers (Windows). The list price for that would be about $13k per year, but most organizations get lower than list prices.

This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.

immibis · 133d ago

I noticed this with bandwidth. AWS price for bandwidth: $90.00/TB after 0.1TB/month. Price everywhere else (low cost VPSes): $1.50/TB after 1-5TB/month. Price some places (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB after.

You pay 60 times the price for the privilege of being on AWS.

Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.

There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.

ttfkam · 132d ago

Which is why most folks use CloudFront like AWS recommends, which is typically free for services to route through. Prices for CloudFront egress are competitive with what you described for other vendors. I don't know anyone paying $90/TB out to internet on AWS.

https://aws.amazon.com/cloudfront/pricing/?nc=sn&loc=3

Were you trying to route S3 directly out to the internet?

immibis · 132d ago

The Cloudfront pricing page you cited is just as bad, except the free tier for Cloudfront traffic is 1TB/month rather than 0.1TB/month. The pricing is just as bad above that. It says "Free for origin fetches" but upon further research, seems like that just means you won't pay twice by also having to pay for your traffic to get from its origin to Cloudfront.

ttfkam · 132d ago

You still haven't cited where you got $90/TB from.

detaro · 131d ago

For example EC2 traffic (from US and EU locations, other get more expensive).

ttfkam · 131d ago

Okay, I stand corrected. Now my question to you is how are you generating all these terabytes over the internet? And would data transfer costs really be a significant portion of your bill at that scale regardless of cloud provider? Not video transcoding resources, AI capacity, GPU capacity, etc.?

You're talking about the equivalent of completely saturating a gigabit Internet connection for over two hours per terabyte. A high def video stream from Netflix is only about 5 megabits/sec. 4K is 15 megabits.

67+ simultaneous 4K streams for two hours. Or 200+ simultaneous high def streams. Or a metric fuck-ton of web resources.

And you think you're going to find a provider that will allow you to send those kinds of volumes for $10/TB or less AND have relatively few outages AND stick around because their business model is sound?

By all means, point out who these unicorns are. Sign me up!

immibis · 129d ago

On Hetzner I have no problem saturating my gigabit connection with static files if I want to.

hylaride · 132d ago

Not everybody is serving public http traffic, though.

__s · 133d ago

Even worse on Azure where we had to ask customers to scale up vcpu to increase iops

https://azure.microsoft.com/en-us/pricing/details/managed-di...

Increasing vcpu also opened up more disk slots to try improve situation with disk striping

maherbeg · 133d ago

instance store on aws can give up to 3.3mil iops https://aws.amazon.com/blogs/aws/now-available-i3-instances-... - the main problem is just using networked storage.

the8472 · 133d ago

The NVMe on other instance types is quite throttled. E.g. on a G5.4xlarge instance EBS is limited to 593MB/s and 20000IOPS while instance-attached NVMe is limited to 512MB/s (read) at 125000IOPS, a fraction of IO what a workstation or gaming PC with similar GPU and RAM would have. And stopping the instance wipes it, which means you can't do instance warmup with those, everything must be populated at boot.

coder543 · 133d ago

That quoted IOPS number is only with an 8-disk stripe (requiring the full instance), even if you don't need 488GB of RAM or a $3600/mo instance, I believe.

The per-disk performance is still nothing to write home about, and 8 actually fast disks would blow this instance type out of the water.

perching_aix · 133d ago

Instance store is also immediately wiped when the instance is halted / restarted, which can theoretically happen at any time, for example by a mystery instance failure, or a patching tool that's helpfully restarting your boxes during offhours.

slashdev · 133d ago

My understanding is this not true, only when the instance permanently fails and is moved.

the8472 · 133d ago

stop or hibernate kills it. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance...

slashdev · 133d ago

Yeah, so restart does not.

Which means you can count on it about as much as a server of your own, if you could not repair the server.

I know a database company that uses instance storage as the primary storage. It’s common.

perching_aix · 132d ago

Wasn't aware, interesting. I did consider it in the past as well, but the reliability aspect made me consider this as a moonshot rather than anything practical. Kind of weirdly validating to know there are (supposedly) database providers using it.

That said, a good bit of our environments are scheduled, so it still wouldn't be an option there without hacks (e.g. doing a compressed blockwise dump before shutting down and then a blockwise flash on startup).

slashdev · 131d ago

I would take a logical dump on shutdown and the restore that rather. You’d need a backup and restore process anyway, so that path has to exist either way. It’s kind of like a full vacuum, so has some benefits too.

The NDA prevents me from saying which database company, but it’s a major provider of cloud managed databases across clouds.

anonymars · 133d ago

> given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

In the face of sustained writes? For how long?

merb · 133d ago

sustained reads would not even give 1 mio iops in that case. Maybe wen you only read the same file that fits into the nvme cache. Which probably never happens in a production database..

jauntywundrkind · 133d ago

I think you'd be surprised. Sustained write performance has gotten pretty good. Decent but not fancy consumer drives will often do 1GBps sustained, for bulkier writes. That's much better than we used to expect: flash has gotten much better with so many layers! This mid-range PCIe5 drive sustains a nice 1.5GBps: https://www.techpowerup.com/review/team-group-ge-pro-2-tb/6....

I don't think sustained reads are a problem? Benches like the CrystalDiskMark do a full disk random read test; they're designed to bust through cache afaik. 7.2GBp of 4k reads would translate to 1.8MIOps. Even if this is massively optimistic, you need to slash a lot of zeroes/orders of magnitude to get down to 20kIOps, which you will also pay >$100/mo for.

p_ing · 133d ago

Samsung 9910 has a 1:1 TB:GB cache size of LPDDR4X memory. I won't pretend to understand the magic NVMe drives possess, but if you got a 4TB or 8TB 9910, could you not in theory pull in all of the data you require to cache?

I would assume, and it might be a poor assumption, that NVMe controllers don't pull in files, but rather blocks, so even if you had a database that exceeded cache size, in theory if the active blocks of that database did not exceed cache size, it could be "indefinitely" cached for a read-only pattern.

wtallis · 133d ago

The DRAM on a SSD like that isn't for caching user data, it's for caching the drive's metadata about which logical blocks (as seen by the OS) correspond to which physical locations in the flash memory.

LtdJorge · 132d ago

It definitely uses it for caching data

yxhuvud · 133d ago

FWIW, using the same approach as in the article, ie io_uring, is one of the few ways to actually reach anywhere close to that 1 million, so it is not as if they are competing concerns.

binary132 · 133d ago

Meanwhile people are running things on raspberry pi home clusters thinking they’re winning

brulard · 133d ago

Maybe they are. With NVMe hat, you get decent IO performance

adgjlsfhk1 · 133d ago

It's still moderately bad. Raspberry pi is limited to 2 gen3 pcie lanes which is ~4-8x slower than the drive (and you will likely be further limited by cpu speed)

ttfkam · 132d ago

If the engineering demand is lower than the engineering supply, you're still winning. If your transactions per second only amounts to 60% of the Pi's capacity for a given use case, why complain?

p_ing · 133d ago

Is io_uring still plagued by security issues enabled by it's use? Or have those largely been fixed? My understanding was many Linux admins (or even distros by default?) were disabling io_uring.

hansvm · 133d ago

https://github.com/axboe/liburing/discussions/1047

p_ing · 133d ago

Thanks. It looks like it is still going through growing pains.

https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring

https://www.theregister.com/2025/04/29/linux_io_uring_securi...

But most of the 'off by default' are from ~2023 and not a current concern.

znpy · 133d ago

Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.

One should evaluate the risk according to their specific use case.

It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).

wtallis · 133d ago

I think at this point, those "other technologies to enforce security" are the main area of concern for io_uring users: if those other security layers don't know about io_uring they won't apply any restrictions to it.

lambdaone · 132d ago

Most users don't know what io_uring is, nor would they have read any of thosee articles. They are not HN readers with low-level understanding of operating systems workingm and don't even remotely have the knowledge of how any of these technologies work, nor do they have the time to get it. With a very few exceptions, users have to rely entirely on software vendors and service providers to do this for them, and trust they know what they're doing.

p_ing · 133d ago

Yes, every security mechanism should be reviewed to validate it is applicable, i.e. I had a vulnerable version of node running but I wasn't using the particular aspect of node so it was a non-issue (and amazingly, IT sec agreed!).

But in the case of io_uring, it was outright bypassing other security layers. And while we all like to think we're running trusted services/code, we have to think about supply-chain attacks that may surprise us, or zero days, etc.

rdtsc · 132d ago

Oh sure, if you run your own server or sell your cloud as a service, you can enable it. But if you want to run on someone else’s setup, you have to play by their rules.

> Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.

I think it’s more like “$faang_company already disabled it in their hosts, so I am out of luck for my containers running on their cloud”

song · 133d ago

Are there good performance comparisons between postgres, mariadb and percona? I'm really curious at this point in which case each of those database shine.

KronisLV · 133d ago

Probably depends on the particular workload, but there are at least some attempts at benchmarking vaguely typical workloads: https://datasystemreviews.com/postgresql-vs-mariadb-performa...

ttfkam · 132d ago

Apples to apples, Postgres might lose, but that'd be tying both hands behind its back first.

Remember that Postgres's feature set is far larger than those alternatives. If you can use a range with an exclusion constraint, an unnest with an array, or the like, you'll be seeing Postgres leave the alternatives in the dust.

Imagine writing a benchmark comparing programming languages, but the benchmark only includes idioms that all tested languages shared in common. Wouldn't be a fair comparison, would it?

song · 132d ago

Yes, but I wonder how many apps using ORMs actually use all features from postgres. So, in the practical use case of typical Rails, Laravel, Django or expressJS, what would the performance look like?

ttfkam · 132d ago

That's an indictment of ORMs more than an evaluation of database options.

pseudopersonal · 133d ago

Does anyone know when the update allowing more concurrent connections is dropping, so we can stop using pgbouncer?

hoherd · 132d ago

I recently read a great article exploring what would change if they were to switch from processes to threads for each connection. Running a connection pooler didn't seem so bad to me after reading it. https://medium.com/@tusharmalhotra_81114/why-postgresql-choo...

__s · 133d ago

That'll likely need conversion from process per connection, so not any time soon

anarazel · 132d ago

FWIW, there actually are some ongoing efforts towards that - including several preparatory changes in PG 18. Still lots more work, but we are working towards it.

https://wiki.postgresql.org/wiki/Multithreading

CoolCold · 132d ago

Do you know, how well pgbouncer works with prepared statement nowadays? We slowly migrating our clusters to pg16/new bouncer and feel unsure on stability and reliability of prepared statements support

rks404 · 132d ago

I remember back in the day when mysql vs postgres was a real debate with mysql being the popular fave. Genuinely interesting to see that postgres eventually won.

cryptonector · 133d ago

I've been following the `AIO 2.5` thread on the PG mailing lists, and I've been salivating at the thought of this being released.

dbbk · 133d ago

This looks promising! Wonder if it's coming to Neon

clarkbw · 132d ago

It is!

And we plan to have 18 out as quickly as we did 17; on the day of release.

WhyNotHugo · 133d ago

It's pretty disappointing that simply using O_NONBLOCK doesn't work as expected on regular files. It would be such a simple and portable mechanism to do async I/O using the same interfaces that we already use for networking.

gitroom · 133d ago

insane how long it took postgres to get async i/o right - feels like all the big changes spark a million little tradeoffs, right? you think stuff like io_uring is finally gonna push postgres to catch up with the clouds

Event Horizon Labs (YC W24) Is Hiring (ycombinator.com)

Adam (YC W25) Is Hiring to Build the Future of CAD (ycombinator.com)

Piramidal (YC W24) Is Hiring Back End Engineer (ycombinator.com)

Mux (YC W16) Is Hiring Engineering ICs and Managers (mux.com)

Bild AI (YC W25) Is Hiring (ycombinator.com)

Infracost (YC W21) Is Hiring First Product Manager to Shift FinOps Left (ycombinator.com)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Nango (YC W23) Is Hiring a Staff Back End Engineer (Remote) (jobs.ashbyhq.com)

Gym Class VR (YC W22) Is Hiring – UX Design Engineer (ycombinator.com)

Relace (YC W23) Is Hiring for Code LLMs (SF)

Artie (YC S23) Is Hiring Engineers, AES, and Senior PMM (ycombinator.com)

Depot (YC W23) Is Hiring a Solutions Engineer (Remote US and Canada) (ycombinator.com)

Svix (webhooks as a service) is hiring for a founding marketing lead (svix.com)

Dynamo AI (YC W22) Is Hiring for AI Product Managers (ycombinator.com)

Kapa.ai (YC S23) is hiring research and software engineers (ycombinator.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

Telli (YC F24) is hiring engineers, designers, and interns (on-site in Berlin) (hi.telli.com)

Infisical (YC W23) Is Hiring Solutions Engineers to Scale the OSS Security Stack (ycombinator.com)

Channel3 (YC S25) Is Hiring a Founding Engineer, NYC (channel3.notion.site)

Thunder Compute (YC S24) Is Hiring (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build a better Jupyter notebook (deepnote.com)

Prosper AI (YC S23) Is Hiring Founding Account Executives (NYC) (jobs.ashbyhq.com)

The Forecasting Company (YC S24) Is Hiring a Software Engineer (ycombinator.com)

Lago – Open-Source Usage Based Billing – Is Hiring in Sales, Eng, Ops (EU, US) (ycombinator.com)

Ember (YC F24) Is Hiring Full Stack Engineer (ycombinator.com)

LiteLLM (YC W23) is hiring a back end engineer (ycombinator.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring Platform Engineers (Remote) (jobs.ashbyhq.com)

Motion (YC W20) Is Hiring Principal Software Engineers (jobs.ashbyhq.com)

Bild AI (YC W25) Is Hiring an Applied AI Engineer (workatastartup.com)

Text.ai (YC X25) Is Hiring Founding Full-Stack Engineer (ycombinator.com)

Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous I/O

Comments (154)