Jepsen: Amazon RDS for PostgreSQL 17.4

488 aphyr 116 4/29/2025, 2:30:11 PM jepsen.io ↗

Comments (116)

hliyan · 4h ago
I wish more writing in the software world was done this way:

"Amazon RDS for PostgreSQL is an Amazon Web Services (AWS) service which provides managed instances of the PostgreSQL database. We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation, the strongest consistency model supported across all endpoints. Healthy clusters occasionally allow..."

Direct, to-the-point, unembellished and analogous to how other STEM disciplines share findings. There was a time I liked reading cleverly written blog posts that use memes to explain things, but now I long for the plain and simple.

Twirrim · 7m ago
I'm so past wanting to read meme laden blog posts. Especially when all too often it's just stretching a paragraph of content. Security vulnerability stuff is probably the worst at it these days.
augustl · 4h ago
Jepsen is awesome, on so many levels!
luhn · 10h ago
It's not mentioned in the headline and not made super clear in the article: This is specific to multi-AZ clusters, which is a relatively new feature of RDS, and differ from multi-AZ instance that most will be familiar with. (Clear as mud.)

Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.

Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.

Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.

ants_a · 1h ago
Interesting why this magic would be needed. Vanilla Postgres does support quorum commit which can do this. You can also set up the equivalent multi-AZ cluster with Patroni, and (modulo bugs) it does the necessary coordination to make sure to promote primaries in a way that does not lose transactions or makes visible a transaction that is not durable.

There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.

Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.

sontek · 27m ago
Link the talk video
ashu1461 · 10h ago
Have one question

So if snapshot violation is happening inside Multi-AZ instances, it can happen with a single region - multiple read replica kind of setup as well ? But it might be easily observable in Multi-AZ setups because the lag is high ?

luhn · 8h ago
A synchronous replica via WAL shipping is a well-worn feature of Postgres. I’d expect RDS to be using that feature behind the scenes and would be extremely surprised if that has consistency bugs.

Two replicas in a “semi synchronous” configuration, as AWS calls it, is to my knowledge not available in base Postgres. AWS must be using some bespoke replication strategy, which would have different bugs than synchronous replication and is less battle-tested.

But as nobody except AWS knows the implementation details of RDS, this is all idle speculation that doesn’t mean much.

wb14123 · 4h ago
This kind of replication can be configured in vanilla Postgres with something like ANY 3 (s1, s2, s3, s4) in synchronous_standby_names? Doc: https://www.postgresql.org/docs/current/runtime-config-repli...
ctapobep · 2h ago
I don't think it's possible with ANY set up. All you get is that some replicas are more outdated than others. But they won't return 2 conflicting states when ReplicaA says tx1 wrote (but not tx2), while ReplicaB says tx2 wrote (but not tx1). Which is what Long Fork and Parallel Snapshot are about.

So Amazon Multi-cluster seems to replicate changes out of order?

mattashii · 1h ago
Kinda. I think it's "just" PostgreSQL behaviour that's to blame here: On replicas, transaction commit visibility order is determined by the order of WAL records; on the primary it's based on when the backend that wrote the transaction notices that its transaction is sufficiently persisted.

See also my comment https://news.ycombinator.com/item?id=43843790 elsewhere in this thread

havkom · 3h ago
Good investigation!

Software developers nowadays barely know about transactions, and definitely not about different transaction models (in my experience). I have even encountered "senior developers" (who are actually so called "CRUD developers"), who are clueless about database transactions.. In reality, transactions and transaction models matter a lot to performance and error free code (at least when you have volumes of traffic and your software solves something non-trivial).

For example: After a lot of analysis, I switched from SQL Server standard Read Committed to Read Committed Snapshot Isolation in a large project - the users could not be happier -> a lot of locking contention has disappeared. No software engineer in that project had any clue of transaction models or locks before I taught them some basics (even though they had used transactions extensively in that project)..

ljm · 3h ago
I’ve noticed the lack of transaction awareness mostly in serverless/edge contexts where the backend architecture (if you can even call it that) is driven exclusively by the needs of the client. For instance, database queries are modelled as react hooks or sequential API calls.

I’ve seen this work out terribly at certain points in my career.

cswilliams · 12h ago
Interesting. At a previous company, when we changed the pg_dump command in a backup script to start using parallel workers (-j flag) we started to rarely see errors that suggested inconsistency when restoring the backups (duplicate key errors and fk constraint errors). At the time, I tried reporting the issue to both AWS and on the Postgres mailing list but never got anywhere since I could not easily reproduce it. We eventually gave up and went back to single threaded dumps. I wonder if this issue is related to that behavior we were seeing.
belter · 12h ago
Was a single instance, one instance with a standby in another AZ or a multiaz cluster as tested here?
cswilliams · 10h ago
We saw it when we ran the pg_dump off a standby instance (or a "replica" to use RDS terminology). Our primary was a multi-az instance. So not exactly what they tested here I guess, but it makes me wonder what changes, if any, they've made to postgres under the hood.
baq · 5h ago
> This work was performed independently by Jepsen, without compensation

not what a RDBMS stakeholder wants to wake up to on the best of days. I'd imagine there were a couple emails expressing concern internally.

hats off to aphyr as usual.

tasuki · 5h ago
What's a "RDBMS stakeholder" ?

(Hats off to aphyr for sure!)

bobnamob · 5h ago
The three layers of middlemanagement between engineers and whichever director owns this particular incarnation of RDS
baq · 4h ago
a stakeholder is anyone who has any business at all with the system - customer, engineer, manager, etc.

RDBMS - https://en.wikipedia.org/wiki/Relational_database#RDBMS

ezekiel68 · 15h ago
In my reading of this, it looks like the practical implication could be that reads happening quickly after writes to the same row(s) might return stale data. The write transaction gets marked as complete before all of the distributed layers of a multi AZ RDS instance have been fully updated, such that immediate reads from the same rows might return nothing (if the row does not exist yet) or older values if the columns have not been fully updated.

Due to the way PostgreSQL does snapshotting, I don't believe this implies such a read might obtain a nonsense value due to only a portion of the bytes in a multi-byte column type having been updated yet.

It seems like a race condition that becomes eventually consistent. Or did anyone read this as if the later transaction(s) of a "long fork" might never complete under normal circumstances?

aphyr · 15h ago
This isn't just stale data, in the sense of "a point-in-time consistent snapshot which does not reflect some recent transactions". I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T.
kevincox · 1h ago
To provide a simple (although contrived) example of the type of thing that can happen. Imagine that you have a table with three columns `gps_coordinate`, `postal_code` and `city`. The way these are set is that the new coordinate gets posted to the API and `gps_coordinate` is updated. This then kicks off a background task that uses the new coordinate to lookup and update `postal_code`. Then another background task uses the postal code to look up and set `city`.

Since these happen sequentially, for a single update of `gps_coordinate` you would only expect to be able to observe one of:

1. Nothing updated yet, all columns have the previous value.

2. `gps_coordinate` updated, with `postal_code` and `city` still having the previous values.

3. `gps_coordinate` and `postal_code` updated with `city` still having the previous value.

4. All fields updated.

But the ordering that aphyr proved is possible allows you to see "impossible" states such as

1. `postal_code` updated with `gps_coordinate` and `city` still having the previous values.

2. `city` updated with `gps_coordinate` and `postal_code` still having the previous values.

Basically since these transactions happen in order and depend on one another you would expect that you can only see the "left to right" progression. But actually you can see some subset of the transactions applied even if that isn't a possible logical state of the database.

mikesun · 13h ago
"I think what's going on here is that a read-only transaction against a secondary can observe some transaction T, but also miss transactions which must have logically executed before T."

i was intuitively wondering the same but i'm having trouble reasoning how the post's example with transactions 1, 2, 3, 4 exhibits this behavior. in the example, is transaction 2 the only read-only transaction and therefore the only transaction to read from the read replica? i.e. transactions 1, 3, 4 use the primary and transaction 2 uses the read replica?

aphyr · 13h ago
Yeah, that's right. It may be that the (apparent) order of transactions differs between primary and secondary.
franckpachot · 10m ago
The Write-Ahead Logging (WAL) is single-threaded and maintains consistency at a specific point in time in each instance. However, there can be anomalies between two instances. This behavior is expected because the RDS Multi-AZ cluster does not wait for changes to be applied in the shared buffers. It only waits for the WAL to sync. This is similar to the behavior of PostgreSQL when synchronous_commit is set to on. Nothing unexpected.
mikesun · 12h ago
ah, so something like... if the primary ordered transaction 3 < transaction 1, but transaction 2 observes only transaction 1 on the read-only secondary potentially because the secondary orders transaction 1 < transaction 3?
nijave · 16h ago
It's not entirely clear but this isn't an issue in multi instance upstream Postgres clusters?

Am I correct in understanding either AWS is doing something with the cluster configuration or has added some patches that introduce this behavior?

aphyr · 13h ago
This is a very good question! I do not understand AWS's replication architecture well enough to reimplement it with standard Postgres yet. This behavior doesn't happen in single-node Postgres, as far as I can tell, but it might happen in some replication setups!

I also understand there are lots of ways to do Postgres replication in general, with varying results. For instance, here's Bin Wang's report on Patroni: https://www.binwang.me/2024-12-02-PostgreSQL-High-Availabili...

mattashii · 1h ago
> It's not entirely clear but this isn't an issue in multi instance upstream Postgres clusters?

No, it isn't an issue with single-instance PostgreSQL clusters. Multi-instance PostgreSQL clusters (single primary, plus streaming/physical replicas) are affected.

What they -too- discovered is that PostgreSQL currently doesn't have consistent snapshot behaviour between the primary and replicas. Presumably, read-only transaction T2 was executed on a secondary (replica) node, while T1, T3, and T4 (all modifying transactions) were executed on the primary.

Some background:

Snapshots on secondary PostgreSQL nodes rely on transaction persistence order (location of commit record in WAL) to determine which transactions are visible, while the visibility order on the primary is determined by when the backend that authorized the transaction first got notice that the transaction was completely committed (and then got to marking the transaction as committed). On each of these (primary and secondary) the commit order is consistent across backends that connect to that system, but the commit order may be somewhat different between the primary and the secondary.

There is some work ongoing to improve this, but that's still very much WIP.

aphyr · 8m ago
Thank you matashii--this would definitely explain it. I've also received another email suggesting this anomaly is due to the difference in commit/visibility order between primary and secondary. Is there by chance a writeup of this available anywhere that I can link to? It looks like https://postgrespro.com/list/thread-id/1827129 miiight be related, but I'm not certain. If so, I'd like to update the report.

My email is aphyr@jepsen.io, if you'd like to drop me a line. :-)

aeyes · 10h ago
What are multi instance upstream Postgres clusters for you? PostgreSQL has no official support for failover of a master instance, the only mechanism is Postgres replication which you can make synchronous. Then you can build your own tooling around this to build a Postgres cluster (Patroni is one such tool).

AWS patched Postgres to replicate to two instances and to call it good if one of the two acknowledges the change. When this ack happens is not public information.

My personal opinion is that filesystem level replication (think drbd) is the better approach for PostgreSQL. I believe that this is what the old school AWS Multi-AZ instances do. But you get lower throughput and you can't read from the secondary instance.

nijave · 2h ago
>My personal opinion is that filesystem level replication (think drbd) is the better approach for PostgreSQL

That's basically what their Aurora variant does. It uses clustered/shared storage then uses traditional replication only for cache invalidation (so replicas know when data loaded into memory/cache has changed on the shared storage)

belter · 15h ago
Yes its different. This is a deeper overview of what they did: https://youtu.be/fLqJXTOhUg4

Specially here: https://youtu.be/fLqJXTOhUg4?t=434

tibbar · 16h ago
The submitted title buries the lede: RDS for PostgreSQL 17.4 does not properly implement snapshot isolation.
aphyr · 14h ago
Folks on HN are often upset with the titles of Jepsen reports, so perhaps a little more context is in order. Jepsen reports are usually the product of a long collaboration with a client. Clients often have strong feelings about how the report is titled--is it too harsh on the system, or too favorable? Does it capture the most meaningful of the dozen-odd issues we found? Is it fair, in the sense that Jepsen aims to be an honest broker of database safety findings? How will it be interpreted in ten years when people link to it routinely, but the findings no longer apply to recent versions? The resulting discussions can be, ah, vigorous.

The way I've threaded this needle, after several frustrating attempts, is to have a policy of titling all reports "Jepsen: <system> <version>". HN is of course welcome to choose their own link text if they prefer a more descriptive, or colorful, phrase. :-)

dang · 14h ago
Given that author and submitter (and commenter!) are all the same person I think we can go with your choice :)

The fact that the thread is high on HN, plus the GP comment is high in the thread, plus that the audience knows how interesting Jepsen reports get, should be enough to convey the needful.

broost3r · 12h ago
long time lurker here who registered on HN many years ago after reading Jepsen: Cassandra

the Jepsen writeups will surely stand the test of time thank you!

altairprime · 14h ago
I emailed the mods and asked them to change it to this phrase copy-pasted from the linked article:

> Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation

altairprime · 11h ago
(The mods replied above; thank you!)
belter · 15h ago
And your comment also...In Multi-AZ clusters.

Well this is from Kyle Kingsbury, the Chuck Norris of transactional guarantees. AWS has to reply or clarify, even if only seems to apply to Multi-AZ Clusters. Those are one of the two possibilities for RDS with Postgres. Multi-AZ deployments can have one standby or two standby DB instances and this is for the two standby DB instances. [1]

They make no such promises in their documentation. Their 5494 pages manual on RDS hardly mentions isolation or serializable except in documentation of parameters for the different engines.

Nothing on global read consistency for Multi-AZ clusters because why should they.... :-) They talk about semi-synchronous replication so the writer waits for one standby to confirm log record, but the two readers can be on different snapshots?

[1] - "New Amazon RDS for MySQL & PostgreSQL Multi-AZ Deployment Option: Improved Write Performance & Faster Failover" - https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...

[2] - "Amazon RDS Multi-AZ with two readable standbys: Under the hood" - https://aws.amazon.com/blogs/database/amazon-rds-multi-az-wi...

n2d4 · 14h ago
> They make no such promises in their documentation. Their 5494 pages manual on RDS hardly mentions isolation or serializable

Well, as a user, I wish they would mention it though. If I migrate to RDS with multi-AZ after coming from plain Postgres (which documents snapshot isolation as a feature), I would probably want to know how the two differ.

gymbeaux · 15h ago
Par for the course
badmonster · 10h ago
What safety or application-level bugs could arise if developers assume Snapshot Isolation but Amazon RDS for PostgreSQL is actually providing only Parallel Snapshot Isolation, especially in multi-AZ configurations using the read replica endpoint?
ctapobep · 1h ago
Consider this: you leave a comment under a post. The user who posts first deserves a "first commenter badge". Now:

- User1 comments

- User2 comments

- User1 checks (in a separate tx) that there's only 1 comment, so User1 gets the badge

- User2 checks the same (in a separate tx) and also sees only 1 comment (his), and also receives the badge.

With Snapshot isolation this isn't possible. At least one of the checks made in a separate tx would see 2 comments.

The original article on the Parallel Snapshot is a good read: https://scispace.com/pdf/transactional-storage-for-geo-repli...

Elucalidavah · 6h ago
Consider a "git push"-like flow: begin a transaction, read the current state, check that it matches the expected, write the new state, commit (with a new state hash). In some unfortunate situations, you'll have a commit hash that doesn't match any valid state.

And the mere fact that it's hard to reason about these things means that it's hard to avoid problems. Hence, the easiest solution is likely "it may be possible to recover Snapshot Isolation by only using the writer endpoint", for anything where write is anyhow conditional on a read.

Although I'm surprised the "only using the writer endpoint" method wasn't tested, especially in availability loss situations.

mushufasa · 15h ago
> These phenomena occurred in every version tested, from 13.15 to 17.4.

I was worried I had made the wrong move upgrading major versions, but it looks like this is not that. This is not a regression, but just a feature request or longstanding bug.

password4321 · 14h ago
It would be great to get all the Amazon RDS flavors Jepsen'd.
aphyr · 14h ago
I have actually been working on this (very slowly, in occasional nights and weekends!) Peter Alvaro and I reported on a safety issue in RDS for MySQL here too: https://jepsen.io/analyses/mysql-8.0.34#fractured-read-like-...
password4321 · 12h ago
There is a universe where cloud providers announce each new database offering by commissioning a Jepsen test and iterating on the results until every issue has been resolved or at least documented.

Unfortunately reliability is not that high on the priority list here. Keep up the good work!

film42 · 15h ago
I think AWS will need to update their documentation to communicate this. Will a snapshot isolation fix introduce a performance regression in latency or throughput? Or, maybe they stand by what they have as being strong enough. Either way, they'll need to say something.
kevincox · 14h ago
I think the ideal solution from AWS would be fixing the bug and actually providing the guarantees that the docs say that they do.
film42 · 14h ago
I agree, but I have a feeling this isn't a small fix. Sounds like someone picked a mechanism that seemed to be equivalent but is not. Swapping that will require a lot of time and testing.
mdaniel · 12h ago
> Swapping that will require a lot of time and testing.

Lucky them, there is an automated suite[1] to verify the correct behavior :-D

1: https://github.com/jepsen-io/rds

slt2021 · 12h ago
there is no trivial fix for this without breaking performance. roughly, there is no free lunch in distributed systems, and AWS made a tradeoff to relax consistency guarantees for that specific setup, and didn't really advertise that
zaphirplane · 14h ago
Yet bellow your comment is a quote that this is since v13 and above is a comment that there is no mention in the docs.

Using the words Bug and guarantee is throwing the casual readers off the mark ?

belter · 12h ago
It looks like a bug, but the problem is the documentation does not detail what guarantees are offered in this scenario, but would love if somebody could point me where it does...
oblio · 14h ago
I wonder how Aurora fares on this?
nijave · 2h ago
If I'm understanding the issue correctly, it probably doesn't.

From what I understand, multi-az has some setup with multiple semi synchronous replicas where only 1 replica needs to acknowledge the transaction.

Aurora doesn't use semi synchronous replication but uses clustered/shared storage with a different replication setup for cache invalidation

RachelF · 10h ago
I wondered how Microsoft SQL Server fares, but not it's tested in the long list of databases:

https://jepsen.io/analyses

__float · 9h ago
It may violate the SQL Server license? Microsoft have not apparently paid for a Jepsen analysis (or perhaps don't want it public :))
KronisLV · 7h ago
> Microsoft have not apparently paid for a Jepsen analysis (or perhaps don't want it public :))

If I was some database vendor that sometimes plays fast and loose (not saying Microsoft is, just an example) and my product is good for 99.95% of use cases and the remainder is exceedingly hard to fix, I'd probably be more likely to pay for Jepsen not to do an analysis, because hiring them would result in people being more likely to leave an otherwise sufficient product due to those faults being brought to light.

VeejayRampay · 6h ago
Aurora doesn't offer Postgresql 17 for now I think
phonon · 3h ago
"They occurred in every PostgreSQL version we tested, from 13.15 (the oldest version which AWS supported) to 17.4 (the newest)."

So unlikely v17 will make a difference.

wb14123 · 12h ago
Surprised to see Amazon RDS doesn't pass such simple test. Nicely done!
belter · 12h ago
This is not a single RDS or the single instance in another AZ. Its a more specific and advanced cluster setup.
wb14123 · 4h ago
Yeah yet this is a happy path with no failure imported. Still surprised to see this kind of scenario was not tested inside AWS.
cr3ative · 16h ago
This is in such a thick academic style that it is difficult to follow what the problem actually might be and how it would impact someone. This style of writing serves mostly to remind me that I am not a part of the world that writes like this, which makes me a little sad.
glutamate · 16h ago
In the beginning, when you read papers like this, it can be hard work. You can either give up or put some effort in to try to understand it. Maybe look at some of the other Jepsen reports, some may be easier. Or perhaps an introductory CS textbook. With practice and patience it will become easier to read and eventually write like this.

You may not be part of that world now, but you can be some day.

EDIT: forgot to say, i had to read 6 or 7 books on Bayesian statistics before i understood the most basic concepts. A few years later i wrote a compiler for a statistical programming language.

cr3ative · 16h ago
I’ll look to do so, and appreciate your pointers. Thank you for being kind!
concerndc1tizen · 15h ago
The state of the art is always advancing, which greatly increases the burden of starting from first principles.

I somewhat feel that there was a generation that had it easier, because they were pioneers in a new field, allowing them to become experts quickly, while improving year-on-year, being paid well in the process, and having great network and exposure.

Of course, it can be done, but we should at least acknowledge that sometimes the industry is unforgiving and simply doesn't have on-ramps except for the privileged few.

_AzMoo · 15h ago
> I somewhat feel that there was a generation that had it easier

I don't think so. I've been doing this for nearly 35 years now, and there's always been a lot to learn. Each layer of abstraction developed makes it easier to quickly iterate towards a new outcome faster or with more confidence, but hides away complexity that you might eventually need to know. In a lot of ways it's easier these days, because there's so much information available at your fingertips when you need it, presented in a multitude of different formats. I learned my first programming language by reading a QBasic textbook trying to debug a text-based adventure game that crashed at a critical moment. I had no Internet, no BBS, nobody to help, except my Dad who was a solo RPG programmer who had learned on the job after being promoted from sweeping floors in a warehouse.

robertlagrant · 4h ago
> solo RPG programmer

The kids might not know this means "IBM mainframe" rather than "role playing game" :)

jorams · 15h ago
It uses a lot of very specific terminology, but the linked pages like the one on "G-nonadjacent" do a lot to clear up what it all means. It is a lot of reading.

Essentially: The configuration claims "Snapshot Isolation", which means every transaction looks like it operates on a consistent snapshot of the entire database at its starting timestamp. All transactions starting after a transaction commits will see the changes made by the transaction. Jepsen finds that the snapshot a transaction sees doesn't always contain everything that was committed before its starting timestamp. Transactions A an B can both commit their changes, then transactions C and D can start with C only seeing the change made by A and D only seeing the change made by B.

deathanatos · 12h ago
I empathize with the feeling of this being dense and unapproachable; I remember when I was first approaching these posts, and feeling the same.

For this particular one, the graph under "Results" is the most approachable portion, I think. (Don't skip the top two sections, though … and they're so short.) In the graph, each line is a transaction, and read them left-to-right.

Hopefully I get this right, though if I do not, I'm sure someone will correct me. Our database is a set of ordered lists of integers. Something like,

  CREATE TABLE test (
    id int primary key,
    -- (but specifically, this next column holds a list of ints, e.g.,
    --  a value might be, '1,8,11'; the list of ints is a comma separated
    --  string.)
    list text not null
  );
The first transaction:

  a 89 9
This is shorthand; means "(a)ppend to list #89 the integer 9" (in SQL, crudely this is perhaps something like

  UPDATE test SET list = CONCAT(list, ',9') WHERE id = 89;
… though we'd need to handle the case where the list doesn't exist yet, turning it into an `INSERT … ON CONFLICT … DO UPDATE …`, so it would get gnarlier.[2]); the next:

  r 90 nil    # read list 90; the result is nil
  r 89 [4 9]  # read list 89; the result is [4, 9]
  r 90 nil    # read line 90; the result is (still) nil
I assume you can `SELECT` ;) That should provide sufficient syntax for one to understand the remainder.

The arrows indicate the dependencies; if you click "read-write dependencies"[1], that page explains it.

Our first transaction appends 9 to list 89. Our second transaction reads that same list, and sees that same 9, thus, it must start after the first transaction has committed. The remaining arrows form similar dependencies, and once you take them all into account, they form a cycle; this should feel problematic. It's that they're in a cycle, which snapshot isolation does not permit, so we've observed a contradiction in the system: these cannot be obeying snapshot isolation. (This is what "To understand why this cycle is illegal…" gets at; it is fairly straightforward. T₁ is the first row in the graph, T₂ the second, so forth. But it is only straight-forward once you've understood the graph, I think.)

> This is in such a thick academic style that it is difficult to follow what the problem actually might be and how it would impact someone.

I think a lot of this is because it is written with precision, and that precision requires a lot of academic terminology.

Some of it is just syntax peculiar to Jepsen, which I think comes from Clojure, which I think most of us (myself included) are just not familiar with. Hence why I used SQL and comma-sep'd lists in my commentary above; that is likely more widely read. It's a bit rough when you first approach it, but once you get the notation, the payoff is worth it, I guess.

More generally, I think once you grasp the graph syntax & simple operations used here, it becomes easier to read other posts, since they're mostly graphs of transactions that, taken together, make no logical sense at all. Yet they happened!

> This style of writing serves mostly to remind me that I am not a part of the world that writes like this, which makes me a little sad.

I think Jepsen posts, with a little effort, are approachable. This post is a good starter post; normally I'd say Jepsen posts tend to inject faults into the system, as we're testing if the guarantees of the system hold up under stress. This one has no fault injection, though, so it's a bit simpler.

Beware though, that if you learn to read these, that you'll never trust a database again.

[1]: https://jepsen.io/consistency/dependencies

[2]: I think this is it? https://github.com/jepsen-io/postgres/blob/225203dd64ad5e5e4... — but this is pushing the limits of my own understanding.

mdaniel · 12h ago
> Beware though, that if you learn to read these, that you'll never trust a database again.

I chuckled, but (while I don't have links to offer) I could have sworn that there were some of them which actually passed, and a handful of others that took the report to heart and fixed the bugs. I am similarly recalling that a product showed up to their Show HN or Launch HN with a Jepsen in hand, which I was especially in awe of the maturity of that (assuming, of course, I'm not hallucinating such a thing)

renewiltord · 15h ago
It's maximal information communication. Use LLM to distill to your own knowledge level. It is trivial with modern LLM. Very good output in general.
benatkin · 15h ago
It addresses the reader no matter how knowledgeable they are. It's a very good use of hypertext, making it so that a knowledgeable reader won't need to skip over much.
ZYbCRq22HbJ2y7 · 15h ago
> such a thick academic style

Why? Because it has variables and a graph?

What sort of education background do you have?

vlovich123 · 14h ago
Have you tried using an LLM? I’ve found good results getting at the underlying concepts and building a mental model that works for me that way. It makes domain expertise - that often has unique terminology for concepts you already know or at least know without a specific name - more easily accessible after a little bit of a QA round.
vlovich123 · 11h ago
Lots of downvotes with no actual explanation of what the issue is my suggestion.

I’ve repeatedly used ChatGPT and Claude to help me understand papers and to cut through the verbiage to the underlying concepts.

gitroom · 9h ago
honestly this made me side-eye aws docs hard, i always think snapshot isolation just means what it says. good catch
henning · 16h ago
I thought this kind of bullshit was only supposed to happen in MongoDB!
kabes · 16h ago
Then you haven't read enough jepsen reports. Distributed system guarantees generally can't be trusted
__alexs · 16h ago
Postgres is not a distributed system in this configuration usually though is it?
semiquaver · 16h ago
The result is for “Amazon RDS for PostgreSQL multi-AZ clusters” which are certainly a distributed system.

I’m not well versed in RDS but I believe that clustered is the only way to use it.

dragonwriter · 15h ago
An RDS cluster can have a single instance (but it can't be multi-AZ with a single instance.)
NewJazz · 16h ago
No, you can have single instances
reissbaker · 15h ago
This writeup tested multi-AZ RDS for Postgres — which is always distributed behind the scenes (otherwise, it couldn't exist in multiple AZs).
dragonwriter · 15h ago
A multi-AZ cluster is necessarily a distributed system.
bananapub · 16h ago
I think zookeeper is still the only distributed system that got through jepsen without dataloss bugs, though at high cost: https://aphyr.com/posts/291-jepsen-zookeeper
robterrell · 16h ago
Didn't FoundationDB get a clean bill of health?
necubi · 15h ago
Aphyr didn’t test foundation himself, but the foundation team did their own Jepsen testing which they reported passing. All of this was a long time ago, before Foundation was bought by Apple and open sourced.

Now members of the original Foundation team have started Antithesis (https://antithesis.com/) to make it easier for other systems to adopt this sort of testing.

MarkMarine · 15h ago
wasn't tested because: "haven't tested foundation in part because their testing appears to be waaaay more rigorous than mine."

https://web.archive.org/web/20150312112552/http://blog.found...

bananapub · 15h ago
apparently wasn't tested because Kyle thought the internal testing was better than jepsen itself: https://abdullin.com/foundationdb-is-back/
colesantiago · 16h ago
Do people still use MongoDB in production?

I was quite surprised to read that Stripe uses MongoDB in the early days and still today and I can't imagine the sheer nightmares they must have faced using it for all these years.

senderista · 16h ago
MongoDB has come a long way. They acquired a world-class storage engine (WiredTiger) and then they hired some world-class distsys people (e.g. Murat Demirbas). They might still be hamstrung by early design and API choices but from what I can tell (never used it in anger) the implementation is pretty solid.
computerfan494 · 15h ago
MongoDB is a very good database, and these days at scale I am significantly more confident in its correctness guarantees than any of the half-baked Postgres horizontal scaling solutions. I have run both databases at seven figure a month spend scale, and I would not choose off-the-shelf Postgres for this task again.
colechristensen · 16h ago
mongodb is a public company with a market cap of 14.2 billion dollars. so yes, people still use it in production
djfivyvusn · 15h ago
I've been looking for a job the last few weeks.

Literally the only job ad I've seen talking about MongoDB was a job ad for MongoDB itself.

Thaxll · 15h ago
Those memes are 10 years old, you know that some very tech company use MongoDB right? We're talking billions a year.
xmodem · 14h ago
Billion dollar companies lose their customer’s data all the time.
djfivyvusn · 15h ago
What is your point?
Thaxll · 12h ago
MongoDB is reliable.
skywhopper · 15h ago
This is an unfortunate report in a lot of ways. First, the title is incomplete. Second, there’s no context as to the purpose of the test and very little about the parameters of the test. It makes no comparison to other PostgreSQL architectures except one reference at the end to a standalone system. Third, it characterizes the transaction isolation of this system as if it were a failure (see comments in this thread assuming this is a bug or a missing feature of Postgres). Finally, it never compares the promises made by the product vendors to the reality. Does AWS or Postgres promise perfect snapshot isolation?

I understand the mission of the Jepsen project but presenting results in this format is misleading and will only sow confusion.

Transaction isolation involves a ton of tradeoffs, and the tradeoffs chosen here may be fine for most use cases. The issues can be easily avoided by doing any critical transactional work against the primary read-write node only, which would be the only typical way in which transactional work would be done against a Postgres cluster of this sort.

Sesse__ · 15h ago
Postgres does indeed promise perfect snapshot isolation, and Amazon does not (to the best of my knowledge) document that their managed Postgres service weakens Postgres’ promises.
billiam · 15h ago
New headline: AWS RDS is not CockroachDB or Spanner. And it's not trying to be.
senderista · 11h ago
However, Aurora DSQL is trying to compete with both CDB and Spanner, and they explicitly promise snapshot isolation.
anentropic · 3h ago
But this test wasn't of Aurora DSQL
cnlwsu · 13h ago
Cockroach doesn't offer strict serializability. It has serializability with some limits depending on clock drift. Also CockroachDB does not provide linearizability over the entire database.