Systems Correctness Practices at Amazon Web Services

345 tanelpoder 127 5/30/2025, 12:43:13 PM cacm.acm.org ↗

Comments (127)

amazingamazing · 1d ago
> Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

Any good open source libraries that do this that are language agnostic? Seems doable - spin up a container with some tools within it. Said tools require some middleware to know when a test is going to be run, when test is run, tools basically make certain things, networking, storage, etc "determinstic" in the context of the test run.

This is more-or-less what antithesis does, but haven't seen anything open source yet.

You of course, could write your tests well, such that you can stub out I/O, but that's work and not everyone will write their tests well anyway (you should do this anyway, but it's nicer imo if this determinism is on a layer higher than the application).

as a slight sidebar - I'm not really bullish on AI, but I think testing is one of the things where AI will hopefully shine, because the feedback loop during prompting can be driven by your actual application requirements, such that the test implementation (driven by AI), requirements (driven by you as the prompt) and "world" (driven by the actual code being tested) can hopefully help drive all three to some theoretical ideal. if AI gives us anything, I'm hoping it can make software a more rigorous discipline by making formal verification more doable.

wwilson · 23h ago
There have historically been two giant adoption challenges for DST.

(1) Previously, you had to build your entire system around one of the simulation frameworks (and then not take any dependencies).

(2) It’s way too easy to fool yourself with weak search/input generation, which makes all your tests look green when actually you aren’t testing anything nontrivial.

As you say, Antithesis is trying to solve both of these problems, but they are very challenging.

I don’t know of anybody else who has a reliable way of retrofitting determinism onto arbitrary software. Facebook’s Hermit project tried to do this with a deterministic Linux userspace, but is abandoned. (We actually tried the same thing before we wrote our hypervisor, but found it didn’t work well).

A deterministic computer is a generically useful technology primitive beyond just testing. I’m sure somebody else will create one someday, or we will open-source ours.

nine_k · 15h ago
I suspect you can relatively easily obtain a completely deterministic machine by running QEMU in 100% emulation mode in one thread. But what you are after is controlled deterministic execution, and it's far harder. That is, making your multiple processes to follow a specific dance that triggers an interesting condition must be very involved, when seen from the level as low as CPU and OS scheduler. Hence a language-agnostic setup is hard to achieve and especially hard to make it do your bidding. It may drown you in irrelevant details.

I once built a much, much simpler thing that allowed to run multiple JVM threads in a particular kind of lockstep, by stubbing and controlling I/O operations and the advance of the system time. With that, I could run several asynchronously connected components in particular interaction patterns, including not just different thread activation order but also I/O failures, retries, etc. It was manageable, and it helped uncover a couple of nasty bugs before the code ever ran in prod.

But that was only possible because I went with drastic simplifications, controlling not the whole system but only particular synchronization points. It won't detect a generic data race where explicit synchronization would be just forgotten.

ajb · 10h ago
I'm sure I heard that something like this existed for the JVM ages ago (like 15 years). I don't remember the details so it might not be quite the same, but a colleague was telling me about some tech which would test your concurrency by automatically selecting bad scheduling orders.
bit_razor · 22h ago
amazingamazing · 22h ago
Trust me, I love FDB, but that's not the same thing. The FDB team IIRC had to write their own programming language to do this. It's not a agnostic layer above the application.

The problem with coupled tooling is that no one will use it. That's what is cool about antithesis. If they're able to complete their goal, that's basically what will be achieved.

Lwerewolf · 19h ago
I guess you meant to say "only the people working on the software coupled to the tooling will use it". It's not just FDB & Amazon that are using something like this, and it is a ridiculously powerful type of tool for debugging distributed systems.
amazingamazing · 19h ago
ah, yes.
bit_razor · 21h ago
Fiar point. I was thinking about antithesis, but it's not open source (yet?). Turns out I also didn't read your comment well enough. Back to lurking I go.
crvdgc · 22h ago
https://rr-project.org/ for languages that can be debugged by gdb.
smj-edison · 9h ago
+1 for rr. Bonus feature is you can also time-travel debug! It's spoiled me forever...
john2x · 20h ago
There was a talk from Joe Armstrong about using property testing to test Dropbox.
simonw · 1d ago
S3 remains one of the most amazing pieces of software I've ever seen. That thing a few years ago where they just added strong read-after-write consistency to the whole system? Incredible software engineering. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
riknos314 · 16h ago
I had the distinct pleasure of working on S3 (Lifecycle) during the timeframe that the index team was working on the rearchitecture that enabled the read-after-write consistency.

I can confidently say that as impressive as S3 is from the outside, it's at least that impressive internally, both in implementation, and organizational structure.

No comments yet

positisop · 22h ago
Google Cloud Storage had it for eons before S3. GCS comes across as a much better thought-out and built product.
simonw · 21h ago
S3 is probably the largest object store in the world. The fact that they can upgrade a system like that to add a feature as complex as read-after-write with no downtime and working across 200+ exabytes of data is really impressive to me.
tossandthrow · 20h ago
I really do respect the engineering efforts.

But object stores are embarrassingly parallel, so if such a migration should be possible somewhere without down time, then it is definitely object stores.

gamegoblin · 19h ago
Where would you make make the cut that takes advantage of object store parallelism?

That is, at what layer of the stack do you start migrating some stuff to the new strongly consistent system on the live service?

You can't really do it on a per-bucket basis, since existing buckets already have data in the old system.

You can't do it at the key-prefix level for the same reason.

Can't do both systems in parallel and try the new one and fall back to the old one if the key isn't in it, because opens up violations of the consistency rules you're trying to add.

Seems trickier than one might think.

nojvek · 15h ago
Obviously depends on how they delivered read after write.

Likely they don't have to physically move data of objects, but the layer that writes and reads coordinates based on some version control guarantees e.g in database land MVCC is a prominent paradigm. They'd need a distributed transactional kv store that tells every reader what the latest version of the object is and where to read from.

An object write only acknowledges finished if the data is written and kv store is updated with new version.

They could do this bucket by bucket in parallel since buckets are isolated from each other.

rossjudson · 13h ago
GCS's metadata layer was originally implemented with Megastore (the precursor to Spanner). That was seamlessly migrated to Spanner (in roughly small-to-large "region" order), as Spanner's scaling ability improved over the years. GCS was responsible for finding (and helping to knock out) quite a few scaling plateaus in Spanner.
SteveNuts · 22h ago
Sure, but whose (compatible) API is GCS using again? Also keep in mind that S3 is creeping up on 20 years old, so backing a change in like that is incredible.
benoau · 22h ago
Not just 20 years old - an almost flawless 20 years at massive scale.
SteveNuts · 21h ago
It's funny that things that are pinnacles of human engineering exist like this where the general public has no idea it even exists, though they (most likely) use it every single day.
ninetyninenine · 21h ago
I find red dead redemption 2 more impressive. I don’t know why. It sounds stupid but S3 on the surface has the simplest api and it’s just not impressive to me when compared to something like that.

I’m curious which one is actually more impressive in general.

SteveNuts · 19h ago
Simple to use from an external interface yes, the backend is wildly impressive.

Some previous discussion https://news.ycombinator.com/item?id=36900147

koito17 · 14h ago
> S3 on the surface has the simplest api and it’s just not impressive [...]

Reminded of the following comment from not too long ago.

https://news.ycombinator.com/item?id=43363055

sriram_malhar · 9h ago
That's the strangest comparison I have seen. What axis are you really comparing here? Better graphics? Sound?
ninetyninenine · 8h ago
Complexity and sheer intelligence and capability required to build either.
sriram_malhar · 6h ago
And what is the basis for your claim? You are not impressed by AWS's complexity and intelligence and capability to build and manage 1-2 zettabytes of storage near flawlessly?
ninetyninenine · 2h ago
Im more impressed by red dead redemption 2 or baldurs gate 3.

There is no “basis” other my gut feeling. Unless you can get quantified metrics to compare that’s all we got. For example if you had lines of code for both, or average IQ. Both would lead towards the “basis” which neither you or I have.

UltraSane · 13h ago
AWS has said that the largest S3 buckets are spread over 1 million hard drives. That is quite impressive.
ninetyninenine · 12h ago
Red dead redemption 2 is likely on over 74 million hard drives.
simonw · 2h ago
I think you misunderstood. They're not saying S3 uses a million hard drives, they're saying that there exist some large single buckets that use a million hard drives just for that one bucket/customer!
SteveNuts · 12h ago
There's likely over a trillion active SQLite databases in use right now.
theflyinghorse · 20h ago
Maybe. But Google do have a reputation which makes selecting them for infrastructure a risky endeavor
throitallaway · 22h ago
From my POV Amazon designs its services from a "trust nothing, prepare for the worst case" perspective. Eventual consistency included. Sometimes that's useful and most of the time it's a PITA.
up2isomorphism · 10h ago
S3 is not a piece of software per se, it is a service.

Also S3 is not better than gcs or azure blob.

simonw · 2h ago
Services are built on software.

S3 is likely an order of magnitude larger then those others - it's had a lot longer to grow.

osigurdson · 1h ago
It is interesting how the industry ended up with things like TDD when it doesn't work for something as simple as a function that adds two numbers together. While not completely useless in some edge cases, its complete lack of any kind of formal underpinnings should have given us a clue. So many bad / unexamined ideas in the Uncle Bob era. Far closer to a religion than anything else (complete with process "rituals" even).
furkansahin · 1d ago
Amazing article! Using state machines is a must if you are building infrastructure control-planes. Was P a must, though? I am not sure. We have been building infrastructure control-planes for over 13 years now and every iteration we have built with Ruby. It worked wonders for us https://www.ubicloud.com/blog/building-infrastructure-contro...
severusdd · 1d ago
The 92 % stat looks really interesting! It’s rarely the spectacular crash that knocks a cluster over. Instead, the “harmless” retry leaks state until everything breaks at 2 a.m on one fateful Friday. Evidently, we should budget more engineering hours for mediocre, silent failures than for outright disasters. That’s where the bodies are buried.
smallnix · 22h ago
Or survivorship bias: the major issues, that have been addressed, do not cause problems cause they were addressed. Some of the minor issues that are not addressed randomly do cause major issues.
chubot · 1d ago
One thing I wondered about the P language: It seems like in the early days, it was used at Microsoft to generate C code that’s actually used at runtime in the Windows USB stack?

But now it is no longer used to generate production code?

I asked that question here, which I think was the same question as in a talk: https://news.ycombinator.com/item?id=34284557

It seems like if the generated code is used in a kernel, it could also be used in a cloud, which is less resource-constrained

algorithmsRcool · 1d ago
It looks like Coyote[0], which is used in azure, was an evolution of P# which was an evolution of P

[0]https://www.microsoft.com/en-us/research/wp-content/uploads/...

inaseer · 22h ago
+1.

We have used Coyote/P# not just for model checking an abstract design (which no doubt is very useful) but testing real implementations of production services at Microsoft.

k__ · 21h ago
How do Coyote and P differ?
chubot · 18h ago
OK, but then not for generating production code?

I thought I read that somewhere, but now I can't find the claim

ctkhn · 1d ago
This sounds interesting but as someone who hasn't worked at AWS, and isn't already familiar with TLA+ or P, I would have liked to see even a hello world example of either of them. Without that, it sounds like a lot of extra pain for things that a good design and testing process should catch anyway. Seeing a basic example in the article itself that would give me a better insight into what these actually do.
hwayne · 21h ago
This is a quick demo of TLA+ I like: https://gist.github.com/hwayne/39782de71f14dc9addb75f3bec515...

It models N threads non-atomically incrementing a shared counter, with the property "the counter eventually equals the number of threads in the model". When checked in TLA+, it finds a race condition where one threads overwrites another value. I've written implementations of the buggy design and on my computer, they race on less than 0.1% of executions, so testing for it directly would be very hard.

Most TLA+ specs are for significantly more complex systems than this, but this is a good demo because the error is relatively simple.

teeray · 1d ago
The TLA Plus examples repository is very good: https://github.com/tlaplus/Examples . I would recommend starting with something simple like the DieHard problem.
dmd · 1d ago
The entire point of using formal methods is that testing will never, ever catch everything.
nickpsecurity · 22h ago
Whereas, formal verification only catches what properties one correctly specifies in what portions of the program one correctly specifies. In many, there's a gap between these and correctness of the real-world code. Some projects closed that gap but most won't.
UltraSane · 13h ago
Can't you use the formal model to write/generate a lot of unit tests to verify the actual code behaves like the model does?
skrishnamurthi · 4h ago
There are two DIFFERENT gaps here. You're talking about the gap where what you have verified is a model, and the actual code and the model may diverge. But there is another, subtler gap: how good are you at coming up with properties? Because your verification is only as good as the properties you come up with, and the average programmer is not great at coming up with properties. (Before people attack me, I should make clear I've been working in formal methods for 3 decades now.)
rtpg · 12h ago
In most domains you're always going to have the possibility of a gap.

The real thing though, is that if you have a verified formal model and a buggy implementation, then _you know_ your problem is at that model <-> implementation level.

Could be the implementation is wrong. Could be that the "bug" is in fact not a bug according to your model. But your model isn't "wrong". If your model says that A happens in condition B, then that is what should happen!

You can avoid second guessing a lot of design patterns with models, and focus on this transitional layer.

If someone came up to you and said "I built a calculator and with this calculator 1/0 becomes 0" you don't say "oh... maybe my model of division is wrong". You think the calculator is wrong in one way or another.

Maybe the calculator's universe is consistent in some way but in that case its model is likely not the same model of division you're thinking of. This eliminates entire classes of doubt.

nickpsecurity · 58m ago
That's called specification-, model-, or property-based test generation. Google those terms. If older security or verification papers, always try adding "citeseerx" and "pdf" to the search to find research publications.

The papers will describe the strengths and weaknesses of those tools.

hamburglar · 22h ago
We should have formal verification of the formal verification specification. Standing on a turtle.
Twisol · 19h ago
I've heard that called "validation". In other words, you verify that your solution meets the problem specification, but you validate that your specification is actually what you need.
nickpsecurity · 18h ago
You're looking for foundational, proof-carrying code with verified logins. I can't find the verified logic right now, though. Examples:

https://www.cs.princeton.edu/~appel/papers/fpcc.pdf

https://hol-light.github.io/

I'll also add that mutation testing has found specification errors, too.

https://github.com/EngineeringSoftware/mcoq

skydhash · 1d ago
Tests do specific instances of a class of problem and proves that your implementation is correct for these instances. Formal verification proves the whole class.

You can have a function that returns the anagram and testing will proves it correct for some pairs of words. But to prove it for all words require formal verification. And that’s when you catch some tricky errors due to undefined behavior or library bugs because you can’t prove their semantics.

sriram_malhar · 4h ago
egl2020 · 16h ago
I experimented with TLA, and the graphical toolbox didn't seem to work or match the tutorial. Kinda disappointing: I wanted to use TLA, and I'm otherwise a big fan of Lamport's work, from the utilitarian Latex to the intellectually satisfying paper on time, clocks, and distributed systems.
yathaid · 1d ago
>> a good design

good is doing a lot of heavy lifting here. The point of TLA+/Pluscal is to have a proof of the soundness of the design.

EGreg · 1d ago
Wow. I used to correspond with Leslie Lamport years ago (about his Buridan's Principle papers, etc.)

Today I went to his website and discovered a lot about TLA+ and PlusCal. He still maintains it: https://lamport.azurewebsites.net/tla/peterson.html?back-lin...

I must say ... it would make total sense for a guy like that, who brought mathematics to programming and was an OG of concurrent systems, to create a systems design language that's used at AWS and other industrial places that need to serve people.

I wish more people who build distributed systems would use what he made. Proving correctness is crucial in large systems.

lopatin · 1d ago
And just a tip for who may be intersted: Claude Opus with Extended Thinking seems to be very good at converting existing code into TLA+ specs.

I've found multiple bugs for personal Rust projects like this (A Snake game that allowed a snake to do a 180 degree turn), and have verified some small core C++ components at work with it as well (a queue that has certain properties around locking and liveness).

I tried other models but kept getting issues with syntax and spec logic with anything else besides Opus.

dosnem · 1d ago
I’ve always envisioned tla and other formal methods as specific to distributed systems and never needed to understand it. How is it used for a snake game? Also how is the TLA+ spec determined from the code? Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system? Also when using TLA from the start, can it be applied to implementations? Or is it only for catching bugs during design? Therefore I’m assuming implementations still need to match the design exactly or else you would still get subtle bugs? Sorry for all the questions I’ve never actually learned formal methods but have always been interested.
lopatin · 1d ago
Here's how it caught my Snake bug: My snake representation is a vector of key points (head, turns, tail). A snake in a straight line, of length 3, facing right can look like this: [(0,0), (2,0)]. When a Snake moves (a single function called "step_forward"), the Snake representation is compressed by my code: If the last 2 points are the same, remove the last one. So if this snake changes direction to "left", then the new snake representation would be [(1, 1), (1, 1)] and compressed to [(1, 1)] before existing out of step_forward.

Here's how the bug was caught: It should be impossible for the Snake representation to be < 2 points. So I told Opus to model the behavior of my snake, and also to write a TLA+ invariant that the snake length should never be under 2. TLA+ then basically simulates it and finds the exact sequence of steps "turns" that cause that invariant to not hold. In this case it was quite trivial, I never thought to prevent a Snake from making turns that are not 90 degrees.

Jtsummers · 1d ago
It's targeted at distributed systems, but it can be used to model any system over time. I've used it for distributed systems, but also for embedded systems with a peculiar piece of hardware that (seemed, and we found was) to be misbehaving. I modeled the hardware and its spec in TLA+, then made changes to the behavior description to see if it broke any expected invariants (it did, in precisely the way we saw with the real hardware). The TLA+ model also helped me develop better reproducible test cases for that hardware compared to what we were doing before.
skydhash · 23h ago
I'm not an expert but my current understanding is that code execution is always a state transition to the next state. So what you do is fully specify each state and the relation between them. How the transition actually happens is your code and it's not that important. What's important is that the relations does not conflict to each other. It's a supercharged type system.

> Also how is TLA+ spec determined from the code?

You start from the initial state, which is always known (or is fixed). Then you model the invariants for each lines.

> Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system

Invariants will conflicts with each other in this case.

> Also when using TLA from the start, can it be applied to implementations?

Yes, by fully following the specs and handling possible incorrect states that may happens in practice. If your initial state in the TLA+ specs says that it only includes natural numbers between 1 and 5, you add assertions in your implementation (or throw exceptions) that check that as the Int type in many type systems is not a full guarantee for that constraint. Even more work when using a dynamic language.

k__ · 21h ago
It seems like the new DeepSeek performs at a similar level as Opus 4. At least to preliminary Aider benchmarks.
skydhash · 1d ago
> Proving correctness is crucial in large systems.

It could be good in smaller, but critical and widely used utilities like SSH and terminals.

oblio · 1d ago
Yeah, basically all the coreutils plus all the common extras (rsync, ssh, etc) could use stuff like this.
rthnbgrredf · 20h ago
It should be feasible to rewrite the coreitils like ls, cd and cp in Lean 4 together with Cursor within days. Rsync and ssh are more complex though.
oblio · 7h ago
Your first claim is actually a very solid test for AI. We should start seeing a lot more AI powered OSS projects or at least contributions if AI truly is as good as they say. Heck, OSS should accelerate exponentially since contributions should become very easy.
belter · 1d ago
>> Proving correctness is crucial in large systems.

You can't do that...

The model checker says the specification satisfies the properties you wrote within the finite state space you explored...

amw-zero · 15h ago
You can write proofs in TLA+ and many other formalisms. You don’t need to ever use a model checker. The proofs hold for an infinite number of infinite-length executions. We are definitely not limited to finite behaviors.
abeppu · 22h ago
> to more lightweight semi-formal approaches (such as property-based testing, fuzzing, and runtime monitoring)

Ok, I get how property-based testing and fuzzing have a relationship to formal methods (the thing being checked looks like part of a formal specification, and in some sense these are a subset of the checks that a model-checking confirms), but calling runtime monitoring a "semi-formal approach" seems like a real stretch.

mjb · 22h ago
Runtime monitoring with something like PObserve is a semi-formal approach. Not just regular alarming and metrics.
purpleidea · 13h ago
It's very interesting (I applaud this) that one of the main goals seems to be to make it more approachable as compared to TLA+, but then they go in write it in C# which I consider to be an incredibly unapproachable community and language.

I'm not trying to draw the ire of the Microsoft fan boys, and there are certainly smart people working on that, but it's just not going to happen for most people.

Had this been in golang, or maybe java, I'm sure many more hands would be digging in! Having said that, I hope this helps bring correctness and validation more into the mainstream. I've been casually following the project for a while now.

My long-term goal is to integrate model validation into https://github.com/purpleidea/mgmt/ so if this is an area of interest to you, please let me know!

ahalbert4 · 1d ago
Just curious, has anyone used FIS in their own distributed services? I'm considering using it but don't have any real word experience handling those kind of experiments.
sylware · 6h ago
Hopefully, they have now a reactive security team, because I do not count the harcking bots/ip scanners they are protecting on their "cloud".
Marazan · 1d ago
Would I be right in saying Promela and SPIN are at a higher level than what is being described in the article?
mjb · 1d ago
I (one of the authors) did some distributed systems work with Promela about a decade ago, but it never felt like the right fit in the domain. It's got some cool ideas, and may be worth revisiting at some point.
sebstefan · 1d ago
>Deterministic simulation. Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness, such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

This is fucking amazing

jeffreygoesto · 1d ago
Loom for Rust does this. We also adapted it for some C++ with good success (found actual bugs that slipped tests and reviewes). Sits lower than i.e. TLA+ and is not a proof but super useful as you check the actual implementation.

https://github.com/tokio-rs/loom

wwilson · 23h ago
DST was i̶n̶v̶e̶n̶t̶e̶d̶ popularized at FoundationDB a little over a decade ago, and has been quietly gathering steam ever since. If you’re interested in the technique, the FDB paper has some good info in section 4 and section 6.2: https://www.foundationdb.org/files/fdb-paper.pdf

(Disclosure: I am an author.)

I also gave a talk overview of it at Strange Loop back in 2015, but don’t have the youtube link handy.

If you’re interested in trying DST, we have a new company that aims to make it a much easier lift, especially for existing projects that can’t be retrofitted onto one of the many simulation frameworks: https://antithesis.com

Happy to answer any questions about the approach, here or over email.

mjb · 22h ago
Hey Will. I'm a huge fan of the work you all are doing, and of FoundationDB, but I don't believe it's accurate that DST was invented at FoundationDB (or, maybe it was, but was also used in other places around the same time or before).

For example, the first implementations of AWS's internal lock service (Alf) used DST as a key part of the testing strategy, sometime around 2009. Al Vermeulen was influential in introducing it at AWS, and I believe it built on some things he'd worked on before.

Still, Anithesis is super cool, and I really admire how you all are changing the conversation around systems correctness. So this is a minor point.

we6251 · 19h ago
Also a huge proponent of Antithesis and their current work, but there definitely were some notable precedents at or around that time e.g. MODIST from 2009 (https://www.usenix.org/legacy/event/nsdi09/tech/full_papers/...), which similarly tried to build a "model checker for distributed systems".

As another interesting historical side note, I have wondered about the similarities between Antithesis and "Corensic", a startup spun out of UW around a similar time period (circa 2009). Apparently, they "built a hypervisor that could on-demand turn a guest operating system into deterministic mode" (see "Deterministic Multiprocessing" at https://homes.cs.washington.edu/~oskin/). My impression is that their product was not a significant commercial success, and the company was acquired by F5 Networks in 2012 (https://comotion.uw.edu/startups/corensic/).

Overall, I don't over-index on novelty, and think it is generally good for ideas to be recycled/revived/re-explored, with updated, modern perspectives. I believe that most rigorous systems designers/engineers likely converge to similar ideas (model checking, value of determinism, etc.) after dealing with these types of complex systems for long enough. But, it is nevertheless interesting to trace the historical developments.

wwilson · 19h ago
Corensic was impressive tech. I actually debriefed with one of their founders years ago. IIRC, their product was focused on finding single-process concurrency bugs.

Deterministic hypervisors are by no means new. Somebody once told me that VMWare used to support a deterministic emulation mode (mostly used for internal debugging). Apparently they lost the capability some time ago.

wwilson · 22h ago
Hi Marc, thank you for the correction! We started doing it around 2010, and were not aware of any prior art. But I am not surprised to hear that others had the idea before us. I will give Al credit in the future.
joshstrange · 23h ago
agentultra · 1d ago
The TigerBeetle team does this too and it's an interesting approach. One of their developers did a talk on it at HYTRADBOI this year [0].

[0] https://www.hytradboi.com/2025/c222d11a-6f4d-4211-a243-f5b7f...

slt2021 · 21h ago
they also made a browser game to showcase their DST, where you can inject faults and see how tigerbeetle recovers

https://sim.tigerbeetle.com/

slayerjain · 1d ago
similar to the integration testing component in keploy.io, which includes time freezing.

basically certain system clocks are rolled back and incremented in a deterministic manner to emulate real issues recorded from production bugs. Had lot of fun working with building this and the effect of messing with different system clocks by intercepting system calls.

https://keploy.io/docs/keploy-cloud/time-freezing/

algorithmsRcool · 1d ago
Microsoft has a flavor of this for .NET called Coyote[0]

[0] https://microsoft.github.io/coyote/#overview/how/

supriyo-biswas · 1d ago
Also somewhat similar to what Antithesis is doing.
keploy · 1d ago
Keploy is doing the same thing, in same space - https://github.com/keploy/keploy
jen20 · 1d ago
Which is largely an extended version of what FoundationDB does (from many of the same people).
aoli-al · 1d ago
https://github.com/cmu-pasta/fray

Is a concurrency testing framework for Java. It also does deterministic simulation.

whateveracct · 1d ago
lucianbr · 1d ago
Guessing an ordering that will trigger a bug and exploring all orderings should both be basically impossible for non-trivial scenarios, no?
teeray · 1d ago
Perhaps, but there is still some value in "capturing the bug with a failing test." If some ordering presents a defect, you can recreate it and prove that you've fixed the bug.
spenczar5 · 1d ago
I want something like this, but I work almost entirely in Go.

Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?

Does one also need to run on a particularly predictable OS?

ideal_gas · 22h ago
For in-process go scheduling, some progress has been made here; see: https://go.dev/blog/synctest

But `synctest.Wait` won't work for non-durably blocked goroutines (such as in the middle of a TCP syscall) so requires memory-based implementations of e.g. net.Conn (I've plugged in https://pkg.go.dev/google.golang.org/grpc/test/bufconn with good success)

jerf · 21h ago
That's not enough for proof purposes. It allows you to build a test that deterministically tests one path, but it does not give you control over all possible tests.

In fact I seem to be a bit iconoclastic on this matter but I'm not even a fan for testing purposes. Even non-proof-based testing needs the ability to test that goroutines may execute out of order. Nothing prevents a goroutine scheduled for 2ms from now to run all the way to completion before a goroutine scheduled to run 1ms from now even starts, but AFAIK this approach doesn't let you test that case. Such logic is not valid in a multithreaded system; it is at most the most likely progression, not the only possible progression.

But since we live in the world where the question is more will anyone write a concurrency test at all, moaning about not having a perfect one is missing the point, I suppose. I won't deny a deterministic test is better than a non-deterministic test in general.

jerf · 23h ago
"Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?"

Yes, because the Go runtime actually explicitly and deliberately leans in the opposite direction. Additional non-determinism is deliberately added to the system, most notably that a "select" call with multiple valid channels will be psuedorandomly chosen, and iteration on a map is somewhat scrambled on each iteration. It's not entirely random and as such not suitable for pretty much any randomness requirement you may have, but enough to generally prevent accidental dependencies on deterministic iteration.

This has the effect of "spreading out" concurrency issues and hopefully reducing the sort of issues you get when you only discover a major issue when something else in the environment changes three years later. It's a nice default but nothing like a proof, of course.

You can also go the "predictable OS" route, since the entire Go runtime sits on top of it. Nothing can non-deterministically run on top of a fully deterministic OS image. But a fully deterministic OS would probably want some integration with the Go runtime to be able to more deeply control it through mechanisms other than varying inputs to the random function, which is a very indirect way to go down a code path.

slayerjain · 1d ago
I worked on adding golang time freezing in keploy by using the approach from the go playground which used a fixed time (2009-11-10 23:00:00 UTC) but also increments the system clock used by goroutines.
carbondating · 1d ago
This is what hermit does, although it's no longer actively developed.

https://github.com/facebookexperimental/hermit

jiggawatts · 18h ago
It ought to be the norm, and it is (slowly) getting there. For example, the .NET 8 standard library now has a TimeProvider type to allow test frameworks to provide substitute implementations of the system clock.
mdaniel · 14h ago
mlhpdx · 1d ago
> 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors

This. If you take nothing else away from the article (which has a lot) take this: fail well, don’t fail poorly.

senthil_rajasek · 1d ago
It would also be nice to list some "best practices" on how to handle non-fatal errors. I would be definitely interested to know of any sources.
jerf · 23h ago
One of the nice things about "errors as values" is that it is generally easier to shim in an error rather than shim in an exception. Not that it's impossible to do the latter, but it's just generally easier because you can have that error serving as a value in your test code.

I have a lot of Go testing shims that look like:

    type UserGetter interface {
        GetUser(userID string) (User, error)
    }

    type TestUsers struct {
        Users map[string]User
        Error error
    }

    func (t TestUsers) GetUser(userID string) (User, error) {
        if t.Error != nil {
            return User{}, t.Error
        }
        user, have := t.Users[userID]
        if !have {
            return User{}, ErrUserNotFound
        }
        return user, nil
    }
This allows easily testing errors upon retrieving users and ensuring the correct thing happens. I'm not a 100% maniacal "get 100% test coverage in everything" kind of guy, but on the flip side, if your test coverage only lights up the "happy path", your testing is not good enough and as you scale up the probability that your system is going to do something very wrong when an error occurs very rapidly approaches 1.

It's more complicated when you have something like a byte stream where you want to simulate a failure at arbitrary points in the stream, but similar techniques can get you as close as you like, depending on how close that is.

From there, in terms of "how do you handle non-fatal errors", there really isn't a snap rule to give. Quite often you just propagate because there isn't anything else to do. Sometimes you retry some bounded number of times, maybe with backoff. Sometimes you log things and move on. Sometimes you have a fallback you may try. It just depends on your needs. I write a lot of network code, and I find that once my systems mature it's actually the case that rather a lot of the errors in the system get some sort of handling beyond "just propagate it up", but it's hard for me to ever guess in advance what they will be. It's a lot easy to mentally design all the happy paths than it is to figure out all the ways the perverse external universe will screw them up and how we can try to mitigate the issues.

skydhash · 1d ago
The same way you handle fatal errors, by specifying the exceptional circumstances and how to handle them (retry, alternative actions, or signaling to another handler up the call/request tree). Something’s correct output may not be our thing’s correct input.
harrall · 1d ago
I think the best practice is to handle them with equal attention as the happy path. Error handling is usually afterthought from my experience.

What is the system state when it does error?

What is the best possible recovery from each error state?

What can the user/caller expect for an error?

hiddencost · 1d ago
One I see a lot is not being careful to use the correct error type / status code.

E.g. if you're in python and raise a value error when an API is rate limited, someone down stream from you is going to have a bad time.

dnw · 23h ago
This paper from 11 years ago had the exact same finding!!(Finding 10). https://www.usenix.org/system/files/conference/osdi14/osdi14...
ad_hockey · 9h ago
Same paper, they're just referencing it:

> In 2014, Yuan et al. found that 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors.

trebligdivad · 23h ago
Yeh I think a lot of security screwups are also in error handling paths. This article gave me two terms I'd not heard before - 'Happy-case castrophic failures' - is a great one, and lower down there is 'goodput'.
bravesoul2 · 16h ago
I experienced this. The Go/Rust/Haskell way of avoiding exceptions in the language is better than the C#/Java/JavaScript way.

To the point I've seen it cause real bugs in production.

The problem in Node is you can either throw to make a 4xx or you can return a 4xx so downstream there are 2 things to check.

Supermancho · 1d ago
How much effort should be put into "failing well"? I rather see the program crash than output a liability. Fail well is too broad to be useful, in my industry.
ad_hockey · 1d ago
It gets tricky in a distributed system, or I suppose any server process. When the program crashes it just starts up again, and sometimes picks up the same input that caused the crash.

Typical example would be processing an event that you can't handle from a message queue. You don't want to crashloop, so you'd probably have to throw it away on a dead letter queue and continue processing. But then, is your system still correct? What happens if you later receive another event relating to the same entity, which depends on the first event? Or sometimes you can't even tell which entity the malformed or bug-triggering event relates to, and then it's a real problem.

skydhash · 23h ago
In distributed systems, that means you either wants the whole system to crash (consistency) or that the node that crash is not critical for any operation and can be out until you clean out the state that makes it crash (a partition). The issue is when you fail to be in those two categories. Meaning some aspects are concurrent, but some are sequential.
Marazan · 1d ago
For me the most catastrophic situations happen when a fatal error is treated as a non-fatal error and suddenly instead of the system crashing the system starts promulgating nulls everywhere and into storage.
thfuran · 21h ago
And then someone decides to "fix it" by adding a null check and things really go off the rails.
agentultra · 23h ago
What’s been disappointing to me is how easily formal methods are dismissed in industry.

TLA takes some time to learn to wield effectively but it pays off on spades.

No comments yet

hipgoat · 21h ago
It a comment of assesment