Cosmic Ray Bit Flips and the Hidden Risk at Scale

9 s-mon 1 8/8/2025, 11:53:29 PM cside.dev ↗

Comments (1)

addaon · 2h ago
> For example, NASA spacecraft run critical calculations in triplicate - multiple processors run the same operation, and if one disagrees due to a stray bit flip, the other two out-vote the process.

Triple modular redundancy is one way to achieve reliability in the face of arbitrary single errors, but it’s not the only way. In my experience dual-dual approaches are more common these days. A dual-channel system calculates everything twice, and checks that the results match; and fails detectably (usually fails silent) if they don’t match — that it, it converts arbitrary errors into detectable errors. Then, downstream systems receive command streams or other data from each of two such fail-detectable systems; and can assume that if they receive commands from either system (and those commands are protected from corruption during transmission), they can trust it. While this requires 4x computation rather than 3x, it avoids the (relatively complicated) voting system, and is /massively/ aided by the wide availability of lockstep processors that provide the first level of dual-channel checking on a single piece of silicon.