A Null Pointer Exception Brought Down Mighty Google;7 Hours of Downtime

9 pavan_panto 9 7/9/2025, 8:29:47 PM getpanto.ai ↗

Comments (9)

gnat · 17h ago
Ugh. Each of these points is a classic reliability precaution – yet all were missed simultaneously. As one analyst put it, Google had “written the book on Site Reliability Engineering” but still deployed code that could not handle null inputs. In hindsight, this outage looks like a string of simple errors aligning by unfortunate chance.

Yes, that's how major outages happen. By this stage of maturity any single failure generally doesn't break things dramatically. When things go this wrong, it's ALWAYS a combination of failures: failure of recovery system, omission in detection systems, gap in automated review, oversight in ...

The vacuous gotcha language is indicative of the low quality of the whole article. As Metalnem says in comments here, see the official incident report for a better writeup and more insight. https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

jaymzcampbell · 16h ago
I really love the "Swiss Cheese model" for showing this in a very explicit way, it's easy to see how the most improbably thing could happen.

https://en.wikipedia.org/wiki/Swiss_cheese_model

Metalnem · 17h ago
This is a terrible article. It doesn’t cite its sources and, even worse, invents quotes. It's basically just an ad for some Panto AI tool.

Save yourself some time and just read the official incident report: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

Or the previous Hacker News discussion: https://news.ycombinator.com/item?id=44274563

KevinMS · 16h ago
fitting for the company that invented the nil pointer exception
malux85 · 17h ago
Very surprising that it was not feature flagged, when I worked at Google we went to enormous lengths to support partial feature rollouts, supporting this for complex codebases often added weeks to development but was crucial because an outage of this magnitude was considered sacrilege