Airpass – easily overcome WiFi time limits (airpass.tiagoalves.me)

Hmm. Here's what I read from this article: RedPanda didn't happen to use any of the stuff in GCP that went down, so they were unaffected. They use a 3rd party for alerting and dashboarding, and that 3rd party went down, but RedPanda still had their own monitoring.

When I read "major outage for a large part of the internet was just another normal day for Redpanda Cloud customers", I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance.

I get it, an SRE is doing well if you don't notice them, but the only real preventative measure I saw here that directly helped with this issue, is that they over provision disk space. Which I'd be alarmed if they didn't do.

literallyroy · 2h ago

Yeah I thought they were going to show something cool like multi-tenant architecture. Odd to write this article when it was clear they expected to be impacted as they were reaching out to customers.

dangoodmanUT · 1h ago

I think you're missing the point. What I took away was that: "Because we design for zero dependencies for full operation, we didn't go down". Their extra features like tiered storage and monitoring going down didn't affect normal operations, which it seems like it did for similar solutions with similar features.

rybosome · 2h ago

Must be hell inside GCP right now. That was a big outage, and they were tired of big outages years ago. It was already extremely difficult to move quickly and get things done due to the reliability red tape, and I have to imagine this will make it even harder.

siscia · 2h ago

In fairness, their design does not seem to be regional. With problems in one region bringing down another, apparently not unrelated, region.

With this kind of architecture, this sort of problems is just bound to happen.

During my time in AWS, region independence was a must. And some services were able to operate at least for a while without degrading also when some core dependencies were not available. Think like loosing S3.

And after that, the service would keep operating, but with a degraded experience.

I am stunned that this level of isolation is not common in GCP.

rybosome · 2h ago

Global dependencies were disallowed back in 2018 with a tiny handful of exceptions that were difficult or impossible to make fully regional. Chemist, the service that went down, was one of those.

Generally GCP wants regionality, but because it offers so many higher-level inter-region features, some kind of a global layer is basically inevitable.

dangoodmanUT · 1h ago

Does Route53 depend on services in us-east-1 though? Or maybe it's something else, but i recall us-east-1 downtime causing service downtime for global services

cyberax · 19m ago

As far as I remember, Route53 is semi-regional. The master copy is kept in us-east-1, but individual regions have replicated data. So if us-east-1 goes down, the individual regions will keep working with the last known state.

Amazon calls this "static stability".

valenterry · 2h ago

How does AWS do that though? Do the re-implement all the code in every region? Because even the slightest re-use of code could trigger a synchronous (possibly delayed) downtime of all regions.

crop_rotation · 1h ago

Reusing code doesn't trigger region dependencies.

> Do the re-implement all the code in every region?

Everyone does.

The difference is AWS very strongly ensures that regions are independent failure domains. The GCP architecture is global with all the pros and cons that implies. e.g GCP has a truly global load balancer while AWS can not since everything is at core regional.

nijave · 44m ago

They definitely roll out code (at least for some services) one region at a time. That doesn't prevent old bugs/issues from coming up but it definitely helps prevent new ones from becoming global outages.

cyberax · 23m ago

Region (and even availability zones) in AWS are independent. The regions all have overlapping IPv4 addresses, so direct cross-region connectivity is impossible.

So it's actually really hard to accidentally make cross-region calls, if you're working inside the AWS infrastructure. The call has to happen over the public Internet, and you need a special approval for that.

Deployments also happen gradually, typically only a few regions at a time. There's an internal tool that allows things to be gradually rolled out and automatically rolled back if monitoring detects that something is off.

buremba · 1h ago

I think making the identity piece regional hurts the UX a lot. I like GCP's approach, where you manage multiple regions with a single identity, but I'm not sure how they can make it resilient to regional failures.

nijave · 41m ago

Async replication? I think you could run semi independent regions with an orchestrator that copies config to each one. You'd go into a degraded read only state but it wouldn't be hard down.

Of course bugs in the orchestrator could cause outages but ideally that piece is a pretty simple "loop over regions and call each regional API update method with the same arguments"

delusional · 1h ago

> they were tired of big outages years ago

One could hope that they'd realize whatever red tape they've been putting up so far hasn't helped, and so more of it probably wont either.

If what you're doing isn't having an effect you need to do something different, not just more.

Peterpanzeri · 1h ago

“We got lucky as the way we designed it happened not to use the part of the service that was degraded” this is a stupid statement from them, hope they will be prepared next time

beefnugs · 4m ago

I learned a lesson : "use less cloud"

mankyd · 1h ago

Why is that stupid? They did get lucky. They are acknowledging that, had they used that, they would have had problems. And now they will work to be more prepared.

Acknowledging that one still has risks and that luck plays a factor is important.

raverbashing · 2h ago

Lol I love how they call "not spreading your services needlessly across many different servers" as an "Architectural Pattern" (Cell based arch)

They are right, of course, but the way things, the obvious needs to be said sometimes

bdavbdav · 2h ago

“We got lucky as the way we designed it happened not to use the part of the service that was degraded”

smoyer · 2h ago

And we're oblivious enough about that luck that we're patting ourselves on the back in public.

belter · 2h ago

And we are linking our blog to the AWS doc on cell architectures, while talking about multiaz clusters on GCP azs that are nothing like that...

Airpass – easily overcome WiFi time limits (airpass.tiagoalves.me)

Scaling our observability platform by embracing wide events and replacing OTel (clickhouse.com)

Behind the scenes: Redpanda Cloud's response to the GCP outage (redpanda.com)

Using Microsoft's New CLI Text Editor on Ubuntu (omgubuntu.co.uk)

Samsung embeds IronSource spyware app on phones across WANA (smex.org)

Tell HN: Beware confidentiality agreements that act as lifetime non competes

Microsoft suspended the email account of an ICC prosecutor at The Hague (nytimes.com)

Delta Chat is a decentralized and secure messenger app (delta.chat)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Phoenix.new – Remote AI Runtime for Phoenix (fly.io)

YouTube's new anti-adblock measures (iter.ca)

Harper – an open-source alternative to Grammarly (writewithharper.com)

'Gwada negative': French scientists find new blood type in woman (lemonde.fr)

AbsenceBench: Language models can't tell what's missing (arxiv.org)

Plastic bag bans and fees reduce harmful bag litter on shorelines (science.org)

Life as Slime (asimov.press)

Show HN: MMOndrian (mmondrian.com)

Captain Cook's missing ship found after sinking 250 years ago (independent.co.uk)

The Nyanja new PC-Engine/TurboGrafx 16-bit console game in development (sarupro.itch.io)

Cosmoe: BeOS Class Library on Top of Wayland (cosmoe.org)

Sega mistakenly reveals sales numbers of popular games (gematsu.com)

Visualizing environmental costs of war in Hayao Miyazaki's Nausicaä (jgeekstudies.org)

Show HN: lambda-nat-proxy – Serverless proxy using Lambda and UDP NAT punching (github.com)

Augmented Vertex Block Descent (AVBD) (graphics.cs.utah.edu)

Unexpected security footguns in Go's parsers (blog.trailofbits.com)

Show HN: Nxtscape – an open-source agentic browser (github.com)

Tiny Undervalued Hardware Companions (2024) (vermaden.wordpress.com)

Wiki Radio: The thrilling sound of random Wikipedia (monkeon.co.uk)

Fundamental Problems of Lisp, the Cons Cell (2024) (xahlee.info)

College baseball, venture capital, and the long maybe (bcantrill.dtrace.org)

Learn you Galois fields for great good (2023) (xorvoid.com)

AMD's Freshly-Baked MI350: An Interview with the Chief Architect (chipsandcheese.com)

Chromium Switching from Ninja to Siso (groups.google.com)

The Art of Bijective Combinatorics (viennot.org)

Show HN: A color name API that maps hex to the closest human-readable name (meodai.github.io)

Record DDoS pummels site with once-unimaginable 7.3Tbps of junk traffic (arstechnica.com)

Oklo, the Earth's Two-billion-year-old only Known Natural Nuclear Reactor (2018) (iaea.org)

Mathematicians hunting prime numbers discover infinite new pattern (scientificamerican.com)

Alpha Centauri (filfre.net)

On memes, mimetic desire, and why it's always that deep (caitlynclark.substack.com)

Agentic Misalignment: How LLMs could be insider threats (anthropic.com)

A brief, incomplete, and mostly wrong history of robotics (generalrobots.substack.com)

Tuxracer.js play Tux Racer in the browser (github.com)

Smartphones: Parts of Our Minds? Or Parasites? (tandfonline.com)

Andrej Karpathy: Software in the era of AI [video] (youtube.com)

uBlock Origin Lite Beta for Safari iOS (testflight.apple.com)

Verified dynamic programming with Σ-types in Lean (tannerduve.github.io)

Cracovians: The Twisted Twins of Matrices (marcinciura.wordpress.com)

Show HN: Inspect and extract files from MSI installers directly in your browser (pymsi.readthedocs.io)

A Python-first data lakehouse (bauplanlabs.com)

Behind the scenes: Redpanda Cloud's response to the GCP outage

Comments (22)