The author raised some great questions! But as he admits, they are unlikely to be answered in public, which is why I usually find public retrospectives a bit underwhelming.
Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.
If nothing else, this section of the incident report reminded me of my favorite distributed systems paper: Metastable Failures in Distributed Systems. You should definitely check it out if you haven't already:
Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.
If nothing else, this section of the incident report reminded me of my favorite distributed systems paper: Metastable Failures in Distributed Systems. You should definitely check it out if you haven't already:
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...