How We Took Vapi from 99.9% to 99.99% Reliability

2 jordandearsley 1 8/12/2025, 9:43:19 PM vapi.ai ↗

Comments (1)

jordandearsley · 18h ago
A while ago, Vapi.ai was at ~99.9% uptime which was around 8 hours of downtime a year.

We set a goal for 99.99% (<1 hour/year), and quickly learned that getting there meant 100 small changes, not one big one.

Some highlights from what we did to achieve this goal:

- When our primary database on Neon falters, traffic now shifts to Aurora in under five seconds, keeping calls alive.

- Every external dependency has a backup. LLM calls roll from OpenAI → Azure → Bedrock

- Deployments are rolled out gradually across clusters by an automated canary manager, which starts at 5% of traffic and rolls back instantly if error rates rise.

- When traffic spikes, Lambda burst workers come online in milliseconds and tunnel into our Kubernetes cluster over QUIC, handling overflow without dropping calls.

In total, these changes cut dropped calls by 97% and made provider outages invisible to users.

Full deep dive with architecture diagrams, failure scenarios, and code-level decisions here: https://vapi.ai/blog/how-we-achieved-99-99-reliability-at-va...