Ask HN: How are you scaling AI agents reliably in production?

4 nivedit-jain 0 8/15/2025, 5:51:54 AM
I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?

What I’m most curious about:

- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.

- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.

- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.

- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.

- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.

- Observability: tracing, metrics, evals that actually predicted incidents.

- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.

- A war story: the incident that taught you a lesson and the fix.

Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.

Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!

Comments (0)

No comments yet