Agent simulations = unit testing for AI?
Think: What if the upstream tool fails mid-execution? What if the user flips intent mid-dialogue? What if the agent’s assumptions were subtly wrong?
from self-driving cars to AI agents? The above aren’t one-off tests. They’re like AV simulations: controlled environments to explore failure boundaries. Autonomous vehicle teams learned long ago that real-world data isn't enough. The rarest events are the most important—and you need to generate and replay them systematically. That same long-tail distribution applies to LLM agents. We’ve started treating scenario testing as a core part of the dev loop—versioning simulations, running them in CI, and evolving them as our agent behavior changes. It’s not about perfect coverage,it’s about shifting from “test after” to “test through simulation” as part of iterative agent development. Curious if others here are doing something similar. How are you testing your agents beyond a few prompts and metrics? Would love to hear how the HN crowd is thinking about agent reliability and safety—not just in research, but in real-world deployments.
Your idea of simulating agent interactions is interesting, but I want to know how are you actually evaluating simulation runs?