Perplexity Launches Comet for Pro Subscribers (techcrunch.com)

In traditional software, we write unit tests to catch regressions before they reach users. In AI systems—especially agentic ones that model breaks down. You can test inputs and outputs, use evals, but agents operate over time, across tools, mcps, apis, and unpredictable user input. The failure modes are non-obvious and often emerge only in edge cases. I'm seeing an emerging practice: agent simulations—structured, repeatable scenarios that test how an AI agent behaves in complex or long-tail situations.

Think: What if the upstream tool fails mid-execution? What if the user flips intent mid-dialogue? What if the agent’s assumptions were subtly wrong?

from self-driving cars to AI agents? The above aren’t one-off tests. They’re like AV simulations: controlled environments to explore failure boundaries. Autonomous vehicle teams learned long ago that real-world data isn't enough. The rarest events are the most important—and you need to generate and replay them systematically. That same long-tail distribution applies to LLM agents. We’ve started treating scenario testing as a core part of the dev loop—versioning simulations, running them in CI, and evolving them as our agent behavior changes. It’s not about perfect coverage,it’s about shifting from “test after” to “test through simulation” as part of iterative agent development. Curious if others here are doing something similar. How are you testing your agents beyond a few prompts and metrics? Would love to hear how the HN crowd is thinking about agent reliability and safety—not just in research, but in real-world deployments.

Comments (1)

aszen · 1h ago

We are just starting to introduce AI and for now rely on simple evals as unit tests that Dev's run locally to fine tune prompts and context.

Your idea of simulating agent interactions is interesting, but I want to know how are you actually evaluating simulation runs?

Perplexity Launches Comet for Pro Subscribers (techcrunch.com)

How well optimised are sites for AI crawlers? (trakkr.ai)

Advancing Claude for Education (anthropic.com)

Real AI agents solve bounded problems (venturebeat.com)

What is the voice inside my head? (bbc.com)

BitChat, New Offline Messaging App, Uses Bluetooth Mesh, No Internet (reclaimthenet.org)

Jonathan Blow – Jai Demo and Design Explanation (youtube.com)

Disinformation around a "weather weapon" and cloud seeding is being promoted (wired.com)

Hi-SQL: Optimizing Text-to-SQL Systems Through Dynamic Hint Integration (arxiv.org)

Linda Yaccarino to step down as CEO of X (nbcnews.com)

Population Genetics Explorer (media.hhmi.org)

Why XSS Persists in This Frameworks Era? (flatt.tech)

NIH to crack down on excessive publisher fees for publicly funded research (nih.gov)

Sizing up the 5 companies selected for Europe's launcher challenge (arstechnica.com)

Show HN: Live streaming for CUA models using WebRTC (OSS, Apache 2.0)

AI First Hiring, Teamwork and Org Structures, Staying Relevant in an an AI World (madhavajay.com)

Teachers urge parents not to buy children smartphones (bbc.com)

X Chief Says She Is Leaving the Social Media Platform (nytimes.com)

Only on Nantucket: The Curious Case of the "Stolen" Mercedes (nantucketcurrent.com)

Beacon API (developer.mozilla.org)

Nvidia becomes first company to reach $4T in market value (ft.com)

X CEO Linda Yaccarino to step down (axios.com)

How AI is breaking traditional remuneration models (technollama.co.uk)

Traction Then Taste (deepsouthventures.com)

A practical handbook on Context Engineering (github.com)

I’ve decided to step down as CEO of X (twitter.com)

Skia Graphite: Chrome's rasterization back end for the future (blog.chromium.org)

EU Product Liability Directive impacts software, digital products, cybersecurity (lexology.com)

Redis Historical Versions from 2009 (github.com)

Analyzing Grok's Latest Meltdown Through Public xAI System Prompts (theahura.substack.com)

Show HN: Whispering – An open-source alternative to Superwhisper (github.com)

Physics needs research software engineers (nature.com)

The Magic Theorem (aperiodical.com)

Computer Scientists Figure Out How to Prove Lies (quantamagazine.org)

Tree Borrows (plf.inf.ethz.ch)

Florida is letting companies make it harder for highly paid workers to swap jobs (businessinsider.com)

Durable Agent Loops (restate.dev)

DeCSS (2000) (decss.zoy.org)

Show HN: Remove metadata from images and documents online

Show HN: Kinic – A Portable AI Memory Store You Own (Farewell AI Amnesia) (kinic.io)

RNode is an open, free and unrestricted digital radio transceiver (unsigned.io)

Show HN: Chain-The-Words game that tests your vocab (chain-the-words.com)

Texas Flood Challenges Faith (amazingfacts.org)

Show HN: Pulse – the wearable for n=1 habit experiments (blog.pulse.site)

Using Self-Hosted Large Language Models (LLMs) Securely in Government (digitaltrade.blog.gov.uk)

Has anyone else had issues with the new low calorie sweeteners? (tildes.net)

MemOS: A Memory OS for AI System (arxiv.org)

Show HN: Nordstars shows a team's missing skills for different business goals (nordstars.ai)

Sh*t Coding – Where sh*t posting and vibe coding meet (dcoates.com)

Omarchy Is Out (world.hey.com)

Agent simulations = unit testing for AI?

Comments (1)

Sht Coding – Where sht posting and vibe coding meet (dcoates.com)