We hit a wall testing AI agents, agents simulations works better

1 draismaa 0 6/26/2025, 4:24:20 PM

We've been working with teams building AI agents (agentic systems, with actual execution) But here's the thing: everyone says “agents are the future,” yet no one really knows how to test them. Some teams are manually walking through conversations, others are just shipping and "vibe checking" what comes back. Both break down at scale. The real problem? We’re testing agents like software, but agents don’t behave like software. They make decisions, adapt, escalate, reason across contexts. They're more like processes than functions. Rogerio, our CTO, wrote up a deeper dive on how we see the future of agent testing, and why agent simulations (not hardcoded flows) are becoming the new unit tests for AI systems. We built LangWatch scenario to let teams simulate real-world agent behavior and catch regressions early on. Would love feedback from folks who’ve been burned by this or hacked together their own simulation setups.

ClaudeBox: Claude Code Docker Development Environment (github.com)

Show HN: I built a Résumé tool that helps people land job interviews (trymockly.ai)

DDR4 Module Prices Overtake DDR5 (techpowerup.com)

Find Remote Jobs (remotepathglobal.com)

Show HN: Window Expander – Mostly maximize your windows (windowexpander.com)

Myths and mythconceptions: what does it mean to be a programming language?(2021) (dl.acm.org)

DeepMind Close to Solving the Navier-Stokes Millenium Prize Problem (english.elpais.com)

The Wheel (Direction)

Tesla head of manufacturing Omead Afshar fired by Elon Musk (cnbc.com)

VMware perpetual license holder receives audit letter from Broadcom (arstechnica.com)

`blaze-install` is a drop-in CLI that installs NPM packages (github.com)

Elastic's journey to build Elastic Cloud Serverless (elastic.co)

The Washington Post Will Ask Some Sources to Annotate Its Stories (nytimes.com)

Marge Simpson isn't dead yet, so everyone can calm down (cnn.com)

Apple's Swift coding language is working on Android support (9to5google.com)

Show HN: Chisel – Profile GPU Kernels Without a GPU (Nvidia and AMD) (github.com)

Salesforce CEO Says 30% of Internal Work Is Being Handled by AI (bloomberg.com)

A.I. Is Homogenizing Our Thoughts (newyorker.com)

Lalo Schifrin, Film Composer Who Wrote 'Mission: Impossible' Theme, Dies at 93 (variety.com)

Coloring.app – Custom AI Coloring Pages and Books (coloring.app)

Simulating a neural operating system with Gemini 2.5 Flash-Lite (developers.googleblog.com)

Can a Brain Be Preserved and Uploaded? Neuroscience Reveals 40% Chance It Could (iflscience.com)

I started writing the hono.js of Golang (github.com)

BIS: Stablecoins Fail Key Tests of Real Money (cointelegraph.com)

Show HN: Built a Food Scanner for Longevity (getbiohack.app)

Understanding the sport viewership experience using functional IR spectroscopy (nature.com)

Built something to help with panic attacks – what am I missing? (abler.health)

Britain shuns $34B Morocco-UK subsea power project (reuters.com)

Bluefishjs: Composing Diagrams in with Declarative Relations (dl.acm.org)

Stryker is a new generation mobile pentest application (github.com)

RFK Jr's new vaccine panel votes against preservative in flu shots in shock move (theguardian.com)

Carrot Cache: High-Performance, SSD-Friendly Caching Library for Java (medium.com)

Genomics coordinate systems (docs.rs)

Everything We Just Learned About the Ordnance Penetrator Strikes on Iran (twz.com)

Mammal evolution of upright posture was no cake walk (cosmosmagazine.com)

The Bitter Lesson (finbarr.ca)

Night train startup 'Nox' promises private cabins and reasonable prices (euronews.com)

Diabolus ex Machina – ChatGPT as literary critic (amandaguinzburg.substack.com)

Sesame – Natural voice companion preview (app.sesame.com)

Show HN: Gemini Bookmarks – Bookmark and tag responses in Gemini conversations (github.com)

Critical AI Summer Reading: Empires, Cons and Everlasting False Promises (neuralab.net)

The End of SaaS?

Programming skills that AIs cannot have and how you learn them (youtube.com)

CIA Insectothopter (cia.gov)

Kea 3.0, our first LTS version (isc.org)

John Carmack (Keen Technologies): Research Directions (youtube.com)

Vpype: A CLI for Plotter Art (github.com)

Out-of-Band, Part 1: The new gen of IP KVMs and how to find them (runzero.com)

Ask HN: Employers of HN – Would you hire a career changer without experience?

Robots that learn (openai.com)

We hit a wall testing AI agents, agents simulations works better

Comments (0)