Show HN: Simulation-Based Testing for Agents Using AG-UI Protocol

Comments (1)

rchaves · 5h ago

Hello HN!

tl;dr: We built Scenario, an open-source testing library for AI agents. It simulates real conversations with your agent, its code-driven, and lets you assert anything mid-dialogue. Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

I'm Rogerio, founder of LangWatch, I've been helping many customers building LLM applications in this past two years and worked with Alex on this.

Most of the efforts for LLM quality so far were about evaluations, single-turn, there was nothing actually good to test agents, it all felt forced, but we believe we cracked it now, we have built an agent testing library that test your agent by simulating a user and playing a conversation back and forth with it.

One of the key challenges there was that we had to make it compatible with all the 273+ AI frameworks (and counting) there are. Luckliy AG-UI protocol popped up recently, standardizing agents frameworks and UI interactions, this is perfect, because at the end of the day, we want our user simulator to "see" just the same that the user sees.

So we made Scenario in a way that is really easy to connect to any agent no matter the tech stack, from a simple string <-> string connection, to openai standard messages format, to AG-UI.

The other key challenge was to balance testing the open-endedness of agents vs having reliable cases you want to test, so we worked a lot on thinking through the autopilot simulation vs the fully scripted one, and here again, the goal was complete interoperability. At the end of the day, the design we achieved was simply having lambdas, that you can call at any point of the test, so it's just code, where you can connect any other evaluation or assertion tool you want, we are not restrictive.

Check out the repo and the docs, we would love to get some feedback in here!

Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

The Ant Mill: How theoretical high-energy physics descended into groupthink (jespergrimstrup.substack.com)

National Archives to restrict public access starting July 7 (archives.gov)

Mexico is now Chinas No. 1 car export market (mexiconewsdaily.com)

Python Tools Are Quickly Adopting the New pylock.toml Standard (socket.dev)

The Discovery Engine (automated system for scientific discovery) (zenodo.org)

Show HN: Vybetr – Hire AI app developers using tools like Lovable, Bolt and more (vybetr.com)

Using Lxcfs Together with Podman (die-welt.net)

Lessons from LangChain and Slack and MCP Integration (medium.com)

Use of ch unit considered inappropriate (in certain circumstances) (clagnut.com)

Brit Watchdog Cracks Down on Data Collection by Smart TVs, Speakers, Air Fryers (theguardian.com)

Thoughts on the AI 2027 Discourse (dynomight.substack.com)

Childhood and Education #10: Behaviors (thezvi.substack.com)

When Can I Stop Listening to My Enemy's Points? (substack.com)

Show HN: Letter Lockbox – A word game I built over the weekend with Claude Code (letterlockbox.com)

Programmers and Their Monospace Blogs (lambdaland.org)

Ask HN: What's your fastest conversion from cold outreach to prepaid client?

Namespaced Pundit Policies Without the Repetition Racket (alec-c4.com)

The Legacy of "The Gastronomical Me" (lithub.com)

Show HN: How Usage Works (usage.ai)

Why Your Car's Touchscreen Is More Dangerous Than Your Phone (carsandhorsepower.com)

Dr. Dobb's (drdobbs.com)

Joining CNCF as Executive Director: Let's Build What's Next (cncf.io)

Elisa: A Comprehensive Guide to Enzyme-Linked Immunosorbent Assay (clyte.tech)

Secure your Express application APIs in 5 minutes with Cedar (aws.amazon.com)

Why Paris's Centre Pompidou, not even 50 years old, must close for five years (lemonde.fr)

Curated realities: An AI film festival and the future of human expression (arstechnica.com)

Scientists can now target the cells at the center of ALS (alleninstitute.org)

Haflang: Hardware Acceleration of Functional Languages (haflang.github.io)

Waldo – Geoip Lookups (geoip.dpdns.org)

David Friedberg: it is important for America that Mamdani get elected (twitter.com)

Portable Network Graphics (PNG) Specification (Third Edition) (w3.org)

EU lawmakers vote to bar carry-on luggage fees on planes (france24.com)

I Designed UX for an AI Product Last Year. Are Those Lessons Still Valid? (uxdesign.cc)

The Sun is twisting Mercury's crust in unexpected ways (bgr.com)

How to (Almost) solve cybersecurity once and for all (adaptive.live)

I Love GitOps (newsletter.masterpoint.io)

What It's Like to Be 'Mind Blind' (time.com)

Embabel: Framework for Building AI Agents with Java (thenewstack.io)

Epic Games and Qualcomm Are Bringing Fortnite to Windows 11 on Arm (thurrott.com)

Marginalia mania: how 'annotating' books went from no-no to BookTok's next trend (theguardian.com)

The AI Revolution: Human like interfaces, not intelligence (jaimefh.com)

Snyk Acquires Invariant Labs (snyk.io)

The Secret Rules of the Terminal (wizardzines.com)

Scaling Pinterest ML Infrastructure with Ray: From Training to ML Pipelines (medium.com)

Show HN: I built an AI thumbnail generator for YouTubers who can't design (thumbo.io)

Amish company embraced robots–then made an even bolder bet (fortune.com)

AI doesn't have to reason to take your job (vox.com)

The Reenchanted World: On finding mystery in the digital age (harpers.org)

Adding to markwhen documents via SMS and email (docs.markwhen.com)

Show HN: Simulation-Based Testing for Agents Using AG-UI Protocol

Comments (1)