Show HN: prompttest – pytest for LLMs

1 decodingchris 0 8/25/2025, 4:12:21 PM github.com ↗

I’ve been experimenting with LLMs lately and kept running into the same problem:

Every time I tweaked a prompt, I had to rerun a bunch of test cases manually and eyeball the results.

It felt like writing code without unit tests.

Existing tools I found were either:

- Full frameworks where you write evaluators in Python.

- Big platforms expanding into monitoring/security.

I wanted something simpler: a fast CLI that just tests prompts.

So I built prompttest, a pytest-like workflow for LLMs:

- You define a prompt in a .txt file with `{variables}`.

- You write test cases in .yml, with plain-English criteria.

- You run prompttest to see clear pass/fail results in your terminal.

The core idea is that your "assertion" is just English.

Example:

> The response must be polite and address the user by name.

Then a model grades the output for you.

This gives a safety net — you can refactor prompts and instantly see regressions.

The project is still early.

It runs on OpenRouter, so you can test against many models (including free ones) with one API key.

Would love feedback, ideas, or use-cases you’d want supported.

GitHub: https://github.com/decodingchris/prompttest

Ilya Sutskever Burnt an Effigy to Show That OpenAI Must Destroy Its Harmful AI (officechai.com)

Perplexity launches revenue share with content publishers (perplexity.ai)

Show HN: Expansions of Programming Related Acronyms (github.com)

Elon Musk Has His Vision. Waymo Chief T. Mawakana Says She's Got a Better One (vanityfair.com)

Quality Precision (lesswrong.com)

The Nvidia AI GPU Black Market [video] (youtube.com)

The Family Fallout of DNA Surprises (newyorker.com)

OpenAI: Building the "Everything Platform" in AI (leoniscap.com)

ASK HN: AI in high school. Will teachers and schools have to compensate?

Notes on Autograd (aschrein.github.io)

Enslaved Grandparent Syndrome (theguardian.com)

Ask HN: What problem you are solving as a founder?

Inouye Solar Telescope delivers record images of solar flare and coronal loops (phys.org)

Show HN: I built an image-based logical Sudoku Solver (dokusolver.com)

Show HN: I made a browser extension that exposes the hidden cost of shopping (chromewebstore.google.com)

Richard Dawkins explains why men are different from women (unherd.com)

Show HN: Whisker, a real-time Pipecat debugger for your voice AI agents (github.com)

Photonics and co-packaged optics may become mandatory for AI data centers (tomshardware.com)

De minimis' end: How shippers are adapting for peak season and beyond (supplychaindive.com)

Contracts: A Deep Dive (modernescpp.com)

Good Products Work. Great Products Create Meaning (opuslabs.substack.com)

AnalogSeeker: An Open-Source Foundation Language Model for Analog Circuit Design (arxiv.org)

noble-curves: audited and minimal elliptic curve cryptography in JavaScript (github.com)

Nullable vs. Nullable in C# (einarwh.no)

Estimating Household Green Space Using Drone Oblique Photography (mdpi.com)

Bmssp: A New Shortest Path Algorithm (rohanparanjpe.substack.com)

French Toilets of Spikersuppa in Oslo, Norway (atlasobscura.com)

The Cult of Productivity Apps (karanjaxyz.substack.com)

Better Politicians: Weeding Is Fundamental (eatingpolicy.com)

Walmart VP took $30K/day kickbacks favoring Indian H1Bs over US applicants (twitter.com)

20 years of the default mode network: A review and synthesis (sciencedirect.com)

There Are a Lot of ETFs (bloomberg.com)

Google to require developer verification to install and sideload Android apps (9to5google.com)

With AI chatbots, Big Tech is moving fast and breaking people (arstechnica.com)

Private equity may affect your 401(k) (thehustle.co)

Solve coding challenges by generating the code solutions with prompts (colf.dev)

ShellSage is a context-aware AI assistant that generates/explains shell commands (github.com)

Reflecting on Years of Runescape (2024) (ludic.mataroa.blog)

Why is Delphi so popular according to the Tiobe index?

Calories In, Calories Out (2014) (possiblywrong.wordpress.com)

The Power of CICO (twitter.com)

First absolute superconducting switch developed in a magnetic device (phys.org)

How RubyGems.org Protects Our Community’s Critical OSS Infrastructure (blog.rubygems.org)

Kruci: Post-Mortem of a UI Library (pwy.io)

Madison Avenue Is Starting to Love A.I (nytimes.com)

Why Go Out (2013) (thoughtcatalog.com)

I built a tool that converts Git commits into a full Changelog (shiplog.sh)

Google's Liquid Cooling at Hot Chips 2025 (chipsandcheese.com)

Prisoner laments reliance on floppy disks for appeals documents (tomshardware.com)

U.S. confirms nation's first travel-associated human screwworm case (reuters.com)

Show HN: prompttest – pytest for LLMs

Comments (0)