Show HN: An Open-Source Eval Suite That Helps You Fix Postgres-Based Text-to-SQL

1 cevian 0 8/28/2025, 3:41:46 PM tigerdata.com ↗
We've been building text-to-SQL at TigerData and kept hitting the same problem: evaluation tools that tell you your accuracy score but nothing about how to improve it.

Getting a 60% pass rate is meaningless if you don't know whether failures are from bad schema retrieval or poor SQL generation. It's the difference between actionable insights and meaningless benchmarketing.

So we built, and are now open-sourcing, text-to-sql-eval with a simple insight: run every query three different ways:

- Normal mode - let the system retrieve schema and generate SQL - Full schema mode - provide all tables to test upper bound accuracy - Golden tables mode - give it the right tables to isolate reasoning issues

The performance delta between modes tells you exactly what's broken.

PostgreSQL-specific because database quirks matter for correctness. Works with any LLM or text-to-SQL system. Includes an LLM-as-judge option because deterministic matching produces too many false negatives on complex queries.

We've been using this internally to improve our (also open-sourced) text-to-sql system.

Open sourcing both the eval suite and a companion tool for generating test datasets from your production schema.

Built with uv for easy setup. TimescaleDB for tracking results over time. Simple Flask UI for exploring failures.

Try it, break it, tell us what's missing.

Comments (0)

No comments yet