Building Your Own CLI Coding Agent with Pydantic-AI (martinfowler.com)

1 points by vinhnx 25s ago 0 comments

How to Get Internet Feedback Without Going Insane (twitter.com)

1 points by Michelangelo11 26s ago 0 comments

Can AI agents beat vibe coders for $10k? (8090.ai)

1 points by arjun_krishna1 1m ago 0 comments

1 in 4 Texas school districts sign up for Bible-infused curriculum (chron.com)

2 points by geox 2m ago 0 comments

NASA Marsquake Data Reveals Lumpy Nature of Red Planet's Interior (nasa.gov)

1 points by pixelesque 3m ago 0 comments

Anthropic extended its all data retention policy from 1 month to 5 years (theverge.com)

1 points by skeptrune 3m ago 0 comments

Protests in Indonesia on privileges for parliament members and 'corrupt elites' (theguardian.com)

2 points by mmarian 5m ago 0 comments

Linkwarden v2.12 (blog.linkwarden.app)

2 points by daniel31x13 5m ago 0 comments

Windows 11 Update KB5063878 corrupted my 8TB Seagate drive (RAW, data loss) (learn.microsoft.com)

1 points by sipofwater 5m ago 0 comments

4chan and Kiwi Farms sue UK over age verification law (reddit.com)

1 points by 01-_- 6m ago 0 comments

FI Framework Laptop 16 – Prototypes and scrapped ideas [video] (youtube.com)

1 points by mpartel 6m ago 0 comments

Building VS Code Live Share, but for Neovim (byronsharman.com)

1 points by chilipepperhott 9m ago 0 comments

VLT observations of interstellar comet 3I/ATLAS II (arxiv.org)

3 points by bikenaga 9m ago 0 comments

Pod Shops Are the New Banks (bloomberg.com)

1 points by ioblomov 9m ago 1 comments

MAGA Puts Wikipedia in the Crosshairs (reddit.com)

2 points by 01-_- 9m ago 0 comments

LLM Eval Driven Development with Claude Code (fireworks.ai)

2 points by dphuang2 10m ago 0 comments

CCPS: Calibrating LLM Confidence via Perturbation Stability – EMNLP 2025 (arxiv.org)

1 points by erfan_mhi 10m ago 1 comments

China is dunking data centers into the ocean to keep them cool (scientificamerican.com)

1 points by Stratoscope 12m ago 1 comments

AI startup Lovable receives funding offers at $4B valuation (ft.com)

1 points by mmarian 12m ago 1 comments

3Blue1Brown – But how does Bitcoin work? (3blue1brown.com)

2 points by bilsbie 12m ago 0 comments

The Technium: Everything I Know about Self-Publishing (kk.org)

1 points by MaysonL 12m ago 0 comments

Ray Cat (en.wikipedia.org)

1 points by hyperific 12m ago 0 comments

Unpacking Passkeys Pwned: Possibly the most specious research in decades (arstechnica.com)

1 points by jmsflknr 14m ago 0 comments

Guido van Rossum revisits Python's life in a new documentary (thenewstack.io)

1 points by MilnerRoute 15m ago 1 comments

The Four Styles of Confidence on a Team (adamsmith.cc)

6 points by adamsmith 16m ago 1 comments

Why Bitnami Secure Image Minimal Node.js Image Is the Optimal Choice (community.broadcom.com)

1 points by alexandratabone 17m ago 0 comments

Speeding Up Firefox Local AI Runtime (blog.mozilla.org)

1 points by bundie 17m ago 0 comments

Grok Code Fast 1 (x.ai)

3 points by mfiguiere 17m ago 0 comments

Writing Mac and iOS Apps Shouldn't Be So Difficult (inessential.com)

3 points by mgrayson 18m ago 0 comments

Study finds big crowds hurt live-stream engagement (techxplore.com)

1 points by PaulHoule 21m ago 0 comments

The AI correction is here: "We are witnessing the bursting of the AI bubble" (calcalistech.com)

2 points by mpweiher 21m ago 0 comments

Making Minecraft Spherical (bowerbyte.com)

2 points by iamwil 22m ago 0 comments

The Bird Algorithm (clairelevans.substack.com)

2 points by colinprince 23m ago 0 comments

450 Blizzard Diablo developers vote to unionize under CWA (gamedeveloper.com)

4 points by KZerda 23m ago 0 comments

Policy as code explained (hashicorp.com)

3 points by historynops 27m ago 0 comments

Woodcoin: The First Physical Crypto Currency (woodcoincrypto.com)

3 points by bko 27m ago 1 comments

Bringing the Cursor Agent to Linear (cursor.com)

2 points by dphuang2 30m ago 0 comments

Precovery Observations of 3I/Atlas from Tess Suggests Possible Distant Activity (arxiv.org)

1 points by bikenaga 30m ago 0 comments

Nat Friedman – Some things I believe (nat.org)

4 points by grbsh 31m ago 0 comments

Superfood for Bees Sparks 15-Fold Colony Boom (scitechdaily.com)

3 points by AdmiralAsshat 31m ago 0 comments

Transport Layer Obscurity: Circumventing SNI Censorship on the TLS-Layer [pdf] (censorbib.nymity.ch)

1 points by 1vuio0pswjnm7 33m ago 0 comments

Pb.py: A tiny, dependency-free, protobuf encoder/decoder (github.com)

1 points by allanrbo 34m ago 1 comments

Cricket, Fandom, and the Unspoken Price of Fantasy Gaming (uselessmbaguy.substack.com)

1 points by akbarnama 36m ago 0 comments

Bazel Knowledge: Dive into Unused_deps (fzakaria.com)

1 points by setheron 36m ago 0 comments

Take us North is EXPOSING the modern audience [video] (youtube.com)

1 points by FirmwareBurner 36m ago 1 comments

Vibing options for whoever you are (seroter.com)

1 points by richards 38m ago 1 comments

Show HN: FunnelBro 3000 – An AI that generates hustle-bro strategies (chatgpt.com)

1 points by adriana_tica 38m ago 0 comments

Anthropic Will Now Train Claude on Your Chats (macrumors.com)

2 points by tosh 39m ago 0 comments

Func Prog Podcast #9 with Hécate (discourse.haskell.org)

1 points by Vosporos 40m ago 0 comments

Why Radiology AI Didn't Work and What Comes Next (outofpocket.health)

1 points by nradov 41m ago 1 comments

Show HN: An Open-Source Eval Suite That Helps You Fix Postgres-Based Text-to-SQL

1 cevian 0 8/28/2025, 3:41:46 PM tigerdata.com ↗

We've been building text-to-SQL at TigerData and kept hitting the same problem: evaluation tools that tell you your accuracy score but nothing about how to improve it.

Getting a 60% pass rate is meaningless if you don't know whether failures are from bad schema retrieval or poor SQL generation. It's the difference between actionable insights and meaningless benchmarketing.

So we built, and are now open-sourcing, text-to-sql-eval with a simple insight: run every query three different ways:

- Normal mode - let the system retrieve schema and generate SQL - Full schema mode - provide all tables to test upper bound accuracy - Golden tables mode - give it the right tables to isolate reasoning issues

The performance delta between modes tells you exactly what's broken.

PostgreSQL-specific because database quirks matter for correctness. Works with any LLM or text-to-SQL system. Includes an LLM-as-judge option because deterministic matching produces too many false negatives on complex queries.

We've been using this internally to improve our (also open-sourced) text-to-sql system.

Open sourcing both the eval suite and a companion tool for generating test datasets from your production schema.

Built with uv for easy setup. TimescaleDB for tracking results over time. Simple Flask UI for exploring failures.

Try it, break it, tell us what's missing.

Comments (0)

No comments yet