Ask HN: How are you checking if your LLM is giving customers the right answer?

2 navaed01 0 5/28/2025, 11:08:00 AM

Something that’s been bothering me is observability with LLMs and how to check it’s giving customers the right answer.

There seems to be multiple failure points: hallucinations, partial responses (missing facts), saying information does not exist, response accuracy depends on how and what is being asked.

How are you measuring this in production today? - Thumbs up/ down seems like a weak signal - Running a sample of ‘known queries’ Assumes you know what is being asked.

What have you tried that works for you?

Bug Finding Bot vs. Gruyere (jazzberry.ai)

WireWatch: Security of proprietary network encryption in the Android ecosystem (computer.org)

FLUX.1 Kontext – Edit Images with Prompt (bfl.ai)

Tesla shareholders demand Elon Musk commits to 40-hour week (thetimes.com)

A Spyware App Compromised Assad's Army (newlinesmag.com)

My Guidelines for Vibe-Coding (thetechenabler.substack.com)

Show HN: Website Does Not Exist (thiswebsitedoesnotexist.net)

A MEMS grating modulator with a tunable sinusoidal grating (nature.com)

What we know (so far) about CSS reading order (css-tricks.com)

The Knowledge Economy Is Over. Welcome to the Allocation Economy (every.to)

Bittersweet Beginnings of Vanilla: Commercial Success from the Isle of Réunion (smithsonianmag.com)

DiscordTCG – A TCG platform for Discord in early development (discordtcg.com)

Discord lures users to click on ads by offering them new Orbs currency (arstechnica.com)

Ethics Guidelines for Trustworthy AI (ec.europa.eu)

Running MCP on a partner bank (griffin.com)

How to Pack for a Long Trip (atlasobscura.com)

Lalrpop: A Parser Generator for Rust (2015) (smallcultfollowing.com)

Meta and Anduril defense startup partner on VR/AR project intended for U.S. Army (cnbc.com)

Self-driving Tesla fails school bus test, hitting child-size dummies (fuelarc.com)

Show HN: MCP Defender – OSS AI Firewall for Protecting MCP in Cursor/Claude etc. (mcpdefender.com)

FLUX.1 Kontext (bfl.ai)

AI in SMB Manufacturing: What Worked and What Did Not (fredlybrand.com)

Cory Doctorow – PyCon 2025 keynote (youtube.com)

Show HN: Logtrees, a Blockchain Economic Model Producing UBI and Debt Reduction

Show HN: ClickStack – open-source Datadog alternative by ClickHouse and HyperDX (github.com)

Copper adds ROS2/Zenoh migration path to its deterministic Rust runtime (copper-robotics.com)

Local information disclosure in apport and systemd-coredump (openwall.com)

DeepTeam: Penetration Testing for LLMs

Use FLUX.1 Kontext to edit images with words (replicate.com)

Difficulties Choosing a Captcha Provider [video] (youtube.com)

Top Tech Firms Hire North Korean Cyber Operatives (politico.com)

Indian News Agency is abusing YouTube copyright strikes as an extrotion tool (cnbctv18.com)

Extracting video covers, thumbnails and previews with FFmpeg (tech-couch.com)

The Case for Bridge Editors (nmccarty.com)

Private equity kills companies and communities (theverge.com)

A new programming language inspired by Go, no LLVM (github.com)

The Coming AI Revolution in Distributed Systems (zfhuang99.github.io)

Billions of AI Users? (manualdousuario.net)

Learning coordinated badminton skills for legged manipulators (science.org)

Resonant Charge Transport Through Open-Shell Donor–Acceptor Macromolecules (pubs.acs.org)

Network Intelligence Is Changing the Internet (open.spotify.com)

Use B4 for Kernel Contributions (marcusfolkesson.se)

UF/IFAS scientists confirm hybrid termites established in Florida (news.ufl.edu)

Trump administration backtracks on Harvard foreign student policy (abcnews.go.com)

VA-based DOGE associate gets 'the boot' after publicly discussing his work (nextgov.com)

What's the Damage with Long Covid? Advanced Imaging Reveals Clues (tctmd.com)

Open-sourcing circuit tracing tools (anthropic.com)

Anthropic's circuit tracer is now open source (github.com)

Quantum Computing and the Hidden Subgroup Problem (daniellowengrub.com)

The Anxiety on Top of the Anxiety (exhotmess.net)

Ask HN: How are you checking if your LLM is giving customers the right answer?

Comments (0)