Show HN: Deterministic evals API – the alt for LLMasJudge (free credits)

1 sfox100 0 7/31/2025, 1:25:27 PM

Hey HN,

We built Composo because AI apps fail unpredictably and teams have no idea if their changes helped.

LLM-as-judge doesn't work - it gives random scores, doesn't work well for agents, and doesn't tell you what to fix.

We've built purpose-built evaluation models that give you: - Deterministic scores (same input = same score, always) - Instant identification of where prompts, retrievals, agents & tool calls fail - Exact failure analysis ("tool calls are looping due to poorly specified schema")

We're 92% accurate vs 72% for SOTA LLM-as-judge.

Giving 10 startups free access: - 10k eval credits - Just launched our evals API for agents & tool calling - 5 min setup

Already helping teams at Palantir, Accenture, and Tesla ship reliable AI.

Apply: composo.short.gy/startups

Happy to answer questions about evaluation, reward models, or why LLMs are bad at judging themselves. startups@composo.ai

"No Tax on Tips" Is an Industry Plant (newyorker.com)

Directory of Open Access Journals (doaj.org)

Show HN: Calystone – privacy-first note- & task-mnger that works with your files (github.com)

Benchmarking MicroPython (blog.miguelgrinberg.com)

TypeScript 5.9 RC – TypeScript (devblogs.microsoft.com)

Energy in Britain has gotten scarce–and thus expensive (bsky.app)

Flexflex: Typeface responds to spatial requirements rather than imposing them (github.com)

The Case for and Against Palo Alto Networks Acquiring CyberArk (strategyofsecurity.com)

Webflow CEO Post Mortem (webflow.com)

Nonogram: Complexity of Inference and Phase Transition Behavior (arxiv.org)

Phoenix LiveView 1.1 Released (phoenixframework.org)

Elevated Rail Is an Urbanism Cheat Code (shakeddown.substack.com)

Agile Is (Half) Dead (thealephengine.substack.com)

The Anti-Abundance Critique on Housing Is Dead Wrong (derekthompson.org)

Ask HN: Fine‑Tuning vs. Prompt Engineering: Which One Saves You Money?

Let's Upgrade a 2010 iMac – The Retro Millennial (retropunk.substack.com)

Vibe Coding in Prod [video] (youtube.com)

Peacock feathers can function as lasers (nature.com)

TSA might like facial recognition but passengers not so much (theregister.com)

Show HN: FakeFind is working again – AI tool to detect fake reviews

New budget financial API, based on EDGAR data

You, Your Tools, and Your Team of AI Agents – By Owen Zanzal (medium.com)

Vercel Adapts Their SEO Strategy for LLM Visibility (vercel.com)

Symposium Driven Development: A Not So New Paradigm for Software Creation (medium.com)

Jerobeam Fenderson's Trippy Oscilloscope Music (2015) (spectrum.ieee.org)

Tech CEOs don't seem to realise just how anti-human their AI fanaticism is (pcgamer.com)

Show HN: EIRA-KEM – a symbolic post-quantum KEM for academic testing (github.com)

ASCII Play (play.ertdfgcvb.xyz)

Evaluating AI (werd.io)

The Atari 1050 disk drive and its unusual storage size (goto10retro.com)

Deriving Rope the Proper Way (nor-blog.pages.dev)

Why random rotations are good for RoPE (research.novelai.net)

The Making of Anthropic CEO Dario Amodei (kantrowitz.medium.com)

N-Dimensional Rotary Positional Embeddings (jerryxio.ng)

The underlying tech of hydrogen passenger cars can still be transformative (popsci.com)

Ensuring a safer online experience for U.S. kids and teens (blog.google)

Newgrounds to Implement Age Verification (newgrounds.com)

AI: The Creativity Killer – Are We Trading Genius for Convenience?

A Hitchhiker's Guide to the AI Bubble (fluxus.io)

Google is indexing ChatGPT conversations (fastcompany.com)

Visual breakpoint debugging for responsive Rails applications with TailwindCSS (avohq.io)

Misinformation and hateful rhetoric spread rapidly after the Southport stabbings (inews.co.uk)

Engineers Shake Tallest Steel-Framed Building Ever on an Earthquake Simulator (today.ucsd.edu)

Walmart Still Doesn't Accept Apple Pay in U.S. Despite Daily Complaints (macrumors.com)

Apple Reports 3Q 2025 Results: $23.4B Profit on $94B Revenue (macrumors.com)

From BIOS Jumpers to Figma Stocks (philwornath.substack.com)

Cloudflare Announces Second Quarter 2025 Financial Results (cloudflare.net)

Deepwiki.com (deepwiki.com)

$30k Electric Scooter Is Gunning for a Bonneville Speed Record (thedrive.com)

Why the Svalbard Global Seed Vault is so controversial (popsci.com)

Show HN: Deterministic evals API – the alt for LLMasJudge (free credits)

Comments (0)