LLM Evals Are Just Tests. Why Are We Making This So Complicated?

Comments (1)

8organicbits · 7m ago

So, did the tests allow you to build a system that never confused existing features with new features? That seems like the problem statement, but I think I'm only seeing probabilistic testing.

So You Bought a Fancy Vintage Car. Now Who's Going to Restore It? (bloomberg.com)

Design Patterns for Securing LLM Agents Against Prompt Injections (arxiv.org)

Melonking Website (melonking.net)

Computational Music Synthesis (cs.gmu.edu)

Making a Font of My Handwriting (chameth.com)

What It's Like to Brainstorm with a Bot (newyorker.com)

R-Zero: Self-Evolving Reasoning LLM from Zero Data (arxiv.org)

LLM Evals Are Just Tests. Why Are We Making This So Complicated? (cameronwestland.com)

Tsukuba WIDE FTP server public mirror (ftp.tsukuba.wide.ad.jp)

eBay Stalking Scandal (en.wikipedia.org)

Software Modernization Projects Dilemma (Part 1) – Should you do it? (medium.com)

India and US launch 'first-of-its-kind' satellite (bbc.com)

Designing accessible color systems (2019) (stripe.com)

Tell HN: Claude/Anthropic and Stripe Billing timezone discrepency

Hybrid-SOTA Builder Pipeline (gemini.google.com)

The OSL Open Source Lab Needs Your Help (osuosl.org)

Francis Bacon's Essays, in print 400 years (theconversation.com)

Show HN: TheFastHost – Fast and Affordable Web Hosting (thefasthost.net)

We're Losing Our Love of Learning and AI Is to Blame (mndaily.com)

Sheltervrclub (twitch.tv)

Your Continuous Delivery Transformation Is Not Complete (maxdaten.io)

Quantum Computing Could Upend Bitcoin (barrons.com)

How to Phish a Crypto Scammer (leonewton.com)

Disinformation report: The Wikipedia article in the most languages (en.wikipedia.org)

Thinking Machines: Mathematical Reasoning in the Age of LLMs (arxiv.org)

The Telemessage saga, and how you can view the data (theregister.com)

The new Compute's Gazette magazine has a BBS column (old.reddit.com)

Cryptophasia (en.wikipedia.org)

Designing an SOI Interleaver Using Genetic Algorithm (mdpi.com)

Show HN: 16-Pad Sampler from Your Videos (sampler.rlafuente.com)

The Tooth Fairy Is Real. She's a Dentist in Seattle (nytimes.com)

It Looks Like a School Bathroom Smoke Detector A Hacker Showed It Could Be a Bug (wired.com)

Show HN: Gemlink.app – A Social-First Pocket Alternative to Save and Share Web (gemlink.app)

Buttercup is now open-source (blog.trailofbits.com)

FidoNet Global HyperText Interface (github.com)

Introduction to the Linux Laptop PCI-DSS at OVHcloud (blog.ovhcloud.com)

GPT-5: "How many times does the letter b appear in blueberry?" (kieranhealy.org)

Hamas Pulls Israel Deeper into Gaza (wsj.com)

Turns out GPT-5 can count but GPT-5-chat can't (bsky.app)

HHS cites list of studies as scientific justification for mRNA cancellation (statnews.com)

Trump administration threatens to strip Harvard University of lucrative patents (theguardian.com)

Subversive of What? (1948) (theatlantic.com)

Galileo's telescopes: Seeing is believing (2010) (historytoday.com)

Official Prompt Optimizer for GPT-5 (platform.openai.com)

Google is killing millions of web links to save a few bucks (washingtonpost.com)

Democratizing Access to Alternative Assets for 401(K) Investors (whitehouse.gov)

Exploring AI Memory Architectures (Part 2): MemOS Framework (blog.lqhl.me)

The Linguistics of Brain Rot (theamericanscholar.org)

AI Image Watermarking Faces New Threat from "Unmarker" (spectrum.ieee.org)

Exploring AI Memory Architectures (Part 3): From Prototype to Blueprint (blog.lqhl.me)

LLM Evals Are Just Tests. Why Are We Making This So Complicated?

Comments (1)