Problems in LLM Benchmarking and Evaluation

Comments (4)

sbilstein · 2h ago

Personally I think LLM benchmarks make agents worse. All these companies chase the benchmarks, overfit, and think being able to cheat at the math olympiad is gonna get us to AGI. Instead researchers should peer in and get me an agent that can reliably count the number of "i"'s in mississippi.

upperhalfplane · 1h ago

I don't quite think they cheat at math olympiads, but obviously there are blindspots for the unspectacular tasks. That being said, Mississippi is both a good and a bad question to ask. On the one hand, it's "the bare minimum" to require, on the other hand, is it really a feat? Like, most models can write a piece of code that would compute that. If you show me a task I'm not designed to solve (like count the number of i's in this text), the smart thing is actually to write a program to count them (which LLMs can do).

The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.

tianlong · 1h ago

What's the TLDR about how to solve that benchmark problem?

upperhalfplane · 1h ago

TLDR: games are a good way to go.

US sanctions Grinex crypto-exchange, successor to Garantex (bleepingcomputer.com)

Spent the whole week mapping the GUI Agent research landscape (github.com)

Show HN: Kuvasz Uptime 2.4.0 – custom status, keyword and slow response checks (kuvasz-uptime.dev)

America's strategic posture (2023) [pdf] (ida.org)

Precision mapping tracks woody plant spread across Great Plains grasslands (phys.org)

Show HN: JMAP MCP – Email for your agents (github.com)

Sublime Text Build 4200 and Future Plugin Changes (sublimetext.com)

Launch HN: Embedder (YC S25) – Claude Code for Embedded Software

Conversations remotely detected from cellphone vibrations, researchers report (psu.edu)

Imagen 4 is now generally available (developers.googleblog.com)

Show HN: Built an MCP server that fetches latest docs for any NPM package (github.com)

Whoop pushes back on FDA over blood pressure feature (mobihealthnews.com)

Show HN: Edka – Deploy Kubernetes on your own Hetzner account in minutes (edka.io)

Show HN: Fecusio – Feature flagging and release management tool (fecusio.com)

Wassette: A bridge between WASM and MCP (infoworld.com)

Millions Told to Delete Emails to Save Drinking Water (newsweek.com)

Show HN: Open-Source Character.ai Alternative with Ollama and Groq API Support (github.com)

Single Sign on for Furries (cendyne.dev)

HTTP/1.1 must die: the desync endgame (portswigger.net)

New app lets people buy tickets for strangers wedding (perthnow.com.au)

Programming's New Frontier: The Rise of LLM-First Languages (osada.blog)

Isaac Newton's Recipe for the Mythical 'Philosopher's Stone' Is Digitized (openculture.com)

Fungus compound shows promise in cancer and inflammation (newatlas.com)

Flumist, nasal spray flu vaccine, now available for home delivery (astrazeneca-us.com)

New Local LLM Value King – MaxSun Arc Pro B60 Dual with 48GB VRAM for $1200 (hardware-corner.net)

Are We Creating Entrepreneurs or Just Privileged Risk-Takers? (luolink.substack.com)

Car Brands Running Curl (daniel.haxx.se)

Patches Posted for Raspberry Pi 5 Ethernet with the Upstream Linux Kernel (phoronix.com)

Singapore Battles Silicon Valley (restofworld.org)

Iceland supermarket offering £1 reward for reporting shoplifters (bbc.com)

LibreOffice 26.2 to Better Handle Documents with Restricted Embedded Fonts (phoronix.com)

A Quick Community Update on PinePhone Pro and What's Next (pine64.org)

Peak Cinema (scottsumner.substack.com)

What if you could search every visible word on NYC streets? (pudding.cool)

Serial Bridge over WiFi Using ESP32 DevKitC (2021) (fpaynter.com)

Show HN: Countmeup – Simple counter to watch yourself earn money in real time (countmeup.com)

Livebook (livebook.dev)

Approaching Quantum‐Limited Precision in Frequency‐Comb Interferometric Ranging (onlinelibrary.wiley.com)

US Pilots Trained to Fight Russia Will Soon Help Protect Putin (newsweek.com)

How people in 24 countries view India (pewresearch.org)

NoLiMa: Long-Context Evaluation Beyond Literal Matching (arxiv.org)

Usage Policy Update (anthropic.com)

Doom as a Tool for System Administration (cs.unm.edu)

What It Feels Like to Have Long Covid (liamrosen.com)

American Pilot, 20, Stuck Off Antarctica Says It's 'Isolating and Lonely' (nytimes.com)

AI crawlers now solves the Anubis challenges crawling Codeberg (social.anoxinon.de)

Box, run, crash: China's humanoid robot games show advances and limitations (theguardian.com)

Show HN: Whispering – Open-source, local-first dictation you can trust (github.com)

What the US Can Learn from Engineering in China (bloomberg.com)

C# vs. Java int: Primitive type semantics, runtime behavior and tribal knowledge (msiyer.com)

Problems in LLM Benchmarking and Evaluation

Comments (4)