The Problem with AI Benchmarks

Comments (1)

philecho · 1h ago

I wrote an essay outlining why common AI benchmarks are not terribly useful, instead arguing we should mostly use normal user experience instead.

Key reasons: 1) Most questions are not simply ‘wrong’ or ‘right’ 2) Most user problems are poorly defined 3) Agents are getting popular, and they pose interconnections of these problems

Gabber: Build realtime AI apps that can see, hear, and speak (gabber.dev)

Porn site traffic plummets as UK age verification rules enforced (bbc.co.uk)

Fortinet discloses critical bug with working exploit amid surge in brute force (theregister.com)

Google Play Crypto Wallet Rules: Unprecedented Impact on Digital Assets (bitcoinworld.co.in)

Show HN: Multi-agent AI orchestration – lessons from a build log

MCP is an open protocol that standardizes how apps provide context to LLMs (modelcontextprotocol.io)

NY sues Zelle, says security lapses led to $1B consumer fraud losses (reuters.com)

Tonga Eruption Blasted Unprecedented Amount of Water into Stratosphere (2022) (nasa.gov)

The World's top AI researchers (metislist.com)

MCP servers can attack you before you ever use them (blog.trailofbits.com)

New downgrade attack can bypass FIDO auth in Microsoft Entra ID (bleepingcomputer.com)

Google's next big rival – Visual search engines (griiids.com)

Show HN: Gitego – Automatic Git identity switcher (github.com)

Show HN: PromptProof – CI gate for LLM outputs (schema/regex/cost; no API keys) (github.com)

Book Review: The Math Academy Way (ijfen.substack.com)

The Poison Within Patriotism (renfoc.us)

First bidirectional asymmetric frequency conversion in a single system (phys.org)

Show HN: USDA-linked nutrition API from messy inputs (CLI/Python/REST) (nutrition.avocavo.app)

Fuse is 95% cheaper and 10x faster than NFS (nilesh-agarwal.com)

Show HN: PhantomWall – open‑source prompt‑injection firewall and telemetry (github.com)

Next slap in European Tech Sovereignty face? (reddit.com)

Yoke Kubernetes Package Manager (github.com)

Army developing new iterations of autonomous missile launcher (army.mil)

OpenAI's GPT-5 looks less like AI evolution and more like cost cutting (theregister.com)

Configuring GH Codespaces with UV/node + llm tool + free GPT4.1 w/$GITHUB_TOKEN (til.simonwillison.net)

The AI boyfriend ticking time bomb (garbageday.email)

DeepKit Story: how $160M company killed EU trademark for a small OSS project (old.reddit.com)

The Houthis want to punish Israel. Filipino seafarers are bearing the cost (washingtonpost.com)

The Software of Science (mirawelner.com)

Automagical JavaScript Programming (crtv.dev)

Out of the frying pan (dot dot dot) (fimfiction.net)

Show HN: AI Interoperability to the Max – The Intelligence Hub (theintelligencehub.azurewebsites.net)

How Does a Blind Model See the Earth (lesswrong.com)

Zapstore, Android app store powered by your social graph (zapstore.dev)

The Secret HQ of Synchron, One of Neuralink's Biggest Competitors (pcmag.com)

Show HN: Use AI to analyze GitHub profiles and repos, and their code (gitsage.dev)

The Bidirectional Stream Processor: Why Pull Beats Push for Crash Recovery (blog.epsiolabs.com)

We built a live scoreboard for developers. Now 1k+ devs are competing on it (entelligence.ai)

Un-Exceptional US Stock Market Earnings? (elmwealth.com)

Amazon Launches Same-Day Fresh Grocery Delivery in 1k U.S. Cities (wsj.com)

First EV powered by a semi-solid-state battery cleared for sale (electrek.co)

Show HN: FakeFind – Free Fakespot Alternative and Amazon Review Checker (fakefind.ai)

Job Listing Site Highlighting H-1B Positions So Americans Can Apply (newsweek.com)

Doctors Were Worse at Spotting Cancer After Leaning on AI, Study Finds (gizmodo.com)

AI Is Different (antirez.com)

Ask HN: Dotcom Stories

Google Play Store Bans Wallets That Don't Have Banking License (therage.co)

Crypto Firm "Bullish" Surges 143% in Debut After $1.1B IPO (bloomberg.com)

Pyx: A Python-native package registry, now in Beta (simonwillison.net)

Che, Fidel and Christopher Columbus (english.elpais.com)

The Problem with AI Benchmarks

Comments (1)