Benchmark for Local LLMs with German "Who Wants to Be a Millionaire" Questions

Comments (1)

mynti · 1d ago

This is super cool! One thing I find counterintuitive is that GPT5 or o3 not have better performance. GPT5 gets about 800k on average per round but I would have expected it to be nearly perfect, since these are not particularly hard questions and mostly trivia or simple look up knowledge questions. There is little reasoning involved so I expected the big models to do much better.

OK Go Impulse Purchase Production File (studio.blender.org)

Simulating Email Campaigns; predicting open and reply rate (mocke.co)

The Binomial Trap: Why 99% Reliability Still Fails Users (michaelyaroshefsky.com)

Waymo Approved to Operate at San Jose Mineta International Airport (sfstandard.com)

Ask HN: Do you know any myopia researchers?

The irony of stablecoin: centralization is the point (text-incubation.com)

North Korean hackers are using fake job offers to steal cryptocurrency (reuters.com)

Nvidia GPU Virtualization (VGPU) Support (lore.kernel.org)

New constraints on the age of the Universe and the Hubble constant (arxiv.org)

AI Predictions for Agents (kumo.ai)

Finding 1,000 exposed AI servers took researchers 10 minutes (tailscale.com)

Prosecutors Fail to Obtain Indictment Against Man Who Threw Sandwich at Agent (nytimes.com)

LLM Social Simulations Are a Promising Research Method (arxiv.org)

OB-1: the new #1 coding agent on Terminal Bench (twitter.com)

Private Assets Need Public Buyers (bloomberg.com)

Forbes: The Cloud 100 2025 (forbes.com)

New Shelly Wall Display XL (shelly.com)

AI is causing a drop in hiring but few layoffs (usatoday.com)

VX-NOVA.Ω1: A Symbolic AI Engine That Patches Code Without Any Models (github.com)

Saquon Is Playing for Equity (readtheprofile.com)

Netlify: New credit-based pricing for today's AI development (netlify.com)

Analog optical computer for AI inference and combinatorial optimization (nature.com)

Feature Engineering A-Z (github.com)

I built my own phone [video] (youtube.com)

Show HN: Bistroai.cc generate Michelin-quality meals with AI (bistroai.cc)

US puts $10M bounty on 3 Russians accused of attacking critical infrastructure (theregister.com)

Guide to America (guidetoamerica.info)

Birds of a Feather (Remake in Strudel) [video] (youtube.com)

Facet: Reflection for Rust (fasterthanli.me)

How developers are using generative AI to create a new generation of games [pdf] (services.google.com)

The Bonfire of the GPUs (stohl.substack.com)

Zacks Investment Quantum Stocks Eyeing Breakoust: D-Wave, IonQ, and Rigetti (nasdaq.com)

Show HN: ZipZen – Lightweight Release Hosting for Binaries

Lexxy – a modern text editor for Rails from 37signals (github.com)

Building an open-source WAL on S3 (trychroma.com)

Product Scope and the Marketing Overhang (twitter.com)

Notion's hosted MCP server: an inside look (notion.com)

Show HN: StripeMeter – Open-Source Usage Metering for Stripe Billing (github.com)

AI chatbots are not your friends (computerworld.com)

Sharing the Road with Cyclists? (creators.yahoo.com)

The Whop chop: how we cut a Rails test suite and CI time in half (evilmartians.com)

Altus 4 – AI-enhanced MySQL search engine (no Elasticsearch needed) (altus4.thavarshan.com)

Calculating Percentage-Based Confidence from Similarities of Embedding Models (sefiks.com)

AI Not Affecting Job Market Much So Far, New York Fed Says (money.usnews.com)

Scaling DRAM Technology to Meet Future Demands: Challenges and Opportunities [pdf] (rambus.com)

A PM's Guide to AI Agent Architecture (productcurious.com)

Steve Jobs TV (tvmode.net)

Lithium-metal batteries can charge in 12 minutes for an 800 km drive (techxplore.com)

I watched scientists view the interstellar comet 3I/ATLAS in real time (space.com)

Vibe Coding Our Way to Disaster (softwarearthopod.substack.com)

Benchmark for Local LLMs with German "Who Wants to Be a Millionaire" Questions

Comments (1)