About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong

Comments (2)

EvgeniyZh · 1d ago

I submitted questions to HLE. I tend to agree that the review was far from perfect. For example, some of my questions were just not understood, and another was claimed to have the wrong answer, though it wasn't.

I think the situation is better for math and physics, where answers are more straightforward to verify; probably even worse in the humanities. I also believe that releasing the models' answers would help verify the questions, but it has never been done (possibly in fear of even more train-on-test?)

Edit: to clarify, when I contacted orgs they helped me with the problems but I guess it wouldn't happen if the problem was in opposite direction

falcor84 · 1d ago

It's funny, but I suppose it's in full alignment with the replication crisis - that's probably close to the amount of published "science" that is indeed wrong. Can we use this opportunity to connect each claim more directly with the evidence for it, and help resolve the crisis?

Tesla to roll out human-driven chauffeur service in Bay Area, regulator says (reuters.com)

Ask HN: Facebook Name Change Fiasco

Show HN: Factifi – Real-Time Fact-Checking Content (chromewebstore.google.com)

The Sparse Frontier: Sparse Attention Trade-Offs in Transformer LLMs (arxiv.org)

Retrieval Embedding Benchmark (huggingface.co)

Trump's order to block woke AI in government encourages tech to censor chatbots (apnews.com)

Ben Evans on Decades of Disruption (youtube.com)

Who Is Valar Atomics? (utahinvestigative.org)

Long-term air pollution exposure and incident dementia: Review and meta-analysis (thelancet.com)

Noggn AI (noggn.app)

Gilles Dowek, automated theorem proving pioneer, passed away (lemonde.fr)

This walking robot is controlled by a mushroom (imeche.org)

History of the RJ45: A Case of Mistaken Identity (2020) (flukenetworks.com)

Two Simple Rules to Fix Code Reviews (serce.me)

I Built QZ–and How Echelon Is Now Breaking It (robertoviola.cloud)

Grindr Won't Let Users Say 'No Zionists' (404media.co)

What Happened to Maybe.co? (maybe.co)

A walk through building serverless ATProto applications on Cloudflare (blog.cloudflare.com)

Tea, a women only app, suffers massive data breach (dexerto.com)

Finding the Shape of My Thoughts (sachachua.com)

Hacking Dormant Bitcoin Wallets in C (leetarxiv.substack.com)

First-Ever Antimatter Qubit, Making the Quantum World Even Weirder (gizmodo.com)

Meta Superintelligence Labs Chief Scientist: Former GPT4 Cocreator Shengjia Zhao (venturebeat.com)

Scientists make revelation about origins of life on Earth (independent.co.uk)

The Problem with Rewards Credit Cards (theatlantic.com)

GlobalComix Gold lowers price to $6.99/mo, adds same-day IDW Comics (comicsbeat.com)

The Tonto Synthesizer (artsandculture.google.com)

Old paradigm spoiling new – MCP's Structured Output undermines the point of MCP (github.com)

How AI Wreaked Havoc on the Lo-Fi Beat Scene (pitchfork.com)

What Tea Got Wrong (and how to avoid it) [video] (youtube.com)

DJI couldn't confirm or deny it disguised this drone to evade a US ban (theverge.com)

Gnome Calendar: A New Era of Accessibility Achieved in 90 Days (tesk.page)

Ask HN: What if emotion could be written in logic? (github.com)

Microsoft admits it 'cannot guarantee' data sovereignty (theregister.com)

FCC green-lights Skydance/Paramount deal after CBS concessions (politico.com)

We mapped Jeffrey Epstein's social network. Here's what we found (america2.news)

AI and ML AI industry's size obsession is killing ROI, engineer argues (theregister.com)

Tea App hacked days after becoming top free app on Apple's App Store – over 72,0 (economictimes.indiatimes.com)

Female Founders Outperform Their Male Counterparts but Receive Much Less Funding (inc.com)

A satellite just used AI to make its own decisions in space (thenextweb.com)

50x rendering speed improvements in Hologram (Elixir web framework) (hologram.page)

Time Magazine 100 Best Podcasts of All Time (time.com)

Show HN: Open-source macOS CLI tool for aliasing and timing command line runs (github.com)

Show HN: CUDA Fractal Renderer (github.com)

Suno Radio: a live stream where listeners don't just tune in – they create (suno.com)

Anker is no longer selling 3D Printers (theverge.com)

Show HN: Baag – Easily run multiple AI coding agents on the same project (github.com)

Show HN: StackSafe, Taming Recursion in Rust Without Stack Overflow (fast.github.io)

Shengjia Zhao to Lead Meta's AI Superintelligence Lab (cnbc.com)

What Is the Maillard Reaction? [video] (youtube.com)

About 30% of Humanity's Last Exam chemistry/biology answers are likely wrong

Comments (2)