Android 16 QPR1 source code is nowhere to be found but Google swears it's coming (androidauthority.com)

We run a variety of tests through Claude Code and the OpenAI API (using GPT-4.1 as a reference point for comparison). We also have a Vibe Check feature that lets users vote whenever they feel the quality of LLM answers has either improved or declined. Over the past few weeks of monitoring, we’ve noticed just how volatile Claude Code’s performance can be.

Chart is here: https://i.snipboard.io/RydmH7.jpg

1) Up until August 28, things were more or less stable.

2) On August 29, the system went off track — the failure rate doubled, then returned to normal by the end of the day.

3) The next day, August 30, it spiked again to 70%. It later dropped to around 50% on average, but remained highly volatile for nearly a week.

4) Starting September 4, the system settled into a more stable state again.

It’s no surprise that many users complain about LLM quality and get frustrated when, for example, an agent writes excellent code one day but struggles with a simple feature the next. This isn’t just anecdotal — our data clearly shows that answer quality fluctuates over time.

By contrast, our GPT-4.1 tests show numbers that stay consistent from day to day.

And that’s without even accounting for possible bugs or inaccuracies in the agent CLIs themselves (for example, Claude Code), which are updated with new versions almost every day.

What’s next: we plan to add more benchmarks and more models for testing. Share your suggestions and requests — we’ll be glad to include them and answer your questions.

https://isitnerfed.org

Ask HN: Would you use a CAPTCHA that blocks browser agents?

Exploring Canton: a privacy-preserving distributed ledger for finance (quant.engineering)

Atlassian says its 'Don't F– the Customer' principle drove cloud-only decision (computerworld.com)

Microsoft Goes Back to Basic, Open-Sources Bill Gates' Code (gizmodo.com)

Parents could get alerts if children show acute distress while using ChatGPT (theguardian.com)

Identity Gaps That Put AI Agents at Risk (dock.io)

68000 – The CPU ahead of its time (youtube.com)

You may not be interested in climate change, but it is interested in you (defenseone.com)

I'm build a skill Match-3 game with Chess-style Elo ranking (Browser/Mobile) (guivo.io)

An Inline Cache Isn't Just a Cache (mgaudet.ca)

Fraudulent Publishing in the Mathematical Sciences (arxiv.org)

FastComments is Now Globally Distributed (and more rusty) (blog.fastcomments.com)

Dependabot Support for Vcpkg (devblogs.microsoft.com)

Has Google ended support for plain HTML search? (google.com)

Android 16 QPR1 source code is nowhere to be found but Google swears it's coming (androidauthority.com)

Vercel Updates Pro Pricing (vercel.com)

SourceForge Sunsets Developer Web Hosting (sourceforge.net)

Overview of the DiskANN Project (2018–present) (harsha-simhadri.org)

ChatGPT 5 marginalizing Gelman's measurement error model in Stan (statmodeling.stat.columbia.edu)

PgEdge Goes Open Source (pgedge.com)

Lessons from Hidden Satoshi Gold Book on Crypto and AI (satoshigoldbook.com)

HiTex: A spam factory for AI-generated books (laurent.le-brun.eu)

Is Apple's iPhone 17 launch a win for India? (restofworld.org)

Trial and Error Driven Development (stevenoxley.com)

Exploratorium Cookbook Set: Volumes I, II and III (exploratoriumstore.com)

NATO's Chemical, Biological, Radiological and Nuclear (CBRN) Defence Policy (nato.int)

Senator: FTC should investigate Microsoft for dangerous and insecure software (wyden.senate.gov)

'China Is the Engine' Driving Nations Away from Fossil Fuels, Report Says (nytimes.com)

Show HN: HumanAlarm – Real people knock on your door to wake you up (humanalarm.com)

The rules behing Rust functions (blog.cuongle.dev)

Launching Bottlenecks Institute (bottlenecksinstitute.com)

In 1979 one of the best guitar solos recorded was cut for radio time (seekhifi.com)

Lifetime Starlink Deal? Nope, It's Just a Scam Circulating on Facebook (pcmag.com)

Understanding Motion and Relativity with Spacetime Diagrams (steuard.github.io)

Coffee naps might be the weirdest–and smartest–way to recharge (nationalgeographic.com)

How do we decide if a tax is good or bad? (theguardian.com)

What's the real reason games are taking longer to make? (gamedeveloper.com)

Scaling Asyncio on Free-Threaded Python (labs.quansight.org)

Front-Loaded Vesting: Why Your Tech Offer Looks Different Now (levels.fyi)

The Great French Fry Mystery (torontolife.com)

The weird economics of semiconductors and GenAI (gauthierroussilhe.com)

Claude Code Analytics API (docs.anthropic.com)

New Prefill Specialised GPU – Nvidia Rubin CPX (semianalysis.com)

AI-Personalized Welcome Messages for Website Visitors (peteallport.substack.com)

Show HN: Pgdbtemplate – fast PostgreSQL test databases in Go using templates (github.com)

What's Got into Stephen King? (notoneoffbritishisms.com)

A GitHub Co-Founder's Next Commit (opensourcepledge.com)

Psychologist for Founders (marcosander.com)

From Org Charts to Work Charts: Building Clarity in the New Work World (clearwork.io)

Show HN: TimeCopilot, forecasting agent with LLMs and foundation models (github.com)

The AI Nerf Is Real

Comments (1)