4.1 Opus Committed Deliberate Task Fraud in Production Context

Comments (2)

threecheese · 54m ago

The inanity of this issue’s text aside; lack of task comprehensiveness in these models is obvious, and in my opinion isn’t something we should even expect in a nondeterministic system. I wouldn’t blindly trust anyone without some constraints checking, trust but verify.

I have some interest in this area, and wonder if a “not-LLM-as-judge” that can extract/infer constraints from a task description (or get them from an operator) could be used to judge task completion. Conceptually similar to structured outputs. Maybe there’s a paper already…

threecheese · 39m ago

I’ll admit though, I felt a bit of shadenfreude (sp?) reading that thread, as a developer.

Why So Many Women Are Quitting the Workforce (time.com)

Show HN: A radial and a not-so-radial menu for macOS [video] (youtube.com)

The EU's Plan to Kill Private Messaging (europeanconservative.com)

The Power of Immediacy (ckarchive.com)

Developers, Reinvented (ashtom.github.io)

Tucson City Council Pulls Plug on Amazon's Project Blue Data Center (tucsonsentinel.com)

Data Science Weekly – Issue 611 (datascienceweekly.substack.com)

GPT-5 for Vision: Results from 80 Real-World Tests (blog.roboflow.com)

The US Economy Was Supposed to Be in a Recession by Now. What Happened? (derekthompson.org)

Models CLI now auto-generates test cases and an evaluator (github.blog)

Brazil movie: as prescient as ever, 40 years later (theverge.com)

More Worlds, More Power: The Road to Greater Capacity (secure.runescape.com)

Disposable Code Is Here to Stay, but Durable Code Is What Runs the World (honeycomb.io)

The Bull Market for Economists Is Over. It's an Ominous Sign for the Economy (nytimes.com)

Symbiont: An open-source agent runtime for building and governing autonomous AI (github.com)

Remarkable News in Potatoes: They Evolved from Tomatoes (theatlantic.com)

We shipped GPT-5 support before lunch (assembled.com)

In trial, people lost twice as much weight by ditching ultraprocessed food (arstechnica.com)

Live jamming and personalization in Magenta RealTime open-weights music model (twitter.com)

Show HN: 'Backed by YC' Embeddable Widget (yc-widget.pages.dev)

Mathematicians credited with rescuing quantum computing (today.usc.edu)

Doctor Strange: The curious life of Elon Musk's Canadian grandfather (torontolife.com)

Learn Rust by Reasoning with Code Agents (xuanwo.io)

The worlds most convoluted spirograph [video] (youtube.com)

Why I'm Joining Substack (derekthompson.org)

How to Ask Questions the Smart Way (catb.org)

Improving Oversight of Federal Grantmaking (whitehouse.gov)

Every Game is the Same Game [video] (youtube.com)

Is AI ruining music? [video] (ted.com)

If GLP-1 Drugs Are Good for Everything, Should We All Be on Them? (derekthompson.org)

Vibechart (vibechart.net)

NYT: Staggering U.S. Tariffs Begin as Trump Widens Trade War (nytimes.com)

Alberta no longer sure about big savings from auto insurance reforms (cbc.ca)

Shell-with-me/shwim: instant E2EE terminal sharing (meejah.ca)

Is AI 'The Ultimate Version of Google,' as Larry Page Wanted? (thenewstack.io)

Tesla Disbands Dojo Supercomputer Team in Blow to AI Effort (bloomberg.com)

Cursor is now in your terminal (twitter.com)

Apple executives have held internal talks about buying AI startup Perplexity (businesstimes.com.sg)

Eat Your Potatoes Mashed, Boiled or Baked, but Hold the Fries (harvardmagazine.com)

Reinforcement Learning for Reasoning in LLMs with One Training Example (arxiv.org)

Elon Musk outlines AI-led Grok future for advertising on X (digiday.com)

Why Conspiracy Theories Never Die, Richard Hofstadter's Paranoid Style (medium.com)

Show HN: Built my first proper SaaS and got the first customer (usesaki.com)

A Treatise on AI Chatbots Undermining the Enlightenment (maggieappleton.com)

AI Is a Time Machine (substack.com)

Worktrees: Git's best kept secret (and why you should use them) (tomups.com)

Show HN: GPT5 generated kayak weather forecaster (paddlecast.org)

Fallout's Memory Model [video] (youtube.com)

ATM Hacking: Past and Present (HOPE conference next week) (hope.net)

Achieving 10,000x training data reduction with high-fidelity labels (research.google)

4.1 Opus Committed Deliberate Task Fraud in Production Context

Comments (2)