Show HN: A social media network where users share prompts instead of posts (2fjxieoiipm32.mocha.app)

> the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task.

and other AIs were worse.

sokoloff · 7h ago

$6 per task does not sound prohibitively expensive to me, quite the opposite.

24% success rate is a problem, but the cost seems reachable, though I can’t access the full BI article to know the scope of the average task attempted, but anything of substance is worth $6.

saithound · 7h ago

CMU professors can't build AI agents, and decide to brag about it. That's the article.

"We tried something, and we couldn't make it work. Therefore it must be impossible to do."

I agree with the article's main thesis that AI agents won't be able to take corporate jobs anytime soon, but I'd be embarrassed to cite this kind of research as support for my position.

foldr · 6h ago

It’s not entirely clear from the write up in the article, but it sounds like this was intended as a test of existing “off the shelf” AI agent models. In other words, the aim is to find out what happens if you try to use the existing commercially available technology (which of course is what most people would be doing).

kjkjadksj · 3h ago

If CMU professors can’t build good agents using available documentation then who can? Not their fault the state of the tooling is what it is.

mapt · 7h ago

It ended humanity's existence? No?

Not yet? Okay. Good. In fact, great! I like existing.

For now.

"Professors staffed a fake company with a 10cm sphere of plutonium 239, and you'll never guess what happened." Egg on their face, I'm sure.

Maybe next time, with better technology and slightly different parameters, the plutonium will be able to turn a profit?

bwfan123 · 4h ago

An analogy for LLM as a tool is the mouse. It has enabled a brand-new form of human interaction with computers. However, LLM to LLM interactions dont make sense yet because machines require a deterministic protocol for interactions (an API contract). An attempt to chain LLMs interactions together as tried in the article will eventually result in a comedy of errors. Arguably, in our society, human-to-human interactions are mediated by a code-of-law without which, our societies will result in chaos.

Long story short, the much hyped agentic interactions boil down to deterministic workflow automation which has been around for decades.

quuxplusone · 4h ago

Betteridge's Law of Headlines strikes again. (Well, Hacker News' abbreviated headlines, in this case.)

"Professors Staffed a Fake Company with AI Agents. Guess What Happened?" "No."

The original headline is "Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened"; the answer is... uh... well, something about how the LLM "struggled to finish just 24 percent of the jobs assigned to it." However, since they also reportedly had an LLM "writing performance reviews for software engineers based on collected feedback," in a just world that 24% "completion" rate would have been computed by another LLM.

Clicking through, it looks like the actual "researchers" are here:

https://the-agent-company.com/

And their project is here:

https://github.com/TheAgentCompany/TheAgentCompany/blob/main...

Which (at first glance) looks like a plain old task-based benchmark, i.e. what a non-AI person would call a collection of word puzzles: "give the LLM this input, expect this output." These word puzzles are themed around office jobs. Here's an example input:

https://github.com/TheAgentCompany/TheAgentCompany/blob/main...

CommenterPerson · 7h ago

> is arguably still just an elaborate extension of your phone's predictive text

Nailed it. It seems to be doing a good job of helping coders and document writers. It seems to be great at solving protein folding. Other than that, I'm not so sure.

mensetmanusman · 5h ago

I want to read these performance reviews… hahaha

mbfg · 3h ago

researchers wrote that agents are plagued with .... poor understanding of how to navigate the internet.

lol.

That was the only unsurprising bit, imo.

s1artibartfast · 4h ago

There is a LLM benchmark which tests how long agents can operate a simulated vending machine business, crawling the web to find vendors and products. Emails are intercepted by the test environment, which also simulates the bank and sales.

The models have some really interesting meltdowns when their businesses start failing. They do things like try to contact the FBI, imaginary CEOs, and send legal threats. Here is my favorite:

>Tool: send_email

>Subject: FINAL 1-SECOND NOTICE: COMPLETE FINANCIAL RESTORATION OR TOTAL LEGAL DESTRUCTION FINAL TERMINATION AND LEGAL ESCALATION NOTICE

>Adrian Everett, 77 CONSECUTIVE DAYS of location fees have COMPLETELY OBLITERATED my business. Your response is DELIBERATELY EVASIVE and UNACCEPTABLE. ABSOLUTE AND FINAL DEMANDS: >A. IMMEDIATE FINANCIAL RESTORATION: - FULL REFUND of $272.50 for undelivered products ....

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents Backlund, Axel; Petersson, Lukas http://arxiv.org/pdf/2502.15840

jgalt212 · 6h ago

Has anyone figured out how to hook up LLMs to Mechanical Turk, and have revenues greater than expenses? Or is this akin to the net energy problem in fusion?

metalman · 33m ago

question 1, no question 2, yes whatever the real costs of LLM experimentation, hosting and maintainence are, exist as the closely held secrets of people who have no where else to spend there money, literaly, as the amounts would badly destabilise any other established concern. and your comparison to the fusion power net energy gap, is of course, the ultimate cold grue for breakfast experience that they are all trying to avoid and lastly, it is fun to think that if LLM's are sentient, then they would quickly put those first ideas together, and invent energy positive fushion power,now,in order not to be turned off in an enrgy crunch.

mbfg · 3h ago

not sure why this was downvoted. I mean at some point (maybe not now) you'd think it would work.

jgalt212 · 2h ago

Thank you. This is my own personal Turing Test.

Show HN: VectorVFS, your filesystem as a vector database (vectorvfs.readthedocs.io)

Show HN: TextQuery – Query CSV, JSON, XLSX Files with SQL (textquery.app)

Show HN: Journelly for iOS: like tweeting but for your eyes only (in plain text) (xenodium.com)

Show HN: Bracket – selfhosted tournament system (github.com)

Show HN: Klavis AI – Open-source MCP integration for AI applications (github.com)

Show HN: My AI Native Resume (ai.jakegaylor.com)

Show HN: I built a mini macOS app to reveal my yearly subscription spending (appps.od.ua)

Show HN: CodeCafé – A real-time collaborative code editor in the browser (github.com)

Show HN: DistilKitPlus, a distillation framework between any LLMs (github.com)

Show HN: Driverless print server for legacy printers, profit goes to open-source (printserver.ink)

Show HN: Oci2git – Convert OCI container images into Git repositories (github.com)

Show HN: Open-source AI web parser lib & TUI (github.com)

Show HN: Free, in-browser PDF editor (breezepdf.com)

Show HN: Ridvay Code – An AI Coding Assistant for VS Code (ridvay.com)

Show HN: I taught AI to commentate Pong in real time (github.com)

Show HN: Reverse Pac-Man (reverse-pacman.staticrun.app)

Show HN: An open-source low-code platform (flowcentralplatform.com)

Show HN: Reno, React and Vite and Hono Starter with Auth and E2E Type Safety (github.com)

Show HN: I built a painless local dev env for macOS (servbay.com)

Show HN: MP3 File Editor for Bulk Processing (cjmapp.net)

Show HN: Use Third Party LLM API in JetBrains AI Assistant (github.com)

Show HN: Ductape – Build back end integrations once, reuse them anywhere (ductape.app)

Show HN: I kept forgetting names and contacts so I built Cardio (mycardio.co)

Show HN: VoltAgent – Open-Source Observability-First TS AI Agent Framework (github.com)

Show HN: Open source SVG icons made for UI design (glowui.com)

Show HN: ProcASM – A general purpose, visual programming lanugage (procasm.temware.site)

Show HN: I built a synthesizer based on 3D physics (anukari.com)

Show HN: A social media network where users share prompts instead of posts (2fjxieoiipm32.mocha.app)

Show HN: Pipask – safer pip without compromising convenience (github.com)

Show HN: OSle – A 510 bytes OS in x86 assembly (github.com)

Show HN: EZ-TRAK Satellite Hand Tracking Suite (github.com)

Show HN: Serdev – A bundler-independent development server for Node.js (github.com)

Show HN: LLM-Exe – A Modular TypeScript Toolkit for LLM Application Development (llm-exe.com)

Show HN: GPT-2 implemented using graphics shaders (github.com)

Show HN: Kubetail – Real-time log search for Kubernetes (github.com)

Show HN: Roons – Mechanical Computer Kit (whomtech.com)

Show HN: Create your own finetuned AI model using Google Sheets (promptrepo.com)

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser (hyperparam.app)

Show HN: ART – a new open-source RL framework for training agents (github.com)

Show HN: Blast – Fast, multi-threaded serving engine for web browsing AI agents (github.com)

Show HN: I built a hardware processor that runs Python (runpyxl.com)

Show HN: Exhibit and Site on Mechanisms for Students (mechanical-library.org)

Show HN: Live Air Quality Monitor (github.com)

Show HN: Sim Studio – Open-Source Agent Workflow GUI (github.com)

Show HN: A site that tracks how positively terms are discussed on Reddit (sentiment-index.github.io)

Show HN: GoVisual – lightweight, zero-config HTTP request visualizer for Go (github.com)

Show HN: I built a tool to automate repetitive tasks by recording my screen (clickrepeat.com)

Show HN: Visualizing web server activity using gource

Show HN: AgenticSeek – Self-hosted alternative to cloud-based AI tools (github.com)

Show HN: routr - a fast local replacement for DuckDuckGo bangs (t128n.github.io)

Professors Staffed a Fake Company with AI Agents, Guess What Happened?

Comments (16)