Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Comments (1)

nobody9999 · 2d ago

>...In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making.

Show HN: I rewrote my Mac Electron app in Rust (desktopdocs.com)

Compiler Explorer and the promise of URLs that last forever (xania.org)

Compiling a Neural Net to C for a 1,744× speedup (slightknack.dev)

What does "Undecidable" mean, anyway (buttondown.com)

Visualize and debug Rust programs with a new lens (firedbg.sea-ql.org)

Ice Cream Replaced Booze in the US Navy (oldsaltblog.com)

Deepseek R1-0528 (huggingface.co)

Show HN: Tesseral – Open-Source Auth (github.com)

LLM Codegen go Brrr – Parallelization with Git Worktrees and Tmux (skeptrune.com)

The Blowtorch Theory: A new model for structure formation in the universe (theeggandtherock.com)

Launch HN: MindFort (YC X25) – AI agents for continuous pentesting

GoGoGrandparent (YC S16) is hiring Back end Engineers

Getting a Cease and Desist from Waffle House (jack.bio)

As a developer, my most important tools are a pen and a notebook (hamatti.org)

De-anonymization attacks against the privacy coin XMR (monero.forex)

The mysterious Gobi wall uncovered (phys.org)

xAI to pay telegram $300M to integrate Grok into the chat app (techcrunch.com)

Mathematical Fiction (kasmana.people.charleston.edu)

Show HN: Loodio 2 – A Simple Rechargable Bathroom Privacy Device (loodio.com)

Implementing complex numbers and FFT with just datatypes (2023) (gist.github.com)

Show HN: Wetlands – a lightweight Python library for managing Conda environments (arthursw.github.io)

Show HN: My LLM CLI tool can run tools now, from Python code or plugins (simonwillison.net)

Building interactive web pages with Guile Hoot (spritely.institute)

Square Theory (aaronson.org)

Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning

A thought on JavaScript "proof of work" anti-scraper systems (utcc.utoronto.ca)

Japan Post launches 'digital address' system (japantimes.co.jp)

Homo erectus from the seabed, new archaeological discoveries in Indonesia (universiteitleiden.nl)

The Ingredients of a Productive Monorepo (blog.swgillespie.me)

The Level Design Book (book.leveldesignbook.com)

The Who Cares Era (dansinker.com)

The Decline of Battery Life (brainbaking.com)

Designing Tools for Scientific Thought (forester-notes.org)

FlowTSE: Target Speaker Extraction with Flow Matching (arxiv.org)

DWARF as a Shared Reverse Engineering Format (lief.re)

Negotiating PoE+ Power in the Pre‑Boot Environment (roderickkhan.com)

Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B (hazyresearch.stanford.edu)

How a hawk learned to use traffic signals to hunt more successfully (frontiersin.org)

OpenTPU: Open-Source Reimplementation of Google Tensor Processing Unit (TPU) (github.com)

Pyrefly vs. Ty: Comparing Python's two new Rust-based type checkers (blog.edward-li.com)

Chairs, Chairs, Chairs (parliament.uk)

The Windows Registry Adventure #7: Attack surface analysis (googleprojectzero.blogspot.com)

Programming Basics with Tiki (tiki.li)

Mullvad Leta (leta.mullvad.net)

There Is No Diffie-Hellman but Elliptic Curve Diffie-Hellman (keymaterial.net)

The Hobby Computer Culture (technicshistory.com)

LumoSQL (lumosql.org)

Wake Up, Remix (remix.run)

Show HN: Lazy Tetris (lazytetris.com)

BGP handling bug causes widespread internet routing instability (blog.benjojo.co.uk)

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

Comments (1)