Verifiability Is the Limit (alperenkeles.com)

Super cool that you all published these benchmarks. We've seen similar, where some APIs work REALLY well with agents, but convoluted ones just product churn in the agent tool calls. Curious to see how Supabase's APIs would perform with your benchmarks. We've seen their PostgREST via API do really well with our agents (with targeted system prompts), and it's a fairly non-standard REST structure.

michael-fuest · 6h ago

Love hearing Sam Altman talking about feeling the AGI and seeing that million dollar reasoning models can't execute simple API calls despite having a lot of docs and the entire internet as baked-in knowledge.

There may be hope for humanity yet!

Jokes aside, interested in eventually exploring how well the new OpenAI agent mode handles these types of tasks if the underlying foundation models struggle with this type of work.

sfaist · 6h ago

The reason we think this would be interesting to share here is that these llm benchmarks seem increasingly disconnected from reality. idc if the llm can solve a PhD math question or make scientific discoveries, I care if it can solve our problems, which in our case is automating API integrations. Turns out it mostly can't, which tracks well with our experience using cursor.

nimar · 4h ago

v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.

adinagoerres · 7h ago

Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

tl;dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.

What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/

If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

Verifiability Is the Limit (alperenkeles.com)

FAA says power outage forced postponement of SpaceX TRACERS launch (aol.com)

Python 3.14 release candidate 1 is go (pythoninsider.blogspot.com)

Red Sox Pitcher Confronts Commissioner About Gambling, Social Media Threats (newsweek.com)

Thoughts on cloud alerts from the top cloud MDR (groundedcloudsecurity.substack.com)

Andor and the Psychology of Resistance [audio] (changetechnically.fyi)

Synthetic Auth Report – Issue 003 (syntheticauth.ai)

Ahey – A free and open-source video calling app for the web (ahey.net)

Google exec: 'We're going to be combining ChromeOS and Android' (theverge.com)

Comparing the Glove80 and Maltron Keyboards (tratt.net)

The Pharaohs Built Pyramids–We Build Data Centers (forbes.com)

We Are Winning (Update) (honest-broker.com)

Inlining in the Glasgow Haskell Compiler:Empirical Investigation and Improvement (era.ed.ac.uk)

Tooooools.app

Google and OpenAI Get 2025 IMO Gold (thezvi.substack.com)

The Perverse Economics of Assisted Suicide (nytimes.com)

Sony PXW-Z300: The First Camcorder to Embed Content Authenticity in Video (diyphotography.net)

Apple alerted Iranians to iPhone spyware attacks, say researchers (techcrunch.com)

Aging well according to a longevity researcher (wbur.org)

Topics in Mathematics with Applications in Finance (ocw.mit.edu)

Tinyio: A tiny (~200 lines) event loop for Python (github.com)

Hierarchies and Promotions in Politics: Accountability and Selection (mdpi.com)

Amazon Acquires AI wearables startup Bee (techcrunch.com)

2025 Scholar Metrics Released (scholar.googleblog.com)

Proton completes SoC 2 Type II audit, reinforcing trust for business users (proton.me)

HP owed over $940M by Mike Lynch's estate, ex-business partner, UK court rules (reuters.com)

Functional Documentation (dzombak.com)

The Food Court 5000 is a Portland-based, retro-fitness, mall-walking movement (foodcourt5k.com)

The kill ring is a list of blocks of text (gnu.org)

Bookmer.com launched Browser extention for Chrome (chromewebstore.google.com)

Show HN: I built BodyCount to track my 'score' but found deeper meaning (app.bodycount.love)

Rest in Peace Ozzy

New Duke Study Finds Obesity Rises with Caloric Intake, Not Couch Time (trinity.duke.edu)

Morse Code (kmcd.dev)

Show HN: How Claude Code Improved My Dev Workflow

Diffusion Beats Autoregressive in Data-Constrained Settings (arxiv.org)

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs (arxiv.org)

When Existence is Inefficient (2022) (inference-review.com)

Comment with your favorite local-first content (lofi.so)

The average Apple Watch user gets 49 minutes of deep sleep per night (empirical.health)

Windows 11 gets new Black Screen of Death, auto recovery tool (bleepingcomputer.com)

China begins building largest dam, fuelling fears in India (bbc.com)

Show HN: How Claude Code Improved My Dev Workflow

Despite deepfake audio tech, banks, ISPs push voice print authentication (2021) (keydiscussions.com)

The dangers of Musk's new, Manga-style [flirty] chatbot [video] (youtube.com)

Qwen3 – Coder (old.reddit.com)

Vector Tiles are deployed on OpenStreetMap.org (blog.openstreetmap.org)

How Silicon Valley is becoming militarized (english.elpais.com)

Show HN: How Claude Code Improved My Dev Workflow

Show HN: Checklist Genie - Create Sharable Checklists w/Just Your Voice and AI (checklistgenie.app)

Show HN: We let agents use APIs to find out if they can actually...do things?

Comments (5)