Show HN: Product as Code – YAML-based product management for AI coding workflows (productascode.org)

Hi HN! Stefan here from superglue and today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

We gave LLMs API documentation and asked them to write code that makes actual API calls. Things like "create a Stripe customer" or "send a Slack message". We're not testing if they can use SDKs; we're testing if they can write raw HTTP requests (with proper auth, headers, body formatting) that actually work when executed against real API endpoints and can extract relevant information from that response.

tl:dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings:

- Best general LLM: 68% success rate. That's 1 in 3 API calls failing, which most would agree isn’t viable in production

- Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it.

- Only 6 out of 21 APIs worked 100% of the time, every other API had failures.

- Anthropic’s models are significantly better at building API integrations than other providers.

Here is the results chart: https://superglue.ai/files/performance.png

What made LLMs fail:

- Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did)

- Multi-step workflows (chaining API calls)

- Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/.

If you're building agents that need reliable API access, we'd love to hear your approach, or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

Comments (10)

adinagoerres · 3h ago

Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.

hoerzu · 3h ago

Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step

adinagoerres · 2h ago

self-reflection is very important for both humans and LLMs, indeed

ThomasMin · 1h ago

Awesome work Stefan, this is super insightful! Really appreciate the transparency and open-sourcing the benchmark. The 68% success rate is a wake-up call for anyone building with LLMs. Your 91% integration layer result is impressive, shows tooling matters. Excited to see what you uncover next with MCP!

iamflimflam1 · 2h ago

I would expect most developers to fail at this challenge. Here’s the doc - you’ve got one chance to get the API to do this.

I can’t tell from the description if the LLMs are allowed to try and then correct based on any errors received.

Though it would be surprising if that helped. Most APIs don’t tell you what you’ve done wrong…

sfaist · 2h ago

We would've assumed that the llms are much better at writing working code since it's not random APIs but rather established API patterns which they should be able to one-shot (e.g. Stripe). Bad error messages are a problem indeed. We will release another one with retries very soon.

ForzaAaRon · 2h ago

Fascinating read. Interesting how opus performs worse compared to sonnet

sfaist · 2h ago

Quite interesting actually. not sure why, I assume it just overthinks. What suprised me even more is how bad o4-mini performed, after taking up hours of evaluation time and more credits than all other llms combined. More thinking != better (integration) coding performance

hoerzu · 3h ago

What's the hello world of super glue?

danmeier · 2h ago

very interesting! curious to see the benchmarks for MCP!

Show HN: Nia – MCP server that gives more docs and repos to coding agents (trynia.ai)

Show HN: A code editor that integrates into the browser (tachicode.dev)

Show HN: Local Email Client for AI Horseless Carriages (github.com)

Show HN: Tinder but it's only pictures of my wife and I can only swipe right (trytender.app)

Show HN: Draw A Fish and watch it swim with the others:) (drawafish.com)

Show HN: LLMs suck at writing integration code… for now (github.com)

Show HN: Product as Code – YAML-based product management for AI coding workflows (productascode.org)

Show HN: NeoArchive – Lightweight offline tool for files, media and PDFs (play.google.com)

Show HN: YouTubeTLDR – A lightweight, self-hosted YouTube summarizer in Rust (github.com)

Show HN: The missing link of a bookstore's tech stack (bookhead.net)

Show HN: TheProtector – Linux Bash script for the paranoid admin on a budget (github.com)

Show HN: I built an AI tools hub with 70 categorized and reviewed apps (toolsverse.tools)

Show HN: SECONDSENSE – The OS for the Resale Economy (secondsense.co)

Show HN: Self-updating MCP server for official pip, uv, poetry and conda docs (github.com)

Show HN: Compass CNC – Open-source handheld CNC router (compassrouter.com)

Show HN: A word of the day that doesn't suck

Show HN: Phind.design – Image editor & design tool powered by 4o / custom models (phind.design)

Show HN: Duende: Web UX for guiding Gemini as it improves your source code (github.com)

Show HN: The Magic of Code – book about the wonders and weirdness of computation (themagicofcode.com)

Show HN: Quite Fast, sharded cache for Go with LRU/LFU, TTL and object pooling (github.com)

Show HN: Any-LLM – Lightweight router to access any LLM Provider (github.com)

Show HN: WTFfmpeg – Natural Language to FFmpeg Translator (github.com)

Show HN: NativeSwap – Low cost cross-chain swaps without wrappers or bridges (nativeswap.io)

Show HN: RcloneView – A GUI for Rclone to Manage and Sync Cloud Storage

Show HN: Lotas – Cursor for RStudio (lotas.ai)

Show HN: I built a free math-based puzzle game called Equatile (equatile.com)

Show HN: Strava for Cooking (stravaforcooking.com)

Show HN: Textrix – Open-source Medium.com-style editor for publishing platforms (github.com)

Show HN: Limit – Android content blocker which can't be bypassed (limitphone.com)

Show HN: Marchat – Terminal-based chat app written in Go (github.com)

Show HN: Pogocache – Fast caching software (github.com)

Show HN: ETHShot – an Ethereum test‑net "take‑your‑shot" jackpot game

Show HN: Tool to discover bloggers, trending blog topics, and weekly summaries (weblogs.ai)

Show HN: A rudimentary game engine to build four dimensional VR evironments (brainpaingames.com)

Show HN: Conductor, a Mac app that lets you run a bunch of Claude Codes at once (conductor.build)

Show HN: Learn foreign languages by doomscrooling Twitter – Chrome extension (doomlingo.com)

Show HN: Go Command-streaming lib for distributed systems (3x faster than gRPC) (github.com)

Show HN: Built an email marketing platform after paying $230/month (fertit.com)

Show HN: Gitpatch – send patches with git push (gitpatch.com)

Show HN: X11 desktop widget that shows location of your network peers on a map (github.com)

Show HN: My GPU Fan Saga – A DIY ATX Fan Controller (shafq.at)

Show HN: Vacation Maximizer – Maximize PTO by taking days off around holidays (vacation-maximizer.com)

Show HN: ggc – A terminal-based Git CLI written in Go (github.com)

Show HN: Two steps to build a personal website (fuck vibe coding) (noey.ai)

Show HN: LookAway Now Syncs Screen Breaks with iPhone (lookaway.app)

Show HN: Zams – Build AI agents that automate sales work (zams.com)

Show HN: Improving search ranking with chess Elo scores (zeroentropy.dev)

Show HN: Kafka, the first AI employee (NEW SOTA ON GAIA BY 20%) (brainbaselabs.com)

Show HN: AnkiTTS (Anki Text to Speech) (github.com)

Show HN: CalOverlap – Group Scheduling from Multiple Calendly/Cal.com Links (caloverlap.com)

Show HN: LLMs suck at writing integration code… for now

Comments (10)