Last week we made it to the front page with our post about benchmarking how well coding agents interact with libraries and APIs. The response was positive overall, but many wanted to see the code.
For those just catching up: The problem is that existing benchmarks focus on self-contained codegen. StackBench tests how well AI coding agents (like Claude Code, and now Cursor) use your library by:
• Parsing your documentation automatically
• Extracting real usage examples
• Having agents generate those examples from a spec from scratch
• Logging every mistake and analyzing patterns
You can find out more information about how it works and how to run it in the docs https://docs.stackbench.ai/
Next up, we’re planning to add more:
• Coding agents
• Ways of providing docs as context (e.g. Mintlify vs Cursor doc search)
• Benchmark tasks (e.g. use of APIs via API docs)
• Metrics
We're also working on automating in-editor testing and maybe even using an MCP server.
Contributions and suggestions very welcome. What should we prioritize next? The issues tab is open.
danmaw · 12h ago
Super cool. Thanks for sharing this. Who is this mainly aimed at, the maintainers or the users of the libs?
richardblythman · 12h ago
Mainly at the maintainers. In the generated reports, we highlight issues and suggest improvements to the docs (working on improving the reports as we run across more libraries).
gerstep · 12h ago
Cool direction! benchmarking agent library usage instead of pure codegen is what’s missing
For those just catching up: The problem is that existing benchmarks focus on self-contained codegen. StackBench tests how well AI coding agents (like Claude Code, and now Cursor) use your library by: • Parsing your documentation automatically • Extracting real usage examples • Having agents generate those examples from a spec from scratch • Logging every mistake and analyzing patterns
You can find out more information about how it works and how to run it in the docs https://docs.stackbench.ai/
Next up, we’re planning to add more: • Coding agents • Ways of providing docs as context (e.g. Mintlify vs Cursor doc search) • Benchmark tasks (e.g. use of APIs via API docs) • Metrics
We're also working on automating in-editor testing and maybe even using an MCP server.
Contributions and suggestions very welcome. What should we prioritize next? The issues tab is open.