I wrote 2000 LLM test cases so you don't have to: LLM feature compatibility grid

Comments (2)

Oras · 6h ago

I might have missed the point, but most of these features are just filters in OpenRouter.

- Reasoning.

- Structured Output.

- Logprops

What's the added value from your tests? To verify these features exist?

scosman · 6h ago

There's a section about Openrouter/LiteLLM: https://getkiln.ai/blog/i_wrote_2000_llm_test_cases_so_you_d...

Those tools map API compatibility. These tests+config add:

1) check which features are available

2) check which parameters you need to use for best results. For example, there are about 6 different options for requesting JSON from OpenRouter, and different models work best with different options.

3) check that the features consistently work. API compatibility and functionality are not the same.

4) Go much deeper: are the models good enough for synthetic data generation? Can they generate uncensored model inputs if you're building a toxicity eval? etc.

Comparing the Glove80 and Maltron Keyboards (tratt.net)

The Pharaohs Built Pyramids–We Build Data Centers (forbes.com)

We Are Winning (Update) (honest-broker.com)

Inlining in the Glasgow Haskell Compiler:Empirical Investigation and Improvement (era.ed.ac.uk)

Tooooools.app

Google and OpenAI Get 2025 IMO Gold (thezvi.substack.com)

The Perverse Economics of Assisted Suicide (nytimes.com)

Sony PXW-Z300: The First Camcorder to Embed Content Authenticity in Video (diyphotography.net)

Apple alerted Iranians to iPhone spyware attacks, say researchers (techcrunch.com)

Aging well according to a longevity researcher (wbur.org)

Topics in Mathematics with Applications in Finance (ocw.mit.edu)

Tinyio: A tiny (~200 lines) event loop for Python (github.com)

Hierarchies and Promotions in Politics: Accountability and Selection (mdpi.com)

Amazon Acquires AI wearables startup Bee (techcrunch.com)

2025 Scholar Metrics Released (scholar.googleblog.com)

Proton completes SoC 2 Type II audit, reinforcing trust for business users (proton.me)

HP owed over $940M by Mike Lynch's estate, ex-business partner, UK court rules (reuters.com)

Functional Documentation (dzombak.com)

The Food Court 5000 is a Portland-based, retro-fitness, mall-walking movement (foodcourt5k.com)

The kill ring is a list of blocks of text (gnu.org)

Bookmer.com launched Browser extention for Chrome (chromewebstore.google.com)

Show HN: I built BodyCount to track my 'score' but found deeper meaning (app.bodycount.love)

Rest in Peace Ozzy

New Duke Study Finds Obesity Rises with Caloric Intake, Not Couch Time (trinity.duke.edu)

Morse Code (kmcd.dev)

Show HN: How Claude Code Improved My Dev Workflow

Diffusion Beats Autoregressive in Data-Constrained Settings (arxiv.org)

Liking Yellow Imply Driving a School Bus? Semantic Leakage in LLMs (arxiv.org)

When Existence is Inefficient (2022) (inference-review.com)

Comment with your favorite local-first content (lofi.so)

The average Apple Watch user gets 49 minutes of deep sleep per night (empirical.health)

Windows 11 gets new Black Screen of Death, auto recovery tool (bleepingcomputer.com)

China begins building largest dam, fuelling fears in India (bbc.com)

Show HN: How Claude Code Improved My Dev Workflow

Despite deepfake audio tech, banks, ISPs push voice print authentication (2021) (keydiscussions.com)

The dangers of Musk's new, Manga-style [flirty] chatbot [video] (youtube.com)

Qwen3 – Coder (old.reddit.com)

Vector Tiles are deployed on OpenStreetMap.org (blog.openstreetmap.org)

How Silicon Valley is becoming militarized (english.elpais.com)

Show HN: How Claude Code Improved My Dev Workflow

Show HN: Checklist Genie - Create Sharable Checklists w/Just Your Voice and AI (checklistgenie.app)

Qwen3-Coder: Agentic Coding in the World (qwenlm.github.io)

Ask HN: A Reddit UI where all writing is done by an AI?

Show HN: A CLI tool for creating Typst screenplay projects (github.com)

Hackers Behind $140M Brazil Banking Heist Turn to Crypto to Launder Their Loot (coindesk.com)

RFC 1392: Internet Users' Glossary (rfc-editor.org)

A power utility is reporting suspected pot growers to cops. EFF says illegal (arstechnica.com)

SmoothCSV: the ultimate CSV editor for macOS & Windows (smoothcsv.com)

Ask HN: Can You Buy Your Way into Your Dream Job?

SWE-Bench Verified Is Flawed Despite Expert Review (ddkang.substack.com)

I wrote 2000 LLM test cases so you don't have to: LLM feature compatibility grid

Comments (2)