Adaptive LLM routing under budget constraints

152 tdchaitanya 61 9/1/2025, 4:57:38 PM arxiv.org ↗

Comments (61)

pbd · 5h ago

GPT-4 at $24.7 per million tokens vs Mixtral at $0.24 - that's a 100x cost difference! Even if routing gets it wrong 20% of the time, the economics still work. But the real question is how you measure 'performance' - user satisfaction doesn't always correlate with technical metrics.

FINDarkside · 5h ago

It's trivial to get better score than GPT-4 with 1% of the cost by using my propertiary routing algorithm that routes all requests to Gemini 2.5 Flash. It's called GASP (Gemini Always, Save Pennies)

nutjob2 · 3h ago

Does anyone working in an individual capacity actually end up paying for Gemini (Flash or Pro)? Or does Google boil you like a frog and you end up subscribing?

aspect8445 · 3h ago

I've used Gemini in a lot of personal projects. At this point I've probably made tens of thousands of requests, sometimes exceeding 1k per week. So far, I haven't had to pay a dime!

worm00111 · 44m ago

How come you don't need to pay? Do you get it for free somehow?

KETHERCORTEX · 18m ago

There's free tier for API.

dcre · 2h ago

I've paid a few dollars a month for my API usage for about 6 months.

simpaticoder · 4h ago

PPT (price-per-token) is insufficient to compute cost. You will also need to know an average tokens-per-interaction (TPI). They multiply to give you a cost estimate. A .01x PPT is wiped out by 100x TPI.

monsieurbanana · 2h ago

Are you saying that some models will take 100x more tokens than other (models in the same ballpark) for the same task? Is the 100 a real measured metric or just random numbers to illustrate a point?

simpaticoder · 2h ago

With thinking models, yes 100x is not just possible, but probable. You get charged for the intermediate thinking tokens, even if you don't see them (which is the case for Grok, for example). And even if you do see them, they won't necessarily add value.

Keyframe · 5h ago

number of complaints / million tokens?

mkoubaa · 3h ago

> How you measure 'performance'

I heard the best way is through valuations

pqtyw · 5h ago

> GPT-4 at $24.7 per million tokens

While technically true why would you want to use it when OpenAI itself provides a bunch of many times cheaper and better models?

KTibow · 4h ago

RouterBench is from March 2024.

QuadmasterXLII · 5h ago

The framing in the headline is interesting. As far as I recall, spending 4x more compute on a model to improve performance by 7% is the move that has worked over and over again up to this point. 101 % of GPT-4 performance (potentially at any cost) is what I would expect an improved routing algorithm to achieve.

dang · 4h ago

(The submitted title was "93% of GPT-4 performance at 1/4 cost: LLM routing with weak bandit feedback")

spoaceman7777 · 5h ago

Incredible that they are using contextual bandits, and named it: Preference-prior Informed Linucb fOr adaptive rouTing (PILOT)

Rather than the much more obvious: Preference-prior Informed Linucb For Adaptive Routing (PILFAR)

bhickey · 2h ago

That's pretty funny. I might need to pilfer it.

fny · 5h ago

Is there a reason human preference data is even needed? Don't LLMs already have a strong enough notion of question complexity to build a dataset for routing?

delichon · 5h ago

> a strong enough notion of question complexity

Aka Wisdom. No, LLMs don't have that. Me neither, I usually have to step in the rabbit holes in order to detect them.

fny · 4h ago

"Do you think you need to do high/medium/low amount of thinking to answer X?" seems well within an LLMs wheelhouse if the goal is to build an optimized routing engine.

nutjob2 · 3h ago

How do you think that an LLM could come by that information? Do you think that LLM vendors are logging performance and feeding that back into the model or some other mechanism?

adtac · 2h ago

Why not something dumb like this: https://chatgpt.com/share/68b60199-b6ac-8009-b50d-3e7cfff1d7... (gpt-4o)

carlhjerpe · 3h ago

Yes, that's why they keep getting better and why Anthropic is switching privacy policy defaults to eat my data please.

jibal · 5h ago

LLMs don't have notions ... they are pattern matchers against a vast database of human text.

mhh__ · 4h ago

Please do a SELECT * from this database

ashirviskas · 3h ago

What was the name of the rocket that brought the first humans into space?

CuriouslyC · 2h ago

These router papers are popping up hard now. I have a gradient boosted router I've been playing with that ties into retrieval to provide adaptive routing. The truth about these routers is that you have to tune them on your workloads to get the full benefit, otherwise they test way better than they work in production. That was why I added the retrieval aspect to mine, otherwise your top line slice and reality are very different.

axiom92 · 1h ago

From last neurips https://automix-llm.github.io/automix/

lewtun · 3h ago

> We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB

Academics are pretty creative at naming their creations

CuriouslyC · 2h ago

I almost named my LoRA replacement BEMO, but that felt too cute, so it's just BEM (Bolt-on Expert Modules).

westurner · 2h ago

Would there be advantages to routing to models according to cost in conjunction with prompt rewriting?

andrewflnr · 5h ago

Is this really the frontier of LLM research? I guess we really aren't getting AGI any time soon, then. It makes me a little less worried about the future, honestly.

Edit: I never actually expected AGI from LLMs. That was snark. I just think it's notable that the fundamental gains in LLM performance seem to have dried up.

kenjackson · 5h ago

First, I don't think we will ever get to AGI. Not because we won't see huge advances still, but AGI is a moving ambiguous target that we won't get consensus on.

But why does this paper impact your thinking on it? It is about budget and recognizing that different LLMs have different cost structures. It's not really an attempt to improve LLM performance measured absolutely.

ACCount37 · 3h ago

I can totally see "it's not really AGI because it doesn't consistently outperform those three top 0.000001% outlier human experts yet if they work together".

It'll be a while until the ability to move the goalposts of "actual intelligence" is exhausted entirely.

9dev · 2h ago

Well right now, my niece of 7 years outperforms all LLM contenders in drawing a Pelican on a bicycle

kenjackson · 1h ago

I know this was a joke, but LLMs are quite good at this now. If your niece draws better then she’s a good artist.

_heimdall · 4h ago

So you don't expect AGI to be possible ever? Or is your concern mainly with the wildly different definitions people use for it and that we'll continue moving goal posts rather than agree we got there?

nutjob2 · 4h ago

There's no concrete evidence AGI is possible mostly because it has no concrete definition.

It's mostly hand waving, hype and credulity, and unproven claims of scalability right now.

You can't move the goal posts because they don't exist.

ashirviskas · 3h ago

Well, if a human is GI, we just need to make it Artificial. Easy.

ctoth · 4h ago

Is a random paper from Fujitsu Research claiming to be the frontier of anything?

andrewflnr · 4h ago

Not just this paper, but model working shenanigans also seem to have been a big part of GPT-5, which certainly claims to be frontier work.

jibal · 5h ago

LLMs are not on the road to AGI, but there are plenty of dangers associated with them nonetheless.

andrewflnr · 4h ago

Agreed, broadly. I never really thought they were, but seeing people work on stuff like this instead of even trying to improve the architecture really makes it obvious.

nicce · 4h ago

Just 2 days ago Gemini 2.5 Pro tried to recommend me tax evasion based on non-existing laws and court decisions. The model was so charming and convincing, that even after I brought all the logic flaws and said that this is plain wrong, I started to doubt myself, because it is so good at pleasing, arguing and using words.

And most would have accept the recommendation because the model sold it as less common tactic, while sounding very logical.

nutjob2 · 4h ago

Or you could understand the tool you are using and be skeptical of any of its output.

So many people just want to believe, instead of the reality of LLMs being quite unreliable.

Personally it's usually fairly obvious to me when LLMs are bullshitting probably because I have lots of experience detecting it in humans.

nicce · 2h ago

LLM is only useful if it gives shortcut to information with reasonable accuracy. If I need to double check everything, it is just extra step.

In this case I just happened to be domain expert and knew it was wrong. It would have required significant effort to verify everything with some less experienced person.

roywiggins · 4h ago

> even after I brought all the logic flaws and said that this is plain wrong

Once you've started to argue with an LLM you're already barking up the wrong tree. Maybe you're right, maybe not, but there's no point in arguing it out with an LLM.

nicce · 2h ago

There are cases when they are actually correct, instead of the human.

roywiggins · 2h ago

Yes, and there's a substantial chance they'll apologize to you anyway even when they were right. There's no reason to expect them to be more likely to apologize when they're actually right vs actually wrong- their agreeableness is really orthogonal to their correctness.

nicce · 2h ago

Yes, they over-apologize. But my main reason for using LLMs is seeking out things that I missed myself or my own argumentation was not good. Sometimes they are really good at bringing new perspectives. Whether they are correct or incorrect is not the point - are they giving argument or perspective that is worth inspecting more with my own brains?

srekhi · 5h ago

I'm not following this either. You'd think this would be frontier back in 2023

yahoozoo · 4h ago

That and LLMs are seemingly plateauing. Earlier this year, it seemed like the big companies were releasing noticeable improvements every other week. People would joke a few weeks is “an eternity” in AI…so what time span are we looking at now?

andrewflnr · 4h ago

That's just the thing. There don't seem to have been any breakthroughs in model performance or architecture, so it seems like we're back to picking up marginal reductions in cost to make any progress.

muldvarp · 3h ago

There have been very large improvements in code generation in the last 6 months. A few weeks without improvement are not necessarily a plateau.

ACCount37 · 3h ago

Wait until it ramps up so much that people will say "it's a plateau, for real this time" when they go 3 days without a +10% capability jump.

muldvarp · 3h ago

I mean I wish there were a plateau, without one we're well onto our way into techno-feudalism. I just don't see it.

ACCount37 · 2h ago

That's what it is: wishful thinking. A lot of people really, really want AI tech to fail - because they don't like the alternative.

yieldcrv · 4h ago

just because it’s on arxiv doesn’t mean anything

arxiv is essentially a blog under an academic format, popular amongst asian and south asian academic communities

currently you can launder reputation with it, just like “white papers” in the crypto world allowed for capital for some time

this ability will diminish as more people catch on

guluarte · 5h ago

I'm starting to think that there will not be an 'AGI' moment, we will simply slowly build smarter machines over time until we realize there is 'AGI'. It would be like video calls in the '90s everybody wanted them, now everybody hates them, lmao.

nutjob2 · 4h ago

Or we'll realize that human intelligence and machine intelligence is apple and oranges.

Ask HN: Who is hiring? (September 2025)

Ask HN: Who wants to be hired? (September 2025)

Ask HN: Best foundation model for CLM fine-tuning?

Ask HN: Tools for Crossword Puzzle Generation?

Ask HN: The government of my country blocked VPN access. What should I use?

Change Tracker: Monitor+revert file edits from Claude/AI agents(in-memory VCS)

Ask HN: Do custom ROMs exist for electric cars, for example Teslas?

Ask HN: How do you fight YouTube addiction and procrastination? I'm struggling

Tell HN: Use "-f**k" to kill Google AI Overview

Ask HN: What to learn for math for modeling?

Ask HN: Why hasn't x86 caught up with Apple M series?

Tell HN: My advice after I applied to 450 positions before getting hired

Ask HN: Did Developers Undermine Their Own Profession?

Ask HN: Which Open Source License to Choose for a Python Language Server

Ask HN: How much are you guys paying for AI coding tools monthly?

Hacker News Alternativies

JDeploy 5.0: Deploy Java Desktop Apps to ARM64 Windows and Linux with One Click

FUGC: Understand the GC in Fil-C

Ask HN: How can I recover and run my old mobile game from the 2010s?

Looking for Info on Aerosol Pathogen Detection

Ask HN: Anyone using their own custom text editor?

Ask HN: Any Android Engineers Here?

Ask HN: Best self-hosted wiki solution in 2025? Mediawiki or something else?

Ask HN: What do you think about GFW?

Ask HN: Should we stop worrying that AI will replace developer jobs?

Ask HN: Best codebases to study to learn software design?

Ask HN: What options do I have for self-hosted end to end encrypted group chat?

Ask HN: What to Do with Old iPads?

Ask HN: What to do when you suspect your interview is with a state operative?

Ask HN: If burnout tells you that you are in the wrong job/career, why avoid it?

Adaptive LLM routing under budget constraints

Comments (61)