Show HN: Arch-Router – 1.5B model for LLM routing by preferences, not benchmarks
- Embedding-based routers use intent classifiers — label a prompt as “support,” “SQL,” or “math,” then route to a matching model. This works for simple tasks but breaks down in real conversations. Users shift topics mid-conversation, task boundaries blur, and product changes require retraining classifiers.
- Performance-based routers pick models based on benchmarks like MMLU or MT-Bench, or based on latency or cost curves. But benchmarks often miss what matters in production: domain-specific quality or subjective preferences like “Will legal accept this clause?”
Arch-Router takes a different approach: route by preferences written in plain language. You write rules like “contract clauses → GPT-4o” or “quick travel tips → Gemini Flash.” The router maps the prompt (and conversation context) to those rules using a lightweight 1.5B autoregressive model. No retraining, no fragile if/else chains. We built this with input from teams at Twilio and Atlassian. It handles intent drift, supports multi-turn conversations, and lets you swap in or out models with a one-line change to the routing policy. Full details are in our paper (https://arxiv.org/abs/2506.16655), but here's a snapshot:
Specs:
- 1.5B params — runs on a single GPU (or CPU for testing)
- No retraining needed — point it at any mix of LLMs
- Cost and latency aware — route heavy tasks to expensive models, light tasks to faster/cheaper ones
- Outperforms larger closed models on our conversational routing benchmarks (details in the paper)
Links:
- Arch Proxy (open source): https://github.com/katanemo/archgw
- Model + code: https://huggingface.co/katanemo/Arch-Router-1.5B
- Paper: https://arxiv.org/abs/2506.16655
We use envoy as request handler which forwards request to local service written in rust. Envoy is proven to be high performance, low latency and highly efficient on request handling. If I have to put a number it would be in single digit ms per request. I will have more detailed benchmark in the coming days.
Arch-Router takes a different approach. Instead of focusing benchmark scores, we lets developers define routing policies in plain language based on their preferences — like “contract analysis → GPT-4o” or “lightweight brainstorming → Gemini Flash.” Our 1.5B model learns to map prompts (along with conversational context) to these policies, enabling routing decisions that align with real-world expectations, not abstract leaderboards. Also our approach doesn't require router model retraining when new LLMs are swapped in or when preferences change.
Hope this helps.
Nonetheless, super curious to learn more and see what we may be able to improve. This is technically not a classifier model - its a usage prediction model (feels like a classifier, but not quite in terms of intended usage)