Detecting hallucinations in LLM function calling with entropy

2 honorable_coder 2 8/17/2025, 12:46:34 PM archgw.com ↗

Comments (2)

honorable_coder · 7h ago
We use this technique heavily for function-calling scenarios in https://github.com/katanemo/archgw, which uses a 3b function-calling model to neatly map a user's ask to one of many tools — the model doesn’t need to write an essay, it just needs to pick the right function immediately and the response can be synthesized by one of many configured upstream LLMs.

Why we do this: latency. A 3b parameter model, especially when quantized, can deliver sub-100ms time-to-first-token and generate a complete function call in under 50 tokens. That makes the LLM “disappear” as a bottleneck, so the only real waiting time is in the external tool or API being called + the time it takes to synthesize a human readable response.

sunscream89 · 6h ago
Your approach is cool, a bit cringe to say it’s entropy. You’ve mitigated some response latency in exchange for an opportunity to refine the decision support up stream. It’s a nice strategy!