Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning
The core idea: instead of giving every query the same "thinking time," classify queries as HIGH or LOW complexity and allocate thinking tokens accordingly. Complex reasoning gets 70-90% of tokens, simple queries get 20-40%.
I also implemented steering vectors derived from Pivotal Token Search (originally from Microsoft's Phi-4 paper) that guide the model's reasoning patterns during generation. These vectors encourage behaviors like numerical accuracy, self-correction, and thorough exploration.
Results on DeepSeek-R1-Distill-Qwen-1.5B:
- GPQA-Diamond: 31.06% vs 21.72% baseline (+43% relative improvement)
- MMLU-Pro: 26.38% vs 25.58% baseline
- Uses fewer tokens than baseline approaches
Works with any local reasoning model - DeepSeek, Qwen, custom fine-tuned models. No API dependencies.
The technique builds on two things I developed: an adaptive classification framework that can learn new complexity categories without retraining, and an open source implementation of Pivotal Token Search.
Technical paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327
Code and examples: https://github.com/codelion/optillm/tree/main/optillm/autoth...
PTS implementation: https://github.com/codelion/pts
I'm curious about your thoughts on adaptive resource allocation for AI reasoning. Have you tried similar approaches with your local models?
The breakthrough was combining two techniques I'd been working on separately: adaptive classification (which can learn new categories without retraining) and an open source implementation of Pivotal Token Search from Microsoft's Phi-4 paper. When I put them together with dynamic token budgeting, the performance gains were much better than expected.
What surprised me most was that the technique actually uses fewer tokens on average while improving performance. The adaptive allocation means simple queries finish faster, offsetting the extra computation on complex ones.
A few technical notes:
- The steering vectors are small (typically <1MB per pattern) and add minimal memory overhead
- Classification adds about 10ms latency, which is negligible
- Target layer selection matters - I found middle layers (15-20) work best for most models
I'd love feedback on:
- Have you tried similar adaptive approaches with your models?
- What other reasoning patterns would be useful to steer toward?
- Ideas for automatically detecting the optimal target layer?
Thanks for checking it out! Happy to answer any questions about the implementation or results.
Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.
"how long break distance does a train need if going in 100 km/hour?"
Just need a quick reply and you dont care so much (maybe showerthought)? Or is life and death depending on the answer?
The same question can need different amount of thinking.
In this situation I suspect you'd still want the answer quickly.
If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.
Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.
You could even put a simpler AI in front to decide if it was effectively the same query.
So far I’ve taken only lazy approach to optimising local LLMs by sending small queries to my M4 Mac Mini running MLX models and sending larger queries to my Nvidia 4090; it’s remarkable how efficient M4 is compared to Nvidia and I think Apple is in the right direction with MLX.
I would read about AutoThink and try to integrate it with my workflow.
Or another seemingly simple equation with positive integers x,y,z
requires elliptic curve knowledge, and the solution is huge (Solution is discussed here: https://www.quora.com/How-do-you-find-the-positive-integer-s...)You're trading correctness for speed.
That's AI in a nutshell.
If someone asked me to find solutions to these example equations, there are three complications that I would immediately notice:
1. We are looking for solutions over integers. 2. There are three variables. 3. The degree of the equation is 3.
Having all three is a deadly combination. If we were looking for solutions over reals or complex numbers? Solvable. Less than three variables? Solvable. Degree less than 3? Solvable. With all three complications, it's still not necessarily hard, but now it might be. We might even be looking at an unsolved problem.
I haven't studied enough number theory to actually solve either of these problems, but I have studied enough to know where to look. And because I know where to look, it only takes me a few seconds to recognize the "this might be very difficult" vibe that both of these have. Maybe LLMs can learn to pick up on similar cues to classify problems as difficult or not so difficult without having needing to solve them. (Or, maybe they have already learned?)
https://github.com/NiloCK/autothink
https://www.paritybits.me/think-toggles-are-dumb/
My own version took a first pass with an LLM whose job was to assign a 0-100 complexity rating, and then there was more or less a linear scaling of the allocated thinking budget.
The OP effort here is obviously higher grade, and I'm really tickled to see quantitative results. Well done.
Also, as small language models (SML) become more competent, it's amazing what they can do on-device !
that should be SLM, right?
Even though Gemma 3 27B QAT is not a reasoning model, it's so good at instruction following and being used in LLM chains/routes that it can be used for classifying/language optimization steps before instructing it how to reason about the prompt in the next step. You can even have it output intermediate answers interspersed between multiple think tags in the same response. In many ways for these models I just define thinking as any tokens that are helping the model arrive at the conclusion, but are not fully formed parts of the answer.
Instructing it to use certain words (tokens) and types of phrasing preferentially is something that is known to improve results in general, not just in LLMs and I've seen improved results by encouraging certain types of language to be used. AutoThink using the highest performing tokens out of a dataset _could_ be a nice way to optimize towards that in a more general way.
It seems like there's a risk of using so many pivotal tokens that it almost overfits responses to benchmark questions, though. So, while I have personally seen careful word/token selection improve result quality and also see it as a potential low cost high return optimization, I'd still want to see how AutoThink generalizes.
However, for a local model, answering my own queries? That's the last thing I want. I already spent way too much money on that GPU, might as well get use out of it.
From what I understood, AutoThink helps the AI “think more wisely” by adjusting how much effort it spends based on how hard the question is. That makes a lot of intuitive sense — like how people don’t spend 10 minutes figuring out what 2+2 is, but do take time with tricky problems.
Even though I don’t know the technical parts (like token budgeting or steering vectors), it’s fascinating to see how these methods can make the AI both faster and smarter at the same time.
Thanks for sharing — I’m definitely going to follow this kind of work more closely from now on.
They are a computing method, where we can choose to use more or less run time (and so processor time), to generate results.
To me, a fairly pragmatic way of characterizing these tools day to day is to anthropomorphize them. One benefit of this heuristic: they simulate conversation and it's much easier to use them with a conversational flow. Another one is to create an approximation of a character, which makes it easier to build a useful intuition for what they can and cannot do.
Obviously these kinds of heuristics do break down. But it's obvious enough when they do so one can switch into a more precise and analytical mode of thinking.
For context, we're samaritanscout.org a search engine that is attempting to provide a comprehensive view into all local volunteering opportunities posted on a range of nonprofit websites.
FWIW gemini explicitly told me that it ranks question difficulty from 1 to 100 and depending on the bin allocates more or less resources to answering it
Do you mean someone from the gemini team? If you "asked" the LLM then it's likely a "hallucinated" answer. They say all sort of things about "themselves" only because they were trained to do so. They likely have 0 knowledge about their true architecture.
even in the future, the model that is directly responding to you will likely have to know some details of its architecture to be useful.
MMLU-PRO is 12,000 instances. To avoid this we set a 600 seconds timeout for each instance to run.
No comments yet
Now have it mark blocks of text on or off, so it can ignore irrelevant, or worse erroneous material — no need to include it in the context window.