We recently ran similar experiments and saw that fine-tuning small models on automatically curated high-quality outputs from a large model can beat large-model performance while reducing inference costs by up to 30x and inference time by up to 4x.
We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).
We're still running a few experiments and plan to update the post with additional results in a few days.
Looking forward to trying out importance weighting soon!
How is this kind of analogy helpful? You can frame any optimization problem as RL if you try hard enough. RL is a method of optimization which calls the optimum "reward maximization". You can craft the reward function any which way you want.
The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?
mandevil · 4h ago
Interesting to see two independent researchers on this. Makes me curious as to what the back-story is? Side project?
jtspringenberg · 3h ago
Author here, just to clarify: we are both no longer working for DeepMind. This was purely an independent effort for the sake of research and understanding!
Happy to answer any questions.
babelfish · 4h ago
Especially interesting given they both work for Google DeepMind.
GabrielBianconi · 3h ago
Yeah, I hadn't noticed!
henriquegodoy · 3h ago
It's cool to see the perspective that many problems (somekinda communication problems, look at lawyers, compliance and etc...) can be solved by treating AI less as agents and more as modular components within a larger system. Once we build a working process—monitored through evals—we can then reduce costs by distilling these modules. That means starting with superintelligent models and later distilling them down to just a few billion parameters, instead of needing hundreds of billions.
We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).
We're still running a few experiments and plan to update the post with additional results in a few days.
Looking forward to trying out importance weighting soon!
Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...
The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?