Supervised fine tuning on curated data is reinforcement learning

Comments (9)

anndvision · 3h ago

We recently ran similar experiments and saw that fine-tuning small models on automatically curated high-quality outputs from a large model can beat large-model performance while reducing inference costs by up to 30x and inference time by up to 4x.

We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).

We're still running a few experiments and plan to update the post with additional results in a few days.

Looking forward to trying out importance weighting soon!

Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...

chongliqin · 3h ago

Cool! If you are interested, we have open sourced our code: https://github.com/emmyqin/iw_sft

anndvision · 2h ago

thanks

iandanforth · 4h ago

How is this kind of analogy helpful? You can frame any optimization problem as RL if you try hard enough. RL is a method of optimization which calls the optimum "reward maximization". You can craft the reward function any which way you want.

The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?

mandevil · 4h ago

Interesting to see two independent researchers on this. Makes me curious as to what the back-story is? Side project?

jtspringenberg · 3h ago

Author here, just to clarify: we are both no longer working for DeepMind. This was purely an independent effort for the sake of research and understanding! Happy to answer any questions.

babelfish · 4h ago

Especially interesting given they both work for Google DeepMind.

GabrielBianconi · 3h ago

Yeah, I hadn't noticed!

henriquegodoy · 3h ago

It's cool to see the perspective that many problems (somekinda communication problems, look at lawyers, compliance and etc...) can be solved by treating AI less as agents and more as modular components within a larger system. Once we build a working process—monitored through evals—we can then reduce costs by distilling these modules. That means starting with superintelligent models and later distilling them down to just a few billion parameters, instead of needing hundreds of billions.

Tell HN: Add "NSFW" words in your Google query to avoid AI summary

Ask HN: Whats your best workflows to draft legal agreements without lawyers?

Claude Code weekly rate limits

Ask HN: How will the OSA affect small Mastodon instances?

Tell HN: macOS Time Machine will no longer support Time Capsules

Ask HN: What are you working on? (July 2025)

An app that schedules your most important tasks around your peak energy levels

Ask HN: What's your secret weapon for keeping code/docs/tests from rotting?

Drafting Software Recommendation

Warp.dev Terminal – Overpriced, Buggy, and AI-Sabotaged My Code

Ask HN: How many of you are working in tech without a STEM degree?

Have We Stopped Inventing Futures Worth Predicting?

How do I get a paid internship as a 16yo developer?

My Theory: Advertising is a lot like capitalism itself

Ask HN: How do you handle audit logs in your systems?

Ask HN: Has your opinion on AI changed over the past year?

Ask HN: Is there any LLM provider that is GDPR compliant?

How to prioritize marketing when attribution is broken and AI is changing rules?

Are we building AI coding assistants wrong?

Ask HN: Have you ever waited for a project to be launched but it never did?

Ask HN: Will I get left behind if I don't jump on AI train?

Ask HN: Why do Cursor, Windsurf and Claude Code dominate the conversation?

Ask HN: Do You Block DigitalOcean?

Ask HN: How do you build B2B software that pays living expenses?

Ask HN: What is so good about MCP servers?

Supervised fine tuning on curated data is reinforcement learning

Comments (9)