Fine-tuned small LLMs can beat large ones with programmatic data curation

31 GabrielBianconi 8 8/4/2025, 3:55:19 PM tensorzero.com ↗

Comments (8)

k8si · 1h ago

Maybe this is a nitpick but CoNLL NER is not a "challenging task". Even pre-LLM systems were getting >90 F1 on that as far back as 2016.

Also, just in case people want to lit review further on this topic: they call their method "programmatic data curation" but I believe this approach is also called model distillation and/or student-teacher training.

GabrielBianconi · 1h ago

Thanks for the feedback!

We chose a set of tasks with different levels of complexity to see how this approach would scale. For LLMs, the "challenge" with NER is not the task itself but the arbitrariness of the labels in the dataset. I agree it's still much simpler than the other tasks we present (agentic RAG, agentic tool use, maze navigation).

There are definitely strong parallels to model distillation and student-teacher training, with the primary difference being that we don't simply take all the data from the larger model but rather filter the dataset based on metrics from the environment. In the "Does curation even matter?" section, we show that this generally improves the result by a good margin.

We link to Vicuna, which might be the closest reference as prior art: https://lmsys.org/blog/2023-03-30-vicuna/

Thanks!

mwigdahl · 1h ago

Is this just distillation but with a step to filter out low-quality responses first?

GabrielBianconi · 1h ago

AFAIK, distillation typically refers to tuning on the logits of the larger model, so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post). We fine-tune on the outputs themselves.

But broadly speaking, yes, we generate data using a large model, curate the best samples using metrics from the environment, and fine-tune on that data. This isn't a novel technique from an academic perspective; our focus is on applying it to different use cases (e.g. agentic RAG, agentic tool use) and models (OpenAI, Google, Qwen).

Thanks!

mwigdahl · 50m ago

Thanks for the explanation and the clarification on terminology! I've used a similar approach myself and it sounded like you were doing something similar.

6510 · 1h ago

Noob question: Would it be possible to train a small model for a single prompt?

GabrielBianconi · 1h ago

With supervised fine-tuning (SFT), you'll often see good results with 100-1000+ datapoints (they can be variations of the same prompt template). If you have more limited data, reinforcement fine-tuning (RFT) can work well in the 10-100 range.

Good luck!

alchemist1e9 · 2h ago

I’ve been thinking about curating primary sources themselves and then using those for fine-tuning.

Anyone gone that route and know of projects with very high quality curated source materials? ideally categorized and labeled.