Show HN: Augmentoolkit 3.0: open-source datagen. Teach LLMs new facts, tasks

1 e-p-armstrong 0 6/12/2025, 10:14:13 PM github.com ↗
Finally a tool for training specialist LLMs. Fully open-source. Add documents and click a button.

Augmentoolkit is a production-ready way to train AI subject matter experts. It lets you update an LLM's knowledge cutoff and put new facts into its brain, without any retrival needed. You can then do reinforcement learning to improve its performance in any task you can imagine.

It includes: - Factual finetuning: A massive data pipeline which, given some documents, will automatically generate training data that teaches an LLM the facts inside. It also handles training for you. - Data generation model: A custom dataset generation LLM built for running Augmentoolkit pipelines, allowing at-scale dataset generation on your own hardware. - Arbitrary alignment: an experimental GRPO training pipeline. Write a single prompt explaining how to grade responses (according to ANY criteria), and this pipeline will produce a model that scores highly. - Automatic RAG dataset generation: in case you still want grounding, Augmentoolkit will repurpose its generated questions and answers at the end of a data generation run into a dataset ready for powering a RAG system. - Production scale: even if you generate gigabytes of data with it, Augmentoolkit's code won't break or become painfully slow. - Easy use: making data is easy, intuitive, and fast. Augmentoolkit's start scripts mean all you need to do to get started is to run a single command. A custom-built interface allows full functionality without touching a command line or code editor. - Tools to build your own data: a whole bunch of reusable code, templates, conventions, examples, and abstractions are at your disposal for when you want to make your own dataset generation pipelines. When you want to make a custom LLM that does something that no other model does, Augmentoolkit is the place to start. - Classifier training: Augmentoolkit has a pipeline which takes raw text and some labels you specify; and uses an LLM to bootstrap a binary classification dataset. It will keep training BERT models and expanding the dataset until the model reaches a certain % accuracy. Comparable to human-labelled data but with no intensive work.

Training an LLM on facts, rather than relying on including these facts in-context, comes with many benefits. Besides faster generation times and lower costs, an expert AI that is trained on a domain gains a "big-picture" understanding of the subject that a generalist just won't have. It's the difference between giving a new student a class's full textbook and asking them to write an exam, versus asking a graduate student in that subject to write the exam.

Using Augmentoolkit's factual finetuning ability, you can control what facts your AI knows, and -- since opinions are just subjective facts -- you decide what it believes. With the experimental GRPO pipeline and the ability to easily create your own data pipelines, if you want to go further, then you can control every aspect of your model's capailities. Open-source LLMs had the promise of customization, but people and organizations needed to invest absurd time and money to even get started, with no guarantee of success. *No longer.*

Augmentoolkit's production-ready factual finetuning is the best open-source dataset generation pipeline. It has evolved from the experience of multiple successful consulting projects. Demo models of the factual finetuning are available (https://huggingface.co/Heralax/llama-Augmentoolkit-Quickstar...) for you to see some example results. Try it yourself! https://github.com/e-p-armstrong/augmentoolkit#macos-interfa...

Comments (0)

No comments yet