Show HN: Augmentoolkit 3.0: open-source datagen. Teach LLMs new facts, tasks

1 e-p-armstrong 0 6/12/2025, 10:14:13 PM github.com ↗

Finally a tool for training specialist LLMs. Fully open-source. Add documents and click a button.

Augmentoolkit is a production-ready way to train AI subject matter experts. It lets you update an LLM's knowledge cutoff and put new facts into its brain, without any retrival needed. You can then do reinforcement learning to improve its performance in any task you can imagine.

It includes: - Factual finetuning: A massive data pipeline which, given some documents, will automatically generate training data that teaches an LLM the facts inside. It also handles training for you. - Data generation model: A custom dataset generation LLM built for running Augmentoolkit pipelines, allowing at-scale dataset generation on your own hardware. - Arbitrary alignment: an experimental GRPO training pipeline. Write a single prompt explaining how to grade responses (according to ANY criteria), and this pipeline will produce a model that scores highly. - Automatic RAG dataset generation: in case you still want grounding, Augmentoolkit will repurpose its generated questions and answers at the end of a data generation run into a dataset ready for powering a RAG system. - Production scale: even if you generate gigabytes of data with it, Augmentoolkit's code won't break or become painfully slow. - Easy use: making data is easy, intuitive, and fast. Augmentoolkit's start scripts mean all you need to do to get started is to run a single command. A custom-built interface allows full functionality without touching a command line or code editor. - Tools to build your own data: a whole bunch of reusable code, templates, conventions, examples, and abstractions are at your disposal for when you want to make your own dataset generation pipelines. When you want to make a custom LLM that does something that no other model does, Augmentoolkit is the place to start. - Classifier training: Augmentoolkit has a pipeline which takes raw text and some labels you specify; and uses an LLM to bootstrap a binary classification dataset. It will keep training BERT models and expanding the dataset until the model reaches a certain % accuracy. Comparable to human-labelled data but with no intensive work.

Training an LLM on facts, rather than relying on including these facts in-context, comes with many benefits. Besides faster generation times and lower costs, an expert AI that is trained on a domain gains a "big-picture" understanding of the subject that a generalist just won't have. It's the difference between giving a new student a class's full textbook and asking them to write an exam, versus asking a graduate student in that subject to write the exam.

Using Augmentoolkit's factual finetuning ability, you can control what facts your AI knows, and -- since opinions are just subjective facts -- you decide what it believes. With the experimental GRPO pipeline and the ability to easily create your own data pipelines, if you want to go further, then you can control every aspect of your model's capailities. Open-source LLMs had the promise of customization, but people and organizations needed to invest absurd time and money to even get started, with no guarantee of success. *No longer.*

Augmentoolkit's production-ready factual finetuning is the best open-source dataset generation pipeline. It has evolved from the experience of multiple successful consulting projects. Demo models of the factual finetuning are available (https://huggingface.co/Heralax/llama-Augmentoolkit-Quickstar...) for you to see some example results. Try it yourself! https://github.com/e-p-armstrong/augmentoolkit#macos-interfa...

Rethinking the Patent Office (forbes.com)

The average ChatGPT request uses ~0.34Wh (engineeringprompts.substack.com)

After millions of years, why are carnivorous plants still so small? (smithsonianmag.com)

Open-source granola (meetings summary) (omi.me)

Powering next-gen services with AI in regulated industries (technologyreview.com)

Hackable AlphaFold 3 without Docker or MSAs (github.com)

Show HN: A Visual way to build complex prompts - Looking for product validation (thepromptindex.com)

Silicon Valley tech execs are joining the US Army Reserve (techcrunch.com)

The Israeli Attack Against Iran (mearsheimer.substack.com)

Ask HN: Has anyone digitally modeled the impact and collapse of the twin towers?

In Twist, U.S. Diplomacy Served As Cover for Israeli Surprise Attack (wsj.com)

Show HN: Free tool to download Microsoft Learn video (github.com)

The Growing Risk of Malicious Browser Extensions (socket.dev)

There's another leak on the ISS, but NASA is not saying much about it (arstechnica.com)

Apple's Liquid Glass is prep work for AR interfaces, not just a design refresh (omc345.substack.com)

Plunder: How Private Equity is reshaping HVAC (heatpumped.org)

Show HN: Infrabase: Natural language rules engine to manage your cloud account (infrabase.co)

The Viable Systems Model (fffej.substack.com)

Build It Twice (russellpollari.substack.com)

Observability with real insights and auto-fixes (cloudgrip.ai)

First Fossil Proof Found That Long-Necked Dinosaurs Were Vegetarians (nytimes.com)

The Postgres Developers guide to updates and deletes in ClickHouse (clickhouse.com)

The Return of Forgotten Math in Computer Graphics [pdf] (2012) (terathon.com)

Ask HN: Are senior engineers not senior anymore?

LLMs.txt Generator with Automated Monitoring (github.com)

All Starlink Direct to Cell Gen 1 satellites have now been launched (twitter.com)

Anti-Tesla demonstration highlights safety concerns with self-driving vehicles (statesman.com)

Things Jeremy says to do (2019) (forums.fast.ai)

A remote island escaped mass suicide in Battle of Okinawa (japantimes.co.jp)

Ask HN: Any way to get some OpenAI/Anthropic credits for school students?

Vox Media Union Reaches Agreement on Three-Year Contract (variety.com)

Phoenix contexts are simpler than you think (arrowsmithlabs.com)

Self-Adapting Language Models (arxiv.org)

Thoughts on Kagi Search after two months (olly.pagecord.com)

FlockRunner – A project based YAML command excecutor (github.com)

Who was the real Andy Warhol? (bbc.com)

The Magic of Through Running (worksinprogress.co)

Ex150ish-Fruit-and-Chips (theheartattackdiet.substack.com)

Amanda Feilding, Eccentric Countess Who Backed Psychedelic Meds, Dies at 82 (nytimes.com)

Simulink (Matlab) Copilot (github.com)

A chemical in acne medicine can help regenerate limbs (popsci.com)

The Claude Bliss Attractor (astralcodexten.com)

Is there a Linux laptop that has >=15" and >800 nits screen?

TvOS 26 Introduces Automatic Sign-In Feature for Apple TV Apps (macrumors.com)

Privacy protections in the Google search case (brookings.edu)

But, my Postgres (+Lakehouse) is free (mooncake.dev)

Betsy Jochum, 104, Dies; Last Original Member of Women's Baseball League (nytimes.com)

Trump gives data of immigrant Medicaid enrollees to deportation officials (apnews.com)

System Card: Claude Opus 4 & Claude Sonnet 4 [pdf] (www-cdn.anthropic.com)

The story of ispc: all the links (pharr.org)

Show HN: Augmentoolkit 3.0: open-source datagen. Teach LLMs new facts, tasks

Comments (0)