Show HN: I made an open-source synthetic text datasets generator

2 astropat 0 5/27/2025, 6:46:21 PM github.com ↗
Many LLMs projects suffers due to the lack of custom datasets:

- no labelled data at all

- lack coverage and diversity in existing data

- Data collection and annotation processes are slow and boring

- Not enough examples to fine-tune or evaluate LLMs…

So I built datafast, an open-source library for synthetic text datasets generation.

Right now it supports 5 datasets types:

- Text Classification Dataset

- Raw Text Generation Dataset

- Instruction Dataset (Ultrachat-like)

- Multiple Choice Question (MCQ) Dataset

- Preference Dataset

And more to come.

Currently supported LLM providers for generation are:

- OpenAI

- Anthropic

- Google Gemini

- Ollama (local LLM server)

There is more to come but I am not in a rush for features. I seek data quality, data diversity and reliability over quantity. I don't measure success by shipping more features: I succeed if it works when you try it out, and if you actually use it.

Hope you like that!

Comments (0)

No comments yet