Show HN: DeepFabric – Structured synthetic datasets for model distillation

2 decodebytes 0 9/17/2025, 2:35:51 PM github.com ↗
I’ve been working on DeepFabric, an open-source CLI + SDK for generating synthetic datasets using LLMs, based on Topic or Tree Graphs (DAG).

The goal is to make it easier to create structured, diverse, domain-specific datasets — especially ones with chain-of-thought (CoT) reasoning — without hand-crafting hundreds of prompts.

What it does:

Generates datasets via topic graphs/trees to systematically cover a domain and reduce duplication.

Supports multiple CoT styles (free-text, structured, hybrid).

Works with different LLM providers (OpenAI, Anthropic, local models (Ollama).

Configurable via YAML, as a library , or CLI and exports easily for Hugging Face training.

Why:

Synthetic data is increasingly important for fine-tuning, evaluation, and distillation, aka deepseek and more recently Phi-4

Most existing approaches are ad-hoc; I wanted something systematic and reproducible.

Some example data here:

https://huggingface.co/datasets/lukehinds/medical_q_and_a https://huggingface.co/datasets/lukehinds/programming-challe... https://huggingface.co/datasets/lukehinds/linux_shell_attack...

Comments (0)

No comments yet