Show HN: DeepFabric – Structured synthetic datasets for model distillation
The goal is to make it easier to create structured, diverse, domain-specific datasets — especially ones with chain-of-thought (CoT) reasoning — without hand-crafting hundreds of prompts.
What it does:
Generates datasets via topic graphs/trees to systematically cover a domain and reduce duplication.
Supports multiple CoT styles (free-text, structured, hybrid).
Works with different LLM providers (OpenAI, Anthropic, local models (Ollama).
Configurable via YAML, as a library , or CLI and exports easily for Hugging Face training.
Why:
Synthetic data is increasingly important for fine-tuning, evaluation, and distillation, aka deepseek and more recently Phi-4
Most existing approaches are ad-hoc; I wanted something systematic and reproducible.
Some example data here:
https://huggingface.co/datasets/lukehinds/medical_q_and_a https://huggingface.co/datasets/lukehinds/programming-challe... https://huggingface.co/datasets/lukehinds/linux_shell_attack...
No comments yet