Show HN: I made an open-source synthetic text datasets generator

2 astropat 0 5/27/2025, 5:09:30 AM github.com ↗

Many LLMs projects suffers due to the lack of custom datasets: - no labelled data at all - lack coverage and diversity in existing data - Data collection and annotation processes are slow and boring - Not enough examples to fine-tune or evaluate LLMs…

So I built datafast, an open-source library for synthetic text datasets generation.

Right now it supports 5 datasets types:

- Text Classification Dataset - Raw Text Generation Dataset - Instruction Dataset (Ultrachat-like) - Multiple Choice Question (MCQ) Dataset - Preference Dataset

And more to come.

Currently supported LLM providers for generation are: - OpenAI - Anthropic - Google Gemini - Ollama (local LLM server)

There is more to come but I am not in a rush for features. I seek data quality, data diversity and reliability over quantity. I don't measure success by shipping more features: I succeed if it works when you try it out, and if you actually use it.

Hope you like that!

Long live American Science and Surplus (milwaukeerecord.com)

A toy RTOS inside Super Mario Bros. using emulator save states (prettygoodblog.com)

Porting Terraria and Celeste to the Browser with WebAssembly (velzie.rip)

Show HN: I rewrote my Mac Electron app in Rust (desktopdocs.com)

Compiler Explorer and the promise of URLs that last forever (xania.org)

Japan Post launches 'digital address' system (japantimes.co.jp)

Compiling a neural net to C for a speedup (slightknack.dev)

Visualize and debug Rust programs with a new lens (firedbg.sea-ql.org)

What does “Undecidable” mean, anyway (buttondown.com)

Show HN: Handover.ai – Knowledge transfer made easy (handover.ai)

What If We Had Bigger Brains? Imagining Minds Beyond Ours (writings.stephenwolfram.com)

A Visual History of Chessmen (chesshistory.github.io)

Show HN: Tesseral – Open-Source Auth (github.com)

Basic for the Raspberry Pi Pico and Pico 2 (geoffg.net)

Unlocking Ractors: class instance variables in Ruby (byroot.github.io)

LLM codegen go brrr – Parallelization with Git worktrees and tmux (skeptrune.com)

Grass Rendering Series (hexaquo.at)

HTAP Databases Are Dead (mooncake.dev)

The Blowtorch Theory: A new model for structure formation in the universe (theeggandtherock.com)

US Trade Court finds Trump tariffs illegal (bloomberg.com)

Deepseek R1-0528 (huggingface.co)

De-anonymization attacks against the privacy coin XMR (monero.forex)

Concatenative programming and stack-based languages (2023) [video] (youtube.com)

YAD: display graphical dialogs from shell scripts or command line (yad-guide.ingk.se)

GoGoGrandparent (YC S16) is hiring Back end Engineers

Launch HN: MindFort (YC X25) – AI agents for continuous pentesting

Mathematical Fiction (kasmana.people.charleston.edu)

Getting a Cease and Desist from Waffle House (jack.bio)

A thought on JavaScript "proof of work" anti-scraper systems (utcc.utoronto.ca)

Prohibition and ice cream in the US Navy (oldsaltblog.com)

Mullvad Leta (leta.mullvad.net)

Implementing complex numbers and FFT with just datatypes (2023) (gist.github.com)

Collatz's Ant and Similarity of Landscapes (gbragafibra.github.io)

xAI to pay telegram $300M to integrate Grok into the chat app (techcrunch.com)

The Ingredients of a Productive Monorepo (blog.swgillespie.me)

DeepSeek-R1-0528 is now live on Hyperbolic (app.hyperbolic.xyz)

Show HN: Loodio 2 – A Simple Rechargable Bathroom Privacy Device (loodio.com)

Negotiating PoE+ Power in the Pre‑Boot Environment (roderickkhan.com)

The anomalous magnetic moment of the muon in the Standard Model: an update (arxiv.org)

As a developer, my most important tools are a pen and a notebook (hamatti.org)

Harrison Ruffin Tyler, grandson of 10th U.S. president, has died (richmonder.org)

Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B (hazyresearch.stanford.edu)

We Tested Google Veo and Runway to Create This AI Film. It Was Wild [video] (youtube.com)

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective (arxiv.org)

DWARF as a Shared Reverse Engineering Format (lief.re)

Homo erectus from the seabed, new archaeological discoveries in Indonesia (universiteitleiden.nl)

The mysterious Gobi wall uncovered (phys.org)

OpenTPU: Open-Source Reimplementation of Google Tensor Processing Unit (TPU) (github.com)

Designing Tools for Scientific Thought (forester-notes.org)

Building interactive web pages with Guile Hoot (spritely.institute)

Show HN: I made an open-source synthetic text datasets generator

Comments (0)