Data manipulation using natural language prompts

4 brockmeier 4 6/10/2025, 1:52:13 PM
I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.

It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.

Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?

Comments (4)

jonahbenton · 1d ago
+1. Have used AI to write code for me to do various cleaning steps but as the process of iterating over data cleaning is usually one of discovery, both in terms of the data and also in terms of requirements, especially in the context of problems, have not found a conversational workflow tool capable of working at the right level of abstraction to be useful. Curious if any folks have.
kevinherron · 22h ago
Submit for batch processing using the OpenAI batch API?
sprobertson · 15h ago
This in combination with making the tool call / json response itself also "batched" is a good pattern. Instead of returning a single `{english, foul}` object per record, pass in an array of records and have it return an array of `[{english, foul}]`. Adjust the inner batch size depending on your record size and spread the rest over the batched API.
constantinum · 21h ago
There is https://www.visitran.com/ which is still in closed beta.