Ask HN: What cool skill or project interests you, but feels out of reach?

72 points by akktor 2d ago 158 comments

PCL – Run Python and C Together in One File

4 points by hejhdiss 2h ago 5 comments

Ask HN: What is the latest on treatment of Metastatic Breast Cancer?

20 points by hazrmard 10h ago 3 comments

Ask HN: Dear Product Managers – How do you use LLM's in your day to day work?

6 points by tss93 4h ago 1 comments

Tell HN: Help restore the tax deduction for software dev in the US (Section 174)

2371 points by dang 1d ago 888 comments

Ask HN: Does Apple's Liquid Glass hint to upcoming AR glasses?

3 points by divan 5h ago 1 comments

Why Vertical AI Agents May Replace RPA in Complex Enterprise Workflows

3 points by saxon_ai 6h ago 1 comments

How to get started with writing tech video essays

88 points by sonderotis 6d ago 25 comments

Ask HN: How to learn CUDA to professional level

238 points by upmind 3d ago 78 comments

Ask HN: Any Way to Sidestep Stripe's "alternate currency payout fee"

4 points by SilentDonor 12h ago 0 comments

Ask HN: Startup getting spammed with PayPal disputes, what should we do?

291 points by june3739 7d ago 183 comments

Ask HN: Would You Use a Declarative Back End (APIs, DB, Auth, Sync)?

3 points by Imazadi 16h ago 1 comments

Ask HN: In 15 years, what will a gas station visit look like?

53 points by thomassmith65 2d ago 156 comments

Ask HN: Has anybody built search on top of Anna's Archive?

291 points by neonate 7d ago 146 comments

Ask HN: Does Google train on Google Docs?

8 points by puttycat 1d ago 6 comments

Ask HN: Any good tools for viewing congressional bills?

110 points by tlhunter 4d ago 40 comments

Data manipulation using natural language prompts

4 points by brockmeier 1d ago 4 comments

Building an Audit Readiness Platform for Startups – Would Love Your Feedback

6 points by dillonatkinson5 1d ago 0 comments

Rust password hashing functions: Argon2, scrypt, PBKDF2

2 points by jph 1d ago 3 comments

Control PC games with body movements using webcam andPython+ MediaPipe

5 points by TSkavinskyy 1d ago 2 comments

Heroku Outage from AWS CloudFront Issue?

5 points by raasdnil 1d ago 0 comments

Ask HN: What API or software are people using for transcription?

54 points by indigodaddy 1d ago 63 comments

Ask HN: What's your open source stack?

27 points by 3D39739091 3d ago 14 comments

Ask HN: What Happened to the Apple Vision Pro?

16 points by pera 2d ago 31 comments

List of Countries Without a Value-Added Tax

11 points by nabla9 1d ago 13 comments

Ask HN: Do founders get honest feedback on their pitch decks?

3 points by johnzakkam 1d ago 8 comments

Ask HN: Is there any demand for Personal CV/Resume website?

8 points by usercvapp 2d ago 19 comments

Ask HN: Options for One-Handed Typing

95 points by Townley 7d ago 94 comments

Best place for small remote gigs?

18 points by xucian 4d ago 24 comments

Bill Atkinson has passed away

27 points by hyperhello 4d ago 6 comments

Ask HN: How do you integrate AI assistants into your note taking?

4 points by andruby 2d ago 4 comments

Ask HN: What are some of the best books to learn signal processing?

3 points by optbuild 22h ago 1 comments

Ask HN: Anyone else feeling increasingly alienated from the industry?

38 points by saubeidl 5d ago 33 comments

Ask HN: What would you work on if you couldn't fail?

13 points by rblion 4d ago 21 comments

Ask HN: I'm considering open-sourcing my UI library – how to generate income?

5 points by mrholek 1d ago 2 comments

Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

20 points by anditherobot 6d ago 7 comments

Ask HN: Running AI agents in isolated environments

8 points by polycaster 5d ago 1 comments

Ask HN: How long do you ever leave your server running without updates?

5 points by bagol 2d ago 6 comments

Get Your Dev Tool Mentioned by ChatGPT, Gemini Not Just Ranked on Google

8 points by vinodvarma24 4d ago 6 comments

When Profit Overshadows Community: A Look at Golang Conferences

5 points by gophercon 4d ago 2 comments

What is a modern successor to HyperCard?

11 points by WillAdams 3d ago 9 comments

Data manipulation using natural language prompts

4 brockmeier 4 6/10/2025, 1:52:13 PM

I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.

It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.

Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?

Comments (4)

jonahbenton · 1d ago

+1. Have used AI to write code for me to do various cleaning steps but as the process of iterating over data cleaning is usually one of discovery, both in terms of the data and also in terms of requirements, especially in the context of problems, have not found a conversational workflow tool capable of working at the right level of abstraction to be useful. Curious if any folks have.

kevinherron · 22h ago

Submit for batch processing using the OpenAI batch API?

sprobertson · 15h ago

This in combination with making the tool call / json response itself also "batched" is a good pattern. Instead of returning a single `{english, foul}` object per record, pass in an array of records and have it return an array of `[{english, foul}]`. Adjust the inner batch size depending on your record size and spread the rest over the batched API.

constantinum · 21h ago

There is https://www.visitran.com/ which is still in closed beta.