Ask HN: How are you cleaning and transforming data before imports/uploads?
27 dataflowmapper 18 5/14/2025, 4:27:15 PM
Hi all,
I’m curious how folks handle the prep work for data imports/uploads into systems like Salesforce, Workday, NetSuite, or really any app that uses template based import for data loading, migration, or implementation.
Specifically: - How do you manage conversions/transformations like formatting dates, getting everything aligned with the templates, mapping old codes to new ones, etc.
- Are you primarily using Excel, custom scripts, Power Query or something else?
- What are the most tedious/painful parts of this process and what have you found that works?
Really appreciate any insights and am curious to learn from everyone's experience.
Assume at some point, the data will bork up.
If you ingest Excel (ugh), treat it like free range data. I have a typescript lambda that just shreds spreadsheets in a "ok scan for this string, then assume the thing to the right of it is this value we want" style -- it's goofy AF but it's one of my favorite tools in the toolbox, since I look magical when I use it. It allows me to express-pass janky spreadsheets into Athena in minutes, not days.
It is based on the convert-excel-to-json library and once you grok how it wants to work (excel -> giant freaky JSON object with keys that correspond to cell values, so object.A, object.B, object.C etc for columns. array index for row number), you can use it as a real blunt-force chainsaw approach to unstructured data LARPing as an excel doc :D
Not Bash, not Excel, not any special-purpose tool because the motto of those is "you can't get there from here". Maybe you can get 80% of the way there, which is really seductive, but that last 20% is like going to the moon. Specifically, real programming languages have the tools to format dates correctly with a few lines of code you can wrap into a function, fake programming languages don't. Mapping codes is straightforward, etc.
You can't really trust the time handling functions on any platform which is some of the reason why languages like Python, and Java, might have two or three libraries for working with dates and times in the standard library because people realized the old one was unfixable. Plenty of Javascript date handling libraries have told people "look, this is obsolete, it's time to move on" not just because that's the Javascript way, but because the libraries really were error prone.
In a real programming language it's straightforward to fix those problems, in a fake programming language it is difficult or impossible.
If you've got a strange index structure in the source or destination data, for instance, many tools fall down. For instance if you want to convert nested set trees
https://blog.uniauth.com/nested-set-model
to something more normal like an adjacency list (or vice versa) a lot of simple tools are going to fall down.
My impression is that people use Python to do most of the things that we used to do with Perl.
Circa 2001 I wrote a cookie-based authentication system for the web which had an "authentication module" of roughly 100 lines that I wound up rewriting in at least ten different languages such as PHP, Java, Cold Fusion, Tango, etc. The OO Perl version was the shortest and my favorite.
1. define a clean interface target - for me, that's an interface that I made for my startup to import call data.
2. explore the data a little to get a sense of transformation mappings.
3. create a PII-redacted version of the file, upload it to ChatGPT along with the shape of my interface, ask it to write a transformation script in Python
4. run it on a subset of the data locally to verify that it works.
5. run it in production against my customer's account.
I'm curious - that seems like a reasonably standard flow, and it involves a bit of manual work, but it seems like the right tradeoff between toil and automation. Do you struggle with that workflow or think it could be better somehow?
Date example:
var dateValue = DateTime.ParseExact(yyyymmdd, "yyyyMMdd", null); var dateString = dateValue.ToString("yyyy-MM-dd HH:mm:ss")
The idea was to replace all of the bespoke custom Excel workflows that various program management teams used to do it themselves, so we were self-funding based on the estimated cost savings of replacing manual work, meaning I had pretty much free reign to do it however I wanted as long as I could get people to use it. I did it entirely in Python, making heavy use of Pandas, creating a toolchain that ingested mapping definitions from the users and relied on fuzzy matching (i.e. stemming, de-capitalization, recognizing word roots) to categorize field names into a common definition set used by the corporate system.
It wasn't a 100% solution, more like 95%, but the reality of this kind of work is we had a real garbage-in problem anyway that no software could solve. Extracting data from an issue tracker relies upon developers using it correctly in the first place. If you have a bunch of workflow transitions but whoever works an issue just transitions them all at once when work is complete, there's no way to know how long everything actually took. You can only extract information that was actually recorded. Coercing dates into a single format standard is fairly simple to do, but if the dates were wrong in the original format, they'll still be wrong in the corrected format.
We have a template with accompanying documentation - partners with technical teams and well-structured data can basically self-serve with SQL. We have a series of meetings where we map columns and values, review errors, etc. More irritating than the data transformations is understanding the structure of the data and practical use, e.g., the same column for Partner A means something entirely different for Partner B. Transforming it might not be the problem - making sure everyone across our teams understands what it means and where it should go is a large pain point in mapping, but the coding of this logic is trivial.
For non-technical partners where we handle the prep, over time I wrote a small internal Python library just building off DataFrames that contains a few hundred tests to validate the data, plus a set of commonly used data cleaning functions that are flexible in their inputs. We connect to APIs to verify addresses, time zones, where we can. We now use LLMs more frequently to parse and structure fields to our standard, but it is still painful to review results from the LLM and ensure correctness. Each incoming file will result in an accompanying Jupyter notebook that runs all the relevant scripts.
Issue with working Excel (formulas, Office JS/PY Scripts, manual edits) has always been version tracking - difficult to replicate work if new versions of files come in while we are prepping the data. If we find an error post-import, it is great to be able to track down where and why we made the decision or coding error. I haven't tried Power Query though. I have tried OpenRefine, but I think sometimes it slows me down for easy functions, and API-based data transformations becoming a separate task.
When we have interns, coordinating cleaning / prep on a single file across users can be tricky.
We did try an internal POC of a UI based tool to allow users to map their own data, start cleaning it, but honestly, either a) the partner would have a data analyst or steward that didn't need it, or b) the partner wouldn't have enough of a skillset to feel comfortable cleaning it. Plus pretty often we'll need to join data from multiple spreadsheets, but conditionally use certain rows from one or the other to get the most up-to-date data, which can be difficult to do. Didn't feel as if this was worth the effort to continue with.
Fuzzy de-duplication validation is a pain, or anything where you actually want to check with the partner about its correctness - like if I notice that an email is spelled correctly because it almost matches the person's last name, but 1 letter different - becomes a long list of bullet points in an email. Something I would like is an easy way to flag and share errors and have a partner correct those in a friendly UI, then store those changes as code, without a long series of emails or meetings. We've a few times had an Excel file uploaded to Sharepoint, then with a script that automatically adds a comment to cells with errors -> but again, some people just aren't comfortable working with Excel, or get confused when they see a very structured version of their data with additional formatting, and there is no convenient way to track or replicate their changes.
There are always also one-off corrections that will need to be made as well - I've just sort of accepted this, and so we generate a list of warnings / recommendations for the partner to review or correct post-import based on the test results rather that us trying to do it. That has worked fine.
A few things particularly stood out to me:
It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file: And for your usage of LLM, is that mostly for mapping source field to target fields?Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.