Ask HN: Data engineers, What suck when working on exploratory data-related task?

4 robz75 8 6/18/2025, 10:27:48 AM
Hey guys,

Founder here. I’m working on building my next project and I don’t want to waste time solving fake problems.

Right now, what's currently extremely painful & annoying to do in your job? (You can be very brutally honest)

More specifically, I'm interested how you handle exploratory data-related tasks from your team?

Very curious to get your current workflows, issues and frustrations :)

Comments (8)

daemonologist · 17m ago
As clejack said, "Org silos, security, and permissions" - this is usually the largest single time sink on any project that needs production data.

Related to this is obtaining data in bulk - teams (understandably) are usually not willing to hand out direct read access to their databases and would prefer you use their API, and they've usually built APIs intended for accessing single records at a relatively slow rate. It often takes some convincing (DoSing their API) to get a more appropriate bulk solution.

clejack · 3h ago
The main issues for problems like this fall into 3 categories

- Things that prevent you from starting the job. Org silos, security, and permissions

- Things that prevent you from doing the job. This is primarily data cleaning.

- Things that make the job more difficult. This involves poor tooling, and you'll struggle to break the stranglehold that SQL and python-pandas have in this area. I'll also add plotting libraries to this. Many of them suck in a seemingly unavoidable way.

On the second and third points llms will most likely own these soon enough, though maybe there's room to build something small and local that's more efficient if the scope of the agent is reduced?

The first point is organizational generally, and it's very difficult to solve outside of integrating your system into an environment which is the strategy pursued by companies like snowflake and databricks.

robz75 · 1h ago
What are the pain points your are facing with data cleaning? How do you handle it for now?
squircle · 4h ago
Conversations and interviews > Jupyter notebook
robz75 · 4h ago
Why? What's currently annoying about notebooks that you have to deal with compared to just directly going to users?
squircle · 4h ago
Ah, well, rereading your original post I realize now this isn't necessarily painful for me. Perhaps though, the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake. Conway's Law?

Does VS here mean Visual Studio? I would not call myself a data engineer, I just play one at work sometimes. Many hats, yknow?

robz75 · 3h ago
"the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake" => what's painful about that?

VS = compared to, versus

squircle · 37m ago
Hah okay. I read VS different from vs. The pain, in part, is hidden functions, rarely ever inline documentation, difficult to reuse or repurpose, Windows-centric, etc.