Activeloop(YC S18)Is Hiring Senior Backend and AI Search Engineer(Mountain View) (careers.activeloop.ai)

1 points by davidbuniat 14d ago 0 comments

Morph (YC S23) Is Hiring a ML Engineer

1 points by bhaktatejas922 15d ago 0 comments

Spark AI (YC W24) Is Hiring a Full Stack Engineer in San Francisco (ycombinator.com)

1 points by tk90 15d ago 0 comments

Demodesk (YC W19) Is Hiring Rails Engineers (demodesk.com)

1 points by alxppp 15d ago 0 comments

Piramidal (YC W24) Is Hiring a Senior Full Stack Engineer (ycombinator.com)

1 points by dsacellarius 16d ago 0 comments

Ask HN: Data engineers, What suck when working on exploratory data-related task?

5 robz75 11 6/18/2025, 10:27:48 AM

Hey guys,

Founder here. I’m working on building my next project and I don’t want to waste time solving fake problems.

Right now, what's currently extremely painful & annoying to do in your job? (You can be very brutally honest)

More specifically, I'm interested how you handle exploratory data-related tasks from your team?

Very curious to get your current workflows, issues and frustrations :)

Comments (11)

dapperdrake · 3h ago

Have been working on this for a while with real stakes.

You have two issues that computers cannot help with (by their nature). And this incidental complexity dominates all the rest.

1. What people want to do with data

2. Bureaucracies are willfully oblivious to this problem domain

What people actually want to do with data: Answer questions that are interesting to them. It is all about the problem domain and its geometry.

Problem: You can only falsify hypothesis when asking reality questions. Everything else will bankrupt you. You can only work with the data that you have. Collecting data will always be hard. Computers are only involved, because they happen to be good with crunching numbers.

Bureaucracies only care about process and never about outcomes. And LLMs can now produce random plausible PowerPoint material to satisfy this demand. Only plausibility ever mattered, because it is empirically sufficient as an excuse for CYA.

---------

Naval Ravikant (abridged): "Tell truth, don't waste word."

daemonologist · 4h ago

As clejack said, "Org silos, security, and permissions" - this is usually the largest single time sink on any project that needs production data.

Related to this is obtaining data in bulk - teams (understandably) are usually not willing to hand out direct read access to their databases and would prefer you use their API, and they've usually built APIs intended for accessing single records at a relatively slow rate. It often takes some convincing (DoSing their API) to get a more appropriate bulk solution.

ferguess_k · 3h ago

Mostly human problems especially if you work with Analytic teams. I need a PO for data. We usually don't have dedicated PO for data products so we have to do all the requirement findings by ourselves.

For exploratory data-related tasks, these are mostly related to checking data format or malformed data, so it is not a huge issue. But since you are building a product, I'll share my experience -> What I need is a quick way to explore schema changes in a column of a database table (not the schema of the table). Imagine you have a table `user` which has a column says `context` which is a bunch of JSON payload, I need to quick way to summarize and give me all "variations" of the schema of that field.

clejack · 7h ago

The main issues for problems like this fall into 3 categories

- Things that prevent you from starting the job. Org silos, security, and permissions

- Things that prevent you from doing the job. This is primarily data cleaning.

- Things that make the job more difficult. This involves poor tooling, and you'll struggle to break the stranglehold that SQL and python-pandas have in this area. I'll also add plotting libraries to this. Many of them suck in a seemingly unavoidable way.

On the second and third points llms will most likely own these soon enough, though maybe there's room to build something small and local that's more efficient if the scope of the agent is reduced?

The first point is organizational generally, and it's very difficult to solve outside of integrating your system into an environment which is the strategy pursued by companies like snowflake and databricks.

robz75 · 6h ago

What are the pain points your are facing with data cleaning? How do you handle it for now?

dapperdrake · 3h ago

Data cleaning depends on the problem domain.

Compare output from a spoctrometer (or spectrograph) vs. eliminating outliers from an almost linear process. One will wreck your data and the other is the only correct thing to do.

**** ****

squircle · 8h ago

Conversations and interviews > Jupyter notebook

robz75 · 8h ago

Why? What's currently annoying about notebooks that you have to deal with compared to just directly going to users?

squircle · 8h ago

Ah, well, rereading your original post I realize now this isn't necessarily painful for me. Perhaps though, the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake. Conway's Law?

Does VS here mean Visual Studio? I would not call myself a data engineer, I just play one at work sometimes. Many hats, yknow?

robz75 · 7h ago

"the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake" => what's painful about that?

VS = compared to, versus

squircle · 4h ago

Hah okay. I read VS different from vs. The pain, in part, is hidden functions, rarely ever inline documentation, difficult to reuse or repurpose, Windows-centric, etc.

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer (ycombinator.com)

Jiga (YC W21) Is Hiring Software Engs to Make Life of Mech Engs Easier (workatastartup.com)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

Blaze (YC S24) Is Hiring (ycombinator.com)

Infracost (YC W21) is hiring software engineers (GMT+2 to GMT-6) (infracost.io)

Solidroad (YC W25) Is Hiring (solidroad.com)

Kyber (YC W23) Is Hiring a Technical Account Manager (ycombinator.com)

Roundtable (YC S23) Is Hiring a President / CRO (ycombinator.com)

Roame (YC S23) Is Hiring (ycombinator.com)

GauntletAI (YC S17): All expenses paid AI training and guaranteed $200k+ job (gauntletai.com)

SchemeFlow (YC S24) Is Hiring an Engineer (London) to Speed Up Construction (ycombinator.com)

Shaped (YC W22) Is Hiring (ycombinator.com)

Spice Data (YC S19) is hiring a software engineer – back end (ycombinator.com)

Onlook (YC W25) Is Hiring an engineer in SF

OneText (YC W23) Is Hiring a DevOps/DBA Lead Engineer (jobs.ashbyhq.com)

Gander (YC F24) Is Hiring Founding Engineers and Interns (ycombinator.com)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Onyx (YC W24) – AI Assistants for Work Hiring Founding AE (ycombinator.com)

Great Question (YC W21) Is Hiring a Director of Customer Success (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build an AI-powered data notebook (deepnote.com)

Converge (YC S23) Well-capitalized New York startup seeks product developers (runconverge.com)

CircuitHub (YC W12) is hiring full-stack robotics engineers (workatastartup.com)

AtoB (YC S20) – Stripe for Transportation – is hiring engineers (jobs.ashbyhq.com)

PromptArmor (YC W24) Is Hiring in San Francisco (ycombinator.com)

Depot (YC W23) is hiring an enterprise support engineer (UK/EU) (ycombinator.com)

Patched (YC S24) Is Hiring SWEs in Singapore (ycombinator.com)