Canada needs 'bold ambition' to poach top US researchers (phys.org)

1 points by bikenaga 1m ago 0 comments

The Strategy Behind Dia's Design (browsercompany.substack.com)

1 points by jbegley 1m ago 0 comments

Ask HN: Paper pad "self-prompting" as rubber-duck-with-a-context-length?

1 points by xeonmc 2m ago 0 comments

Real-time translation in 60 languages (soniox.com)

1 points by lukax 2m ago 0 comments

Rusty Music Player Client (mierak.github.io)

1 points by cyrc 3m ago 0 comments

Most Students Say They're Not 'Math People' (the74million.org)

2 points by gmays 5m ago 0 comments

The Gift and the Curse of Staying Private with Bill Gurley [video] (youtube.com)

1 points by washedup 6m ago 0 comments

Getting Lustre Upstream (lwn.net)

2 points by Bogdanp 7m ago 0 comments

New Claude Models Default to Full Code Output, Stronger Prompt Required (eval.16x.engineer)

1 points by paradite 7m ago 0 comments

Neuralink competitor Paradromics completes first human implant (cnbc.com)

1 points by gmays 9m ago 0 comments

'Failed vision': S.F. citizen body slams city, police for lack on Vision Zero (sfchronicle.com)

1 points by iancmceachern 10m ago 2 comments

Cell Flow: A New Kind of Particle Simulation Algorithm (github.com)

2 points by playfultones 10m ago 0 comments

Intel Foundry layoffs could impact 'more than 10k' factory workers (tomshardware.com)

3 points by ksec 12m ago 1 comments

Sketches by da Vinci Suggest He Understood Gravity Decades Before Newton (msn.com)

1 points by squircle 13m ago 0 comments

The bias pushing women out of computer science (techxplore.com)

1 points by bikenaga 13m ago 0 comments

Chromedp: A faster, simpler way to drive browsers (github.com)

1 points by tosh 15m ago 0 comments

Superintelligence, from First Principles (blog.jxmo.io)

1 points by jxmorris12 16m ago 0 comments

Show HN: Portle – A Client-Side LLM Interface That Doesn't Store Your Data (portle.ai)

1 points by nhp_fermi 16m ago 0 comments

Strong link between Earth's magnetic field and atmospheric oxygen levels (phys.org)

1 points by bilsbie 17m ago 0 comments

Polar – European, open source fintech-team of 3 raises $10M seed led by Accel (twitter.com)

6 points by birk 17m ago 0 comments

S-Curves and the Bitter Lesson: How We Will Progress (diwank.space)

1 points by diwank 18m ago 0 comments

XAI Raising Money, XAI and Oracle, Xbox = Windows (stratechery.com)

1 points by feross 18m ago 0 comments

Ask HN: Why study anything if AGI is (supposedly) coming?

2 points by msvana 18m ago 1 comments

WikipeQA: An evaluation dataset for both web-browsing agents and RAG systems (huggingface.co)

1 points by teilom 19m ago 0 comments

Terraform Industries Is Hiring (terraformindustries.com)

2 points by waynenilsen 22m ago 1 comments

Eidophor: 1950's space age video projection technology. [video] (youtube.com)

1 points by fanf2 23m ago 0 comments

Notes from Ms. Morrison (slate.com)

1 points by petethomas 23m ago 0 comments

Show HN: Monotone v1.2.0 is out (cloud native key-value storage for seq data) (monotone.studio)

1 points by pmwkaa 23m ago 0 comments

Buying a laptop for College/general purpose

1 points by pkrzysiek 23m ago 0 comments

Nietzschean Reflections on Liberty (isonomiaquarterly.com)

2 points by brandonlc 26m ago 0 comments

Musk's X sues New York state over social media hate speech law (bbc.com)

1 points by nradov 26m ago 0 comments

High levels of antihistamine drugs can reduce fitness gains (medicalxpress.com)

7 points by bikenaga 27m ago 4 comments

Augmented Vertex Block Descent (AVBD) (graphics.cs.utah.edu)

1 points by Luc 27m ago 0 comments

Advisory Committee on Immunization Practices at a Crossroads (jamanetwork.com)

1 points by rntn 29m ago 0 comments

Speeder Speed Controller (chromewebstore.google.com)

1 points by doroved 30m ago 0 comments

What I did during the basketball game, or, browser screenshots in Sketch (sketch.dev)

1 points by tosh 31m ago 0 comments

I've almost completely switched from "Python" to "uv run" (actinium226.substack.com)

8 points by actinium226 33m ago 0 comments

A simple Go error handling pattern led to 54GB memory usage with 65535 errors (gist.github.com)

1 points by alingse 35m ago 1 comments

Landing a Model Rocket [video] (youtube.com)

1 points by pillars 35m ago 0 comments

Artavolo – Your $0 Airtable Alternative (artavolo.com)

2 points by boikom 38m ago 3 comments

An episodic burst of genomic rearrangements (nature.com)

1 points by darkwater 39m ago 0 comments

macOS Containers, Docker Desktop and Unikernels (nanovms.com)

3 points by transpute 39m ago 0 comments

GerriScary: Hacking the Supply Chain of Popular Google Products (tenable.com)

1 points by bearsyankees 41m ago 0 comments

The OpenAI Files (openaifiles.org)

11 points by shscs911 42m ago 0 comments

A dwarf galaxy just might upend the Milky Way's predicted demise (sciencenews.org)

1 points by gmays 42m ago 0 comments

Selecting a Model Based on Stripe Conversion (cookbook.openai.com)

2 points by tosh 44m ago 0 comments

Notes on Retries (justinblank.com)

1 points by todsacerdoti 47m ago 1 comments

Mathematical Optimization: Solving Problems Using SCIP and Python (scipbook.readthedocs.io)

1 points by marklit 48m ago 0 comments

Social media destroyed one of America's key advantages (noahpinion.blog)

4 points by PaulHoule 52m ago 4 comments

Pope Leo Takes on AI (wsj.com)

1 points by calstad 53m ago 0 comments

Ask HN: Data engineers, What suck when working on exploratory data-related task?

4 robz75 8 6/18/2025, 10:27:48 AM

Hey guys,

Founder here. I’m working on building my next project and I don’t want to waste time solving fake problems.

Right now, what's currently extremely painful & annoying to do in your job? (You can be very brutally honest)

More specifically, I'm interested how you handle exploratory data-related tasks from your team?

Very curious to get your current workflows, issues and frustrations :)

Comments (8)

daemonologist · 17m ago

As clejack said, "Org silos, security, and permissions" - this is usually the largest single time sink on any project that needs production data.

Related to this is obtaining data in bulk - teams (understandably) are usually not willing to hand out direct read access to their databases and would prefer you use their API, and they've usually built APIs intended for accessing single records at a relatively slow rate. It often takes some convincing (DoSing their API) to get a more appropriate bulk solution.

clejack · 3h ago

The main issues for problems like this fall into 3 categories

- Things that prevent you from starting the job. Org silos, security, and permissions

- Things that prevent you from doing the job. This is primarily data cleaning.

- Things that make the job more difficult. This involves poor tooling, and you'll struggle to break the stranglehold that SQL and python-pandas have in this area. I'll also add plotting libraries to this. Many of them suck in a seemingly unavoidable way.

On the second and third points llms will most likely own these soon enough, though maybe there's room to build something small and local that's more efficient if the scope of the agent is reduced?

The first point is organizational generally, and it's very difficult to solve outside of integrating your system into an environment which is the strategy pursued by companies like snowflake and databricks.

robz75 · 1h ago

What are the pain points your are facing with data cleaning? How do you handle it for now?

squircle · 4h ago

Conversations and interviews > Jupyter notebook

robz75 · 4h ago

Why? What's currently annoying about notebooks that you have to deal with compared to just directly going to users?

squircle · 4h ago

Ah, well, rereading your original post I realize now this isn't necessarily painful for me. Perhaps though, the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake. Conway's Law?

Does VS here mean Visual Studio? I would not call myself a data engineer, I just play one at work sometimes. Many hats, yknow?

robz75 · 3h ago

"the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake" => what's painful about that?

VS = compared to, versus

squircle · 37m ago

Hah okay. I read VS different from vs. The pain, in part, is hidden functions, rarely ever inline documentation, difficult to reuse or repurpose, Windows-centric, etc.