Data engineering and software engineering are converging

23 craneca0 10 8/29/2025, 6:43:42 PM clickhouse.com ↗

Comments (10)

CalRobert · 23m ago
Data engineering was software engineering from the very beginning. Then a bunch of business analysts who didn't know anything about writing software got jealous and said that if you knew SQL/DBT you were a data engineer. I've had to explain too many times that yes, indeed, I can set up a CI/CD pipeline or set up kafka or deploy Dagster on ECS, to the point where I think I need to change my title just to not be cheapened.
sdairs · 4m ago
I think even before dbt turned DE into "just write sql & yaml", there was an appreciable difference in DE vs SE. There was defo some DEs writing a lot of java/scala if they were in Spark heavy co's, but my experience is that DEs were doing a lot more platform engineering (similar to what you suggest), SQL and point-and-click (just because that was the nature of the tooling). I wasn't really seeing many DEs spending a lot of time in an IDE.

But I think whats interesting from the post is looking at SEs adopting data infra into their workflow, as opposed to DEs writing more software.

getnormality · 6m ago
It's not hard to do data engineering to the standards of software engineering, and many people do it already, provided that

1. You use a real programming language that supports all the abstractions software engineers rely on, not (just) SQL.

2. The data is not too big, so the feedback cycle is not too horrendously slow.

#2 can't ever be fully solved, but testing a data pipeline on randomly subsampled data can help a lot in my experience.

giantg2 · 29m ago
I've never really seen the distinction between data and software engineering. It's more like front-end vs backend. If you're a data engineer and it's all no code tooling, then you're just an analyst or something.
zurfer · 40m ago
Maybe. On the one side you have something like dbt or Moosestack. On the other hand analytics and data pipelining is still a lot of no code tooling and I doubt it will go away. However I would love to learn more about how other people use coding agents to do DE tasks.
rawgabbit · 26m ago
In Snowflake, I am now writing Python Stored Procedures that make REST API calls to things like Datadog REST API and dumping the JSON into a Snowflake table. I then unpack the JSON and transform it into a normalized table. So far it works reasonably well. This is possible using Snowflake's external access feature. https://docs.snowflake.com/en/developer-guide/external-netwo...
zamalek · 19m ago
One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts.

Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.

RobinL · 15m ago
I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management.
zamalek · 3m ago
Interesting, databricks has been my first exposure to DE at scale and it does seem to solve many problems (even though it sounds like it's causing some). So what does everyone else do? Run spark etc. themselves?
esafak · 17m ago
For CI, try dagger. It's code based and runs locally too, so you can write tests.