U.S. government takes 10% stake in Intel (cnbc.com)
605 points by givemeethekeys 7d ago 718 comments
Claude Sonnet will ship in Xcode (developer.apple.com)
470 points by zora_goron 21h ago 383 comments
Data engineering and software engineering are converging
50 craneca0 25 8/29/2025, 6:43:42 PM clickhouse.com ↗
My title is senior data engineer at GAMMA/FAANG/whatever we're calling them. I have a CS degree and am firmly in the engineering. My passion, though, is in using software engineering and computer science principles to make very large-scale data processing as stupid fast as we can. To the extent I can ignore it, I don't personally care much about the tooling and frameworks and such (CI/CD, Airflow, Kafka, whatever). I care about how we're affinitizing our data, how we index it, whether and when we can use data sketches to achieve a good tradeoff between accuracy and compute/memory, and so on.
While there are plenty of folks in this thread bashing analysts, one could also bash other "proper" engineers that can do the CI/CD but don't know shit about how to be efficient with petabyte-scale processing.
But I think whats interesting from the post is looking at SEs adopting data infra into their workflow, as opposed to DEs writing more software.
Ridiculous.
1. You use a real programming language that supports all the abstractions software engineers rely on, not (just) SQL.
2. The data is not too big, so the feedback cycle is not too horrendously slow.
#2 can't ever be fully solved, but testing a data pipeline on randomly subsampled data can help a lot in my experience.
Another anecdatum: the data engineers role at Zillow is called "Software Development Engineer, Big Data"
Their organization often insists they must use standard tools, and their idea of a good job is that the task works fine within their personal version. No automatic testing, no automated deployment, no version control, and handcrafted environments. And then they get yelled at when things break and yelled at for taking too long. And most DEs want to quit the field after a few years.
The real question is not that DE and software engineering are converging. It's why most DEs don't have the self-respect and confidence to engineer systems so that their lives don't suck.
My view is that it isn't so much a lack of "self-respect and confidence" but an acknowledgment that the path of least resistance is often the best one. Often data teams are something that was tacked on as an afterthought and the organizational environment is oriented towards buying off-the-shelf solutions rather than developing things in house.
Saying that, versional control and replicable environments are becoming standard in the profession and, as data professionals become first class citizens in organizations, we may find that orgs orient themselves towards a more production focused environment.
Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.
There are plenty of us out here with many repos, dozens of contributors, and thousands of lines of terraform, python, custom GitHub actions, k8s deployments running airflow and internal full stack web apps that we're building, EMR spark clusters, etc. All living in our own Snowflake/AWS accounts that we manage ourselves.
The data scientists that we service use notebooks extensively, but it's my teams job to clean it up and make it testable and efficient. You can't develop real software in a notebook, it sounds like they need to upskill into a real orchestration platform like airflow and run everything through it.
Unit test the utility functions and helpers, data quality test the data flowing in and out. Build diff reports for understanding big swings in the data to sign off changes.
My email is in my profile I'm happy to discuss further! :-)
For Spark, glue works quite well. We use it as 'spark as a service', keeping our code as close to vanilla pyspark as possible. This leaves us free to write our code in normal python files, write our own (tested) libraries which are used in our jobs, use GitHub for version control and ci and so on
personally you couldn't pay me to run Spark myself these days (and I used to work for the biggest Hadoop vendor in the mid 2010s doing a lot of Spark!)