> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.
The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/
Mortiffer · 5h ago
The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.
wodenokoto · 4h ago
A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
IanCal · 3h ago
I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.
wodenokoto · 9m ago
You can, but then every step starts with a drop table if exists; insert into …
jgalt212 · 49m ago
In R, data sources, intermediate results, and final results are all dataframes (slight simplification). With DuckDB, to have the same consistency you need every layer and step to be a database table, not a data frame, which is awkward for the standard R user and use case.
rr808 · 5h ago
Ugh I have joined a big data team. 99% of the feeds are less than a few GB yet we have to use Scala and Spark. Its so slow to develop and slow to run.
threeseed · 4h ago
a) Scala being a JVM language is one of the fastest around. Much faster than say Python.
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
Larrikin · 1h ago
But can you justify Scala existing at all in 2025. I think it pushed boundaries but ultimately failed as a language worth adoption.l anymore.
rr808 · 3h ago
1) Yes Scala and JVM is fast. If we could just use that to clean up a feed on a single box that would be great. The problem is calling the Spark API creates a lot of complexity for developers and runtime platform which is super slow.
2) Yes for the few feeds that are a TB we need spark. The platform really just loads from hadoop transforms then saves back again.
threeseed · 3h ago
a) You can easily run Spark jobs on a single box. Just set executors = 1.
b) The reason centralised clusters exist is because you can't have dozens/hundreds of data engineers/scientists all copying company data onto their laptop, causing support headaches because they can't install X library and making productionising impossible. There are bigger concerns than your personal productivity.
rr808 · 3h ago
> a) You can easily run Spark jobs on a single box. Just set executors = 1.
Sure but why would you do this? Just using pandas or duckdb or even bash scripts makes your life is much easier than having to deal with Spark.
cgio · 2h ago
For when you need more executors without rewriting your logic.
this_user · 1h ago
Using a Python solution like Dask might actually be better, because you can work with all of the Python data frameworks and tools, but you can also easily scale it if you need it without having to step into the Spark world.
rpier001 · 2h ago
Re: b. This is a place where remote standard dev environments are a boon. I'm not going to give each dev a terabyte of RAM, but a terabyte to share with a reservation mechanism understanding that contention for the full resource is low? Yes, please.
culebron21 · 2h ago
A tangential story. I remember, back in 2010, contemplating the idea of completely distributed DBs inspired by then popular torrent technology. In this one, a client would not be different from a server, except by the amount of data it holds. And it would probably receive the data in torrents manner.
What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.
I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.
Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.
twic · 34m ago
This feels like a companion to classic 2015 paper "Scalability! But at what COST?":
I'm working on a big research project that uses duckdb, I need a lot of compute resources to develop my idea but I don't have a lot of money.
I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.
My email is in my profile. Thank you.
willvarfar · 6h ago
I only retired my 2014 MBP ... last week! It started transiently not booting and then, after just a few weeks, it switched to be only transiently booting. Figured it was time. My new laptop is actually a very budget buy, and not a mac, and in many things a bit slower than the old MBP.
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
Cthulhu_ · 4h ago
I still have mine, but it's languishing, I don't know what to do with it / how to get rid of it, it doesn't feel like trash. The Apple stores do returns but for this one you get nothing, they're just like "yeah we'll take care of it".
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
compiler-devel · 2h ago
I have the same machine and installed Fedora 41 on it. Everything works out of the box, including WiFi and sound.
HPsquared · 4h ago
eBay is pretty active for that kind of thing. Spares/repair.
zkmon · 6h ago
A database is not only about disk size and query performance. Database reflects the company's culture, processes, workflows, collaboration etc. It has an entire ecosystem around it - master data, business processes, transactions, distributed applications, regulatory requirements, resiliency, Ops, reports, tooling etc,
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
willvarfar · 5h ago
And its very much the tech culture at large that influences the company's tech choices. Those techies chasing shiny things and trying to shoehorn it into their job - perhaps cynically to pad their cvs or perhaps generously thinking it will actually be the right thing to do - have an outsized say in how tech teams think about tech and what they imagine their job is.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
threeseed · 4h ago
Why you don't just quietly use SQL instead of condescending lecturing others about how compromised their tech choices are.
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
kukkeliskuu · 3h ago
Around 20 years ago I was working for a database company. During that time, I attended SIGMOD, which is the top conference for databases.
The keynote speaker for the conference Stonebraker, who started Postgres, among other things. He talked about the history of relational databases.
At that time, XML databases were all the rage -- now nobody remembers them. Stonebraker explained that there is nothing new in the hierarchical databases. There was a significant battle in SIGMOD, I think somewhere in the 1980s (I forget the exact time frame) between network databases and relational databases.
The relational databases won that battle, as they have won against each competing hierarchical database technology since.
The reason is that relational databases are based on relational algebra. This has very practical consequences, for example you can query the data more flexibly.
When you use JSON storage such as MongoDB, when you decide your root entities you are stuck with that decision. I see very often in practice that there will always come new requirements that you did not foresee that you then need to work around.
I don't care what other people use, however.
threeseed · 3h ago
MongoDB is a $2b/year revenue company growing at 20% y/y. JSON stores are not going anywhere and it's an essential tool for dealing in data where you have no control over the schema or you want to do it in the application layer.
And the only "battle" is one you've invented in your head. People who deal in data for a living just pick the right data store for the right data schema.
pragmatic · 1h ago
And sql server alone is like 5 billion/yr.
lazide · 3h ago
Sensitive much?
hobs · 38m ago
Every person I know who has ever used Cassandra in prod has cursed its name. Mongo lost data for close to a decade, and Microservices mostly are NOT used to solve real world problems but instead used either as an organizational or technical hammer for which everything is a nail. Hell there's entire books written how you should cut people off from each other so they can "naturally" write microservices and hyperscale your company!!
zwnow · 5h ago
No, a database reflects what you make out of it. Reports are just queries after all. I dont know what all the other stuff you named has to do with the database directly. The only purpose of databases is to store and read data, thats what it comes down to. So query performance IS one of the most important metrics.
DonHopkins · 5h ago
You can always make your data bigger without increasing disk space or decreasing performance by making the font size larger!
mangecoeur · 57m ago
Did my phd around that time and did a project “scaling” my work on a spark cluster. Huge pita and no better than my local setup which was an MBP15 with pandas a postgres (actually I wrote+contributed a big chunk of pandas read_sql at that time to make is postgres compatible using sqlalchemy)
I have worked for a half dozen companies all swearing up and down they had big data and meaningfully one customer had 100TB of logs and another 10TB of stuff, everyone else when actually thought of properly and had just utter trash removed was really under 10TB.
Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.
querez · 6h ago
> The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.
Why do they use the geometric mean to average execution times?
ayhanfuat · 5h ago
It's a way of saying twice as fast and twice as slow have equal effect on opposite sides. If your baseline is 10 seconds, one benchmark takes 5 seconds, and another one takes 20 seconds then the geometric mean gives you 10 seconds as the result because they cancel each other. The arithmetic mean would treat it differently because in absolute terms 10 seconds slow down is bigger than 5 seconds speedup. But that is not fair for speedups because the absolute speedup you can reach is at most 10 seconds but slow down has no limits.
tbillington · 2h ago
This is the best explain-like-im-5 I've heard for geo mean and helped it click in my head, thank you :)
zmgsabst · 58m ago
But reality doesn’t care:
If half your requests are 2x as long and half are 2x as fast, you don’t take the same wall time to run — you take longer.
Let’s say you have 20 requests, 10 of type A and 10 of type B. They originally both take 10 seconds, for 200 seconds total. You halve A and double B. Now it takes 50 + 200 = 250 seconds, or 12.5 on average.
This is a case where geometric mean deceives you - because the two really are asymmetric and “twice as fast” is worth less than “twice as slow”.
ayhanfuat · 19m ago
There is definitely no single magical number that can perfectly represent an entire set of numbers. There will always be some cases they are not representative enough. In the request example you are mostly interested in the total processing times so it does make sense you use a metric based on addition. But you could also frame a similar scenario where halving the processing time lets you handle twice as many items in the same duration. In that case a ratio-based or multiplicative view might be more appropriate.
willvarfar · 5h ago
Squaring is a really good way to make the common-but-small numbers have bigger representation than the outlying-but-large numbers.
Its the very first illustration at the top of that blog post that 'clicks' for me. Hope it helps!
The inverse is also good: mean-square-error is the good way for comparing how similar two datasets (e.g. two images) are.
yorwba · 3h ago
The geometric mean of n numbers is the n-th root of the product of all numbers. The mean square error is the sum of the squares of all numbers, divided by n. (I.e. the arithmetic mean of the squares.) They're not the same.
willvarfar · 3h ago
I'm not gonna edit what I wrote but you are interpreting it too way too literally. I was not describing the implementation of anything, I was just giving a link that explains why thinking about things in terms of area (geometry) is popular in stats. Its a bit like the epiphany that histograms don't need to be bars of equal width.
drewm1980 · 7h ago
I mean, not everyone spent their decade on distributed computing. Some devs with a retrogrouch inclination kept writing single threaded code in native languages on a single node. Single core clock speed stagnated, but it was still worth buying new CPU's with more cores because they also had more cache, and all the extra cores are useful for running ~other peoples' bloated code.
nyanpasu64 · 1h ago
I find that good multithreading can speed up parallelizable workloads by 5-10 times depending on CPU core count, if you don't have tight latency constraints (and even games with millisecond-level latency deadlines are multithreaded these days, though real-time code may look different than general code).
HPsquared · 2h ago
High-frequency trading, gaming, audio/DSP, embedded, etc.
There's a lot of room for that kind of developer.
mediumsmart · 6h ago
I am on the late 2015 version and I have an ebay body stashed for when the time comes to refurbish that small data machine.
This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.
The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
b) The reason centralised clusters exist is because you can't have dozens/hundreds of data engineers/scientists all copying company data onto their laptop, causing support headaches because they can't install X library and making productionising impossible. There are bigger concerns than your personal productivity.
Sure but why would you do this? Just using pandas or duckdb or even bash scripts makes your life is much easier than having to deal with Spark.
What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.
I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.
Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.
https://www.usenix.org/system/files/conference/hotos15/hotos...
I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.
My email is in my profile. Thank you.
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
The keynote speaker for the conference Stonebraker, who started Postgres, among other things. He talked about the history of relational databases.
At that time, XML databases were all the rage -- now nobody remembers them. Stonebraker explained that there is nothing new in the hierarchical databases. There was a significant battle in SIGMOD, I think somewhere in the 1980s (I forget the exact time frame) between network databases and relational databases.
The relational databases won that battle, as they have won against each competing hierarchical database technology since.
The reason is that relational databases are based on relational algebra. This has very practical consequences, for example you can query the data more flexibly.
When you use JSON storage such as MongoDB, when you decide your root entities you are stuck with that decision. I see very often in practice that there will always come new requirements that you did not foresee that you then need to work around.
I don't care what other people use, however.
And the only "battle" is one you've invented in your head. People who deal in data for a living just pick the right data store for the right data schema.
Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.
Why do they use the geometric mean to average execution times?
If half your requests are 2x as long and half are 2x as fast, you don’t take the same wall time to run — you take longer.
Let’s say you have 20 requests, 10 of type A and 10 of type B. They originally both take 10 seconds, for 200 seconds total. You halve A and double B. Now it takes 50 + 200 = 250 seconds, or 12.5 on average.
This is a case where geometric mean deceives you - because the two really are asymmetric and “twice as fast” is worth less than “twice as slow”.
I just did a quick google and first real result was this blog post with a good explanation with some good illustrations https://jlmc.medium.com/understanding-three-simple-statistic...
Its the very first illustration at the top of that blog post that 'clicks' for me. Hope it helps!
The inverse is also good: mean-square-error is the good way for comparing how similar two datasets (e.g. two images) are.