I have an personal pet peeve about Parquet that is solved, incompatibly, by basically every "data lake / lakehouse" layer on top, and I'd love to see it become compatible: ranged partitioning.
I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.
Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:
I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.
I could just write the files and import ".parquet". This kills performance and costs lots of money.
I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.
It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.
[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.
hendiatris · 23m ago
In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant.
Just beware that one issue you can have is the limit of row groups per file (2^15).
bitbang · 1h ago
Why is the footer metadata not sufficient for this need? The metadata should contain the min and max timestamp values from the respective column of interest, so that when executing a query, the query tool should be optimizing its query by reading the metadata to determine if that parquet file should be read or not depending on what time range is in the query.
dugmartin · 1h ago
This can also be done using row group metadata within the parquet file. The row group metadata can include the range values of ordinals so you can "partition" on timestamps without having to have a file per time range.
simlevesque · 14m ago
I wish we had more control of the row group metadata when writing Parquet files with DuckDB.
wodenokoto · 2h ago
I’m building a poor man’s datalake at work, basically putting parquet files in blob storage using deltalake-rs’ python bindings and duck db for querying.
However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.
But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.
isoprophlex · 52m ago
Add a randomly chosen suffix to your filenames?
SystemOut · 15m ago
Strangely I can't get to this domain. We have ZScaler at work with DGA Blocking enabled and it prevents me from loading the page.
rickette · 7m ago
Most likely caused by the .select TLD.
dkdcio · 4h ago
This looks awesome. One of my biggest gripe's personally with Iceberg (less-so Delta Lake, but similar) is how difficult it is to just try out on a laptop. Delta Lake has vanilla Python implementations, but those are fragemented and buggy IME. Iceberg has just never worked locally, you need a JVM cluster and a ton of setup. I went down a similar road of trying to use sqlite/postgres+duckdb+parquet files in blob storage, but it was a lot of work.
It seems like this will just work out of the box, and just scale up to very reasonable data sizes. And the work from the DuckDB folks is typically excellent. It's clear they understand this space. Excited to try it out!
mritchie712 · 1h ago
Here's a step-by-step setup. It's using S3 and RDS, but I wouldn't be hard to swap in a local sqlite instead.
Have you tried out PyIceberg yet? It's a pure Python implementation and it works pretty well. It supports a SQL Catalog as well as an In-Memory Catalog via a baked in SQLite SQL Catalog.
They support syncing to Iceberg by writing the manifest and metadata files on demand, and they already have read support for Iceberg. They just fixed Iceberg's core issues but it's not a direct competitor as you can use DuckLake along with Iceberg in a very nice and bidirectional way.
prpl · 4h ago
metadata bloat can be due to a few things, but it’s manageable.
* number of snapshots
* frequent large schema changes
* lots of small files/row level updates
* lots of stats
The last one IIRC used to be pretty bad especially with larger schemas.
Most engines have ways to help with this - compaction, snapshot exportation, etc… Though it can still be up to the user. S3 tables is supposed to do some of this for you.
If metadata is below 1-5MB it’s really not an issue. Your commit rate is effectively limited by the size of your metadata and the number of writers you have.
I’ve written scripts to fix 1GB+ metadata files in production. Usually it was pruning snapshots without deleting files (relying on bucket policy to later clean things up) or removing old schema versions.
mehulashah · 4h ago
We've come full circle. If you want to build a database, then you need to build it like a database. Thank you DuckDB folks!
buremba · 4h ago
My understanding was that MotherDuck was focusing on providing the "multiplayer mode" for DuckDB. It's interesting to see DuckDB Labs supporting data lakes natively. I guess MotherDuck is potentially moving to the UI layer by providing the notebook interface for DuckDB.
peterboncz · 3h ago
Good point! Anticipating official announcements I can confirm that MotherDuck is indeed intending to both: host DuckLake catalogs, and facilitate querying DuckLakes using DuckDB via its cloud-based DuckDB service.
nehalem · 4h ago
I wonder how this relates to Mother Duck (https://motherduck.com/)? They do „DuckDB-powered data warehousing“ but predate this substantially.
jtigani · 1h ago
For what it's worth, MotherDuck and DuckLake will play together very nicely. You will be able to have your MotherDuck data stored in DuckLake, improving scalability, concurrency, and consistency while also giving access to the underlying data to third-party tools. We've been working on this for the last couple of months, and will share more soon.
nojvek · 3h ago
Motherduck is hosting duckdb in cloud. DuckLake is a much more open system.
Ducklake you can build petabyte scale warehouse with multiple readers and writer instances, all transactional on your s3, on your ec2 instances.
Motherduck has limitations like only one writer instance. Read replicas can be 1m behind (not transactional).
Having different instances concurrently writing to different tables is not possible.
Ducklake gives proper separation of compute and storage with a transactional metadata layer.
spenczar5 · 5h ago
There is a lot to like here, but once metadata is in the novel Ducklake format, it is hard to picture how you can get good query parallelism, which you need for large datasets. Iceberg already is well supported by lots of heavy-duty query engines and that support is important once you have lots and lots and lots of data.
buremba · 4h ago
You don't need to store the metadata in DuckDB; it can live in your own PostgreSQL/MySQL, similar to Iceberg REST Catalog. They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.
They don't focus on solving the scalability problem in the metadata layer; you might need to scale your PostgreSQL independently as you have many DuckDB compute nodes running on the edge.
spenczar5 · 3h ago
Even though it's in your own SQL DB, there's still some sort of layout for the metadata. That's the thing that trino/bigquery/whatever won't understand (yet?).
> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.
Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?
nojvek · 3h ago
you're correct that duckdb doesn't do any multi-node map-reduce, however duckdb utilizes all available cores on a node quite effectively to parallelize scanning. And node sizes nowadays get upto 192 vCPUs.
A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.
formalreconfirm · 5h ago
Someone correct me if I'm wrong but from my understanding, DuckDB will always be the query engine, thus I suppose you will have access to DuckDB query parallelism (single node but multithreaded with disk spilling etc) + statistics-based optimizations like file pruning, predicate pushdown etc offered by DuckLake. I think DuckLake is heavily coupled to DuckDB (Which is good for our use case). Again, this is my understanding, correct me if wrong.
memhole · 5h ago
From my perspective the issue is analytics support. You’ll need a step that turns it into something supported by BI tools. Obviously if something like Trino picks up the format it’s not an issue
anentropic · 3h ago
It seems to me that by publishing the spec other non-DuckDB implementations could be built?
It's currently only DuckDB specific because the initial implementation that supports this new catalog is a DuckDB extension
spenczar5 · 5h ago
I agree with everything you said. I just mean that a single node may be slow when processing those parquet files in a complex aggregation, bottlenecked on network IO or CPU or available memory.
If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?
formalreconfirm · 4h ago
That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size.
dkdcio · 4h ago
Huge "it depends", but typically organizations are not querying all of their data at once. Usually, they're processing it in some time-based increments.
Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.
BewareTheYiga · 3h ago
I am a huge fan of what they are doing, particularly putting local compute front and center. However for “BigCorp”, it’s going to be an uphill battle. The incumbents are entrenched and many decision makers will make decisions based on non technical reasons (I.e did my sales exec get me to the F1 Grand Prix).
TheGuyWhoCodes · 1h ago
Is there any information about updates to existing rows?
The FAQ says "Similarly to other data lakehouse technologies, DuckLake does not support constraints, keys, or indexes."
However in Iceberg there are Copy-On-Write and Merge-On-Read strategies dealing with updates.
szarnyasg · 1h ago
Yes - updates on existing rows are supported.
(I work at DuckDB Labs.)
TheGuyWhoCodes · 50m ago
Thanks szarnyasg.
If I've got you here, can you use the ducklake extension commands to get the parquet files for a query without running said query?
That way you could use another query engine while still use duckdb to handle the data mutation.
data_ders · 4h ago
the manifesto [1] is the most interesting thing. I agree that DuckDB has the largest potential to disrupt the current order with Iceberg.
However, this mostly reads to me as thought experiment:
> what if the backend service of an Iceberg catalog was just a SQL database?
The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!
But, what I most want to know is what's the end-user benefit?
What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?
it adds for users the following features to a data lake:
- multi-statement &
multi-table transactions
- SQL views
- delta queries
- encryption
- low latency: no S3 metadata &
inlining: store small inserts in-catalog
and more!
tishj · 41m ago
One thing to add to this:
Snapshots can be retained (though rewritten) even through compaction
As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.
With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files
anentropic · 3h ago
they say it's faster for one thing - can resolve all metadata in a single query instead of multiple HTTP requests
zhousun · 3h ago
Using SQL as catalog is not new (iceberg supports JDBC catalog from the very beginning).
The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.
a26z · 3h ago
How do I integrate DuckLake with Apache Spark? Is it a format or a catalog?
Same question for presto, trino, dremio, snowflake, bigquery, etc.
nxm · 3h ago
"DuckLake is also able to improve the two biggest performance problems of data lakes: small changes and many concurrent changes."
These I'd argue are not the natural use cases for a data lake, especially a design which uses multiple writers to a given table.
formalreconfirm · 5h ago
It looks very promising, especially knowing DuckDB team is behind it. However I really don't understand how to insert data in it. Are we supposed to use DuckDB INSERT statement with any function to read external files or any other data ? Looks very cool though.
szarnyasg · 5h ago
Yes, you can use standard SQL constructs such as INSERT statements and COPY to load data into DuckLake.
(diclaimer: I work at DuckDB Labs)
formalreconfirm · 5h ago
Thank you for your work ! We use DuckDB with dbt-duckdb in production (because on-prem and because we don't need ten thousands nodes) and we love it ! About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ? From my understanding DuckLake was responsible for managing the files on the storage layer.
szarnyasg · 5h ago
Great!
> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?
Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.
Not exactly sure what it's for? it's to stream your data to Parquet files on (eg) S3 and keep somewhere the exact schema at each point in time? or is it something else?
would be nice to have some tutorial/use-cases in the doc :)
adastra22 · 5h ago
What is a data lake?
szarnyasg · 5h ago
The YouTube video “Apache Iceberg: What It Is and Why Everyone’s Talking About It” by Tim Berglund explains data lakes really well in the opening minutes: https://www.youtube.com/watch?v=TsmhRZElPvM
adastra22 · 3h ago
Thanks but I don’t have the time to watch YouTube.
dsp_person · 1h ago
he explains
~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)
~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later
adastra22 · 21m ago
Thank you!
simlevesque · 1h ago
It's your db but on s3.
iampims · 6h ago
Great idea, poor naming. If you’re aiming for a standard of sorts, tying it to a specific software by reusing its name feels counter productive.
“Ducklake DuckDB extension” really rolls off the tongue /s.
rtyu1120 · 5h ago
Quite a bummer, particularly because the main selling point is that it can be utilized with any SQL database (iiuc).
formalreconfirm · 5h ago
If I understand the Manifesto correctly, the metadata db can be any SQL database but the client needs to be DuckDB + DuckLake extension no ?
crudbug · 5h ago
Good point. I think - any ducklake implementation for any SQL compliant database will work.
Of course, the performance will depend on the database.
I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.
Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:
I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.
I could just write the files and import ".parquet". This kills performance and costs lots of money.
I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.
It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.
[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.
Just beware that one issue you can have is the limit of row groups per file (2^15).
However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.
But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.
It seems like this will just work out of the box, and just scale up to very reasonable data sizes. And the work from the DuckDB folks is typically excellent. It's clear they understand this space. Excited to try it out!
https://www.definite.app/blog/cloud-iceberg-duckdb-aws
https://py.iceberg.apache.org/
https://delta-io.github.io/delta-rs/
https://quesma.com/blog-detail/apache-iceberg-practical-limi...
Even Snowflake was using FoundationDB for metadata, whereas Iceberg attempts to use blob storage even for the metadata layer.
They support syncing to Iceberg by writing the manifest and metadata files on demand, and they already have read support for Iceberg. They just fixed Iceberg's core issues but it's not a direct competitor as you can use DuckLake along with Iceberg in a very nice and bidirectional way.
* number of snapshots
* frequent large schema changes
* lots of small files/row level updates
* lots of stats
The last one IIRC used to be pretty bad especially with larger schemas.
Most engines have ways to help with this - compaction, snapshot exportation, etc… Though it can still be up to the user. S3 tables is supposed to do some of this for you.
If metadata is below 1-5MB it’s really not an issue. Your commit rate is effectively limited by the size of your metadata and the number of writers you have.
I’ve written scripts to fix 1GB+ metadata files in production. Usually it was pruning snapshots without deleting files (relying on bucket policy to later clean things up) or removing old schema versions.
Ducklake you can build petabyte scale warehouse with multiple readers and writer instances, all transactional on your s3, on your ec2 instances.
Motherduck has limitations like only one writer instance. Read replicas can be 1m behind (not transactional).
Having different instances concurrently writing to different tables is not possible.
Ducklake gives proper separation of compute and storage with a transactional metadata layer.
They don't focus on solving the scalability problem in the metadata layer; you might need to scale your PostgreSQL independently as you have many DuckDB compute nodes running on the edge.
> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.
Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?
A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.
It's currently only DuckDB specific because the initial implementation that supports this new catalog is a DuckDB extension
If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?
Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.
However in Iceberg there are Copy-On-Write and Merge-On-Read strategies dealing with updates.
(I work at DuckDB Labs.)
That way you could use another query engine while still use duckdb to handle the data mutation.
However, this mostly reads to me as thought experiment: > what if the backend service of an Iceberg catalog was just a SQL database?
The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!
But, what I most want to know is what's the end-user benefit?
What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?
[1]: https://ducklake.select/manifesto/
it adds for users the following features to a data lake: - multi-statement & multi-table transactions - SQL views - delta queries - encryption - low latency: no S3 metadata & inlining: store small inserts in-catalog and more!
As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.
With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files
The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.
Same question for presto, trino, dremio, snowflake, bigquery, etc.
These I'd argue are not the natural use cases for a data lake, especially a design which uses multiple writers to a given table.
(diclaimer: I work at DuckDB Labs)
> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?
Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.
would be nice to have some tutorial/use-cases in the doc :)
~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)
~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later
“Ducklake DuckDB extension” really rolls off the tongue /s.
Of course, the performance will depend on the database.