The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced.
nine_k · 3m ago
It's cool but it does not address the issue of indexes, mentioned in the original post. Not carrying index data over the slow link was the key idea. The VACUUM INTO approach keeps indexes.
A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.
gwbas1c · 10m ago
Does that preserve the indexes? As the TFA mentioned, the indexes are why the sqlite files are huge.
bambax · 1h ago
> If it takes a long time to copy a database and it gets updated midway through, rsync may give me an invalid database file. The first half of the file is pre-update, the second half file is post-update, and they don’t match. When I try to open the database locally, I get an error
Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.
> You can't copy the file of a running, active db receiving updates, that can only result in corruption
To push back against "only" -- there is actually one scenario where this works. Copying a file or a subvolume on Btrfs or ZFS can be done atomically, so if it's an ACID database or an LSM tree, in the worst case it will just rollback. Of course, if it's multiple files you have to take care to wrap them in a subvolume so that all of them are copied in the same transaction, simply using `cp --reflink=always` won't do.
Possibly freezing the process with SIGSTOP would yield the same result, but I wouldn't count on that
lmz · 51m ago
It can't be done without fs specific snapshots - otherwise how would it distinguish between a cp/rsync needing consistent reads vs another sqlite client wanting the newest data?
o11c · 9m ago
Obligatory "LVM still exists and snapshots are easy enough to overprovision for"
wswope · 47m ago
The built-in .backup command is also intended as an official tool for making “snapshotted” versions of a live db that can be copied around.
yellow_lead · 1h ago
Litestream looks interesting but they are still in beta, and seem to have not had a release in over a year, although SQLite doesn't move that quickly.
Is Litestream still an active project?
pixl97 · 1h ago
>You can't copy the file of a running, active db receiving updates, that can only result in corruption
There is a slight 'well akshully' on this. A DB flush and FS snapshot where you copy the snapshotted file will allow this. MSSQL VSS snapshots would be an example of this.
tpmoney · 50m ago
Similarly you can rsync a Postgres data directory safely while the db is running, with the caveat that you likely lose any data written while the rsync is running. And if you want that data, you can get it with the WAL files.
It’s been years since I needed to do this, but if I remember right, you can clone an entire pg db live with a `pg_backup_start()`, rsync the data directory, pg_backup_stop() and rsync the WAL files written since backup start.
jmull · 45m ago
If the corruption is detectable and infrequent enough for your purposes, then it does work, with a simple “retry until success” loop. (That’s how TCP works, for example.)
quotemstr · 1h ago
> Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption
Do people really not understand how file storage works? I cannot rightly apprehend the confusion of ideas that would produce an attempt to copy a volatile database without synchronization and expect it to work.
kccqzy · 27m ago
The confusion of ideas here is understandable IMO: people assume everything is atomic. Databases of course famously have ACID guarantees. But it's easy for people to assume copying is also an atomic operation. Honestly if someone works too much on databases and not enough with filesystems it's a mistake easily made.
ahazred8ta · 23m ago
> I cannot rightly apprehend the confusion of ideas
I see you are a man of culture.
kccqzy · 16m ago
Charles Babbage is smart, but either he lacks empathy to understand other people or he's just saying that deliberately for comedic effect.
zeroq · 2h ago
How to copy databases between computers? Just send a circle and forget about the rest of the owl.
As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.
Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.
It's a very good writeup on how to do fast inserts in sqlite3
zeroq · 46m ago
Yes! That was actually quite helpful.
For my use case (recreating in-memory from scratch) it basically boils down to three points: (1) journal_mode = off (2) wrapping all inserts in a single transaction (3) indexes after inserts.
For whatever it's worth I'm getting 15M inserts per minute on average, and topping around 450k/s for trivial relationship table on a stock Ryzen 5900X using built-in sqlite from NodeJS.
jgalt212 · 1h ago
yes, but they punt on this issue:
CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
i.e. they only time INSERTs, not the CREATE INDEX after all the INSERTs.
JamesonNetworks · 2h ago
30 minutes seems long. Is there a lot of data? I’ve been working on bootstrapping sqlite dbs off of lots of json data and by holding a list of values and then inserting 10k at a time with inserts, Ive found a good perf sweet spot where I can insert plenty of rows (millions) in minutes. I had to use some tricks with bloom filters and LRU caching, but can build a 6 gig db in like 20ish minutes now
zeroq · 54m ago
It's roughly 10Gb across several CSV files.
I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).
I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.
I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.
Millions of rows in minutes sounds not ok, unless your tables have a large number of columns. A good rule is that SQLite's insertion performance should be at least 1% of sustained max write bandwidth of your disk; preferably 5%, or more. The last bulk table insert I was seeing 20%+ sustained; that came to ~900k inserts/second for an 8 column INT table (small integers).
pessimizer · 2h ago
Saying that 30 minutes seems long is like saying that 5 miles seems far.
hundredwatt · 1h ago
The recently released sqlite_rsync utility uses a version of the rsync algorithm optimized to work on the internal structure of a SQLite database. It compares the internal data pages efficiently, then only syncs changed or missing pages.
Nice tricks in the article, but you can more easily use the builtin utility now :)
sqlite_rsync can only be used in WAL mode. A further constraint of WAL mode is the database file must be stored on local disk. Clearly, you'd want to do this almost all the time, but for the times this is not possible this utility won't work.
mromanuk · 3h ago
I was surprised that he didn't try to use on the flight compression, provided by rsync:
-z, --compress compress file data during the transfer
--compress-level=NUM explicitly set compression level
Probably it's faster to compress to gzip and later transfer. But it's nice to have the possibility to improve the transfer with a a flag.
sqlite transaction- and WAL-aware rsync with inflight compression.
crazygringo · 2h ago
The main point is to skip the indices, which you have to do pre-compression.
When I do stuff like this, I stream the dump straight into gzip. (You can usually figure out a way to stream directly to the destination without an intermediate file at all.)
Plus this way it stays stored compressed at its destination. If your purpose is backup rather than a poor man's replication.
worldsavior · 2h ago
I believe compression is only good on slow speed networks.
PhilipRoman · 1h ago
It would have to be one really fast network... zstd compresses and decompresses at 5+ GB (bytes, not bits) per second.
worldsavior · 3m ago
Where are you getting this performance? On the average computer this is by far not the speed.
cogman10 · 1h ago
Is the network only doing an rsync? Then you are probably right.
For every other network, you should compress as you are likely dealing with multiple tenants that would all like a piece of your 40Gbps bandwidth.
worldsavior · 1m ago
In your logic, you should not compress as multiple tenants would like a piece of your CPU.
creatonez · 20m ago
What? Compression is absolutely essential throughout computing as a whole, especially as CPUs have gotten faster. If you have compressible data sent over the network (or even on disk / in RAM) there's a good chance you should be compressing it. Faster links have not undercut this reality in any significant way.
bityard · 11m ago
Whether or not to compress data before transfer is VERY situationally dependent. I have seen it go both ways and the real-world results do not not always match intuition. At the end of the day, if you care about performance, you still have to do proper testing.
(This is the same spiel I give whenever someone says swap on Linux is or is not always beneficial.)
berbec · 2h ago
Valve tends to take a different view...
rollcat · 2h ago
Depends. Run a benchmark on your own hardware/network. ZFS uses in-flight compression because CPUs are generally faster than disks. That may or may not be the case for your setup.
berbec · 2h ago
or used --remove-source-files so they didn't have to ssh back to rm
Jyaif · 1h ago
He absolutely should be doing this, because by using rsync on a compressed file he's passing by the whole point of using rsync, which is the rolling-checksum based algorithm that allows to transfer diffs.
simlevesque · 3h ago
In DuckDB you can do the same but export to Parquet, this way the data is an order of magnitude smaller than using text-based SQL statements. It's faster to transfer and faster to load.
duckdb -c "attach 'sqlite-database.db' as db; copy db.table_name to 'table_name.parquet' (format parquet, compression zstd)"
in my test database this is about 20% smaller than the gzipped text SQL statements.
simlevesque · 2h ago
That's not it. This only exports the table's data, not the database. You lose the index, comments, schemas, partitioning, etc... The whole point of OP's article is how to export the indices in an efficient way.
Also I wonder how big your test database is and it's schema. For large tables Parquet is way more efficient than a 20% reduction.
If there's UUIDs, they're 36 bits each in text mode and 16 bits as binary in Parquet. And then if they repeat you can use a dictionary in your Parquet to save the 16 bits only once.
It's also worth trying to use brotli instead of zstd if small files is your goal.
RenThraysk · 2h ago
SQLite has an session extension, which will track changes to a set of tables and produce a changeset/patchset which can patch previous version of an SQLite database.
I have yet to see a single SQLite binding supporting this, so it’s quite useless unless you’re writing your application in C, or are open to patching the language binding.
In one of my projects I have implemented my own poor man’s session by writing all the statements and parameters into a separate database, then sync that and replay. Works well enough for a ~30GB database that changes by ~0.1% every day.
Ended up with the latter, but did have to add one function binding in C, to inspect changesets.
simonw · 1h ago
Have you used that? I've read the documentation but I don't think I've ever heard from anyone who uses the extension.
RenThraysk · 1h ago
I have, atleast to confirm it does what it says on the tin.
Idea for an offline first app, where each app install call pull a changeset and apply it to their local db.
gwbas1c · 7m ago
I wonder if there's a way to export to parquet files? They are designed to be extremely compact.
rarrrrrr · 2h ago
If you're regularly syncing from an older version to a new version, you can likely optimize further using gzip with "--rsyncable" option. It will reduce the compression by ~1% but make it so differences from one version to the next are localized instead of cascading through the full length of the compression output.
Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)
isn't this rather obvious? doesn't everyone do this when it makes sense? obviously, it applies to other DBs, and you don't even need to store the file (just a single ssh from dumper to remote undumper).
if retaining the snapshot file is of value, great.
I'd be a tiny bit surprised if rsync could recognize diffs in the dump, but it's certainly possible, assuming the dumper is "stable" (probably is because its walking the tables as trees). the amount of change detected by rsync might actually be a useful thing to monitor.
ozim · 44m ago
I guess for me it is obvious you don't try to copy running DB only a backup.
So I see basic stuff needs to be repeated as people still miss those kinds of things.
But I learned that you can easily dump SQLite to a text file - neat!
Levitating · 2h ago
I am sure you can just pipe all this so you don't have to use an intermediate gunzip file.
Just ssh the machine, dump the SQL and load it back into SQLite locally.
rollcat · 2h ago
rsync will transmit only the delta between the source and destination.
wang_li · 33m ago
I've seen a suggestion several times to compress the data before sending. If remote means in the same data center, there's a good chance compressing the data is just slowing you down. Not many machines can gzip/bzip2/7zip at better than the 1 gigabyte per second you can get from 10 Gbps networks.
nottorp · 1h ago
Wait... why would you even think about rsyncing a database that can get changed while being copied?
Isn't this a case for proper database servers with replication?
Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?
bluefirebrand · 13m ago
All this obsession with making processes like this faster
When is a guy supposed to get a coffee and stretch his legs anymore?
actinium226 · 1h ago
I have recently discovered a tool called mscp which opens open multiple scp threads to copy down large files. It works great for speeding up these sorts of downloads.
gzip/gunzip might also be redundant if using ssh compression with -oCompression=on or -C on the ssh call
Cthulhu_ · 2h ago
I used to work at a company that had a management interface that used sqlite as database, its multi-node / fallover approach was also just... copying the file and rsyncing it. I did wonder about data integrity though, what if the file is edited while it's being copied over? But there's probably safeguards in place.
Anyway I don't think the database file size was really an issue, it was a relatively big schema but not many indices and performance wasn't a big consideration - hence why the backend would concatenate query results into an XML file, then pass it through an xml->json converter, causing 1-2 second response times on most requests. I worked on a rewrite using Go where requests were more like 10-15 milliseconds.
But, I still used sqlite because that was actually a pretty good solution for the problem at hand; relatively low concurrency (up to 10 active simultaneous users), no server-side dependencies or installation needed, etc.
rollcat · 2h ago
SQLite has a write-ahead log (WAL). You can use Litestream on top of that. You get single RW, multiple readers (you lose the C in CAP), and can promote a reader when the writer fails.
MPSimmons · 2h ago
>I did wonder about data integrity though, what if the file is edited while it's being copied over? But there's probably safeguards in place.
You could do a filesystem snapshot and copy from that, but neither a cp or rsync is atomic.
formerly_proven · 2h ago
sqlite3 has a backup API for this, which you can invoke using the .backup command in the sqlite3 CLI.
That makes zero sense. Incremental backup via rsync/sqlite3_rsync should always be faster.
Retr0id · 1h ago
For incremental backups sure, but I think OP's solution would win for one-off snapshots.
npn · 1h ago
???
Why not just compress the whole database using `gzip` or `lz4` before rsyncing it instead? `zstd` works too but seems like it had a bug regarding compressing file with modified content.
better yet, split your sqlite file to smaller piece. it is not like it needs to contain all the app data in a single sqlite file.
404mm · 46m ago
zstd would be a better choice. It’s bonkers fast (especially when used with multithreading) and still compresses better than gzip. Alternatively, I’d recommend looking into bzip3, but I’m not sure if it would save time.
forinti · 2h ago
One of the coolest things you can do with Postgresql is pipe pg_dump straight into psql connected to another cluster on another host.
actinium226 · 50m ago
I recently set up some scripts to do this and it wasn't quite as simple as I had hoped. I had to pass some extra flags to pg_restore for --no-owner --no-acl, and then it still had issues when the target db has data in it, even with --clean and --create. And sometimes it would leave me in a state where it dropped the database and had trouble restoring, and so I'd be totally empty.
What I ended up doing is creating a new database, pg_restore'ing into that one with --no-owner and --no-acl, forcibly dropping the old database, and then renaming the new to the old one's name. This has the benefit of not leaving me high and dry should there be an issue with restoring.
d34th0rl1f3 · 3h ago
You can save time by using `zcat` instead of `cat` and skip the `gunzip my_local_database.db.txt.gz` step.
Slasher1337 · 2h ago
Why would anyone use gzip instead of zstd in 2025? zstd is superior in every dimension.
gzip is a legacy algorithm that imo only gets used for compatibility with legacy software that understands nothing but gzip.
_joel · 2h ago
You could speed up by using pigz (parallel gzip) too.
masklinn · 2h ago
If you're going to use a less universal tool for compression you might as well go with zstd.
pvorb · 3h ago
How long does this procedure take in comparison to the network transfer?
My first try would've been to copy the db file first, gzip it and then transfer it but I can't tell whether compression will be that useful in binary format.
MPSimmons · 2h ago
The sqlite file format (https://www.sqlite.org/fileformat.html) does not talk about compression, so I would wager unless you are storing already compressed content (media maybe?) or random numbers (encrypted data), it should compress reasonably well.
chasil · 2h ago
Native compression in sqlite is offered as a closed and licensed extension.
Pretty good point. I just wonder if databases in generally can be perfectly reconstructed from a text dump. For instance, do the insertion orders change in any of the operations between dumping and importing?
cwmma · 2h ago
how well does just the sqlite database gzip, the indexes are a lot of redundant data so your going to get some efficiencies there, probably less locality of data then the text file though so maybe less?
jbverschoor · 2h ago
In curious how your indices are twice the data. Sounds like you just put indices in anything you see.
crazygringo · 2h ago
I definitely have databases like this.
It's not carelessness, it's performance.
Quite simply, I have a table with 4 columns -- A, B, C, D. Each column is just an 8-byte integer. It has hundreds of millions of rows. It has an index on B+C+D, an index on C+D, and one on D.
All of these are required because the user needs to be able to retrieve aggregate data based on range conditions around lots of combinations of the columns. Without all the indices, certain queries take a couple minutes. With them, each query takes milliseconds to a couple seconds.
I thought of every possible way to avoid having all three indices, but it just wasn't possible. It's just how performant data lookup works.
You shouldn't assume people are being careless with indices. Far too often I see the opposite.
hobs · 2h ago
Hah they need to try harder then, I have seen more than 20x the data volume in systems where people are both paranoid and ignorant, a dangerous combo!
ukuina · 2h ago
Doesn't this just push the runtime into index recomputation on the destination database?
masklinn · 1h ago
Yes, however they seem to have a pretty slow internet connection
> Downloading a 250MB database from my web server takes about a minute over my home Internet connection
So for the original 3.4GB database that's nearly 15mn waiting for the download.
yapyap · 43m ago
Very neat walkthrough, clear commands and I appreciate the explanations as to why this may help in OPs case
iambear · 2h ago
I usually use scp for this case, sometimes rsync version is not compatible between 2 machines
dundundundun · 3h ago
This is basically the way every other database is moved around.
A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.
Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.
For replicating sqlite databases safely there is
https://github.com/benbjohnson/litestream
To push back against "only" -- there is actually one scenario where this works. Copying a file or a subvolume on Btrfs or ZFS can be done atomically, so if it's an ACID database or an LSM tree, in the worst case it will just rollback. Of course, if it's multiple files you have to take care to wrap them in a subvolume so that all of them are copied in the same transaction, simply using `cp --reflink=always` won't do.
Possibly freezing the process with SIGSTOP would yield the same result, but I wouldn't count on that
Is Litestream still an active project?
There is a slight 'well akshully' on this. A DB flush and FS snapshot where you copy the snapshotted file will allow this. MSSQL VSS snapshots would be an example of this.
It’s been years since I needed to do this, but if I remember right, you can clone an entire pg db live with a `pg_backup_start()`, rsync the data directory, pg_backup_stop() and rsync the WAL files written since backup start.
Do people really not understand how file storage works? I cannot rightly apprehend the confusion of ideas that would produce an attempt to copy a volatile database without synchronization and expect it to work.
I see you are a man of culture.
As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.
Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.
It's a very good writeup on how to do fast inserts in sqlite3
For my use case (recreating in-memory from scratch) it basically boils down to three points: (1) journal_mode = off (2) wrapping all inserts in a single transaction (3) indexes after inserts.
For whatever it's worth I'm getting 15M inserts per minute on average, and topping around 450k/s for trivial relationship table on a stock Ryzen 5900X using built-in sqlite from NodeJS.
CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
i.e. they only time INSERTs, not the CREATE INDEX after all the INSERTs.
I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).
I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.
I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.
Here's quite recent log (on stock Ryzen 5900X):
Nice tricks in the article, but you can more easily use the builtin utility now :)
I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...
sqlite transaction- and WAL-aware rsync with inflight compression.
When I do stuff like this, I stream the dump straight into gzip. (You can usually figure out a way to stream directly to the destination without an intermediate file at all.)
Plus this way it stays stored compressed at its destination. If your purpose is backup rather than a poor man's replication.
For every other network, you should compress as you are likely dealing with multiple tenants that would all like a piece of your 40Gbps bandwidth.
(This is the same spiel I give whenever someone says swap on Linux is or is not always beneficial.)
https://duckdb.org/docs/stable/sql/statements/export.html
You'd want to do this:
Also I wonder how big your test database is and it's schema. For large tables Parquet is way more efficient than a 20% reduction.If there's UUIDs, they're 36 bits each in text mode and 16 bits as binary in Parquet. And then if they repeat you can use a dictionary in your Parquet to save the 16 bits only once.
It's also worth trying to use brotli instead of zstd if small files is your goal.
https://www.sqlite.org/sessionintro.html
In one of my projects I have implemented my own poor man’s session by writing all the statements and parameters into a separate database, then sync that and replay. Works well enough for a ~30GB database that changes by ~0.1% every day.
https://github.com/crawshaw/sqlite
https://github.com/eatonphil/gosqlite/
Ended up with the latter, but did have to add one function binding in C, to inspect changesets.
Idea for an offline first app, where each app install call pull a changeset and apply it to their local db.
Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)
if retaining the snapshot file is of value, great.
I'd be a tiny bit surprised if rsync could recognize diffs in the dump, but it's certainly possible, assuming the dumper is "stable" (probably is because its walking the tables as trees). the amount of change detected by rsync might actually be a useful thing to monitor.
So I see basic stuff needs to be repeated as people still miss those kinds of things.
But I learned that you can easily dump SQLite to a text file - neat!
Just ssh the machine, dump the SQL and load it back into SQLite locally.
Isn't this a case for proper database servers with replication?
Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?
When is a guy supposed to get a coffee and stretch his legs anymore?
https://github.com/upa/mscp
ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c" | gunzip -c | sqlite3 my_local_database.db
Anyway I don't think the database file size was really an issue, it was a relatively big schema but not many indices and performance wasn't a big consideration - hence why the backend would concatenate query results into an XML file, then pass it through an xml->json converter, causing 1-2 second response times on most requests. I worked on a rewrite using Go where requests were more like 10-15 milliseconds.
But, I still used sqlite because that was actually a pretty good solution for the problem at hand; relatively low concurrency (up to 10 active simultaneous users), no server-side dependencies or installation needed, etc.
You could do a filesystem snapshot and copy from that, but neither a cp or rsync is atomic.
And then there is also https://www.sqlite.org/rsync.html
Why not just compress the whole database using `gzip` or `lz4` before rsyncing it instead? `zstd` works too but seems like it had a bug regarding compressing file with modified content.
better yet, split your sqlite file to smaller piece. it is not like it needs to contain all the app data in a single sqlite file.
What I ended up doing is creating a new database, pg_restore'ing into that one with --no-owner and --no-acl, forcibly dropping the old database, and then renaming the new to the old one's name. This has the benefit of not leaving me high and dry should there be an issue with restoring.
gzip is a legacy algorithm that imo only gets used for compatibility with legacy software that understands nothing but gzip.
My first try would've been to copy the db file first, gzip it and then transfer it but I can't tell whether compression will be that useful in binary format.
https://sqlite.org/com/cerod.html
It's not carelessness, it's performance.
Quite simply, I have a table with 4 columns -- A, B, C, D. Each column is just an 8-byte integer. It has hundreds of millions of rows. It has an index on B+C+D, an index on C+D, and one on D.
All of these are required because the user needs to be able to retrieve aggregate data based on range conditions around lots of combinations of the columns. Without all the indices, certain queries take a couple minutes. With them, each query takes milliseconds to a couple seconds.
I thought of every possible way to avoid having all three indices, but it just wasn't possible. It's just how performant data lookup works.
You shouldn't assume people are being careless with indices. Far too often I see the opposite.
> Downloading a 250MB database from my web server takes about a minute over my home Internet connection
So for the original 3.4GB database that's nearly 15mn waiting for the download.