> The checksums in WAL are likely not meant to check for random page corruption in the middle; maybe they’re just to check if the last write of a frame was fsynced properly or not?
This is the correct explanation. The purpose is to detect partial writes, not to detect arbitrary data corruption. If detecting corruption was the goal, then checksumming the WAL without also checksumming the database itself would be fairly pointless.
In fact, it's not accurate to say "SQLite does not do checksums by default, but it has checksums in WAL mode." SQLite always uses checksums for its journal, regardless of whether that's a rollback journal or a write-ahead log. [1]
For the purpose of tolerating and recovering from crashes/power failures, writes to the database file itself are effectively idempotent. It doesn't matter if only a subset of the DB writes are persisted before a crash, and you don't need to know which ones succeeded, because you can just roll all of them forward or backward (depending on the mode). But for the journal itself, distinguishing partial journal entries from complete ones matters.
No matter what order the disk physically writes out pages, the instant when the checksum matches the data is the instant at which the transaction can be unambiguously said to commit.
Imagine the power goes out while sqlite is in the middle of writing a transaction to the WAL (before the write has been confirmed to the application). What do you want to happen when power comes back, and you reload the database?
If the transaction was fully written, then you'd probably like to keep it. But if it was not complete, you want to roll it back.
How does sqlite know if the transaction was complete? It needs to see two things:
1. The transaction ends with a commit frame, indicating the application did in fact perform a `COMMIT TRANSACTION`.
2. All the checksums are correct, indicating the data was fully synced to disk when it was committed.
If the checksums are wrong, the assumption is that the transaction wasn't fully written out. Therefore, it should be rolled back. That's exactly what sqlite does.
This is not "data loss", because the transaction was not ever fully committed. The power failure happened before the commit was confirmed to the application, so there's no way anyone should have expected that the transaction is durable.
The checksum is NOT intended to detect when the data was corrupted by some other means, like damage to the disk or a buggy app overwriting bytes. Myriad other mechanisms should be protecting against those already, and sqlite is assuming those other mechanisms are working, because if not, there's very little sqlite can do about it.
malone · 1d ago
Why is the commit frame not sufficient to determine whether the transaction was fully written or not? Is there a scenario where the commit frame is fsynced to disk but the proceeding data isn't?
adambb · 1d ago
The disk controller may decide to write out blocks in a different order than the logical layout in the log file itself, and be interrupted before completing this work.
grumbelbart2 · 9h ago
Just wondering how SQLite would ever work if it had zero control over this. Surely there must be some "flush" operation that guarantees that everthing so far is written to disk? Otherwise, any "old" block that contains data might have not been written. SQLite says:
> Local devices also have a characteristic which is critical for enabling database management software to be designed to ensure ACID behavior: When all process writes to the device have completed, (when POSIX fsync() or Windows FlushFileBuffers() calls return), the filesystem then either has stored the "written" data or will do so before storing any subsequently written data.
mschuster91 · 9h ago
A "flush" command does indeed exist... but disk and controller vendors are like patients in Dr. House [1] - everybody lies. Especially if there are benchmarks to be "optimized". Other people here have written up that better than I ever could [2].
It’s worth noting this is also dependent on filesystem behavior; most that do copy-on-write will not suffer from this issue regardless of drive behavior, even if they don’t do their own checksumming.
hinkley · 1d ago
We still have the elevator algorithm on NVMe?
jrockway · 1d ago
NVMe drives do their own manipulation of the datastream. Wear leveling, GC, trying to avoid rewriting an entire block for your 1 bit change, etc. NVMe drives have CPUs and RAM for this purpose; they are full computers with a little bit of flash memory attached. And no, of course they're not open source even though they have full access to your system.
djfivyvusn · 10h ago
Skynet gotta start somewhere.
bob1029 · 1d ago
Anything that uses NAND storage technology is going to be optimized in some way like this. NVMe is just the messenger.
lxgr · 1d ago
SQLite runs on anything from servers to Internet-connected lightbulbs.
jrockway · 1d ago
Which lightbulbs include SQLite? I kind of want one.
natebc · 1d ago
these guys have a Cree logo on their homepage so maybe Cree?
At least what I could turn up with a quick web search.
hinkley · 1d ago
For instance, running on ZFS or one of its peers.
zaarn · 1d ago
ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS, because otherwise you will have the same experience I had for years; SQLite may randomly hang for up to a few minutes with no visible cause, if there isn’t sufficient write txg’s to fill up in the background. If your app depends on SQLite, it’ll randomly die.
Btrfs is a better choice for sqlite, haven’t seen that issue there.
The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:
>amotin
>I see. So ZFS tries to drop some data from pagecache, but there seems to be some dirty pages, which are held by ZFS till them either written into ZIL, or to disk at the end of TXG. And if those dirty page writes were asynchronous, it seems there is nothing that would nudge ZFS to actually do something about it earlier than zfs_txg_timeout. Somewhat similar problem was recently spotted on FreeBSD after #17445, which is why newer version of the code in #17533 does not keep references on asynchronously written pages.
Might be worth testing zfs_txg_timeout=1 or 0
jclulow · 12h ago
This isn't an inherent property of ZFS at all. I have made heavy use of SQLite for years (on illumos systems) without ever hitting this, and I would never counsel anybody to disable sync writes: it absolutely can lead to data loss under some conditions and is not safe to do unless you understand what it means.
What you're describing sounds like a bug specific to whichever OS you're using that has a port of ZFS.
zaarn · 10h ago
I wouldn't recommend SQLite on ZFS (or in general for other reasons), for the precise reason that it either lags or is unsafe.
I've encountered this bug both on illumos, specifically OpenIndiana, and Linux (Arch Linux).
throw0101b · 1d ago
> ZFS isn’t viable for SQLite unless you turn off fsync’s in ZFS
Which you can do on a per dataset ('directory') basis very easily:
Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.
zaarn · 13h ago
Disabling sync corrupts SQLite databases on powerloss, I've personally experienced this following disabling sync because it causes SQLite to hang.
You cannot have SQLite keep your data and run well on ZFS unless you make a zvol and format it as btrfs or ext4 so they solve the problem for you.
ezekiel68 · 1d ago
You know what's even easier than doing that? Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.
throw0101b · 1d ago
> Neglecting to do it or meaning to do it then getting pulled in to some meeting (or other important distraction) and then imagining you did it.
If your job is to make sure your file system and your database—SQLite, Pg, My/MariDB, etc—are tuned together, and you don't tune it, then you should be called into a meeting. Or at least the no-fault RCA should bring up remediation methods to make sure it's part of the SOP so that it won't happen again.
The alternative the GP suggests is using Btrfs, which I find even more irresponsible than your non-tuning situation. (Heck, if someone on my sysadmin team suggested we start using Btrfs for anything I would think they were going senile.)
johncolanduoni · 23h ago
Facebook is apparently using it at scale, which surprised me. Though that’s not necessarily an endorsement, and who knows what their kernel patcheset looks like.
kentonv · 23h ago
Doesn't turning off sync mean you can lose confirmed writes in a power failure?
jandrewrogers · 1d ago
Apropos this use case, ZFS is usually not recommended for databases. Competent database storage engines have their own strong corruption detection mechanisms regardless. What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.
tetha · 1d ago
Hm.
On the other hand, I've heard people recommend running Postgres on ZFS so you can enable on the fly compression. This increases CPU utilization on the postgres server by quite a bit, read latency of uncached data a bit, but it decreases necessary write IOPS a lot. And as long as the compression is happening a lot in parallel (which it should, if your database has many parallel queries), it's much easier to throw more compute threads at it than to speed up the write-speed of a drive.
And after a certain size, you start to need atomic filesystem snapshots to be able to get a backup of a very large and busy database without everything exploding. We already have the more efficient backup strategies from replicas struggle on some systems and are at our wits end how to create proper backups and archives without reducing the backup freqency to weeks. ZFS has mature mechanisms and zfs-send to move this data around with limited impact ot the production dataflow.
supriyo-biswas · 1d ago
Is an incremental backup of the database not possible? Pgbackrest etc. can do this by creating a full backup followed by incremental backups from the WAL.
On the big product clusters, we have incremental pgbackrest backups running for 20 minutes. Full backups take something between 12 - 16 hours. All of this from a sync standby managed by patroni. Archiving all of that takes 8 - 12 hours. It's a couple of terabytes on noncompressible data that needs to move. It's fine though, because this is an append-log-style dataset and we can take our time backing this up.
We also have decently sized clusters with very active data on them, and rather spicy recovery targets. On some of them, a full backup from the sync standby takes 4 hours, we need to pull an incremental backup at most 2 hours afterwards, but the long-term archiving process needs 2-3 hours to move the full backup to the archive. This is the first point in which filesystem snapshots, admittedly, of the pgbackrest repo, become necessary to adhere to SLOs as well as system function.
We do all of the high-complexity, high-throughput things recommended by postgres, and it's barely enough on the big systems. These things are getting to the point of needing a lot more storage and network bandwidth.
hinkley · 23h ago
This was my understanding as well, color me also confused.
bad news is, most databases don't do checksums by default.
lxgr · 1d ago
This is in fact good news.
Redundantly performing the same performance-intensive tasks on multiple layers makes latency less predictable and just generally wastes resources.
jandrewrogers · 1d ago
Actually bad news. Most popular filesystems and filesystem configurations have limited and/or weak checksums, certainly much worse than you'd want for a database. 16-bit and 32-bit CRCs are common in filesystems.
This is a major reason databases implement their own checksums. Unfortunately, many open source databases have weak or non-existent checksums too. It is sort of an indefensible oversight.
fc417fc802 · 15h ago
Assuming that you expect corruption to be exceedingly rare what's wrong with a 1 in 2^16 or 1 in 2^32 failure rate? That's 4 9s and 9 9s respectively for detecting an event that you hardly expect to happen in the first place.
At 32 bits you're well into the realm of tail risks which include things like massive solar flares or the data center itself being flattened in an explosion or natural disaster.
Edit: I just checked a local drive for concrete numbers. It's part of a btrfs array. Relevant statistics since it was added are 11k power on hours, 24 TiB written, 108 TiB read, and 32 corruption events at the fs level (all attributable to the same power failure, no corruption before or since). I can't be bothered to compute the exact number but at absolute minimum it will be multiple decades of operation before I would expect even a single corruption event to go unnoticed. I'm fairly certain that my house is far more likely to burn down in that time frame.
lxgr · 1h ago
> Most popular filesystems and filesystem configurations have limited and/or weak checksums,
Because filesystems, too, mainly use them to detect inconsistencies introduced by partial or reordered writes, not random bit flips. That's also why most file systems only have them on metadata, not data.
hawk_ · 22h ago
So when checksums are enabled and the DB process restarts or the host reboots does the DB run the checksum over all the stored data? Sounds like it would take forever for the database to come online. But if it doesn't it may not detect bitrot in time...?
TheDong · 1d ago
> ZFS is usually not recommended for databases
Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.
> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.
The corruption was likely present for months or years, and postgres didn't notice.
ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.
It's stuff like in that post that are exactly why I run postgres on ZFS.
If you've got specifics about what you mean by "databases should bring their own implementation", I'd be happy to hear it, but I'm having trouble thinking of any sorta technically sound reason for "databases actually prefer it if filesystems can silently corrupt data lol" being true.
zaarn · 1d ago
SQLite on ZFS needs the Fsync behaviour to be off, otherwise SQLite will randomly hang the application as the fsync will wait for the txg to commit. This can take a minute or two, in my experience.
Btrfs is a better choice for SQLite.
supriyo-biswas · 1d ago
Btw this concern also applies to other databases, although probably it manifests in the worst way in SQLite. Essentially, you’re doing a WAL over the file systems’ own WAL-like recovery mechanism.
zaarn · 13h ago
I've not observed other databases locking up on ZFS, Postgres and MySQL both function just fine, without needing to modify any settings.
throw0101b · 1d ago
> SQLite on ZFS needs the Fsync behaviour to be off […]
As noted in a sibling comment, this causes corruption on power failure.
jandrewrogers · 1d ago
The point is that a database cannot rely on being deployed on a filesystem with proper checksums.
Ext4 uses 16-/32-bit CRCs, which is very weak for storage integrity in 2025. Many popular filesystems for databases are similarly weak. Even if they have a strong option, the strong option is not enabled by default. In real-world Linux environments, the assumption that the filesystem has weak checksums usually true.
Postgres has (IIRC) 32-bit CRCs but they are not enabled by default. That is also much weaker than you would expect from a modern database. Open source databases do not have a good track record of providing robust corruption detection generally nor the filesystems they often run on. It is a systemic problem.
ZFS doesn't support features that high-performance database kernels use and is slow, particularly on high-performance storage. Postgres does not use any of those features, so it matters less if that is your database. XFS has traditionally been the preferred filesystem for databases on Linux and Ext4 will work. Increasingly, databases don't use external filesystems at all.
mardifoufs · 22h ago
I know MySQL has checksums by default, how does it compare? Is it useful or is it similarly weak?
jandrewrogers · 20h ago
I don't know but LLMs seem to think it uses a 32-bit CRC like e.g. Postgres.
In fairness, 32-bit CRCs were the standard 20+ years ago. That is why all the old software uses them and CPUs have hardware support for computing them. It is a legacy thing that just isn't a great choice in 2025.
No comments yet
lxgr · 1d ago
No, competent systems just need to have something that, taken together, prevents data corruption.
One possible instance of that is a database providing its own data checksumming, but another perfectly valid one is running one that does not on a lower layer with a sufficiently low data corruption rate.
johncolanduoni · 23h ago
Is not great for databases that do updates in place. Log-structured merge databases (which most newer DB engines are) work fine with its copy-on-write semantics.
> [...] The checkpoint does not normally truncate the WAL file (unless the journal_size_limit pragma is set). Instead, it merely causes SQLite to start overwriting the WAL file from the beginning. This is done because it is normally faster to overwrite an existing file than to append.
Without the checksum, a new WAL entry might cleanly overwrite an existing longer one in a way that still looks valid (e.g. "A|B" -> "C|B" instead of "AB" -> "C|data corruption"), at least without doing an (expensive) scheme of overwriting B with invalid data, fsyncing, and then overwriting A with C and fsyncing again.
In other words, the checksum allows an optimized write path with fewer expensive fsync/truncate operations; it's not a sudden expression of mistrust of lower layers that doesn't exist in the non-WAL path.
lxgr · 1d ago
> This is a follow-up post to my PSA: SQLite does not do checksums and PSA: Most databases do not do checksums by default.
That's really all there is to it.
SQLite has very deliberate and well-documented assumptions (see for example [1], [2]) about the lower layers it supports. One of them is that data corruption is handled by these lower layers, except if stated otherwise.
Not relying on this assumption would require introducing checksums (or redundancy/using an ECC, really) on both the WAL/rollback journal and on the main database file. This would make SQLite significantly more complex.
I believe TFA is mistaken about how SQLite uses checksums. They primarily serve as a way to avoid some extra write barriers/fsync operations, and maybe to catch incomplete out-of-order writes, but never to detect actual data corruption: https://news.ycombinator.com/item?id=44671373
I might be missing something (We use sqlite for our embedded stores) - but I feel like "failing silently" is alarmist here.
1. If the WAL is incomplete, then "failing" silently is the correct thing to do here, and is the natural function of the WAL. The WAL had an incomplete write, nothing should have been communicated back the application and the application should assume the write never completed.
2. If the WAL is corrupt (due to the reasons he mentioned), then sqlite says that is that's your problem, not sqlite's. I think this is the default behavior for other databases as well. If a bit flips on disk, it's not guaranteed the database will catch it.
This article is framed almost like a CVE, but to me this is kind of like saying "PSA: If your hard drive dies you may lose data". If you care about data integrity (because your friend is sending you sqlite files) you should be handling that.
supriyo-biswas · 1d ago
Also, partially applying a WAL has obvious issues even though the author of this post would somehow prefer that. If we update 3 rows in a database and the WAL entry for one of the rows is corrupted, do they expect to ignore the corrupted entry and apply the rest? What happens to data consistency in this particular case?
lxgr · 1d ago
Even worse: SQLite, by default, does not immediately truncate WAL files, but rather overwrites the existing WAL from the beginning after successfully applying a checksum.
Doing what the author suggests would actually introduce data corruption errors when "restoring a WAL with a broken checksum".
avinassh · 1d ago
However, SQLite, by default, always truncates the WAL files on last connection close
> When the last connection to a database closes, that connection does one last checkpoint and then deletes the WAL and its associated shared-memory file, to clean up the disk.
Yes, but you can't rely on an orderly database close when it comes to correctness. Non-clean shutdowns are much more common than actual low-level data corruption in many SQLite applications.
kburman · 1d ago
An employee of Turso, a commercial fork of SQLite, is presenting a standard, safety-first feature of SQLite's WAL as a dangerous flaw. As many have noted, this behavior prevents database corruption, it doesn't cause it.
chambers · 10m ago
Yeah, this tracks. If the OP consulted with Turso on this blogpost, then Turso probably believes the reported behavior is indeed a failure or a flaw, which they think a local db should be responsible for.
The confusion is that Limbo is not mentioned in the article which means that everyone has to start from a different rational premise.
supriyo-biswas · 1d ago
I wouldn't have jumped to a conspiracy angle immediately, but there are some signs which are difficult to overlook:
- Said person was apparently employed due to his good understanding of databases and distributed systems concepts (there's a HN thread about how he found an issue in the paper describing an algorithm); yet makes fundamental mistakes in understanding what the WAL does and how it's possible not to "partly" apply a WAL.
- Said person expects a SQL database to expose WAL level errors to the user breaking transactional semantics (if you want that level of control, consider simpler file-based key-value stores that expose such semantics?)
- Said person maligns SQLite as being impossible to contribute; whereas the actual project only mentions that they may rewrite the proposed patch to avoid copyright implications.
- Said person again maligns SQLite as "limping along" in the face of disk errors (while making the opposite claim a few paragraphs ago); while ignoring that the checksum VFS exists when on-disk data corruption is a concern.
jrockway · 1d ago
I think it's kind of possible to partially apply the WAL manually. Imagine your frames are:
1) Insert new subscription for "foobar @ 123 Fake St."
2) Insert new subscription for "�#�#�xD�{.��t��3Axu:!"
3) Insert new subscription for "barbaz @ 742 Evergreen Terrace"
A human could probably grab two subscriptions out of that data loss incident. I think that's what they're saying. If you're very lucky and want to do a lot of manual work, you could maybe restore some of the data. Obviously both of the "obviously correct" records could just be random bitflips that happen to look right to humans. There's no way of knowing.
cwillu · 1d ago
And if the “obviously correct” last entry is actually an old entry that just hadn't been overwritten yet? Or if it was only permitted because of something in the corrupted section?
The database should absolutely not be performing guesswork about the meaning of its contents during recovery. If you want mongodb, go use mongodb.
jrockway · 19h ago
I'm aware that nothing should automatically attempt this recovery, but asking the database to throw an error when it happens does not seem crazy to me. I mean, I get why it doesn't. I have worked on systems with "mount /dev/sda1 || mkfs.ext4 /dev/sda1" because there isn't a UI to recover the filesystem, so nothing is lost by erasing it. But I don't think it makes you bad at databases to want this condition to be optionally fatal. (I can also see why "welp it's corrupted <unlink>" is also annoying. If I'm the author of this software, I would be interested in checking out the corrupted file before deleting it. I can live with the slight hit to my free disk space if it means I can fix a bug!)
ulrikrasmussen · 1d ago
Yes, in this particular example you could. But in general the database cannot make assumptions that changes are independent of each other.
I think SQLite assumes that a failing checksum occurs due to a crash during a write which never finished. A corrupt WAL frame before a valid frame can only occur if the underlying storage is corrupt, but it makes no sense for SQLite to start handling that during replay as it has no way to recover. You could maybe argue that it should emit a warning
supriyo-biswas · 1d ago
This could work for a simple key-value store; but SQLite also does referential integrity which means we might just end up with extra entries with no entities on the other side of the table. IMO, best avoided in a transactional database.
avinassh · 1d ago
of all places, I did not expect to get personal attacks on HN :)
> yet makes fundamental mistakes in understanding what the WAL does and how it's possible not to "partly" apply a WAL.
Please provide citation on where I said that. You can't partly apply WAL always, but there are very valid cases where you can do that to recover. Recovery doesn't have to automatic. It can be done by SQLite, or some recovery tool or with manual intervention.
> - Said person maligns SQLite as being impossible to contribute; whereas the actual project only mentions that they may rewrite the proposed patch to avoid copyright implications.
Please provide citation on where I said that. Someone asked me to send a patch to SQLite, I linked them to the SQLite's page.
supriyo-biswas · 1d ago
> You can't partly apply WAL always, but there are very valid cases where you can do that to recover.
Without mentioning the exact set of cases where recovery is possible and it isn't, going "PSA: SQLite is unreliable!!1one" is highly irresponsible. I think there's quite a bit of criticism going around though, you could add them to your blog article :)
Please also consider the fact that SQLite being a transactional database, it is usually not possible to expose a WAL level error to the user. The correct way to address it is to probably come up with a list of cases where it is possible, and then send in a patch, or at least a proposal, of how to address it.
> Please provide citation on where I said that [SQLite is impossible to contribute].
I don't know if you have some personal vendetta against me, because you are citing things I did not say. I did not say SQLite is unreliable. I said SQLite stops at checksum errors found in WAL and stops recovery, which may lead to data loss. Which part of this is incorrect?
On SQLite contribution, I did not say it's "impossible." I said it's not open to contribution. This is the exact phrase from the linked page.
CJefferson · 1d ago
To me it isn't incorrect, but misleading. A checksum error means corruption, corruption in the main database just randomly can cause any damage. Why specially consider checksum error in the WAL?
tucnak · 23h ago
> of all places, I did not expect to get personal attacks on HN
You must be new to the site.
cwillu · 1d ago
The benefit is that you're left with a database state that actually existed; there's no guarantee from the database's perspective that dropping some committed transactions and not others that came after will result in a valid state.
HelloNurse · 1d ago
This is the main point that the OP misses: even if the newer portion of the WAL file isn't corrupted, its content cannot be used in any way because doing so would require the lost transactions from the corrupted block. The chained checksums are a feature, not gratuitous fragility.
AlotOfReading · 1d ago
Sqlite could attempt to recover the detected errors though and not lose the transactions.
daneel_w · 1d ago
The WAL was corrupted, the actual data is lost. There's no parity. You're suggesting that sqlite should somehow recreate the data from nothing.
AlotOfReading · 1d ago
I was assuming sqlite did the sane thing and used a CRC. CRCs have (limited) error correction capabilities that you can use to fix 1-2 bit errors in most circumstances, but apparently sqlite uses a Fletcher variant and gives up that ability (+ long message length error detection) for negligible performance gains on modern (even embedded) CPUs.
lxgr · 1d ago
The ability to... correct 1-2 bit errors? Is that even a realistic failure mode on common hardware?
CRCs as used in SQLite are not intended to detect data corruption due to bit rot, and are certainly not ECCs.
AlotOfReading · 1d ago
Yes, it's a pretty important error case with storage devices. It's so common that modern storage devices and filesystems include their own protections against it. Your system may or may not have these and bit flips may happen after that point, so WAL redundancy wouldn't be out of place.
Sure, the benefits to the incomplete write use case are limited, but there's basically no reason to ever use a fletcher these days.
It's also worth mentioning that the VFS checksums are explicitly documented as guarding against storage device bitrot and use the same fletcher algorithm.
lxgr · 1d ago
It would be absolutely out of place if your lower layers already provided it, and not enough if they didn't (since you'd then also need checksums on the actual database file, which SQLite does not provide – all writes happen again there!)
AlotOfReading · 1d ago
sqlite does actually provide database checksums, via the vfs extension I mentioned previously.
There's no harm to having redundant checksums and it's not truly redundant for small messages. It's pretty common for systems not to have lower level checksumming either. Lots of people are still running NTFS/EXT4 on hardware that doesn't do granular checksums or protect data in transit.
Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.
lxgr · 1d ago
> There's no harm to having redundant checksums
There sure is: Redundant checksums need extra storage and extra processing. SQLite often runs on embedded systems, where both can come at a premium.
> Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.
That's not nearly the only thing missing for SQLite to provide full resistance to lower-level data corruption. At a very minimum, you'd also need checksums of the actual database file.
AlotOfReading · 23h ago
There's no extra storage because it's already storing checksums. There's technically extra processing, but a good CRC is faster than reading memory on virtually all modern systems (even in embedded) so there's no reason to prefer a fletcher unless the few extra cycles of latency to get the CRC running are deeply concerning to you.
That's not nearly the only thing missing for SQLite to provide full resistance to lower-level data corruption.
A CRC wouldn't provide full resistance anyway, so this clearly isn't what I'm arguing. What I'm saying is that a fletcher is strictly worse than a CRC here. Not an advanced algorithm, a bog-standard CRC. Fletcher actually takes more space in this case, has worse error detection, and doesn't allow recovery. A CRC would allow you to recover some of the time, from some kinds of errors, in some scenarios. It's simply better than nothing.
lxgr · 23h ago
> There's no extra storage because it's already storing checksums.
At the database level (i.e. not just the WAL)? Are you sure?
> What I'm saying is that a fletcher is strictly worse than a CRC here.
I can't speak to the performance differences, but the only thing SQLite really needs the checksum to do is to expose partial writes, both due to reordered sector writes and partial intra-sector writes. (The former could also be solved by just using an epoch counter, but the latter would require some tricky write formats, and a checksum nicely addresses both).
In both cases, there's really nothing to recover: CRC won't catch an entire missing sector, and almost no partially written sectors (i.e. unless the failure somehow happens in the very last bytes of it, so that the ratio of "flipped" bits is low enough).
AlotOfReading · 23h ago
Just talking about the WAL. The database does have a page level checksum with the VFS extension, but it's optional.
avinassh · 1d ago
> The WAL was corrupted, the actual data is lost. There's no parity. You're suggesting that sqlite should somehow recreate the data from nothing.
Not all frames in the WAL are important. Sure, recovery may be impossible in some cases, but not all checksum failures are impossible to recover from.
daneel_w · 1d ago
My feeling is that any portion that no longer checksums correctly is practically forfeited, which makes any form of recovery an arbitrary result - a half-baked transaction. I don't see how that's a more desirable outcome than rolling back to the last known-good commit, which is never arbitrary.
avinassh · 1d ago
I posted in another comment: There is a smoll demo in the post. In it, I corrupt an old frame that is not needed by the database at all. Now, one approach would be to continue the recovery and then present both states: one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter.
nemothekid · 1d ago
>one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter
I can't imagine picking the latter unless you were treating sqlite like a filesystem of completely unrelated blobs.
If I run three transactions where:
1. John gives $100 to Sue.
2. Sue gives $100 to Mark.
3. Mark $100 money to Paul.
If sqlite, just erases transaction (2), then Mark materializes $100 from nowhere. The rest of your database is potentially completely corrupted. At that point your database is no longer consistent - I can't see how you would "almost always" prefer this.
If (2) is corrupt, then the restore stops at (1), and you are guaranteed consistency.
avinassh · 1d ago
Yes, this is a good example of showing an we cannot partly apply the WAL always. Again, let me repeat it, we cannot partly apply the WAL all the time expect it to work but there are some valid cases where we can do that to recover. Your example is not the one.
pests · 1d ago
> but not all checksum failures are impossible to recover from
Which failures are possible to recover from?
jandrewrogers · 1d ago
There are two examples I know of that require no additional data:
First, force a re-read of the corrupted page from disk. A significant fraction of data corruption occurs while it is being moved between storage and memory due to weak error detection in that part of the system. A clean read the second time would indicate this is what happened.
Second, do a brute-force search for single or double bit flips. This involves systematically flipping every bit in the corrupted page, recomputing the checksum, and seeing if corruption is detected.
daneel_w · 1d ago
> A significant fraction of data corruption occurs while it is being moved between storage and memory
Surely you mean on the memory bus specifically? SATA and PCIe both have some error correction methods for securing transfers between storage and host controller. I'm not sure about old parallel ATA. While I understand it can happen under conditions similar to non-ECC RAM being corrupted, I don't think I've ever heard or read about a case where a storage device randomly returned erroneous data, short of a legitimate hardware error.
jandrewrogers · 1d ago
The error correction in PCIe, SATA, etc is too weak to be reliable for modern high-performance storage. This is a known deficiency and why re-reading a corrupted page sometimes fixes things. PCIe v6 is introducing a much stronger error detection scheme to address this, which will mostly leave bit-rot on storage media as the major vector.
The bare minimum you want these days is a 64-bit CRC. A strong 128-bit hash would be ideal. Even if you just apply these at the I/O boundaries then you'll catch most corruption. The places it can realistically occur are shrinking but most software makes minimal effort to detect this corruption even though it is a fairly well-bounded problem.
daneel_w · 23h ago
Thanks for the tech details.
dec0dedab0de · 1d ago
what if the corruption only affected the stored checksum, but not the data itself?
kstrauser · 1d ago
And you would detect this how?
No comments yet
daneel_w · 1d ago
...then the checksum can't clear the associated operation. Same result.
hobs · 1d ago
And not get in an infinite loop, and not harm the startup time of the process inordinately, and...
This is just basically how a WAL works, if you have an inconsistent state the transaction is rolled back - at that point you need to redo your work.
teraflop · 1d ago
Yes, and it's not just about application-level integrity. The WAL operates at a page level, so dropping one WAL entry and then applying later ones would be likely to cause corruption at the B-tree level.
For instance, say you have a node A which has a child B:
* Transaction 1 wants to add a value to B, but it's already full, so B is split into new nodes C and D. Correspondingly, the pointer in A that points to B is removed, and replaced with pointers to C and D.
* Transaction 2 makes an unrelated change to A.
If you skip the updates from transaction 1, and apply the updates from transaction 2, then suddenly A's data is overwritten with a new version that points to nodes C and D, but those nodes haven't been written. The pointers just point to uninitialized garbage.
slashdev · 1d ago
How would this work differently? As soon as you encounter a checksum failure, you can't trust anything from that point on. If the checksum were just per-page and didn't build on the previous page's checksum, you can't just apply pages from the WAL that were valid, skipping the ones which were not. The database at the end of that process would be corrupt.
If you stop at the first failure, the database is restored to the last good state. That's the best outcome that can be achieved under the circumstances. Some data could be lost, but there wasn't anything sensible you could do with it anyway.
avinassh · 1d ago
> How would this work differently?
I would like it to raise an error and then provide an option to continue or stop. Since continuing is the default, we need a way to opt in to stopping on checksum failure.
Not all checksum errors are impossible to recover from. Also, as the post mentions, only some non important pages could be corrupt too.
My main complaint is that it doesn't give developers an option.
thadt · 1d ago
Aight, I'll bite: continue or stop... and do what? As others have pointed out, the only safe option to get back to a consistent state is to roll back to a safe point.
If what we're really interested in is the log part of a write ahead log - where we could safely recover data after a corruption, then a better tool might be just a log file, instead of SQLite.
avinassh · 1d ago
> Aight, I'll bite: continue or stop... and do what? As others have pointed out, the only safe option to get back to a consistent state is to roll back to a safe point.
Attempt to recover! Again, not all checksum errors are impossible to recover. I hold the view that even if there is a 1% chance of recovery, we should attempt it. This may be done by SQLite, an external tool, or even manually. Since WAL corruption issues are silent, we cannot do that now.
There is a smoll demo in the post. In it, I corrupt an old frame that is not needed by the database at all. Now, one approach would be to continue the recovery and then present both states: one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter.
nemothekid · 1d ago
>I corrupt an old frame that is not needed by the database at all
1. How does the database know that.
2. In your example Alice gets the money from nowhere. What if another user had sent the money to Alice and that frame get corrupted. Then you just created 10,000,000 from nowhere.
At the very least, rolling back to a good point gives you an exact moment of time where transactions can be applied from. Your example is very contrived and in a database where several transactions can be happening, doing a partial recovery will destroy the consistency of the database.
NortySpock · 1d ago
The demo is good, through I wish you had presented it as text rather than as a gif.
I do see your point of wanting an option to refuse to delete the wal so a developer can investigate the wal and manually recover... But the the typical user probably wants the database to come back up with a consistent, valid state if power is lost. They do not want have the database refuse to operate because it found uncommitted transactions in a scratchpad file...
As a SQL-first developer, I don't pick apart write-ahead logs trying to save a few bytes from the great hard drive in the sky, I just want the database to give me the current state of my data and never be in an invalid state.
avinassh · 1d ago
> As a SQL-first developer, I don't pick apart write-ahead logs trying to save a few bytes from the great hard drive in the sky, I just want the database to give me the current state of my data and never be in an invalid state.
Yes, that is a very valid choice. Hence, I want databases to give me an option, so that you can choose to ignore the checksum errors and I can choose to stop the app and try to recover.
lxgr · 1d ago
WAL checksum errors are not used to catch data corruption, they are only used to detect incomplete writes or leftover data from previous checkpoints.
If you attempt to do the former in a system that by design uses checksums only for the latter, you'll actually introduce corrupted data from some non-corrupted WAL files.
lxgr · 1d ago
What about all classes of durability errors that do not manifest in corrupted checksums (e.g. "cleanly" lost appends of entire transactions to the WAL)?
It seems like you're focusing on a very specific failure mode here.
Also, what if the data corruption error happens during the write to the actual database file (i.e. at WAL checkpointing time)? That's still 50% of all your writes, and there's no checksum there!
ncruces · 1d ago
> Since WAL corruption issues are silent, we cannot do that now.
You do have backups, right?
avinassh · 1d ago
Yes. Say if I am using something like litestream, then all subsequent generations will be affected. At some point, I'd go back and find out the broken generation.
Instead of that, I'd prefer for it to fail fast
lxgr · 1d ago
Giving developers that option would require SQLite to change the way it writes WALs, which would increase overhead. Checksum corruptions can happen without any lower-level errors; this is a performance optimization by SQLite.
> Giving developers that option would require SQLite to change the way it writes WALs, which would increase overhead.
Yes! But I am happy to accept that overhead with the corruption detection.
lxgr · 1d ago
But why? You'd only get partial corruption detection.
As I see it, either you have a lower layer you can trust, and then this would just be extra overhead, or you don't, in which case you'll also need error correction (not just detection!) for the database file itself.
slashdev · 22h ago
It is good that it doesn't give you an option. I don't want some app on my phone telling me its database is corrupt, I want it to always load back to the last good state and I'll handle any missing data myself.
The checksums are not going to fail unless there was disk corruption or a partial write.
In the former, thank your lucky stars it was in the WAL file and you just lose some data but have a functioning database still.
In the latter, you didn't fsync, so it couldn't have been that important. If you care about not losing data, you need to fsync on every transaction commit. If you don't care enough to do that, why do you care about checksums, it's missing the point.
ryanjshaw · 1d ago
> What I want: throw an error when corruption is detected and let the code handle it.
I wonder what that code would look like. My sense is that it’ll look exactly like the code that would run as if the transactions never occurred to begin with, which is why the SQLite design makes sense.
For example, I have a database of todos that sync locally from the cloud. The WAL gets corrupted. The WAL gets truncated the next time the DB is opened. The app logic then checks the last update timestamp in the DB and syncs with the cloud.
I don’t see what the app would do differently if it were notified about the WAL corruption.
fer · 1d ago
Exactly. I'd read it as
> I want to correct errors that the DB wizard who implemented SQLite chose not to
When there's a design decision in such a high profile project that you disagree with, it's either
1. You don't understand why it was done like this.
2. You can (and probably will) submit a change that would solve it.
If you find yourself in the situation of understanding, yet not doing anything about it, you're the Schrodinger's developer: you're right and wrong until you collapse the mouth function by putting money on it.
It's very rarely an easy to fix mistake.
avinassh · 1d ago
> 2. You can (and probably will) submit a change that would solve it.
> 1. You don't understand why it was done like this.
sure, I would like to understand it. That's why the post!
rovr138 · 1d ago
> Contributed Code
> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.
Propose it.
jmull · 1d ago
> What’s interesting is that when a frame is found to have a missing or invalid checksum, SQLite drops that frame and all the subsequent frames.
Skipping a frames but processing later ones would corrupt the database.
> SQLite doesn’t throw any error on detection of corruption
I don’t think it’s actually a corruption detection feature though. I think it’s to prevent a physical failure while writing (like power loss) from corrupting the database. A corruption detection feature would work differently. E.g., it would cover the whole database, not just the WAL. Throwing an error here doesn’t make sense.
asveikau · 1d ago
> You have SQLite .db and .db-wal files, but no accompanying .db-shm file. Maybe your friend shared it with you, or you downloaded some data off the internet.
Honestly this sounds out of scope for normal usage of sqlite and not realistic. I had a hard time reading past this. If I read that correctly, they're saying sqlite doesn't work if one of the database files disappears from under it.
I guess if you had filesystem corruption it's possible that .db-shm disappears without notice and that's a problem. But that isn't sqlite's fault.
CGamesPlay · 20h ago
This, exactly. Especially since these files are basically the "this database was not cleanly closed" markers for SQLite. From SQLite's docs:
> If the last client using the database shuts down cleanly by calling sqlite3_close(), then a checkpoint is run automatically in order to transfer all information from the wal file over into the main database, and both the shm file and the wal file are unlinked.
dev_l1x_be · 12h ago
I was wondering about this subject for some time but the only real solution as I see would be a transactional filesystem (re-designing how filesystems work).
- there is an official check sum VFS shim, but I never used it and don't know how good it is. The difference between it and WAL checksum is that it works on a per page level and you seem to need manually run the checksum checks and then yourself decide what to do
- check sums (as used by SQLite WAL) aren't meant for backup, redundancy or data recovery (there are error recovery codes focused on allowing recovering a limited set of bits, but they have way more overhead then the kind of checksum used here)
- I also believe SQLite should indicate such checksum errors (e.g. so that you might engage out of band data recovery, i.e. fetch a backup from somewhere), but I'm not fully sure how you would integrate it in a backward compatible way? Like return it as an error which otherwise acts like a SQLITE_BUSY??
Data in the WAL should be considered to be of "reduced durability".
adzm · 1d ago
sqlite has several callbacks / hooks / handlers that can be set. I think it is reasonable to expect there to be a way for this situation to be communicated to the application.
westurner · 1d ago
Do the sqlite replication systems depend upon WAL checksums?
Merkle hashes would probably be better.
google/trillian adds Merkle hashes to table rows.
sqlite-parquet-vtable would workaround broken WAL checksums.
> There are many extensions of SQLite; rqlite (Raft in Go,), cr-sqlite (CRDT in C), postlite (Postgres wire protocol for SQLite), electricsql (Postgres), sqledge (Postgres), and also WASM: sqlite-wasm, sqlite-wasm-http, dqlite (Raft in Rust),
This is the correct explanation. The purpose is to detect partial writes, not to detect arbitrary data corruption. If detecting corruption was the goal, then checksumming the WAL without also checksumming the database itself would be fairly pointless.
In fact, it's not accurate to say "SQLite does not do checksums by default, but it has checksums in WAL mode." SQLite always uses checksums for its journal, regardless of whether that's a rollback journal or a write-ahead log. [1]
For the purpose of tolerating and recovering from crashes/power failures, writes to the database file itself are effectively idempotent. It doesn't matter if only a subset of the DB writes are persisted before a crash, and you don't need to know which ones succeeded, because you can just roll all of them forward or backward (depending on the mode). But for the journal itself, distinguishing partial journal entries from complete ones matters.
No matter what order the disk physically writes out pages, the instant when the checksum matches the data is the instant at which the transaction can be unambiguously said to commit.
[1]: https://www.sqlite.org/fileformat.html
Imagine the power goes out while sqlite is in the middle of writing a transaction to the WAL (before the write has been confirmed to the application). What do you want to happen when power comes back, and you reload the database?
If the transaction was fully written, then you'd probably like to keep it. But if it was not complete, you want to roll it back.
How does sqlite know if the transaction was complete? It needs to see two things:
1. The transaction ends with a commit frame, indicating the application did in fact perform a `COMMIT TRANSACTION`.
2. All the checksums are correct, indicating the data was fully synced to disk when it was committed.
If the checksums are wrong, the assumption is that the transaction wasn't fully written out. Therefore, it should be rolled back. That's exactly what sqlite does.
This is not "data loss", because the transaction was not ever fully committed. The power failure happened before the commit was confirmed to the application, so there's no way anyone should have expected that the transaction is durable.
The checksum is NOT intended to detect when the data was corrupted by some other means, like damage to the disk or a buggy app overwriting bytes. Myriad other mechanisms should be protecting against those already, and sqlite is assuming those other mechanisms are working, because if not, there's very little sqlite can do about it.
> Local devices also have a characteristic which is critical for enabling database management software to be designed to ensure ACID behavior: When all process writes to the device have completed, (when POSIX fsync() or Windows FlushFileBuffers() calls return), the filesystem then either has stored the "written" data or will do so before storing any subsequently written data.
[1] https://house.fandom.com/wiki/Everybody_lies
[2] https://news.ycombinator.com/item?id=30371403
https://imaginovation.net/case-study/cree/
At least what I could turn up with a quick web search.
Btrfs is a better choice for sqlite, haven’t seen that issue there.
The latest comment seems to be a nice summary of the root cause, with earlier in the thread pointing to ftruncate instead of fsync being a trigger:
>amotin
>I see. So ZFS tries to drop some data from pagecache, but there seems to be some dirty pages, which are held by ZFS till them either written into ZIL, or to disk at the end of TXG. And if those dirty page writes were asynchronous, it seems there is nothing that would nudge ZFS to actually do something about it earlier than zfs_txg_timeout. Somewhat similar problem was recently spotted on FreeBSD after #17445, which is why newer version of the code in #17533 does not keep references on asynchronously written pages.
Might be worth testing zfs_txg_timeout=1 or 0
What you're describing sounds like a bug specific to whichever OS you're using that has a port of ZFS.
I've encountered this bug both on illumos, specifically OpenIndiana, and Linux (Arch Linux).
Which you can do on a per dataset ('directory') basis very easily:
* https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...Meanwhile all the rest of your pools / datasets can keep the default POSIX behaviour.
You cannot have SQLite keep your data and run well on ZFS unless you make a zvol and format it as btrfs or ext4 so they solve the problem for you.
If your job is to make sure your file system and your database—SQLite, Pg, My/MariDB, etc—are tuned together, and you don't tune it, then you should be called into a meeting. Or at least the no-fault RCA should bring up remediation methods to make sure it's part of the SOP so that it won't happen again.
The alternative the GP suggests is using Btrfs, which I find even more irresponsible than your non-tuning situation. (Heck, if someone on my sysadmin team suggested we start using Btrfs for anything I would think they were going senile.)
On the other hand, I've heard people recommend running Postgres on ZFS so you can enable on the fly compression. This increases CPU utilization on the postgres server by quite a bit, read latency of uncached data a bit, but it decreases necessary write IOPS a lot. And as long as the compression is happening a lot in parallel (which it should, if your database has many parallel queries), it's much easier to throw more compute threads at it than to speed up the write-speed of a drive.
And after a certain size, you start to need atomic filesystem snapshots to be able to get a backup of a very large and busy database without everything exploding. We already have the more efficient backup strategies from replicas struggle on some systems and are at our wits end how to create proper backups and archives without reducing the backup freqency to weeks. ZFS has mature mechanisms and zfs-send to move this data around with limited impact ot the production dataflow.
For Postgres specifically you may also want to look at using hot_standby_feedback, as described in this recent HN article: https://news.ycombinator.com/item?id=44633933
We also have decently sized clusters with very active data on them, and rather spicy recovery targets. On some of them, a full backup from the sync standby takes 4 hours, we need to pull an incremental backup at most 2 hours afterwards, but the long-term archiving process needs 2-3 hours to move the full backup to the archive. This is the first point in which filesystem snapshots, admittedly, of the pgbackrest repo, become necessary to adhere to SLOs as well as system function.
We do all of the high-complexity, high-throughput things recommended by postgres, and it's barely enough on the big systems. These things are getting to the point of needing a lot more storage and network bandwidth.
EDIT: It seems they're opt-in for PostgreSQL, too: https://www.postgresql.org/docs/current/checksums.html
bad news is, most databases don't do checksums by default.
Redundantly performing the same performance-intensive tasks on multiple layers makes latency less predictable and just generally wastes resources.
This is a major reason databases implement their own checksums. Unfortunately, many open source databases have weak or non-existent checksums too. It is sort of an indefensible oversight.
At 32 bits you're well into the realm of tail risks which include things like massive solar flares or the data center itself being flattened in an explosion or natural disaster.
Edit: I just checked a local drive for concrete numbers. It's part of a btrfs array. Relevant statistics since it was added are 11k power on hours, 24 TiB written, 108 TiB read, and 32 corruption events at the fs level (all attributable to the same power failure, no corruption before or since). I can't be bothered to compute the exact number but at absolute minimum it will be multiple decades of operation before I would expect even a single corruption event to go unnoticed. I'm fairly certain that my house is far more likely to burn down in that time frame.
Because filesystems, too, mainly use them to detect inconsistencies introduced by partial or reordered writes, not random bit flips. That's also why most file systems only have them on metadata, not data.
Say more? I've heard people say that ZFS is somewhat slower than, say, ext4, but I've personally had zero issues running postgres on zfs, nor have I heard any well-reasoned reasons not to.
> What filesystems in the wild typically provide for this is weaker than what is advisable for a database, so databases should bring their own implementation.
Sorry, what? Just yesterday matrix.org had a post about how they (using ext4 + postgres) had disk corruption which led to postgres returning garbage data: https://matrix.org/blog/2025/07/postgres-corruption-postmort...
The corruption was likely present for months or years, and postgres didn't notice.
ZFS, on the other hand, would have noticed during a weekly scrub and complained loudly, letting you know a disk had an error, letting you attempt to repair it if you used RAID, etc.
It's stuff like in that post that are exactly why I run postgres on ZFS.
If you've got specifics about what you mean by "databases should bring their own implementation", I'd be happy to hear it, but I'm having trouble thinking of any sorta technically sound reason for "databases actually prefer it if filesystems can silently corrupt data lol" being true.
Btrfs is a better choice for SQLite.
Ext4 uses 16-/32-bit CRCs, which is very weak for storage integrity in 2025. Many popular filesystems for databases are similarly weak. Even if they have a strong option, the strong option is not enabled by default. In real-world Linux environments, the assumption that the filesystem has weak checksums usually true.
Postgres has (IIRC) 32-bit CRCs but they are not enabled by default. That is also much weaker than you would expect from a modern database. Open source databases do not have a good track record of providing robust corruption detection generally nor the filesystems they often run on. It is a systemic problem.
ZFS doesn't support features that high-performance database kernels use and is slow, particularly on high-performance storage. Postgres does not use any of those features, so it matters less if that is your database. XFS has traditionally been the preferred filesystem for databases on Linux and Ext4 will work. Increasingly, databases don't use external filesystems at all.
In fairness, 32-bit CRCs were the standard 20+ years ago. That is why all the old software uses them and CPUs have hardware support for computing them. It is a legacy thing that just isn't a great choice in 2025.
No comments yet
One possible instance of that is a database providing its own data checksumming, but another perfectly valid one is running one that does not on a lower layer with a sufficiently low data corruption rate.
> [...] The checkpoint does not normally truncate the WAL file (unless the journal_size_limit pragma is set). Instead, it merely causes SQLite to start overwriting the WAL file from the beginning. This is done because it is normally faster to overwrite an existing file than to append.
Without the checksum, a new WAL entry might cleanly overwrite an existing longer one in a way that still looks valid (e.g. "A|B" -> "C|B" instead of "AB" -> "C|data corruption"), at least without doing an (expensive) scheme of overwriting B with invalid data, fsyncing, and then overwriting A with C and fsyncing again.
In other words, the checksum allows an optimized write path with fewer expensive fsync/truncate operations; it's not a sudden expression of mistrust of lower layers that doesn't exist in the non-WAL path.
That's really all there is to it.
SQLite has very deliberate and well-documented assumptions (see for example [1], [2]) about the lower layers it supports. One of them is that data corruption is handled by these lower layers, except if stated otherwise.
Not relying on this assumption would require introducing checksums (or redundancy/using an ECC, really) on both the WAL/rollback journal and on the main database file. This would make SQLite significantly more complex.
I believe TFA is mistaken about how SQLite uses checksums. They primarily serve as a way to avoid some extra write barriers/fsync operations, and maybe to catch incomplete out-of-order writes, but never to detect actual data corruption: https://news.ycombinator.com/item?id=44671373
[1] https://www.sqlite.org/psow.html
[2] https://www.sqlite.org/howtocorrupt.html
1. If the WAL is incomplete, then "failing" silently is the correct thing to do here, and is the natural function of the WAL. The WAL had an incomplete write, nothing should have been communicated back the application and the application should assume the write never completed.
2. If the WAL is corrupt (due to the reasons he mentioned), then sqlite says that is that's your problem, not sqlite's. I think this is the default behavior for other databases as well. If a bit flips on disk, it's not guaranteed the database will catch it.
This article is framed almost like a CVE, but to me this is kind of like saying "PSA: If your hard drive dies you may lose data". If you care about data integrity (because your friend is sending you sqlite files) you should be handling that.
Doing what the author suggests would actually introduce data corruption errors when "restoring a WAL with a broken checksum".
> When the last connection to a database closes, that connection does one last checkpoint and then deletes the WAL and its associated shared-memory file, to clean up the disk.
https://www.sqlite.org/wal.html
The confusion is that Limbo is not mentioned in the article which means that everyone has to start from a different rational premise.
- Said person was apparently employed due to his good understanding of databases and distributed systems concepts (there's a HN thread about how he found an issue in the paper describing an algorithm); yet makes fundamental mistakes in understanding what the WAL does and how it's possible not to "partly" apply a WAL.
- Said person expects a SQL database to expose WAL level errors to the user breaking transactional semantics (if you want that level of control, consider simpler file-based key-value stores that expose such semantics?)
- Said person maligns SQLite as being impossible to contribute; whereas the actual project only mentions that they may rewrite the proposed patch to avoid copyright implications.
- Said person again maligns SQLite as "limping along" in the face of disk errors (while making the opposite claim a few paragraphs ago); while ignoring that the checksum VFS exists when on-disk data corruption is a concern.
1) Insert new subscription for "foobar @ 123 Fake St." 2) Insert new subscription for "�#�#�xD�{.��t��3Axu:!" 3) Insert new subscription for "barbaz @ 742 Evergreen Terrace"
A human could probably grab two subscriptions out of that data loss incident. I think that's what they're saying. If you're very lucky and want to do a lot of manual work, you could maybe restore some of the data. Obviously both of the "obviously correct" records could just be random bitflips that happen to look right to humans. There's no way of knowing.
The database should absolutely not be performing guesswork about the meaning of its contents during recovery. If you want mongodb, go use mongodb.
I think SQLite assumes that a failing checksum occurs due to a crash during a write which never finished. A corrupt WAL frame before a valid frame can only occur if the underlying storage is corrupt, but it makes no sense for SQLite to start handling that during replay as it has no way to recover. You could maybe argue that it should emit a warning
> yet makes fundamental mistakes in understanding what the WAL does and how it's possible not to "partly" apply a WAL.
Please provide citation on where I said that. You can't partly apply WAL always, but there are very valid cases where you can do that to recover. Recovery doesn't have to automatic. It can be done by SQLite, or some recovery tool or with manual intervention.
> - Said person maligns SQLite as being impossible to contribute; whereas the actual project only mentions that they may rewrite the proposed patch to avoid copyright implications.
Please provide citation on where I said that. Someone asked me to send a patch to SQLite, I linked them to the SQLite's page.
Without mentioning the exact set of cases where recovery is possible and it isn't, going "PSA: SQLite is unreliable!!1one" is highly irresponsible. I think there's quite a bit of criticism going around though, you could add them to your blog article :)
Please also consider the fact that SQLite being a transactional database, it is usually not possible to expose a WAL level error to the user. The correct way to address it is to probably come up with a list of cases where it is possible, and then send in a patch, or at least a proposal, of how to address it.
> Please provide citation on where I said that [SQLite is impossible to contribute].
https://news.ycombinator.com/item?id=44672563
On SQLite contribution, I did not say it's "impossible." I said it's not open to contribution. This is the exact phrase from the linked page.
You must be new to the site.
CRCs as used in SQLite are not intended to detect data corruption due to bit rot, and are certainly not ECCs.
Sure, the benefits to the incomplete write use case are limited, but there's basically no reason to ever use a fletcher these days.
It's also worth mentioning that the VFS checksums are explicitly documented as guarding against storage device bitrot and use the same fletcher algorithm.
There's no harm to having redundant checksums and it's not truly redundant for small messages. It's pretty common for systems not to have lower level checksumming either. Lots of people are still running NTFS/EXT4 on hardware that doesn't do granular checksums or protect data in transit.
Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.
There sure is: Redundant checksums need extra storage and extra processing. SQLite often runs on embedded systems, where both can come at a premium.
> Of course this is all a moot point because sqlite does WAL checksums, it just does them with an obsolete algorithm.
That's not nearly the only thing missing for SQLite to provide full resistance to lower-level data corruption. At a very minimum, you'd also need checksums of the actual database file.
At the database level (i.e. not just the WAL)? Are you sure?
> What I'm saying is that a fletcher is strictly worse than a CRC here.
I can't speak to the performance differences, but the only thing SQLite really needs the checksum to do is to expose partial writes, both due to reordered sector writes and partial intra-sector writes. (The former could also be solved by just using an epoch counter, but the latter would require some tricky write formats, and a checksum nicely addresses both).
In both cases, there's really nothing to recover: CRC won't catch an entire missing sector, and almost no partially written sectors (i.e. unless the failure somehow happens in the very last bytes of it, so that the ratio of "flipped" bits is low enough).
Not all frames in the WAL are important. Sure, recovery may be impossible in some cases, but not all checksum failures are impossible to recover from.
I can't imagine picking the latter unless you were treating sqlite like a filesystem of completely unrelated blobs.
If I run three transactions where:
1. John gives $100 to Sue.
2. Sue gives $100 to Mark.
3. Mark $100 money to Paul.
If sqlite, just erases transaction (2), then Mark materializes $100 from nowhere. The rest of your database is potentially completely corrupted. At that point your database is no longer consistent - I can't see how you would "almost always" prefer this.
If (2) is corrupt, then the restore stops at (1), and you are guaranteed consistency.
Which failures are possible to recover from?
First, force a re-read of the corrupted page from disk. A significant fraction of data corruption occurs while it is being moved between storage and memory due to weak error detection in that part of the system. A clean read the second time would indicate this is what happened.
Second, do a brute-force search for single or double bit flips. This involves systematically flipping every bit in the corrupted page, recomputing the checksum, and seeing if corruption is detected.
Surely you mean on the memory bus specifically? SATA and PCIe both have some error correction methods for securing transfers between storage and host controller. I'm not sure about old parallel ATA. While I understand it can happen under conditions similar to non-ECC RAM being corrupted, I don't think I've ever heard or read about a case where a storage device randomly returned erroneous data, short of a legitimate hardware error.
The bare minimum you want these days is a 64-bit CRC. A strong 128-bit hash would be ideal. Even if you just apply these at the I/O boundaries then you'll catch most corruption. The places it can realistically occur are shrinking but most software makes minimal effort to detect this corruption even though it is a fairly well-bounded problem.
No comments yet
This is just basically how a WAL works, if you have an inconsistent state the transaction is rolled back - at that point you need to redo your work.
For instance, say you have a node A which has a child B:
* Transaction 1 wants to add a value to B, but it's already full, so B is split into new nodes C and D. Correspondingly, the pointer in A that points to B is removed, and replaced with pointers to C and D.
* Transaction 2 makes an unrelated change to A.
If you skip the updates from transaction 1, and apply the updates from transaction 2, then suddenly A's data is overwritten with a new version that points to nodes C and D, but those nodes haven't been written. The pointers just point to uninitialized garbage.
If you stop at the first failure, the database is restored to the last good state. That's the best outcome that can be achieved under the circumstances. Some data could be lost, but there wasn't anything sensible you could do with it anyway.
I would like it to raise an error and then provide an option to continue or stop. Since continuing is the default, we need a way to opt in to stopping on checksum failure.
Not all checksum errors are impossible to recover from. Also, as the post mentions, only some non important pages could be corrupt too.
My main complaint is that it doesn't give developers an option.
If what we're really interested in is the log part of a write ahead log - where we could safely recover data after a corruption, then a better tool might be just a log file, instead of SQLite.
Attempt to recover! Again, not all checksum errors are impossible to recover. I hold the view that even if there is a 1% chance of recovery, we should attempt it. This may be done by SQLite, an external tool, or even manually. Since WAL corruption issues are silent, we cannot do that now.
There is a smoll demo in the post. In it, I corrupt an old frame that is not needed by the database at all. Now, one approach would be to continue the recovery and then present both states: one where the WAL is dropped, and another showing whatever we have recovered. If I had such an option, I would almost always pick the latter.
1. How does the database know that.
2. In your example Alice gets the money from nowhere. What if another user had sent the money to Alice and that frame get corrupted. Then you just created 10,000,000 from nowhere.
At the very least, rolling back to a good point gives you an exact moment of time where transactions can be applied from. Your example is very contrived and in a database where several transactions can be happening, doing a partial recovery will destroy the consistency of the database.
I do see your point of wanting an option to refuse to delete the wal so a developer can investigate the wal and manually recover... But the the typical user probably wants the database to come back up with a consistent, valid state if power is lost. They do not want have the database refuse to operate because it found uncommitted transactions in a scratchpad file...
As a SQL-first developer, I don't pick apart write-ahead logs trying to save a few bytes from the great hard drive in the sky, I just want the database to give me the current state of my data and never be in an invalid state.
Yes, that is a very valid choice. Hence, I want databases to give me an option, so that you can choose to ignore the checksum errors and I can choose to stop the app and try to recover.
If you attempt to do the former in a system that by design uses checksums only for the latter, you'll actually introduce corrupted data from some non-corrupted WAL files.
It seems like you're focusing on a very specific failure mode here.
Also, what if the data corruption error happens during the write to the actual database file (i.e. at WAL checkpointing time)? That's still 50% of all your writes, and there's no checksum there!
You do have backups, right?
Instead of that, I'd prefer for it to fail fast
I've written more about this here: https://news.ycombinator.com/item?id=44673991
Yes! But I am happy to accept that overhead with the corruption detection.
As I see it, either you have a lower layer you can trust, and then this would just be extra overhead, or you don't, in which case you'll also need error correction (not just detection!) for the database file itself.
The checksums are not going to fail unless there was disk corruption or a partial write.
In the former, thank your lucky stars it was in the WAL file and you just lose some data but have a functioning database still.
In the latter, you didn't fsync, so it couldn't have been that important. If you care about not losing data, you need to fsync on every transaction commit. If you don't care enough to do that, why do you care about checksums, it's missing the point.
I wonder what that code would look like. My sense is that it’ll look exactly like the code that would run as if the transactions never occurred to begin with, which is why the SQLite design makes sense.
For example, I have a database of todos that sync locally from the cloud. The WAL gets corrupted. The WAL gets truncated the next time the DB is opened. The app logic then checks the last update timestamp in the DB and syncs with the cloud.
I don’t see what the app would do differently if it were notified about the WAL corruption.
> I want to correct errors that the DB wizard who implemented SQLite chose not to
When there's a design decision in such a high profile project that you disagree with, it's either
1. You don't understand why it was done like this.
2. You can (and probably will) submit a change that would solve it.
If you find yourself in the situation of understanding, yet not doing anything about it, you're the Schrodinger's developer: you're right and wrong until you collapse the mouth function by putting money on it.
It's very rarely an easy to fix mistake.
SQLite is not open to contribution - https://www.sqlite.org/copyright.html
> 1. You don't understand why it was done like this.
sure, I would like to understand it. That's why the post!
> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.
Propose it.
Skipping a frames but processing later ones would corrupt the database.
> SQLite doesn’t throw any error on detection of corruption
I don’t think it’s actually a corruption detection feature though. I think it’s to prevent a physical failure while writing (like power loss) from corrupting the database. A corruption detection feature would work differently. E.g., it would cover the whole database, not just the WAL. Throwing an error here doesn’t make sense.
Honestly this sounds out of scope for normal usage of sqlite and not realistic. I had a hard time reading past this. If I read that correctly, they're saying sqlite doesn't work if one of the database files disappears from under it.
I guess if you had filesystem corruption it's possible that .db-shm disappears without notice and that's a problem. But that isn't sqlite's fault.
> If the last client using the database shuts down cleanly by calling sqlite3_close(), then a checkpoint is run automatically in order to transfer all information from the wal file over into the main database, and both the shm file and the wal file are unlinked.
Related: https://news.ycombinator.com/item?id=32200007
- there is an official check sum VFS shim, but I never used it and don't know how good it is. The difference between it and WAL checksum is that it works on a per page level and you seem to need manually run the checksum checks and then yourself decide what to do
- check sums (as used by SQLite WAL) aren't meant for backup, redundancy or data recovery (there are error recovery codes focused on allowing recovering a limited set of bits, but they have way more overhead then the kind of checksum used here)
- I also believe SQLite should indicate such checksum errors (e.g. so that you might engage out of band data recovery, i.e. fetch a backup from somewhere), but I'm not fully sure how you would integrate it in a backward compatible way? Like return it as an error which otherwise acts like a SQLITE_BUSY??
Data in the WAL should be considered to be of "reduced durability".
Merkle hashes would probably be better.
google/trillian adds Merkle hashes to table rows.
sqlite-parquet-vtable would workaround broken WAL checksums.
sqlite-wasm-http is almost a replication system
Re: "Migration of the [sqlite] build system to autosetup" https://news.ycombinator.com/item?id=41921992 :
> There are many extensions of SQLite; rqlite (Raft in Go,), cr-sqlite (CRDT in C), postlite (Postgres wire protocol for SQLite), electricsql (Postgres), sqledge (Postgres), and also WASM: sqlite-wasm, sqlite-wasm-http, dqlite (Raft in Rust),
> awesome-sqlite
From "Adding concurrent read/write to DuckDB with Arrow Flight" https://news.ycombinator.com/item?id=42871219 :
> cosmos/iavl is a Merkleized AVL tree. https://github.com/cosmos/iavl
/? Merkle hashes for sqlite: https://www.google.com/search?q=Merkle+hashes+for+SQlite
A git commit hash is basically a Merkle tree root, as it depends upon the previous hashes before it.
Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree
(How) Should merkle hashes be added to sqlite for consistency? How would merkle hashes in sqlite differ from WAL checksums?