What If OpenDocument Used SQLite?

178 whatisabcdefgh 77 9/4/2025, 9:36:50 PM sqlite.org ↗

Comments (77)

agwa · 9h ago
If you're going to use SQLite as an application file format, you should:

1. Enable the secure_delete pragma <https://antonz.org/sqlite-secure-delete/> so that when your user deletes something, the data is actually erased. Otherwise, when a user shares one of your application's files with someone else, the recipient could recover information that the sender thought they had deleted.

2. Enable the options described at <https://www.sqlite.org/security.html#untrusted_sqlite_databa...> under "Untrusted SQLite Database Files" to make it safer to open files from untrusted sources. No one wants to get pwned when they open an email attachment.

3. Be aware that when it comes to handling security vulnerabilities, the SQLite developers consider this use case to be niche ("few real-world applications" open SQLite database files from untrusted sources, they say) and they seem to get annoyed that people run fuzzers against SQLite, even though application file formats should definitely be fuzzed. https://www.sqlite.org/cves.html

They fail to mention any of this on their marketing pages about how you should use SQLite as an application file format.

charleslmunger · 5h ago
>and they seem to get annoyed that people run fuzzers against SQLite, even though application file formats should definitely be fuzzed.

I think that's an unfair reading. Sqlite runs fuzzers itself and quickly addresses bugs found by fuzzers externally. There's an entire section in their documentation about their own fuzzers and thanking third party fuzzers, including credit to individual engineers.

https://www.sqlite.org/testing.html

The tone of the CVE docs are because people freak out about CVEs flagged by automated tools when the CVEs are for issues that have no security impact for typical usage of SQLite, or have prerequisites that would already have resulted in some form of compromise.

Seattle3503 · 7h ago
Hrm, using sqlite as an application format would be a good use case for Limbo.
chris_wot · 7h ago
"Most applications can use SQLite without having to worry about bugs in obscure SQL inputs." And then they recommend SQLite as a document interchange format.
wzdd · 4h ago
Although this is indeed a worrying statement, it seems true to me. Most users of sqlite control the SQL they use. The problem I would expect from using a database document interchange format is that a maliciously crafted database could result in a CVE. The page acknowledges this possibility, even while pointing out (in their CVE list) that it hasn't happened so far, or is rare (it's hard to parse some of their descriptions).
munch117 · 2h ago
I'm not that concerned with bugs in sqlite. sqlite is high quality software, and the application that uses it is a more likely source of vulnerabilities.

But I do see a problem if you really need to use a sqlite that's compiled with particular non-default options.

Say I design a file format and implement it, and my implementation uses an sqlite library that's compiled with all the right options. Then I evangelize my file format, telling everyone that it's really just an sqlite database and sooo easy to work with.

First thing that happens is that someone writes a neat little utility for working with the files, written in language X, which comes with a handy sqlite3 library. But that library is not compiled with the right options, and boom, you have a vulnerable utility.

ncruces · 1h ago
Most of the recommended [1] setting are available on a per connection basis, through PRAGMAs, sqlite3_db_config, sqlite3_limit, etc; some are global settings, like sqlite3_hard_heap_limit64.

A binding can expose those settings. It's not a given a third party utility will use them, but they can.

1: https://www.sqlite.org/security.html

ncruces · 1h ago
Untrusted database file is not the same as untrusted SQL input.

There are parts of the SQL engine that are exposed to malicious file manipulation (the schema is stored as SQL DDL text) but that's not arbitrary SQL input.

If you want to highlight an inconsistency, this is way more worrying:

> “All historical vulnerabilities reported against SQLite require at least one of these preconditions: (…) 2. The attacker can submit a maliciously crafted database file to the application that the application will then open and query. Few real-world applications meet either of these preconditions…”

However, most of the rest of the page is speaking of arbitrary SQL input, not purposely broken database files.

nloomans · 25m ago
Interesting read! I find the idea to use SQL queries to get only the relevant data quite convincing. I do wonder how this would work in practice though. Any changes the user makes would have to be inserted with SQL to allow for the new data to be included in SQL queries, but users also expect to be able to make changes and then not save them (or save them into a different file).

Should one make a massive transaction that is only committed when saving? It is possible to commit such a transaction to a different file when using Save As?

Or maybe for editing one would need to copy the file to a separate temporary location, constantly commit to that file, and when saving move the temporary file over the original file (this way we aren't losing the resilience against corruption SQLite offers).

Or is there a better way to do this? I don't like storing pending changes into the original file since it kinda goes against how users expect files to work (and could cause them to accidentally leak data).

liuliu · 10h ago
One thing I would call out, if you use SQLite as an application format:

BLOB type is limited to 2GiB in size (int32). Depending on your use cases, that might seem high, or not.

People would argue that if you store that much of binary data in a SQLite database, it is not really appropriate. But, application format usually has this requirement to bundle large binary data in one nice file, rather than many files that you need to copy together to make it work.

Retr0id · 9h ago
You can split your data up across multiple blobs
johncolanduoni · 5h ago
Also you almost certainly want to do this anyway so you can stream the blobs into/out of the network/filesystem, well before you have GBs in a single blob.
Retr0id · 41m ago
Singular sqlite blobs are streamable too! But for streaming in you need to know the size in advance.
bob1029 · 3h ago
This is essential if you want to have encryption/compression + range access at the same time.

I've been using chunk sizes of 128 megabytes for my media archive. This seems to be a reasonable tradeoff between range retrieval delay and per object overhead (e.g. s3 put/get cost).

chmaynard · 11h ago
Dr. Hipp occasionally gets on a soapbox and extolls the virtue of sqlite databases for use as an application file format. He also preaches about the superiority of Fossil over Git. His arguments generally make sense. I tolerate his sermons because he is one of the truly great software developers of our time, and a personal hero of mine.
korkor55 · 5h ago
These are thought-experiments to help better understand how SQLite works. This is exactly how supporting documentation should be written so that others read it.

He even went over the top with the disclaimers.

lifthrasiir · 8h ago
SQLite can't be reliably used in networked file systems because it heavily relies on locking to be correctly implemented. I recently had to add a check for such file systems in my application [1] because I noticed a related corruption firsthand. Simpler file formats do not demand such requirements. SQLite is certainly good, but not for this use.

[1] https://github.com/lifthrasiir/angel/commit/50a15e703ef2c1af...

kvdveer · 4h ago
In the context of this article, that's largely irrelevant: ZIP cannot be used in a multi-user scenario at all, so even if sqlite isn't perfect, it's still miles better than the ZIP format it replaces in this thought experiment.
chungy · 5h ago
That's pretty broad and over-generalized. Networking file systems without good lock support is almost always a bad setup by an administrator. Both NFS and CIFS can work with network-wide locks just fine.

SQLite advises against using a networking file system to avoid potential issues, but you can successfully do it.

lifthrasiir · 48m ago
As noted in my other comment, those "potential" issues are real and do happen from time to time. Unless SQLite gives some set of configurations to avoid such issues, I can't agree that it's over-generalized.
jdboyd · 4h ago
Are the typical Synology, Qnap, or TrueNAS devices with default Linux, macOS and Windows clients going to be set up correctly by default? If any of the typical things someone is likely to setup following wizards in a home or small office is likely to result in lock not working correctly for SQLite, then it is fair for them to warn against using it on a network file system.

As an application format, you don't generally expect people to be editing an ODF file at the same time though, so network locking doesn't really disqualify it for use as a document format.

mschuster91 · 3h ago
> As an application format, you don't generally expect people to be editing an ODF file at the same time though

Oh hell yes you do. Excel spreadsheets are notorious for people wanting to collaborate on them, and PowerPoint sheets come in close second. It used to be an absolute PITA but at least Office 365 makes the pains bearable.

afiori · 5h ago
In that case the application would keep a temporary file and copy over when saving
cpach · 4h ago
Maybe, but how would the application know if /data/foo.bar is a local file or mounted via NFS/SMB/etc?
afiori · 1h ago
it would always use such a temporary file and update the "real" file only on explicit saves with fast mv or cp operations
greenavocado · 8h ago
Easy fix is an empty lock file adjacent to the real one.
lifthrasiir · 8h ago
Yeah, but only if SQLite did support that mode in some built-in VFS implementation...
hedora · 7h ago
Which network filesystems are still corrupting sqlite files?

Sqlite on NFSv3 has been rock solid for some NFS servers for a decade.

Maybe name and shame?

lifthrasiir · 5h ago
Specifically I had an issue over 9p used by WSL2. (I never thought it was networked before this incident.)
Cthulhu_ · 59m ago
floating-io · 12h ago
An interesting skim, but it would have been more meaningful if it had tackled text documents or spreadsheets to show what additional functionality would be enabled with those beyond "versioning".

Maybe it's just me, but I see the presentation functionality as one of the less used aspects of the OpenOffice family.

jdboyd · 4h ago
What he listed as the first improvement, "Replace ZIP with SQLite" would certainly apply to the other ODF formats.

He advocates breaking the XML into smaller pieces in SQLite. I suppose making each slide a new XML record could make sense. Moving over to spreadsheets, I don't know how ODF does it now, but making each sheet a separate XML could make sense.

Thinking about Write documents, I wonder what a good smaller unit would be. I think one XML per page would be too fine a granularity. You could consider one record per chapter. I doubt one record per paragraph would make sense, but it could be fun to try different ideas.

maweki · 1h ago
Splitting the presentation into multiple fragments makes it more difficult to generate/alter a presentation using xslt.
scott_w · 5h ago
While reading I was musing one way to handle text could be to use a linked list format as storage? To make it work like that, you’d need the editor to work on a block concept and I don’t think document editors work like that?

Spreadsheets might be a little easier because you can separate out by sheet or even down to a row/column level?

Part of me wants to try it now…

bob1029 · 3h ago
> it is still bothersome that changing a single character in a 50 megabyte presentation causes one to burn through 50 megabytes of the finite write life on the SSD.

I used to worry a lot about this but it has never once actually come up for me. 50 megabytes is a pretty extreme example, but even so if you edit this document fewer than several million times it won't matter.

Serializing the object graph all over again can be way faster than mapping into a tabular model. There are JSON serializers that can push multiple gigabytes per second per core. It might even be the case that, once you factor in the SSD controller quirks, the tabular updates could cause more blocks to be written than just dumping a big fat json stream all at once.

gwd · 1h ago
Anki's storage format is SQLite (or was a few years ago). That made it really lovely when I wanted to import the contents (including the view logs) of Anki deck I'd been using for a decade into a custom system I was designing. Just pop up the `sqlite3` REPL, poke around and see what it looks like, then write standard SQL queries to get the data out.
conorbergin · 11h ago
I've being trying out SQLite for a side project of mine, a virtual whiteboard, I haven't quite got my head around it, but it seems to be much less of a bother than interacting with file system APIs so far. The problem I haven't really solved is how sync and maybe collaboration is going to interact with it, so far I have:

1. Plaintext format (JSON or similar) or SQLite dump files versioned by git

2. Some sort of modern local first CRDT thing (Turso, libsql, Electric SQL)

3. Server/Client architecture that can also be run locally

Has anyone had any success in this department?

rogerbinns · 10h ago
SQLite has a builtin session extension that can be used to record and replay groups of changes, with all the necessary handling. I don't necessarily recommend session as your solution, but it is at least a good idea to see how it compares to others.

https://sqlite.org/sessionintro.html

That provides a C level API. If you know Python and want to do some prototyping and exploration then you may find my SQLite wrapper useful as it supports the session extension. This is the example giving a feel for what it is like to use:

https://rogerbinns.github.io/apsw/example-session.html

hahn-kev · 9h ago
CRDTs are the way to go if you need something very robust for lots of offline work.
sakesun · 11h ago
If I remember correctly Mendix project file format is simply a sqlite db. I thought the designer was lazy but it turns out it's a reasonable decision.

Recently, DuckDB team raise similar question on DataLake catalog format. Why not just use SQL database for that ? It's simpler and more efficient as well.

atonse · 10h ago
Didn’t Apple actually move to SQLite for their Pages/Numbers format? I remember reading years ago that it was rocky (the transition), but was maybe eventually smoothed out?
mdaniel · 8h ago
Given n=1 https://freeiworktemplates.com/2022/05/pages-concessions-sta... seems to imply the answer is "no, it's a zip" and that seems to hold even for the interior files

  $ file concessions-stand-menu-template.pages
  concessions-stand-menu-template.pages: Zip archive data, at least v2.0 to extract, compression method=store

  $ unzip -l concessions-stand-menu-template.pages
  Archive:  concessions-stand-menu-template.pages
    Length      Date    Time    Name
  ---------  ---------- -----   ----
      58727  05-09-2022 13:27   Data/Artboard 2-26.png
      26993  05-09-2022 13:27   Data/Artboard 2-small-27.png
      11550  05-10-2022 08:13   Index/Document.iwa
        720  05-10-2022 08:13   Index/ViewState.iwa
        536  05-09-2022 12:41   Index/CalculationEngine-1686619.iwa
         23  07-02-2021 17:48   Index/AnnotationAuthorStorage-1686618.iwa
      43891  05-09-2022 12:41   Index/DocumentStylesheet.iwa
        229  05-09-2022 13:28   Index/DocumentMetadata.iwa
      17895  05-10-2022 08:13   Index/Metadata.iwa
        379  05-10-2022 08:13   Metadata/Properties.plist
         36  05-09-2022 12:41   Metadata/DocumentIdentifier
        268  04-29-2022 22:18   Metadata/BuildVersionHistory.plist
     135503  05-10-2022 08:13   preview.jpg
       1666  05-10-2022 08:13   preview-micro.jpg
      11057  05-10-2022 08:13   preview-web.jpg
  ---------                     -------
     309473                     15 files

  $ unzip concessions-stand-menu-template.pages Index/Document.iwa
  extracting: Index/Document.iwa

  $ file Index/Document.iwa
  Index/Document.iwa: data

  $ xxd -l 128
  00000000: 001a 2d00 bcae 0170 6408 0112 6008 904e  ..-....pd...`..N
  00000010: 1203 0100 0518 c90c 2209 0a03 0a01 3010  ........".....0.
  00000020: 0118 0122 0701 0b08 2e18 0109 1400 2f05  ..."........../.
  00000030: 14f4 a801 0b0a 050a 030f 0111 1003 1800  ................
  00000040: 2a27 daf8 66db f866 dcf8 66e6 f768 ddf8  *'..f..f..f..h..
  00000050: 66df ef66 def8 66d1 f666 fdf5 66d5 f566  f..f..f..f..f..f
  00000060: 8ff8 66df f866 86f9 6612 0408 def8 661a  ..f..f..f.....f.
  00000070: 0408 fdf5 6622 0408 8ff8 6632 0408 dfef  ....f"....f2....
nashashmi · 45m ago
Fwiw, autocad uses database format for its file data.
sgc · 11h ago
It seems like it would be relatively straightforward to make an sqlite based file format and just have users add a plugin if for some reason they couldn't upgrade their older version of LibreOffice etc. I agree with the other commenter who mentioned that the benefits for text and spreadsheet files needs more explanation. But it seems like a good enough idea to have a LibreOffice working group perform a more in depth study. If significant memory reduction is real and that would translate to fewer crashes, it would be a huge boost even if it had no other benefits, IMHO.
supportengineer · 11h ago
What if instead of API's for data sets, we simply placed a sqlite file onto a web server as a static asset, so you could just periodically do a GET and have a local copy.
abtinf · 11h ago
A few years ago someone posted a site that showed how to query portions of a SQLite file without having to pull the whole thing down.
dbarlett · 11h ago
supportengineer · 10h ago
>> I implemented a virtual file system that fetches chunks of the database with HTTP Range requests

That's wild!

yupyupyups · 11h ago
This works as long as the data is "small" and you have no ACL for it. Assuming you mean automatic downloads.

Devdocs does something similar, but there you request to download the payload manually, and the data is still browsable online without you having to download all of it. The data is also split in a convenient manner (by programming language/library). In other words, you can download individual parts. The UI also remains available offline, which is pretty cool.

https://devdocs.io/

abtinf · 11h ago
With an S3 object lambda, I suppose you could generate the sqlite file on the fly.
anon291 · 11h ago
You can do this today by using the WASM-compiled SQLite module with a custom Javascript VFS that implements the SQLite VFS api appropriately for your backend. I've used it extensively in the past to serve static data sets direct from S3 for low cost.

More industrious people have apparently wrapped this up on NPM: https://www.npmjs.com/package/sqlite-wasm-http

jll29 · 9h ago
I love SQLite.

As a document _exchange_/_interchange_ format, what I prefer for durability is a non-binary format (e.g. XML based).

For local use, I agree SQLite might be much faster than ZIP, and of course the ability to query based on SQL has its own flexibility merits.

thayne · 6h ago
XML isn't great for exchange/interchange either due to security problems and inconsistencies in implementations. A big part of the problem is that xml has a lot of complexity, which leads to a bigger attack surface when parsing and processing untrusted data. And then xml entities are just inherently insecure, unless you disable some of their capabilities (like using remote files, and unlimited recursion).

That said, creating a format that can convey rich untrusted data is a hard problem.

jdboyd · 4h ago
Part of the problem though with saying SQLite instead of XML is a lot of things would lend themselves to XML in SQLite.
Ekaros · 3h ago
Complex features are inherently complex. Say you want external resources or some scripts in document. No matter what storage format you use those are more surfaces. Problem is not storage, but what is done with information. And very often that is a lot and poorly thought out and even more poorly implemented.
thayne · 2h ago
But most applications don't need those features. And if they do, that should be part of the application logic, with appropriate controls. Having your parsing library make arbitrary http requests is a bad idea.
thayne · 3h ago
Oh, I'm not saying sqlite is better than xml for data exchange. As mentioned in other comments, sqlite's security posture towards an untrusted database is problematic. My point is that xml has problems too.
RainyDayTmrw · 11h ago
Juggling all the fragments inside the database, garbage collecting all the unused ones, and maintaining consistency are all quite challenging in this use case.
est · 5h ago
has anyone actually used the `content BLOB` pattern in a larger scale? Suppose if I have tens of thousands of small jpegs would it be better off in a .sqlite file?
shlomo_z · 5h ago
Sqlite claims [1] this is a good use case

1. https://www.sqlite.org/fasterthanfs.html

tombert · 10h ago
I remember I played with some software called "The Illumination Software Creator" [1], and I remember the saved project files were just SQLite databases.

I actually thought it was kind of cool, because I was able to play with it easily with some SQLite explorer tool (I forget which one) and I could easily look at how the save files actually worked.

I haven't really used SQLite for anything serious [2], but always found the idea of it kind of charming. Maybe I should dust it off and try it again.

[1] https://en.wikipedia.org/wiki/Illumination_Software_Creator by Bryan Lunduke before I realized how much of a pseudo-intellectual dimwit that he is.

[2] At least outside of the "included" database in a few web frameworks.

wakawaka28 · 5h ago
What is it that makes you think Lunduke is pseudo-intellectual? He certainly doesn't try to pose as a scholar. If you are like most of his haters, you just refuse to believe that smart people can be conservatives.
kstrauser · 5h ago
There’s no way to discuss Lunduke without getting into politics, so I’ll leave it that Lunduke is clearly a very intelligent person who IMO mistakes his knowledge in some areas for general expertise in other unrelated fields.

It’s a common trap to fall into. See also: Ben Carson. Both of them are obviously intelligent and highly skilled in their professional fields. And both have let that convince themselves that they know everything about everything.

wakawaka28 · 5h ago
I don't think Lunduke is a Ben Carson type. That would be ridiculous. He has opinions about things outside his area of expertise, like all of us, but he also has some unique experiences like having worked for Microsoft and OpenSUSE. His opinions on tech are pretty solid. I also agree with his politics for the most part.
kstrauser · 5h ago
I would hear what he has to say about his tech experiences. I would not be in a room where he was discussing his politics.
librasteve · 12h ago
wouldn’t an XML database be easier?
duskwuff · 11h ago
You can't* index into XML. You have to read through the whole document until you get to the part you want.

*: without adding an index of your own, at which point it isn't really XML anymore, it's some kind of homebrew XML-based archive format.

HelloNurse · 2h ago
You can store the content of a XML document in a database faithfully enough to reconstruct it exactly. Any system that can produce XML documents is a "XML database".
floating-io · 12h ago
Does an embeddable XML database engine exist at a similar level of reliability?
jsight · 8h ago
They could resurrect xindice!
supportengineer · 11h ago
No.
Zambyte · 8h ago
Why?
renecito · 11h ago
LOL!
ignoramous · 9h ago
> SQLite database has a lot of capability, which this essay has only begun to touch upon. But hopefully this quick glimpse has convinced some readers that using an SQL database as an application file format is worth a second look.

It really is. One of the experiments we have been doing currently to make bug reporting from Androids easier (and to an extent, reduce user frustration and fatigue) is to store app logs (unstructured) in (an in-memory) SQLite table. It lends very well in to on-device LLMs (like Gemma 3n or Qwen2.5 0.5b), as users can Q&A to know just what the app is doing and why it won't work the way they want it to. On-device LLMs are limited (context length and/or embeddings) and too many writes (in batches of 1000 rows) to the in-memory SQLite table (surprisingly) eats up battery like no tomorrow, so this "chat to know what the app is doing" isn't rolled out to everyone, yet.

treyd · 5h ago
What kinds of queries are being done on the logs such that it makes sense to use sqlite instead of, like, just a ring buffer?
mschuster91 · 3h ago
The problem they're alluding to, I think, isn't the query side, it's the creation side. adb logcat and logging in Android in general is one hell of a clusterfuck, not being helped by logging in Java being a PITA.
mac-attack · 12h ago
I'm a fan of both as a Linux user. Interesting thought experiment.