I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
CREATE TABLE IF NOT EXISTS "messages" (
"id" INTEGER NOT NULL PRIMARY KEY, -- internal id
"message_id" TEXT NOT NULL, -- Gmail message id
"thread_id" TEXT NOT NULL, -- Gmail thread id
"headers" JSON NOT NULL, -- JSON object of { "header": value },
"subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
...
);
CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as
ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
CREATE INDEX dkimidx on messages(dkim);
SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
or whatever you want.
dotancohen · 35m ago
I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?
xearl · 48m ago
TIL, thanks a lot!
terhechte · 1h ago
I build something to visualize huge amounts of email (such as from Gmail) some years ago:
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
terhechte · 30m ago
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
TekMol · 1h ago
Shouldn't this be "imap to sqlite" or something? Why tie it to one specific email provider?
isaachinman · 24m ago
Because _it is_ specific to Gmail. It's using OAuth and presumable API access.
IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.
yread · 1h ago
Would be nice to enable fulltext search as well
padjo · 55m ago
Yes! I find gmail’s full text search surprisingly bad given it’s run by a search company.
isaachinman · 23m ago
Agreed! One of the reasons we started working on Marco.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as or whatever you want.https://github.com/terhechte/postsack
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.
https://marcoapp.io
I think it's a big eml file.