looks like it's not a system being down but a buggy frontend code update.
udosan · 2d ago
iOS app is also affected
stn8188 · 2d ago
Came here to check if this was the case, I'm seeing issues on the Android application. Thank you for confirming that it's not just me!
brongondwana · 2d ago
There are a few different threads on this and now that things are in a stable place I'm going to cross post this to all of them!
The larger context is that we're making a major change to how we create IDs for email and mailboxes over the JMAP protocol. The old IDs are a UUID for mailboxId and the first 25 chars of the sha1 of the message for the emailId, prefixed by an 'M'. The new IDs are the createdmodseq for the mailbox prefixed by a 'P' (these are pretty short for most users) and a reverse counter of nanoseconds of the message internaldate (delivery time) for the emailId. This gives good storage density for offline and good data locality in databases for the email listings.
This morning we rolled out a build which we'd tested extensively on our staging and staff servers, but missed that for older v19 mailboxes which hadn't been upgraded to v20, the code to check if messages belonged in a thread incorrectly marked them all as missing.
This made MOST emails appear missing for most customers, clearly a very bad situation.
We immediately rolled back, but in the hurry missed that an unrelated change to correct subject matching for some languages (Japanese users had reported the issue, but possibly others as well) had changed the thread version, so new threads then had failed reads (making some, though many fewer, messages appear blank in the UI). There were about 50 million attempts to read those values over 15,000 users, because our UI was keeping on retrying thinking it was just a temporary synchronisation issue because the previous request told it there was a Thread to fetch data for. Ouch. https://github.com/cyrusimap/cyrus-imapd/pull/5527 contains those changes.
Anyway, since the only difference between the old and new records was normalisation of subjects, I wrote a tiny patch to let the old code read the newer records and just deployed that, which made all the emails re-appear for everyone again. This is the one bit of code from all this which isn't in a public repo, but it's two lines of: if (version == 2) version = 1;
But we'll wait until Monday to upgrade again, when we have fresh eyes available to watch that it's OK.
...
P.S. this is almost entirely unrelated to the UI changes. The underlying reason we're doing these changes IS related to UI changes, it's there to make offline mode use storage more efficiently on your device because the IDs are smaller and provide better data locality, but the timing is purely coincidental. The Cyrus changes have been done almost exclusively by the team in the USA and the UI changes by the team in Australia, and our deploy timelines were not synchronised.
antongribok · 2d ago
Thank you for this detailed reply!
I only want to add a small suggestion. I get that large distributed production systems will occasionally go down, but it would be great if you could look into reducing the latency of your status page.
By my count there was at least a 35 minute delay between when things broke and before the status page (https://fastmailstatus.com) was updated.
Also, I think it would have been nice to have a bit more explanation on this event than simply "database issues" [1]. Being able to know that this was related to an upgrade would have made me feel a bit better during the time the status page was updated and until the issue was resolved.
Thank you for your hard work and an excellent email service!
Fingers crossed it will be fixed really soon.
https://news.ycombinator.com/item?id=44824034
https://fastmailstatus.com/
You can see all the code for that in a handful of merge requests in the public cyrus-imapd repository on github at https://github.com/cyrusimap/cyrus-imapd/
Over the past few weeks, I've been helping out with the last bits of code modification, largely the changes on https://github.com/cyrusimap/cyrus-imapd/pull/5539 if you're interested.
This morning we rolled out a build which we'd tested extensively on our staging and staff servers, but missed that for older v19 mailboxes which hadn't been upgraded to v20, the code to check if messages belonged in a thread incorrectly marked them all as missing.
This made MOST emails appear missing for most customers, clearly a very bad situation.
We immediately rolled back, but in the hurry missed that an unrelated change to correct subject matching for some languages (Japanese users had reported the issue, but possibly others as well) had changed the thread version, so new threads then had failed reads (making some, though many fewer, messages appear blank in the UI). There were about 50 million attempts to read those values over 15,000 users, because our UI was keeping on retrying thinking it was just a temporary synchronisation issue because the previous request told it there was a Thread to fetch data for. Ouch. https://github.com/cyrusimap/cyrus-imapd/pull/5527 contains those changes.
Anyway, since the only difference between the old and new records was normalisation of subjects, I wrote a tiny patch to let the old code read the newer records and just deployed that, which made all the emails re-appear for everyone again. This is the one bit of code from all this which isn't in a public repo, but it's two lines of: if (version == 2) version = 1;
Meanwhile, the real bug is fixed https://github.com/cyrusimap/cyrus-imapd/pull/5553 And a test has been written to prove it: https://github.com/cyrusimap/cyrus-imapd/pull/5554
But we'll wait until Monday to upgrade again, when we have fresh eyes available to watch that it's OK.
...
P.S. this is almost entirely unrelated to the UI changes. The underlying reason we're doing these changes IS related to UI changes, it's there to make offline mode use storage more efficiently on your device because the IDs are smaller and provide better data locality, but the timing is purely coincidental. The Cyrus changes have been done almost exclusively by the team in the USA and the UI changes by the team in Australia, and our deploy timelines were not synchronised.
I only want to add a small suggestion. I get that large distributed production systems will occasionally go down, but it would be great if you could look into reducing the latency of your status page.
By my count there was at least a 35 minute delay between when things broke and before the status page (https://fastmailstatus.com) was updated.
Also, I think it would have been nice to have a bit more explanation on this event than simply "database issues" [1]. Being able to know that this was related to an upgrade would have made me feel a bit better during the time the status page was updated and until the issue was resolved.
Thank you for your hard work and an excellent email service!
-A long time customer.
[1]: https://fastmailstatus.com/cme1fq7ej002dh0iu6z8pey4f