2016/01/05 - Ongoing problems: IMAP mail


The IMAP mailbox servers have been suffering ongoing performance problems for several months: at random, unpredictable intervals, access to mailboxes becomes very slow, sometimes to the point of mail clients timing out. This affects all email clients, but is more noticeable in some than others. In particular, the webmail client at is badly affected. It also appears to affect the backend serving (predominantly) staff mailboxes more than the one serving (predominantly) student mailboxes, in spite of the fact that there are more users hosted on the latter.

Unfortunately, when this happens people appear to start checking mail for frequently in an attempt to access their mail. This substantially compounds the issue -- the best response would be to close your mail client and try again in half an hour or so.

Over the last few weeks, including during shutdown, we've tried a number of things to identify and address the root cause of problem. Whilst we do not have a conclusive explanation, it seem that the most likely explanation is contention on one of the databases that controls delivery. As shutdown commenced we performed emergency maintenance during which we upgraded the software running on the IMAP servers. This upgrade promised significant performance improvements in the area that seems to be the most likely cause of our problems (it introduced a new database format).

Unfortunately the upgrade has introduced additional instability, which manifested a few times during the shutdown period and again yesterday. Moreover, whilst performance during the shutdown was very good, yesterday demonstrated that the change does not seem to have solved the underlying performance problem.

It is possible -- in fact highly likely -- that the additional instability is completely unrelated to the performance problems. It would appear that instead it is caused by older mail clients making use of outdated security protocols and handling rejection of these poorly. We've been unable to exactly pinpoint which client(s) are involved. However people who're still making use of Pegasus Mail in particular should migrate to a more modern client.

Last night another change was made to try and address the performance problem. However this revealed serious corruption in the database that records the location of individual mailboxes, which resulted in one of the backend servers being unavailable for several hours (meaning approximately half of users could not access mail).

Early this morning, further changes were made in attempt to address this. Most notably, we've reverted the most critical of the databases (and the one that was corrupt last night) back to the original pre-shutdown format.

During the course of today, we'll establish whether this improves the situation. However, even if it does, it is likely that further emergency maintenance will be required.
As part of addressing the database corruption problem, we've found that there are mail folders on disk that are no longer available via IMAP. It is likely that these folders were previously marked for deletion but never actually removed from disk. We're running a reconstruct process that will make those folders appear via IMAP again.

If you have a folder you've previously deleted re-appear and you don't need it, you can simply delete it again.
We've discovered some minor filesystem corruption on one of the IMAP backend servers. It would seem data has been lost. However, if a mailbox that has not been accessed for a long time (since before 2014-10-28) is accessed, it sometimes causes the backend to crash and reboot.

This is a documented bug, and we're working on resolving it.
trackback 149423