noticeboard.ru.ac.za

2012/08/06-11 - IMAP Server Instability/Emergency Maintenance
The University's mailbox server (imap.ru.ac.za) has been unstable of late, and in particular during the course of today.

There have been a number of factors that have contributed to this, several of which have been eliminated over the course of the last week or so. Each change has had the effect of moving the problem elsewhere in the system, without resolving it.

It now seems that the root cause of the problem is an unusually large volume of internal email being delivered, particularly to students. This includes a significant volume of mail to large mailing lists (such as studentnews@lists and oppidan@lists), as well as from the RUConnected learning management system. The latter probably relates to a large number of new course enrolments at this time of year, and is exacerbated by moodle's inefficient use of mail systems.

This has resulted in significant backlogs (at one stage this evening there were over 10,000 messages pending delivery, some with up to a hundred recipients each). The high delivery rate has also had a substantial knock-on effect -- it has resulted in slow accesses to mailboxes, particularly at peak times of the day.

During the course of this evening emergency maintenance has been done on the IMAP server, both in an effort to eliminate the backlog and in the hopes of improving the situation going forward. This has resulted in the IMAP server being intermittently unavailable during the course of this evening.

At the time of writing, the backlogged queue has been completely eliminated; all pending mail has been delivered. In addition, the IMAP server appears to be functioning normally and is significantly more response than it has been of late. However it remains to be seen whether this is a function of the time of day, and whether the changes made this evening will have any marked improvement when load picks up.
Over the course of the last week or so we've made a significant number of changes to the configuration of the IMAP servers in an effort to performance tune them, so as to be able to cope with the new volumes of mail.

In general the way we do this is to make a change during a low utilisation period (often after hours), and then wait for the next morning's peak load to determine its effectiveness. It is important that we only make a small number of changes at any one time in order to preserve the relative stability of the system (thus far no email has been lost; it has just been delayed); if we make too many changes at once we risk making the situation worse, and being unable to undo the changes we've made. This unfortunately drags out the process, since the main peak of load only happens once a day, on working days, between 8.30AM and 10AM.

The changes we've made have significantly improved the capacity of the IMAP servers, and things have improved slightly. However it seems that the increase in mail volume is such that even these changes are not yet enough; we still haven't broken the back of the problem. We'll continue working on this over the weekend and, if things haven't improved on Monday morning, we'll start making more drastic changes.

It is worth pointing out that the root cause of this problem is that we've seen a 200-600% increase in mail volumes in the last two weeks -- we've gone from average 50-60 thousand messages a day to peaking at over 300 thousand. The increase is very sudden (within two weeks), and as yet largely unexplained. The sudden change has taken us completely by surprise; our capacity planning based on the growth over the last 18 months suggested we had a long way to go before we had any problems.


Through the , we've just learnt of a change that should significantly improve the performance of our mail servers. This involves disabling a feature that's seldom used, and in the process halving the number of disk accesses required to deliver a message. Our initial tests show that we're now (subjectively) able to deliver mail much faster than we were this morning.

We're hopeful that this will fix the ongoing problem, but we'll only be able to fully determine this on Monday morning.
The IMAP servers have just been rebooted to increase the hardware resources available to them. We're not convinced this will solve the problem, but it will help us eliminate resources as the cause of the current problems. This will have caused a ~ 10 minute outage of the mail system.

We're also in the process of configuring a new mailbox backend server, which will be used to load-balance mailboxes. Again, this is an experiment to help us eliminate a possible cause of contention. This will likely go into service later today, and there should be no impact on services when it does.

post.5532629