noticeboard.ru.ac.za

2005/08/01 - Imap Server Instability
Shortly after 5PM this evening, the University's IMAP mail server crashed for reasons unknown. The machine was duly rebooted and service was restored at 18:21. Shortly after 10PM, the server crashed again. This time it recovered on its own, with service being restored at 22:26.

This server (imap.ru.ac.za, aka elephant) stores incoming e-mail and provides users' e-mail clients (Pegasus Mail, Outlook, Thunderbird, etc) a way to access it (either via IMAP or POP3). It also houses the widely used Horde/IMP webmail client. During these two outages, none of these services would have been available, meaning that users could not read their e-mail.

Outgoing e-mail is handled by a separate server (mail.ru.ac.za) and this server was unaffected by this outage. This means that users should still have been able to send e-mail (unless of course they were attempting to use the webmail client).

The transport protocols used by e-mail are very resilient and tolerate this sort of failure as part of their normal operation. As a result, no e-mail will have been lost by these two outages. Any remote mail server that attempted to deliver e-mail during the outages would have simply queued the e-mail and tried to send it again later. This may result in delivery being delayed for a number of hours after service was restored as servers only re-try their deliveries periodically.

At the moment we're not entirely sure of the reasons for the two crashes, and as such, it is likely that the machine will crash again. We've made some configuration changes in an attempt to get more useful debugging information in the event of this happening. We'll also further analyse what little information we have during the course of tomorrow morning. Updates and further information may be posted to this thread at a later stage once we've got a clearer picture of what's gone wrong.
At about 10:15 this morning, the IMAP mail server crashed again. Service was restored at 10:38. We intend building a debugging kernel for the machine in an attempt to get a useful crash dump for it. This will involve a few minutes downtime to install the kernel.
At about 2:45 AM this morning it crashed again and has not recovered on its own. Service will hopefully be restored by 8:00 AM. Hopefully this time we'll have got some useful debugging information out of it.

Update: Service was restored at 8:34. The initial disk checks failed and required manual intervention -- this took longer than expected.
We've now managed to glean some useful information from the most recent crash. It appears that the problem is in the network subsystem and primarily affects multi-processor machines (elephant is a dual Xeon). By the looks of things, we're not the only people who are experiencing this problem.

It also appears as if there is a workaround that'll get the machine stable again at the expense of some network performance (something that shouldn't be noticible at all in this situation because the network traffic to the machine in question is reasonably low). We've applied this workaround in the hopes that it'll prevent another crash.

In the mean time we're going to search for a real solution to the problem. This may involve a reboot at a later stage.
post.5506143