noticeboard.ru.ac.za

2004/09/01 Mail Server Down
elephant (mail.ru.ac.za/imap.ru.ac.za) has just crashed and is presently doing consistency checks on its disk array. This normally takes about twenty minutes or so, after which the services provided by this machine should be available again.

We've worked out that the reason for this machine's instability is the amount of shared memory being used by the web server processes. Steps will be taken to address this problem during the course of today, with the result that there may be more unscheduled downtime later on today.

While elephant is down users will be unable to send or read e-mail, access the mailman mailing list manager or the RT3 report tracking system. All e-mail from external sites to Rhodes will be automatically queued and will be delivered once service is restored. No e-mail should be lost at any stage.
The problems with the mail server continue, so in order to minimise the impact of this, mail.ru.ac.za has just been temporarily moved to a different machine, meaning that while you might not be able to read your e-mail at times, you should still be able to send e-mail.

One of the implications of this is that people whose mail clients are misconfigured may not be able to send e-mail at times. Your outgoing mail server should be set to mail.ru.ac.za and your incoming mail server should be imap.ru.ac.za. They are no longer the same machine.

People using secure SMTP to allow relaying from outside of Rhodes will find this service no longer works. This is a temporary condition and secure SMTP relaying will be available once we get to the bottom of the problems we're experiencing.

Rhodes's MX records have also been altered to make the temporary mail server the least preference route for mail. This means that should elephant fall over again, mail from remote sites will automatically be queued at Rhodes for delivery (rather than the remote site which was the case in the past). This should speed up the delivery of this mail once service is restored.

A new kernel is currently being built on elephant (which explains the poor response times) and it is hoped that this kernel goes some way towards addressing the problems we are currently experiencing. This kernel will be installed as soon as it has finished compiling. This will require that we reboot elephant, with the result that incoming mail will be temporarily unavailable. Unfortunately we don't have a time schedule for this at the moment because the kernel compile has already taken significantly longer than we would normally have expected.

It appears that the fundemental cause of all of the problems we're experiencing is a large amount of unsolicited e-mail (SPAM) that's being directed at Rhodes e-mail addresses. This unusually high amount of SPAM is utilising significant resources on the mail server and essentially overloading it.
Shortly before lunch the mail server crashed again. It is currently re-checking its disks and will hopefully be up in about twenty minutes or so. We've now installed a debugging kernel as well as made some configuration changes in the hopes that we'll get some useful information if this happens again.

It appears the temporary SMTP gateway is working correctly and around 1500 messages have been queued in the last two hours. These will be delivered as soon elephant comes back up. In order to speed this up, we'll initially bring up only the SMTP and IMAP services on elephant. The web server (which hosts the webmail client, the RT3 queue and the mailman management interface) will only be brought up once the backlog has cleared.
post.7994