noticeboard.ru.ac.za

2007/07/22 - Problems Affecting Webmail, Others
In the early hours of yesterday morning our management and backup server crashed as a result of a failed hard disk. The hard disk in question contained no valuable information (it was a spool drive used to store backups before they were written to tape), and was replaced by lunch time yesterday. Unfortunately, the unclean shutdown of the machine, and subsequent disk checks on boot, caused the machine in question to lose information off another, unaffected hard disk. In particular, all the databases that are replicated between nodes of our services cluster were lost. Since the databases are replicated from the management server, and the data loss wasn't noticed until too late, these changes were propagated to other nodes in the cluster. As a result, all six copies of these databases were destroyed.

The most noticeable impact of this is that users will have been unable to log into the webmail system since yesterday morning. The reason for this is the underlying database that stores user preferences, etc no longer existed.

We've had to restore the databases in question from tape backups. Unfortunately the only viable copies of the databases in question we have on tape are two weeks old**. This means that any changes made since 01:00 on Sunday 8th July 2007 have been lost. In particular, this will affect the webmail client (the other databases contained volatile, non-essential information). The restored copies of the database are online now, and the webmail system is operational again.

Those users who've made changes to any settings using the webmail client (http://www.ru.ac.za/webmail) in the last two weeks should carefully re-check their settings as they may have been reverted. In the same way, users who make use of the calendaring system attached to the webmail client should be aware that they may have lost changes made in the last two weeks.
** This is because the origin of the problem was in the backup system itself. It has been performing erratically, and we have been trying to debug problems with it, for the last two weeks. The problems first presented on the 9th of July, and despite repeated efforts to resolve the problem we've been unable to get a complete backup since then.

We now know that the reason this was happening is that its spool disk was on the verge of failure, and that it was intermittently locking up when it was under the heavy read/write load of a backup cycle. It appeared to crash most frequently when backing up information on the local machine and so, whilst we have more recent backups of other systems, we don't have backups of the management server itself. Inevitably when the crash happened it destroyed one (and always the same) replica of the RAID1 array that the machine runs on. This lead us to the (erroneous) belief that one of the disks in the RAID arrary was failing. In hindsight, the most likely explanation for this relates to the fact that the spool disk and the drive we thought was failing exist on the same ATA controller and so a temporary failure of the spool disk would have stopped the other disk responding. Since backups typically happen in the middle of the night when most systems are quiescent, nobody's been present when the machine has crashed and so nobody's noticed that both disks report errors at the time. All our servers are configured to automatically try and restore service when they crash (to minimise downtime), and so by the time someone looked at it the next day the only remaining symptom was a broken RAID array.

Unfortunately, that retrospective information doesn't help us restore the backups that failed to write successfully in the intervening two week period. We've tried to recover more recent information from the spool disk, but every copy of the database we've managed to extract from it has been corrupt. This left us with no choice but to use the older, but viable, information we had on tape.
post.5526573