noticeboard.ru.ac.za

2011/11/02 - Multiple Failures of Core Services
A number of core ICT services have recently failed or begun behaving erratically.

The underlying problem appears to be related to storage, and specifically the FibreChannel link between Struben and AMM. This is not directly related to the changes that occurred last night, but those changes may have exposed an existing, underlying problem with the fibre optic cable connecting our storage systems. It appears that this fibre pair is causing intermittent connectivity between various servers and our storage systems, which is resulting in servers repeated changing the storage path they're using to access storage. Services that have high disk I/O have been most affected byt this.

We're in the process of installing a new fibre cable to replace the one we suspect is faulty.

In the mean time we've disabled the faulty storage path in an attempt stablise the situation. We're in the process of determining, and restarting the affected services.
Services that have been affected or are affected:
  • IMAP
  • Outgoing e-mail
  • Serval (departmental storage)
  • iPrint
  • Gnu (staff novell server)
  • Mailing Lists
  • P2000 Access Control System

Others may be added to this list as we discover them.
All services should now be working normally except for the P2000 access control system.

Unfortunately, because of its bizarre licensing requirements, it is tied to the physical hardware of the only host that we cannot currently use (all other services have been migrated to other, now unaffected hosts). As such we will not be able restore this service until such time as the underlying fibre fault is rectified. We're hoping that we'll be in a position to do this by this evening.
post.5532602