noticeboard.ru.ac.za

2006/03/13 - Core Services Outage
On Monday 13th March we'll be undertaking urgent maintenance of the power supply in Struben building. This will likely result in a short power outage in the data centre in Struben sometime between 11:30 and 12:00. It is anticipated that power should be out for no more than thirty minutes.

During this power outage all core services, including e-mail, Internet access, Novell file space, networked printing, etc will be unavailable.

Once power to the data centre has been returned it make take a short while for services to stabilise. Users who experience connectivity problems immediately after the outage should wait five to ten minutes and then reboot their computer.

Whilst we'd dearly love to be able to schedule this outage out of office hours we're unable to do so as the above time window is the only one available to the contractors who're going to perform the work. Accordingly we offer our sincere apologies for the inconvenience this necessary maintenance will cause.

As usual further information and updates will be posted on the IT Division's noticeboard at http://noticeboard.ru.ac.za/
For those who're interested, details of this outage follow:

Over the last few power outages we've noticed that the uninteruptable power supply (UPS) in Struben building has not been maintaining power to the data centre for as long as it should. In addition when the UPS dropped power to machines in the data centre it didn't try to shut them down beforehand, with the result that they came up with unclean filesystems once power was restored; the implication of this is that recovery from the last three power failures has required significant manual intervention.

During the last significant power outage we managed to conclusively prove that the fault lay with the UPS. It fails to set the low battery indicator before it drops power to the load. As a result our monitoring software has no idea that the power's about to be cut and thus doesn't schedule shutdowns of affected machines.

The people who maintain our UPS suspect the fault lies in one of the batteries supporting our UPS. Since the batteries are reaching the end of their usable life anyway, it makes sense to replace them. This will happen during the scheduled outage.

At the same time the UPS is going to get a capacity upgrade. It'll go from a 15KVA maximum load to a 20KVA maximum load. This involves the addition of extra batteries, some recabling and a firmware upgrade. It is during the firmware upgrade that power to the data centre will be cut.

We're told that power should be out for no more than ten minutes. Being conservative by nature, we've advertised this as half an hour to allow for any problems that may occur. We'll attempt to shut down machines cleanly before power is cut in the hopes that services will be restored quickly once power is returned. In a worst case scenario it takes us about an hour and a half to fully stablise all services after an unclean shutdown.
post.5514630