noticeboard.ru.ac.za

2009/11/24 - Unscheduled Network Outage
At approximately 2.30PM this afternoon the University's internal network suffered a campus-wide failure. This resulted in intermittent network and service outages, both internally and for external clients of the University, between about 2.30PM and 3.45PM.

It is believed that the initial trigger of this failure may have been an electrical surge as a result of the thunderstorm that occurred at the time. It is likely that this caused a temporary network loop between the two legs of a redundant connection to one of the core switches that route traffic for the whole campus. A cascading failure followed, and resulted in all core network devices locking up. Each attempt to reset the devices caused networking to be temporarily restored, only to have the failure re-occur shortly thereafter.

To restore stability we have physically disabled one leg of each of the redundant connections to the network core switches, effectively breaking any possible loop. This has resulted in the network becoming fully operational again, but has not fixed the underlying problem.

We still need to determine the root cause of this outage, as a glitch such as the one described above should not have caused such a catastrophic failure. As we systematically attempt to isolate the problematic link, we hope to uncover the real cause of the outage and will take steps to try and prevent it from recurring in future.

It is possible (perhaps likely) that our efforts to restore the disabled redundant connections will cause a further outage. We expect, however, that any such outage will be short lived and well understood.
As predicted, our attempts to restore connectivity caused an outage last night between approximately 8:30PM and 9:30PM. Unfortunately we're not entirely sure why this happened, as the affected redundant links were actually restored the previous day. We have, however, noticed a possible reason for this.

As a result of last nights outage, we've disabled a number of the links again. We're going to make some some adjustments and then try again. Again, it is possible (likely) that this will cause another outage.
post.5531918