2006/04/18 - Network Outage, St Peters Campus
From about 2.30PM yesterday (Tuesday 18 April) there was a general Network outage on the St Peter's campus. This outage will have affected all sites hanging off the Music regional switching centre including, but not limited to, the Law, Music & Education departments, Salisbury, Winchester and Cantebury houses, and Environmental Education.

This outage was our core switch's response to an abnormally high amount of broadcast traffic originating from the regional switching centre in the Music department. This sort of problem usually occurs when someone intentionally or unintentionally creates a closed loop in the network (by, for instance, plugging the same device in to two different ports on a switch). The reason for this is that loops act as repeaters -- they repeat the same broadcast traffic over and over as fast as they can. The outage occured when the number of broadcast packets from the Music switching centre exceeded 10,000 per second (a normal volume is less than 100 per second), peaking at around 23,000 per second late yesterday afternoon. This triggered protection mechanisms on the core switch that are designed to contain and limit the impact of this traffic. If these protection mechanisms hadn't activated it is likely that this fault would have caused a far larger outage spanning most of the campus.

What was unusual about this outage was the scope of the problem. The vast majority of our campus, including all of St Peters, runs on managed switching. One of the main selling points of this technology is the ability to detect and break loops close to their source, thus limiting their impact. This doesn't appear to have happened in this case, which causes us some concern. At present, the only logical explanation we have is that someone has installed an unauthorised, unmanaged device on the network in violation of the University's acceptable use policy. We were unable to confirm this yesterday since, by the time we figured out what was going on, most people had already left for the day.

So, whilst we haven't yet determined the exact cause of the problem, we did manage to isolate it to two single switch ports during the course of yesterday afternoon. As a result most of St Peter's campus should have had normal networking restored by about 4PM yesterday.

Investigations into the cause, and more specifically, why the protection that should be offered by our investment in managed switching failed to detect and resolve this, will continue during the course of today (Wednesday).
Shortly after 8AM this morning we discovered the loop that created this outage. It appears that the problem was accidentally and unintenionally created by someone plugging a telephone into two separate network ports.

Our investigations this morning showed that the reason that this outage was so widespread is that the protocol that detects loops (spanning tree) was disabled on the switch in question. Spanning tree is enabled by default on all switches we buy. The switch in question, however, was a second hand power-over-Ethernet switch (the only one of its type on campus) and the previous owners may have changed this configuration. We naively assumed, since we never have to set up spanning tree on any of the new switches we install, that the switch was configured correctly, at least as far as this protocol is concerned.

The switch in question has now been re-configured to enable spanning tree. We tested that things are working properly by re-creating the original loop and checking that the switch correctly disabled the port. As a result we're confident that there won't be a repeat of this particular outage.