2014/02/08 - storage failure affecting labs, ruconnected

Sat, 08 Feb 2014 13:41:55 +0200

At about 10:15 this morning, a problem developed on one of the University's storage systems. This system primarily provides storage for test and non-production services, or bulk storage. However, two notable exceptions are ruconnected.ru.ac.za and public computer lab logins -- it is currently not possible to log into many lab machines, and ruconnected may become inaccessible too.

We're currently investigating the problem.

guy

Sat, 08 Feb 2014 13:56:26 +0200

This work may result in emergency maintenance being performed this evening. A separate post at http://noticeboard.ru.ac.za/post.5532699 explains this, but further updates will be posted here.

guy

Sat, 08 Feb 2014 14:07:21 +0200

Whilst services such as email, protea, etc are not currently affected by this problem and do not store data on the affected storage system, they share common infrastructure with those services that are affected. In particular, the physical servers hosting affected services also host unaffected services. Because of the storage problem, our virtual infrastructure can no longer manage these servers and is thus unable to respond to other problems that may arise. In order to stabilise the situation, it is necessary to reboot these servers -- until we do, we're at risk of further, unrelated outages.

guy

Sat, 08 Feb 2014 21:15:03 +0200

The emergency maintenance is complete, and all unaffected services have been restored.

The storage problem persists, and continues to affect public computer labs and ruconnected. Also affected is serval.ru.ac.za and a number of test services.

guy

Sun, 09 Feb 2014 10:27:13 +0200

ruconnected has been restored, in the sense that it is currently stable and operational. All course backups have been lost, but currently active courses are unaffected. In addition, it is currently not possible to make course backups (i.e. please do not make manual backups of courses).

guy

Sun, 09 Feb 2014 10:38:16 +0200

Public labs and lecture venues remain affected -- it is not possible to log into any centrally managed lab or lecture venue, except for the labs in Law and the Library.

We are currently working on a solution to this problem, which we hope to have in place by tomorrow morning. However, we may need to re-image the affected labs and venues to do this, which will delay things. More information will follow.

guy

Sun, 09 Feb 2014 12:06:56 +0200

Half of Jacaranda lab is currently being imaged with an older image, so as to provide students with somewhere they can use whilst we work on a more permanent solution. This should be complete within the next hour or two.

guy

Sun, 09 Feb 2014 21:36:28 +0200

Thus far we've created a new login & imaging server, and prepared (some) updated images. Unfortunately to restore login functionality most labs and venues need to be reimaged -- until they are only local login is possible (venues).

We've loaded an older image in Jacaranda and the whole lab is open. The Library is unaffected.

By tomorow (Monday) we're hoping to have Eden Grove and the six venues listed in the O'week programme for departmental presentations (Barratt 1&2, Zoo Major, Arts Major, Chem Major, Botany Major and GLT) reimaged,

Tomorrow (when we have more staff available) we'll look at the other venues and labs.

guy

Mon, 10 Feb 2014 07:25:03 +0200

The venues, including some of the minor venues in the same buildings, were completed in the early hours of this morning. They're available for use as normal.

Eden Grove re-imaged overnight. It should be open around 9AM.

guy

Mon, 10 Feb 2014 07:52:57 +0200

guy

Mon, 10 Feb 2014 09:32:22 +0200

The server that provides our software repository (commonly known as \\rhino) is unavailable as a result of this. We are attempting to restore from backups.

guy

Wed, 12 Feb 2014 15:56:52 +0200

Whilst most labs have now been re-imaged, an issue has arisen with the new login & imaging server that was created over the weekend. It seems that, at times, it is not coping with the demands being placed on it. When this happens, people are temporarily unable to log in from lab machines.

We're working on resolving this, but are hampered by the fact that replacements for the storage that failed will likely only arrive next week. We're hoping to find an interim workaround sooner than that.

We're also simultaneously working on a fall-back plan using older images, in case we're not able to resolve the current issue.

guy

Wed, 12 Feb 2014 15:58:36 +0200

The server known as \\rhino (hosting software) was restored to a temporary location yesterday; more work will be needed once replacement hardware is available.

The server known as \\serval can only be properly restored when replacement hardware is available. Details will be communicated with affected parties.

guy

Wed, 12 Feb 2014 16:02:21 +0200

Recovery of ruconnected was completed last night, and a scheduled change went ahead last night as planned.

guy

Mon, 17 Feb 2014 14:48:11 +0200

Labs have been completely recovered, and are working as normal. Logins are currently dependant on a single server, and there's a risk that if this server fails for any reason, users will be unable to log into the labs. We're unable to resolve this until replacement storage is available.

New disks have been received, but there's a single critical component (a host-based adapter) outstanding. We do not have a reliable ETA for this. However, we're unable to complete recovery of the remaining systems (including, most notably, \\serval) until this arrives.

post.5532698