noticeboard.ru.ac.za

2014/02/08 - storage failure affecting labs, ruconnected
At about 10:15 this morning, a problem developed on one of the University's storage systems. This system primarily provides storage for test and non-production services, or bulk storage. However, two notable exceptions are ruconnected.ru.ac.za and public computer lab logins -- it is currently not possible to log into many lab machines, and ruconnected may become inaccessible too.

We're currently investigating the problem.
This work may result in emergency maintenance being performed this evening. A separate post at http://noticeboard.ru.ac.za/post.5532699 explains this, but further updates will be posted here.
Whilst services such as email, protea, etc are not currently affected by this problem and do not store data on the affected storage system, they share common infrastructure with those services that are affected. In particular, the physical servers hosting affected services also host unaffected services. Because of the storage problem, our virtual infrastructure can no longer manage these servers and is thus unable to respond to other problems that may arise. In order to stabilise the situation, it is necessary to reboot these servers -- until we do, we're at risk of further, unrelated outages.
The emergency maintenance is complete, and all unaffected services have been restored.

The storage problem persists, and continues to affect public computer labs and ruconnected. Also affected is serval.ru.ac.za and a number of test services.
ruconnected has been restored, in the sense that it is currently stable and operational. All course backups have been lost, but currently active courses are unaffected. In addition, it is currently not possible to make course backups (i.e. please do not make manual backups of courses).
Public labs and lecture venues remain affected -- it is not possible to log into any centrally managed lab or lecture venue, except for the labs in Law and the Library.

We are currently working on a solution to this problem, which we hope to have in place by tomorrow morning. However, we may need to re-image the affected labs and venues to do this, which will delay things. More information will follow.
Half of Jacaranda lab is currently being imaged with an older image, so as to provide students with somewhere they can use whilst we work on a more permanent solution. This should be complete within the next hour or two.
Thus far we've created a new login & imaging server, and prepared (some) updated images. Unfortunately to restore login functionality most labs and venues need to be reimaged -- until they are only local login is possible (venues).

We've loaded an older image in Jacaranda and the whole lab is open. The Library is unaffected.

By tomorow (Monday) we're hoping to have Eden Grove and the six venues listed in the O'week programme for departmental presentations (Barratt 1&2, Zoo Major, Arts Major, Chem Major, Botany Major and GLT) reimaged,

Tomorrow (when we have more staff available) we'll look at the other venues and labs.
The venues, including some of the minor venues in the same buildings, were completed in the early hours of this morning. They're available for use as normal.

Eden Grove re-imaged overnight. It should be open around 9AM.
The venues, including some of the minor venues in the same buildings, were completed in the early hours of this morning. They're available for use as normal.

Eden Grove re-imaged overnight. It should be open by 9AM.
The server that provides our software repository (commonly known as \\rhino) is unavailable as a result of this. We are attempting to restore from backups.
Whilst most labs have now been re-imaged, an issue has arisen with the new login & imaging server that was created over the weekend. It seems that, at times, it is not coping with the demands being placed on it. When this happens, people are temporarily unable to log in from lab machines.

We're working on resolving this, but are hampered by the fact that replacements for the storage that failed will likely only arrive next week. We're hoping to find an interim workaround sooner than that.

We're also simultaneously working on a fall-back plan using older images, in case we're not able to resolve the current issue.
The server known as \\rhino (hosting software) was restored to a temporary location yesterday; more work will be needed once replacement hardware is available.

The server known as \\serval can only be properly restored when replacement hardware is available. Details will be communicated with affected parties.
Recovery of ruconnected was completed last night, and a scheduled change went ahead last night as planned.
Labs have been completely recovered, and are working as normal. Logins are currently dependant on a single server, and there's a risk that if this server fails for any reason, users will be unable to log into the labs. We're unable to resolve this until replacement storage is available.

New disks have been received, but there's a single critical component (a host-based adapter) outstanding. We do not have a reliable ETA for this. However, we're unable to complete recovery of the remaining systems (including, most notably, \\serval) until this arrives.
post.5532698