One of our Novell OpenEnterprise servers, camel.ru.ac.za, has become increasingly unstable over the last few months.

camel hosts a number of non-critical services, some of which are widely used by staff and students. Most notably, it hosts the iFolder service that's widely used by some staff as one form of back up for important documents on their local computer. It also hosts most of the software that's available via http://software.ru.ac.za/ and http://studsoft.ru.ac.za/, and directly via the Windows shares on \\rhino.ru.ac.za (rhino is simply a pseudonym for camel, and has been for some time).

Staff who use iFolder for backups will notice the problems on camel as their iFolder client periodically asking for authentication a number of times and/or failing to synchronize. For most people, the synchronization will eventually complete and the data stored on camel will be up-to-date. The problem is, however, irritating and as a result a number of people will have disabled the iFolder client.

Staff and students who try to access software off camel will see this problem occasionally manifesting as an error such as "Windows cannot find '\\rhino.ru.ac.za'".

The exact cause of the instability is not known at this stage. We have identified at least three possible contributing factors:
  • The disk array that camel uses to host data is known to be very slow, and it is possible that the disks are simply not keeping up with the increasing demands being placed on them as more and more people have started using iFolder (the system was initially designed to host a couple of hundred users and now hosts roughly a thousand collections).
  • The machine also runs a old version of SuSE Linux/Novell OES that contains a number of known problems. This certainly contributes to the slow recovery of the machine after outages, as the current version of Linux takes about half an hour to discover the disk array on boot (boot time is about 45 minutes).
  • The imaging of public labs, which is also hosted on camel, is known to interact badly with some of the redundant network routing protocols we use on campus. Its possible that this fault lies within the operating system, rather than within the imaging software, and is thus affecting other services.

During the course of this year, we intend revisiting camel's purpose and functionality in light of other developments (particularly those related to storage) on campus. Unfortunately, a number of staffing vacancies in the IT Division have decreased the amount of time we have available to devote to major projects such as this, as well as reduced our capacity for preventative and other maintenance. Nevertheless, it is likely that the services it provides will change, or change form when this finally happens. At the very least, the machine will be updated to use later versions of the software it currently makes use of. However, this all forms part of a much larger project that's only in the planning phase at this stage. Whilst some changes (such as an OS upgrade) may happen earlier, the final implementation of any changes we plan in this regard is only likely to happen towards the end of the year.

As a result of the above, we are not attempting to find a complete solution to this problem. We are instead using our experiences to inform other decisions. In the interim, the focus will be on maintaining the machine in its current semi-usable state, and implementing ad-hoc work arounds as and when we can. This has several implications for staff and students:
  • Backups: Staff who make use of iFolder to back up important documents should strongly consider making use of alternative backup strategies in addition to iFolder. (This should have been the case all along -- relying on one backup mechanism, particularly a centrally provided one, is considered bad practice.) The IT Division has flash sticks, CD & DVD writers, and external hard disks available for sale if necessary. Staff & students should take particular cognizance of section 6.7.1 of the AUP in this regard.
  • Software: Staff and students who make use of software from \\rhino.ru.ac.za (whether via the software webpages or not) should be aware that this service may be intermittently unavailable. When and if this happens, the solution is to simply try again a couple of hours later.
  • Lab Imaging: Public lab images take significantly longer to deploy than they used to. This can cause scheduling problems.

We ask that you bear with us until such time as we're able to resolve this problem properly.