noticeboard.ru.ac.za

2011/03/28-30 - IMAP Mail Server Outage
The University's IMAP mail server is currently seeing an exceptionally large number of simultaneous connections. As a result it is operating well beyond its design capability, and this is making connections seem slow.

To try and mitigate this and handle a larger number of connections, we need to install some more memory in the server. We'll be doing this during lunch time today (Monday 28 March). As a result, the IMAP server will be unavailable for some time starting at 13:00 today. We're hoping to complete this task before the end of lunch.
We've completed the scheduled work, but have encountered some problems. We're in the process of investigating, but don't currently know how long this will take. I suspect, however, that it'll extend somewhat beyond the end of lunch.
We've reverted the change we made. This has allowed us to bring the mail server back up, but has done nothing to resolve the performance issues.
In an effort to improve the performance problem, there'll be another short outage of the IMAP mail server within the next few minutes. During this time we'll change some of the operating system's parameters, but won't make any physical changes to the server.
We've made a number of small changes throughout the day that are aimed at resolving the problem we had when increasing memory. As a result, we'll be re-attempting the memory upgrade at 6PM. If things go to plan, we expect an outage of about 15 minutes; if they don't and the machine crashes again, it'll likely take about three hours to recover.
We've finally managed to establish the cause of the load on our IMAP server. It seems that there are ~ 500 Rhodes users who're making use of Blackberries, and RIM's e-mail proxies don't behave very well.

In a normal IMAP client, the client periodically polls the server for e-mail. This means that at any given time a random subset of the total number of users is connected to the server; the IMAP server can handle substantially more clients than connections. In the case of RIM's proxies, they seem to establish a connection per-user and then hold it open for as long as the server permits. If the connection fails for any reason, they immediately attempt to re-establish it. This drastically increases the number of connections to the server, or reduces the number of clients it can support. So far as we can tell, RIM are violating Internet standards in doing so.

As a short term solution to stabilise our mail services, we've rate-limited the number of simultaneous connections from RIM's proxies. We're not yet sure what the effect this will have on Blackberry users. We're hoping that RIM's proxies handle the restrictions sensibly and load balance the connections between users. However, if they don't, it might mean that some Blackberry users are unable to access their mail using their Blackberry.

In the medium term, we're going to try and increase the number of simultaneous connections our IMAP server can support with a view to reducing the restrictions on Blackberries. This might involve one or more outages of the mail server, perhaps starting later tonight.

We're not yet sure what the longer term solution is.
The IMAP server will be unavailable from about 8.30PM to make some configuration changes and run some disk checks. I expect that this will take about an hour to complete.
We've increased the number of simultaneous connections we'll allow from RIM's mail proxies, but the limit is still well below the number of connections we were seeing (roughly half).

We're told that some Blackberries have no problems with this change, whereas others report that they're unable to connect. This might depend on the specific model of Blackberry in use.
In an effort to more equitably allow Blackberries access, we're resetting all connections from RIM's servers every 15 minutes. This means that random chance should allow a greater proportion of blackberry clients to connect.

We've still no idea how to solve this in the longer term, or it it is even possible to do so. Ultimately we've got to prioritise the usability of the IMAP service for PCs on campus (being the bulk of connections).
post.5532557