17 October 2006

Testing the Business Continuity Plan

Every Business Continuity Plan needs to be tested.

Business Continuity planning has been on our list of Current Challenges for some time now. Also called Disaster Recovery, business continuity planning is about knowing what to do in the case of an interruption to business to ensure the impact of the interruption on the smooth running of the organization is minimised or eliminated.

In the IT context, the Business Continuity Plan (the BCP) details the systems and processes in place to ensure the continued availability of computing facilities to support the operation of, in our case, the Division. In order to develop a BCP, it is common to imagine a number of scenarios and work out what needs to be done to continue to provide or restore services in the event the scenario is realised.

We had a real-life situation last week that has given us great insight into the business recovery process: someone on the team took out the Divisional web server and reformatted the hard drives, and installed a completely different operating system (Linux rather than Windows) on it, thereby very effectively erasing completely all the information, operating system, applications, settings and data that was previously on the server.

We are decommissioning a number of servers over the next three months, and while the web server (bacillus) is due to be decommissioned at the end of this year, it was still in service supporting the ASP database, the Divisional helpdesk system, all the Divisional websites that have yet to be migrated to UCOnline, and a number of other internal services used by the TSU. The Division’s old mail server (spirillium), was taken out of service last year, and this was the server that was due to be reformatted. Unfortunately at some time over the last five years, the faceplates on the servers were swapped around so the wrong server was removed and erased.

That was around lunchtime last Thursday. By Monday the TSU team, especially Thomas Teng, was able to restore the web server to its status as of the previous Tuesday night. About a day and a half of data was lost, but all services have now been restored: our disaster recovery worked OK, but probably took longer than we would have liked, and exposed some holes in our BCP that need to be addressed.

Faster restores

A new backup system is progressively being installed in the new 9B25 server room. This backup system allows us to back up both Building 9 and Building 20 servers, and provides a more secure environment for the backed-up data. The new system has faster connections to most of the servers needed to be backed up (especially the new ones in Building 20 and Building 9), so backups will take less time to do (for the same amount of data), and restoring the servers in the event of catastrophic failure of the type experienced last week will be quicker. We are also looking at ways to back up operating systems and configurations more effectively so the restoration process goes faster and more smoothly.

Double checks

One of the factors in the recent disaster was the turnover of staff: there have been four generations of IT Officers responsible for the servers since they were installed five years ago, and many more staff with access to them over that time. A regular program will be put in place to physically check documentation against the actual installations to ensure they match up. Where documentation is missing or wrong, new documentation will be developed to reflect accurately the situation so that the risk of disaster is reduced and faster restoration is easier.

Better documentation of server installations

With the turnover of staff responsible for the Division’s IT staff, a lot of corporate knowledge has walked out the door over the past five years. We already had in place a project to capture all the knowledge of the current team, and any documentation we had, and are building an internal-to-TSU knowledge base of the operations the TSU is responsible for maintaining. Last week’s disaster has focused the minds of the team on this task and more than confirmed the need for it.

One shortcoming identified in the process of restoring the web server was that original installation disks and serial numbers were not being kept systematically. Everything that was needed to restore the server was found, but it could have been easier. A more robust, accessible and reliable system will be developed over the summer to ensure the right resources are always available easily.

Tracking of configurations

The Division has around 29 servers in daily use: each of them serving different functions, on different hardware, of different ages (from a few weeks to five years old), with different operating systems (Microsoft Windows Server 2000, 2000 Advanced, 2003, a special version for the Network Attached Storage device, Apple Macintosh OS X Server, and now a Linux machine!), different configurations and locations. Keeping track of which machine is doing what, where it is physically and how it is set up is a major project and one which is being addressed with added enthusiasm at the moment.

Backup regime

Due to limitations of the retiring backup system, we weren’t keeping as many backups as maybe we should have done. We will increase the frequency of backups (especially of critical systems where data changes frequently), and investigate ways of restoring files changed even on the same day.

Change

Another issue identified as a result of the disaster was the transfer of responsibility for Divisional services to ICT Services.

  • The Academic Skill Program database should have moved to ICT Services on 1 January 2006 when the ASP moved to the Division of Teaching and Learning. Configuration, maintenance and hosting of the database is still the responsibility of the TSU and despite a number of attempts to get some idea of what the migration plan is for the service, none is in place.
  • Web hosting should have been migrated to ICT Services, although not all of the Division’s web sites will ever be the responsibility of UCOnline. Those web sites we will continue to have responsibility for will be migrated to a new server by the end of this year, as will those destined for the Web CMS.
  • The Division’s Helpdesk was due to be replaced by CA’s UniCenter ServiceDesk from 1 January 2006. It is still not available for us to use, and it probably won’t make much sense for us to use it until January 2007, assuming the current date of commencement (30 October 2006) does indeed see the system operational.

Skills

With the move to centralising standard and common services (like the network, email, staff home directories), the Division has been moving to provide more specialised resources (and the appropriately skilled technical staff to look after those resources) that are not and may never be standard or common. When ICT Services doesn’t provide us with the level of service we require to support teaching, learning, research and administration in the Division, we end up having to be involved in looking after the old services as well as the new. This is becoming increasingly difficult, as common services remain the responsibility of the Division.