31 October 2006

More thefts

Stolen equipment continues to concern us.

It appears around $2,000 worth of microphones have been stolen from the audio production area in Building 9 during the last teaching week of Semester.

The production areas are always busy during the last few weeks of semester when many staff and students are involved with audio and video productions. It’s hard to know what to do to reduce the risk of thefts from the area without compromising the access required to support teaching.

Network storage outages

Delays in the transition of responsibilites for hosting network storage are causing more and more concerns for Divisional staff.

The Division’s file server, dcenas, is becoming increasingly unreliable. dcenas houses staff home directories and other volumes shared between Divisional staff (and in some cases, students). In April this year the Division organised with ICT Services to migrate these services to a new server infrastructure provided by ICT Services to host home directories and shares for staff in the non-Academic Divisions (and eventually the other Academic Divisions as well).

On current estimates, the replacement facility will not be available until early December. The existing facility run by ICT Services to provide network storage to the non-Academic Divisions can’t cope with the Division’s requirements: existing users of the service on Windows already occasionally see lengthy delays, and Macintosh file services are not available. It would be unwise to migrate our Divisional users to the service because the extra load would probably compromise the service even further, and providing the Macintosh users with Macintosh file services could take a lot of effort and increase the unreliability of the system for all users.

Last week dcenas failed a number of times. On Thursday it took the combined effort of the Technical Services Unit and ICT Services to revive dcenas when it looked terminal. A ‘cold reboot’ (basically removing the power cord and waiting a minute or so before reconnecting it and starting the computer up again) has restored it for now and it has been operational since then.

Action Plan

We will continue to use dcenas in its current state until the services can be migrated to ICT Services’ replacement facility in early December, unless of course dcenas fails completely and can’t be revived. If two cold reboots are required within any twenty-four hour period, dcenas will be pronounced dead and an emergency migration to the existing ICT Services facility will be undertaken. This should take about a day and the service will be subject to the shortcomings outlined above.

To lower the load on dcenas and reduce the effort for migration (emergency or planned), TSU staff will:

  1. move shared volumes (that is, not home directories) to cestaff.canberra.edu.au.
  2. ask Divisional staff to clean up their home directories on dcenas, reducing the number of files and folders for any planned or emergency recovery.

We will continue to track the installation of the new ICT Services facility, and keep the Division informed if there are any changes to the current plan to move the directories and volumes to ICT Services in early December. In the meantime, daily backups are being made so in the event of a total failure of dcenas very little if any work should be lost.

Videoconferencing for the Division

H.323 Videoconferencing facilities for the Division?

Recently the Executive discussed access to H.323 videoconferencing for the Division. On investigation, a basic point-to-point system could be acquired for around $17,000.

The cost includes:

  • option to allow sharing of data (presentations)
  • ISDN
  • 1 year maintenance (NBD [Next Business Bay] swap out on failure)
  • Installation and training
  • Point to point only

Total cost of system like that installed in the Division of BLIS could be up to $65,000 when installation, commissioning, network installations and operator and participant training is taken into account. Such a system could include additional features like multiple cameras, video projector, additional monitors, interactive whiteboard, multiple participants, recording capability, computer, and additional audio enhancements.

As an alternative to buying a system of our own, or organising access and support to use the BLIS facility, the current cost of renting an existing facility commercially off-site is around $250 per hour, plus a booking fee of $75. Staff wishing to organise booking such a facility should contact the Division’s Helpdesk in the first instance.

Sensitive Data on Student Shares

Staff should NOT place material they don't want students to see on network shares that are available to students.

Several years ago the Technical Services Unit set up a student share on the Division’s Network Attached Storage Device (dcenas) so that lecturers, tutors and students in particular Units could share materials. One requirement of the system was that staff could share unit materials among themselves without students being necessarily able to see them. The TSU was convinced at that stage that WebCT or other services provided by ICT Services couldn’t handle the particular requirements of the Units.

Once set up, the facility was left in place so that the lecturers could manage the system without requiring TSU staff intervention.

Recently a student in one of the Units with access to the share discovered several folders within their unit folder containing marks for students. The student was able to open the files and read them. When a directory is created within another directory, by default the new directory inherits the permissions of its parent (permissions allow or deny certain users and groups of users access to read from and/or write to files in the directory). The creator of the directory can change the default permissions to allow or restrict other users’ read and/or write access to the directory. Anyone with the right level of administrative access can change the permissions on a directory to allow or deny access to users.

Staff should not store information they don’t want students to see on drives, shares, or volumes that students can access. The Division and the University have network storage for this purpose that can be shared between specified staff members securely: contact the Division’s helpdesk for further information.

A list of shares or volumes on Divisional servers that students have access to will be circulated to staff so they can assess whether they want to move any of their files to a more appropriate location.

17 October 2006

Testing the Business Continuity Plan

Every Business Continuity Plan needs to be tested.

Business Continuity planning has been on our list of Current Challenges for some time now. Also called Disaster Recovery, business continuity planning is about knowing what to do in the case of an interruption to business to ensure the impact of the interruption on the smooth running of the organization is minimised or eliminated.

In the IT context, the Business Continuity Plan (the BCP) details the systems and processes in place to ensure the continued availability of computing facilities to support the operation of, in our case, the Division. In order to develop a BCP, it is common to imagine a number of scenarios and work out what needs to be done to continue to provide or restore services in the event the scenario is realised.

We had a real-life situation last week that has given us great insight into the business recovery process: someone on the team took out the Divisional web server and reformatted the hard drives, and installed a completely different operating system (Linux rather than Windows) on it, thereby very effectively erasing completely all the information, operating system, applications, settings and data that was previously on the server.

We are decommissioning a number of servers over the next three months, and while the web server (bacillus) is due to be decommissioned at the end of this year, it was still in service supporting the ASP database, the Divisional helpdesk system, all the Divisional websites that have yet to be migrated to UCOnline, and a number of other internal services used by the TSU. The Division’s old mail server (spirillium), was taken out of service last year, and this was the server that was due to be reformatted. Unfortunately at some time over the last five years, the faceplates on the servers were swapped around so the wrong server was removed and erased.

That was around lunchtime last Thursday. By Monday the TSU team, especially Thomas Teng, was able to restore the web server to its status as of the previous Tuesday night. About a day and a half of data was lost, but all services have now been restored: our disaster recovery worked OK, but probably took longer than we would have liked, and exposed some holes in our BCP that need to be addressed.

Faster restores

A new backup system is progressively being installed in the new 9B25 server room. This backup system allows us to back up both Building 9 and Building 20 servers, and provides a more secure environment for the backed-up data. The new system has faster connections to most of the servers needed to be backed up (especially the new ones in Building 20 and Building 9), so backups will take less time to do (for the same amount of data), and restoring the servers in the event of catastrophic failure of the type experienced last week will be quicker. We are also looking at ways to back up operating systems and configurations more effectively so the restoration process goes faster and more smoothly.

Double checks

One of the factors in the recent disaster was the turnover of staff: there have been four generations of IT Officers responsible for the servers since they were installed five years ago, and many more staff with access to them over that time. A regular program will be put in place to physically check documentation against the actual installations to ensure they match up. Where documentation is missing or wrong, new documentation will be developed to reflect accurately the situation so that the risk of disaster is reduced and faster restoration is easier.

Better documentation of server installations

With the turnover of staff responsible for the Division’s IT staff, a lot of corporate knowledge has walked out the door over the past five years. We already had in place a project to capture all the knowledge of the current team, and any documentation we had, and are building an internal-to-TSU knowledge base of the operations the TSU is responsible for maintaining. Last week’s disaster has focused the minds of the team on this task and more than confirmed the need for it.

One shortcoming identified in the process of restoring the web server was that original installation disks and serial numbers were not being kept systematically. Everything that was needed to restore the server was found, but it could have been easier. A more robust, accessible and reliable system will be developed over the summer to ensure the right resources are always available easily.

Tracking of configurations

The Division has around 29 servers in daily use: each of them serving different functions, on different hardware, of different ages (from a few weeks to five years old), with different operating systems (Microsoft Windows Server 2000, 2000 Advanced, 2003, a special version for the Network Attached Storage device, Apple Macintosh OS X Server, and now a Linux machine!), different configurations and locations. Keeping track of which machine is doing what, where it is physically and how it is set up is a major project and one which is being addressed with added enthusiasm at the moment.

Backup regime

Due to limitations of the retiring backup system, we weren’t keeping as many backups as maybe we should have done. We will increase the frequency of backups (especially of critical systems where data changes frequently), and investigate ways of restoring files changed even on the same day.

Change

Another issue identified as a result of the disaster was the transfer of responsibility for Divisional services to ICT Services.

  • The Academic Skill Program database should have moved to ICT Services on 1 January 2006 when the ASP moved to the Division of Teaching and Learning. Configuration, maintenance and hosting of the database is still the responsibility of the TSU and despite a number of attempts to get some idea of what the migration plan is for the service, none is in place.
  • Web hosting should have been migrated to ICT Services, although not all of the Division’s web sites will ever be the responsibility of UCOnline. Those web sites we will continue to have responsibility for will be migrated to a new server by the end of this year, as will those destined for the Web CMS.
  • The Division’s Helpdesk was due to be replaced by CA’s UniCenter ServiceDesk from 1 January 2006. It is still not available for us to use, and it probably won’t make much sense for us to use it until January 2007, assuming the current date of commencement (30 October 2006) does indeed see the system operational.

Skills

With the move to centralising standard and common services (like the network, email, staff home directories), the Division has been moving to provide more specialised resources (and the appropriately skilled technical staff to look after those resources) that are not and may never be standard or common. When ICT Services doesn’t provide us with the level of service we require to support teaching, learning, research and administration in the Division, we end up having to be involved in looking after the old services as well as the new. This is becoming increasingly difficult, as common services remain the responsibility of the Division.