Research Issues in Migration Strategies within an Electronic Archive
Steve Binns, David V. Bowen, Alan Murdock, Jean Samuel, Tony Parsons
Electronic Records Meeting
Pittsburgh, PA
May 29, 1997
The primary requirement of an archive of electronic records is to allow users of the future to read and understand in their context and environment records archived in the current context and environment without loss of pedigree or sense. We use "migration" to refer to the process by which electronic records are changed to maintain access and utility. The two major types of change are changes in the computer hardware and/or software used to view or use the record, and changes in the computer hardware and/or software used to store and index the record (the electronic archive). Thus this paper will discuss issues with migration of records and of the archive in which they are kept.
In order to maintain electronic records, they must be:
In all three of these requirements electronic records differ significantly from physical records, and all three require significant research effort to define:
The key difference is that a physical record can be seen and handled directly by humans. If it is protected from damage, it maintains its form and content and the seeing or handling will be possible at any future time. Physical records are normally labelled by physical labels, and these too maintain their existence over time. The techniques required to read and understand many sorts of physical record persist easily over 10s of years and often persist over 100s of years. Thus I can read a letter which was written to me 40 years ago, and I can read a book which is 150 years old. Indeed, with readily available training, I could read manuscripts which were over 1000 years old.
Electronic records are stored as patterns of 1s and 0s represented in a magnetic or electrical form. Computer hardware and software is needed to find the record, to identify it, and to present it in a legible or usable form. Computers are changing on a 9 to 18-month cycle, and the changes are so severe that a record acquired two or three years ago may not be legible on a modern computer. The software that created that record may not run on a modern computer. Indeed, the computer on which that record was created may become difficult to keep operational, albeit on a 10-20 year timescale.
The famous picture is that of the Rosetta stone surrounded by old (over 5 years) floppy discs and magnetic tapes. The Rosetta stone is the only record that can be read. The records on the other devices must be "migrated" to retain identity, accessibility and utility.
The problem of maintaining electronic records is made more complex by the requirement to migrate the archive in which they are held. Since the electronic archive is a computer, it will become obsolete just as quickly as the systems that created the records. This imposes a second migration requirement and a second migration cycle on an electronic archive: source software and hardware must be migrated, but the archive itself must also be migrated.
This paper sets out some of the issues and questions raised by the requirement to maintain electronic records. In delivering CEA Phase 1, we have answered some of the questions raised by the profession over recent years and gone on to raise new questions as our operating electronic archive delivered experience. We now have a good idea of the process point where migration alerts need to be raised; it is earlier than often recorgnised. We have met the issues of balancing user and custodian responsibility for electronic records, and of deciding between centralised system provision and stand alone, non-standard systems. This work has been reported at the DMM Forum (Brussels, 1996). We will share solutions where we have found them and raise new questions which we hope the archive and software communities will find challenging. We hope our practical experience will expand the profession's current repertoire of questions, which we know you have already found challenging.
Since an electronic record may not have a fixed physical form, and if it does, it is not perceptible to a person without the aid of computing hardware and software, the first criterion for maintaining electronic records is that they be identifiable as records. This requires:
An electronic record will only be useful if it can be accessed by users and custodians. This implies that users (subject to some security model) must be able to find which records exist that are (or perhaps just 'may be') relevant to a query (a concern, a present activity). They must then (given appropriate security levels) be able to obtain a copy of, or a view of, the record. This raises some interesting questions:
At some time before the software (or even the combination of hardware and software) that created a record becomes obsolete, it is important to consider how to maintain the record so that it can be read, or used. In this respect documents may be easier to manage than experimental data or images. The choices available to the record owners or custodians include:
The discussion so far has considered records which are:
However the computerised record world is changing too rapidly for these categories to remain useful. Systems are already creating composite records, in which (for example) a document contains a spreadsheet, some graphs, and some images. "Contains" here can mean either "holds a static representation of" or (more importanly for our discussion of issues) "is linked to a live representation of".
The second meaning of "contains" is especially important when multidimensional records are considered. For example, a medical image might be a collection of measurements in which each point in 3-dimensional space is represented by two or more intensities at several different times. A scientific measurement might combine a time-varying measurement with three different measurements of chemical properties at each time point. Each chemical property measurement might contain three or four dimensions.
These multidimensional records cannot be represented by a single image: they can only be represented by software which selects a view. A different user, asking a different question, may need to see a different view. The view contained in a document is not the only view, nor the best view.
A database is another sort of complex record.
The World Wide Web presents another challenge to archiving. A Web page can include routines that respond to the user's commands. A record published on the Web can include links to other records, programs (applets) that act by themselves, and applets that respond to the user.
Finally, the archive itself must be migrated. Both hardware and software may need to be changed, and the changes may come about slowly, with time for planning, or more unexpectedly, even as an emergency. For example, it is clear that even stable computer operating systems change versions every year or two, and most disappear from routine use after 10-20 years. Equally, computer data storage peripherals (tape drives, disk drives, ...) may become unreliable after 5-10 years, and may be unsupported by their original vendor. Sometimes a vendor will lose technical staff or cease trading, and support may collapse more suddenly.
As a result of these pressures, plus the advent of new technologies, archives will migrate in small steps every year or so, and will undergo major migrations (to new hardware and software platforms) every 4-7 years. This raises issues, including:
Although this is only a discussion document, you may still wish to ask why these topics were not given priorities. Should we not propose an order for tackling these problems?
In fact the answer, at one level, is simple. The growth of electronic records in business, entertainment, academic research, and private use is so great that all of these issues must be tackled urgently. The penalty, if we do not solve electronic archiving quickly, is that we will lose a large part of our cultural and social history. The forces represented by technological change (and resulting obsolescence) in computing hardware and software are more devastating than war, and as inexorable. Furthermore, they are truely worldwide in their impact. No corner of the globe is free of computers, few aspects of human life are not affected by, and recorded in, computers. If we don't archive these electronic records soon, several generations (ours!) will leave drastically impaired traces for historians. We will also find ourselves unable to transmit what we have learned to our descendants.
At the other end of the scale, since we know some topics will be studied, and others won't, the decisions on priorities are not ours to make. They will be determined by academic forces:
and by commercial forces:
and by governments and the public will:
So, we hope you enjoy the next few years of research into Electronic Records. And please, please archive your research results carefully!