MW-photo
April 11-14, 2007
San Francisco, California

From Casual History to Digital Preservation

Ari Davidow, Jewish Women’s Archive, USA

http://katrina.jwa.org

Abstract

Traditional history relies on the ability to review hundreds or thousands of relevant documents and artifacts. Using Web 2.0 tools, an archive can now gather those objects on-line, creating an historical record broader and deeper than ever before possible. In conceiving its "Katrina's Jewish Voices" project in late 2005, the Jewish Women's Archive realized that it was not enough to create a "raw archive" of such objects. For digital preservation we require assurances of fixity, as well as the capture of significant metadata about objects, their contributors and creators (where possible), and about the relationships between complex objects. In this project, we made good progress towards these goals and made a major step forward in our goal of becoming and exemplifying the "Archive for the 21st century."

Keywords: digital preservation, on-line collecting, archival metadata, fixity, raw archive

Introduction

The Jewish Women’s Archive (JWA), founded in November 1995, grew out of the recognition that neither the historical record preserved in America’s Jewish archives nor that preserved in its women’s archives accurately or adequately represented the complexity of Jewish women’s lives and experiences over the three centuries in which they had been making history in the United States. (JWA, 2006)

The Archive’s mission is to “uncover, chronicle, and transmit the rich history of American Jewish women.” That includes a commitment to documenting Jewish women from all walks of life, and to reaching out to the general public, as well as to scholars. Uniquely, JWA does not maintain physical archives. Rather, from its inception, the commitment has been to present information on-line, available at any time to anyone with access to the Internet.

Collaboration has also been a key part of the Archive’s work. Initial JWA projects grew out of collaboration with MIT’s Center for Educational Computing Initiatives and resulted in profiles of significant “Women of Valor,” as well as a database, the “Virtual Archive” containing information about where to find the papers of hundreds of significant Jewish women. In 2005 we teamed up the University of Michigan to use OCR and new indexing technology to make the full text of The American Jewess, the first American periodical for Jewish women, accessible and searchable over the Web. That same year we also launched a new type of exhibit, “Jewish Women and the Feminist Revolution” in which the women profiled were asked to determine the relevant artifacts to be digitized for inclusion in the exhibit. That freedom to choose enabled us to present a variety of documents and digitized artifacts that would never otherwise have been presented.

The Archive has long been active in gathering oral histories, as well. In addition to exhibits profiling communities in Baltimore and Seattle, the Archive has undertaken numerous smaller oral history projects, including a popular series, “Women Who Dared,” featured both on our Web site and in events held in several cities in the United States, including New Orleans.

Katrina’s Jewish Voices

Hurricane Katrina and its aftermath profoundly changed communities along the Gulf Coast and in New Orleans. The Jewish communities were also changed, but in ways that reflected and accelerated shifts that had begun in those communities decades ago. To capture those communities’ history and changes, JWA joined forces with the Goldring/Woldenberg Institute of Southern Jewish Life (ISJL) to conduct 100 in-depth oral histories with members of the Jewish communities of New Orleans, Baton Rouge, and the Gulf Coast. In addition, we partnered with the Center for History and New Media (CHNM) to create a Web site to help capture the stories more broadly, using on-line collecting. This paper is about that on-line collecting project.

By this time, CHNM had been involved with on-line collecting for over half a decade. Starting with their 9/11 on-line collecting site (CHNM, 2001), the organization had literally written the book on on-line collecting (Cohen & Rosenzweig, 2005).

CHNM projects focused on providing a comfortable user interface buttressed by the latest in Web2.0 technologies. They were early to use AJAX to simplify the form, inviting site visitors to contribute stories and digital items on-line, and similarly early in using a mashup with Google Maps to provide a map-based interface to browsing. For this project, they added the ability to categorize items using folksonomic tags. The tags could be added by the original item contributors, by registered users on the site, and by the JWA administrative site on the back end.

As is the case with all on-line history projects, outreach would be essential. Throughout the project, CHNM staff warned us that outreach would be the critical requirement for populating the archive: “If you build it, they will not come.” They also provided invaluable and ongoing advice in holding workshops and outreach sessions in various communities.

The Back End Challenges

There were three major components to the back end of the system: administration, users, and data. Each posed special challenges.

Digital Preservation

The back end of the site was an interesting cross between a “traditional” Web site Content Management System (CMS) and a Digital Asset Management System (DAMS). For Content Management, we had to ask ourselves questions about what would appear on the Web and how:

The DAM side had longer-term implications. Digital Asset Management, buttressed by appropriate policies and procedures, carried out successfully and predictably over time, is Digital Preservation. To do this well, we had to ensure that we captured all appropriate technical, personal, and descriptive metadata. We also had to provide assurance of fixity - to ensure that what we were presenting was, in fact, what had been contributed, and that nothing had been changed or corrupted over time.

The fixity issue is of significant importance. I had personal awareness of how slippery digital provenance could be. About 20 years ago, I typeset a book about Kurt Waldheim. One of the photo exhibits for the book was a copy of Waldheim’s indictment after WWII by the UN. The document we received was unreadable. It looked as though it had been faxed and copied a few too many times. An early software package to edit image bitmaps had just been released. We carefully researched how the text might have read had it been readable, and then used SuperPaint to make it so. The result made a much better impression on casual readers, but was no more “true” than the photoshopped images passed around after 9/11 purporting to show the planes about to hit the Twin Towers.

As historical artifacts, it was fine for the archive to receive photoshopped images. We have no way of usefully vetting authenticity. But it is essential that we be able to prove that what we display (or the original of what we display) is exactly what was submitted - that each object has neither been altered nor corrupted.

Digital Preservation workflow and processes involve several automated steps. This is an area where entering data by hand takes time, and is prone to mistake. In larger systems, tools such as JHOVE (http://hul.harvard.edu/jhove/) can be used to extract technical information from digital objects as they are submitted. This information includes what type of object has been submitted, as well as compliance with the standard for that object (i.e., the file appears to be a JPEG, but is that really the case? Is anything wrong with the object?). In our case, we used a smaller open source module, as customized by the CHNM project team.

In the original data model for the Web site, we initially followed CHNM’s lead in creating categories called “online text” and “online file”. The reason for segregating content created using our Web site lies with the second category: “online file.” Since we don’t ask contributors technical information (nor do we expect most on-line contributors to be sufficiently computer-savvy to reliably know such information), it seemed initially safest to segregate such items. But our backend system does know what it has received - we derive that information as part of the automated technical information extraction process. By segregating some objects as “online file,” we complicate site visitor’s ability to find objects. How will our site visitors know that cousin Janet’s pictures aren’t categorized as “still images” (which we might have further simplified to “photos” or “photos/images”) but “online files”? As part of “lessons learned,” we are currently changing that functionality to eliminate the “online” categories.

For “fixity” - assurance that what we were holding is what was submitted - we insisted on changes to the original CHNM processes. First, the upload software calculates a “checksum,” which is a unique digital fingerprint of the submitted file. The software then records the date and time at which the file was received. Subsequently, work done through the administration tools will also automatically record the “last modified” date. By regularly checking that the file generates the same checksum, we know that it has not been modified. By looking at the creation and last-modified dates on the archive’s file system, and comparing those with what is stored in the file metadata, we have further assurance of the file’s integrity over time.

It is also worth noting that the fact that we take these elaborate precautions to ensure that the original file (or files, in the case of an aggregate object - see below) is not altered does not mean that we present the original file to archive visitors. In some cases this is simply not possible: a URL pointing to a Web site might point to entirely different content tomorrow, and might look entirely different when viewed with tomorrow’s Web browser technology. We preserve Web pages as we can, and present the “look and feel” on the public archive via Acrobat PDF. Similarly, we do not present Word documents, but rather first create Acrobat PDFs. Over time, part of our Digital Preservation effort will include deciding when/if to migrate files or how to otherwise continue to ensure usability (where possible) of the original bitstream. In other cases, where people contribute files that have no common Web analog (image TIFFs, for instance), we create JPEG or GIF files for Web display, as appropriate. We are also looking at techniques for better preserving the look and feel of complex Web sites using technologies such as MPEG 21 (Smith & Nelson, 2007).

In addition to the data automatically generated when an item is uploaded, there is object type-specific data that is accessed via an AJAX controlled administration form.

Fig 1

Fig 1: Adding a new object from the KJV administration area, showing the default form

Fig 2

Fig 2: Showing the first part of the email-specific content-type once “email” has been selected from the dropdown menu.

Users - or People, in General

In a typical CMS or DAM, “Users” are the people with authorization to use the system. Depending on what role has been assigned to the users, they may be able to view objects, edit them, or even alter the system. Typically, these administrative users are also entirely different from the people who contribute the objects that are being managed. In our case, this model was different, and the roles were much more complex.

 This was familiar territory, but we did not notice during our original planning that we were building something that was more complex than previous CHNM Web sites. Some of the new features didn’t matter: for instance, the fact that Public could now potentially include people who didn’t also appear in the Contributors table was minor. Some people contributed items on-line. Some people registered with the site so that they could mark favorite items in MyArchive and add their tags to objects as seemed appropriate. Mistakes were made in failing to carefully consider two things.

First, some contributors use our on-line collecting interface. Because we require only an e-mail address and a name from on-line contributors, we do not collect an explicit login id. There is no reason not to use the e-mail address as login id, and many good reasons to do so. But, by habit, we think of administration as requiring a login, and in building the admin side of the site, we had created a specific field for login. Thus, we have two login patterns. These patterns are further complicated by the fact that when contributing on-line - doing exactly what we were hoping people would do - individuals didn’t get the opportunity to set a password. So, if we succeed in making fans out of our contributors and they come back to collect favorites, or to contribute more items, they have no way of actually logging in. Worse, if they try to contribute again without logging in, the original site software notes that the contributor’s e-mail address matches one already on file. The logic insists that our contributor is registered on the site and must not log in. The solution is to rethink user-related id and password issues. We’ll use e-mail addresses as logins, and will be sending new contributors the appropriate login information and system-generated “starter passwords” as part of the message that informs them that their contribution has been received and tells them how to view it.

A second set of problems arose when we realized that there were three different contribution-related roles, not one: contributor, creator, and collector. Any user role from Public to Superuser can be attached to a login that also appears in what we are now calling the People table - the table of people who are involved in the creation, collection, or contribution of items to the database. Where the original design had us entering creator or collector names into a text field, whether or not the names were already in our database, we will finally have a way to prevent typos, on the one hand, and to record information about the person on the other.

When we receive an on-line contribution (http://katrina.jwa.org/contribute/ ), all that we require of the contributor is that he or she give us a name, a verifiable e-mail address, and permission to use the object(s). What we want, of course, is much more. We have had reasonable success in getting that information from our on-line contributors, as well as from people who submit items via email, CD ROM, or other media.

The additional information - metadata - includes several pieces of information about the object: a description, whether or not the contributor created the object (and if not, who did),  what location (using Google Maps or a form) the object refers to, the date to which the object refers, and even tags. This object metadata is presumed to apply to each individual item that someone contributes in a given session, whether they type in one story, upload one file, or upload several files. Such metadata, like the metadata we gather about the objects themselves, would be the same for any on-line collecting project.

Once we have the contribution, plus whatever object metadata the contributor chooses to enter, we ask for permission to use the object, and then for an e-mail address, and a name. At this point, we have enough information to accept the item. Nonetheless, we do want to know more about each contributor.

Much of the personal metadata that we gather is unique to this project. This is the part of the form that changes with each project. We ask obvious questions such as occupation, age, gender, and religious affiliation. We also ask whether or not the person lived in New Orleans at the time of the hurricane, and if so, where they evacuated to, where they have been since, and whether they plan to return. If they didn’t live in New Orleans, we ask how they came to be involved. The form is mediated using AJAX so that people only see parts of the form that are relevant to their answers. If they change answers, the form will change to suit. (On the administration side, we use similar AJAX constructions to navigate contributor information.) We not only wanted to simplify each person’s experience so that they see only questions relevant to their self-identification, but also wanted them to be able to explore without having to start over.

Fig 3

Fig. 3: Tell Us More

Once the required minimum information has been gathered, a contributor can choose to submit the contribution, or to tell us more. In either case, AJAX us used to shape the form based on the contributor’s responses.

In most cases, we receive submissions indirectly - not through the site. We then follow up using e-mail to gather that additional personal metadata.

Objects

If there was an area that highlighted the difference between “good enough to get usable material on-line,” aka “raw archives” and our attempts at setting up a system that would support asset management and preservation, getting a handle on “objects” was it. In the CHNM model, items are contributed one at a time. In some cases, items are united by considering them “collections”, but in most cases there is no easy way to view related objects as a group. This is a common DAM issue. How does one indicate that the 10 images used for a VRML presentation of a museum asset are one “object”? How does one indicate that these 5 items are part of a set? In our case, we knew from the start that “Object” could not be a synonym for “file,” just as there had to be collections, sub-collections, meta-collections, and the like. The example we used most often was an e-mail with attachments: pictures, for instance.

We simply got it wrong.

We spent several weeks at the beginning of the project working on content models, and CHNM did a great job of coming up with terms and metadata, primarily based on Dublin Core, to describe the various object types, generalized object metadata, and relationships among items. But CHNM, up to this point, had been doing “instant history” and gathering “raw archives.” They had only captured this metadata in text fields, even significant items such as “unique ID” and “upload date.”

Implementing the metadata in meaningful ways that would contribute to Digital Preservation was significantly more complex than any prior on-line collecting site our vendor had built, and we simply got it wrong. We did not sufficiently understand how to set up the application to indicate and enforce relationships such that specific file metadata could be easily stored and edited on the file level without getting in the way of  “Object” metadata that might have similar metadata descriptors. “Creator” or “creation date” or “location” might refer to a specific “object” (what we now call a “working copy”) as a whole, or to specific parts (which we now call “items”). When a group of items is contributed in a single on-line session, that metadata will apply both to the working copy, and to its component items. As we look forward to coming  projects, this is the area that will change the most. We just didn’t have the archivist’ or librarians’ knowledge that would have made obvious the need for being able to record component attributes separately from aggregations, nor did the programmer have experience with models that involved relationships more complex than could be indicated in a series of metadata text fields.

Levels of Aggregation/Distinction

For the record, we think we recognize the need for at least four levels of aggregation/distinction:

  1. item - this is the individual file - the smallest component of something
  2. working copy - this is one or more items that form a working object - that e-mail with attachments, or the set of images+xml in a VRML set
  3. subcollection - this is used to collect items in a set or session; in the KJV context, this might be used to represent all items created by one person, or in one session, from a collection, as well.
  4. collection - this is used to collect items contributed or collected by one person or organization. In KJV terms, this could be the set of items gathered by one of the local Jewish communal groups

Items can belong to more than one working copy, subcollection, or collection.

A good explanation of these levels has been provided recently by Leslie Johnston, Head of Digital Access Services at the University of Virginia, at the Open Repositories Conference 2007, in San Antonio (Johnston, 2007).

In addition to the complex relationships, it is important to recognize that metadata  is best stored and maintained in a variety of ways. Text fields are to be avoided where possible.

In some cases, the application should be gathering the metadata (especially for items such as the item checksum, upload date, last-modified date; and hopefully for all technical metadata). Some dates do need to be entered by humans (On what date did this event occur?), but they also need to be stored as dates in the database, not as text, and the form design needs to provide affordances such that dates are entered consistently, correctly. Without enforcing “date” attributes, it is impossible to do any date-related searches or browses (What was uploaded in the last two months? Do we have items pegged to the Katrina commemoration on August 29, 2006?).

In the case of repeatable information, such as the people who create, contribute, or collect items, their names need to be selected from a list, or verified against a list with a request for “Person” metadata when a new name is encountered. It is not a good thing to have the same person in the database more than once. Even worse is to have that person’s name indicated by variant spellings. (If we were gathering old documents, such variances are common, and are resolved using something called an “Authority file.” Such files indicate the “authoritative” spelling of a name or a term. In the age of relational databases, where the data are under the control of a single organization, there is no justification for such drift.)

Relationships between objects similarly need to be defined or verified from a pick list, not keyed in. Without such constraints, one winds up with objects having unique relationships: isPartOf, isChildOf.

Duplicate File Names

There is one last, minor issue that we encountered: how to deal with duplicate file names. Files are stored in the Web server’s file system rather than inside the database. This is usually much easier to maintain and back up than otherwise, but creates a problem when more than one file is submitted with the same file name. (This can happen either because more than one person uploads the same file, or because some names are natural choices when subjects are similar.) The usual practice when encountering a new file using an already extant name is to add a number. So, the first file might be called “overturned car.jpg”. The next time around, you store the file as “overturned car1.jpg” (as well as the original filename for fixity purposes). In the first iteration of Katrina’s Jewish Voices we decided to be overly clever and included part of the checksum in the filenames. This means that a file received as “ZAKA torah photo_1.jpg” becomes “ZAKA torah photo_1_1f8ee628f1.jpg.” The extra work complicates some file maintenance, complicates the usability of the files when downloaded, and gets very much in the way of debugging.

Lessons Learned

Gathering raw archival data would not have been sufficient to meet the needs of the Jewish Women’s Archive’s mandate. Working with the Center for History and New Media, we succeeded in extending their vision of on-line collecting while also taking advantage of CHNM’s experience in the subject and meeting very tight deadlines. In true Web2.0 fashion, we are already working on a new iteration of the Katrina site which should be available by Spring 2007.

Automation is the key to sanity and success. Had we not automated the gathering of technical metadata, we would have no reasonable way of ensuring fixity, nor would we have a reasonable way of handling the hundreds of items currently backlogged. Future projects will want to include good tools for dealing with batches of items at a time.

As we plan for a full-fledged DAMS, ensuring that maximum metadata is generated by automated tools is something we think about a lot. We’re not ready to try automating descriptive data, but to the degree that humans don’t have to do things such as indicate file size or type, or manually type in dates that can be generated by the computer, the greater degree that such data is likely to be accurate. That accuracy is an essential part of our plan to create the full suite of tools, policies and procedures required for a Trusted Digital Repository.

The biggest thing we learned is the vast gap between “raw archive” and Digital Preservation. Since this project was initiated, one staff member has completed a course in managing digital preservation and we have hired a digital archivist. At the same time, we need to acknowledge that the project succeeded. This Web site is used by a growing number of people and is attracting exponentially increasing quantities of contributions. The tool is sufficiently successful to encourage use and revision. To me, this strongly reinforces the desirability of undertaking projects in short, iterative steps. If we had waited until we knew what we know now, we would still not be on-line, and many of the materials we have collected would already be lost.

It is also true that we would not have succeeded as well as we did without the meticulous Content Model definition and workflow analysis that preceded the actual design and coding. This leads me to a favorite issue: project management. When one is creating a very simple Web site from a well-known template, project management can be fairly casual. It is absolutely critical that both my own organization and our future vendors are both grounded in professional project management practice. A simple, full-semester project management course is enough to learn professional vocabulary and norms and to begin to understand what Project Management is. But without some basic professional Project Management skills, it is impossible for staff to break a project down and discover what they do not know, early, and to effectively solve problems before they affect schedule, scope, or cost. Although I remain convinced that having one project management professional on one team should be enough, my experience suggests that each partner to the collaboration needs the skills. Without such skills, miscommunication and inadequate management flourish. Where this project underachieved, this is where the problem lay.

Finally, as we look forward, it occurs to us that a single on-line collection Website is always going to be a hard sell. It will always require significant effort to reach out to people and to get the first contributors to participate. We are looking hard at ways to make our MyArchive tools easier to use, and to generalize them throughout the JWA Web site. As we build on the success of Katrina’s Jewish Voices and create new on-line collecting projects, we should also be creating a community interested in the Jewish Women’s Archive. As that develops, we will find ourselves with a core group of people who participate regularly. They might be using Web site forums, sharing tags, commenting on blog posts. They will be familiar with us, and comfortable being those first contributors on new projects. We believe that this will nudge us gradually into a virtuous cycle in which participation brings more participation.

As William H. Whyte noted years ago (Whyte 1988), people like to be where people are.  If JWA is to succeed in its mission of uncovering and transmitting history, we must focus not just on new ways of gathering information and artifacts. We need to ensure that we provide affordances such that our Web site becomes a place “where people are.”

References

Center for History and New Media. “The September 11 Digital Archive,” 2001 and following. http://911digitalarchive.org/

Cohen, Daniel J. & Roy Rosenzweig (2005). Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web. Philadelphia: University of Pennsylvania Press.

Jewish Women’s Archive. “Introduction” JWA Technology Plan, Internal document, 2006.

Johnston, Leslie. “How the Principles and Activities of Digital Curation Guide Repository Management and Operations.” Presented at Open Repositories Conference 2007, San Antonio, TX.

JSTOR/Harvard Object Validation Environment (JHOVE) (2003). http://hul.harvard.edu/jhove/

Smith, Joan & Michael Nelson. “Using OAI-PMH Resource Harvesting and MPEG-21 DIDL for Digital Preservation.” Presented at Open Repositories Conference 2007, San Antonio, TX. http://openrepositories.org/program/presentations#session2 .  Demos at: http://beatitude.cs.odu.edu:9999/

Whyte, William H. (1988). City: Rediscovering the Center. New York: Doubleday.

Cite as:

Davidow, A., From Casual History to Digital Preservation, in J. Trant and D. Bearman (eds.). Museums and the Web 2007: Proceedings, Toronto: Archives & Museum Informatics, published March 1, 2007 Consulted http://www.archimuse.com/mw2007/papers/davidow/davidow.html

Editorial Note