Archives & Museum Informatics: Museums and the Web 2010: Papers: McCracken, N., and A.R. Diekema, Publishing Digital Museum Collections on the Web Using the Metadata Assignment and Search Tool

Nancy McCracken, Syracuse University, and Anne R. Diekema, Utah State University, USA

http://meta.usu.edu/mast/

Abstract

The MetaData Assignment and Search Tool (MAST) is a computer assisted cataloguing system that enables museums to efficiently describe and disseminate their digital materials and to make them fully available in a digital library or searchable via Web services from their own Web site. One of the primary goals of this project is to make the publishing of digital collections easier for small museums.

Keywords: digital collections, metadata, educational standards, Web publishing, cataloging

1. Introduction

Presently smaller museums wishing to implement digital collections are at a disadvantage. They have varying degrees of expertise in metadata, and limited time and budget to catalogue resources they wish to make available to their constituents. The open source Metadata Assignment and Search Tool (MAST) is freely available to museums to assist in putting collections on-line. MAST is a computer assisted cataloguing system that enables museums to efficiently describe and disseminate their digital materials and to make them fully available in a digital library or searchable via Web services from their own Web site. Computer assisted cataloging and the ability to link to educational standards simplifies the cataloging process while enhancing access to museum materials for educational purposes.

MAST can increase the efficiency of the cataloging process, reducing the cognitive load on the cataloguer by presenting a review-and-edit task as opposed to completely original composition of text fields and vocabulary selections. The cataloguer and system create XML metadata records for each collection item.

MAST uses Natural Language Processing to populate as many fields of the metadata record as possible (15 simple Dublin Core elements, plus 8 GEM elements, and K-12 state or national educational standards), leaving the intellectual efforts of cataloguing to the curator who can now approve, improve, or add entries, rather than starting from scratch. After the cataloguing has been completed, the tool provides a template for a simple search box that can be added to any Web site to make the digital collection accessible via any Web browser. It also has the ability to export the records via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for sharing with other digital museum collections.

A team comprised of the Center for Natural Language Processing (CNLP) at Syracuse University and Digital Learning Sciences at the University Corporation for Atmospheric Research (UCAR) created MAST.

2. Background

With the exponential growth of the Web, libraries and museums are increasingly aware of the importance of having a digital presence in addition to the traditional brick and mortar existence. Digital collections can augment access to library materials by enlarging user communities and ensuring improved visibility. Many libraries and museums have recently begun to develop digital collections, and are in need of efficient and effective means for making these collections searchable by both their local and much broader communities.

The key to making digital collections accessible is providing effective search capabilities. However, merely adding a full text search mechanism is not sufficient; searching can be substantially enhanced with well-crafted metadata that includes searchable controlled vocabularies. Metadata is crucial to manage and provide access to digital resources (Chowdhury and Chowdhury, 2003). Metadata allow digital resources to be organized, validated and searched according to information not explicit in the full text of a digital resource such as dates, grade ranges, subjects, resources types and content standards.

The challenge of efficiently creating reliable metadata was quickly apparent in digital library development. Independent research into different components of this challenge has resulted in an array of tools and services that support workflows in many ways, including user interface design, best practices guidelines, and technical supports for completing metadata fields, such as with the Natural Language Processing approaches we will describe here. The opportunity is now before us to integrate these independent efforts into an integrated system that will have applicability in a wider setting, allowing any institution with digital holdings to efficiently catalogue, search and disseminate its materials.

Of the three main elements that make up a digital library – documents, technology, and work – Levy and Marshall (1995) identify work as the most essential. Work is defined as the work of digital library users, but also the work of librarians to support users by making the resources in the library accessible. Our proposal addresses both types of work: 1) the work of cataloguers to assign metadata to digital resources and thereby make them available to users, and 2) the work required by users (e.g. students, teachers, general public) to find relevant materials to aid their studies, work, or general interest. After all, a digital collection would be of no value if users could not find what they needed.

Greenberg et al. (2005) report on the challenges of metadata assignment and encourage libraries to find ways to assign metadata in ways that are more cost-effective and efficient than previous traditional methods. Their research suggests that automatic metadata assignment might be the key to solving the problem of complete and accurate metadata assignment. Our own research has shown (Liddy et al., 2002) that in most cases, there was no statistically significant difference between automatically assigned metadata and metadata assigned by humans. In fact, the automatic metadata system consistently populated more metadata fields than did the human cataloguers. However, concurrent with input from digital collection holders, we believe that the best approach to metadata assignment is a hybrid approach in which the system populates all metadata fields possible, and the human cataloguer reviews and edits the metadata before finalizing. In a study of 217 participants (library administrators, cataloguers, digital librarians, archivists), Greenberg et al. found that 70%, or 148 participants, preferred a hybrid approach in which the system initially suggests metadata, followed by a human-mediated review. Only 1.4%, or 3 participants, insisted on a completely manual process.

Since cataloguing digital collections is a tedious process, and high quality fully automatic cataloguing is not yet within the realm of technical possibility, our goal was to create a hybrid, computer-assisted cataloguing tool that aids cataloguers in assigning metadata to digital collections effectively and efficiently.

3. Metadata Assignment and Search Tool (MAST)

The MAST tool was developed by integrating existing tools; namely, MetaExtract and the Content Assignment Tool (CAT), both developed by CNLP at Syracuse University, with the Digital Collection System (DCS), developed by Digital Learning Sciences at UCAR, originally for the Digital Library for Earth System Education, funded by the National Science Foundation. Previously, DCS and CAT were integrated to provide a cataloguing tool that included automatic suggestions of standards. Our current work on metadata assignment extended the successful prototype integration of the two tools (CAT and DCS) to include more auto-suggested metadata fields in the context of a more generalizable metadata framework to describe the museum objects.

In our previous work we conducted a comprehensive evaluation of the workflow support offered by the DCS-CAT tool that focused on comparing the automatic assignments of educational standards to human-selected assignments. An internal DCS-CAT evaluation showed that the support for standards assignment eased the workload of cataloguers substantially and potentially could increase the number of digital items catalogued with educational standards. It was our goal to extend the DCS-CAT prototype that provides support for cataloguing with automatically suggested additional descriptive metadata so that cataloguing can become a review-and-edit-task as opposed to a daunting blank slate task. For this, we added the MetaExtract tool into the DCS-CAT tools.

The Web-based tool, Metadata Assignment and Search Tool (MAST), uses Natural Language Processing to populate as many fields of the metadata record as possible, leaving the intellectual efforts of cataloguing to the librarian or curator who can now approve, improve, or add entries, rather than starting from scratch. After the cataloguing has been completed, the tool can provide a template for a simple search box that can be added to any Web site to make the digital collection accessible via any Web browser. It also has the ability to export the records via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for sharing with digital libraries such as the National Science Digital Library (http://www.nsdl.org).

In short, MAST maintains the hybrid human-mediated approach to cataloguing, and provides necessary workflow support to aid the cataloguer in efficient and accurate metadata generation. In addition, metadata held within this tool can easily be disseminated using the DLESE Search Web services that allow customized search and discovery within a home institution’s Web site. Each of MAST’s three component technologies is now described in detail.

3.1 Digital Collection System (DCS)

The base technology of MAST is provided by the Digital Collection System (DCS). The Digital Collection System (DCS) is a Web-based cataloguing tool that combines a metadata editor with search, discovery and OAI-PMH (http://www.openarchives.org/pmh/) dissemination services. The DCS is able to accommodate multiple metadata frameworks that describe items, collections, and annotations. These frameworks must be expressed as an XML Schema, which the tool uses to build the interface and enforce controlled vocabularies and required fields. A single instance can support multiple collections of different themes (topics) or metadata formats. For more information on the DCS and its continued development, consult http://ncore.nsdl.org/index.php?menu=services&submenu=services!NCS. (With ongoing development, the DCS is now referred to as the NSDL Collection System (NCS) as it is used as the collection management system of the NSDL. The NCS is available as a metadata management and cataloguing tool for projects, and CAT is provided as an optional feature.)

The metadata editor component of the DCS provides many supports for the cataloguing process, including:

Ensuring well-formed records through schema-based validation of metadata values

Ensuring required metadata fields have values

Providing pick-lists for controlled vocabularies to eliminate typographical errors, and

Providing best practices information that is easily accessible for each field.

Because it is Web-based, cataloguers can utilize the DCS if they are able to access the Web. The system allows multiple cataloguers to work simultaneously while guarding against concurrent editing of individual records. An authentication component enables activities of users to be permitted or restricted based on user roles. A customizable workflow support mechanism allows the status and history of metadata records to be tracked and annotated throughout their life cycle.

The DCS incorporates a Search service that allows the metadata it manages to be disseminated on the Web from a local or a remote Web site. The search service supports the creation of interfaces that accept search queries from users, and display search results directly from the DCS using simple code embedded in an html page.

Educational standards are defined in the metadata frameworks as controlled vocabulary lists. The metadata editor displays these lists as collapsible hierarchies, the leaves of which can be selected to assign a standard to a particular resource. While the metadata editor can ensure that a cataloguer cannot assign an illegal value to a resource, it cannot offer assistance for selecting the appropriate standards for a resource. Thus, while the DCS metadata editor addresses many of the syntactical tasks facing a cataloguer, when it comes to cataloguing standards, more support is needed.

3.2 MetaExtract

The component that extracts the bulk of the metadata values from the resource to populate the fields of MAST is MetaExtract. The MetaExtract system is an information extraction system designed to extract metadata values for the 15 elements of the simple Dublin Core Metadata Element Set (DCMES), (Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, and Rights), and the 8 Gateway to Educational Material (GEM) elements (Audience, Cataloguing, Duration, Essential Resources, Grade, Pedagogy, Quality, and Standards). MetaExtract was originally developed to automatically extract metadata elements for science and mathematics from lesson plans. The data used for training and testing were primarily lesson plans - documents that have embedded metadata in free text, including teaching objectives, methods, duration, grade levels and evaluation methods, and were primarily focused on the K-12 population. Under the current project, this has been extended to work with more general museum on-line educational materials and collection materials.

MetaExtract compiles the output from three distinct item-level extraction modules, along with information from the collection-level configuration file, if available, to automatically assign metadata to lesson plans. Some elements are extracted from more than one module so that the system has a higher chance of populating the metadata elements. For elements that are processed through more than one module, a prioritization process is used to indicate which of the extractions to assign to the particular metadata element. The results of all of the modules are gathered and output as an XML file.

MetaExtract uses Natural Language Processing to extract terms and phrases found within single sentences. This module is a rule-based system that uses shallow parsing rules and multiple levels of NLP tagging to extract terms and phrases to assign to specific metadata elements. When possible, MetaExtract uses the structure of HTML documents to provide additional clues as to where the terms and phrases for some of the metadata elements can be found. The HTML-based Extraction module extracts across sentence boundaries. It uses the structure of the HTML and syntactic clues to determine where the contents of a metadata element may be. Then it compares the text in that location to a list of clue words, developed by analyzing hundreds of documents, to determine which metadata element is present.

MetaExtract was developed as part of an NSDL-funded research project and was evaluated favorably in the MetaTest project (Liddy, 2005). Compared to manually assigned metadata, automatically assigned metadata was empirically shown to perform comparably in retrieval and in quality for most elements, while achieving much better coverage of metadata elements. Compared to full-text, automatically assigned metadata performs comparably in retrieval and has additional advantages as it enables fielded searching, limits searches by particular aspects (e.g., grade, language), and allows easy browsing of results. Although our study showed no significant qualitative difference between automatic and manually assigned metadata, we have worked on improving MetaExtract for some of the metadata fields that performed more poorly than others.

3.3 Content Assignment Tool

The third component of MAST is the Content Assignment Tool (CAT) which provides educational standards metadata.

CAT assists collection providers, cataloguers and teachers in assigning educational standards by assessing the content of a resource and making suggestions as to which standards may be assigned. These standards are then manually reviewed and selected for association with that resource (Diekema and Chen, 2005). CAT utilizes CNLP’s in-house Natural Language Processing software TextTagger to process both the standards and the educational resource with part-of-speech tagging, phrase bracketing, and text categorization modules to find terms and phrases and possible synonyms to facilitate more accurate matching between the resource and the standards. While CAT has been developed to handle math and science standards, this automatic approach can be applied to educational standards in other subjects with minimal additional effort.

CAT incorporates continuous assignment quality improvement through a custom-built instance-based machine learning algorithm. This type of learning algorithm postpones processing until a new resource needs to be processed. Each final assignment of standards to a resource is stored in a relational database. This information is then utilized by the system via an algorithm similar to k-Nearest Neighbor (kNN) utilizing Euclidian distance to find similar resources and their assignments in order to inform and improve future assignments (Yang, 1999). This learning can take place at the single user level, or at the organizational level where all assignments from multiple users in an organization are aggregated and used to inform the organization’s uniform effort.

In a typical use case, a cataloguer wishes to assign standards to a particular resource. The user informs the tool of the group of standards the system is to use (i.e. New York Math or National Science Education Standards) and a grade level or grade band (e.g. K-2). The URL of the resource (or the filename and path if the resource resides locally) is entered in the search box of the tool. The cataloguer then selects the “Suggest Standards” button, and CAT responds by presenting a list of suggested standards, ranked by their relevance to the resource. The cataloguer reviews the list of suggested standards and selects the preferred ones. If the cataloguer wishes to assign additional standards beyond those suggested by the system, a navigable standards tree is available to browse the standards text and make additional selections. All selected standards are then associated with the resource URL in the CAT database.

In this implementation, the CAT front-end communicates with a standards suggestion Web Service API (CAT Service) that is responsible for making standards suggestions for a given resource, and for storing standards assignments in a database. The getSuggestions function in CAT returns an ordered list of standards that are relevant to the content of a provided resource URL. In addition to the resource URL, a getSuggestions request may contain several other pieces of information to influence the suggestions returned by the CAT Service. For example, providing a GradeRange will cause the CAT Service to return only standards within that range.

4. User Study

The Digital Learning Sciences (DLS) group conducted a focus group on May 6, 2009 to introduce the Metadata Assignment and Search Tool (MAST) to members of the museum and library community. Nine participants from a range of institutions representing public libraries and public, private and academic museums in the sciences, arts, and history attended. Participants held a range of positions, including collections manager, curator, library director, image archivist, and information technology director. The two-hour session was held at the Bibliographical Center for Research campus in Aurora, Colorado (http://www.bcr.org).

The goal of the session was to gather feedback on the utility of the MAST Digital Collection System (DCS) for the professional communities represented. Specific areas of focus were the usefulness of the metadata suggestion services MetaExtract and the Content Assignment Tool (CAT), and the appropriateness of the native metadata framework (including controlled vocabularies and free text options). An additional goal was to solicit suggestions for refinements or improvements to the tool to enhance its utility.

While the group had a generally favorable impression of the DCS, there were parts of the capabilities that raised concerns about the utility of the tool. The first was that, at the time, the system could only be applied to text or html documents and not to other document types or, indeed, to other media. Further, the participants in the group were either not involved in assigning educational standards or did not use them in their institution and were unable to judge the general utility of the CAT service.

The group offered favorable general impressions of the user interface and cataloguing process. They liked the expand/collapse features for viewing complex fields, and most found the navigation reasonably easy to follow. Having the best practice guidelines embedded in close proximity to the data entry fields was overwhelmingly praised by all. Several participants indicated they often use volunteers for cataloguing, and thought this would be extremely helpful in assisting new and unpracticed users. The field definitions and vocabulary definitions were both cited as especially beneficial. The suitability of the controlled vocabularies for their professional communities was debated.

The participants in the survey were asked to rate the quality of the metadata suggestions offered by the MetaExtract system. For each field, the participants were asked, “How helpful was the suggested metadata to you in completing the field listed below?” The answers were generally in the "helpful" range. However, among all the fields, the otherSubject field, which contained automatically generated keywords, was ranked lowest by a significant margin and was the only field to fall into the “not helpful” range.

As a result of the user study, work is underway at CNLP to improve the automatically generated keywords, using algorithms that incorporate topics from Wikipedia. Furthermore, the text processing tool for MetaExtract has been extended to work on documents in pdf format, adding at least one very important document format to the types that are covered by the system.

5. Deployment in Small Museums

Two case studies are currently being conducted by the CNLP group for relatively inexperienced museum staff to go through the entire process of downloading and installing software, cataloguing the museum collection items, and publishing the collection on-line in an environment with Web search enabled. The results of these case studies will also be demonstrated at the Museums and the Web conference.

Acknowledgements

The project was funded by a National Leadership Grant from the Institute for Museum and Library Services (IMLS) to extend the ability of museums and libraries to preserve culture, heritage, and knowledge while enhancing learning.

References

Chowdhury, G., and S. Chowdhury (2003). Introduction to Digital Libraries. Facet Publishing, London.

Diekema, A.R. and J. Chen (2005). “Experimenting with the Automatic Assignment of Educational Standards to Digital Library Content”. In:Proceedings of the Joint Conference of Digital Libraries. Denver, Colorado.

Greenberg, J. (2004). “Metadata extraction and harvesting: A comparison of two automatic metadata generation applications”. Journal of Internet Cataloging, 6(4), 59–82.

Greenberg, J., K. Spurgin & A. Crystal (2005). Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. Submitted to the Library of Congress February, 17, 2005. Available at: http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf.

Levy, D.M. & C.C.Marshall (1995). Going Digital: A Look at Assumptions Underlying Digital Libraries. Communications of the Association for Computing Machinery.

Liddy. E.D. (2005). MetaTest: A Tripartite Evaluation. American Society for Information Science & Technology Conference. Charlotte, N.C.

Liddy, E.D., G. Gay, S. Harwell & T. Finneran (2002). A Modest (Metadata) Proposal. Joint Conference on Digital Libraries. Portland, OR, July 17, 2002.

Yang, Y. “An evaluation of statistical approaches to text categorization”. Information Retrieval 1, 1, 1999, 69-90.

Cite as:

McCracken, N., and A.R. Diekema, Publishing Digital Museum Collections on the Web Using the Metadata Assignment and Search Tool . In J. Trant and D. Bearman (eds). Museums and the Web 2010: Proceedings. Toronto: Archives & Museum Informatics. Published March 31, 2010. Consulted http://www.archimuse.com/mw2010/papers/mccracken/mccracken.html