Archives & Museum Informatics: Museums and the Web 2009: Paper: Morgan, R., What is Your Museum Good at, and How Do You Build an API for It?

Richard Morgan, Victoria and Albert Museum, London, UK

Abstract

There has been an encouraging surge of interest in the museums sector in opening up museum data and building APIs on museum collections databases. However, a museum's collections are not the only and sometimes not even the most interesting service which a museum provides. Events, communities, shopping, learning and interpretation are all areas where museums have lively and engaging offerings. These areas typically have a Web presence, and therefore the possibility exists to build an API or make use of an existing API to open up that offering.

Furthermore, as museum collections’ content becomes more readily accessible on the Web, museums need to focus more and more on their value-add, the expertise and authority which they bring to the interpretation of their own collections and those of others.

Keywords: APIs, technical, services, communities, advocacy, value-add

Introduction

The very fact that it is now relatively easy to build a simple API for a museum's Web offer means that museums need to consider seriously what value-add they bring to their collections in order to stay relevant in the digital arena.

This paper describes an approach taken by the Victoria and Albert Museum (V&A) to developing a simple API not only for collections-based material but also for the rest of the museum.

This paper is given in conjunction with a workshop at the Museums and the Web Conference 2009. The paper focuses on the technical approach to API building (“How do you build an API for it?”) while the workshop is concerned with what sort of API a museum should be building in the first place (“What is your museum good at?”). The aim here is to present the steps and decisions taken by the V&A in the course of developing a suite of APIs such that others can replicate those steps and assess whether the decisions taken would be appropriate or not for their own environment. In the workshop context, the aim is to explore further what sorts of services a museum could offer beyond the obvious making available of collections information.

Web Application Frameworks

The V&A has an in-house development team of two and routinely hires digital agencies to assist with certain projects. The development of an API is not a project with a clear start and end, but is more an approach to the development of any digital project. A technical solution was required, one which avoided excessive amounts of unmaintainable bespoke code and vendor lock-in to a third party. With many other projects to be delivered, the V&A also needed a solution which allowed rapid application development so that progress could be made quickly but not necessarily within a single, allotted period of time.

The V&A is the lead partner in the National Museums Online Learning Project (NMOLP) and was able to benefit from the knowledge gained in that project's initial phases. Makewell (2008) indicated that the approach of setting up OAI harvesting of museum collections had been rejected after attempts to do so had led nowhere. As a result we also needed a solution which allowed us to decouple the rapid development of an API from the slower development of our internal systems.

The V&A found the right balance between structure and flexibility in a group of recently matured Web application frameworks, and the museum now uses Symfony, Drupal and Django for all Web development projects.

Of these, Django, which is written in the python programming language, was considered the most effective for delivering an API because it was the easiest to scale and, using python, facilitated robust but rapid coding. The python language and Django framework are easy to learn and most appropriate to a small development team.

Django is also used by Google's AppEngine, and the AppEngine provides a way of experimenting with the approach described below if hardware infrastructure is not available.

The choice of Django was essentially a commitment to maintaining a standardized way of doing things rather than a standardized way of managing data and protocols. Given the evident difficulties in setting up services such as OAI and SRU / SRW, it was decided that although such services represented the ideal way of accessing our content, it was better to adopt an approach which delivered something quickly but with a view to slotting in those services later.

Technical readers are referred here to the excellent Django tutorial which describes how to set up a simple Django Web application. In this paper I will outline the steps we took to develop this application, without going into the details of the code.

Getting Started

We began by using Django to write a very simple Web application which serves up what we might term a “museum resource” but what is, in effect, a page from a museum Web site.

The first step is to define a model for a museum resource. This is at simplest a list of fields and, initially, might require only title, url and summary fields.

The second step is to define a view which returns a machine-readable version of a resource. We also define a view which returns a subset of all the resources in our application, also in a machine-readable format, which will likely include simple search logic. A view is at its simplest a template which indicates how fields should be displayed in a page served up by our Web application. For humans, a view might tell the application to put a title in bold; for machines, a view might tell the application to put the title in a particular XML element.

Finally, we define a url pattern which will allow someone to construct a URL to pass parameters to our Web application and, hopefully, get a result.

Now we can enter a few records into this system. One of Django's great benefits is an excellent out-of-the-box administration interface which allowed for easy data entry. Using this interface we can fill out a few records in the system and we have created a very simple API.

We can create a URL which returns a record in machine-readable format if we know, say, the identifier of the record. We can also create a URL which returns multiple records if we, say, pass a search term in the URL.

This API is not very exciting. We will need more than a few records and a more sophisticated model in order to do something special.

Crawling, parsing, mining

The next stage is to get as many records as possible into the system. In practice this means two things – crawling your museum Web site and then extracting information from the data provided.

Crawling our own Web site seemed at first to be a rather coarse method of obtaining the data, but proved on every occasion to be the quickest way of progressing. As indicated before, we consistently adopted the approach of taking the line of least resistance. To gather collections records, we started by crawling our own Web site. Since then we have been able to replace this with using an XML export from our collections system. We are still not able to use OAI harvesting to collect the records, but our application is leaving the door open in the expectation that this will be possible in the future.

The same principle was applied elsewhere on the Web site. Initially it was easier to crawl HTML pages about museum events from our Web site than it was to access more structured information from our internal bookings and events system.

The desired basic output from the crawling process is a set of HTML files. In some cases these files were accessible directly from a Web server file system and we were able to copy them; in other cases it was necessary to use a tool to crawl the files. The task is easily accomplished with some python code,but a tool such as httrack can also be used effectively. Crawling a Web site is not difficult but is time-consuming since it is polite to wait a few seconds between each request so as not to put excessive load on the server.

Once a suitable number of HTML files has been collected, the task is to extract information from these files. At the V&A we used python for this task and, in particular, the BeautifulSoup library: it provides a set of useful ways of pulling out information from raw HTML pages. The technical reader is again referred to the documentation for BeautifulSoup in order to explore the exact process involved.

Django has an effective API such that records in the stem can be manipulated programmatically from python modules and, indeed, from the python command line interpreter. This made it easy to batch the processes that will now be described simply iteratively fetching a record, performing the processes on it and then returning it to the django application in its enriched form.

For collections records, the HTML display was largely consistent, and it was straightforward to design a set of rules in order to pull out structured information. For example, a standard page for a museum collections record is likely to contain names and values of fields in a relatively predictable way. In this case, it is easy to identify a pattern where a label is in bold, followed by a colon, and then has the value for a particular field.

For events information, we had only the unstructured HTML to work with, though it is worth remarking at this point that microformats provide an albeit imperfect method of embedding some structured information about events within an HTML page. This sort of structured data was not available on the V&A Web site, and the vast majority of pages which were not collections records were unstructured.

We therefore needed to look at ways of extracting structured information from these pages. Two obvious methods were clearly too time-consuming in the short term, though both worth exploring in the long-term: internal editorial enrichment of the metadata for these resources, and user-generated tagging.

Therefore the V&A chose initially to use a variety of text mining techniques and services in order to enrich the unstructured content on the Web site. These services included OpenCalais, Yahoo term extractor, WorldCat and a number of geocoding services.

As well as external services, we also used the python natural language toolkit (NLTK) in order to perform some internal text mining analysis on our content. In each case the broad aim was to extract from the unstructured text the words that were most useful in determining the content of the page. The aim, in effect, was to automate tagging of content by keyword.

The production of tags allows and required two refinements to the django API Web application. First we needed to extend our model. Rather than extend the existing model at this stage, we chose to create a new model for tags and then set up a way of associating our museum resources with given tags. The advantages of this approach were that we would have a better method in the future for determining which tags were duplicates and that we would be able to extend our very basic museum resource model to encompass specialised types such as events, activities and collections records if we chose.

Text from the HTML files was fired off to the services involved and we pumped the results into our new tags model. In some cases we were able to do this several times. For example, place names returned by the text extraction services could then be sent to multiple geocoding services and the results compared in order to build up a confident picture of the location a particular page was associated with.

With all the API information in the system, the final stage was to create a new view which brought back museum resources by tag. We then created a machine-readable version of this view and a url pattern which allowed us to send RESTful urls to our Web application which contained a tag name and have museum resources returned to us. This API is better but is still not all that exciting, and, in practice, not necessarily better than a Google search of our site. The next stage was to get our API doing something useful.

Making our API more useful

We paused at this stage to analyze the results of the text mining activity. The first task for making the categories as useful as possible was to attempt to normalize them so that a single term such as “Scotland” can bring back items tagged as “Scots”, “Scot”, “Scottish” and so on. As an initial stage, it was not too time consuming to do some of this mapping “by hand,” but a lot of progress can also be made with stemming tools such as those available from the NLTK.

The aim here was to do the best job possible at clustering resources around terms returned from the text mining processes and, in some cases, internal taxonomies. The normalization process represents a big step forward in how useful our API Web application can be. Rather than resources being brought back by the rather woolly string matching process of search, we can build more concrete relationships indicating, say, that the “Paris” that a certain painting comes from is one and the same “Paris” that a fashion designer presenting a fashion event at the museum comes from.

This allowed us to build two refined views in our application. One shows all the tags which are associated with a particular resource; the other shows all the resources associated with a particular tag. Our API now presented an effective way of navigating through the V&A's on-line resources and we were quickly able to build some prototype applications for humans to allow visitors to view objects on a map and browse around objects by clicking on tags.

By looking at the most common terms we could also get a quick visual indication as to what our museum is actually about, according to the text mining services used to create the museum resource tags.

The clustering of museum resources around these tags indicates three things:

The most common terms give some indication of what a museum has actually been doing on the Web.
The common terms which stand out give an interesting indication of how museum practice might differ from museum policy.
The not-so-common terms which stand out can indicate areas where the museum has objects and content that might be interesting in a way not previously thought of.

Adding a social media and user-generated content component

The model of a basic resource which has tagging associated with it is very familiar to any user of Flickr or YouTube. Users of Twitter are familiar with the not unrelated concept of hashtags.

Since the same sort of model is being used in our simple Django Web application, it is not difficult to extend this model also to take in social media activities which have some potential relationship to the content we are modeling. There are two ways in which this can be achieved. Where a museum already has a presence on Flickr, Twitter or YouTube perhaps through a set of groups or tags, it is possible to crawl that information and combine it with our API application which until now has been based just on internal museum content.

Using APIs provided by these sites, we can obtain a subset of the information that we believe to be relevant to us and add it to our pot of Web resources in order to gather additional data, additional tags, and to nuance our idea of what our museum is about by factoring in the behaviour of an engaged subset of the general public. For example, people may be contributing photographs in Flickr to a particular V&A group. The textual content associated with these photographs may be an excellent set of resources to hold information about in our API.

Furthermore, we can enrich our content by encouraging the use of hashtags and machine tags with a view to being able to access these social media resources without having to rely so heavily on the text mining and normalization techniques in order to gather this data. For example, we might encourage visitors who take a photograph of a museum object to “machine tag” it with that object number, or we might encourage visitors who attend a particular event to tweet about it using a hashtag for that particular event.

With either approach, we can now extend our API application to deliver not only “traditional” internal museum resources by tag, but also resources relating to the museum on social media sites.

Many of the social media sites have an API which allows a certain visitor interaction such as submitting a video programmatically. Many museum sites also have some facility for users to contribute content, and the V&A has had particular success with these.

The Next Stage

The next stage in expanding our suite of APIs is to leave the simple “information providing” API application we have developed to one side and to consider how to deliver an API for our own Web site interactions.

Many of the V&A's initial Web 2.0 activities were initially built using bespoke code which became increasingly difficult to maintain. We therefore considered a number of options for developing these activities in a more standardized way and settled on Drupal as a framework which provided useful out-of-the-box features for user management, community management and submissions by visitors.

Drupal is perhaps best described as a PHP, content management framework. It does not have the elegance or robustness of Django nor is it as easy to learn the technical niceties of it. However, it has a lot of out-of-the-box features; the V&A is confident it can hire third party developers to work with it;and, after a certain amount of exposure, we have been able to devise a set of guidelines for Drupal module development which will keep our bespoke modules in a maintainable state.

Drupal has good core functionality to consume remote data in the form of RSS, and it is relatively easy to write extension modules to consume serialized php objects for more bespoke purposes. But our focus here is more on its capacity for service provision rather than service consumption.

A particularly attractive contributed drupal module is the services module which provides a core methodology for exposing drupal interactions via services such as XML-RPC. The Drupal Services module provides a framework (via the concept of Drupal “hooks”) so that bespoke Drupal modules can expose a similar API in a predictable way.

The advantage of this approach is that any custom interaction designed as being particularly appropriate for a museum's content can instantly be made available as an API, allowing it to appear in Facebook applications, Google gadgets and the like.

With this methodology in mind, the V&A is now looking to deliver new and updated user-generated content activities with this model. Given an application which allows a user to tell a story about an object, we can now embed that functionality on any page we choose that has an object on it, and we can take not only our content but also our opportunities for participation beyond the V&A Web site.

We now have a Django application holding enriched, structured data which can be made available via an API; we have at our disposal the APIs of any social media site where our visitors are actively participating with our data; finally, we have an API for allowing people to participate in particular user interactions which the museum can design. However, the final stage is perhaps to look at what API the museum can produce for its core, academic expertise.

What is Your Museum Actually Good At?

The APIs we have discussed so far can be done by anyone. Anyone with sufficient technical expertise can crawl the content from a museum Web site and mine the unstructured text. Anyone can set up a simple Web interaction and offer an API for people to use it. But there are some areas where a museum has exceptional expertise and authority in terms of understanding objects and providing interpretation. The question is: if you have some content, what can the V&A tell you about it? And can it provide an API for that service?

For the purposes of this paper, I will finish by focusing on a very simple example, the use of our continually-developing text-mined corpus of data which visitor content can be compared and contrasted against and then categorized by the V&A.

The process of text categorization is fairly straightforward and is supported by the NLTK. The simplest method is, for any piece of text, to compare the words of most statistical significance in that text to the words of most statistical significance across a set of texts which have already been categorized. The output is an indication as to which of these predefined categories the new piece of text is most likely to belong to.

The final step in the development of our Web application is to set up a form which allows visitors to submit text. The contents of the form submission can then be passed to our categorizer and a set of likely categories can be returned. In this way we have created a very simple service which indicates whether a piece of text outside the context of the museum's own texts would be a good fit. In essence, the V&A can have a service which tells you how “V&A” a particular text is.

Furthermore, assuming we have a number of categories in place, this service can indicate which particular area of the museum the incoming text fits best with. Finally, we can return from the museum pieces of content which seem most relevant to a particular piece of text.

Conclusion

In following the steps taken by the V&A in creating a light-weight API, we have moved through three phases. We started by looking at how to build a light-weight application to serve a simple, machine-readable version of content produced around the V&A. We then looked at the opportunities for user contributions and more sophisticated interactions provided by the V&A which can be presented as APIs using the Drupal services module. Finally, we took a first step at looking at how the V&A might set up a service of its own similar to, say, OpenCalais, and what that service might look like.

The result is a suite of simple, light-weight APIs which have scope for further rapid, iterative and flexible development and which are decoupled from the issues such as standards, vendor lock-in and complexity which accompany attempts to drive APIs directly from back-office collections management systems.

References

Makewell, Terry (2008). The National Museums Online Learning Project Federated Collections Search: Searching Across Museum And Gallery Collections In An Integrated Fashion. In D. Bearman & J. Trant (eds.) Museums and the Web 08 Proceedings. CD ROM. Archives & Museum Informatics, 2008. Available: http://www.archimuse.com/mw2008/papers/makewell/makewell.html

Cite as:

Morgan, R., What is Your Museum Good at, and How Do You Build an API for It?. In J. Trant and D. Bearman (eds). Museums and the Web 2009: Proceedings. Toronto: Archives & Museum Informatics. Published March 31, 2009. Consulted http://www.archimuse.com/mw2009/papers/morgan/morgan.html

Museums and the Web 2009: the international conference for culture and heritage on-line

produced by Archives & Museum Informatics

site at http://www.archimuse.com/mw2009/

What is Your Museum Good at, and How Do You Build an API for It?