MW-photo
April 13-17, 2010
Denver, Colorado, USA

How to Manage and Build a Web Collections Search Project in a Museum

Richard Morgan, Victoria & Albert Museum, United Kingdom

http://collections.vam.ac.uk

Abstract

Large, complex, technical projects which require input from many stakeholders in a large, complex, bureaucratic organisation are difficult to run and even more difficult to keep to budget and schedule.

In September 2009, the V&A released a beta version of its new collections search site. More than one million catalogue records are now available on-line, and the site boasts a beautiful design, powerful faceting and cross-linking, a fast, scalable search and an API.

We had to take brave strategic technical decisions and adopt new, more agile working practices to deliver the project. The purpose of the paper is to describe the processes we went through, how we took the decisions we did, and how we persuaded stakeholders of the merits of our approaches.

The paper is intended to be far more than just a description of what we did and how we did it. It presents techniques for project management, technical strategies, and frameworks for making technology decisions which will be generalized and could be applied in any museum. It presents practical and pragmatic methods which work in the reality of a museum context and can achieve a radical shift in culture and practice without the need for top-down organisational change.

Keywords: search, agile, api, organizational change, project management, open source, collections

The Problem

Museums are not set up to run complex Web projects. The on-line function in a museum is often a service provider to other museum departments who wish to get their content on the Web; it is expected that this service will be maintained alongside project work. Large projects are often run by committee, the members of those committees each bringing great experience and expertise in their own domain but having little direct experience in the domains of their colleagues. Groups of intelligent, dedicated individuals can often seem to come together yet be less than the sum of their parts, a situation described by Methven and Hart (2008) which many museum professionals might recognise from their own working practice:

These three teams were classic silos – they saw everything from the perspective of what was right from the point of view of the applications/systems they were responsible for. So, from the Web team’s perspective, everything revolved around the Content Management System – a service could only work and be supported if it was part of the CMS. From the business systems team, everything revolved around the particular requirements of each system they supported, and from the collection systems team, all that mattered was maintaining the collection. Despite extremely clever and dedicated individuals, it was very difficult to develop a strategic approach capable of producing innovative yet consistent on-line services which met needs recognised across the whole organisation.

Methven and Hart go on to describe a strategic approach that brought about an effective organisational change. Unfortunately, museum professionals are not always in the right time and place to bring about what they perceive as necessary organisational change. The V&A's Search the Collections was a project delivered against an organisational background that possessed both strengths and weaknesses. The project was a success because it harnessed the organisation's strengths while mitigating the weaknesses.

Project Background

A brief tour of the project is now presented in order to provide background for the subsequent discussion.

Search the Collections is a project to make the entirety of the V&A's in-house collections catalogue available on-line, irrespective of the quality of those records. This represented a major shift for the museum since previously we published only records with both an image and a "public access description" - a short piece of descriptive text suitable for a non-specialist audience.

Search the Collections also features an image ordering system. Where copyright allows, the V&A encourages users to download large, high-resolution versions of the images which visitors can then use for their own personal use. A special licence also allows visitors to use the images freely in scholarly publications with a limited print run.

Data

The available data for these catalogue records presented both challenges and opportunities. The quality and richness of the records were extremely variable. Some records would have many associated images; the majority had none. Some records had many paragraphs of descriptive and interpretive text; others might have nothing more than a generic title and a museum number.

The data also contained a distinction between catalogue records and inventory records. An inventory record might exist for each individual drawer in a chest of drawers, or each individual chess piece in a chess set. We wanted to present the user with catalogue records, but not at the expense of losing information that was contained only within inventory records.

Our in-house Collections Information System also featured a rich taxonomy, albeit one that had grown somewhat organically over the years. We wanted to surface this taxonomy to the users to show how records were linked together.

Design

These realities of the data presented a number of user interface design challenges. The major issue confronting the design of the user interface was that it needed to satisfy both specialist, academic researchers and more casual browsers who knew the collection less well. We dealt with this problem by adopting a user interface where the default screen is as simple as possible, but users can open up and expand areas such as "more search options" and "narrow your search" without leaving the screen. Users can indicate whether they would like to search for "best-quality" records; i.e. those with both images and public access descriptions, or to search across all the records with images or across all records. Users are also able to select different views of the results. A "list view" is suitable for users searching for specialist information; an "image only" view is more suitable for those looking for images; an "image with text" view provides a compromise between the two.

For visitors who arrive at the site on the homepage, we created a browseable, drag-and-drop "never-ending wall" of collections images which allow visitors to explore the collections even if they are not sure what to put in the search box.

Finally, we were aware that many of our visitors did not go to the site directly but might land on a record page from a Google search. Mindful of the observations on Google Images of Ellis and Kelly (2007), we designed the URLs, the linking within the site and the placement of images on the page with Google crawling very much in mind.

Technologies

Search the Collections is really two applications plus search and user management technologies. A Django application consumes XML data which is exported from our internal systems. The Django application provides an API service to client applications. It returns machine-readable representations of search results and individual records in JSON. Records are stored as instances of Django models in a MySQL database. Sphinx provides a fast search index, and Django delegates search requests to Sphinx, looking up the records corresponding to the identifiers chosen in that search.

The front-end Web application uses Symfony. The "never-ending wall" is written with JQuery. User management and image ordering are handled by the V&A's Drupal instance. The Drupal instance exposes API services via XML-RPC provided by the Drupal Services module. The front-end Symfony application makes transactions with these Drupal services when users register with the site, add images to orders, and order the images for download.

Infrastructure

These applications run on virtual servers in the V&A's in-house server room. The virtualization allows us to scale server resources to usage and, if necessary, allows us to clone servers to create horizontal scaling across the front-end application, back-end Django application and database servers. The virtualization project deserves a paper in its own right: readers are referred to Winmill (2009).

Staff and resources

A project board made up of representatives from the Online Museum, Collections and Documentation Management department, a representative from the museum's curators and a representative from the Information Systems and Services Department steered the project. The Other Media, a digital agency, was hired to produce the design of the site. The build of the site and the creation of the Web applications were carried out by the Web Technical Team (two developers and one manager, the author of this paper). Staff from the Collections and Documentation Management department carried out a project to create catalogue records for objects where only inventory records existed. They also undertook extensive work to improve the quality of the data where possible. All these members of staff had to carry on providing their normal service function while carrying out the project.

No dedicated project manager was appointed. Project management and administrative tasks were carried out by a member of the project board, with individual function managers being responsible for organizing their teams and moving different areas of the project along.

Principles

Within the constraints described above, where major organisational change is little more than wishful thinking, project planning and project management can only be effective if grounded in reality and pragmatism. Imposing a process-driven project management framework such as PRINCE2 may have the allure of "best practice" but simply will not work if the organisation in which it is applied is recalcitrant about playing by the rules. A more pragmatic approach is to gain some understanding of the incentives and tensions that exist within that organisation and devise an approach that fits it; it is perhaps like a skilled carver or sculptor seeing a finished piece within a block of wood or stone and then releasing it by exploiting the nature of the raw material rather than fighting it.

There now follows a discussion of a series of project principles and strategies which were used in Search the Collections to deliver an effective Web site. The focus in this discussion is on the design and build of the Web site. This is an appropriate moment to acknowledge the work which went into this project across every department of the museum for a sustained period of time. The work to improve the data and digitize the entire collection continues.

Not using Vaporware

It almost seems too obvious to start by observing that one should plan to use technologies and data that actually exist rather than ones that do not. Yet it does not seem uncommon to come across projects which have run into the ground because a piece of software turned out not to do what was expected, or turned out not to scale to the requirements of a new project. In the run-up to the build of Search the Collections, the author of this paper was undone when an API for a piece of popular ticketing software turned out not actually to exist, despite the claims made by the supplier. Reflecting on how to avoid a similar failure, two courses of action suggested themselves when embarking on this project.

The first was to ask - could we deliver this technology today, with no new infrastructure, no improved data export and no new staff? In the case of Search the Collections, we knew that the internal Collections Information System would likely not be scalable to provide services to a front-end Web application; we also knew that it would likely not be flexible enough to meet our requirements for the user experience. We were able to conclude that if we were going to build Search the Collections "today", we would need to export the data from the Collections Information System and import it into another application which was optimized for the appropriate performance and flexibility.

The second was to ask the same question again - could we really deliver this technology today? Before making claims about what we could and could not do, we prototyped that very application, a (as it happens) Django application into which we could import that data, and then provide a scalable API service accessible by client applications. With an eye on the approach taken by hoard.it described in Ellis and Zambonini (2008), we screen-scraped our own existing collections Web site to create this prototype. Then we created the prototype for the front-end client application that would make the API calls, and we made it good enough to demonstrate to the senior management of the museum. With something quick, light-weight but real in place, we could convince ourselves that we were not going to be using vaporware.

Removing dependencies

Just as important as not relying on non-existent technologies was being realistic, indeed pessimistic, about the availability of staff. All the staff involved in this project had other demands on their time, providing the service their department was responsible for and working on other projects. Often, Search the Collections was not even the most important aspect of the participants' work. Since any of the departments involved might be called away to fight fires or meet deadlines, any dependency in the project between these functions created a potential point of failure. The more departments involved, and the more intertwined the dependencies, the more likely was such a failure to occur.

Nor did the complexity end there. Managers within these departments were also responsible for managing third party suppliers. If a supplier failed to deliver on time or to specification, then not only would those tasks dependent on it be delayed, but the already thinly-stretched management resource would also be diverted from the project, to manage the supplier. The worst-case scenario was that a network of dependencies in a large, complex project would be a house of cards where the failure of one element could cause the failure of all the others.

There was no dedicated project manager to monitor and mitigate the risks. We had to accept that wherever a dependency occurred, there was likely to be a problem, a delay and a failure. We dealt with this by adopting a strategy which removed as many of the dependencies between departments and suppliers as possible. The project was broken down into smaller projects, all of which could proceed in parallel and largely at their own pace. Furthermore, each of these smaller projects was comparable in scope and size to other projects routinely carried out by the departments involved, such that we could feel confident of their success. In effect, there was a project to improve the data and create catalogue records by inferring them from the inventory records; there was a project to create a back-end application which provided the API service and search technology to query the collections; there was a project to create a front-end application with a high quality user experience and look and feel appropriate for a museum of design; there was the on-going project to provide a robust infrastructure using virtualization.

As far as possible we removed dependencies between those subprojects. We made sure that the back-end application would still basically work even if it had to use the existing inventory records; we made sure that the front-end application could basically work if we had to abandon the back-end application and drop in a simpler search technology such as Google site search; we were confident that either application could be installed on virtual servers in the cloud if the infrastructure were not available; if the design was delivered late,the back-end application was only a few Django templates away from being a usable, if basic, public-facing Web site in its own right.

An inconvenient truth about working together

It might seem rather cynical to despair so readily of the ability of museum departments to work together and communicate effectively. If not even the employees of a liberal organisation such as a museum can work together for a common goal, what hope is there for the rest of humanity? Should we not think about how we can improve communication, share our experience and expertise, and ultimately turn our museums into collegiate powerhouses of creativity, productivity and innovation?

The decoupled approach outlined above need not be embarked on cynically. On the contrary, it requires trust in each of the departments to carry out their functions but removes the fear associated with failure by any of those departments. To borrow an idea from game theory, it makes large, complex museum projects less like prisoners' dilemmas and more like stag hunts.

If we are to adopt this approach and persuade our colleagues of its virtues, then we do need to build up their confidence in our abilities, and they need to trust us to do what we say we will do. If the starting point is an atmosphere of mistrust, blame and recrimination, then steps are necessary to build confidence. In building Search the Collections, we attempted to build that confidence by extending our enthusiasm for prototyping to providing iterations and demonstrations of progress at every meeting. Showing the application taking shape to the project board or senior management every meeting was a way of building confidence and trust in the project. This agile approach and its degree of success are discussed in more detail below.

No project is too big to fail

Large, complex projects with fixed budgets rushed through to meet fixed deadlines can become a big sustainability problem. By splitting up the large project into the smaller sub-projects and dividing the effective responsibility for delivery of those smaller projects among different departments and different suppliers, we created a situation where no one supplier or department understood the totality of the detail of the project. There would be no way of creating a Service Level Agreement (SLA) for the whole of Search the Collections, and we would lose any ability to create a synergy saving by costing on-going support for the whole project across every component.

The benefits, however, far outweighed this potential disadvantage. By adopting the decoupled approach we would not have to redevelop the entire project if one of the components changed. Nor would other projects which involved those components be constrained by their role in Search the Collections. For example, if we decided to redesign the site, only the light-weight, front-end application would be affected; if we decided to replace the internal Collections Information System, only the export and import mechanism to the back-end application would need attention.

Search the Collections need not become a money pit for the museum, nor an irksome legacy project for our successors to deal with. As with the philosopher's axe, each component can be replaced and updated individually to respond in a nimble way to changing demands from users and changing skillsets in those whose job it is to maintain it. Because each of the components is easily replaceable, there is also much less sense of investment in each one individually, and consequently much less danger of throwing good money after bad to support an expensive, complicated but ultimately flawed project.

Surviving a thousand cuts

Prototyping and iterative development as described above are hallmarks of an agile approach to managing Web development. It was maintaining something real, demonstrable and visible that allowed us to retain confidence that our progress on each of the smaller projects would ultimately cohere to a greater vision. For example, our designers, the Other Media, had to make certain assumptions about what might be technically possible. Similarly, in developing the back-end application we had to be as confident as possible that we were creating functionality appropriate for real users and not falling into the traps of technology-led design.

Doing this effectively meant working independently of each other but not straying too far away for each other in our thinking and vision. It was far more efficient and far less risky to make constant small steps of progress in every arena so that we could continually compare notes and present something real. For example, when building the front-end application we did not wait for every PSD to be completed and delivered. Rather, as soon as we had even basic wireframes available, we mocked them up and tried to plug them into the application, partly to prove that we could really do it and partly to sanity check our overall approach.

As a Web development team, we needed a way of working which allowed us to take these constant small steps in the face of other demands from elsewhere in the museum. It would have been very undesirable to stop work on Search the Collections for two or three weeks while we attended to some other urgent matter or important project. We felt that an agile methodology was the answer and were inspired by the Subramaniam and Hunt (2006) to take this approach.

Our first attempts to do this were a failure. We started by trying to use Scrum. The basic principle of Scrum is to choose a certain period of time and then create a list of relatively short tasks and features to complete within that period of time. One might end up with, say, ten features to deliver in fifteen days, that combination of features and days forming what is called a sprint. During the sprint the list of features to deliver remains unchanged, and at the end of the sprint a release is delivered which clients can review and then request more features for the next sprint.

But in our attempts to do this at the V&A we never completed a sprint, and the requests we had to deliver never remained the same. In fact, something urgent inevitably came up a couple of days into the sprint and required plans to be changed. We tried a different approach: we would ring-fence half our time to the sprint and the other half to dealing with emergency requests and providing support and services. We failed again; the approach was too rigid. The urgent matters were still urgent and sometimes could not be completed in the time allowed. Nor did the approach account for the time when the developers felt most productive or inspired by a creative solution to a particular problem; they were not working to a natural or effective rhythm.

It was a discussion in one of the Museums and the Web 2009 Unconference sessions which eventually pointed to a better answer. We switched from using Scrum to using Kanban. As with Scrum, we identify the small steps that will progress our projects and the support requests and maintenance requests that turn up in the natural course of our work. Each task is written on a post-it note and stuck to a board. The board is divided into sections - we use "analysis", "development", "staging", "feedback" and "deployment" to allow us to track each task through to completion. The rule is that only ten post-it notes can be on the board at any one time. We meet daily to report on our tasks and physically move the post-it notes across the board. When a task is finished we take it off the board; that leaves space for the next one. To regulate this system, we have a further rule that exactly three of the ten tasks will be support or maintenance requests. It means that we keep providing a service, but it also means that there are seven slots available to progress our major projects. The physical element of moving the tasks and the daily meetings create a good rhythm within the development team. If an urgent need does occur, then we physically remove another task from the board to make space for it. In some cases, we make the person requesting the urgent task choose which task it will replace.

The methodology described was crucial in supporting the decoupled technology strategy we had planned and provided a way in which we could take control of the balance between development work and support requests, saving us from death by a thousand cuts. More detailed discussion of agile practices and kanban in particular is beyond the scope of this paper, but the interested reader is referred in particular to Subramaniam and Hunt (2006).

Technology

The decoupled approach to Search the Collections was not only a project management strategy but also a technical strategy. In the previous section, I discussed how it was well suited to the realities of the museum, how it minimized risk in development and sustainability, and how we delivered it using agile methodologies. In this section I describe some of the consequences for the technology we used.

APIs as a way of life and data portability

The small, light-weight Web applications which make up Search the Collections need to be able to communicate with each other. All the benefits of the decoupled approach described above would come to nothing if the method by which the applications communicated was not up to the task.

We planned on making the back-end application's API available to the public from the start. We prepared a paper for the senior management of the museum, outlining the benefits and the risks of encouraging reuse of our data via an API, and got an agreement that the benefits outweighed those risks and allowed us to proceed with some confidence.

There are a number of excellent resources available, such as Guy (2009), which describe how a public-facing API can be made successful. Rather than plan this as a public-facing API, however, we took a different approach, focusing first on whether it did the job for us. The result is an API which is rooted in reality and pragmatism, and we now routinely use it for a whole suite of applications. The barriers to accessing it are as low as possible - a key question we asked ourselves was how much easier it would it be to screenscrape our own data compared to using our API. We did not use API keys because we did not want to use API keys ourselves; we used a RESTful URL-as-interface approach because that was the easiest and most pleasant thing for us to code into the client application; we focused on returning JSON results because JSON was easier and more pleasant to process in the client application than XML. We are pleased to have an API grounded in our real usage, one that is genuinely an internal piece of the museum's working, exposed to the public. There is still a task for us to do in terms of documenting it thoroughly and providing different formats, such as XML using metadata standards.

Taking our decoupled approach requires a certain leap of faith that the applications will be able to talk to each other. We decided to bolster confidence in this leap of faith by providing an API layer for every application or platform we developed as a matter of principle. We use Drupal as the system which registers users and manages their "orders" to freely download high-resolution images. By using Drupal, we have begun to make this one and the same user management system across our site. We use the Drupal services module to provide API access for the front-end client application.

Deeper within the museum there are fewer working APIs and less machine-readable data. To move data from the internal Collections Information System to our back-end application requires an export of questionably-formatted XML to be manipulated by python scripts before being imported into Django models. Eventually, we will be able to replace this with OAI harvesting.

Technology selection and enterprise search

There are good reasons why each of the technologies used to deliver the applications which power Search the Collections were chosen. Symfony, a PHP framework, is a good choice for the front-end application because PHP developers and design agencies are easy to come by: we can be comfortable that we can hire someone to work on this application without difficulty. The software libraries which access the APIs of our Drupal instance and Django application are code that can be reused by third parties creating new applications for us.

Django, a python framework, is very satisfactory for the back-end application as it provides rapid application development in complex data-driven circumstances. Python has a good capacity for data manipulation, making it a good choice for managing the export and import of the raw XML data. Drupal has a lot of out-of-the-box functionality, especially in the area of user registration and management.

However, we did not go through any lengthy feasibility study or procurement process to select these technologies. They were selected because they were good enough, far better than the bespoke legacy applications we had inherited (written when frameworks were not so mature) and we could prove that internal and external developers we used across our whole range of work could develop with them rapidly. In deciding on these technologies, we also decided that we needed to avoid proliferation. We began to insist that design agencies creating applications and microsites for us use the Symfony framework; we rejected technologies that did not seem to play well with PHP, Python and our Linux, MySQL stack. We steered clear of Java applications in particular.

For Search the Collections, the selection of Symfony, Django and Drupal was relatively straightforward and coherent with previous decisions we had made. Choosing a search technology was harder; we looked at Solr, Sphinx, and some enterprise search options. Performance was a big concern for us, and none of us had extensive expertise in scaling opensource solutions to the anticipated level of usage.

We rejected the Solr / Lucene family because, although it is widely used, we did not want to commit ourselves to maintaining a Java environment or to recruiting for those skills in the future. This left us following two paths: we prototyped Sphinx search, and explored commercial enterprise search options. It soon became apparent that Sphinx would be good enough and delivered features we wanted, such as the faceted searching. It also became apparent that license costs for enterprise search were not going to provide good value for money for the ,e valid.

At this point we were able to make a decision in the spirit of the decoupled project strategy: we deployed Sphinx more thoroughly and used JMeter, a stress-testing application, to improve our confidence that it would scale up as necessary. The virtual server infrastructure with our decoupled delivery also made it easy to build in horizontal scaling: we could simply take a snapshot of the virtual server running Sphinx, clone it and write round-robin style load balancing into the front-end client application. However, we also decided to avoid close integration with Sphinx and our frameworks. Plugins exist to use Sphinx with Django and Sphinx with Drupal. But we preferred to write our own configuration, pointing Sphinx directly at the MySQL databases. This approach makes it easy for us stop using Sphinx and choose another search if a better option presents itself.

Epilogue and Reflections

Shortly after launch, we decided to create a mobile version of Search the Collections. In the author's view the main value of this was that if people were using mobile applications such as Twitter or Google Goggles, where they might be pointed towards our collections, they could be redirected to a sensible mobile rendering of the full record page with sensible mobile rendering of the basic functionality from that page should they wish to explore further.

This piece of work was accomplished by an external contractor within a couple of weeks; all that was needed was another client-side application making similar API calls to the unchanged back-end application. The robustness of the mobile site and the speed with which it was delivered seemed a fitting vindication for the decoupled approach. Even as this paper is being written, further development is taking place, creating further client-side applications which make use of the same back-end API platform.

With Search the Collections released, it was time to reflect on whether the principles and approach described above were appropriate for the even larger project of a complete Web site redesign. The decoupled, API-based approach will certainly feature, with more APIs being opened up to the public. The commitment to Python and PHP as core technologies will remain. The agile kanban style of managing the build will continue. The area still needing review is how best to present the working iterations of the Web site to the clients and stakeholders.

The agile way of working was certainly suitable for our small Web development team. Demonstrating the iterations and showing a real application taking shape created confidence in our ability to deliver the project. However, it was much harder to obtain meaningful feedback on those iterations. As it was a new way of working for many of the stakeholders, it was difficult for them sometimes to know what they should comment on and what they should just put down to work in progress. There was also a danger in people latching on to the petty and trivial as they used it rather than thinking about the site in the context of the users. Perhaps more importantly, the demands on other people's time meant that they were sometimes not even available to provide feedback on an iteration at all, and could give it only a cursory glance.

Search the Collections is a project where we successfully removed dependencies and were able to make production the driving force which brought it to release. Our next challenge is to improve the quality of our output further by effectively harnessing the expertise of our stakeholders and engaging in more proactive user testing.

References

Ellis, M., and B. Kelly (2007). Web 2.0: “How to Stop Thinking and Start Doing: Addressing Organisational Barriers”. In J. Trant and D. Bearman (eds). Museums and the Web 2007: Proceedings. Toronto: Archives & Museum Informatics, published March 1, 200.7 Consulted January 31, 2010. . http://www.archimuse.com/mw2007/papers/ellis/ellis.html

Ellis, M., and D. Zambonini (2009). “Hoard.it: Aggregating, Displaying and Mining Object-Data Without Consent (or: Big, Hairy, Audacious Goals for Museum Collections On-line)”. In J. Trant and D. Bearman (eds). Museums and the Web 2009: Proceedings. Toronto: Archives & Museum Informatics. Published March 31, 2009. Consulted January 31, 2010. http://www.archimuse.com/mw2009/papers/ellis/ellis.html

Guy, M. (2009). Good APIs Project. . Consulted January 31, 2010. http://blogs.ukoln.ac.uk/good-apis-jisc

Methven, D., and T. Hart (2009). “Organisational Change for the On-line World – Steering the Good Ship Museum Victoria”. In J. Trant and D. Bearman (eds). Museums and the Web 2009: Proceedings. Toronto: Archives & Museum Informatics. Published March 31, 2009. Consulted January 31, 2010. http://www.archimuse.com/mw2009/papers/methven/methven.html

Subramaniam, V. and A. Hunt (2006). Practices of an Agile Developer. Raleigh North Carolina, Dallas Texas: The Pragmatic Bookshelf.

Winmill, S. (2009). Using Server and Storage Virtualisation: Our journey to scalability.12 November 2009. Consulted January 31, 2010. http://www.slideshare.net/swinmill/using-server-and-storage-virtualisation-our-journey-to-scalability

Cite as:

Morgan, R., How to Manage and Build a Web Collections Search Project in a Museum. In J. Trant and D. Bearman (eds). Museums and the Web 2010: Proceedings. Toronto: Archives & Museum Informatics. Published March 31, 2010. Consulted http://www.archimuse.com/mw2010/papers/morgan/morgan.html