Archives & Museum Informatics
158 Lee Avenue
Toronto, Ontario
M4E 2P3 Canada

published: April, 2002

Evaluating The Features of Museum Websites: (The Bologna Report)

Nicoletta Di Blas, HOC-DEI, Politecnico di Milano, Maria Pia Guermand, IBC, Istituto Beni Culturali, Emilia Romagna, Carolina Orsini, Università di Bologna, Paolo Paolini, Politecnico di Milano, Italy

Abstract

MiLE (Milano – Lugano Evaluation Method) is an innovative method for evaluating the quality and usability of hypermedia applications. This paper focuses upon the specific “module” of MiLE concerning cultural heritage applications, synthesizing the results of research carried on by a group of seven museum experts of Bologna (Italy), with the joint coordination of IBC (Institute for the Cultural Heritage of the Emilia Romagna Region) and Politecnico di Milano. The “Bologna group” is composed of different professional figures working in the museum domain: museum curators of artistic, archaeological and historical heritage; museum communication experts; Web sites of cultural institutions’ communication experts.

After illustrating the general features of MiLE and the specific features for Cultural Heritage, we will briefly show a few of the results which are to be published in the “Bologna Report”.

Keywords: usability, inspection method, cultural heritage, users’ scenarios

1. MiLE in a nutshell

MiLE is based upon a combination of Inspection (i.e. an expert evaluator, systematically exploring the application) and Empirical Testing (i.e. a panel of end users actually using the application, under the guidance and the observation of usability experts). If this combination of the two methods is not new (several usability methods propose, in fact, a similar combination), the innovation of MiLE comes from the set of guidelines being used for making both inspection and empirical testing more effective and reliable. In extreme synthesis, we introduce two specifics (heuristic concepts):

Abstract Tasks, ATs in short, used for inspection. They are a list of generic actions (generic in that they can be applied to a wide range of applications) capable of leading the inspector through the maze of the different parts and levels an application is made of, as the Ariadne’s thread. MiLE, in fact, provides inspectors with some guidelines that draw their attention to the most relevant features of the application.

Concrete Tasks, CTs in short. They are a list of specific actions (specific in that they are defined for a single application) which users are required to perform while exploring the application for the empirical testing.

Inspection is the focus of this paper, and we will not further explore the issues concerning empirical testing. One contribution of MiLE is the emphasis on the need for separating different levels of analysis: technology, navigation, content, illocutionary force, graphic, etc. For each level a library of Abstract Tasks has to be prepared, when building the method, in order to support the inspection. The Abstract Tasks are nothing but the marrow of each level’s experts’ knowledge. For some levels (e.g. graphic or navigation), the abstract tasks can be generally independent from the specific application domain; for other levels (e.g. content) we have different tasks according to the application domain (i.e., specific tasks for the cultural heritage domain, for the e-commerce domain, and so on).

The inspector has to understand the client’s communicative goals, combine them with the intended users’ probable requirements, and then select the appropriate set of tasks to perform. If, for example, we have to evaluate a museum’s site that is specifically meant to attract visitors to the real museum, we – as inspectors –concentrate on all those tasks that at the content level involve the practical services parts of the site (opening hours, “buy a Ticket”, etc.).

When performing inspection, the inspector has to check a list of attributes concerning the different facets of usability/quality (e.g. richness, completeness, etc.). For each attribute (in relation to a specific AT), a score must be given. After the scoring phase is over, the set of collected scores is analyzed through “weights” which define the relevance of each attribute for a specific goal (or, technically speaking, for a “user scenario”).

Weighting allows us a clean separation between the “scoring phase” (using the application, performing the tasks, and examining them) from the “evaluation phase” in a strict sense, where different possible usages are considered. Let us introduce a simple example: assume that a navigation feature (e.g. using indexes) is not very powerful, but very easy to learn. What should the evaluation be? With MiLE the inspector can provide a score (e.g. 9/10 for “predictability” and 2/10 for “powerfulness”) for the navigation. Later, figuring out two different user scenarios (e.g. casual users and professional users), the evaluator (possibly different from the inspector), can assign two different pairs of weights to the attributes “predictability” and “powerfulness”. The weights, for example could be <0.8 (predictability), 0.2 (powerfulness)>, for casual users, or <0.1 (predictability), 0.9 (powerfulness)> for professional users. The weighted score for the navigation feature is very different of course (7.6 for casual users and 2.7 respectively), but it reflects the different users’ scenarios. The inspector could therefore conclude that the application (at least for this feature) is well suited for casual users, while it is somehow ineffective for professional users. Trying different weighting systems allows the evaluator to test different user scenarios using the same set of scores derived from the inspection.

In short, an inspection with MiLE requires the following steps:

selection of the relevant portion of the application (based upon considerations that we can’t investigate here);
selection of the Abstract Tasks that are relevant for the intended user scenarios;
execution of the Abstract Tasks, providing scores for each attribute;
for each user scenario
weighting of the attributes and of the tasks chosen (a certain task can be more relevant than another);
production of quantitative evaluation measures (applying weights to scores)

The reliability of the method has proved to be very high. Execution of the Abstract Tasks (at navigation and content level) allows producing more reliable evaluation results and helps spot unexpected usability problems (inconsistencies, lack of clarity, etc.). Even “at-first-sight agreeable” sites, when put to the test through a systematic inspection “à la MiLE”, may reveal weaknesses and defects.

Inspection already provides valuable evaluations; in some cases, however, panels of users may be required for double checking. When empirical testing is required, users are given a list of concrete tasks, i.e. a list of specific actions that they are asked to perform. Concrete tasks definition (different for each case) is based upon the results of the inspection, which has identified portions of the application, tasks and attributes that need special attention. “Real users” can fine-tune the inspector’s observations, confirming them (very often), or dissenting from them (seldom) or spotting additional problems. We now discuss the above outlined approach applied to cultural heritage applications.

2. Guidelines for Evaluating Cultural Heritage Applications

Some of the features (such as navigation or layout) of an application can be examined largely independently from a specific application domain; other features, such as content or functions offered to the users, require a different evaluation schema for each application domain. In order to explore functions and contents for museum Web sites (a specific sub-domain, within the larger domain of cultural heritage applications), a specific panel of “experts” (the so-called “Bologna group”) has been created, with a partnership between Politecnico di Milano and “Istituto Beni Culturali”, a regional organization supervising cultural heritage activities in a large region, Emilia Romagna, with headquarters in Bologna. The group is composed of museum curators (both of archaeology, modern and contemporary art museums and galleries), museum communication experts, and researchers of new technologies for cultural heritage.

The first step of the Bologna group was to identify the main pieces of a generic Museum Web site. In order to avoid the danger of “wish listing” the sum of what everybody could foresee as the “ideal Web site”, we took an empirical standing: we selected a large number of sites and considered them to be the “universe of discourse”. The resulting model (shown also in the appendix) is therefore a synthesis of contents and features found in those sites. At this stage of the research, we have listed more than a hundred “elementary” constituents, organized into three main groups:

A. site presentation: general information about the Web site;

B. museum presentation: contents and functions referring to a “physical museum” (like “arrows” pointing to the real world);

C. the virtual museum: contents and functions exploiting the communicative strength of the medium.

A further analysis has allowed us to detect “high level” constituents such as, collections, services, promotion, which gather the elementary constituents (a full account of all the pieces of the model can be found as an Appendix to this paper).

The next job has been to define a set of users’ scenarios as a way to build a library of suitable ATs. A user scenario, in this context, is a pair <user profile, operation (that users may wish to perform)>. In simple words, we tried to identify a number of user profiles (culture, expertise, interests, etc.) and for each of them we tried to provide a set of meaningful answers to this simple question: “What might a user (with this profile) want to do with the application?”

Therefore the tasks are coupled to user profiles, in the sense that a given task may be interesting for a given profile, and/or meaningless (or irrelevant) for a different profile. When the inspectors perform an inspection, they will learn from their customers who the intended users of the application are and will concentrate on those tasks likely to be performed by these customers. In any case the Inspectors will be free to create new tasks that fit better the communicative goals of the application, as long as they follow the guidelines of the method and its “philosophy”.

Overall we have classified ATs according to two different dimensions: “scope” and “concern”. Possible values for scope are the following: Narrow (a specific item is interesting), Complex (several items are interesting), and General (a generic overview). Possible values for concern are Practical Info (the user wants to gather useful information), Operational (the user wants to do something) and Cognitive (the user wishes to learn something).

The following table shows some examples of AT, classified accordingly:

Table 1: Some examples of AT

Regarding the users, we took into consideration a number of variables, such as age, expertise, professional interest (e.g. school students, fine arts students, fine arts experts, tourists, etc.) Each relevant user profile is based upon a number of these variables.

On-going research work consists of identifying the largest possible number of ATs. More than fifty ATs have been identified so far (while 49 ATs where identified for navigation, in another research), and more than double that number will be the likely result.

Regarding the list of attributes to be scored during the inspection, we started with the idea that they would be different for each AT. At the moment, however, we have developed the following list which seems to be applicable (with minor problems only) virtually to any AT:

Efficiency: the action can be performed successfully and quickly
Authority: the author is competent in relation to the subject
Currency: the time scope of the content’s validity is clearly stated. The info is updated.
Consistency: similar pieces of information are dealt with in similar fashions
Structure effectiveness: the organization of the content pieces is not disorienting
Accessibility: the information is easily and intuitively accessible
Completeness: the user can find all the information required
Richness: the information required is rich (many examples, data…)
Clarity: the information is easy to understand
Conciseness: the basic pieces of information are given; texts are not too long and redundant
Multimediality: different media are used to convey the information
Multilinguisticity: the information is given in more than one language

3. Some Examples

In this section we will introduce a few examples of inspection to help the reader grasp how our method works. The examples are very simple, and are taken from actual Web sites. We hope that in the period between the writing of this paper and the reading of it by a user, the Web sites will not be modified, so that the readers may try directly to “inspect” them. (The impossibility of “freezing” Web sites, in practice, makes it difficult to develop examples of inspection that could maintain their validity over a long span of time.)

Example 1 (practical Info AT)

find the events/exhibitions/lectures occurring on a specific date in a real museum

The user’s scenario for this task is that of well-educated French-speaking tourists (who can speak English too), first-time visitors to the site, who know that on March 9^th (Saturday), 2002, they will be in the town where the real museum is actually located. Therefore they would like to know what special exhibitions or activities of any kind (lectures, guided tours, concerts) will take place in that day.

We performed this task on many different Web sites, and we describe here our findings for the Louvre site (www.louvre.fr) and the Royal Ontario Museum site (www.rom.on.ca), on the basis of an inspection that took place on February 13^th, 2002. The focus of our attention is the section named “information about museum activities and events” in our schema (see the Appendix for the details).

The relevant attributes that we use for this brief example are the following:

(A1) currency of the information;

(A2) quality of the organization of the information, since users are looking for operational support;

(A3) multilinguisticity, fundamental for an international audience;

(A4) richness of the information provided, very important in order to make understandable the potential interest of the events.

The Louvre Web site offers a choice among four languages (French, English, Spanish, Japanese). In the home page we find (on the left menu) three relevant links: “Expositions”, “Auditorium” (a rather “obscure” name, for it refers to a specific place in the real museum, but its actual meaning can be understood only by second-time users) and “Visites – conférences et ateliers”.

If we click on “Expositions”, we get a list of the exhibitions currently available or coming soon; “temporal windows” allow users to easily select the exhibition fitting their needs.

If we click on “Auditorium”, we find a wide choice of activities: "agenda", "concerts", "cinéma muet en concert", " classique en images", "colloques", "conférences", "les enfants au Louvre","films","lectures", "midis du Louvre" (conferences and activities taking place at midday), "musée-musées","musique filmée", "l’oeuvre en direct".

From a graphic point of view, the first element of the list (“agenda”) looks exactly the same as all the others; only after a short exploration do the visitors discover that under the voice “agenda” they can find, arranged in chronological order, all the pieces of information that are available under the other twelve voices.

If eventually we click on “Visites – conférences et ateliers” we have to choose whether we’re interested in “visites-conferences” or “ateliers” and whether we’re adults or children. The “visites-conferences” are furtherly divided into these categories: "visite découverte", "visite d’une collection", " visites thématiques", "thèmes du lundi soir","monographies d’artistes", "une heure-une œuvre", "visite d’exposition temporaire","promenades architecturales". Clicking on “promenades architecturales” we discover that on March 9^th (Saturday) there is a special guided tour called “Le Louvre, l'oeuvre et le musée”, of which we find no mention in the agenda. Therefore, even once one has found out the “collective” character of the agenda-page, a brief exploration of other related links makes it clear beyond doubt that the agenda-page is not exhaustive as regards all the special events taking place at the Louvre on a specific date. On the whole, we can say that although the information is well updated and exhaustive once found, too many paths have to be trodden in order to get to the point.

The Royal Ontario Museum shows a better solution: in the home page we find “what’s on calendar”, a link that leads us to a page showing a big calendar of the current month, divided into cells; all the events taking place on a specific day are listed in the day’s cell and are all linkable to a description’s page. Previous or next months (or even years) are shown on demand by clicking on the arrows on the top of the page. The task can be performed quickly and easily. Note that if our French-speaking tourist doesn’t speak English at all and decides to enter the French version of the site, he has to choose the item “expositions”, leading to a much less appealing list of the present, future and past exhibitions very similar to the Louvre’s. For the scoring we have considered the English version.

The table below synthesizes our scoring and evaluation.

Table 2: The scores and the evaluation for “a visit in a given day”

We do not ask the reader to agree with our scores (we may be poor inspectors) but to appreciate the method on a number of issues:

a) We are evaluating a specific task and not expressing a global evaluation; in addition, we are scoring each single attribute. This level of detail introduces two advantages: precision of the feedback to application designers and possibility of pinpointing the causes for possible discrepancies among different inspectors.

b) Through weights we can take into account the specific objectives for the (portion of the) application. In the example above, we gave great relevance to attributes A2 and A1, and minor relevance to A3 and A4.

c) Global concise evaluation can be obtained trough combining the evaluation for each attribute (as in the above table), and/or combining the evaluation for the different ATs (again using weights in order to attribute different relevance to each AT).

d) Different systems of weights can be used in order to take into account different user profiles.

Example 2 (Cognitive AT)

find all the works of an artist shown in the site

This task might be performed by a high-school student looking for some information about an artist he’s currently studying at school; let’s say Giovanni Battista Tiepolo. He finds out that some of Tiepolo’s works are kept by the Met Museum (www.metmuseum.org) and by the Hermitage Museum (www.hermitagemuseum.org).

The relevant attributes that we will use for this brief example are the following:

(A1) effectiveness of the information;

(A2) completeness of the information;

(A3) richness of the information;

(A4) navigation organization.

Using The Metropolitan Museum’s Website, we have two choices: to use the search engine, or to navigate the site. Writing the name “Tiepolo” in the search window of the home page, we get a list of more than 200 records, many of which refer to the shop on-line. If we leave aside this too-long list and enter the section “collections”, we can use the tool “search the collection”, inserting again the name “Tiepolo” (or the full name “Giovanni Battista Tiepolo”, in order to avoid the mixing between Giovanni Battista’s and Giovanni Domenico’s works). This more precise searching tool gives as a result the list of the 23 Giovanni Battista Tiepolo’s works of art shown in the Web site. For each of the works, we have also the basic data, a description and the possibility of zooming the image.

If we decide to ignore the search and to find what we’re looking for by navigating the site, then we have to reach the sub-section “European paintings” of the section “the collection”; here we find a brief introduction in which there’s a mention of our author and a link. Following the link we are shown a single work of Tiepolo (“Allegory of the planets and continents”), but having entered the guided tour of the department’s highlights, if we click on the “next” or “previous” buttons, then we find other artists’ works of art but no more Tiepolos. In order to perform our task, we have either to check one by one the 2275 items preserved by the department (clicking on “entire department”) or, if we don’t want to spend too many hours in front of the video, to turn again to the search tool.

The Hermitage Web site offers a similar functionality, for we have a search tool right in the home page. Inserting the name “Tiepolo”, we are given a list of 10 works of art (each of them with the zoom); one of them (not the first nor the last, therefore it takes a careful reading to notice it) is in fact not by Giovanni Battista but by Giovanni Domenico Tiepolo; in any case, we are not allowed to search by the full name (Giovanni Battista Tiepolo). There isn’t a description of the works.

If instead we navigate the site, we can choose either “collection highlights” or “digital collection”. In the former case, we have further to select the option “western European art” and eventually “painting”; after all this path we are given an introduction to Western European painting with some little icons on the right side, one of them representing the painting by Tiepolo: “Maecenas Presenting the Liberal Arts to Emperor Augustus”. A link leads to a bigger image and a description. As an alternative, we can decide to browse the “digital collection” by type of art work (“paintings prints and drawings”) and artist, getting nine results (corresponding in this case to the nine works by Giovanni Battista Tiepolo).

On the whole, we can say that both sites permit reaching the wished information only by using the search engines: this can be considered a sign of poor organization of the information.

For the Metropolitan, the collection’s search engine must be used and not the “main” search; otherwise the user may get completely lost! The Hermitage search engine doesn’t distinguish between the works of Giovanni Battista and Giovanni Domenico Tiepolo. The Metropolitan Museum offers a good description of all the items, whilst the Hermitage offers only one single work’s description (the others are simply “shown”).

The table below synthesize our scoring and evaluation.

Table 3: The scores and the evaluation for “all the works of a given artist”

We should note first, that in this case the weights do not change the relative evaluation of the two sites, but rather reduce both of them, given the high relevance assigned to A4. Secondly, we can see that if more details about navigation are wanted, then a different level of analysis should be entered: we have devised nearly 50 different tasks in order to inspect navigation features precisely.

4. Conclusions and Future Work

The general distinctive features introduced by MILE can be synthesized as follows:

Efficient combination of inspection and empirical testing
Use of Abstract Tasks, ATs, as guidelines for inspection
Use of Attributes as a way to detail scoring
Use of Concrete Tasks, CTs, as guidelines for empirical testing
Use of weights as a way to translate scores into evaluation
Use of user profiles in order to assign weights

The specific contribution of the Bologna group (there is also another research group, coordinated by The Museum of Science and Technology of Milan, examining the same issue for scientific and technical museums) is described in this paper. Our task has been the identification of a general framework for defining a set of AT suitable for Art Museum Web sites. The framework is the result of an extensive analysis of several Web sites which are now the objective of our trial inspection.

The current work consists of identifying, through the ATs, the “universe of possible functions” that a museum Web site should support; the next step will be to pair user profile features with ATs. The goal is to generate an overall schema showing what type of user is interested and in what information/action. The combination of user-profile/AT is what we mean by User Scenario; therefore, we could also say that we are trying to build a large set of possible user scenarios for museum Web sites.

We aim at providing a contribution to the community of people interested in museum Web sites (museum curators, designers, Web managers, etc.), sharing our understanding of what it means to evaluate quality and usability of “virtual artifacts”.

Since the amount of work to be performed is immense, and we would like to generate a discussion in a large community, the authors encourage all interested persons to contact them in order to enlarge the scope and the validity of this research in evaluation.

5. References

Costabile, M.F., Garzotto, F., Matera, M., Paolini, P. The SUE Inspection: A Systematic and Effective Method for Usability Evaluation of Hypermedia. In IEEE Transactions on Systems, Man, and Cybernetics - In print.

Costabile, M.F., Garzotto, F., Matera, M., Paolini, P. Abstract Tasks and Concrete Tasks for the Evaluation of Multimedia Applications Presented at the Int. Workshop on Theoretical Foundations of Design, Use, and Evaluation, Los Angeles, CA, USA, 1998.

De Angeli, M. Costabile, F., Garzotto, F. Matera, M., Paolini, P. On the advantages of a Systematic Inspection for Evaluating Hypermedia Usability. Iin International Journal of Human Computer Interaction, Erlbaoum Publ. - In print.

De Angeli, M. Matera, M. F. ,Costabile, F. ,Garzotto, F., and Paolini, P.. Validating the SUE Inspection Technique. In Proc. AVI, 2000, pp. 143-150.

Garzotto, F. & Matera, M. A Systematic Method for Hypermedia Usability Inspection. In The New Review of Hypermedia and Multimedia, vol. 3, pp. 39-65, 1997.

Garzotto, F., Matera, M., Paolini, P. A Framework for Hypermedia Design and Usability Evaluation. In Designing Effective and Usable Multimedia Systems, P. Jonhson, A. Sutcliffe, J. Ziegler Eds., Boston, MA: Kluwer Academic, 1998, pp. 7-21.

Garzotto, F.,Matera, M.,Paolini, P. Abstract Tasks: a Tool for the Inspection of Web-sites and Off-line Hypermedia, in Proc. ACM HT, 1999, pp. 157-163

Acknowledgements

We wish to acknowledge the work of the other members of the Bologna group, who made this (still on going) research effort possible. We therefore warmly thank Dede Auregli (Galleria d'Arte Moderna di Bologna), Gilberta Franzoni (Musei Civici di Arte Antica di Bologna), Paola Giovetti (Museo Civico Archeologico di Bologna), Laura Minarini (Museo Civico Archeologico di Bologna), Federica Liguori (Politecnico di Milano), Uliana Zanetti (Galleria d'Arte Moderna di Bologna).

1. MiLE in a nutshell

2. Guidelines for Evaluating Cultural Heritage Applications

3. Some Examples

Example 1 (practical Info AT)

Example 2 (Cognitive AT)

4. Conclusions and Future Work

5. References

Acknowledgements

Appendix: the Contents Survey Schema for Museum Web sites

1. first section: site’s presentation

2. second section: the real museum

3. Third section: the virtual museum