|
About the title
The title of this paper reflects our search for knowledge representation
models to be applied to museums. We want models rich enough to be
applied for both the daily data management and for multimedia publishing,
like CD-ROMs and the Web. The knowledge must also be suitable for
building relations between heterogeneous sources, like different
kinds of museums.
In this paper, we will present two different approaches for representing
museological information: one based on relational databases and
the other based on annotated documents. For each, we will try to
identify the advantages and the difficulties, resulting in a trade
off that should be analyzed before deciding on one of these models.
We have found that annotated documents are a suitable data representation
for unstructured information, like long descriptions of objects,
of archaeological sites, biographies, etc.
Structure of the Paper
This paper starts by identifying our goal: the development a system
that will minimize the museum's effort in order to maintain a presence
on the Web. In order to do that, we will analyze two different models
of data representation: the relational model, and the use of annotated
documents. The relational database model is presented more briefly
because it is traditionally used and is already well understood.
The second approach will be presented in more detail, with examples
of what and how to explore the semantics embedded in the annotations.
Relational Database
based Knowledge
In the context of GEIRA, there was no specific application to support
the daily data management activities of the museums so, the first
step was the development of one. The design and the development
of this application was done for the Windows operating system. The
application was written in Delphi, and the data is stored in a relational
database. This application runs, typically, on a single or small
network of PCs.
For the definition of the data structures, we followed, as much
as possible, the SPECTRUM [Ass97] recommendations. During the development, we had
the support of people from the museums. At this time, the application
is ready to be installed, and only minor adjustments are expected.
In the next weeks, we will evaluate the application, in daily activity.
Meanwhile, we are preparing a technical report about the application
built, considering all design and implementation issues.
For the purpose of this paper, we refer this application as a practical
solution to fulfill the needs of museum data management. This application
was not written overnight, and some problems had to be solved. But
it was not far from the usual development of an interface over a
database (the definition of the data structures, the tables of the
database and their relations, became easier using SPECTRUM, as we
said). The development of this kind of application, built on top
of a relational databases is easier due to the set of existing sophisticated
and affordable tools, like the Delphi and SQL Server used. This
means that, when deciding for this model, we can expect the existence
of a rich set of tools.
The first Web site
To make possible the presence of this variety of museums on the
Web, in a systematic way, some alternatives were explored.
Another team of this project, working on Vila Real, developed a
common abstract structure for the museums [LC97].
Each site is then constructed to instantiate that structure.
The root of each museum starts in a atrium, and then is divided
in 5 subsections: collections, activities, free theme 1, free theme
2 and contacts. The pages generated for each section are structured
with several frames, each with a specific functionality. This mapping
between the abstract structure and the concrete HTML pages was done
by a multimedia designer.
To make this model systematic, almost all the information is stored
in a relational database, rather then on HTML pages. The pages are
dynamically created using Microsoft Active Server Pages. The construction
of a new site consists of filling in database fields (with text,
images, etc.), rather than writing and composing HTML pages. In
this way, the task of building a new site is faster and more accurate.
Maintaining the sites is also easier, and can be made by people
aware of HTML details.
Annotated Documents
based Knowledge
From the beginning, in parallel, we started the study of SGML [Her94],
to learn how useful it could be for representing and reasoning on
museum data.
The study we are reporting in this section was carried out to deal
with archaeological sites, where museum objects were found. It is
being used in a particular museum with thousands of archaeological
artifacts. That is why the data (documents) we will use to exemplify
some SGML concepts are mostly related to archaeology.
This data is entered by archaeologists, rather than by the museum
staff, but will be available through the museum. The archaeological
data will be cross-referenced against information associated with
the objects available in the museum collections.
Cross-referencing the particular information about a museum object,
with the information about the archaeological site where it was
found, can be useful to create a framework where the full context
of that piece can be explained. The precise location of the piece,
and even all the archaeological site, can be seen through a GIS
(Geographical Information System) plug-in.
Writing SGML Documents
For this approach, the data should be written as SGML documents.
This can be created either by transforming the original documents
collected by the archaeologists (in some other format) into SGML,
or asking them to use SGML. After showing them how to work with
a DTD driven editor, and the benefits of automatic syntax validation
along with structural manipulation of text within the editor, it
was easy to have archeologists adopt the latter approach.
Because they used Winword, we have adopted SGML Author from Microsoft,
but soon it was clear that it was not a very good option. We changed
to Word Perfect, also affordable, which supports SGML more conveniently.
Word Perfect has lots of interesting features: automatically highlights
violations to the DTD structure, computes a list of the valid choices
dependent of the cursor position, has support for tagging non-structured
documents, can ask automatically for the values of required attributes,
etc.
The SGML documents are then processed in order to execute, at least,
two tasks: to check additional constraints, and to generate HTML.
The additional constraints are necessary to ensure consistency among
the data. This validation task is only possible when we check the
contents of a document compared to others. This validation task
and the related discussion about assurance of quality is reported
in detail in [RRAH97].
As said above, the second result of processing the original SGML
documents, is the generation of the HTML pages. This is not an assisted
step; this is completely automatic, which was one of our strongest
requirements. Being automatic does not mean that it is a blind process.
We can introduce as much intelligence in this step, as there is
knowledge to do so. That knowledge depends on the design of the
DTD, and on how the text is being tagged (with more or less detail).
Generating HTML pages
To do the generation of HTML we had to choose an SGML processing
tool. An SGML processing tool can have two "operating modes": transforming
and formatting. Although, it seemed that for this task we only need
a formatting processor, to meet our goals we also need a transforming
processor (to produce different structured views of the data).
We compared the tools available in order to choose the best for
our intended use:
- Perl - sgmlspl.pl and SGMLS.pm [Meg95b],
[Meg95a]
- has the advantage of being freely available; has major drawbacks
if you move deep into transforming; programming gets highly complex
(ex. processing sub-DTDs).
- Omnimark and Balise
- two commercial tools more or less equivalent; the major difference
are the conditions of acquisition; Omnimark made a light version
freely available that can be used in small to medium projects.
We chose Omnimark [Omn96] and we are generating HTML with OMNIMARK scripts.
OMNIMARK is a complex processor, focusing on SGML processing. In
our case we are just using a small subset of its functionality,
mainly 'Down-Translation'.
Example of a simple script to generate an HTML list of all the
entries in the archaeological SGML file:
DOCUMENT-START
OUTPUT "<UL>%n"
ELEMENT IDENTI
OUTPUT "<LI>%c%n"
ELEMENT #IMPLIED
SUPPRESS
DOCUMENT-END
OUTPUT "</UL>%n"
In this script, before the processing starts ("DOCUMENT-START")
we open an HTML list; during processing if we find an element "IDENTI"
which identifies an entry we generate a list item ("<LI> ")
with its contents ("%c "); for all other elements we
may find, we will ignore them ("SUPPRESS"); when we reach the bottom
of the file we close the HTML list.
At this point, any person new to SGML document processing can notice
a major advantage of keeping documents in SGML. Since we can define
DTDs and maintain information according to those, we have a richer
format. It becomes very easy to generate a set of different HTML
pages for the same SGML document. Those HTML pages can reflect the
structure of the source document or have completely new structures.
For example, in the above script, we could collect the "IDENTI"
elements into an associative array, sort this array, and generate
a sorted list of entries (although we are not modifying the structure
we are changing content order).
As a more sophisticated example, we could want a new document having
all the entries grouped by geographical areas ("<CONCEL> "
- in our SGML files). This would imply reordering and restructuring
of the source document.
DOWN-TRANSLATE
global stream area-stream variable initial-size 0
global stream temp
global stream concelho
ELEMENT #IMPLIED
SUPPRESS
ELEMENT arqueo
OUTPUT "%c"
ELEMENT identi
OPEN temp as buffer
PUT temp "<li>%c%n"
CLOSE temp
ELEMENT concel
OPEN concelho as buffer
PUT concelho "%sc"
CLOSE concelho
DO WHEN !(area-stream has key concelho)
NEW area-stream key concelho
DONE
REOPEN area-stream key concelho as buffer
PUT area-stream key concelho temp
CLOSE area-stream key concelho
DOCUMENT-END
REPEAT OVER area-stream
OUTPUT "<h2>Entrys of '" ||
key of area-stream ||
"':</h2>%n<ul>%n%g(area-stream)</ul>%n"
AGAIN
Moreover, if we distinguish specific visitors, we can generate
pages on the fly, according to user attributes, like level of expertise,
etc.
As another example, if we want a LaTeX version of our documents
we just have to write a script to do the job.
The thing that should be reinforced here is that we write all these
scripts once. The documents to which they apply can vary in their
contents but the scripts will remain functional. If we keep the
structure (we do not change the DTD) all the processing remains
stable. at the other end, if one of the formats that are being used
to present our information (HTML, LaTeX, ...) is upgraded, we only
need to change the scripts to reflect this. We do not need to go
through all our documentation upgrading texts. SGML is standard
and platform independent and that is the major advantage we expect
to take from it.
Of course, there is the effort of making the scripts. But importantly
this effort is to be taken by us now, to enable each museum, in
the future, to work with our minimal support.
Tools: Search, thesaurus
and encyclopedia.
In fact, our interest in SGML has to do with document reasoning.
And the reasoning carried out is to provide the information requested
by sophisticated users, who are not interested in quantity. Answers
"in quantity" can easily be obtained through several blind automatic
keyword indexing engines (some more blind than others). Even these
search engines are trying to be more adequate for users not so impressed
with "more is better" and their sophisticated technology, but really
searching for specific things.
The main goal of this section is to show how to profit from the
structure of the SGML annotated information in order to see it as
a knowledge data base capable of inference. We will also show how
to exploit meta information in the process of building new documents
(ex. html pages), and building new tools (ex. browsers, search engines).
The definition of the DTD and the tagging process, associates a
type to each element tagged. The element tag (and sometimes the
attributes) indicate the type of the information.
In order to build a knowledge database with the different sorts
of information, a classification structure is necessary. In our
case a thesaurus will be created.
Building a Thesaurus
In order to establish relations between the different kind of objects.
The thesaurus will:
- establish relations of equality and normalization (alternate
or UseFor terms)
- define relations of being a particular case of (isa relation)
- define some properties related to a term
- establish other relations between terms (writers write books,
etc.)
The thesaurus is a important tool to define relations over heterogeneous
sources of information participating in the GEIRA project (museums,
etc. ) and it is a way to reconcile different classification strategies.
A browser and search engine over this heterogeneous information,
with some conceptual structure, will work like an "encyclopedia".
Building an Encyclopedia
In this context an "encyclopedia" should be defined as a view over
the information, and also as a navigation tool.
The encyclopedia contains terms and associated information, which
can be:
- types and their relations (from the thesaurus)
- instances and associations to the provider, information sources,
pointers to the document source, the context in which the term
appeared. (Typically, many instances for each specie)
Conclusions and Future
Work
The adoption of SGML by the people entering data was a easy step,
as good editors that give the authors promptly feedback, in a WYSIWYG
environment are available. These editors made entering SGML documents
as easy as entering any other unstructured document, but with the
benefit of constructing a structured document. In our study, the
archaeologists do prefer to enter SGML documents, instead of unstructured
descriptions, because they prefer some assistance to ensure structural
and some content validation.
The SGML approach, for some daily data management, is not as adequate
as the relation database. This is because there is already a rich
set of sophisticated tools available to implement the applications
relying on the underlying data model. These tools have been incrementally
developed in the last decades.
The relational database model, however, does not adapt well when
we try to use it for less structured data, with textual characteristics,
and with the notion of sequence. This kind of data is frequent in
object's descriptions, in discussions of their importance, in the
context where the objects were made or discovered, in biographies
and so on. If the data are less structured, how can it be incorporated
in a fixed structure? If there are many variants, the relational
model also tends to grow in the number of fields, many of which
will be null in a particular instance. In textual description, we
read a clear sequence of words containing meaning. Putting discrete
elements in a database, we loose the sequence of the elements. Due
to this limitation of the relational model (mainly when managing
less structured data), we usually create memo fields big enough
to store the textual descriptions. All the information can be stored
in memo fields, but the only thing to do with it is to store and
retrieve all the field as a block.
The SGML standard is suitable for documents with less formal structure,
enabling further processing and reasoning. It is possible to manipulate
parts of the data, to build relations between parts of each, etc.
The drawback of this model is that this processing and reasoning
does not come without a price, in the sense that this processing
must be programmed. There are sophisticated tools to process SGML
documents, but these are very expensive, and require some practice
to get useful results. SGML has also advantages over the relational
model in the preliminary stage of data manipulation, when a structure
is not yet clearly defined. We can only start to work on the relational
model, after that model is built. The SGML approach accepts an incremental
refinement.
Future Work
In this stage, we need to evaluate and test as many tools and systems
as we can for SGML processing. We are also considering the possibility
of using just XML, as it seems to be powerful enough for our purposes,
without some of the difficulties of SGML which is a more general
standard. It seems that XML is being well accepted in the community.
From a more scientific point of view, our work in the short term
is the investigation of the combination between the relational database
model with SGML. In this kind of architecture, we would have the
usual fields of the relational tables, but some of them containing
annotated data. To take advantage of the annotations, some improvements
must be made in the relational engine, enabling processing of the
fields containing SGML, according to the respective DTD, or even
without a DTD.
Acknowledgments
The authors would like to thank Prof. Alberto Proença, responsible
for the GEIRA project at University of Minho. We also would like
to thank the contribution of the Vila Real team, working on the
same project. We also acknowledge all the help received from the
archaeological unit of University of Minho.
The GEIRA project is supported by the EC INTEREG II program.
Finally, we thank the organization of MW'98, for their work and
the opportunity to share our work with other institutions.
References
- Ass97
- Museum Documentation Association. Spectrum: The UK documentation
standard, 2nd edition. Technical report, 1997.
- Gro95
- CIDOC Archaeological Sites Working Group. CIDOC core date
standard for archaeological sites and monuments. Technical report,
1995.
- Her94
- E. Herwjnen. Practical SGML. Kluwer Academic Publishers,
1994.
- LC97
- Leonel and JosÈ Bulas Cruz. Motor para criaÁão
de sÌtios web para museus. Technical report, 1997. In Portuguese.
- Meg95a
- D. Megginson. sgmlspl: a simple post-processor for sgmls
and nsgmls. Technical report, Dep. English - Univ. Ottawa, October
1995.
- Meg95b
- D. Megginson. Sgmls.pm: a perl5 class library for handling
output from the sgmls and nsgmls parsers. Technical report, Dep.
English - Univ. Ottawa, Canada, October 1995.
- omn
- Defining microdocument architecture. Technical report. http://www.omnimark.com/white/microdoc/microdoc.html.
- Omn96
- Omnimark. Omnimark Programmer's Guide. Omnimark Technologies,
1996.
- Roy
- Bruce Royan. Quality control in electronic networks. Technical
report.
- RRAH97
- José Ramalho, Jorge Rocha, José Almeida, and Pedro
Henriques. Sgml documents: Where does quality go? In SML/XML'97
Conference Proceedings, pages 171-177, Washington, 1997.
- Sev95
- Eric Severson. The art of sgml conversion: Eating your vegetables
and enjoying dessert. Technical report, 1995. WHITE PAPER 4001-II.
- Smi97
- Alastair Smith. Criteria for evaluation of internet information
resources. Technical report, 1997. http://www.vuw.ac.nz/ agsmith/evaln/index.html.
![](../dot_clear.gif)
![](../../mw98-line.gif)
This file can be found below http://www.archimuse.com/mw98/
Send questions and comments to info@archimuse.com
|