As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites

This old website and all of its content will stay on as archive – http://archive.ifla.org

64th IFLA Conference Logo

64th IFLA General Conference
August 16 - August 21, 1998

Code Number: 138-161(WS)-E
Division Number:
Professional Group: UBCIM Core Programme
Joint Meeting with: Permanent UNIMARC Committee and Division of Bibliographic Control
Meeting Number: 161.
Simultaneous Interpretation: No

UNIMARC and Metadata : Dublin Core

Alan Hopkinson

Middlesex University
London, UK

E-mail: a.hopkinson@mdx.ac.uk

Abstract

Metadata means data about data so UNIMARC itself is a carrier of metadata. UNIMARC was developed for a specific purpose: exchanging records between different automated cataloguing systems. Dublin Core is a set of metadata elements, 15 in all, intended to facilitate the retrieval of electronic resources. This paper discusses reasons why we would want to map two sets of metadata elements with their respective syntaxes which have in common only that they are metadata and bibliographic. Problems with the mappings are also outlined.

Paper

1. Introduction

Metadata means data about data. The term includes catalogue data. It is used increasingly to refer to any data used to aid the identification, description and location of networked electronic resources.

Why do we need a new term when we as librarians have managed quite well without it for so long. The answer is that other interested groups in this electronic age are entering into what was exclusively librarians' territory and they are having to think up or re-use terminology for their own purposes, which do not necessarily conflict with ours. So librarians have taken on board a new term, metadata, though they did not need it as they already had terminology to cover this concept.

According to the above definition, data in the UNIMARC format will usually if not always be metadata.

2. Why this paper: UNIMARC and metadata

In the past, finding aids were produced only on the one hand by librarians who produced general catalogues and on the other by some related groups of people, often practitioners or researchers in a particular field who produced lists of journal articles, or indexing services as they were called. Today, other groups are producing 'finding aids'. The largest arena of this production is in connection with Web search engines. When you do a search on a web search engine, you do not search the whole of the web directly but rather an index to the web which has been generated by a computer scanning web sites around the internet. This index is generated automatically and leaves a great deal to be desired. Here is metadata at its worst! Yet this is metadata's biggest exposure to the world at large.

Many (librarians) who a few years ago predicted the death of the library profession are now retracting and saying the world at large must realise the importance of indexing data intellectually rather than automatically. The question is do you have a librarian indexing in place of or perhaps rather in addition to an automated indexer, or do you have a librarian helping the end user who wishes to make his search more effective? The latter is going back to the idea of the intermediary, so beloved of information scientists in the 1970s. Today users, people at large, want, indeed demand to do their own searching, so the intellectual precision has to be at the index generation end rather than with the end user himself or herself.

What is required is for every web page to include some intellectually devised terms so that the computers that generate indexes can pick these up. Additionally they could include author and title information. Basically the information world needs to produce catalogues of web resources in the same way that cataloguers produce catalogues of books. How do cataloguers produce catalogues of books? They use the title-page, a 'device' which has been developed over centuries to represent the definitive aspects of bibliographic material. As soon as we leave the realm of the book and go into other materials, the cataloguers amongst us look for the title page (or title page substitute). Where is the title page of a kit, a film, a gramophone record? Their title pages are often in other media, for example the record label, though in the case of a film, the 'title-page' could be the label or it could be at the start or end of the film itself.

In the case of certain electronic materials we have a similar situation. Is the title page of a CD the label on the CD or is it in electronic form within the CD? With internet materials we have no such luxury of alternative sources; the 'title-page' must be in the electronic page itself. There is a certain amount of structure mandatory for any web-page: the 'syntax' of the page which has to be present to tell the computer system how to process the data to display it on the end-user's screen. Then there are certain features such as the 'title' which appear on the top of each web-page. However, there are also specially defined data elements which can be accessed by web crawlers. Here may be stored more information than what is displayed on the screen that the end user sees. In one way it can be regarded as CIP, Cataloguing in Publication. However, as well as the data being useful for web browsers, they may also be extracted into library catalogues. Computerised catalogues can then include records of electronic resources ideally with as little manual intervention as possible.

3. Standards for data on web pages

The standards for data on web pages are notoriously free and easy. Standards for indexing are notoriously difficult to achieve anyway, particularly if indexing is to be consistent across more than one discrete catalogue; the web is universal, so the task of indexing across the web is going to be difficult. The structure or syntax on web pages is also customarily free and easy, though there are certain constraints. Dublin Core is shorthand for the Dublin Metadata Core Element Set which was agreed at the OCLC/NCSA Metadata Workshop in March 1995. One of the uses of this set is in the cataloguing of electronic resources and it is generally held that it should be the standard used on web pages for the 'catalogue record', if indeed there is to be one: 'The Dublin Core is the leading candidate as a lingua franca' for resource discovery on the net' [1]. It is worth remembering that Dublin Core is not confined to use in HTML pages. Also noteworthy is that it is intended to be usable by non-cataloguers (e.g. the authors of web pages) as well as by those with experience with formal resource description models (i.e. cataloguers).

Here is an example of a Dublin Core document identification embedded in HTML.

Sample

In this record I chose to invert the author's name: there is nothing in Dublin Core to tell me to do this. Incidentally, I created this example manually from the IFLA page. Though UKOLN do have a Dublin Core generator DC-dot [2], it cannot make as good a job of it as a cataloguer.

4. Dublin Core and UNIMARC

To recapitulate, library cataloguing systems need MARC records, so if a MARC record could be extracted from a web page which contained an electronic document which it was thought to be worth cataloguing, so much the better.

4.1 Record structure

4.2 Data elements

Here is a table of comparisons based on that from that study but adding the recently added 856 field which was mentioned in Brian Holt's paper.

Additionally a few changes have been made to add extra titles such as parallel title and to remove certain descriptive data elements such as 200 $f First statement of responsibility (equated to creator). Data in this subfield are not in indexed form and may just not be necessary in an electronic medium as they merely repeat data in an access point field (700) in another form (as on the document instead of formalised).

Dublin Core                   	            UNIMARC 
Title                         	200 $a Title Proper
                              	200 $e Other Title Information (for subtitle)
				510 $a Parallel title
                              	517 $a Other Variant Titles (for other titles) 
Creator                       	700 $a Personal Name - Primary Intellectual Responsibility, or if more than one:
                              	701 $a Personal Name - Alternative Intellectual Responsibility
                              	710 $a Corporate Body Name - Primary Intellectual Responsibility, or:
                              	711 $a Corporate Body Name - Alternative Intellectual Responsibility
Subject                      	610 $a Uncontrolled Subject Terms
                              	606 Topical Name Used as Subject (for LCSH and MeSH)
                              	675 UDC
                              	676 DDC
                              	680 LCC
                              	686 Other Classification Systems 
Description                   	330 $a Summary or Abstract 
Publisher                     	210 $c Name of Publisher, Distributor, etc. 
Contributors                  	701 $a Personal Name - Alternative Intellectual Responsibility
                              	711 $a Corporate Body Name - Alternative Intellectual Responsibility
Date                          	210 $d Date of Publication, Distribution, etc. 
Type                          	608 Form, Genre or Physical Characteristics Heading 
Format                        	336 $a Type of Computer File (provisional) 
Identifier                    	001 allocated by the system
                              	010 ISBN
                              	011 ISSN
                              	020 (National Bibliography Number)
                              	856 $aURL
Source                        	324 Original Version Note 
Language                      	101 Language of the Item
Relation                      	300 General Note 
Coverage                      	300 General Note 
Rights                        	300 General Note

Michael Day's paper goes into detail and may be read there. The main thrust is that UNIMARC records consist of data formulated by highly controlling cataloguing codes: Dublin Core data elements are less highly specified. The data elements reflect this in that they cover broader categories of data. UNIMARC also has a concept of main entry (not mandatory, but usually present). Dublin Core does not include this concept. Day also refers to a study by Caplan and Guenther relating to US MARC [5]. Many characteristics of US MARC apply to UNIMARC.

In short, data produced according to one set of conventions in one tradition by one category of producer will not usually be easily converted to data produced by another. What if cataloguers produce data in Dublin Core with a view to its automatically producing a catalogue record in another format? Even this does not seem possible as Dublin Core does not have any coding to provide the necessary detail for the specification of a record that could be converted to UNIMARC. The short answer to this is that it may not be possible to have a standard which is suitable for authors and for cataloguers at the same time. In book production one has publishers in between the author and the publication and even then many publishers would not be able to provide their own CIP record. Library catalogues today do not usually distinguish between personal and corporate authors in their indexes since the powerful retrieval tools we have make it unnecessary. MARC formats include the potential to do this and indeed the distinction is mandatory in most MARC formats and must be followed there. But it is not there in Dublin Core. One distinction in Dublin Core is between Creator and Contributor which is present, though not always explicit, in UNIMARC, and deeper down, hidden in the relator codes which are not mandatory.

If you go to Day's paper, you will see that since he wrote this paper the only area where compatibility is increased is the new field for URL in UNIMARC and this is a clerical not an intellectual improvement to UNIMARC. Dublin Core has not changed, though Dublin Core is extensible and work is going on to formulate best practices for doing this.

5. Conclusion

When comparisons are made between different 'formats', it is often not very profitable to compare anything other than like with like. However, we have isolated a reason for investigating the convertibility of Dublin Core to UNIMARC, the feasibility of including automatically produced catalogue records of electronic items in library catalogues. The issues are not complex though the conversion itself would be. The nub of the matter is that it is difficult to produce a catalogue record from data which has not been prepared with the aim in view of producing a UNIMARC record. The Dublin Core 'rules' and intentions do not make this any easier.

References

Miller, Eric, Dublin Core metadata. [Dublin : OCLC, 1995?] http://purl.org/metadata/dublin_core
Powell, Andy. DC-dot : a Dublin Core generator. Bath: UKOLN, [1997?]
http://www.ukoln.ac.uk/metadata/dcdot/
Format for Information Interchange. Geneva, ISO, 1996 (ISO 2709-1996).
Day, Michael Mapping Dublin Core to UNIMARC : draft. Bath, UKOLN, 1997.
http://www.ukoln.ac.uk/metadata/interoperability/dc_unimarc.html
Caplan, P. and Guenther, R. Metadata for Internet resources: the Dublin Core data elements set and its mapping to USMARC. Cataloguing and classification quarterly, 22 (3/4), 1996, 48.

64th IFLA General Conference August 16 - August 21, 1998