   
Section on Classification and Indexing and Indexing and Information Technology
Joint Working Group on a Classification Format
Requirements for a Format for Classification Data
Continued
Table of Contents
6. Classifications reviewed
This section is not intended as a detailed analysis of individual classification systems, but a general description and discussion of only those aspects that may influence the development of the UNIMARC classification format. A detailed analysis of the Library of Congress Classification (LCC) and the Dewey Decimal Classification (DDC) was undertaken during the development of the USMARC format. These schemes are given only a brief description here.
To accommodate the requirements of an international format, the Universal Decimal Classification (UDC) and foreign editions of DDC were analyzed in more depth. While neither LCC, DDC nor UDC is the only classification used worldwide, one or another of these schemes is used by the large majority of the world's libraries. In addition, DDC is the foundation and model for many later classifications.
6.1 Library of Congress Classification (LCC)
LCC is a pragmatic enumerative scheme intended as a shelf location. It is an arbitrary scheme geared to actual books in a library; it prescribes no universal order of subjects. It uses number building, but to a lesser degree than either DDC or UDC. Its notation is a mixed alpha/numeric notation which makes only limited use of symbols. Constant form numbers are not used to any degree and Cutter-type numbers for subject subdivision are not always mnemonic or constant. LCC is widely used in the United States, especially in university and research libraries. LCC also is used by university and research libraries in Canada, France, United Kingdom, and elsewhere. It is maintained by the Library of Congress and additions and changes to the scheme are published quarterly in printed form. In addition, schedules are revised and reprinted from time to time. The Library of Congress began publishing its schedules from machine-readable records in early 1995. Classification Plus, a CD-ROM version of LCC introduced in early 1996, also uses the machine-readable records.
6.2 Dewey Decimal Classification (DDC)
6.2.1 General description
The DDC is the most widely used classification in the world. It is published in two editions, unabridged (full) and abridged. DDC is a hierarchical scheme and is based on a universal system of knowledge. By means of enumerated classes, and classes synthesized through "add" instructions, the DDC provides numbers identifying the subjects of most books acquired by a general library, regardless of size of library. Standard DDC notation consists of arabic numbers. It is maintained at the Library of Congress Decimal Classification Division and published by Forest Press, a division of OCLC. The print versions are kept up-to-date between editions through DC&, an annual publication of additions and corrections. Electronic Dewey, a DOS-based CD-ROM version of Edition 20, was introduced in January 1993 and an update published in March 1994. Dewey for Windows, a Microsoft Windows-based CD-ROM version of Edition 21 that is the replacement for Electronic Dewey, will be introduced in summer 1996.
DDC currently uses an online system, developed prior to the USMARC format, to support the publication of the scheme in printed and electronic form. This system, the Editorial Support System (ESS), is a UNIX-based system in a non-MARC, Dewey-specific format. Because the ESS is adequate to satisfy DDC's publication needs, it is not intended that the USMARC format will be used for printing of the DDC (Guenther 1992A, 121). In the future, DDC may be available for distribution in the MARC format.
6.2.2 International use of DDC
DDC is used throughout the world. A survey of recent literature on the use of DDC in various countries reports that DDC is used by over 200,000 libraries in 135 countries and has been translated into thirty languages (Sweeney 1991). For example, Béthery (1991) describes the wide use of DDC in France and points to the urgent need for a new translation of the unabridged edition.
To accommodate this international use, the DDC Editorial Office has attempted to revise schedules to correct the inherent Anglo-American bias. Beall (1991) details the changes meant to "internationalize" the DDC, such as expansions to accommodate differences in literature, history, ethnicity, philosophy, religion and law for various countries or cultures. Despite this effort, regional or cultural adaptations of DDC exist. For example, in the Italian DDC 20, books of the Bible are arranged in Catholic order; placement of specific dishes in 641.8 is modified to match Italian tradition; and there is an expanded area table for Italy.
Some of these editions or augmentations are incorporated into the English language DDC. Others are not incorporated into the general DDC schedules, particularly if the expansion or modification will be useful only locally.
When used internationally, DDC may appear in different language editions, different scripts, different subsets or augmentations, and different levels of fullness. Often, the printed schedules must accommodate multiple languages and scripts in one document.
6.3 Universal Decimal Classification (UDC)
6.3.1 General description
The Universal Decimal Classification (UDC) is a general classification scheme for classifying the whole of recorded knowledge. It is often used in two ways: on documents to determine physical shelf arrangement; and included in the references to the documents in entries in bibliographies (Robinson 1994). UDC is a hierarchical classification, which means that each subdivision may be further subdivided into its logical components (McIlwaine 1993). UDC is also a synthetic classification and uses number building extensively. UDC can be described as a synthetic classification in that it permits the joining of any one part of the classification with any other. Because there are also many combined terms in UDC, it can be described as a semi-enumerative scheme, since faceted schemes specify only simple terms.
UDC was developed by two Belgians, Paul Otlet and Henri LaFontaine. First published as a complete edition between 1905 and 1907, it was originally intended only as a tool for the Repertoire bibliographique. However, by the 1920s it was being developed as an all-purpose universal classification (McIlwaine 1993).
UDC originally was derived from the 5th edition of DDC and the UDC schedules incorporate much material published in later editions of DDC (Guide 1963). Although DDC was the original base of UDC, the developers of UDC, over the years, have greatly expanded the UDC and added many synthetic devices and auxiliary tables to make it much more detailed.
6.3.2 Notation
The notation is non-language dependent and is composed of arabic numbers, arranged according to the decimal system, and symbols. UDC notation is able to express relations between subjects, but the main principle of division is hierarchical. Although UDC is capable of expressing relations between subjects, it is not always capable of distinguishing different kinds of relations. The colon, in particular, may be used to indicate a coordinate or subordinate concept (McIlwaine 1994).
6.3.3 Maintenance
Held for many years by the Fédération Internationale de Documentation (FID) in The Hague, the intellectual ownership and the responsibility for maintenance and development was transferred to the UDC Consortium in January 1992.
6.3.4 UDC publications
Editions of varying fullness exist in many languages. There are three types of editions that may appear in any given language: full, medium and abridged. McIlwaine (1994) notes these terms are relative, and a "medium" edition in one language may be the same size as an abridged in another. There are also special subject editions in various languages, each containing a selection of classes appropriate to a particular discipline (Robinson 1994).
6.3.5 The UDC Master Reference File (MRF)
In 1990, a UDC Task Force on System Development recommended the creation of a standard version of UDC, in English, in a machine-readable format. Named the "Master Reference File (MRF)," this database was to provide the individual publishers of UDC with the core material for all editions of the UDC in whatever language, size and form, and on whatever medium. It was also to be the basis for the revision of the schedules, authorized first through publication in "Extensions and Corrections to the UDC" (Strachan and Oomes, 1993). The MRF is the basis for all editions, including the International Medium Edition. (McIlwaine 1994). It is revised annually, and all license holders get the annual update.
The database in its final stage totals ca. 60,000 records. In the final stages of record conversion, a new data element format had to be developed. Some fields in the former design were no longer functional and others had to be added to handle the revisions and revision history. This new format is still experimental. [The fields contained in the new MRF fields are given in Appendix C.]
In its current stage of development, the MRF can be delivered as a database in Micro CDS/ISIS, as a file in ISO 2709 interchange format, and as a text file in ASCII that can be loaded into a word processor (Strachan and Oomes 1993, 27).
6.3.6 Organization of the UDC schema
UDC is organized on the basis of classes, which may be simple or compound. A simple class is a straightforward subdivision, while a compound class is formed by two or more different types of concepts (or facets) within the same class. Where two classes intersect, e.g. in the title "Mathematics for engineers," this is referred to as a complex class and is frequently expressed through the use of a colon (McIlwaine 1993).
There are two types of tables in UDC:
- the main tables (main schedules); and
- the auxiliary tables.
Similarly, the notation may be of two types:
- numbers of main classes and subdivisions; and
- auxiliary numbers.
Auxiliary numbers are numbers and/or symbols which are added at the end of the main number to extend the meaning of a number by giving additional facets of the subject such as points of view, forms, geographical areas, chronological subdivisions, etc. The auxiliary numbers are distinguished by specific signs and symbols. It is these auxiliary numbers that permit the construction of synthesized compound numbers. McIlwaine (1994) points out that, in UDC terminology, the applications of some of the connecting symbols such as the colon and square brackets are referred to as tables and do in fact constitute the Common Auxiliary Tables 1a and 1b.
Auxiliary numbers. There are two types of auxiliary numbers: common and special. The common auxiliary tables are applicable throughout the main table and specify recurrent characteristics such as place, language, physical form, etc. These numbers may be used to qualify any concept. Certain common auxiliaries, principally those of common form and of place, may be used as a main number if required. For example, collections of maps might be arranged using the area table, or serials might be designated using the appropriate form number:
(054) Newspapers (of all kinds)
(054)(44) French newspapers
The special auxiliaries are not listed in one place, but occur at various places in the main tables where applicable. They specify locally recurrent characteristics and generally have a more limited subject range. A note always explains the application of these auxiliary subdivisions. For comparison purposes, common auxiliaries can be roughly equated to the external tables in LCC and Tables 1 - 7 in DDC; special auxiliaries can be roughly equated to the internal (add) tables in LCC and DDC.
Symbols used. Symbols used in the UDC notation with auxiliary numbers include:
Symbol |
Description |
| + (plus sign) |
Connects two or more non-consecutive numbers to denote a compound subject, e.g., (44 + 460) France and Spain |
| / (slash) |
Connects first and last of a series of consecutive numbers to denote a broad subject or range of concepts, e.g., 592/599. Either the main number or the extension number could be a range. If the extension is a range, the digits common to both up to the point are omitted, e.g., 629.734/.735 |
| : (colon) |
Simple relationship. Links two or more UDC numbers to specify a related concept(s) of equal value, e.g., 17:7 Relation of ethics to art. The numbers can be reversed to ensure separate retrieval |
| [ ] (square brackets) |
Subgrouping. Used as an algebraic subgrouping device when two or more main UDC numbers are linked by a plus sign or a colon to denote a complex subject which is as a whole related to another by colon, e.g., 11:[622+669](485) Statistics of mining and metallurgy in Sweden |
| :: (double colon) |
Order-fixing. An irreversible relator used to fix the order of the component numbers in a compound number, especially when the UDC is used in a computer-based system, e.g., 061.1(100)::[54 + 66]|UPAC International Union of Pure and Applied Chemistry |
| = ... |
Specifies the language of the document, e.g., =111 English |
| (0...) |
Form of document, e.g., (051) Periodicals. In addition, documents may be arranged according to form by citing the form auxiliary first |
| (1/9) |
Common auxiliaries of place, e.g. 59(4), Zoology of Europe |
| (=...) |
Common auxiliaries of ethnic grouping and nationality, e.g., 17(=11) Ethics in Germanic races |
| "..." |
Common auxiliaries of time, e.g., 17"19" Ethics in 20th century |
| * |
Codes and notations (non-UDC), e.g., 546.42*90 Strontium 90 |
| A/Z |
Alphabetic extension. Names, etc., e.g., 75REM Paintings of Rembrandt |
| .00 |
Point of view, e.g., 622.002.5 Mining from the equipment aspect |
| - (hyphen) |
A special auxiliary indicating elements, components, properties, etc. of the subject denoted by the main number, e.g., 82-1/-9 denoting literary form. Also used for common auxiliaries of materials (-03) and common auxiliaries of persons and personal characteristics (-05) |
| .0 |
A special auxiliary providing sets and subsets of recurrent concepts, such as aspects, studies, activities, etc., e.g., 303.01/.03, theoretical and methodological aspects |
| ' (apostrophe) |
A special auxiliary, usually more specific than -1/-9, frequently but not invariably denoting compound subjects by compound notation, e.g., 547.426'171 Nitroglycerine |
| --> (arrow) |
Used in English editions (and some others) as instruction to classifier, but not inherently part of UDC. Denotes see also |
| [equals tilde] |
Instruction to classifier to show subdivided as: (parallel subdivision). Notation of parallel subdivision: when one portion of the classification is subdivided like another. |
Citation order. UDC provides for changes in citation order. If the structure of the specific area of the scheme does not have a built-in citation order, the classifier may devise an order that best satisfies his needs. This ability to retrieve in various citation orders makes it possible to retrieve all concepts individually. The only time citation order is important is in shelf or file arrangement. However, if the scheme is to be used for the interchange of information, all users will need to adopt the same citation order (McIlwaine 1993).
Filing order for symbols. Unlike citation order, a fixed filing order is specified for the various codes and symbols. First in filing order comes the number + ...; secondly, the number followed by / ..., thirdly, the simple number ... etc.
Indentation. In some UDC schedules, indentation is used to indicate subordination of topics. However, indentation sometimes varies from edition to edition.
Authority file. Individual users generally compile an authority file relating to the local application of the scheme, in which all decisions are recorded. This authority file should be searchable both through the UDC notation and the natural language terms (McIlwaine 1993).
7. Requirements for a UNIMARC classification format
7.1 General
This section sets forth requirements for a UNIMARC classification format that are in addition to those covered by the USMARC Format for Classification Data. These requirements address the needs from several areas:
- additional data elements needed to extend the USMARC format coverage to UDC and "international" DDC editions;
- elements in the UNIMARC/Authorities format that it would be desirable to include in the classification format; and
- current and future uses.
General requirement. Since it is impossible to predict all the possible future uses of classification data, elements should be tagged as explicitly as possible and to the smallest level possible. Some of the data in manual classification systems are implicit; all of this data must be made explicit in the machine format.
Attributes. In considering these requirements, one should also consider the attributes of classification records. Data may not be as volatile as in other types of records. For instance, data may be mostly from one place, i.e., the classification maintenance agency. (This is not always true in the case of UDC.) Unlike bibliographic records, there is more of a finite number of records to be converted. Thus, one may be better able to provide necessary items in the format that may be labor intensive for the conversion process, but that will save labor in the long run.
7.2 Requirements resulting from extending the format to UDC and DDC international
This section details the requirements peculiar to UDC and "international" DDC and the ability/inability of the USMARC format to handle them.
7.2.1
Ability to handle various combinations of language, script, edition, subset, augmentation, etc. In accommodating these data elements, one must be careful to make a distinction between a straight translation or transliteration (version) and an augmented, abridged, etc. edition. Treatment of these terms must be mutually exclusive, i.e., one could have, hypothetically, an expanded edition of DDC in Russian using Cyrillic script and a straight translation of DDC into Russian using the roman script. If ever one would want to incorporate various specialized versions of either DDC or UDC into one main scheme, these data elements will be key.
Language. Indication of language should accommodate:
- ability to get language equivalency
- publication of multilingual publications, including the ability to publish side-by-side, e.g., Spanish and English. Note that current multilingual publications are not always one-for-one. For example, tête bêche publications in Canada consist of two separate texts bound in one volume, with two front covers in two different languages, each with its own title page and prefatory matter
- ability to designate whether a translation, and if a translation, the language, or edition, translated from and/or based on.
The USMARC Format currently treats language information in field 084, $e. This subfield is currently not repeatable. The format does not explicitly describe any translation information. One would want to describe both the details of the translation and the details of the version on which it was based.
Scripts. Indication of scripts should accommodate:
- ability to handle and designate different scripts at the file level
- ability to handle and designate different scripts at the record level
- ability to handle different scripts within one record, e.g., references to English authors within an Arabic classification where the name of the English authors is given in the Roman script.
The USMARC format does not currently describe scripts.
Editions--Fullness of edition. The format should accommodate an indication of the fullness of the edition, e.g., full edition, intermediate edition, abridged edition. Fullness of edition must be mutually exclusive with augmented or subset edition. For example, one could have an abridged edition of a DDC scheme in which certain geographic areas of the scheme are greatly augmented.
McIlwaine (1994) points out the problem that different language editions, which look as though they may be the same, will vary considerably, depending on local requirements. For example, the French Medium edition of 1991/93 has considerable detail for France and those parts of the world that are of French interest. The English Medium edition of 1993 has far greater detail for the United Kingdom and for the Commonwealth than for the rest of the world.
The USMARC format currently describes the fullness of edition in field 084, first indicator.
Augmentation/subset (edition). The format should accommodate an indication of whether the area or part of the scheme has been augmented, e.g., the Italian edition of DDC has an expanded area table for Italy. Included in this category would be national extensions to DDC or UDC.
The USMARC format does not currently explicitly describe this information, although the edition title appears in 084 $b.
Authorized/unauthorized (edition). The format should accommodate an indication of whether an edition is an authorized or an unauthorized edition. This information may also include the person or organization licensed for the edition.
The USMARC format currently does not explicitly describe this information.
Root classification scheme. Need to accommodate the root scheme when the root scheme has been augmented, i.e., the classification scheme and edition upon which the edition is based.
The USMARC format currently does not explicitly describe this information.
7.2.2
Ability to retrieve individually on parts of the notation, e.g., the ability to retrieve by subject or form. In UDC, it is desirable that each part of the UDC number be separately searchable. This may include the ability to recognize various relational symbols (see 7.2.4).
McIlwaine (1994) notes the difficult problems caused by the parallel subdivision device. UDC intends to abandon it in the future in favor of using colon combinations or other auxiliaries, e.g., use of area numbers instead of spelling out place numbers in History.
The USMARC format currently describes parts of notation but not relational symbols in field 765 (see 7.2.4).
7.2.3
Ability to distinguish use of certain common auxiliaries as main numbers versus use of the same number qualifying a number from the main tables. For example, (054)(44) French newspapers (common auxiliary 054 used as main number); versus 378.18(054) university students' newspapers.
7.2.4
Ability to recognize certain standardized symbols that specify relationships between numbers (see 6.3.6 for description of the symbols). This is critical in any machine-based automatic calculation for synthesized numbers and for subject retrieval in which the relationship adds meaning. For these symbols, one needs:
- ability to recognize these symbols in order to perform machine-based calculations for synthesized numbers
- ability to recognize these symbols during retrieval
- ability to generate standardized text, i.e., display constants, for these symbols, if desirable
- ability to retrieve separate parts of a number, e.g., in UDC to retrieve auxiliary numbers separately
- ability to file these symbols in accordance with the filing rules of the scheme, e.g., UDC has a certain filing order for symbols (see filing order for symbols in 6.3.6).
The USMARC format could accommodate some of the information represented by the symbols by coding and/or textual content in fields.
A range of subjects denoted by a slash (/) could be accommodated by the presence of a $c in X53 fields, e.g., 153 $a592 $c599 would generate 592/599 in a record coded as UDC.
The individual facets in a synthesized number such as language =; place (1/9); ethnic group and nationality (=...); time " "; alphabetic extension A/Z; point of view .00; and other aspects in special auxiliaries denoted by the hyphen, apostrophe, and point nought (-, ', and .0) could be accommodated through use of field 765. The appropriate symbol could be generated from the table name if it were added as an element to the 765 field.
The see also (-->) could be generated by the presence of l in 553 $w. Instructions for parallel subdivision [equals tilde] could be handled through use of field 761. It might be necessary to add an indicator to the field to distinguish between a "subdivide as" note and an "add to" note.
The relationships represented by other symbols (+, :, [ ], and ::) are not readily accommodated in the USMARC format. Oomes (1994) observes a solution similar to that suggested above for the slash might be considered for UDC symbols +, :, ::, and [ ]. McIlwaine (1994) cautions that the relationships conveyed by these symbols are not consistently applied in the UDC. Furthermore, it is doubtful that it would work for +.
7.2.5
Ability to indicate citation order. In UDC, there is no standardized citation order; there is only a standardized filing order for symbols (McIlwaine 1994; see also p. 15 of this report).
The USMARC format accommodates citation and precedence order instructions in field 768. A subfield for source would need to be added to accommodate local adaptations.
7.2.6 Index
The index requirements do not differ from those currently in the USMARC format. One desirable additional feature would be the ability to show hierarchy of terms (e.g., subfield $b) in USMARC field 154, the generic cross reference field.
7.2.7
Ability to recognize the type of special auxiliary
The UDC calls all schedules "tables" and distinguishes the main tables from auxiliary tables. While the USMARC format allows differentiation between schedule records and table records in a fixed field, it currently does not allow one to designate the type of table which may be mutually exclusive and may be repeatable within any given record. This is a code in the UDC MRF.
7.2.8
Ability to indicate the type of combination, i.e., with colon or other connecting symbol, or with a special type of auxiliary
The USMARC format specifies table name or number, but does not accommodate type of combination. This is a code in the UDC MRF.
7.2.9
Ability to indicate administrative data elements in the record history such as source of cancellation.
The USMARC format accommodates the record history, including the institution issuing a cancellation, in field 685 $5.
7.3 Requirements resulting from incorporating UNIMARC/Authorities data
7.3.1 Parallel languages and scripts
One must have the ability to explicitly handle parallel languages and parallel scripts in a manner compatible with other UNIMARC formats. The UNIMARC/Authorities has an explicit linking technique and consideration should be given to using this technique.
7.3.2
Regardless of whether the UNIMARC linking technique or the USMARC linking technique is used, there must be a one-to-one correlation between the elements in USMARC and UNIMARC in order to properly ensure their translation.
8. Additional recommendations
8.1 Testing
Because one cannot always predict the success of any given format with any given classification scheme until records are converted, and especially because some of the elements in the classification format are designed especially for machine-manipulation, it would be desirable to do some test conversion and retrieval of classification records. The USMARC format, augmented with the changes suggested above, could be used to do this testing. For the UDC, it would be best to have someone perform the testing who is well versed in the application of UDC. In addition, one could take data currently in the UDC MRF and recode this into the revised USMARC format.
For the "international" DDC, it would be desirable to experiment with number-building, and with coding different language editions, augmented editions, etc.
 |
|
|
|
|
Latest Revision: October 31, 1996
|
Copyright © 1995-2000
International Federation of Library Associations and Institutions
www.ifla.org
|
|