IFLA

As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites

This old website and all of its content will stay on as archive – http://archive.ifla.org

IFLANET home - International Federation of Library Associations and InstitutionsAnnual ConferenceSearchContacts

62nd IFLA General Conference - Conference Proceedings - August 25-31, 1996

Multilingual and Multiscript Issues in Cataloguing

Joan M. Aliprand
Senior Analyst
The Research Libraries Group, Inc.


PAPER

I would like to ask you about the language you write every day and the script it is written in.

Figure 1: Which script do you use?

Please raise your hand if the script you use regularly is on this list. Thank you. This forest of hands gives us some idea of what is needed for multiscript computing, because we all would like to be able use our own mother tongue.

I have previously spoken at IFLA meetings about the Unicode Standard, the new multiscript character set for worldwide computing. This time, I will speak about multiscript access to materials, including, of course, serials. Note that "multiscript" implies "multilingual" in most cases, and some of my remarks may apply to languages which share a common script.

Since I started my career as a cataloguer, it is natural that I would think about cataloguing in the multiscript context, rather than procurement, or other aspects of library work. What I say about multiscript cataloguing should be applicable not only to serials but to the control of all types of material, whether physically housed in libraries or distributed around the world on the thousands o f servers of the Net.

It is appropriate to consider serials, and not just because this program is sponsored by the Section on Serials. Serials, especially legal ones, are arguably the most complex class of bibliographic material. If you can handle serials well, you can handle practically anything.

We should also recognize that there are parallels between serials and other types of material. For example, large scale maps are comprised of multiple sheets, and individual sheets may be issued in revised editions. The cataloguer creates a single record for the map as a whole, and subsidiary 'check-in" procedures are used to keep track of the sheets. This is similar to serials processing: ca taloguing for the serial as a whole, with check-in, claiming, etc. for the issues. We can envision extension of this to digitized images for maps: a top level record describing the map as a whole, with links to the individual sheets from a summary index map (the cartographic equivalent of a table of contents).

First of all, let me state my axiom or foundation: a language should be written in its proper script. For library service, transliteration or pronunciation-based transcription is a poor approximation to the original. (For the remainder of this paper, I will use "transliteration" imprecisely to mean both transliteration and transcription.) Transliteration is useful in other contexts: tou rists here would be lost without pinyin phrase books.

My axiom leads to the requirement that the description of an item should be in the proper language and script. This is generally recognized in cataloguing, for example, in Rule 1.0E of AACR2 (Anglo-American Cataloguing Rules, second edition). But should this be extended to other access points? Do we need to provide name access in the language and script of the work, or only in the langu age and script of the catalogue user? And, as a slightly different question, do we need to provide complete catalogue access in multiple languages and scripts? But before l expand on these questions, I will say something about multiscript computing.

We are all at different stages of availability of computing resources. Availability varies between countries and also within countries. But computers -- and I include here adjuncts such as CD-ROMs --are creeping into our lives in all sorts of places. And, whatever our current level of availability, we need to know about computing to be prepared for future opportunities.

Computer processing of text requires a way to represent individual letters, numbers, ideographs, punctuation marks, etc. unambiguously. This is done with a character set, which specifies a unique pattern of bits for each character. The most commonly-used character set is probably International Standard ISO 8859-1 or Latin 1, (1) which includes letters from the alphabets of a number of Western European languages.

Figure 2: ISO/IEC 8859-1 (Latin 1)

Multiscript computing is possible when a single character set is large enough to include the characters of all scripts important to users, or when there is a technique to allow smaller, script-specific character sets to be combined.

A number of multibyte character sets with large character repertoires have been developed in Asia; for example, the two-byte Chinese standard GB 2312. GB 2312 includes not only almost 7,000 ideographs, but the ASCII (American Standard Code for Information Interchange) character repertoire, the Greek and Russian alphabets, and Japanese kana.

An alternative technique for multiscript computing is to use multiple character sets with announcers. "Here begins Hebrew," "Here begins Cyrillic," "now back to Latin script," and so forth. The advantage of this technique is that any combination of scripts can be accommodated. However, it is cumbersome from a design standpoint, since the system always needs to be aware of the current script. A particular code, that is, bit pattern, can stand for entirely different characters -- it all depends on the announcer.

Figure 3: Different characters represented by the same code

This technique is used for non-Roman scripts in the USMARC formats for bibliographic and authority data. RLIN, the Research Libraries Information Network, a system based on USMARC, supports CJK (Chinese, Japanese, and Korean), and the Cyrillic, Hebraic, and Arabic scripts. You can find Library of Congress records for East Asian, Arabic and Hebrew works in RLIN the day after cataloguing has bee n done at LC.

Different character sets are used in different parts of the world, which causes work when data is interchanged. For example, different multibyte character sets encode the same ideograph differently.

Figure 4: Different character encodings for the same ideograph

The ideal is a single world-wide character set for multiscript computing which treats all scripts uniformly and equally, and has the simplicity and ease of use of ASCII.

Version 2.0 of the Unicode Standard was released this year.

Figure 5: The Unicode Standard, Version 2.0 (cover)

The Unicode Standard is code-for-code identical to International Standard ISO/IEC 10646 and amendments. ISO/IEC 10646 was adopted by China as its national standard GB 13000. Figure 6: Chinese National Standard GB 13000 (cover)

A frequent question is: what is in the Unicode Standard? Here is a general overview.

Figure 7: Allocation of the Unicode code space

Remember those scripts I asked about earlier? They are all in Version 2.0, chiefly in the General Scripts zone. A repertoire for Ethiopian was finalized this year. Important modern scripts that still need to be added are Mongolian, Khmer, Burmese, and Sinhala. One of the important concepts in the Unicode Standard is Han unification, which applies to the encoding of East Asian i deographs.

We know that, historically, China has influenced the culture of nearby kingdoms and regions.

Figure 8: Japanese daruma

For example, the Japanese daruma doll portrays the Zen Buddhist sage Bodhidharma who is said to have lived in China more than fourteen hundred years ago.

The relationship of Chinese to other East Asian languages is similar to the relationship of Latin to languages of Europe. East Asian writing systems have many ideographs in common; there are also some distinctive local forms. The large number of ideographs with identical appearance and meaning (though with completely different pronunciation) led inevitably to the concept of Han unification -- t o have a single coded set of ideographs for Chinese, Japanese, Korean and historical Vietnamese, rather than separate sets for each language with much duplication. There are rules to determine when an ideographic is unique, and when it is a typographic variant of a character which has already been encoded.

The future of multiscript computing is improving. There are commercial products which use the Unicode Standard: the best known are probably Microsoft's Windows NT and Sun's Java programming language.

Work is underway to establish authoritative mappings between the characters of existing library standards and the Unicode Standard. This work is being carried out by the MARBI Committee of the American Library Association, and, in Europe, by CHASE, a committee within the CoBRA (Computerized Bibliographic Record Actions) Initiative.

Our need is for input and display devices which can handle all the scripts occurring in the material on our own library's collection, and we may also want to have display devices that can handle the scripts that occur in the catalogues of other libraries, and perhaps reproductions of documents in digital collections.

On top of these automation issues, there are additional issues to do with cataloguing and catalogues in the multiscript context. "Cataloguing" means not just conventional bibliographic entries, but entries in serial indexing and abstracting tools, both printed and online, and entries that describe material out on the Net.

Cataloguing includes both description and provision of access points. Description covers these areas:

The description is transcribed in the language which appears on the source of information. For the description to be as accurate as possible, it should also use the same script as the source of information.

Access points are either controlled, established forms, or are uncontrolled free text. Access points may be classified according to language and script:

I will focus on locale-specific access points. By "locale," I mean the catalogue user's environment, including his or her preferred language and script. My own locale language is English.

Locale-specific access points are normally controlled, that is, formulated according to a source of authority, a set of the written or unwritten rules. Except for language-neutral access points (such as ISBN), the source of authority has an operative language. For example, AACR2 controls the non-subject access points in catalogues created for English speakers. English is its operative language .

When we are dealing with different languages, there may not be a 1:1 equivalence between particular terms. This may be due to the vocabulary of the languages, or be introduced by transliteration. Even within a common language, there are local variants. This is particularly true for subjects: for example, longshoreman, wharf labourer, and stevedore. In Australia, the sandwich spr ead made from peanuts is called peanut butter in some states, but peanut paste in others.

There are two reasons to provide access points in another language and/or script. One reason is when transliterated access points mandated by the source of authority provide inferior access. Access in the correct script of a language is not only more accurate, it is much easier for the user who reads that language.

The other reason is when the library serves a multilingual population. We might think of this situation as locales which each have a distinct user language, but which are physically at the same place. Nations that have multiple official languages include Canada, Singapore, and Israel. In many other countries there is a desire to provide better service to people who do not speak the predominan t language.

The need for multilingual and multiscript access raises these questions: how much of the record needs to be in the other language or languages? Should the alternative access points be embedded in the bibliographic record? Or should access in the original language and script be provided by an associated authority control system? What about the user's "dialogue with the catalogue" -- operating i nstructions, help messages, and so forth?
How should multiscript search results be presented to users? Specifically, how should records in multiple scripts be arranged? Even with relevance ranking, as used with WAIS (wide-area information servers) retrieval, there may be a need for character based ordering.

The American tradition for sorting has been to convert all scripts to a single script with a single order by the use of transliterated filing titles. Spalding (7) called this single order the universal catalogue -- "the catalog in which all items in the collection are entered in a single alphabet from A to Z, regardless of language, regardless of form, regardless of subject. The America n ideal." Note that the concept of the universal catalogue is not limited to English, or even to Latin script.

The alternative is to define an order for the scripts, as suggested by Wellisch, (8) with a specific order for the characters of each script. For alphabetic and syllabic scripts, the order may be that of a particular alphabet or syllabary, or an artificial integrated order for the whole script. For ideographs, various orders are possible, based on the structure of the ideograph -- radical-stro ke or total stroke count are typical -- or on pronunciation -- for example, kana equivalents of kanji. But what should the order of arrangement be when the "filing values" are identical, but the source ideographs are different?

These are exciting times for librarians who must deal with multiscript materials. Computers provide more flexibility in the indexing of bibliographic records, and in the presentation of search results. This will continue to be true as multiscript capabilities become more widespread due to use of the Unicode Standard in global software. There are many questions that will need to be answered as we incorporate these developments into our libraries. Thank you for your attention.

Copyright * 1996 by Joan M. Aliprand

CJK is a registered trademark and service mark of The Research Libraries Group, Inc. Unicode is a trademark of Unicode, Inc. All other product names are trademarks of their respective companies.

References

  1. International Organization for Standardization. Information Processing - 8-Bit Single-Byte Coded Graphic Character Sets * Part 1. Latin alphabet no. 1. Geneva, 1987. (ISO/IEC 8859-1:1987).

  2. Code of Chinese Graphic Character Set for Information Interchange, Primary Set. Beijing, Jishu Biaozhun Chubanshe (Technical Standards Press), 1981. (GB 2312 - 1980)

  3. American National Standards Institute. American National Standard Code for Information Interchange. New York, 1977. (ANSI X3.4-1977).

  4. The Unicode Consortium, The Unicode Standard: Worldwide Character Encoding, Version 2.0. Reading, MA: Addison-Wesley, 1996. (ISBN 0-201-48345-9)

  5. International Organization for Standardization. Information Technology -- Universal Multiple-Octet Coded Character Set (UCS), Part 1: Architecture and Basic Multilingual Plane, Geneva, 1993. (ISO/IEC 10646-1:1993).

  6. Information Technology -- Universal Multiple-Octet Coded Character Set (UCS), Part 1: Architecture and Basic Multilingual Plane, Beijing, 1993. (GB 13000.1-93)

  7. C. Sumner Spalding, "Romanization Reexamined." Library Resources & Technical Services, 21(1):3-12 (Winter 1977).

  8. Correspondence. Library Resources & Technical Services, 21(3):303-5 (Summer 1977).

  9. Hans H. Wellisch, "The Arrangement of Entries in Non-Roman Scripts in Multiscript Catalogs and Bibliographies," International Forum for Information Documentation, 3(3):18-24 (1978).