66th IFLA Council and General
Jerusalem, Israel, 13-18 August
Code Number: 174-157-E
Division Number: VI
Professional Group: Preservation and Conservation
Joint Meeting with: -
Meeting Number: 157
Simultaneous Interpretation: No
PANDORA - Towards a National Collection of Selected Australian Online Publications
Coordination Support Branch, National Library of Australia
Background to PANDORA
National and other deposit libraries around the world have a prime responsibility for the collection, description, preservation and provision of long term access to their national imprint. The phenomena of Internet publishing has presented a huge challenge to these libraries in finding a means to address the collection and management of online publications. It is not sufficient merely to point to an online publication somewhere else. In order to ensure long term access, the responsible library must have custody of it.
The National Library of Australia aims as part of its statutory obligations, to retain in perpetuity a copy of all material published in printed form in Australia in order to ensure that Australians will have access to the accumulated knowledge, activities and achievements of its citizens in all forms of human endeavour. In line with this objective, the Library is also committed to preserving online publications of lasting cultural value for access by the Australian community now and in the future. With this in mind, the National Library's PANDORA Project commenced in 1996 to build an archive of selected online publications.
What to Collect ?
At the outset of the PANDORA Project, we felt it was important to develop selection guidelines and a set of business principles1 to help us define what we were trying to achieve, and to put boundaries around the task so that we wouldn't be swamped by the enormity of what lay before us. In order to develop and test the selection guidelines and business principles we decided to allocate a discrete resources to the project. We established a unit of five staff, called the Electronic Unit. The
1 National Library of Australia. (1999). A business process model for PANDORA. [On-line]. Available: http://pandora.nla.gov.au/pandora/bpm.html
staff were drawn mainly from the Legal Deposit Serials Section. We felt that we needed people who had experience in collection development and acquisition work, in managing complex publications and problem solving, and who had experience liaising with publishers through the ISSN office.
Establishing the Electronic Unit at an early stage in the project allowed us to work at two levels, the conceptual and the practical. Each informed the other and meant that we were able to test concepts through practical application and to modify or expand our ideas accordingly. It also taught us that our business principles needed to be flexible. We realised that these principles would undergo evolutionary change just as digital publishing would also evolve as technology developed and provided more sophisticated mechanisms for publishers to use. Staff in the Electronic Unit are responsible for managing all aspects of online publications. These responsibilities include:
The current business principles are outlined below:
- liaison with publishers/creators;
- determining a capture schedule (frequency of capture)
- quality control and problem solving, including fairly high level technical issues;
- creation of a title entry page; and
- cataloguing onto our National Bibliographic Database.
1 Select titles, which are available in online format only ('born digital')
This principle is based on the need to limit the number of publications eligible for inclusion in the archive. As printed Australian publications are already collected comprehensively between the national and state libraries, it seemed sensible to concentrate resources on collecting and managing those publications, which exist only in digital format. The rejection of titles, which are published in both formats, has to date eliminated about seventy percent of titles that could otherwise fall within the PANDORA selection guidelines.
2 Selectivity rather that comprehensiveness
Within the parameters of the 'born digital' principle a rigorous selection process is applied to titles under consideration for inclusion in the archive. The main reason for this is that it is complex, time consuming and expensive to collect and archive digital publications and the Library has chosen at this stage to concentrate resources on those publications considered to have current and future research value. Our selective approach enable us to:
We believe that these actions will assist us in building pathways from current
archiving to future preservation. However the selective approach currently used for
PANDORA does not rule out the importance of capturing a picture of the whole Australian domain, and in future we might do this periodically. However, the present lack of legal deposit provisions for online publications means that this approach presents a copyright difficulties which prevents the Library providing access to such comprehensive captures.
- Inject quality control over the capture process, By being selective we can ensure that the titles we archive are captured successfully, with all files, software plugins and other features in working order.
- Seek permission from the creator/publisher to enable us to make the archived titles available to all Australians.
- Most importantly if we select a title for PANDORA we are making a commitment to preserve that title for future use. And that is a commitment of current and future resources. This commitment to preservation means that it is not enough to merely capture a title and place it in a stable environment in an archive. The Library has also made a commitment to the following:
- Developing a persistent naming convention for digital resources so that they remain visible and able to be found;
- Establishing mechanisms for recording the metadata we will need so we can manage future preservation. This preservation metadata may be used to store technical information that supports preservation decisions and actions such as file types, document preservation action taken such as migration or emulation, to record the effects of preservation strategies, to ensure the authenticity of the item over time, and to note information about collection management and the management of rights; as well as identifying formats that we should be able to migrate easily and formats that will cause problems
- Setting up registers of existing emulation software that we may need to use; and
- Identifying when we need to take action so that we don't lose access to our online titles.
3 Create records for the National Bibliography
The decision to catalogue online publications onto the (Australian) National Bibliographic Database is an acknowledgment that as a National Library we have a responsibility to record digital publications as part of the national imprint. Creating catalogue records also facilitates integrated access to Australian publications, regardless of format, and we believe this is important for transparent user access. (Note: The NBD is a national union catalogue of the holdings of Australian libraries and provides the framework inter-library loan among those libraries).
4 Retain the 'look and feel'
It was agreed at a very early stage that it is important to strive to capture not just the content but the look and feel of digital publications. Many digital publications contain software plug-ins or other features that are an integral part of the publications. This is one of the most challenging aspects of our work. A good example of the sort of challenge that we face is online publications structured as dynamic databases which are entirely reliant on software to enable the publication to be used on the fly.
At this stage I should say something about our definition of a publication on the Internet. After a lot of debate, we decided that any information available on the Internet was a publication. At first we had tried to look at categories of information - formats like e-journals that mimic print publishing, academic papers and so on. But then we found that we wanted to collect more broadly than traditional formats. In particular websites and some listserv discussions. Are those really publications? Our position is if we think it is important, we will collect it. There are of course grey areas, particularly relating to government records where we do not want to duplicate the work of the Australian National Archives and government departments.
The PANDORA Selection Guidelines2 is a living document that is subject to regular review. The key elements of the guidelines are:
In addition to these categories the Library has decided to archive e?journals which are being indexed by indexing and abstracting services, regardless of whether the digital publication also appears in print. The Library became aware through its own APIAS (Australian Public Affairs Information Service) indexing service that there is a need to provide a persistent identifier (permanent location address) for articles cited by indexers. By undertaking to archive indexed publications in PANDORA the Library can assign a persistent identifier (at this stage a PURL) which can be cited by the indexing service and which will provide a guarantee that a user will not encounter a broken link.
- To be selected for the Archive, a digital publication should be about Australia or Australians or written by an Australian and be on a subject of significance and relevance to Australia. Australian authorship does not however guarantee automatic selection as it does with print publications.
- Within these parameters five categories of publications are sought in particular:
(1) Publication's emanating from the academic sector and in particular e-journals or works that have been subject to a peer review process;
(2) Publications or sites which reflect views on topical issues such as euthanasia, the gun debate, mandatory sentencing or sites which relate to a theme, such as the 2000 Olympic Games or the Centenary of Federation;
(3) Publications or sites, which reflect the way in which Australians are using the Internet or which, reflect aspects of Australian culture;
(4) Publications or sites maintained by community groups and associations; and .
(5) Commonwealth and state government publications.
Providing Access - Legal Deposit and Copyright Issues
Legal deposit as described in the current (Commonwealth) Copyright Act 1968 does not extend to digital publications. In the absence of this, we contact publishers individually for permission to copy their publications and to store them in the PANDORA Archive. Where titles are available free of charge, publishers have in most cases been happy to allow us to provide immediate access to the versions in the PANDORA Archive. We include options to link to publishers' sites both from the catalogue record for each item and again from the Archive.
2 National Library of Australia. (1999). Guidelines for the selection of online Australian publications intended for preservation by the National Library of Australia. [On-line]. Available: http://www.nla.gov.au/scoap/guidelines.html
The Copyright Act is under revision and the Copyright Law Review Council (CLRC) has proposed that legal deposit be extended to cover publications in all formats, including digital. In anticipation of this change, the Library is participating in a series of discussions with the Australian Publishers Association (APA) in regard to the provision of access to commercial digital publications received on legal deposit.
Although the APA is chiefly concerned with protecting the interests of its commercial members by limiting access to legal deposit publications as far as possible and the Library is chiefly concerned with providing the Australian public with as much access as possible to these publications, there is broad agreement between the two organisations that a set of guidelines governing use of this material and which satisfy the needs of both groups, can be developed.
Based on its experience to date with the cost and complexity of collecting, archiving and preserving digital publications, the Library believes that commercial publications which are selected for the Archive should be available gratis for national access at an agreed point in time when the commercial viability of the publication has diminished. Prior to the provision of national access, publications received on legal deposit will be restricted to on-site access only. However this model is still largely untested because to date, there has been very little commercial publishing of titles which appear in digital format only. Australian commercial publishers have not yet moved beyond the dual print/electronic model of publication.
Given the complexities and resource intensive nature of archiving and preserving digital publications, the Library would be more selective in acquiring commercial publications in this format than it has been with printed publications. Selection would be against established guidelines but would also occur in the context of national co-operative arrangements with other collecting organizations.
National Collection of Online Australian Publications
From the inception of the PANDORA project in 1996, the Library has envisaged that the collection, provision of access to and preservation of Australian digital publications would be a co-operative activity involving the National Library and the other deposit institutions, such as the State Libraries and ScreenSound Australia (the national body responsible for collecting film and sound materials). Briefly for those not familiar with the Australian library system, the National Library is the Commonwealth depository library, and each of the six Australian States have a State deposit library. The National Library already works in close co-operation with the State Libraries and ScreenSound Australia across a range of collection management issues, and the development of a (virtual) National Collection of Online Australian Publications in partnership with these organisations is a natural extension of the current relationships.
Essentially the National Library of Australia is requesting the State Libraries to take responsibility for archiving significant online publications relating to their own State; for example State government publications, and other categories to be develop in agreement with the National Library of Australia. Four of the six State Libraries are now archiving online publications, to varying degrees. Two have developed their own internal processes and two are using the processes and procedures set up by the National Library's Electronic Unit.
From the National Library's point of view it does not matter whether the deposit institutions develop their own archiving infrastructure or whether they join the PANDORA Archive as a contributing partner. What matters is that the depository institutions works within a collaborative framework to ensure that Australia's significant digital publications are collected and preserved for future access. With this goal in mind, the Library has developed a statement outlining the elements which the Library sees as the key to the successful development of a National Collection of Online Australian Publications. These elements are:
- A set of formal collecting agreements: Through a formal agreement each depository institution articulate as clearly as possible the areas in which it will take responsibility for collecting, archiving and preserving digital publications for current and future access;
- Endorsement of the principle that digital publications are part of the national bibliography: The Library is seeking agreement from the deposit institutions that they will catalogue titles selected for archiving and future preservation onto the National Bibliographic Database (NBD). It is envisaged that the NBD will be a key point of access to the virtual National Collection and that depository institutions will be able to flag their preservation intentions to each other via the NBD. The approach does not exclude other forms of access to digital publications, for example via a metadata repository. However, it endorses the importance of recognising digital publications as part of the national imprint.
- A commitment to future access through the development of long-term preservation strategies: Although the strategies for future preservation remain largely untested at this stage, it is crucial that deposit institutions contributing to the National Collection of Online Australian Publications commit to undertake the key steps necessary to preserve these publications for future access. This includes being prepared to record the preservation metadata that will provide the information on which to base future migration or emulation strategies.
- A commitment to negotiate arrangements with publishers which will ensure that publications which form part of the National Collection will, after an agreed period, be available gratis on a national level: The Library believes that the development in cooperation with the peak Australian publisher's organisation, the Australian Publisher's Association, of a common set of guidelines will assist in the achievement of this goal
At the last count, there were 652 titles in the PANDORA Archive. 546 of these titles have individual title entry pages. The remainder are part of collective entries we have created for topics such as the Sydney Olympics, and broad subjects such as euthanasia. Approximately two-thirds of the titles have been archived on a one-off basis. This is either because the title is static, for example a completed report or project, or because the site has been archived as a "snapshot" example of how the Internet is being used by Australians.
The remaining one-third of items in PANDORA are being archived on a regular basis, ranging from weekly to annually. These items may be electronic journals, or sites of ongoing research interest that are being updated regularly.
On average, around 35 new titles are selected and archived each month. In addition, an average of around 30 titles which are being archived on a regular basis are "regathered" each month.
The data is stored in a Unix file system. Currently, the production area of the PANDORA archive is around 15GB, and the working file area (new gatherings being assessed and fixed prior to moving them into production) is around 7GB.
For particularly complex or large publications, the publisher transfers the files over the Internet or sends them on CD-ROM. However, for the majority of titles we use a robot to gather the desired files directly from the Internet. We are in the process of changing the software used for these gatherings to two main tools: HTTrack and Teleport Executive.
We are currently developing in-house at the Library a new gathering system. We are moving away from a rather clumsy system of an Access database containing management information combined with a customized communications interface for submission of requests and problem logging. Our new collection management system will be an integrated Web-based interface. It will provide for the entry and maintenance of management information describing the title and interactions with the publisher, as well as the assigning of persistent identifiers, initiation and monitoring of gathering requests, tracking of and repairing problems, and automatic generation or amendment of title entry pages.
I should also mention the National Library is committed to a number of other activities related to the preservation of online publications. Examples of this work include:
- Investigations into data migration strategies. As part of this planning process the Library has:
- created a list of tags and attributes which are dead in HTML 4.0 and are used in html files in the PANDORA Archive; and
- started work on an analysis of the file types within PANDORA to assess which may present critical migration difficulties for the future. For example, compressed delivery or access formats such as RealAudio are subject to constant change as improvements are made to them. They often require special browser plug-ins to use them. These formats are likely to be the most subject to change, and the most complex to migrate.
- Development of a preservation metadata scheme. An exposure draft Preservation Metadata for Digital Collections3 has been developed for comment. The Library is working in co-operation with others such as the United Kingdom CURL Exemplars in Digital Archives (CEDARS) project4 to develop an international standard in this area
- Developing a persistent identifier scheme for use with National Library digital resources and promoting the use of persistent identifiers for Australian Web publications. The Library expects to implement a persistent identifier scheme for its own digital resources during the next 12 months. This is likely to be a URL based scheme that maintains persistence through use of an in-house resolver service. Use of the Library's resolver service may also be extended to other interested agencies within Australia.
- Producing a set of best practice guidelines for creating and archiving online publications. These guidelines will be trialed with academic, government, and commercial sectors. The Library believes that the widespread use of a set of best practice guidelines will encourage the creators of online publications to take an active role in ensuring that their information remains accessible for the future. This is a vital step if the National Collection of Online Australian Publications is to be realised.
3National Library of Australia (1999). Preservation metadata for digital collections : exposure draft [On-line]. Available: http://www.nla.gov.au/preserve/pmeta.html
4 CEDARS Project. [On-line]. Available: http://www.leeds.ac.uk/cedars/
- International co-operation with other organisations interested in the issues of digital archiving and preservation. As part of this information sharing process, the National Library has developed and maintains the PADI (Preserving Access to Digital Information) Web site - a subject gateway to digital preservation resources from around the world.
In conclusion the Australian approach is an attempt to respond to the need for National Libraries to actively collect and preserve new digital publications created in the Internet environment. Our approach is not perfect and is based on learning by doing. We do not see our selves in conflict with the Swedish approach. In fact both libraries are in close contact and share information on development on a regular basis.
The real issue for us all especially libraries with legal deposit responsibilities, is to start to see the Internet as a "space" where valuable cultural and documentary heritage "information objects" are being created. We all need to start implementing the values and strategies for preservation we have developed for the print-based world, using the opportunities provided by new technologies. It is our responsibility.
I would like to acknowledge the input of Jasmine Cameron and Julie Whiting in the preparation of this paper.