"Digital Libraries: The Grand Challenges" by John Garrett EDUCOM Review, July/August 1993, Volume 28, Number 4. Star Trek: The Next Generation," a weekly television series, is the horoscope of many in the information community, presenting a vision of one possible computer-rich, informationintensive future. Understanding that vision, and what goes wrong with it, may help us assess how to meet the grand challenges that face us in building a national, networked, distributed, on-line system of linked digital libraries. Much goes right on "Star Trek: The Next Generation." Computing power is embedded throughout the starship Enterprise, so that all crew, all guests, and the ship's computer are instantly accessible via ubiquitous personal communicators. The replicator creates objects on demand, even re-forming them to satisfy verbal instructions. The holodeck generates powerful computer simulations, allowing crew members to relax by way of encounters with either Dr. Moriarty or outlaws in the Old West. And the highest achievement, Lieutenant Commander Data, is a humanoid android with enough computing power to run the starship alone and enough sensibility to look and seem (but not feel) human. SO WHAT GOES WRONG? A variety of things can and do miss the mark. The ship's computer (and Lieutenant Commander Data) occasionally have trouble knowing what information the human crew needs; Data tends to respond to questions with everything he knows about the topic, which can try even the endless patience of Captain Picard. The computer has difficulty replicating objects, due to the imperfect interface between humans', and computers', visual and verbal understanding. The computer is sometimes taken over by alien forces or misled into imagined emergencies. Finally, neither the ship's computer nor Data can easily grasp the deeper aspects of humanness: in one powerful episode, as the Enterprise approaches apparent destruction, Data asks the captain, "What is death?" Here, and on the starship Enterprise, data alone (including the Lieutenant Commander) are not enough. Our "Grand Challenges" require generating structures for building information and knowledge, not just data. There is an evolving shared vision of the new information world. It is a world of ubiquitous, reasonably priced digital information in any and all media, available to everyone from a computer, television, palm, or wrist, as predictable, ordinary, and universal as a toaster. "In addition to traditional text-based information, data accessible through the digital library system will include non-text information (photographs, drawings, illustrations, works of art); streams of numeric data (satellite information, cosmological data); digitized sound and moving visual images; multi-dimensional representations of forms or data (e.g., holograms); and the capacity to integrate these data into new representations drawn from many different sources."1 Most important, the digital library system would be ever-changing, dynamic, and interactive, allowing users to engage together in collaborative investigations; empowering discussions between readers and authors; enabling open, public, democratic debate; and allowing, in the words of Vice President Gore, ". . . a school child to come home after class and, instead of just playing Nintendo, to plug into a digital library that has colormoving graphics that respond interactively to that child's curiosity."2 Unlike traditional libraries, digital libraries don't reside in a building: correctly designed, their information arrives as needed at the user's screen, like the ever-attendant waiter filling your water glass before you know it's empty. That's the vision. What is it going to take to make it happen? Building the National Information Infrastructure, and its constituent digital library system, is one of the pivotal challenges of this century. Many of the most difficult tasks are social and cultural, economic, and technical. There are three problem areas in each of these domains. Each requires cooperation among many and varied groups; there are no obvious, easy solutions; and all must be addressed successfully in building a national digital library system. TECHNICAL CHALLENGES Finding out what you want to know: selecting and filtering information. Data are drowning us. For many, the problem does not consist in too much irrelevant material but in finding what we want to know in a sea of potentially interesting data, then turning it into useful information, perhaps even wisdom. Teaching a computer to navigate this sea for us is a daunting task: imagine the instructions required to rank your daily electronic (or snail) mail in descending order of interest and importance. At the Corporation for National Research Initiatives (CNRI), we have developed Knowbot( programs. These intelligent software agents can carry user instructions to many distributed digital libraries, as well as collect and filter data for relevance and importance. Others are working on different models, including alerting systems, which notify users when new material matching their interests enters a digital library (a prototype for computer science is in place at Stanford) and simple interfaces to multiple databases, like MOSAIC from the University of Illinois. Still we don't have an easy way to sort our e-mail or even to eliminate messages from listbores and unwanted vendors! Without good systems for filtration and selection, digital libraries will be confined to finding facts, and our knowledge will remain in scattered paper files, dusty bookshelves, and the limited capacity of our overtaxed wetware. There must be a better way! Growing pains: scaling distributed libraries. We know how to build small-scale, Booleansearchable, single-site digital libraries of a few hundred megabytes of data and how to carry out restricted searches of larger databases. Building large (terabyte and larger) digital libraries that permit natural language searching of diverse documents is more difficult, although WESTLAW'S WIN( system, among others, shows progress. WIN permits lawyers and legal researchers to search its substantial legal database by entering normal English language questions, and it seems to do a good job of selecting and ranking relevant materials. And- -equally important--WESTLAW has built an attractive, friendly, consistent user interface. But the highly defined and limited legal vocabulary made WESTLAW'S task easier: try, say, psychology. The real Grand Challenge consists of building and scaling large, distributed digital library networks, including thousands of libraries, each containing terabytes of data. But users don't want to have to know where the data they're looking for reside, or how their query gets there: they want it right, they want it fast, they want it seamlessly, they want it now. The technical challenges are many and varied. Here are a few: * How would these libraries talk to each other? * How would a user (or a user's agent) know where to look for what? * How could search results be sorted, selected, merged, and transmitted back to users? * How would large, distributed systems learn and evolve? * How could data integrity and security be ensured? * How about attribution, authorization, and payment? * How could misuse be detected and corrected? * How could failures be contained and fixed? There is important research under way on these issues; the problems are complex and interwoven, and much hard work remains. In the area of rights and royalties management, U.S. scientific, technical, and cultural innovation derives in large measure from the constitutional guarantee that creators own their creation. Effectively implementing that guarantee in distributed digital libraries is a major technical (as well as a social, cultural, and economic) challenge. Digital libraries will require more authorizations and more payments from more users to more rights holders than has ever been contemplated in a nondigital world. Rights and royalties management will need to link and authorize access to information in many different forms, created and distributed under diverse rights-owning systems (e.g., print, film, sound, photographs). Copyright owners will require assurances that users will not be allowed to create derivative works without permission or to disseminate the information beyond the bounds of the authorization. And owners and users will need evidence that the information sought, and subsequently provided, has not been accidentally or maliciously altered. Users will be unwilling or unable, in many cases, to define precisely all the uses to be made of a document; indeed, the power of digital representations derives in part from their flexibility and potential for unanticipated, new, productive uses. No comprehensive digital library can succeed without these conditions being satisfied. Finally, our understanding of copyright will need to expand and change to encompass these new ideas: digital displays are not copies, but performances of the protected work.3 The digital library system must include a copyright management system that: * provides for confidential, automated rights and royalty exchange; * ensures owners and users that information is protected from unauthorized, accidental, or intentional misattribution, alteration, or misuse; * ensures rapid, seamless, efficient linking of requests to authorizations for information use; and * encompasses effective billing and accounting mechanisms. These efforts require active collaboration of many key stakeholders: CNRI, together with the Coalition for Networked Information and others, is building the necessary consensus and working on the technical aspects of the problem. A number of prototype systems for managing copyright in digital libraries have been proposed. CNRI, for example, is implementing several test bed projects and working with the Copyright Office of the Library of Congress to build a system for electronic registration and deposit of digital works. ECONOMIC CHALLENGES: WHAT WILL IT COST? A LOT. Much more research is needed on each of the following topics. We don't even know how to ask the right questions yet. For illustration purposes, let's divide the underlying cost structure into three components: * communications infrastructure: $100-$350 billion * computer infrastructure: ?? * information: ?? Estimates of the cost of a national fiber-optic network depend on whether the fiber is deployed to the "last mile" for local access. It's impossible now to calculate the cost of the new computer hardware and software needed to take advantage of the new capabilities because much of it doesn't exist yet. No one has yet effectively estimated what all the information stored in a comprehensive digital library system would be worth. There are a number of proposals detailing who should pay for digital library development. These include everything, from providing regulatory incentives for the regional telephone operating companies and cable operators to accelerate their investment in the infrastructure, to federal support, to encouraging investment by local gas and electric utilities. Given the amounts involved, some combination of public and diverse private investment seems likely. But this may require rethinking the locus and content of regulations, as well as traditional antitrust and other barriers against intraindustrial and interindustrial cooperation. Invoicing, payment, and authorized access. At least half of the traffic on long-distance phone lines currently involves transmitting billing data, not messages. Figuring out who owes what to whom in a distributed digital library environment will be far more difficult because of the large numbers of information owners and providers and the much greater quantity of information and possible uses. Some preliminary work is under way at Carnegie Mellon and other research centers, but even the issues have not yet been fully defined, much less the solutions. Under Professor Marvin Sirbu, Carnegie Mellon graduate students have designed and prototyped a model Internet-based billing server. But implementing it over thousands of networks to millions of users is another matter. Invoicing and payment are closely linked to building simple yet powerful systems for identifying who is authorized to access what data and for finding ways to ensure that intellectual property owners are compensated for all royalty-bearing uses of their information. More research and experimentation are needed on building economic models for costing and pricing digital information (there are few decent models for print, let alone digital works) and on testing various pricing and collection models. Because computers can count everything, some have proposed transaction-based pricing (charging for each use), but transactional pricing systems are inherently unpredictable. Fixed-price licensing is predictable, but it can't account for the intense fluctuations in use, which, so far, have defined life on the Internet. Combinations and other alternatives may be more desirable, but they have not been tested in the marketplace. SOCIAL AND CULTURAL CHALLENGES The 500-pound gorilla. Since January 20, information infrastructure has heated up, along with proposals for related federal programs. The National Information Infrastructure (NII) is seen, variously, as a cure for the nation's educational woes, as an engine for enhanced international competitiveness, as a vehicle for retraining displaced workers, or as a source of new jobs and wealth. But as one influential congressman stated recently, "the good news is, the White House and Capitol Hill are united in our commitment to the NII. The bad news is, none of us has any idea what the hell it is." Vice President Gore has stated on several occasions that the federal government needs to provide leadership in information infrastructure planning and deployment. Everyone agrees. But where does leadership end and control begin? There are legitimate concerns about the risks involved in federal management of the infrastructure. These are exacerbated, for instance, by FBI-sponsored legislation to force telecommunications suppliers to provide wiretappable fiber (an oxymoron). By some estimates, this would increase the cost of installing fiber by as much as 50 percent, as well as raise significant public policy questions regarding the necessity and appropriateness of ubiquitous capacity for wiretap. If you invite a 500-pound gorilla to your 4-year-old's birthday party, there is always the possibility that it will behave like a model guest and go home when asked. But what if it doesn't? Haves and have-nots. Information costs money: better information costs more. And because time, in these times, is the principal short commodity, systems that save time and also produce valuable information cost the most. There are places to go--such as public libraries and public or advertisement-supported television--if you seek information but don't want to (or can't) pay for it. Indeed, ensuring free public access to information has been a major unifying purpose of the library community. How can free or inexpensive access to information be ensured in a digital library universe? Can it be ensured at all? Even if information can be provided at little or no cost, how will poorer people gain access to the telecommunications and computing infrastructures so that they can find what's there? One of the most attractive features of an NII is its potential to reduce hierarchical distinctions among creators and users of information. But if everyone can't access the system, how will that potential be realized? Not long ago, a friend bought a new software program but couldn't get it to work. He sent an e-mail plea to a bulletin board and an expert answered. There were four or five messages back and forth before the problem was solved; finally, the expert asked my friend who he was. My friend responded that he was 46 years old and a professor at X University. "Who are you?" he asked in return. The expert responded, "I'm 12, and I'm in the seventh grade." WE'LL HANG TOGETHER, OR WE'LL HANG SEPARATELY In a speech some years ago to a group of librarians, I noted that librarians and journal publishers had one thing in common--each regarded the other as a bloodsucking leech. Things are better, but developing a ubiquitous digital library system will require both new levels of cooperation among the key stakeholders and, more important, new levels of trust. Critical stakeholders in digital libraries include not only publishers and librarians (as if that wouldn't be tough enough) but also academic and other creators and authors, information providers such as West, DIALOG, and Mead Data Central; representatives of major print and nonprint information communities, such as books and journals, newspapers, film, TV, software and audio; federal, state, and local governments; colleges, universities, and research centers; major corporations in and around telecommunications and computing; major information user organizations in the public and private sectors; and individuals, corporations, and governments around the world. In "Star Trek," Capt. Jean-Luc Picard says, "Make it so!" and it is so. In our world, it will take many iterations, mistakes and corrections, plans, pilots, and test beds to address any of the Grand Challenges outlined here, so don't even consider all of them at once or soon. CNRI and others are building collaborative structures for the mutual exploration and risktaking that must occur. It's going to get hot inside the tent as we work together to address these exhilarating challenges and build the working prototypes of a comprehensive digital library system. But the complex maelstrom of ideas and interests inside will result, ultimately, in the information infrastructure we all want and need. It won't be quick or easy or cheap, but it can be done--if we can take the heat, trust each other, and keep working on it. Together. ENDNOTES 1. Joseph S. Alen and John R. Garrett, Toward a Copyright Management System for Digital Libraries. Salem, Mass.: Copyright Clearance Center, 1991, p. 5 2. Internet Electronic Mail Press Release, White House, Office of the Press Secretary, Remarks by the President and Vice President to Silicon Graphics Employees, February 22, 1993, at Silicon Graphics, Mountain View, California. 3. See the May/June issue of EDUCOM Review for further discussion of these issues. ************************************************************************