Archiving and Preservation in Electronic Libraries
Gail Hodge
Information
International Associates, Inc.
312 Walnut Place
Havertown,
Pennsylvania 19083 USA
The rapid growth in the creation and dissemination of
electronic information has emphasized the digital environment’s speed and ease
of dissemination with little regard for its long-term preservation and
access. To some extent, electronic
libraries, that is those libraries that are moving toward provision of
materials in electronic form, have been swept up in this attitude as well. But,
electronic information is fragile in ways that traditional paper-based
information is not. Electronic
information is more easily corrupted or altered, intentionally or
unintentionally, without the ability to recognize that the corruption has
occurred. Digital storage media have
unknown life spans. Some formats, such
as multimedia, are so closely linked to the software and hardware technologies
that they cannot be used outside these proprietary environments. Aggravating this situation is the fact that
the time between creation and preservation is shrinking, because technological
advances are occurring so quickly.
While there is a tradition of stewardship, best practices,
and stakeholder roles that has long been institutionalized in the print
environment, many of these traditions are inadequate, inappropriate or not well
known among the stakeholders in the digital environment. Creators of electronic resources are able to
bypass the traditional publication, dissemination and announcement processes
that have been part of the path from creation to archiving and preservation in
the print environment. Publishers and
librarians who traditionally managed this process must now look to computer
scientists to develop systems that support these activities. Digital libraries may be the responsibility
of computer scientists who do not necessarily bring skills in content
management, organization and preservation.
Best practices and policies are needed that satisfy both the
requirements of the digital environment and the economic interests of the
various stakeholder groups.
Electronic information is information that is born digital
or that has its primary version in digital form. Electronic information includes a variety of object types, such
as electronic journals, e-books, databases, data sets, reference works, and web
sites. These are the types of
information that electronic libraries are trying to manage and preserve.
The Open Archival Information System (OAIS) Reference Model provides a
framework for discussing the key areas that impact on digital preservation --
the creation of the electronic information, the acquisition of and policies
surrounding the archiving of resources, preservation formats, preservation planning
including issues of migration versus emulation, and long-term access to the
archive’s contents.
Many projects, worldwide, have contributed to the growing collection of
best practices and standards. The
numerous stakeholder groups involved in preservation of electronic resources,
including creators (authors), publishers, librarians and archivists, and
third-party service providers, are working more closely to build a cohesive and
sustainable response to the issues. An
issue of continuing stakeholder interest is the economic model(s) that will
provide ongoing support to electronic preservation.
Despite the remaining issues, local institutions managing electronic
libraries can become involved. They are encouraged to monitor developments and
projects in the field, to raise awareness of the need for preservation within
their institutions, to consider preservation and long-term access issues when
negotiating licenses for electronic resources, and to look for opportunities to
begin small projects at the local level.
1.0 Background
1.1 Definition
of Terms
Several terms
will be used throughout this lecture.
They are defined here. In some
cases, these definitions are for consistency within the presentation and are
not indicative of general consensus within the community.
Born digital – materials that are created in bits
and bytes rather than being digitized from paper or other analog medium
Digital archiving – storing the digital information for long term preservation
Digital preservation – keeping the
bits and bytes safe and unaltered for a long period of time
Digitization – converting materials in non-digital
form (analog) such as paper, to digital form
Emulation – running old products by recreating
the environment of the old hardware and software without actually using the old
hardware and software
Long-term access – the ability to use a preserved
object long after its initial preservation
Migration – moving a digital product from one
version of a program, operating system or hardware environment to another over
time
Recapturing – copying the content from the original
resource again in order to ensure that changes made to the resources are
incorporated in the archival version
Refreshing – moving a digital object to a new
instance of the same media, retaining the same operating system and hardware
environment
1.2 Outline of Major
Projects
I’ve selected
several major projects in digital archiving as examples. (For a more
complete list, I recommend the PADI (Preserving Access to Digital Information)
Web site from the National Library of Australia (NLA 2002).) I will briefly
describe these since they are used throughout the remainder of the lecture.
CAMiLEON,
(Creative Archiving at Michigan and Leeds: Emulating the Old on the New) a
joint project of the University of Michigan and the University of Leeds, is
conducting analysis and testing to determine if emulation is a viable technical
strategy for preservation. (University of Michigan)
Cedars
(CURL Exemplars in Digital Archiving) is sponsored by the Joint Information
Systems Committee in the UK. It was
established to determine the feasibility of distributed digital archives. The first implementation included the three
institutions in the Consortium of University Research Libraries. In the last two years, Cedars has included
several other test sites. (Cedars)
ERPANET
(Electronic Resources Preservation and Access Network) is a new project funded
by the European Commission to provide a knowledge base and advice to all
sectors on issues of archiving and preservation of electronic resources.
(ERPANET)
EVA is a
project of the National Library of Finland at the University of Helsinki. It uses a series of automatic tools
including robots, harvesters, and metadata creation tools to support its goal
of capturing electronic network publications of Finland. (Lounamaa and
Salonharju 1999)
InterPARES
(International Research on Permanent Authentic Records in Electronic Systems)
is a global project among seven archiving institutions, including regional
consortia for Asia and Europe. The
project’s goals are to develop best practices related to the creation,
preservation and long-term access to authentic
electronic records. (InterPARES)
Kulturaw3 is a
project of the Royal Library of Sweden.
Its goal is to capture the cultural heritage that is being published via
the Internet. Unfortunately, this
project has been stopped due to the lack of deposit legislation for digital
materials in Sweden. (National Library of Sweden)
LOCKSS (Lots
of Copies Keep Stuff Safe) is a project of the Stanford University Library, its
publishing arm, HighWire Press, and several other libraries to develop a system
for redundant archives. Its major
contribution is an infrastructure for keeping redundant archives synchronized.
(LOCKSS)
JSTOR,
originally funded by the Andrew J. Mellon Foundation, is now a non-profit
organization that archives back issues of journals for publishers by digitizing
them. It is just beginning to deal with
current journal issues that are in electronic form. (JSTOR)
NEDLIB (the Network European Deposit Libraries) was
funded by the European Union. It
included eight libraries and numerous publishers and other organizations. This project was completed early in
2001. The major output was a model for
incorporating archives into integrated library systems, work on metadata, early
adoption and testing of the Open Archival Information System Reference Model,
and testing of emulation strategies.
The output of this project is available in a series of reports. The major findings are being incorporated
into operational systems at the British Library and the Dutch National Library.
(NEDLIB)
OCLC Digital
Archive is a service of OCLC that grew out of its electronic
journals’ project. In this service OCLC
acts as a trusted third party archive receiving deposits of electronic journals
into its repository. It provides
several levels of access (continuous or just in case) and controls access
rights so that a library can access only the issues equating to the period for
which it had a license. (OCLC Digital Archive)
PANDORA
(Preserving and Accessing Networked DOcumentary Resources of Australia), a
project of the National Library of Australia, captures the Web-based cultural
heritage of Australia. It involves
capturing content, creating metadata, and making arrangements with rights
holders. A federated approach is
envisioned that includes the libraries in all the Australian states. (PANDORA)
OCLC/RLG
Preservation Metadata Working Group is a joint
project that also includes members from other major projects include Cedars and
the Digital Preservation Coalition. The major effort at this time is on
establishing a standard element set for preservation metadata. (OCLC 2000,
2001)
2.0 A framework for
archiving and preservation
It is valuable
to discuss archiving and preservation within a framework. The framework I’ve chosen is provided by a
reference model, which is being used extensively throughout the digital
preservation community. The Open
Archival Information System Reference Model (CCSDS 2001) provides high level
data and functional models and a consistent terminology for discussing
preservation. The reference model was
originally developed by the Consultative Committee on Space Data Systems to
support the archiving of data among the major space agencies. However, it has become the de facto standard
for the development of digital archives.
It is used by most major projects including those in Australia, the
United Kingdom, the Netherlands, and the United States. The OAIS Reference Model is a draft standard
of the International Standards Organization and is expected to be formally
balloted in 2002.
In its simplest
form the OAIS looks like this (Fig. 1):
Fig. 1. Open Archival Information System
Source:
Consultative Committee on Space Data Systems (used wit permission)
SIP – Submission
Information Packet (what is submitted or acquired from the producer)
AIP – Archival
Information Packet (the object that is archived)
DIP –
Dissemination Information Packet (the object that is distributed based on
access requests)
Descriptive Info
– metadata
2.1 Production and
creation of electronic information
Archiving begins
outside the purview of the archive with the producer or the creator of the
electronic resource. This is where long-term archiving and preservation must
begin. Information that is born digital
may be lost if the producer is unaware of the importance of archiving. Practices used when electronic information
is produced will impact the ease with which the information can be digitally
archived and preserved.
Several key
practices are emerging involving the producers of electronic information. First, the preservation and archiving
process is made more efficient when attention is paid to issues of consistency,
format, standardization and metadata description before the material is
considered for archiving. By limiting the format and layout of certain types of
resources, archiving is made easier. This is, of course, easier for a small
institution or a single company to enforce than for a national archive or
library. In the latter cases, they are
faced with a wide variety of formats that must be ingested, managed and
preserved.
In the case of more formally published materials, such as
electronic journals, efforts are underway to determine standards that will
facilitate archiving. The Andrew J.
Mellon Foundation has funded a study of the electronic journal mark-up
practices of several publishers. The
study concluded that a single SGML document type definition (DTD) or XML schema
can be developed to support the archiving of electronic journals from different
subject disciplines and from different publishers with some loss of special
features (Inera, Inc. 2002). Such
standardization is considered key to efficient archiving of electronic journals
by third-party vendors.
In the case of
less formally published material such as web sites, the creator may be involved
in assessing the long-term value of the information. In lieu of other assessment factors, the creator’s estimate of
the long-term value of the information may be a good indication of the value
that will be placed on it by members of its designated community or audience in
the future. The Preservation Office at the National Library of Medicine has
implemented a “permanence rating system” (Byrnes 2001). The rating is based on three factors:
integrity, persistent location, and constancy of content. These factors have been combined into a
scheme that can be applied to any electronic resource. At the present time, the ratings are being
applied to NLM’s internal Web sites, and guidelines have been developed to
assist creators in assigning the ratings to their sites. This information will be used to manage the
ongoing preservation activities and to alert users about a Web site’s long-term
stability.
Another aspect
of the creator’s involvement in preservation is the creation of metadata. The
best practice is for metadata to be created prior to incorporation into the
archive, i.e., at the producer stage.
However, most of the metadata continues to be created “by hand” and
after-the-fact. Unfortunately, metadata
creation is not sufficiently incorporated into the tools for the creation of
most objects to rely on the creation process alone. However, as standards groups and vendors move to incorporate XML
and other architectures into software products, such as word processors, the
creation of metadata should become easier and more automatic.
2.2 Ingest: Acquisition
and collection development
Now moving into the functions to be
performed by the archive itself. The
first is acquisition and collection development. This is the stage in which the created object is “incorporated”
physically or virtually into the archive.
In the terminology of the reference model, this is called “Ingest”.
There are two
main aspects to the acquisition of electronic information for archiving –
collection policies and gathering procedures.
2.2.1
Collection
policies
Just as in the paper environment, there
is more material that could be archived than there are resources with which to
accomplish it. Guidelines are needed to tailor the collection policies to the
needs of a particular organization and to establish the boundaries in a
situation where the responsibility for archiving among the stakeholders is
still unregulated. The collection
policies answer questions such as what should be archived, what is the extent
of a digital object, should the links
that point from the object to be archived to other objects also be archived,
and how often should the content of an archived site be recaptured?
In the network environment where any
individual can be a publisher, the publishing process does not always provide
the screening and selection at the manuscript stage on which traditional
archiving policy has relied. Therefore,
libraries are left with a larger burden of selection responsibility to ensure
that publications of lasting cultural and research value are preserved (NLC 1998).
The scope of
NLA’s PANDORA (Preserving and Accessing Networked DOcumentary Resources of
Australia) Project is to preserve Australian Internet publishing. The NLA has formulated guidelines for the Selection of Online Australian Publications
Intended for Preservation by the National Library of Australia (NLA). Scholarly publications of national
significance and those of current and long term research value are archived
comprehensively. Other items are
archived on a selective basis “to provide a broad cultural snapshot of how
Australians are using the Internet to disseminate information, express
opinions, lobby, and publish their creative work.” The National Library of Canada has written similar guidelines
(NLC 1998). The broadest guidelines for Collection Management are provided in a
draft document from the Cedars Project (Weinberger 2000). The most comprehensive analysis of such
guidelines is in the Digital Preservation Handbook, which is based on the
combined lessons learned of all the major projects (Beagrie and Jones 2001).
Even the Internet Archive (Internet Archive), which
considers the capture of the entire contents of the Internet as its mandate,
has established limitations. The sites
selected do not include those that are “off-limits,” because they are behind
firewalls, require passwords to access, or are hidden within Web-accessible
databases, and those that require payment.
The major lesson
from the selection guidelines is the importance of creating such a document in
order to set the scope, develop a common understanding, and inform the users
now and in the future what they can expect from the archive.
2.2.1.2 Determining extent
Once the site
has been selected for inclusion, it is necessary to address the issue of
extent. What is the extent or the
boundary of a particular digital work, especially when selecting complex Web
sites? Is it a “home page” and all the
pages underneath it, or are the units to be archived (and cataloged) at a more
specific level?
The PANDORA
(NLA/PANDORA) project in Australia evaluates both the higher and lower site
pages to determine which pages form a cohesive unit for purposes of
preservation, cataloging, and long-term access. While preference is given to breaking down large sites into
components, the final decisions about extent depend upon which pages cluster
together to form a stand alone unit that conveys valuable information. Each
individual component must meet PANDORA’s initial selection guidelines.
2.2.1.3 Archiving links
The extensive
use of links in electronic publications raises the question of whether these
links and their contents should be archived along with the original site. The
answer to this question by any particular project will depend on the purpose of
the archiving, the anticipated stability of the links, and the degree to which
they contribute to the overall information value of the site.
Most
organizations archive the URLs (Uniform Resource Locators) or other identifiers
for the links and not the content of the linked pages, citing problems with the
instability of links. Some projects
have established variants on this approach.
For example, PANDORA’s decision to archive the content of linked objects
is based on its selection guidelines; the content of the linked site is
captured only if it meets the same selection criteria as other sites. The National Library of Canada captures the
text of a linked object as long as it is on the same server as the object that
is being archived, because these intra-server links have proven to be more
stable than external links. The American Institute of Physics (AIP) points to
the content of a linked reference if it is an item in AIP’s archive of
publications or supplemental material.
Elsevier, which
is currently involved in an archiving project with the Yale University Library
funded by the Andrew J. Mellon Foundation, cites a technology-related problem
as the main reason it does not archive links (Hunter 2002). Elsevier’s links
are created on the fly, so there is no URL or live page to capture. Similar
problems exist when trying to capture pages that are active server pages or
those that are created out of a database, portal system, or content management
system.
The American Astronomical Society (AAS) has perhaps the
most comprehensive approach to the archiving of links. The AAS maintains all links to documents and
supporting materials based on
collaboration among the various astronomical societies, researchers,
universities and government agencies involved in this specific domain. Each organization archives its own
publications, retaining all links and access to the full text of all other
links. Within this specific domain, the
contents of all linked objects are archived.
In the future, similar levels of cooperation may be achieved in other
subject domains or by publisher collaborations such as CrossRef.
2.2.1.4 Recapturing the
archived contents
In cases where
the site selected for archiving is updated periodically, recapturing the object
is necessary. This would be the case
for an electronic journal that publishes each article online as it becomes
available or for a preprint service that allows the author to modify the
content of the preprint as it proceeds through the review process.
When making
decisions about recapturing the content of an archived site, a balance must be
struck between the completeness and currency of the archive and the burden on
the system resources. PANDORA allocates
a gathering schedule to each “publication” in its automatic harvesting
program. The options include on/off,
weekly, monthly, quarterly, half-yearly, every nine months, or annually. The selection is dependent on the degree of
change expected and the overall stability of the site. When making decisions about recapturing the
content, the EVA Project (Lounamaa and Salonharju 1999) at the University of
Helsinki considers the burden on its system resources and the burden on the
sites (because of the activities of its robots) from which the content would be
recaptured.
There are two
general ways in which the archive acquires material. The producer can submit
the material to be archived, or the archive can gather the material
proactively.
In the first
method, the best practices identified in the earlier section on creation become
extremely important. Even within an
organization, where the producer and the archive are almost one and the same
organization, attention to standardization and limitations on the number of
formats will have a significant impact on the ease with which submissions can
be processed.
In the second
approach, the archive may or may not have a formal relationship with the
creator or the producer. In this
gathering approach, the information to be archived may be hand-selected or
harvested automatically. In the case of
the NLA, sites are identified, reviewed, hand-selected, and monitored for their
persistence before being captured for the archive.
In contrast, the
Royal Library, the National Library of Sweden, until recently automatically
acquired material by running a robot to capture sites for its Kulturaw3 project
(National Library of Sweden). The harvester automatically captured sites from
the .se country domain and from foreign sites with material about Sweden, such
as travel information or translations of Swedish literature. While the acquisition was automatic,
priority was given to periodicals, static documents, and HTML pages. Conferences, usenet groups, ftp archives,
and databases were considered lower priority. Unfortunately, this project has
been discontinued because of the lack of national deposit legislation for
electronic materials.
2.3
Data
management: Metadata for preservation
Metadata is
needed to preserve the object and for users in the future to find and access
it. Metadata supports organization,
preservation and long-term access. In this section, I will deal primarily with
metadata for preservation. Other issues
surrounding metadata for description and discovery were covered in my previous
lecture on Cataloging and Indexing of Electronic Resources.
Archiving and preservation require
special metadata elements to track the lineage of a digital object (where it
came from and how it has changed over time), to detail its physical
characteristics, and to document its behavior in order to reproduce it on
future technologies. Each of the major
preservation projects – Cedars, PANDORA, NEDLIB, the Harvard Library Project,
etc., had its own set of metadata that it considered important for
preservation. In 2000, the Research Libraries Group and OCLC reviewed the
various sets of preservation metadata and concluded that there is sufficient
similarity among the elements that a core set of metadata for preservation
could be identified (RLG 2000).
In October 2001,
the Preservation Metadata Working Group developed a draft set of over 20
elements and numerous sub-elements for metadata preservation in the framework
of the OAIS Reference Model (RLG 2001).
They describe the content and the environment (software and operating
systems) needed for the object. The
plan is to achieve international consensus on this set. OCLC is already beginning to use the set as the
basis for its Digital Archive and for the work that has been done with the U.S.
Government Printing Office.
The discussion
provided by the Preservation Metadata Working Group acknowledges that there are
details about the use of the proposed elements and perhaps additional elements
needed to preserve objects of various types.
A recent meeting sponsored by UNESCO, the ICSU Press, the international
Committee on Data and the International Council for Scientific and Technical
Information raised the issue of whether the proposed preservation metadata
element set is broad enough to encompass data sets. Based on this meeting, the data community plans to review the
current draft and comment about any revisions necessary to broaden or clarify
its scope.
2.4 Archival storage:
Formats for preservation
A major issue for
the archiving community is which format(s) should be used for archival storage.
Should the electronic resource be transformed into a format more conducive to
archiving? Is the complexity of an interactive journal necessary or should it
be simplified? Should consideration be
given to the re-use of information and its enhancement or representation in
more advanced access technologies in the future? Should the goal be complete replication of the electronic
resource or should preservation provide a copy that is “just good enough”? (For
example, Cedars has identified the concept of “significant properties,” which
are properties that are absolutely required in order for a user in the future
to get the information value from the resource (Russell 2000).)
Of course the
answer to these questions in part differ by resource type, and there is little
standardization at this point. Most
electronic journals, reference book, or reports use image files (TIFF), PDF, or
HTML. TIFF is the most prevalent for
those organizations that are involved with conversion of paper issues of
journals. For example, JSTOR (JSTOR), a non-profit organization that supports
both storage of current journal issues in electronic format and conversion of
back issues, processes everything from paper into TIFF and then scans the TIFF
image. The OCR, because it cannot
achieve 100% accuracy, is used only for searching; the TIFF image is the actual
delivery format that the user sees.
However, this does not allow embedded references to be active hyperlinks.
SGML (Standard Generalized Mark-up
Language) is used by many large publishers after years of converting
publication systems from proprietary formats to SGML. The American Astrological Society (AAS) has a richly encoded SGML
format that is used as the archival format from which numerous other formats,
including HTML and PDF, are made (Boyce 1997).
For purely
electronic documents, Adobe’s PDF (Portable Document Format) is the most
prevalent format. This provides a
replica of the Postscript format of the document, but relies upon proprietary
encoding technologies. PDF is used both
for formal publications and grey literature. While PDF is increasingly
accepted, concerns remain for long-term preservation and it may not be accepted
as a legal depository format, because it is a proprietary format.
Preserving the
“look and feel” is difficult in the text environment, but it is even more
difficult in the multimedia environment, where there is a tightly coupled
interplay between software, hardware and content. The University of California at San Diego has developed a model
for object-based archiving that allows various levels and types of metadata
with separate storage of the multimedia components in systems that are best
suited to the component’s data type.
The UCSD work is funded by the U.S. National Archives and Records
Administration and the U.S. Patent and Trademark Office.
2.5 Preservation
planning: Migration and emulation
Preservation
planning is the bridge between the decisions made about archival storage of the
bits and bytes and issues of future access and user needs. There is no common
agreement on the definition of long-term preservation, but some have defined it
as being long enough to be concerned about changes in technology and changes in
the user community. This may be as
short as 2-10 years.
Two strategies
for preservation are migration and emulation.
Migration means copying the object to be archived and moving it to newer
hardware and software as the technology changes. Migration is, of course, a more viable option if the organization
is dealing with well-established commercial software such as Oracle or
Microsoft Word. However, even in these
cases migration is not guaranteed to work for all data types, and it becomes
particularly unreliable if the information product has used sophisticated
software features. Unfortunately, this level of standardization and ease of
migration is not as readily available among technologies used in fields of
study where specialized systems and instruments are used.
Emulation, a
strategy that replicates the behavior of old software and hardware on new
hardware and software, is being considered as an alternative to migration. There are several types of emulation. Encapsulation would store information about
the behavior of the hardware/software with the object. For example, a MS Word
2000 document would be labeled as such and then metadata information would be
stored with the object to indicate how to reconstruct the document at the
engineering --- bits and bytes -- level.
An alternative to encapsulating the behavior with every instance is to
create an emulation registry that uniquely identifies the hardware and software
environments and provides information on how to recreate the environment. Each instance would point to the registry.
(Rothenberg 1999; 2000). Taking
emulation a step further is the idea of creating a virtual machine – a new machine
that based on the information in the registry could replicate the behavior of
the hardware/software of the past (Lorie 2001).
While the best
practice for the foreseeable future continues to be migration, emulation has
been tested with some success by the CAMiLEON Project (University of
Michigan). This is a joint project between
the University of Michigan and the University of Leeds to determine if
emulation is a viable long-term strategy for preservation. Granger has concluded that a variety of
preservation strategies and technologies should be available. Some simple objects may benefit from
migration, while others that are more complex may require emulation (Holdsworth
and Wheatley 2001; Granger 2000).
2.6 Access
The life cycle
functions discussed so far are performed for the purpose of ensuring continuous
access to the material in the archive.
Successful practices must consider changes to access mechanisms, as well
as rights management and security requirements over the long term.
2.6.1 Access mechanisms
While many preservation
projects are concerned about the ability to provide long-term access to the
electronic information as it exists today, others are interested in how they
might actually improve access to current information in the future. A major reason for storing the information
related to the U.S. National Library of Medicine’s Profiles of Science
materials in TIFF and other standardized forms, such as tagged ASCII, is so
that the information can be re-purposed or enhanced. Even in its development stage, the project was able to improve
the quality of the video clips by converting them to High Definition
Video. The belief is that there will
always be newer and better technologies, and a goal of the archive is to be able
to take advantage of these advances in the future.
One of the most
difficult access issues for digital archiving involves rights management. What rights does the archive have? What rights do various user groups have? What rights has the owner retained? How will the access mechanism interact with
the archive’s metadata to ensure that these rights are managed properly? How will access rights be updated as the material’s
copyright status or security level changes.
3.0 Emerging stakeholder
roles
A
number of stakeholders can be identified including creators/authors,
publishers, libraries, archives, Internet service providers, secondary
publishers, aggregators, and, of course, users (Haynes, et al. 1997; Hodge
1999; Hodge 2000). The roles these
various stakeholders will play in the archiving process described above remains
unclear, but there are several types of electronic information for which some
patterns of responsibility are emerging.
In the early stages of the digital age,
most electronic journal publishers considered the creation of an electronic
archive to be the same as the internal production system. However, many publishers have since come to
realize that archiving and production are not one and the same function. In some cases, they are quite antithetical.
The current environment shows a growing
understanding of the need for archiving and long-term preservation among the
major electronic journal publishers.
This may not be the situation with smaller learned society publishers,
but that may be more an issue of economics than of desire. The major electronic journal publishers such
as Elsevier, Nature and Blackwell have projects underway. These projects are significant, because they
bring together the publishers and the major research and national libraries
that have been at the forefront of the demand for publisher attention to
archiving issues, particularly in their license agreements.
Librarians and
archivists, particularly those at national libraries, were early advocates of
digital preservation issues. Many
national libraries spearheaded initiatives and research projects without
additional funds and without legislative mandates to cover digital deposit. In most cases, these projects have been
instrumental in advancing the research and implementation of operational
systems.
In addition to
new roles for publishers and librarians/archivists, trusted third party
archives are emerging. These third
parties, such as the OCLC Digital Archive (OCLC Digital Archive), JSTOR (JSTOR),
and PubMedCentral (NCBI), see archiving/preservation as an additional
business/service opportunity.
A significant
outgrowth of the OAIS Reference Model has been RLG’s development of attributes
of an OAIS-compliant archive (RLG 2001).
A certification process is being discussed that would assure a library,
publisher or other organization that a particular third-party archive meets
minimal requirements for importing and exporting and basic functionality
related to the other aspects of an archive.
Another
significant development in the emergence of clearer stakeholder
responsibilities, particularly for commercially published materials, is a
January 2002 announcement on digital preservation by the International
Federation of Library Associations and Institutions (IFLA) and the
International Publishers Association (IPA).
The draft presented for discussion highlights the importance of “born
digital” materials and suggests that the appropriate place for preservation of
last resort is with the national libraries.
It is hoped that additional legislative/policy efforts and funding for
cooperative initiatives will result from this statement and from the inclusion
of digital preservation on the agendas of these two major international
stakeholder organizations.
4.0 Trends and issues
The
trend in archiving and preservation has moved from theoretical discussions to
pragmatic projects. There are more
initiatives focused on the realistic details of metadata, selection criteria,
technologies and systems for archiving.
While the need to raise awareness has not completely disappeared, more
time is being spent on partnership development, testing and implementation.
The
focus of research and development has shifted to “filling in the gaps.” The National Science Foundation (NSF) and
the Library of Congress have created a committee of interested federal
agencies, including the National Archives and the national libraries, to
identify key areas of research that could be supported by NSF and other federal
grants. The major research areas
identified to-date include the migration of extremely large data sets and
long-term access to complex multimedia objects. The Dutch National Library and
the British Library efforts have developed system specifications based on the NEDLIB
Project, which ended in 2001. NEDLIB
developed a data and process model for the deposit of digital materials in a
national library setting by working with several European publishers and eight
national libraries (van de Werf 2000; Feenstra 2000). The Deposit System for Electronic Publications (DSEP), which is
based on the OAIS Reference Model, will be implemented in a system to be
developed by IBM.
In
addition to the trend toward pragmatic initiatives, cooperation has increased
among projects and across stakeholder groups. OCLC, the UK’s Digital
Preservation Coalition (JISC) and RLG have been instrumental in identifying,
supporting and advancing key areas of cooperation. As a real sign of maturity, the work is being “divided up”. While some projects are developing
operational systems, others are working in the background to achieve consensus
on standards among/between projects.
Unlike many standards activities in the past that have developed from
local and regional practices, the work related to digital archiving is starting
with the goal of international consensus.
Despite
these positive trends, key issues remain.
The cost of archiving and the lack of established business models that
will sustain long-term preservation may prove to be significant stumbling
blocks in the advancement of the cause of preservation. However, even these issues are being
addressed in a pragmatic fashion. OCLC,
Stanford University Libraries/HighWire Press, JSTOR, and major publishers such
as Elsevier are actively dealing with questions of cost and how and who will
pay for the archiving. The creation of
groups such as the OCLC Digital Preservation & Co-op (OCLC Digital
Preservation & Co-op) will provide venues where barriers can be identified
and business models can be tested.
Projects such as the archive of Elsevier material at Yale Library (also
funded by the Mellon Foundation) (Hunter 2002) will further identify archiving
practices that can accommodate the needs of libraries, users and the economic
requirements of producers.
Another
key issue for electronic libraries is intellectual property rights. Some progress has been made in the area of
legal deposit of electronic resources but much remains to be done. Several initiatives show that there is
increased awareness on the part of governments to address this issue. In the UK the British Library is making
great strides in its voluntary digital deposit program. The Library of Congress has received
appropriations from the U.S. Congress following a study by the U.S. National
Research Council (NRC 2001). The
funding is to produce a plan for development of an infrastructure to support
federated digital preservation for the U.S.
For electronic records, the InterPARES project incorporates seven major
groups, including two regional archive groups from Asia and Europe
(InterPARES).
5.0 Local institutional
responses
We’ve talked
about a number of projects internationally.
Many of these are on a national, regional or even global scale. However, what can a local institution do to
ensure the preservation of electronic resources?
First, it is
important to be aware of what is going on in this field. What are the outcomes of the major
projects? How are standards being
developed?
There are
several sources for this information. The major projects have extensive web sites, and many like Cedars
and NEDLIB have produced numerous publications, which are available from the
web sites. Secondly, the PADI (NLA/PADI)
site at the NLA is the major portal to digital archiving information. The Electronic Resources Preservation and
Access Network (ERPANET) portal promises to provide practical information and
links to experts. Newsletters such as RLG DigiNews from the Research Libraries
Group are an excellent source of up-to-date information.
The local
librarian should take every opportunity to raise awareness about the importance
of digital preservation at his or her institution. When possible, be proactive in seeking funds to start small
projects for preserving digital materials.
A concrete way
to raise awareness is to ensure that archiving and preservation are considered
when negotiating licenses for electronic resources, such as electronic journals
and databases. With many national regimes for deposit of digital materials
lagging behind the practical uses of these materials, it is important to
address these archiving issues in license agreements. Equally, it is important to try to establish a balance between
the rights of the rights holders and those of the library and users.
The major lesson
is to think globally but to act locally – scaling the findings of the major
global activities to the local needs.
6.0 Conclusions
A review of the
cutting-edge projects shows the beginning of a body of best practices for
digital archiving. The early adopters in the area of digital archiving are
providing lessons that can be adopted by others in the stakeholder
communities. Through the collaborative
efforts of the various stakeholder groups – creators, librarians, archivists,
funding sources, and publishers – a new tradition of stewardship will be
developed to ensure the preservation and continued access to our intellectual
heritage.
7.0 References
Beagrie, N. and Jones, M. (2001). Preservation
management of digital materials: a handbook. London: The British Library.
Boyce, P. (1997, November). Costs, archiving, and the publishing
process in electronic STM journals. Against the Grain, 9(5):
86. [Online]. Available: http://www.aas.org/~pboyce/epubs/atg98a-2.html [3 May 2002].
Byrnes, M. (2000). Assigning permanence levels to NLM’s electronic publications. Presented at Information Infrastructures for Digital Preservation: A One Day Workshop, Dec. 6, 2000, York, England. [Online]. Available: http://www.rlg.org/events/pres-2000/infopapers.html/byrnes.html [3 May 2002].
CAMiLEON: Creating creative archiving at Michigan & Leeds: Emulating the old on the new. (2001). [Online]. Available: http://www.si.umich.edu/CAMILEON/ [3 May 2002].
Cedars: CURL Exemplars in Digital Archives.
[Online]. Available: http://www.leeds.ac.uk/cedars/
[3 May 2002].
Committee on an Information Technology
Strategy for the Library of Congress, Computer Sciences and Telecommunications
Board, National Research Council. (2001). LC21: A Digital Strategy for the
Library of Congress. National Academy Press: Washington DC. [Online]. Available: http://books.nap.edu/books/0309071445/html/index.html
[3 May 2002].
Consultative Committee for Space Data
Systems. (2001). Reference model for an Open Archival Information System
(OAIS). Red Book CCSDS 650.0-R-2, June 2001. [Online]. Available: http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html [3 May 2002].
Digital Preservation Coalition. (2002). [Online]. Available: http://www.jisc.ac.uk/dner/preservation/prescoalition.html
[3 May 2002].
Electronic Resource Preservation and Access NETwork: ERPANET. [Online].
Available: http://www.erpanet.org [3
May 2002].
Feenstra, B. (2000). Standards for
implementation of a DSEP. NEDLIB Report Series; #4, Koninklijke Bibliotheek:
Den Haag. [Online]. Available: http://www.kb.nl/coop/nedlib/results/NEDLIBstandards.pdf
[3 May 2002].
Granger, S. (2000). Emulation as a digital
preservation strategy. D-Lib
Magazine, 6(10). [Online]. Available: http://www.dlib.org/dlib/october00/granger/10granger.html
[3 May 2002].
Haynes, D., Streatfield, D., Jowett, T. and Blake, M. (1997).
Responsibility for digital archiving and long term access to digital data. JISC/NPO Studies on Preservation of
Electronic Materials. [Online]. Available: http://www.ukoln.ac.uk/services/papers/bl/jisc-npo67/digital_preservation.html
[3 May 2002].
Hodge, G. (1999). Digital electronic archiving: The state of the
art, the state of the practice. [Online].
Available: http://www.icsti.org/conferences.html
[3 May 2002].
Hodge, G. (2000). Digital archiving: Bringing stakeholders and
issues together: A report on the ICSTI/ICSU Press Workshop on Digital
Archiving. ICSTI Forum 33. [Online]. Available: http://www.icsti.org/forum/33/#Hodge
[3 May 2002].
Holdsworth, D. and Wheatley, P. (2001).
Emulation, preservation and abstraction. RLG DigiNews, 5 (4), Feature #2. [Online]. Available: http://www.rlg.org/preserv/diginews/diginews5-4.html#feature2
[3 May 2002].
Hunter, K. (2002). Yale-Elsevier Mellon Project. [Online].
Available: http://www.niso.org/presentations/hunter-ppt_01_22_02/index.htm
[3 May 2002].
Inera Inc., (2001). E-journal archive DTD feasibility study. Prepared for the Harvard University Library,
Office of Information Systems E-Journal Archiving Project. Pg. 62-63. [Online].
Available: http://www.diglib.org/preserve/hadtdfs.pdf [3 May 2002].
Internet
Archive: Building an ‘Internet Library’. (2001). [Online]. Available: http://www.archive.org [3 May 2002].
InterPARES: International Research on Permanent Authentic Records
in Electronic Systems. (2002). [Online]. Available: http://www.interpares.org www.archive.org [3 May 2002].
JSTOR: The Scholarly Journal Archive.
(2002). [Online]. Available: http://www.jstor.org [3 May 2002].
Kuny, T. (1998, May). The digital dark ages? Challenges in the
preservation of electronic information. International
Preservation News, 17. [Online]. Available: http://www.ifla.org/VI/4/news/17-98.htm
LOCKSS: Permanent publishing on the Web. [Online].
Available: http://lockss.stanford.edu/index.html [3 May 2002].
Lorie, R. (2001, June). A project on preservation
of digital data. RLG DigiNews, 5
(3), Feature # 2. [Online]. Available: http://www.rlg.org/preserv/diginews/diginews5-3.html#1
[3 May 2002].
Lounamaa, K. and
Salonharju, I. (1999, January). EVA-the acquisition and archiving of electronic
network publications in Finland.” Tietolinja News, 1. [Online]. Available: http://www.lib.helsinki.fi/tietolinja/0199/evaart.html
[3 May 2002].
National Library of Australia. Selection of online
Australian publications intended for preservation by the National Library of
Australia. (n.d.) [Online]. Available:
http://pandora.nla.gov.au/selectionguidelines.html [3 May 2002].
National Library of Canada, Electronic Collections
Coordinating Group. (1998). Networked Electronic Publications Policy and
Guidelines. [Online]. Available: http://www.nlc-bnc.ca/9/8/index-e.html
[3 May 2002].
Networked European Deposit Library: NEDLIB. (2001).
[Online]. Available: http://www.konbib.nl/nedlib
[3 May 2002].
PADI: Preserving Access to Digital Information.
(1999). [Online]. Available: http://www.nla.gov.au/padi/
[3 May 2002].
PANDORA. [Online]. Available:
http://pandora.nla.gov.au/index.html [3 May 2002].
PubMed Central: an Archive of life science journals.
(2002). [Online]. Available: http://www.pubmedcentral.nih.gov/
OCLC Digital Archive. (2002). [Online]. Available:
http://www.oclc.org/digitalpreservation/about/archive/
[3 May 2002].
OCLC Digital Preservation Resources, Digital &
Preservation Co-op. (2002). [Online]. Available: http://www.oclc.org/digitalpreservation/about/co-op/
[3 May 2002].
OCLC/RLG Working Group on Preservation Metadata.
(2001, October). A recommendation for content information. [Online]. Available:
http://www.oclc.org/research/pmwg/contentinformation.pdf
[3 May 2002].
Planning Committee of the OCLC/RLG Working Group on Preservation Metadata. (2001, January). Preservation metadata for digital objects: A review of the state of the art. [Online]. Available: http://www.oclc.org/research/pmwg/presmeta_wp.pdf [3 May 2002].
Research Libraries Group. (2001, August). Attributes of a trusted digital repository for digital materials: Meeting the needs for research resources. [Online]. Available: http://www.rlg.org/longterm/attributes01.pdf [3 May 2002].
Ross,
S. (2000). Changing trains at Wigan: Digital preservation and the future of
scholarship. National Preservation Office: The British Library.
Rothenberg,
J. (1999, January). Avoiding technological quicksand: Finding a viable
technical foundation for digital preservation. Report to CLIR. [Online]. Available: http://www.clir.org/pubs/reports/rothenberg/contents.html
[3 May 2002].
Rothenberg, J. (2000, April). An experiment in using emulation to preserve digital publications. NEDLIB Report Series; 1. [Online]. Available: http://www.kb.nl/coop/nedlib/results/NEDLIBemulation.pdf [3 May 2002].
Royal
Library. National Library of Sweden. (n.d.) Kulturaw3 – Heritage Project: Long
term preservation of published electronic documents. [Online]. Available:
http://www.kb.se/ENG/kbstart.htm
[3 May 2002].
Russell,
K. (2000). Digital preservation and the Cedars Project experience. Presented at
Preservation 2000: An International Conference on the Preservation and
Long-Term Accessibility of Digital Materials, York, England, December 7-8, 2000.
[Online]. Available: http://www.rlg.org/events/pres-2000/russell.html
[3 May 2002].
Seville,
C. and Weinberger, E. (2000, June). Intellectual property rights lessons from
the CEDARS Project for digital preservation. (Draft) [Online]. Available:
http://www.leeds.ac.uk/cedars/colman/CIW03.pdf
[3 May 2002].
van de Werf, T. (2000). A process model: The
deposit system for electronic publications,” NEDLIB Report Series; # 6,
Koninklijke Bibliotheek: Den Haag, [Online]. Available: http://www.kb.nl/coop/nedlib/results/DSEPprocessmodel.pdf
[3 May 2002].
Weinberger,
E. (2000, June). Toward collection management guidance.” (Draft) [Online].
Available: http://www.leeds.ac.uk/cedars/colman/CIW02r.html
[3 May 2002].