Buckets – Communications of the ACM

Information content is more important than the systems used for its storage and retrieval. While this seems obvious enough, digital library discussions often mire on the merits of specific databases, search engines, and other implementation details. This is because digital library services (for example, searching, browsing, document access) are often vertically integrated with the content they service. Such tight integration impedes digital library interoperability and easy transitioning to future digital library systems. Even in open architecture digital libraries, data objects remain tied to a single service controlling their access. We make information objects first-class citizens by dismantling the current stovepipe of DL-archive-content. To demonstrate this decoupling we introduce buckets: aggregative, intelligent, object-oriented constructs that contain data, metadata, and the methods for accessing both.

In the Smart Objects, Dumb Archives (SODA) DL model [2], functionalities traditionally associated with archives are pushed down into buckets, making the buckets smarter and the archives dumber. Some of a bucket’s responsibilities include: storing, tracking, and enforcing its terms and conditions; maintenance, display, and dissemination of its contents; and maintaining its event logs. The motivation for buckets came from previous experience in the design, implementation, and maintenance of NASA digital libraries. Users replied that while access to technical reports was desirable, they particularly wanted the experimental data, software, video, and other ancillary material. In response, we defined a digital object to capture and preserve arbitrary data objects and the relationships between them.

Additionally, experience making the content accessible through other digital libraries and Web-crawlers led to making the information objects intelligent. We did not want the objects trapped inside our digital libraries, with the only method for discovery coming from our digital library interface. The information object should be independent of the DL, capable of existing outside the digital library and transitioning to different digital libraries in the future. However, not assuming which digital library was used for discovery and access means buckets must be self-sufficient and perform their required tasks without digital library support.

In our NASA digital library experience, data was partitioned by semantic or syntactic type: metadata in one location, PostScript files in another location, PDF files in still another location, and so on. Over time, different metadata formats were introduced, the number of file formats increased, and new information types (software, multimedia) were introduced. “Being in the DL” eventually represented so much DL jetsam—bits and pieces physically and logically strewn across the system.

How Buckets Work

Although multiple bucket implementations are possible, the initial implementation requires only a CGI-enabled HTTP server and Perl interpreter. Buckets have a bunker mentality: even if other digital library services degenerate, buckets continue to function as long as HTTP and Perl exist.

Aside from Perl, HTTP, and CGI, buckets make no assumptions about their environment. Buckets are self-sufficient, providing their own MIME typing, terms, and conditions, and support libraries. Bucket were developed in Solaris and have been tested in various Unix, Linux, and NT configurations.

Our reference system implements the bucket API using HTTP encoding of messages. Users normally do not invoke methods directly—the applicable methods for content access are built into the bucket’s HTML output. Creation and management-oriented methods are to be accessed by bucket tools. If no method is provided, the display method is assumed. This generates a human-readable display of the bucket’s contents. For example, a bucket version of a NACA Technical Note:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/

which is the same as:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=display

Both URLs produce the output in Figure 1. A digital library can pass in preferences to alter the bucket’s appearance. For example, a view of the bucket suitable for library staff:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=display&view=staff

produces the output shown in Figure 2. From the human-readable interface the display method generates, the link to the PDF file is automatically generated:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=display&pkg_name=report.pkg&element_name=naca-tn-2509.pdf

Similarly, if users wished to display the scanned pages, the link automatically created in the HTML output:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=display&pkg_name=report.pkg&element_name=report.scan

which produces the output in Figure 3. To the casual observer, the bucket API is transparent. However, users or robots can exploit the knowledge that a particular URL is a bucket. For example, to extract the metadata:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=metadata

which returns structured metadata, suitable for indexing by a digital library.

To see what methods are defined on a bucket:

http://www.cs.odu.edu/~nelso_m/naca-tn-2509/?method=list_methods

However, if a harvester is not bucket-aware, it can still crawl the buckets as normal URLs, “HTML-scraping” the default output. The full bucket API is discussed in [3].

Future of Buckets

There are projects with similar aggregation goals as buckets from the digital library community (for example, Kahn-Wilensky Framework [1] and derivatives; multivalent documents [4]). However, they do not feature mobility, self-sufficiency, or the SODA-inspired motivation of archival independence. The aggregative nature of buckets has allowed for easy object-level addition of value-added services such as the SFX reference linking service [5], without modifying digital library system source code.

NASA, Los Alamos National Laboratory, and the Air Force Research Laboratory are using buckets in their next-generation digital libraries. Future plans include significant utilization of bucket mobility and intelligence, including buckets actively involved in their long-term survivability and interacting with digital library services to report their observed usage patterns.