Introduction
From DLXS Documentation
Contents |
Web-Based Searching of Digital Library Resources
The University of Michigan Digital Library eXtension Service (UM DLXS) offers a suite of resources designed to help educational and non-profit institutions mount a variety of types of digital library collections. Since 1996, the University of Michigan Library has worked to serve a growing need for institutions like Michigan to put encoded text collections online. The UM DLXS offers both a search engine and a set of tools for mounting digital library resources. The search engine, XPAT, is specially designed to handle the kinds of large and highly structured documents and metadata found in digital library efforts. The tools, collected under the DLXS umbrella, are designed to tap the power of the XPAT engine for broad classes of resources consistent with digital library activities. The following sections review the major components of DLXS systems, and the diagram included here provides a graphical overview of DLXS systems.
XPAT Search Engine
The XPAT engine is an SGML- and XML-aware search engine (still works with SGML?), able to index, search and retrieve UTF-8 Unicode-encoded text data, that the University of Michigan has deployed with an extremely diverse set of digital library resources. XPAT is based on the search engine previously marketed by Open Text as OT5™, and sometimes referred to as "Pat" and "Pat5.0." XPAT provides excellent support for word and phrase searching, indexing of SGML?/XML elements and attributes, fast retrieval, and ease of integration with different software. As part of the UM DLXS, the University of Michigan Digital Library Production Service has launched a continuous development process in which we hope to add a number of features to XPAT. For more information, please see the XPAT documentation.
DLXS "Classes"
DLXS seeks to enable digital library activities, particularly in the areas of information access and retrieval for large collections of material. To this end, DLXS devotes significant resources to developing and supporting access mechanisms for many types of collections found in the digital library. These types of collections are currently categorized into different classes. All resources within a class are structurally similar, and these classes have different behaviors, or ways of organizing, accessing and searching the information they contain. DLXS mounts collections that include continuous tone images, page-image based books with associated OCR, and SGML/XML-encoded books and journals, among others.
DLXS currently works with object-oriented Perl to manage four main classes: Text Class contains monograph-length books (both modern and historical), Image Class contains images and image metadata, Bib Class organizes bibliographic information, and the Finding Aids Class manages collections of archival materials.
For more information, including background and the development history of DLPS Classes, see the DLPS Classes History and Overview.
Middleware
DLXS distributes and supports several software modules, each based on the evolving DLPS classes and behaviors. This middleware is available online as free, open source software, though it all currently depends on the licensed DLXS search engine, XPAT (see above). UM DLXS staff members develop nearly all current middleware in Perl. UM DLXS chose Perl as the middleware language because of, among other things, its powerful regular expression text manipulation and ease of customization. The DLXS Architecture is not, however, predicated on a Perl development environment, and DLXS developers are open to exploring (or seeing others explore) alternatives that meet the current set of design needs. For a related architectural overview, please see the UM DLPS Architecture Overview.
Digital Library Data
DLXS systems are devoted entirely to digital library resources that rely on (or can capitalize on) what does this mean? encoded data and metadata. By using XPAT to index the underlying information, DLXS systems can incorporate XML (both well-formed and valid) and fully validated and normalized SGML. In some cases, such as encoded texts, the entire digital library object is in XML or SGML. In others, such as continuous tone image resources, only the metadata is SGML or XML, and the images are in other formats. Documentation for each DLXS class has an overview of the encoding supported and the various types of digital library resources it can support.
Diagram
This diagram represents some components of the underlying indexed SGML/XML and the web interface through which users access it. The "Data Preparation" box shows how the XPAT creates an index based on the SGML/XML data. The "Web Use" section of the diagram shows how the system captures search information from the user, how the CGI program processes this information into a valid XPAT query and how XSLT massages the SGML/XML returned by the XPAT search engine into displayable HTML for the user to view and navigate.
DLXS Architecture Overview diagram